17 KiB
Common Kubernetes Issues and Solutions
This reference provides detailed information about common Kubernetes issues, their root causes, and remediation steps.
Table of Contents
Pod Issues
ImagePullBackOff / ErrImagePull
Symptoms:
- Pod stuck in
ImagePullBackOfforErrImagePullstate - Events show image pull errors
Common Causes:
- Image doesn't exist or tag is wrong
- Registry authentication failure
- Network issues reaching registry
- Rate limiting from registry
- Private registry without imagePullSecrets
Diagnostic Commands:
kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
Remediation Steps:
- Verify image name and tag:
docker pull <image>from local machine - Check imagePullSecrets exist:
kubectl get secrets -n <namespace> - Verify secret has correct registry credentials
- Check if registry is accessible from cluster
- For rate limiting: implement imagePullSecrets with authenticated account
- Update deployment with correct image path
Prevention:
- Use specific image tags instead of
latest - Implement image pull secrets for private registries
- Set up local registry or cache to reduce external pulls
- Use image validation in CI/CD pipeline
CrashLoopBackOff
Symptoms:
- Pod repeatedly crashes and restarts
- Increasing restart count
- Container exits shortly after starting
Common Causes:
- Application error on startup
- Missing environment variables or config
- Resource limits too restrictive
- Liveness probe too aggressive
- Missing dependencies (DB, cache, etc.)
- Command/args misconfiguration
Diagnostic Commands:
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> -o yaml
Remediation Steps:
- Check current logs:
kubectl logs <pod-name> - Check previous container logs:
kubectl logs <pod-name> --previous - Look for startup errors, stack traces, or missing config messages
- Verify environment variables are set correctly
- Check if external dependencies are accessible
- Review and adjust resource limits if OOMKilled
- Adjust liveness probe initialDelaySeconds if failing too early
- Verify command and args in pod spec
Prevention:
- Implement proper application health checks and readiness
- Use init containers for dependency checks
- Set appropriate resource requests/limits based on profiling
- Configure liveness probes with sufficient delay
- Add retry logic and graceful degradation to applications
Pending Pods
Symptoms:
- Pod stuck in
Pendingstate - Not scheduled to any node
Common Causes:
- Insufficient cluster resources (CPU/memory)
- Node selectors/affinity rules can't be satisfied
- No nodes match taints/tolerations
- PersistentVolume not available
- Resource quotas exceeded
- Scheduler not running or misconfigured
Diagnostic Commands:
kubectl describe pod <pod-name> -n <namespace>
kubectl get nodes -o wide
kubectl describe nodes
kubectl get pv,pvc -n <namespace>
kubectl get resourcequota -n <namespace>
Remediation Steps:
- Check pod events for scheduling failure reason
- Verify node resources:
kubectl describe nodes | grep -A 5 "Allocated resources" - Check if pod has node selectors: verify nodes have required labels
- Review taints on nodes and tolerations on pod
- For PVC issues: verify PV exists and is in
Availablestate - Check namespace resource quota:
kubectl describe resourcequota -n <namespace> - If no resources: scale cluster or reduce resource requests
- If affinity issue: adjust affinity rules or add matching nodes
Prevention:
- Set appropriate resource requests (not just limits)
- Monitor cluster capacity and scale proactively
- Use pod disruption budgets to prevent total unavailability
- Implement cluster autoscaling
- Use multiple node pools for different workload types
OOMKilled Pods
Symptoms:
- Pod terminates with exit code 137
- Container status shows
OOMKilledreason - Frequent restarts due to memory exhaustion
Common Causes:
- Memory limit set too low
- Memory leak in application
- Unexpected load increase
- No memory limits (using node's memory)
Diagnostic Commands:
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> --previous -n <namespace>
kubectl top pod <pod-name> -n <namespace>
Remediation Steps:
- Confirm OOMKilled in pod events or status
- Check memory usage before crash if metrics available
- Review application logs for memory-intensive operations
- Increase memory limit if application legitimately needs more
- Profile application to identify memory leaks
- Implement memory limits with requests = limits for guaranteed QoS
- Consider horizontal scaling instead of vertical
Prevention:
- Profile applications to determine actual memory needs
- Set memory requests based on normal usage, limits with headroom
- Implement application-level memory monitoring
- Use horizontal pod autoscaling based on memory metrics
- Regular load testing to understand resource requirements
Node Issues
Node NotReady
Symptoms:
- Node status shows
NotReady - Pods on node may be evicted
- Node may be cordoned automatically
Common Causes:
- Kubelet stopped or crashed
- Network connectivity issues
- Disk pressure
- Memory pressure
- PID pressure
- Container runtime issues
Diagnostic Commands:
kubectl describe node <node-name>
kubectl get nodes -o wide
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
ssh <node> "df -h"
ssh <node> "free -m"
Remediation Steps:
- Check node conditions in describe output
- Verify kubelet is running:
systemctl status kubelet - Check kubelet logs:
journalctl -u kubelet - For disk pressure: clean up unused images/containers
- For memory pressure: identify and stop memory-hogging processes
- Restart kubelet if crashed:
systemctl restart kubelet - Check container runtime:
systemctl status dockerorcontainerd - Verify network connectivity to API server
Prevention:
- Monitor node resources with alerts
- Implement log rotation and image cleanup
- Set up node problem detector
- Use resource quotas to prevent resource exhaustion
- Regular maintenance windows for updates
Disk Pressure
Symptoms:
- Node condition
DiskPressureis True - Pod evictions may occur
- Node may become NotReady
Common Causes:
- Docker/containerd image cache filling disk
- Container logs consuming space
- Ephemeral storage usage by pods
- System logs filling up
Diagnostic Commands:
kubectl describe node <node-name> | grep -A 10 Conditions
ssh <node> "df -h"
ssh <node> "du -sh /var/lib/docker/*"
ssh <node> "du -sh /var/lib/containerd/*"
ssh <node> "docker system df"
Remediation Steps:
- Clean up unused images:
docker image prune -a - Clean up stopped containers:
docker container prune - Clean up volumes:
docker volume prune - Rotate and compress logs
- Check for pods with excessive ephemeral storage usage
- Expand disk if consistently running out of space
- Configure kubelet garbage collection parameters
Prevention:
- Set imagefs.available threshold in kubelet config
- Implement automated image pruning
- Use log rotation for container logs
- Monitor disk usage with alerts
- Set ephemeral-storage limits on pods
- Size nodes appropriately for workload
Networking Issues
Pod-to-Pod Communication Failure
Symptoms:
- Services can't reach other services
- Connection timeouts between pods
- DNS resolution may or may not work
Common Causes:
- Network policy blocking traffic
- CNI plugin issues
- Firewall rules blocking traffic
- Service misconfiguration
- Pod CIDR exhaustion
Diagnostic Commands:
kubectl get networkpolicies --all-namespaces
kubectl exec -it <pod> -- ping <target-pod-ip>
kubectl exec -it <pod> -- nslookup <service-name>
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
Remediation Steps:
- Test basic connectivity: exec into pod and ping target
- Check network policies: look for policies affecting source/dest namespaces
- Verify service has endpoints:
kubectl get endpoints - Check if pod labels match service selector
- Verify CNI plugin pods are healthy (usually in kube-system)
- Check node-level firewall rules
- Verify pod CIDR hasn't been exhausted
Prevention:
- Document network policies and their intent
- Use namespace labels for network policy management
- Monitor CNI plugin health
- Regularly audit network policies
- Implement network policy testing in CI/CD
Service Not Accessible
Symptoms:
- Cannot connect to service
- LoadBalancer stuck in Pending
- NodePort not accessible
Common Causes:
- Service has no endpoints (no matching pods)
- Pods not passing readiness checks
- LoadBalancer controller not working
- Cloud provider integration issues
- Port conflicts
Diagnostic Commands:
kubectl get svc <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
kubectl get pods -l <service-selector> -n <namespace>
kubectl logs -l <service-selector> -n <namespace>
Remediation Steps:
- Verify service type and ports are correct
- Check if service has endpoints:
kubectl get endpoints - If no endpoints: verify pod selector matches pod labels
- Check pod readiness:
kubectl describe pod - For LoadBalancer: check cloud provider controller logs
- For NodePort: verify node firewall allows the port
- Test from within cluster first, then external access
Prevention:
- Use meaningful service selectors and pod labels
- Implement proper readiness probes
- Test services in staging before production
- Monitor service endpoint counts
- Document external access requirements
Storage Issues
PVC Pending
Symptoms:
- PersistentVolumeClaim stuck in
Pendingstate - Pod can't start due to volume not available
Common Causes:
- No PV matches PVC requirements
- StorageClass doesn't exist or is misconfigured
- Dynamic provisioner not working
- Insufficient permissions for provisioner
- Volume capacity exhausted in storage backend
Diagnostic Commands:
kubectl describe pvc <pvc-name> -n <namespace>
kubectl get pv
kubectl get storageclass
kubectl describe storageclass <storage-class-name>
kubectl get pods -n <provisioner-namespace>
Remediation Steps:
- Check PVC events for specific error message
- Verify StorageClass exists:
kubectl get sc - Check if dynamic provisioner pod is running
- For static provisioning: ensure PV exists with matching size/access mode
- Verify provisioner has correct cloud credentials/permissions
- Check storage backend capacity
- Review StorageClass parameters for typos
Prevention:
- Use dynamic provisioning with reliable StorageClass
- Monitor storage backend capacity
- Set up alerts for PVC binding failures
- Test storage provisioning in non-production first
- Document storage requirements and limitations
Volume Mount Failures
Symptoms:
- Pod fails to start with volume mount errors
- Events show mounting failures
- Container create errors related to volumes
Common Causes:
- Volume already mounted to different node (RWO with multi-attach)
- Volume doesn't exist
- Insufficient permissions
- Node can't reach storage backend
- Filesystem issues on volume
Diagnostic Commands:
kubectl describe pod <pod-name> -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
kubectl describe pv <pv-name>
kubectl get volumeattachments
Remediation Steps:
- Check pod events for specific mount error
- For RWO volumes: ensure pod is scheduled to node with volume attached
- Verify PVC is bound to a PV
- Check node can reach storage backend (cloud/NFS/iSCSI)
- For CSI volumes: check CSI driver pods are healthy
- Delete and recreate pod if volume is stuck in multi-attach state
- Check filesystem on volume if accessible
Prevention:
- Use ReadWriteMany for multi-pod access scenarios
- Implement pod disruption budgets to prevent scheduling conflicts
- Monitor volume attachment status
- Use StatefulSets for stateful workloads with volumes
- Regular backup and restore testing
Resource Issues
Resource Quota Exceeded
Symptoms:
- New pods fail to schedule
- Error: "exceeded quota"
- ResourceQuota limits preventing resource allocation
Common Causes:
- Namespace resource quota exceeded
- Not enough resource budget available
- Resource requests not specified on pods
- Quota misconfiguration
Diagnostic Commands:
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get pods -n <namespace> -o json | jq '.items[].spec.containers[].resources'
kubectl describe namespace <namespace>
Remediation Steps:
- Check current quota usage:
kubectl describe resourcequota - Identify pods consuming resources
- Either increase quota limits or reduce resource requests
- Delete unused resources to free up quota
- Optimize pod resource requests based on actual usage
- Consider splitting workloads across namespaces
Prevention:
- Set quotas based on actual usage patterns
- Monitor quota usage with alerts
- Right-size pod resource requests
- Implement automatic cleanup of completed jobs/pods
- Regular quota review and adjustment
CPU Throttling
Symptoms:
- Application performance degradation
- High CPU throttling metrics
- Services responding slowly despite available CPU
Common Causes:
- CPU limits set too low
- Burst workloads hitting limits
- Noisy neighbor effects
- CPU limits set without understanding workload
Diagnostic Commands:
kubectl top pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
kubectl exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_throttled_time
Remediation Steps:
- Check current CPU usage vs limits
- Review throttling metrics if available
- Increase CPU limits if application legitimately needs more
- Remove CPU limits if workload is bursty (keep requests)
- Use HPA if load varies significantly
- Profile application to identify CPU-intensive operations
Prevention:
- Set CPU requests based on average usage
- Set CPU limits with 50-100% headroom for bursts
- Use HPA for variable workloads
- Monitor CPU throttling metrics
- Regular performance testing and profiling
Security Issues
Image Vulnerability
Symptoms:
- Security scanner reports vulnerabilities
- Compliance violations
- Known CVEs in running images
Diagnostic Commands:
kubectl get pods -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u
trivy image <image-name>
Remediation Steps:
- Identify vulnerable images with scanner
- Update base images to patched versions
- Rebuild application images with updated dependencies
- Update deployments with new image tags
- Implement admission controller to block vulnerable images
Prevention:
- Scan images in CI/CD pipeline
- Regular base image updates
- Use minimal base images
- Implement admission controllers (OPA, Kyverno)
- Automated image updates and testing
RBAC Permission Denied
Symptoms:
- Users or service accounts can't perform operations
- "forbidden" errors in logs
- API calls fail with 403 errors
Common Causes:
- Missing Role or ClusterRole binding
- Overly restrictive RBAC policies
- Wrong service account in pod
- Namespace-scoped role for cluster-wide resource
Diagnostic Commands:
kubectl auth can-i <verb> <resource> --as=<user/sa>
kubectl get rolebindings -n <namespace>
kubectl get clusterrolebindings
kubectl describe role <role-name> -n <namespace>
kubectl describe serviceaccount <sa-name> -n <namespace>
Remediation Steps:
- Identify exact permission needed from error message
- Check what user/SA can do:
kubectl auth can-i --list - Create appropriate Role/ClusterRole with needed permissions
- Create RoleBinding/ClusterRoleBinding
- Verify service account is correctly set in pod spec
- Test with
kubectl auth can-ibefore deploying
Prevention:
- Follow principle of least privilege
- Use namespace-scoped roles where possible
- Document RBAC policies and their purpose
- Regular RBAC audits
- Use pre-defined roles when possible
This reference covers the most common Kubernetes issues. For each issue, always:
- Gather information (describe, logs, events)
- Form hypothesis based on symptoms
- Test hypothesis with targeted diagnostics
- Apply remediation
- Verify fix
- Document for future reference