zhongwei/gh-ahmedasmar-devops-claude-skills-k8s-troubleshooter

Fork 0

Files

Zhongwei Li ad81bc571f Initial commit

2025-11-29 17:51:20 +08:00

17 KiB

Raw Permalink Blame History

Common Kubernetes Issues and Solutions

This reference provides detailed information about common Kubernetes issues, their root causes, and remediation steps.

Pod Issues
Node Issues
Networking Issues
Storage Issues
Resource Issues
Security Issues

Pod Issues

ImagePullBackOff / ErrImagePull

Symptoms:

Pod stuck in ImagePullBackOff or ErrImagePull state
Events show image pull errors

Common Causes:

Image doesn't exist or tag is wrong
Registry authentication failure
Network issues reaching registry
Rate limiting from registry
Private registry without imagePullSecrets

Diagnostic Commands:

kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>

Remediation Steps:

Verify image name and tag: docker pull <image> from local machine
Check imagePullSecrets exist: kubectl get secrets -n <namespace>
Verify secret has correct registry credentials
Check if registry is accessible from cluster
For rate limiting: implement imagePullSecrets with authenticated account
Update deployment with correct image path

Prevention:

Use specific image tags instead of latest
Implement image pull secrets for private registries
Set up local registry or cache to reduce external pulls
Use image validation in CI/CD pipeline

CrashLoopBackOff

Symptoms:

Pod repeatedly crashes and restarts
Increasing restart count
Container exits shortly after starting

Common Causes:

Application error on startup
Missing environment variables or config
Resource limits too restrictive
Liveness probe too aggressive
Missing dependencies (DB, cache, etc.)
Command/args misconfiguration

Diagnostic Commands:

kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> -o yaml

Remediation Steps:

Check current logs: kubectl logs <pod-name>
Check previous container logs: kubectl logs <pod-name> --previous
Look for startup errors, stack traces, or missing config messages
Verify environment variables are set correctly
Check if external dependencies are accessible
Review and adjust resource limits if OOMKilled
Adjust liveness probe initialDelaySeconds if failing too early
Verify command and args in pod spec

Prevention:

Implement proper application health checks and readiness
Use init containers for dependency checks
Set appropriate resource requests/limits based on profiling
Configure liveness probes with sufficient delay
Add retry logic and graceful degradation to applications

Pending Pods

Symptoms:

Pod stuck in Pending state
Not scheduled to any node

Common Causes:

Insufficient cluster resources (CPU/memory)
Node selectors/affinity rules can't be satisfied
No nodes match taints/tolerations
PersistentVolume not available
Resource quotas exceeded
Scheduler not running or misconfigured

Diagnostic Commands:

kubectl describe pod <pod-name> -n <namespace>
kubectl get nodes -o wide
kubectl describe nodes
kubectl get pv,pvc -n <namespace>
kubectl get resourcequota -n <namespace>

Remediation Steps:

Check pod events for scheduling failure reason
Verify node resources: kubectl describe nodes | grep -A 5 "Allocated resources"
Check if pod has node selectors: verify nodes have required labels
Review taints on nodes and tolerations on pod
For PVC issues: verify PV exists and is in Available state
Check namespace resource quota: kubectl describe resourcequota -n <namespace>
If no resources: scale cluster or reduce resource requests
If affinity issue: adjust affinity rules or add matching nodes

Prevention:

Set appropriate resource requests (not just limits)
Monitor cluster capacity and scale proactively
Use pod disruption budgets to prevent total unavailability
Implement cluster autoscaling
Use multiple node pools for different workload types

OOMKilled Pods

Symptoms:

Pod terminates with exit code 137
Container status shows OOMKilled reason
Frequent restarts due to memory exhaustion

Common Causes:

Memory limit set too low
Memory leak in application
Unexpected load increase
No memory limits (using node's memory)

Diagnostic Commands:

kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> --previous -n <namespace>
kubectl top pod <pod-name> -n <namespace>

Remediation Steps:

Confirm OOMKilled in pod events or status
Check memory usage before crash if metrics available
Review application logs for memory-intensive operations
Increase memory limit if application legitimately needs more
Profile application to identify memory leaks
Implement memory limits with requests = limits for guaranteed QoS
Consider horizontal scaling instead of vertical

Prevention:

Profile applications to determine actual memory needs
Set memory requests based on normal usage, limits with headroom
Implement application-level memory monitoring
Use horizontal pod autoscaling based on memory metrics
Regular load testing to understand resource requirements

Node Issues

Node NotReady

Symptoms:

Node status shows NotReady
Pods on node may be evicted
Node may be cordoned automatically

Common Causes:

Kubelet stopped or crashed
Network connectivity issues
Disk pressure
Memory pressure
PID pressure
Container runtime issues

Diagnostic Commands:

kubectl describe node <node-name>
kubectl get nodes -o wide
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
ssh <node> "df -h"
ssh <node> "free -m"

Remediation Steps:

Check node conditions in describe output
Verify kubelet is running: systemctl status kubelet
Check kubelet logs: journalctl -u kubelet
For disk pressure: clean up unused images/containers
For memory pressure: identify and stop memory-hogging processes
Restart kubelet if crashed: systemctl restart kubelet
Check container runtime: systemctl status docker or containerd
Verify network connectivity to API server

Prevention:

Monitor node resources with alerts
Implement log rotation and image cleanup
Set up node problem detector
Use resource quotas to prevent resource exhaustion
Regular maintenance windows for updates

Disk Pressure

Symptoms:

Node condition DiskPressure is True
Pod evictions may occur
Node may become NotReady

Common Causes:

Docker/containerd image cache filling disk
Container logs consuming space
Ephemeral storage usage by pods
System logs filling up

Diagnostic Commands:

kubectl describe node <node-name> | grep -A 10 Conditions
ssh <node> "df -h"
ssh <node> "du -sh /var/lib/docker/*"
ssh <node> "du -sh /var/lib/containerd/*"
ssh <node> "docker system df"

Remediation Steps:

Clean up unused images: docker image prune -a
Clean up stopped containers: docker container prune
Clean up volumes: docker volume prune
Rotate and compress logs
Check for pods with excessive ephemeral storage usage
Expand disk if consistently running out of space
Configure kubelet garbage collection parameters

Prevention:

Set imagefs.available threshold in kubelet config
Implement automated image pruning
Use log rotation for container logs
Monitor disk usage with alerts
Set ephemeral-storage limits on pods
Size nodes appropriately for workload

Networking Issues

Pod-to-Pod Communication Failure

Symptoms:

Services can't reach other services
Connection timeouts between pods
DNS resolution may or may not work

Common Causes:

Network policy blocking traffic
CNI plugin issues
Firewall rules blocking traffic
Service misconfiguration
Pod CIDR exhaustion

Diagnostic Commands:

kubectl get networkpolicies --all-namespaces
kubectl exec -it <pod> -- ping <target-pod-ip>
kubectl exec -it <pod> -- nslookup <service-name>
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>

Remediation Steps:

Test basic connectivity: exec into pod and ping target
Check network policies: look for policies affecting source/dest namespaces
Verify service has endpoints: kubectl get endpoints
Check if pod labels match service selector
Verify CNI plugin pods are healthy (usually in kube-system)
Check node-level firewall rules
Verify pod CIDR hasn't been exhausted

Prevention:

Document network policies and their intent
Use namespace labels for network policy management
Monitor CNI plugin health
Regularly audit network policies
Implement network policy testing in CI/CD

Service Not Accessible

Symptoms:

Cannot connect to service
LoadBalancer stuck in Pending
NodePort not accessible

Common Causes:

Service has no endpoints (no matching pods)
Pods not passing readiness checks
LoadBalancer controller not working
Cloud provider integration issues
Port conflicts

Diagnostic Commands:

kubectl get svc <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
kubectl get pods -l <service-selector> -n <namespace>
kubectl logs -l <service-selector> -n <namespace>

Remediation Steps:

Verify service type and ports are correct
Check if service has endpoints: kubectl get endpoints
If no endpoints: verify pod selector matches pod labels
Check pod readiness: kubectl describe pod
For LoadBalancer: check cloud provider controller logs
For NodePort: verify node firewall allows the port
Test from within cluster first, then external access

Prevention:

Use meaningful service selectors and pod labels
Implement proper readiness probes
Test services in staging before production
Monitor service endpoint counts
Document external access requirements

Storage Issues

PVC Pending

Symptoms:

PersistentVolumeClaim stuck in Pending state
Pod can't start due to volume not available

Common Causes:

No PV matches PVC requirements
StorageClass doesn't exist or is misconfigured
Dynamic provisioner not working
Insufficient permissions for provisioner
Volume capacity exhausted in storage backend

Diagnostic Commands:

kubectl describe pvc <pvc-name> -n <namespace>
kubectl get pv
kubectl get storageclass
kubectl describe storageclass <storage-class-name>
kubectl get pods -n <provisioner-namespace>

Remediation Steps:

Check PVC events for specific error message
Verify StorageClass exists: kubectl get sc
Check if dynamic provisioner pod is running
For static provisioning: ensure PV exists with matching size/access mode
Verify provisioner has correct cloud credentials/permissions
Check storage backend capacity
Review StorageClass parameters for typos

Prevention:

Use dynamic provisioning with reliable StorageClass
Monitor storage backend capacity
Set up alerts for PVC binding failures
Test storage provisioning in non-production first
Document storage requirements and limitations

Volume Mount Failures

Symptoms:

Pod fails to start with volume mount errors
Events show mounting failures
Container create errors related to volumes

Common Causes:

Volume already mounted to different node (RWO with multi-attach)
Volume doesn't exist
Insufficient permissions
Node can't reach storage backend
Filesystem issues on volume

Diagnostic Commands:

kubectl describe pod <pod-name> -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
kubectl describe pv <pv-name>
kubectl get volumeattachments

Remediation Steps:

Check pod events for specific mount error
For RWO volumes: ensure pod is scheduled to node with volume attached
Verify PVC is bound to a PV
Check node can reach storage backend (cloud/NFS/iSCSI)
For CSI volumes: check CSI driver pods are healthy
Delete and recreate pod if volume is stuck in multi-attach state
Check filesystem on volume if accessible

Prevention:

Use ReadWriteMany for multi-pod access scenarios
Implement pod disruption budgets to prevent scheduling conflicts
Monitor volume attachment status
Use StatefulSets for stateful workloads with volumes
Regular backup and restore testing

Resource Issues

Resource Quota Exceeded

Symptoms:

New pods fail to schedule
Error: "exceeded quota"
ResourceQuota limits preventing resource allocation

Common Causes:

Namespace resource quota exceeded
Not enough resource budget available
Resource requests not specified on pods
Quota misconfiguration

Diagnostic Commands:

kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get pods -n <namespace> -o json | jq '.items[].spec.containers[].resources'
kubectl describe namespace <namespace>

Remediation Steps:

Check current quota usage: kubectl describe resourcequota
Identify pods consuming resources
Either increase quota limits or reduce resource requests
Delete unused resources to free up quota
Optimize pod resource requests based on actual usage
Consider splitting workloads across namespaces

Prevention:

Set quotas based on actual usage patterns
Monitor quota usage with alerts
Right-size pod resource requests
Implement automatic cleanup of completed jobs/pods
Regular quota review and adjustment

CPU Throttling

Symptoms:

Application performance degradation
High CPU throttling metrics
Services responding slowly despite available CPU

Common Causes:

CPU limits set too low
Burst workloads hitting limits
Noisy neighbor effects
CPU limits set without understanding workload

Diagnostic Commands:

kubectl top pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
kubectl exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_throttled_time

Remediation Steps:

Check current CPU usage vs limits
Review throttling metrics if available
Increase CPU limits if application legitimately needs more
Remove CPU limits if workload is bursty (keep requests)
Use HPA if load varies significantly
Profile application to identify CPU-intensive operations

Prevention:

Set CPU requests based on average usage
Set CPU limits with 50-100% headroom for bursts
Use HPA for variable workloads
Monitor CPU throttling metrics
Regular performance testing and profiling

Security Issues

Image Vulnerability

Symptoms:

Security scanner reports vulnerabilities
Compliance violations
Known CVEs in running images

Diagnostic Commands:

kubectl get pods -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u
trivy image <image-name>

Remediation Steps:

Identify vulnerable images with scanner
Update base images to patched versions
Rebuild application images with updated dependencies
Update deployments with new image tags
Implement admission controller to block vulnerable images

Prevention:

Scan images in CI/CD pipeline
Regular base image updates
Use minimal base images
Implement admission controllers (OPA, Kyverno)
Automated image updates and testing

RBAC Permission Denied

Symptoms:

Users or service accounts can't perform operations
"forbidden" errors in logs
API calls fail with 403 errors

Common Causes:

Missing Role or ClusterRole binding
Overly restrictive RBAC policies
Wrong service account in pod
Namespace-scoped role for cluster-wide resource

Diagnostic Commands:

kubectl auth can-i <verb> <resource> --as=<user/sa>
kubectl get rolebindings -n <namespace>
kubectl get clusterrolebindings
kubectl describe role <role-name> -n <namespace>
kubectl describe serviceaccount <sa-name> -n <namespace>

Remediation Steps:

Identify exact permission needed from error message
Check what user/SA can do: kubectl auth can-i --list
Create appropriate Role/ClusterRole with needed permissions
Create RoleBinding/ClusterRoleBinding
Verify service account is correctly set in pod spec
Test with kubectl auth can-i before deploying

Prevention:

Follow principle of least privilege
Use namespace-scoped roles where possible
Document RBAC policies and their purpose
Regular RBAC audits
Use pre-defined roles when possible

This reference covers the most common Kubernetes issues. For each issue, always:

Gather information (describe, logs, events)
Form hypothesis based on symptoms
Test hypothesis with targeted diagnostics
Apply remediation
Verify fix
Document for future reference

17 KiB Raw Permalink Blame History

Common Kubernetes Issues and Solutions

Table of Contents

Pod Issues

ImagePullBackOff / ErrImagePull

CrashLoopBackOff

Pending Pods

OOMKilled Pods

Node Issues

Node NotReady

Disk Pressure

Networking Issues

Pod-to-Pod Communication Failure

Service Not Accessible

Storage Issues

PVC Pending

Volume Mount Failures

Resource Issues

Resource Quota Exceeded

CPU Throttling

Security Issues

Image Vulnerability

RBAC Permission Denied

17 KiB

Raw Permalink Blame History