15 KiB
Kubernetes Performance Troubleshooting
Systematic approach to diagnosing and resolving Kubernetes performance issues.
Table of Contents
- High Latency Issues
- CPU Performance
- Memory Performance
- Network Performance
- Storage I/O Performance
- Application-Level Metrics
- Cluster-Wide Performance
High Latency Issues
Symptoms
- Slow API response times
- Increased request latency
- Timeouts
- Degraded user experience
Investigation Workflow
1. Identify the layer with latency:
# Check service mesh metrics (if using Istio/Linkerd)
kubectl top pods -n <namespace>
# Check ingress controller metrics
kubectl logs -n ingress-nginx <ingress-controller-pod> | grep "request_time"
# Check application logs for slow requests
kubectl logs <pod-name> -n <namespace> | grep -i "slow\|timeout\|latency"
2. Profile application performance:
# Get pod metrics
kubectl top pod <pod-name> -n <namespace>
# Check if pod is CPU throttled
kubectl get pod <pod-name> -n <namespace> -o json | \
jq '.spec.containers[].resources'
# Exec into pod and check application-specific metrics
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Then: curl localhost:8080/metrics (if Prometheus metrics available)
3. Check dependencies:
# Test connectivity to downstream services
kubectl exec -it <pod-name> -n <namespace> -- \
curl -w "@curl-format.txt" -o /dev/null -s http://backend-service
# curl-format.txt content:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_appconnect: %{time_appconnect}\n
# time_pretransfer: %{time_pretransfer}\n
# time_redirect: %{time_redirect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\n
Common Causes and Solutions
CPU Throttling:
# Increase CPU limits or remove limits for bursty workloads
resources:
requests:
cpu: "500m" # What pod needs typically
limits:
cpu: "2000m" # Burst capacity (or remove for unlimited)
Insufficient Replicas:
# Scale up deployment
kubectl scale deployment <deployment-name> -n <namespace> --replicas=5
# Or enable HPA
kubectl autoscale deployment <deployment-name> \
--cpu-percent=70 \
--min=2 \
--max=10
Slow Dependencies:
# Implement circuit breakers and timeouts in application
# Or use service mesh policies (Istio example):
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: backend-circuit-breaker
spec:
host: backend-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
CPU Performance
Symptoms
- High CPU usage
- Throttling
- Slow processing
- Queue buildup
Investigation Commands
# Check CPU usage
kubectl top nodes
kubectl top pods -n <namespace>
# Check CPU throttling
kubectl get pod <pod-name> -n <namespace> -o json | \
jq '.spec.containers[].resources'
# Get detailed CPU metrics (requires metrics-server)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | jq
# Check container-level CPU from node (SSH to node)
ssh <node> "docker stats --no-stream"
Advanced CPU Profiling
Enable CPU profiling in application:
# For Go applications with pprof
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
# Capture CPU profile
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
# Analyze with pprof
go tool pprof -http=:8080 cpu.prof
For Java applications:
# Use async-profiler
kubectl exec -it <pod-name> -n <namespace> -- \
/profiler.sh -d 30 -f /tmp/flamegraph.html 1
# Copy flamegraph
kubectl cp <namespace>/<pod-name>:/tmp/flamegraph.html ./flamegraph.html
Solutions
Vertical Scaling:
resources:
requests:
cpu: "1000m" # Increased from 500m
limits:
cpu: "2000m" # Increased from 1000m
Horizontal Scaling:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
Remove CPU Limits for Bursty Workloads:
# Allow bursting to available CPU
resources:
requests:
cpu: "500m"
# No limits - can use all available CPU
Memory Performance
Symptoms
- OOMKilled pods
- Memory leaks
- Slow garbage collection
- Swap usage (if enabled)
Investigation Commands
# Check memory usage
kubectl top nodes
kubectl top pods -n <namespace>
# Check memory limits and requests
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits\|Requests"
# Check OOM kills
kubectl get pods -n <namespace> -o json | \
jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") | .metadata.name'
# Detailed memory breakdown (requires metrics-server)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | \
jq '.containers[] | {name, usage: .usage.memory}'
Memory Profiling
Heap dump for Java:
# Capture heap dump
kubectl exec <pod-name> -n <namespace> -- \
jmap -dump:format=b,file=/tmp/heapdump.hprof 1
# Copy heap dump
kubectl cp <namespace>/<pod-name>:/tmp/heapdump.hprof ./heapdump.hprof
# Analyze with Eclipse MAT or VisualVM
Memory profiling for Go:
# Capture heap profile
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
curl http://localhost:6060/debug/pprof/heap > heap.prof
# Analyze
go tool pprof -http=:8080 heap.prof
Solutions
Increase Memory Limits:
resources:
requests:
memory: "512Mi"
limits:
memory: "2Gi" # Increased from 1Gi
Optimize Application:
- Fix memory leaks
- Implement connection pooling
- Optimize caching strategies
- Tune garbage collection
Use Memory-Optimized Node Pools:
# Node affinity for memory-intensive workloads
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload-type
operator: In
values:
- memory-optimized
Network Performance
Symptoms
- High network latency
- Packet loss
- Connection timeouts
- Bandwidth saturation
Investigation Commands
# Check pod network statistics
kubectl exec <pod-name> -n <namespace> -- netstat -s
# Test network performance between pods
# Deploy netperf
kubectl run netperf-client --image=networkstatic/netperf --rm -it -- /bin/bash
# From client, run:
netperf -H <target-pod-ip> -t TCP_STREAM
netperf -H <target-pod-ip> -t TCP_RR # Request-response latency
# Check DNS resolution time
kubectl exec <pod-name> -n <namespace> -- \
time nslookup service-name.namespace.svc.cluster.local
# Check service mesh overhead (if using Istio)
kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
curl -s localhost:15000/stats | grep "http.inbound\|http.outbound"
Check Network Policies
# List network policies
kubectl get networkpolicies -n <namespace>
# Check if policy is blocking traffic
kubectl describe networkpolicy <policy-name> -n <namespace>
# Temporarily remove policies to test (in non-production)
kubectl delete networkpolicy <policy-name> -n <namespace>
Solutions
DNS Optimization:
# Use CoreDNS caching
# Increase CoreDNS replicas
kubectl scale deployment coredns -n kube-system --replicas=5
# Or use NodeLocal DNSCache
# https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
Optimize Service Mesh:
# Reduce Istio sidecar resources if over-provisioned
sidecar.istio.io/proxyCPU: "100m"
sidecar.istio.io/proxyMemory: "128Mi"
# Or disable for internal, trusted services
sidecar.istio.io/inject: "false"
Use HostNetwork for Network-Intensive Pods:
# Use with caution - bypasses pod networking
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
Enable Bandwidth Limits (QoS):
metadata:
annotations:
kubernetes.io/ingress-bandwidth: "10M"
kubernetes.io/egress-bandwidth: "10M"
Storage I/O Performance
Symptoms
- Slow read/write operations
- High I/O wait
- Application timeouts during disk operations
- Database performance issues
Investigation Commands
# Check I/O metrics on node
ssh <node> "iostat -x 1 10"
# Check disk usage
kubectl exec <pod-name> -n <namespace> -- df -h
# Check I/O wait from pod
kubectl exec <pod-name> -n <namespace> -- top
# Test storage performance
kubectl exec <pod-name> -n <namespace> -- \
dd if=/dev/zero of=/data/test bs=1M count=1024 conv=fdatasync
# Check PV performance class
kubectl get pv <pv-name> -o yaml | grep storageClassName
kubectl describe storageclass <storage-class-name>
Storage Benchmarking
Deploy fio for benchmarking:
apiVersion: v1
kind: Pod
metadata:
name: fio-benchmark
spec:
containers:
- name: fio
image: ljishen/fio
command: ["/bin/sh", "-c"]
args:
- |
fio --name=seqread --rw=read --bs=1M --size=1G --runtime=60 --filename=/data/test
fio --name=seqwrite --rw=write --bs=1M --size=1G --runtime=60 --filename=/data/test
fio --name=randread --rw=randread --bs=4k --size=1G --runtime=60 --filename=/data/test
fio --name=randwrite --rw=randwrite --bs=4k --size=1G --runtime=60 --filename=/data/test
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-pvc
Solutions
Use Higher Performance Storage Class:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: high-performance-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3 # or io2, premium-rwo (GKE), etc.
resources:
requests:
storage: 100Gi
Provision IOPS (AWS EBS io2):
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: io2-high-iops
provisioner: ebs.csi.aws.com
parameters:
type: io2
iops: "10000"
fsType: ext4
volumeBindingMode: WaitForFirstConsumer
Use Local NVMe for Ultra-Low Latency:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-pv
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-nvme
local:
path: /mnt/disks/nvme0n1
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node-with-nvme
Application-Level Metrics
Expose Prometheus Metrics
Add metrics endpoint to application:
apiVersion: v1
kind: Service
metadata:
name: app-metrics
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
selector:
app: myapp
ports:
- name: metrics
port: 8080
targetPort: 8080
Key Metrics to Monitor
Application metrics:
- Request rate
- Request latency (p50, p95, p99)
- Error rate
- Active connections
- Queue depth
- Cache hit rate
Example Prometheus queries:
# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Request rate
sum(rate(http_requests_total[5m]))
Distributed Tracing
Implement OpenTelemetry:
# Deploy Jaeger
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
template:
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686 # UI
- containerPort: 14268 # Collector
Instrument application:
- Add OpenTelemetry SDK to application
- Configure trace export to Jaeger
- Analyze end-to-end request traces to identify bottlenecks
Cluster-Wide Performance
Cluster Resource Utilization
# Overall cluster capacity
kubectl top nodes
# Total resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Resource requests vs limits
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources)"'
Control Plane Performance
# Check API server latency
kubectl get --raw /metrics | grep apiserver_request_duration_seconds
# Check etcd performance
kubectl exec -it -n kube-system etcd-<node> -- \
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
check perf
# Controller manager metrics
kubectl get --raw /metrics | grep workqueue_depth
Scheduler Performance
# Check scheduler latency
kubectl get --raw /metrics | grep scheduler_scheduling_duration_seconds
# Check pending pods
kubectl get pods --all-namespaces --field-selector status.phase=Pending
# Scheduler logs
kubectl logs -n kube-system kube-scheduler-<node>
Solutions for Cluster-Wide Issues
Scale Control Plane:
- Add more control plane nodes
- Increase API server replicas
- Tune etcd (increase memory, use SSD)
Optimize Scheduling:
- Use pod priority and preemption
- Implement pod topology spread constraints
- Use node affinity/anti-affinity appropriately
Resource Management:
- Set appropriate resource requests and limits
- Use LimitRanges and ResourceQuotas
- Implement VerticalPodAutoscaler for right-sizing
Performance Optimization Checklist
Application Level
- Implement connection pooling
- Enable response caching
- Optimize database queries
- Use async/non-blocking I/O
- Implement circuit breakers
- Profile and optimize hot paths
Kubernetes Level
- Set appropriate resource requests/limits
- Use HPA for auto-scaling
- Implement readiness/liveness probes correctly
- Use anti-affinity for high-availability
- Optimize container image size
- Use multi-stage builds
Infrastructure Level
- Use appropriate instance/node types
- Enable cluster autoscaling
- Use high-performance storage classes
- Optimize network topology
- Implement monitoring and alerting
- Regular performance testing
Monitoring Tools
Essential tools:
- Prometheus + Grafana: Metrics and dashboards
- Jaeger/Zipkin: Distributed tracing
- kube-state-metrics: Kubernetes object metrics
- node-exporter: Node-level metrics
- cAdvisor: Container metrics
- kubectl-flamegraph: CPU profiling
Commercial options:
- Datadog
- New Relic
- Dynatrace
- Elastic APM