zhongwei/gh-ahmedasmar-devops-claude-skills-k8s-troubleshooter

Fork 0

Files

Zhongwei Li ad81bc571f Initial commit

2025-11-29 17:51:20 +08:00

15 KiB

Raw Permalink Blame History

Kubernetes Performance Troubleshooting

Systematic approach to diagnosing and resolving Kubernetes performance issues.

High Latency Issues
CPU Performance
Memory Performance
Network Performance
Storage I/O Performance
Application-Level Metrics
Cluster-Wide Performance

High Latency Issues

Symptoms

Slow API response times
Increased request latency
Timeouts
Degraded user experience

Investigation Workflow

1. Identify the layer with latency:

# Check service mesh metrics (if using Istio/Linkerd)
kubectl top pods -n <namespace>

# Check ingress controller metrics
kubectl logs -n ingress-nginx <ingress-controller-pod> | grep "request_time"

# Check application logs for slow requests
kubectl logs <pod-name> -n <namespace> | grep -i "slow\|timeout\|latency"

2. Profile application performance:

# Get pod metrics
kubectl top pod <pod-name> -n <namespace>

# Check if pod is CPU throttled
kubectl get pod <pod-name> -n <namespace> -o json | \
  jq '.spec.containers[].resources'

# Exec into pod and check application-specific metrics
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Then: curl localhost:8080/metrics  (if Prometheus metrics available)

3. Check dependencies:

# Test connectivity to downstream services
kubectl exec -it <pod-name> -n <namespace> -- \
  curl -w "@curl-format.txt" -o /dev/null -s http://backend-service

# curl-format.txt content:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_appconnect: %{time_appconnect}\n
# time_pretransfer: %{time_pretransfer}\n
# time_redirect: %{time_redirect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\n

Common Causes and Solutions

CPU Throttling:

# Increase CPU limits or remove limits for bursty workloads
resources:
  requests:
    cpu: "500m"      # What pod needs typically
  limits:
    cpu: "2000m"     # Burst capacity (or remove for unlimited)

Insufficient Replicas:

# Scale up deployment
kubectl scale deployment <deployment-name> -n <namespace> --replicas=5

# Or enable HPA
kubectl autoscale deployment <deployment-name> \
  --cpu-percent=70 \
  --min=2 \
  --max=10

Slow Dependencies:

# Implement circuit breakers and timeouts in application
# Or use service mesh policies (Istio example):
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
  name: backend-circuit-breaker
spec:
  host: backend-service
  trafficPolicy:
    connectionPool:
      tcp:
        maxConnections: 100
      http:
        http1MaxPendingRequests: 50
        http2MaxRequests: 100
    outlierDetection:
      consecutiveErrors: 5
      interval: 30s
      baseEjectionTime: 30s

CPU Performance

Symptoms

High CPU usage
Throttling
Slow processing
Queue buildup

Investigation Commands

# Check CPU usage
kubectl top nodes
kubectl top pods -n <namespace>

# Check CPU throttling
kubectl get pod <pod-name> -n <namespace> -o json | \
  jq '.spec.containers[].resources'

# Get detailed CPU metrics (requires metrics-server)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | jq

# Check container-level CPU from node (SSH to node)
ssh <node> "docker stats --no-stream"

Advanced CPU Profiling

Enable CPU profiling in application:

# For Go applications with pprof
kubectl port-forward <pod-name> 6060:6060 -n <namespace>

# Capture CPU profile
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof

# Analyze with pprof
go tool pprof -http=:8080 cpu.prof

For Java applications:

# Use async-profiler
kubectl exec -it <pod-name> -n <namespace> -- \
  /profiler.sh -d 30 -f /tmp/flamegraph.html 1

# Copy flamegraph
kubectl cp <namespace>/<pod-name>:/tmp/flamegraph.html ./flamegraph.html

Solutions

Vertical Scaling:

resources:
  requests:
    cpu: "1000m"  # Increased from 500m
  limits:
    cpu: "2000m"  # Increased from 1000m

Horizontal Scaling:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: app
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70

Remove CPU Limits for Bursty Workloads:

# Allow bursting to available CPU
resources:
  requests:
    cpu: "500m"
  # No limits - can use all available CPU

Memory Performance

Symptoms

OOMKilled pods
Memory leaks
Slow garbage collection
Swap usage (if enabled)

Investigation Commands

# Check memory usage
kubectl top nodes
kubectl top pods -n <namespace>

# Check memory limits and requests
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits\|Requests"

# Check OOM kills
kubectl get pods -n <namespace> -o json | \
  jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") | .metadata.name'

# Detailed memory breakdown (requires metrics-server)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | \
  jq '.containers[] | {name, usage: .usage.memory}'

Memory Profiling

Heap dump for Java:

# Capture heap dump
kubectl exec <pod-name> -n <namespace> -- \
  jmap -dump:format=b,file=/tmp/heapdump.hprof 1

# Copy heap dump
kubectl cp <namespace>/<pod-name>:/tmp/heapdump.hprof ./heapdump.hprof

# Analyze with Eclipse MAT or VisualVM

Memory profiling for Go:

# Capture heap profile
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
curl http://localhost:6060/debug/pprof/heap > heap.prof

# Analyze
go tool pprof -http=:8080 heap.prof

Solutions

Increase Memory Limits:

resources:
  requests:
    memory: "512Mi"
  limits:
    memory: "2Gi"  # Increased from 1Gi

Optimize Application:

Fix memory leaks
Implement connection pooling
Optimize caching strategies
Tune garbage collection

Use Memory-Optimized Node Pools:

# Node affinity for memory-intensive workloads
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: workload-type
          operator: In
          values:
          - memory-optimized

Network Performance

Symptoms

High network latency
Packet loss
Connection timeouts
Bandwidth saturation

Investigation Commands

# Check pod network statistics
kubectl exec <pod-name> -n <namespace> -- netstat -s

# Test network performance between pods
# Deploy netperf
kubectl run netperf-client --image=networkstatic/netperf --rm -it -- /bin/bash

# From client, run:
netperf -H <target-pod-ip> -t TCP_STREAM
netperf -H <target-pod-ip> -t TCP_RR  # Request-response latency

# Check DNS resolution time
kubectl exec <pod-name> -n <namespace> -- \
  time nslookup service-name.namespace.svc.cluster.local

# Check service mesh overhead (if using Istio)
kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
  curl -s localhost:15000/stats | grep "http.inbound\|http.outbound"

Check Network Policies

# List network policies
kubectl get networkpolicies -n <namespace>

# Check if policy is blocking traffic
kubectl describe networkpolicy <policy-name> -n <namespace>

# Temporarily remove policies to test (in non-production)
kubectl delete networkpolicy <policy-name> -n <namespace>

Solutions

DNS Optimization:

# Use CoreDNS caching
# Increase CoreDNS replicas
kubectl scale deployment coredns -n kube-system --replicas=5

# Or use NodeLocal DNSCache
# https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/

Optimize Service Mesh:

# Reduce Istio sidecar resources if over-provisioned
sidecar.istio.io/proxyCPU: "100m"
sidecar.istio.io/proxyMemory: "128Mi"

# Or disable for internal, trusted services
sidecar.istio.io/inject: "false"

Use HostNetwork for Network-Intensive Pods:

# Use with caution - bypasses pod networking
spec:
  hostNetwork: true
  dnsPolicy: ClusterFirstWithHostNet

Enable Bandwidth Limits (QoS):

metadata:
  annotations:
    kubernetes.io/ingress-bandwidth: "10M"
    kubernetes.io/egress-bandwidth: "10M"

Storage I/O Performance

Symptoms

Slow read/write operations
High I/O wait
Application timeouts during disk operations
Database performance issues

Investigation Commands

# Check I/O metrics on node
ssh <node> "iostat -x 1 10"

# Check disk usage
kubectl exec <pod-name> -n <namespace> -- df -h

# Check I/O wait from pod
kubectl exec <pod-name> -n <namespace> -- top

# Test storage performance
kubectl exec <pod-name> -n <namespace> -- \
  dd if=/dev/zero of=/data/test bs=1M count=1024 conv=fdatasync

# Check PV performance class
kubectl get pv <pv-name> -o yaml | grep storageClassName
kubectl describe storageclass <storage-class-name>

Storage Benchmarking

Deploy fio for benchmarking:

apiVersion: v1
kind: Pod
metadata:
  name: fio-benchmark
spec:
  containers:
  - name: fio
    image: ljishen/fio
    command: ["/bin/sh", "-c"]
    args:
    - |
      fio --name=seqread --rw=read --bs=1M --size=1G --runtime=60 --filename=/data/test
      fio --name=seqwrite --rw=write --bs=1M --size=1G --runtime=60 --filename=/data/test
      fio --name=randread --rw=randread --bs=4k --size=1G --runtime=60 --filename=/data/test
      fio --name=randwrite --rw=randwrite --bs=4k --size=1G --runtime=60 --filename=/data/test
    volumeMounts:
    - name: data
      mountPath: /data
  volumes:
  - name: data
    persistentVolumeClaim:
      claimName: test-pvc

Solutions

Use Higher Performance Storage Class:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: high-performance-pvc
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: gp3  # or io2, premium-rwo (GKE), etc.
  resources:
    requests:
      storage: 100Gi

Provision IOPS (AWS EBS io2):

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: io2-high-iops
provisioner: ebs.csi.aws.com
parameters:
  type: io2
  iops: "10000"
  fsType: ext4
volumeBindingMode: WaitForFirstConsumer

Use Local NVMe for Ultra-Low Latency:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
  name: local-pv
spec:
  capacity:
    storage: 100Gi
  accessModes:
  - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  storageClassName: local-nvme
  local:
    path: /mnt/disks/nvme0n1
  nodeAffinity:
    required:
      nodeSelectorTerms:
      - matchExpressions:
        - key: kubernetes.io/hostname
          operator: In
          values:
          - node-with-nvme

Application-Level Metrics

Expose Prometheus Metrics

Add metrics endpoint to application:

apiVersion: v1
kind: Service
metadata:
  name: app-metrics
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8080"
    prometheus.io/path: "/metrics"
spec:
  selector:
    app: myapp
  ports:
  - name: metrics
    port: 8080
    targetPort: 8080

Key Metrics to Monitor

Application metrics:

Request rate
Request latency (p50, p95, p99)
Error rate
Active connections
Queue depth
Cache hit rate

Example Prometheus queries:

# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

# Request rate
sum(rate(http_requests_total[5m]))

Distributed Tracing

Implement OpenTelemetry:

# Deploy Jaeger
apiVersion: apps/v1
kind: Deployment
metadata:
  name: jaeger
spec:
  template:
    spec:
      containers:
      - name: jaeger
        image: jaegertracing/all-in-one:latest
        ports:
        - containerPort: 16686  # UI
        - containerPort: 14268  # Collector

Instrument application:

Add OpenTelemetry SDK to application
Configure trace export to Jaeger
Analyze end-to-end request traces to identify bottlenecks

Cluster-Wide Performance

Cluster Resource Utilization

# Overall cluster capacity
kubectl top nodes

# Total resources
kubectl describe nodes | grep -A 5 "Allocated resources"

# Resource requests vs limits
kubectl get pods --all-namespaces -o json | \
  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources)"'

Control Plane Performance

# Check API server latency
kubectl get --raw /metrics | grep apiserver_request_duration_seconds

# Check etcd performance
kubectl exec -it -n kube-system etcd-<node> -- \
  etcdctl --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key \
  check perf

# Controller manager metrics
kubectl get --raw /metrics | grep workqueue_depth

Scheduler Performance

# Check scheduler latency
kubectl get --raw /metrics | grep scheduler_scheduling_duration_seconds

# Check pending pods
kubectl get pods --all-namespaces --field-selector status.phase=Pending

# Scheduler logs
kubectl logs -n kube-system kube-scheduler-<node>

Solutions for Cluster-Wide Issues

Scale Control Plane:

Add more control plane nodes
Increase API server replicas
Tune etcd (increase memory, use SSD)

Optimize Scheduling:

Use pod priority and preemption
Implement pod topology spread constraints
Use node affinity/anti-affinity appropriately

Resource Management:

Set appropriate resource requests and limits
Use LimitRanges and ResourceQuotas
Implement VerticalPodAutoscaler for right-sizing

Performance Optimization Checklist

Application Level

Implement connection pooling
Enable response caching
Optimize database queries
Use async/non-blocking I/O
Implement circuit breakers
Profile and optimize hot paths

Kubernetes Level

Set appropriate resource requests/limits
Use HPA for auto-scaling
Implement readiness/liveness probes correctly
Use anti-affinity for high-availability
Optimize container image size
Use multi-stage builds

Infrastructure Level

Use appropriate instance/node types
Enable cluster autoscaling
Use high-performance storage classes
Optimize network topology
Implement monitoring and alerting
Regular performance testing

Monitoring Tools

Essential tools:

Prometheus + Grafana: Metrics and dashboards
Jaeger/Zipkin: Distributed tracing
kube-state-metrics: Kubernetes object metrics
node-exporter: Node-level metrics
cAdvisor: Container metrics
kubectl-flamegraph: CPU profiling

Commercial options:

Datadog
New Relic
Dynatrace
Elastic APM

15 KiB Raw Permalink Blame History

Kubernetes Performance Troubleshooting

Table of Contents

High Latency Issues

Symptoms

Investigation Workflow

Common Causes and Solutions

CPU Performance

Symptoms

Investigation Commands

Advanced CPU Profiling

Solutions

Memory Performance

Symptoms

Investigation Commands

Memory Profiling

Solutions

Network Performance

Symptoms

Investigation Commands

Check Network Policies

Solutions

Storage I/O Performance

Symptoms

Investigation Commands

Storage Benchmarking

Solutions

Application-Level Metrics

Expose Prometheus Metrics

Key Metrics to Monitor

Distributed Tracing

Cluster-Wide Performance

Cluster Resource Utilization

Control Plane Performance

Scheduler Performance

Solutions for Cluster-Wide Issues

Performance Optimization Checklist

Application Level

Kubernetes Level

Infrastructure Level

Monitoring Tools

15 KiB

Raw Permalink Blame History