Initial commit

2025-11-29 17:51:20 +08:00
commit ad81bc571f
11 changed files with 3746 additions and 0 deletions
--- a/skills/references/performance_troubleshooting.md
+++ b/skills/references/performance_troubleshooting.md
@@ -0,0 +1,687 @@
+# Kubernetes Performance Troubleshooting
+
+Systematic approach to diagnosing and resolving Kubernetes performance issues.
+
+## Table of Contents
+
+1. [High Latency Issues](#high-latency-issues)
+2. [CPU Performance](#cpu-performance)
+3. [Memory Performance](#memory-performance)
+4. [Network Performance](#network-performance)
+5. [Storage I/O Performance](#storage-io-performance)
+6. [Application-Level Metrics](#application-level-metrics)
+7. [Cluster-Wide Performance](#cluster-wide-performance)
+
+---
+
+## High Latency Issues
+
+### Symptoms
+- Slow API response times
+- Increased request latency
+- Timeouts
+- Degraded user experience
+
+### Investigation Workflow
+
+**1. Identify the layer with latency:**
+
+```bash
+# Check service mesh metrics (if using Istio/Linkerd)
+kubectl top pods -n <namespace>
+
+# Check ingress controller metrics
+kubectl logs -n ingress-nginx <ingress-controller-pod> | grep "request_time"
+
+# Check application logs for slow requests
+kubectl logs <pod-name> -n <namespace> | grep -i "slow\|timeout\|latency"
+```
+
+**2. Profile application performance:**
+
+```bash
+# Get pod metrics
+kubectl top pod <pod-name> -n <namespace>
+
+# Check if pod is CPU throttled
+kubectl get pod <pod-name> -n <namespace> -o json | \
+  jq '.spec.containers[].resources'
+
+# Exec into pod and check application-specific metrics
+kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
+# Then: curl localhost:8080/metrics  (if Prometheus metrics available)
+```
+
+**3. Check dependencies:**
+
+```bash
+# Test connectivity to downstream services
+kubectl exec -it <pod-name> -n <namespace> -- \
+  curl -w "@curl-format.txt" -o /dev/null -s http://backend-service
+
+# curl-format.txt content:
+# time_namelookup: %{time_namelookup}\n
+# time_connect: %{time_connect}\n
+# time_appconnect: %{time_appconnect}\n
+# time_pretransfer: %{time_pretransfer}\n
+# time_redirect: %{time_redirect}\n
+# time_starttransfer: %{time_starttransfer}\n
+# time_total: %{time_total}\n
+```
+
+### Common Causes and Solutions
+
+**CPU Throttling:**
+```yaml
+# Increase CPU limits or remove limits for bursty workloads
+resources:
+  requests:
+    cpu: "500m"      # What pod needs typically
+  limits:
+    cpu: "2000m"     # Burst capacity (or remove for unlimited)
+```
+
+**Insufficient Replicas:**
+```bash
+# Scale up deployment
+kubectl scale deployment <deployment-name> -n <namespace> --replicas=5
+
+# Or enable HPA
+kubectl autoscale deployment <deployment-name> \
+  --cpu-percent=70 \
+  --min=2 \
+  --max=10
+```
+
+**Slow Dependencies:**
+```yaml
+# Implement circuit breakers and timeouts in application
+# Or use service mesh policies (Istio example):
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: backend-circuit-breaker
+spec:
+  host: backend-service
+  trafficPolicy:
+    connectionPool:
+      tcp:
+        maxConnections: 100
+      http:
+        http1MaxPendingRequests: 50
+        http2MaxRequests: 100
+    outlierDetection:
+      consecutiveErrors: 5
+      interval: 30s
+      baseEjectionTime: 30s
+```
+
+---
+
+## CPU Performance
+
+### Symptoms
+- High CPU usage
+- Throttling
+- Slow processing
+- Queue buildup
+
+### Investigation Commands
+
+```bash
+# Check CPU usage
+kubectl top nodes
+kubectl top pods -n <namespace>
+
+# Check CPU throttling
+kubectl get pod <pod-name> -n <namespace> -o json | \
+  jq '.spec.containers[].resources'
+
+# Get detailed CPU metrics (requires metrics-server)
+kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | jq
+
+# Check container-level CPU from node (SSH to node)
+ssh <node> "docker stats --no-stream"
+```
+
+### Advanced CPU Profiling
+
+**Enable CPU profiling in application:**
+
+```bash
+# For Go applications with pprof
+kubectl port-forward <pod-name> 6060:6060 -n <namespace>
+
+# Capture CPU profile
+curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
+
+# Analyze with pprof
+go tool pprof -http=:8080 cpu.prof
+```
+
+**For Java applications:**
+
+```bash
+# Use async-profiler
+kubectl exec -it <pod-name> -n <namespace> -- \
+  /profiler.sh -d 30 -f /tmp/flamegraph.html 1
+
+# Copy flamegraph
+kubectl cp <namespace>/<pod-name>:/tmp/flamegraph.html ./flamegraph.html
+```
+
+### Solutions
+
+**Vertical Scaling:**
+```yaml
+resources:
+  requests:
+    cpu: "1000m"  # Increased from 500m
+  limits:
+    cpu: "2000m"  # Increased from 1000m
+```
+
+**Horizontal Scaling:**
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: app-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: app
+  minReplicas: 3
+  maxReplicas: 20
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+```
+
+**Remove CPU Limits for Bursty Workloads:**
+```yaml
+# Allow bursting to available CPU
+resources:
+  requests:
+    cpu: "500m"
+  # No limits - can use all available CPU
+```
+
+---
+
+## Memory Performance
+
+### Symptoms
+- OOMKilled pods
+- Memory leaks
+- Slow garbage collection
+- Swap usage (if enabled)
+
+### Investigation Commands
+
+```bash
+# Check memory usage
+kubectl top nodes
+kubectl top pods -n <namespace>
+
+# Check memory limits and requests
+kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits\|Requests"
+
+# Check OOM kills
+kubectl get pods -n <namespace> -o json | \
+  jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") | .metadata.name'
+
+# Detailed memory breakdown (requires metrics-server)
+kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | \
+  jq '.containers[] | {name, usage: .usage.memory}'
+```
+
+### Memory Profiling
+
+**Heap dump for Java:**
+```bash
+# Capture heap dump
+kubectl exec <pod-name> -n <namespace> -- \
+  jmap -dump:format=b,file=/tmp/heapdump.hprof 1
+
+# Copy heap dump
+kubectl cp <namespace>/<pod-name>:/tmp/heapdump.hprof ./heapdump.hprof
+
+# Analyze with Eclipse MAT or VisualVM
+```
+
+**Memory profiling for Go:**
+```bash
+# Capture heap profile
+kubectl port-forward <pod-name> 6060:6060 -n <namespace>
+curl http://localhost:6060/debug/pprof/heap > heap.prof
+
+# Analyze
+go tool pprof -http=:8080 heap.prof
+```
+
+### Solutions
+
+**Increase Memory Limits:**
+```yaml
+resources:
+  requests:
+    memory: "512Mi"
+  limits:
+    memory: "2Gi"  # Increased from 1Gi
+```
+
+**Optimize Application:**
+- Fix memory leaks
+- Implement connection pooling
+- Optimize caching strategies
+- Tune garbage collection
+
+**Use Memory-Optimized Node Pools:**
+```yaml
+# Node affinity for memory-intensive workloads
+affinity:
+  nodeAffinity:
+    requiredDuringSchedulingIgnoredDuringExecution:
+      nodeSelectorTerms:
+      - matchExpressions:
+        - key: workload-type
+          operator: In
+          values:
+          - memory-optimized
+```
+
+---
+
+## Network Performance
+
+### Symptoms
+- High network latency
+- Packet loss
+- Connection timeouts
+- Bandwidth saturation
+
+### Investigation Commands
+
+```bash
+# Check pod network statistics
+kubectl exec <pod-name> -n <namespace> -- netstat -s
+
+# Test network performance between pods
+# Deploy netperf
+kubectl run netperf-client --image=networkstatic/netperf --rm -it -- /bin/bash
+
+# From client, run:
+netperf -H <target-pod-ip> -t TCP_STREAM
+netperf -H <target-pod-ip> -t TCP_RR  # Request-response latency
+
+# Check DNS resolution time
+kubectl exec <pod-name> -n <namespace> -- \
+  time nslookup service-name.namespace.svc.cluster.local
+
+# Check service mesh overhead (if using Istio)
+kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
+  curl -s localhost:15000/stats | grep "http.inbound\|http.outbound"
+```
+
+### Check Network Policies
+
+```bash
+# List network policies
+kubectl get networkpolicies -n <namespace>
+
+# Check if policy is blocking traffic
+kubectl describe networkpolicy <policy-name> -n <namespace>
+
+# Temporarily remove policies to test (in non-production)
+kubectl delete networkpolicy <policy-name> -n <namespace>
+```
+
+### Solutions
+
+**DNS Optimization:**
+```yaml
+# Use CoreDNS caching
+# Increase CoreDNS replicas
+kubectl scale deployment coredns -n kube-system --replicas=5
+
+# Or use NodeLocal DNSCache
+# https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
+```
+
+**Optimize Service Mesh:**
+```yaml
+# Reduce Istio sidecar resources if over-provisioned
+sidecar.istio.io/proxyCPU: "100m"
+sidecar.istio.io/proxyMemory: "128Mi"
+
+# Or disable for internal, trusted services
+sidecar.istio.io/inject: "false"
+```
+
+**Use HostNetwork for Network-Intensive Pods:**
+```yaml
+# Use with caution - bypasses pod networking
+spec:
+  hostNetwork: true
+  dnsPolicy: ClusterFirstWithHostNet
+```
+
+**Enable Bandwidth Limits (QoS):**
+```yaml
+metadata:
+  annotations:
+    kubernetes.io/ingress-bandwidth: "10M"
+    kubernetes.io/egress-bandwidth: "10M"
+```
+
+---
+
+## Storage I/O Performance
+
+### Symptoms
+- Slow read/write operations
+- High I/O wait
+- Application timeouts during disk operations
+- Database performance issues
+
+### Investigation Commands
+
+```bash
+# Check I/O metrics on node
+ssh <node> "iostat -x 1 10"
+
+# Check disk usage
+kubectl exec <pod-name> -n <namespace> -- df -h
+
+# Check I/O wait from pod
+kubectl exec <pod-name> -n <namespace> -- top
+
+# Test storage performance
+kubectl exec <pod-name> -n <namespace> -- \
+  dd if=/dev/zero of=/data/test bs=1M count=1024 conv=fdatasync
+
+# Check PV performance class
+kubectl get pv <pv-name> -o yaml | grep storageClassName
+kubectl describe storageclass <storage-class-name>
+```
+
+### Storage Benchmarking
+
+**Deploy fio for benchmarking:**
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: fio-benchmark
+spec:
+  containers:
+  - name: fio
+    image: ljishen/fio
+    command: ["/bin/sh", "-c"]
+    args:
+    - |
+      fio --name=seqread --rw=read --bs=1M --size=1G --runtime=60 --filename=/data/test
+      fio --name=seqwrite --rw=write --bs=1M --size=1G --runtime=60 --filename=/data/test
+      fio --name=randread --rw=randread --bs=4k --size=1G --runtime=60 --filename=/data/test
+      fio --name=randwrite --rw=randwrite --bs=4k --size=1G --runtime=60 --filename=/data/test
+    volumeMounts:
+    - name: data
+      mountPath: /data
+  volumes:
+  - name: data
+    persistentVolumeClaim:
+      claimName: test-pvc
+```
+
+### Solutions
+
+**Use Higher Performance Storage Class:**
+```yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: high-performance-pvc
+spec:
+  accessModes:
+    - ReadWriteOnce
+  storageClassName: gp3  # or io2, premium-rwo (GKE), etc.
+  resources:
+    requests:
+      storage: 100Gi
+```
+
+**Provision IOPS (AWS EBS io2):**
+```yaml
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: io2-high-iops
+provisioner: ebs.csi.aws.com
+parameters:
+  type: io2
+  iops: "10000"
+  fsType: ext4
+volumeBindingMode: WaitForFirstConsumer
+```
+
+**Use Local NVMe for Ultra-Low Latency:**
+```yaml
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: local-nvme
+provisioner: kubernetes.io/no-provisioner
+volumeBindingMode: WaitForFirstConsumer
+---
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+  name: local-pv
+spec:
+  capacity:
+    storage: 100Gi
+  accessModes:
+  - ReadWriteOnce
+  persistentVolumeReclaimPolicy: Retain
+  storageClassName: local-nvme
+  local:
+    path: /mnt/disks/nvme0n1
+  nodeAffinity:
+    required:
+      nodeSelectorTerms:
+      - matchExpressions:
+        - key: kubernetes.io/hostname
+          operator: In
+          values:
+          - node-with-nvme
+```
+
+---
+
+## Application-Level Metrics
+
+### Expose Prometheus Metrics
+
+**Add metrics endpoint to application:**
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: app-metrics
+  annotations:
+    prometheus.io/scrape: "true"
+    prometheus.io/port: "8080"
+    prometheus.io/path: "/metrics"
+spec:
+  selector:
+    app: myapp
+  ports:
+  - name: metrics
+    port: 8080
+    targetPort: 8080
+```
+
+### Key Metrics to Monitor
+
+**Application metrics:**
+- Request rate
+- Request latency (p50, p95, p99)
+- Error rate
+- Active connections
+- Queue depth
+- Cache hit rate
+
+**Example Prometheus queries:**
+```promql
+# P95 latency
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Error rate
+sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
+
+# Request rate
+sum(rate(http_requests_total[5m]))
+```
+
+### Distributed Tracing
+
+**Implement OpenTelemetry:**
+```yaml
+# Deploy Jaeger
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: jaeger
+spec:
+  template:
+    spec:
+      containers:
+      - name: jaeger
+        image: jaegertracing/all-in-one:latest
+        ports:
+        - containerPort: 16686  # UI
+        - containerPort: 14268  # Collector
+```
+
+**Instrument application:**
+- Add OpenTelemetry SDK to application
+- Configure trace export to Jaeger
+- Analyze end-to-end request traces to identify bottlenecks
+
+---
+
+## Cluster-Wide Performance
+
+### Cluster Resource Utilization
+
+```bash
+# Overall cluster capacity
+kubectl top nodes
+
+# Total resources
+kubectl describe nodes | grep -A 5 "Allocated resources"
+
+# Resource requests vs limits
+kubectl get pods --all-namespaces -o json | \
+  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources)"'
+```
+
+### Control Plane Performance
+
+```bash
+# Check API server latency
+kubectl get --raw /metrics | grep apiserver_request_duration_seconds
+
+# Check etcd performance
+kubectl exec -it -n kube-system etcd-<node> -- \
+  etcdctl --endpoints=https://127.0.0.1:2379 \
+  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+  --cert=/etc/kubernetes/pki/etcd/server.crt \
+  --key=/etc/kubernetes/pki/etcd/server.key \
+  check perf
+
+# Controller manager metrics
+kubectl get --raw /metrics | grep workqueue_depth
+```
+
+### Scheduler Performance
+
+```bash
+# Check scheduler latency
+kubectl get --raw /metrics | grep scheduler_scheduling_duration_seconds
+
+# Check pending pods
+kubectl get pods --all-namespaces --field-selector status.phase=Pending
+
+# Scheduler logs
+kubectl logs -n kube-system kube-scheduler-<node>
+```
+
+### Solutions for Cluster-Wide Issues
+
+**Scale Control Plane:**
+- Add more control plane nodes
+- Increase API server replicas
+- Tune etcd (increase memory, use SSD)
+
+**Optimize Scheduling:**
+- Use pod priority and preemption
+- Implement pod topology spread constraints
+- Use node affinity/anti-affinity appropriately
+
+**Resource Management:**
+- Set appropriate resource requests and limits
+- Use LimitRanges and ResourceQuotas
+- Implement VerticalPodAutoscaler for right-sizing
+
+---
+
+## Performance Optimization Checklist
+
+### Application Level
+- [ ] Implement connection pooling
+- [ ] Enable response caching
+- [ ] Optimize database queries
+- [ ] Use async/non-blocking I/O
+- [ ] Implement circuit breakers
+- [ ] Profile and optimize hot paths
+
+### Kubernetes Level
+- [ ] Set appropriate resource requests/limits
+- [ ] Use HPA for auto-scaling
+- [ ] Implement readiness/liveness probes correctly
+- [ ] Use anti-affinity for high-availability
+- [ ] Optimize container image size
+- [ ] Use multi-stage builds
+
+### Infrastructure Level
+- [ ] Use appropriate instance/node types
+- [ ] Enable cluster autoscaling
+- [ ] Use high-performance storage classes
+- [ ] Optimize network topology
+- [ ] Implement monitoring and alerting
+- [ ] Regular performance testing
+
+---
+
+## Monitoring Tools
+
+**Essential tools:**
+- **Prometheus + Grafana**: Metrics and dashboards
+- **Jaeger/Zipkin**: Distributed tracing
+- **kube-state-metrics**: Kubernetes object metrics
+- **node-exporter**: Node-level metrics
+- **cAdvisor**: Container metrics
+- **kubectl-flamegraph**: CPU profiling
+
+**Commercial options:**
+- Datadog
+- New Relic
+- Dynatrace
+- Elastic APM