Initial commit
This commit is contained in:
687
skills/references/performance_troubleshooting.md
Normal file
687
skills/references/performance_troubleshooting.md
Normal file
@@ -0,0 +1,687 @@
|
||||
# Kubernetes Performance Troubleshooting
|
||||
|
||||
Systematic approach to diagnosing and resolving Kubernetes performance issues.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [High Latency Issues](#high-latency-issues)
|
||||
2. [CPU Performance](#cpu-performance)
|
||||
3. [Memory Performance](#memory-performance)
|
||||
4. [Network Performance](#network-performance)
|
||||
5. [Storage I/O Performance](#storage-io-performance)
|
||||
6. [Application-Level Metrics](#application-level-metrics)
|
||||
7. [Cluster-Wide Performance](#cluster-wide-performance)
|
||||
|
||||
---
|
||||
|
||||
## High Latency Issues
|
||||
|
||||
### Symptoms
|
||||
- Slow API response times
|
||||
- Increased request latency
|
||||
- Timeouts
|
||||
- Degraded user experience
|
||||
|
||||
### Investigation Workflow
|
||||
|
||||
**1. Identify the layer with latency:**
|
||||
|
||||
```bash
|
||||
# Check service mesh metrics (if using Istio/Linkerd)
|
||||
kubectl top pods -n <namespace>
|
||||
|
||||
# Check ingress controller metrics
|
||||
kubectl logs -n ingress-nginx <ingress-controller-pod> | grep "request_time"
|
||||
|
||||
# Check application logs for slow requests
|
||||
kubectl logs <pod-name> -n <namespace> | grep -i "slow\|timeout\|latency"
|
||||
```
|
||||
|
||||
**2. Profile application performance:**
|
||||
|
||||
```bash
|
||||
# Get pod metrics
|
||||
kubectl top pod <pod-name> -n <namespace>
|
||||
|
||||
# Check if pod is CPU throttled
|
||||
kubectl get pod <pod-name> -n <namespace> -o json | \
|
||||
jq '.spec.containers[].resources'
|
||||
|
||||
# Exec into pod and check application-specific metrics
|
||||
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
|
||||
# Then: curl localhost:8080/metrics (if Prometheus metrics available)
|
||||
```
|
||||
|
||||
**3. Check dependencies:**
|
||||
|
||||
```bash
|
||||
# Test connectivity to downstream services
|
||||
kubectl exec -it <pod-name> -n <namespace> -- \
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://backend-service
|
||||
|
||||
# curl-format.txt content:
|
||||
# time_namelookup: %{time_namelookup}\n
|
||||
# time_connect: %{time_connect}\n
|
||||
# time_appconnect: %{time_appconnect}\n
|
||||
# time_pretransfer: %{time_pretransfer}\n
|
||||
# time_redirect: %{time_redirect}\n
|
||||
# time_starttransfer: %{time_starttransfer}\n
|
||||
# time_total: %{time_total}\n
|
||||
```
|
||||
|
||||
### Common Causes and Solutions
|
||||
|
||||
**CPU Throttling:**
|
||||
```yaml
|
||||
# Increase CPU limits or remove limits for bursty workloads
|
||||
resources:
|
||||
requests:
|
||||
cpu: "500m" # What pod needs typically
|
||||
limits:
|
||||
cpu: "2000m" # Burst capacity (or remove for unlimited)
|
||||
```
|
||||
|
||||
**Insufficient Replicas:**
|
||||
```bash
|
||||
# Scale up deployment
|
||||
kubectl scale deployment <deployment-name> -n <namespace> --replicas=5
|
||||
|
||||
# Or enable HPA
|
||||
kubectl autoscale deployment <deployment-name> \
|
||||
--cpu-percent=70 \
|
||||
--min=2 \
|
||||
--max=10
|
||||
```
|
||||
|
||||
**Slow Dependencies:**
|
||||
```yaml
|
||||
# Implement circuit breakers and timeouts in application
|
||||
# Or use service mesh policies (Istio example):
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: backend-circuit-breaker
|
||||
spec:
|
||||
host: backend-service
|
||||
trafficPolicy:
|
||||
connectionPool:
|
||||
tcp:
|
||||
maxConnections: 100
|
||||
http:
|
||||
http1MaxPendingRequests: 50
|
||||
http2MaxRequests: 100
|
||||
outlierDetection:
|
||||
consecutiveErrors: 5
|
||||
interval: 30s
|
||||
baseEjectionTime: 30s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CPU Performance
|
||||
|
||||
### Symptoms
|
||||
- High CPU usage
|
||||
- Throttling
|
||||
- Slow processing
|
||||
- Queue buildup
|
||||
|
||||
### Investigation Commands
|
||||
|
||||
```bash
|
||||
# Check CPU usage
|
||||
kubectl top nodes
|
||||
kubectl top pods -n <namespace>
|
||||
|
||||
# Check CPU throttling
|
||||
kubectl get pod <pod-name> -n <namespace> -o json | \
|
||||
jq '.spec.containers[].resources'
|
||||
|
||||
# Get detailed CPU metrics (requires metrics-server)
|
||||
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | jq
|
||||
|
||||
# Check container-level CPU from node (SSH to node)
|
||||
ssh <node> "docker stats --no-stream"
|
||||
```
|
||||
|
||||
### Advanced CPU Profiling
|
||||
|
||||
**Enable CPU profiling in application:**
|
||||
|
||||
```bash
|
||||
# For Go applications with pprof
|
||||
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
|
||||
|
||||
# Capture CPU profile
|
||||
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
|
||||
|
||||
# Analyze with pprof
|
||||
go tool pprof -http=:8080 cpu.prof
|
||||
```
|
||||
|
||||
**For Java applications:**
|
||||
|
||||
```bash
|
||||
# Use async-profiler
|
||||
kubectl exec -it <pod-name> -n <namespace> -- \
|
||||
/profiler.sh -d 30 -f /tmp/flamegraph.html 1
|
||||
|
||||
# Copy flamegraph
|
||||
kubectl cp <namespace>/<pod-name>:/tmp/flamegraph.html ./flamegraph.html
|
||||
```
|
||||
|
||||
### Solutions
|
||||
|
||||
**Vertical Scaling:**
|
||||
```yaml
|
||||
resources:
|
||||
requests:
|
||||
cpu: "1000m" # Increased from 500m
|
||||
limits:
|
||||
cpu: "2000m" # Increased from 1000m
|
||||
```
|
||||
|
||||
**Horizontal Scaling:**
|
||||
```yaml
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: app-hpa
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: app
|
||||
minReplicas: 3
|
||||
maxReplicas: 20
|
||||
metrics:
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 70
|
||||
```
|
||||
|
||||
**Remove CPU Limits for Bursty Workloads:**
|
||||
```yaml
|
||||
# Allow bursting to available CPU
|
||||
resources:
|
||||
requests:
|
||||
cpu: "500m"
|
||||
# No limits - can use all available CPU
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Memory Performance
|
||||
|
||||
### Symptoms
|
||||
- OOMKilled pods
|
||||
- Memory leaks
|
||||
- Slow garbage collection
|
||||
- Swap usage (if enabled)
|
||||
|
||||
### Investigation Commands
|
||||
|
||||
```bash
|
||||
# Check memory usage
|
||||
kubectl top nodes
|
||||
kubectl top pods -n <namespace>
|
||||
|
||||
# Check memory limits and requests
|
||||
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits\|Requests"
|
||||
|
||||
# Check OOM kills
|
||||
kubectl get pods -n <namespace> -o json | \
|
||||
jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") | .metadata.name'
|
||||
|
||||
# Detailed memory breakdown (requires metrics-server)
|
||||
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | \
|
||||
jq '.containers[] | {name, usage: .usage.memory}'
|
||||
```
|
||||
|
||||
### Memory Profiling
|
||||
|
||||
**Heap dump for Java:**
|
||||
```bash
|
||||
# Capture heap dump
|
||||
kubectl exec <pod-name> -n <namespace> -- \
|
||||
jmap -dump:format=b,file=/tmp/heapdump.hprof 1
|
||||
|
||||
# Copy heap dump
|
||||
kubectl cp <namespace>/<pod-name>:/tmp/heapdump.hprof ./heapdump.hprof
|
||||
|
||||
# Analyze with Eclipse MAT or VisualVM
|
||||
```
|
||||
|
||||
**Memory profiling for Go:**
|
||||
```bash
|
||||
# Capture heap profile
|
||||
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
|
||||
curl http://localhost:6060/debug/pprof/heap > heap.prof
|
||||
|
||||
# Analyze
|
||||
go tool pprof -http=:8080 heap.prof
|
||||
```
|
||||
|
||||
### Solutions
|
||||
|
||||
**Increase Memory Limits:**
|
||||
```yaml
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
limits:
|
||||
memory: "2Gi" # Increased from 1Gi
|
||||
```
|
||||
|
||||
**Optimize Application:**
|
||||
- Fix memory leaks
|
||||
- Implement connection pooling
|
||||
- Optimize caching strategies
|
||||
- Tune garbage collection
|
||||
|
||||
**Use Memory-Optimized Node Pools:**
|
||||
```yaml
|
||||
# Node affinity for memory-intensive workloads
|
||||
affinity:
|
||||
nodeAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: workload-type
|
||||
operator: In
|
||||
values:
|
||||
- memory-optimized
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Network Performance
|
||||
|
||||
### Symptoms
|
||||
- High network latency
|
||||
- Packet loss
|
||||
- Connection timeouts
|
||||
- Bandwidth saturation
|
||||
|
||||
### Investigation Commands
|
||||
|
||||
```bash
|
||||
# Check pod network statistics
|
||||
kubectl exec <pod-name> -n <namespace> -- netstat -s
|
||||
|
||||
# Test network performance between pods
|
||||
# Deploy netperf
|
||||
kubectl run netperf-client --image=networkstatic/netperf --rm -it -- /bin/bash
|
||||
|
||||
# From client, run:
|
||||
netperf -H <target-pod-ip> -t TCP_STREAM
|
||||
netperf -H <target-pod-ip> -t TCP_RR # Request-response latency
|
||||
|
||||
# Check DNS resolution time
|
||||
kubectl exec <pod-name> -n <namespace> -- \
|
||||
time nslookup service-name.namespace.svc.cluster.local
|
||||
|
||||
# Check service mesh overhead (if using Istio)
|
||||
kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
|
||||
curl -s localhost:15000/stats | grep "http.inbound\|http.outbound"
|
||||
```
|
||||
|
||||
### Check Network Policies
|
||||
|
||||
```bash
|
||||
# List network policies
|
||||
kubectl get networkpolicies -n <namespace>
|
||||
|
||||
# Check if policy is blocking traffic
|
||||
kubectl describe networkpolicy <policy-name> -n <namespace>
|
||||
|
||||
# Temporarily remove policies to test (in non-production)
|
||||
kubectl delete networkpolicy <policy-name> -n <namespace>
|
||||
```
|
||||
|
||||
### Solutions
|
||||
|
||||
**DNS Optimization:**
|
||||
```yaml
|
||||
# Use CoreDNS caching
|
||||
# Increase CoreDNS replicas
|
||||
kubectl scale deployment coredns -n kube-system --replicas=5
|
||||
|
||||
# Or use NodeLocal DNSCache
|
||||
# https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
|
||||
```
|
||||
|
||||
**Optimize Service Mesh:**
|
||||
```yaml
|
||||
# Reduce Istio sidecar resources if over-provisioned
|
||||
sidecar.istio.io/proxyCPU: "100m"
|
||||
sidecar.istio.io/proxyMemory: "128Mi"
|
||||
|
||||
# Or disable for internal, trusted services
|
||||
sidecar.istio.io/inject: "false"
|
||||
```
|
||||
|
||||
**Use HostNetwork for Network-Intensive Pods:**
|
||||
```yaml
|
||||
# Use with caution - bypasses pod networking
|
||||
spec:
|
||||
hostNetwork: true
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
```
|
||||
|
||||
**Enable Bandwidth Limits (QoS):**
|
||||
```yaml
|
||||
metadata:
|
||||
annotations:
|
||||
kubernetes.io/ingress-bandwidth: "10M"
|
||||
kubernetes.io/egress-bandwidth: "10M"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Storage I/O Performance
|
||||
|
||||
### Symptoms
|
||||
- Slow read/write operations
|
||||
- High I/O wait
|
||||
- Application timeouts during disk operations
|
||||
- Database performance issues
|
||||
|
||||
### Investigation Commands
|
||||
|
||||
```bash
|
||||
# Check I/O metrics on node
|
||||
ssh <node> "iostat -x 1 10"
|
||||
|
||||
# Check disk usage
|
||||
kubectl exec <pod-name> -n <namespace> -- df -h
|
||||
|
||||
# Check I/O wait from pod
|
||||
kubectl exec <pod-name> -n <namespace> -- top
|
||||
|
||||
# Test storage performance
|
||||
kubectl exec <pod-name> -n <namespace> -- \
|
||||
dd if=/dev/zero of=/data/test bs=1M count=1024 conv=fdatasync
|
||||
|
||||
# Check PV performance class
|
||||
kubectl get pv <pv-name> -o yaml | grep storageClassName
|
||||
kubectl describe storageclass <storage-class-name>
|
||||
```
|
||||
|
||||
### Storage Benchmarking
|
||||
|
||||
**Deploy fio for benchmarking:**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: fio-benchmark
|
||||
spec:
|
||||
containers:
|
||||
- name: fio
|
||||
image: ljishen/fio
|
||||
command: ["/bin/sh", "-c"]
|
||||
args:
|
||||
- |
|
||||
fio --name=seqread --rw=read --bs=1M --size=1G --runtime=60 --filename=/data/test
|
||||
fio --name=seqwrite --rw=write --bs=1M --size=1G --runtime=60 --filename=/data/test
|
||||
fio --name=randread --rw=randread --bs=4k --size=1G --runtime=60 --filename=/data/test
|
||||
fio --name=randwrite --rw=randwrite --bs=4k --size=1G --runtime=60 --filename=/data/test
|
||||
volumeMounts:
|
||||
- name: data
|
||||
mountPath: /data
|
||||
volumes:
|
||||
- name: data
|
||||
persistentVolumeClaim:
|
||||
claimName: test-pvc
|
||||
```
|
||||
|
||||
### Solutions
|
||||
|
||||
**Use Higher Performance Storage Class:**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: high-performance-pvc
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
storageClassName: gp3 # or io2, premium-rwo (GKE), etc.
|
||||
resources:
|
||||
requests:
|
||||
storage: 100Gi
|
||||
```
|
||||
|
||||
**Provision IOPS (AWS EBS io2):**
|
||||
```yaml
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: io2-high-iops
|
||||
provisioner: ebs.csi.aws.com
|
||||
parameters:
|
||||
type: io2
|
||||
iops: "10000"
|
||||
fsType: ext4
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
```
|
||||
|
||||
**Use Local NVMe for Ultra-Low Latency:**
|
||||
```yaml
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: local-nvme
|
||||
provisioner: kubernetes.io/no-provisioner
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: local-pv
|
||||
spec:
|
||||
capacity:
|
||||
storage: 100Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-nvme
|
||||
local:
|
||||
path: /mnt/disks/nvme0n1
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- node-with-nvme
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Application-Level Metrics
|
||||
|
||||
### Expose Prometheus Metrics
|
||||
|
||||
**Add metrics endpoint to application:**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: app-metrics
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
prometheus.io/path: "/metrics"
|
||||
spec:
|
||||
selector:
|
||||
app: myapp
|
||||
ports:
|
||||
- name: metrics
|
||||
port: 8080
|
||||
targetPort: 8080
|
||||
```
|
||||
|
||||
### Key Metrics to Monitor
|
||||
|
||||
**Application metrics:**
|
||||
- Request rate
|
||||
- Request latency (p50, p95, p99)
|
||||
- Error rate
|
||||
- Active connections
|
||||
- Queue depth
|
||||
- Cache hit rate
|
||||
|
||||
**Example Prometheus queries:**
|
||||
```promql
|
||||
# P95 latency
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Error rate
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
|
||||
|
||||
# Request rate
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
### Distributed Tracing
|
||||
|
||||
**Implement OpenTelemetry:**
|
||||
```yaml
|
||||
# Deploy Jaeger
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: jaeger
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: jaeger
|
||||
image: jaegertracing/all-in-one:latest
|
||||
ports:
|
||||
- containerPort: 16686 # UI
|
||||
- containerPort: 14268 # Collector
|
||||
```
|
||||
|
||||
**Instrument application:**
|
||||
- Add OpenTelemetry SDK to application
|
||||
- Configure trace export to Jaeger
|
||||
- Analyze end-to-end request traces to identify bottlenecks
|
||||
|
||||
---
|
||||
|
||||
## Cluster-Wide Performance
|
||||
|
||||
### Cluster Resource Utilization
|
||||
|
||||
```bash
|
||||
# Overall cluster capacity
|
||||
kubectl top nodes
|
||||
|
||||
# Total resources
|
||||
kubectl describe nodes | grep -A 5 "Allocated resources"
|
||||
|
||||
# Resource requests vs limits
|
||||
kubectl get pods --all-namespaces -o json | \
|
||||
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources)"'
|
||||
```
|
||||
|
||||
### Control Plane Performance
|
||||
|
||||
```bash
|
||||
# Check API server latency
|
||||
kubectl get --raw /metrics | grep apiserver_request_duration_seconds
|
||||
|
||||
# Check etcd performance
|
||||
kubectl exec -it -n kube-system etcd-<node> -- \
|
||||
etcdctl --endpoints=https://127.0.0.1:2379 \
|
||||
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
|
||||
--cert=/etc/kubernetes/pki/etcd/server.crt \
|
||||
--key=/etc/kubernetes/pki/etcd/server.key \
|
||||
check perf
|
||||
|
||||
# Controller manager metrics
|
||||
kubectl get --raw /metrics | grep workqueue_depth
|
||||
```
|
||||
|
||||
### Scheduler Performance
|
||||
|
||||
```bash
|
||||
# Check scheduler latency
|
||||
kubectl get --raw /metrics | grep scheduler_scheduling_duration_seconds
|
||||
|
||||
# Check pending pods
|
||||
kubectl get pods --all-namespaces --field-selector status.phase=Pending
|
||||
|
||||
# Scheduler logs
|
||||
kubectl logs -n kube-system kube-scheduler-<node>
|
||||
```
|
||||
|
||||
### Solutions for Cluster-Wide Issues
|
||||
|
||||
**Scale Control Plane:**
|
||||
- Add more control plane nodes
|
||||
- Increase API server replicas
|
||||
- Tune etcd (increase memory, use SSD)
|
||||
|
||||
**Optimize Scheduling:**
|
||||
- Use pod priority and preemption
|
||||
- Implement pod topology spread constraints
|
||||
- Use node affinity/anti-affinity appropriately
|
||||
|
||||
**Resource Management:**
|
||||
- Set appropriate resource requests and limits
|
||||
- Use LimitRanges and ResourceQuotas
|
||||
- Implement VerticalPodAutoscaler for right-sizing
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization Checklist
|
||||
|
||||
### Application Level
|
||||
- [ ] Implement connection pooling
|
||||
- [ ] Enable response caching
|
||||
- [ ] Optimize database queries
|
||||
- [ ] Use async/non-blocking I/O
|
||||
- [ ] Implement circuit breakers
|
||||
- [ ] Profile and optimize hot paths
|
||||
|
||||
### Kubernetes Level
|
||||
- [ ] Set appropriate resource requests/limits
|
||||
- [ ] Use HPA for auto-scaling
|
||||
- [ ] Implement readiness/liveness probes correctly
|
||||
- [ ] Use anti-affinity for high-availability
|
||||
- [ ] Optimize container image size
|
||||
- [ ] Use multi-stage builds
|
||||
|
||||
### Infrastructure Level
|
||||
- [ ] Use appropriate instance/node types
|
||||
- [ ] Enable cluster autoscaling
|
||||
- [ ] Use high-performance storage classes
|
||||
- [ ] Optimize network topology
|
||||
- [ ] Implement monitoring and alerting
|
||||
- [ ] Regular performance testing
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Tools
|
||||
|
||||
**Essential tools:**
|
||||
- **Prometheus + Grafana**: Metrics and dashboards
|
||||
- **Jaeger/Zipkin**: Distributed tracing
|
||||
- **kube-state-metrics**: Kubernetes object metrics
|
||||
- **node-exporter**: Node-level metrics
|
||||
- **cAdvisor**: Container metrics
|
||||
- **kubectl-flamegraph**: CPU profiling
|
||||
|
||||
**Commercial options:**
|
||||
- Datadog
|
||||
- New Relic
|
||||
- Dynatrace
|
||||
- Elastic APM
|
||||
Reference in New Issue
Block a user