Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:20 +08:00
commit ad81bc571f
11 changed files with 3746 additions and 0 deletions

View File

@@ -0,0 +1,582 @@
# Common Kubernetes Issues and Solutions
This reference provides detailed information about common Kubernetes issues, their root causes, and remediation steps.
## Table of Contents
1. [Pod Issues](#pod-issues)
2. [Node Issues](#node-issues)
3. [Networking Issues](#networking-issues)
4. [Storage Issues](#storage-issues)
5. [Resource Issues](#resource-issues)
6. [Security Issues](#security-issues)
---
## Pod Issues
### ImagePullBackOff / ErrImagePull
**Symptoms:**
- Pod stuck in `ImagePullBackOff` or `ErrImagePull` state
- Events show image pull errors
**Common Causes:**
1. Image doesn't exist or tag is wrong
2. Registry authentication failure
3. Network issues reaching registry
4. Rate limiting from registry
5. Private registry without imagePullSecrets
**Diagnostic Commands:**
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
```
**Remediation Steps:**
1. Verify image name and tag: `docker pull <image>` from local machine
2. Check imagePullSecrets exist: `kubectl get secrets -n <namespace>`
3. Verify secret has correct registry credentials
4. Check if registry is accessible from cluster
5. For rate limiting: implement imagePullSecrets with authenticated account
6. Update deployment with correct image path
**Prevention:**
- Use specific image tags instead of `latest`
- Implement image pull secrets for private registries
- Set up local registry or cache to reduce external pulls
- Use image validation in CI/CD pipeline
---
### CrashLoopBackOff
**Symptoms:**
- Pod repeatedly crashes and restarts
- Increasing restart count
- Container exits shortly after starting
**Common Causes:**
1. Application error on startup
2. Missing environment variables or config
3. Resource limits too restrictive
4. Liveness probe too aggressive
5. Missing dependencies (DB, cache, etc.)
6. Command/args misconfiguration
**Diagnostic Commands:**
```bash
kubectl logs <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous
kubectl describe pod <pod-name> -n <namespace>
kubectl get pod <pod-name> -n <namespace> -o yaml
```
**Remediation Steps:**
1. Check current logs: `kubectl logs <pod-name>`
2. Check previous container logs: `kubectl logs <pod-name> --previous`
3. Look for startup errors, stack traces, or missing config messages
4. Verify environment variables are set correctly
5. Check if external dependencies are accessible
6. Review and adjust resource limits if OOMKilled
7. Adjust liveness probe initialDelaySeconds if failing too early
8. Verify command and args in pod spec
**Prevention:**
- Implement proper application health checks and readiness
- Use init containers for dependency checks
- Set appropriate resource requests/limits based on profiling
- Configure liveness probes with sufficient delay
- Add retry logic and graceful degradation to applications
---
### Pending Pods
**Symptoms:**
- Pod stuck in `Pending` state
- Not scheduled to any node
**Common Causes:**
1. Insufficient cluster resources (CPU/memory)
2. Node selectors/affinity rules can't be satisfied
3. No nodes match taints/tolerations
4. PersistentVolume not available
5. Resource quotas exceeded
6. Scheduler not running or misconfigured
**Diagnostic Commands:**
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl get nodes -o wide
kubectl describe nodes
kubectl get pv,pvc -n <namespace>
kubectl get resourcequota -n <namespace>
```
**Remediation Steps:**
1. Check pod events for scheduling failure reason
2. Verify node resources: `kubectl describe nodes | grep -A 5 "Allocated resources"`
3. Check if pod has node selectors: verify nodes have required labels
4. Review taints on nodes and tolerations on pod
5. For PVC issues: verify PV exists and is in `Available` state
6. Check namespace resource quota: `kubectl describe resourcequota -n <namespace>`
7. If no resources: scale cluster or reduce resource requests
8. If affinity issue: adjust affinity rules or add matching nodes
**Prevention:**
- Set appropriate resource requests (not just limits)
- Monitor cluster capacity and scale proactively
- Use pod disruption budgets to prevent total unavailability
- Implement cluster autoscaling
- Use multiple node pools for different workload types
---
### OOMKilled Pods
**Symptoms:**
- Pod terminates with exit code 137
- Container status shows `OOMKilled` reason
- Frequent restarts due to memory exhaustion
**Common Causes:**
1. Memory limit set too low
2. Memory leak in application
3. Unexpected load increase
4. No memory limits (using node's memory)
**Diagnostic Commands:**
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> --previous -n <namespace>
kubectl top pod <pod-name> -n <namespace>
```
**Remediation Steps:**
1. Confirm OOMKilled in pod events or status
2. Check memory usage before crash if metrics available
3. Review application logs for memory-intensive operations
4. Increase memory limit if application legitimately needs more
5. Profile application to identify memory leaks
6. Implement memory limits with requests = limits for guaranteed QoS
7. Consider horizontal scaling instead of vertical
**Prevention:**
- Profile applications to determine actual memory needs
- Set memory requests based on normal usage, limits with headroom
- Implement application-level memory monitoring
- Use horizontal pod autoscaling based on memory metrics
- Regular load testing to understand resource requirements
---
## Node Issues
### Node NotReady
**Symptoms:**
- Node status shows `NotReady`
- Pods on node may be evicted
- Node may be cordoned automatically
**Common Causes:**
1. Kubelet stopped or crashed
2. Network connectivity issues
3. Disk pressure
4. Memory pressure
5. PID pressure
6. Container runtime issues
**Diagnostic Commands:**
```bash
kubectl describe node <node-name>
kubectl get nodes -o wide
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
ssh <node> "df -h"
ssh <node> "free -m"
```
**Remediation Steps:**
1. Check node conditions in describe output
2. Verify kubelet is running: `systemctl status kubelet`
3. Check kubelet logs: `journalctl -u kubelet`
4. For disk pressure: clean up unused images/containers
5. For memory pressure: identify and stop memory-hogging processes
6. Restart kubelet if crashed: `systemctl restart kubelet`
7. Check container runtime: `systemctl status docker` or `containerd`
8. Verify network connectivity to API server
**Prevention:**
- Monitor node resources with alerts
- Implement log rotation and image cleanup
- Set up node problem detector
- Use resource quotas to prevent resource exhaustion
- Regular maintenance windows for updates
---
### Disk Pressure
**Symptoms:**
- Node condition `DiskPressure` is True
- Pod evictions may occur
- Node may become NotReady
**Common Causes:**
1. Docker/containerd image cache filling disk
2. Container logs consuming space
3. Ephemeral storage usage by pods
4. System logs filling up
**Diagnostic Commands:**
```bash
kubectl describe node <node-name> | grep -A 10 Conditions
ssh <node> "df -h"
ssh <node> "du -sh /var/lib/docker/*"
ssh <node> "du -sh /var/lib/containerd/*"
ssh <node> "docker system df"
```
**Remediation Steps:**
1. Clean up unused images: `docker image prune -a`
2. Clean up stopped containers: `docker container prune`
3. Clean up volumes: `docker volume prune`
4. Rotate and compress logs
5. Check for pods with excessive ephemeral storage usage
6. Expand disk if consistently running out of space
7. Configure kubelet garbage collection parameters
**Prevention:**
- Set imagefs.available threshold in kubelet config
- Implement automated image pruning
- Use log rotation for container logs
- Monitor disk usage with alerts
- Set ephemeral-storage limits on pods
- Size nodes appropriately for workload
---
## Networking Issues
### Pod-to-Pod Communication Failure
**Symptoms:**
- Services can't reach other services
- Connection timeouts between pods
- DNS resolution may or may not work
**Common Causes:**
1. Network policy blocking traffic
2. CNI plugin issues
3. Firewall rules blocking traffic
4. Service misconfiguration
5. Pod CIDR exhaustion
**Diagnostic Commands:**
```bash
kubectl get networkpolicies --all-namespaces
kubectl exec -it <pod> -- ping <target-pod-ip>
kubectl exec -it <pod> -- nslookup <service-name>
kubectl get svc -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
```
**Remediation Steps:**
1. Test basic connectivity: exec into pod and ping target
2. Check network policies: look for policies affecting source/dest namespaces
3. Verify service has endpoints: `kubectl get endpoints`
4. Check if pod labels match service selector
5. Verify CNI plugin pods are healthy (usually in kube-system)
6. Check node-level firewall rules
7. Verify pod CIDR hasn't been exhausted
**Prevention:**
- Document network policies and their intent
- Use namespace labels for network policy management
- Monitor CNI plugin health
- Regularly audit network policies
- Implement network policy testing in CI/CD
---
### Service Not Accessible
**Symptoms:**
- Cannot connect to service
- LoadBalancer stuck in Pending
- NodePort not accessible
**Common Causes:**
1. Service has no endpoints (no matching pods)
2. Pods not passing readiness checks
3. LoadBalancer controller not working
4. Cloud provider integration issues
5. Port conflicts
**Diagnostic Commands:**
```bash
kubectl get svc <service-name> -n <namespace>
kubectl describe svc <service-name> -n <namespace>
kubectl get endpoints <service-name> -n <namespace>
kubectl get pods -l <service-selector> -n <namespace>
kubectl logs -l <service-selector> -n <namespace>
```
**Remediation Steps:**
1. Verify service type and ports are correct
2. Check if service has endpoints: `kubectl get endpoints`
3. If no endpoints: verify pod selector matches pod labels
4. Check pod readiness: `kubectl describe pod`
5. For LoadBalancer: check cloud provider controller logs
6. For NodePort: verify node firewall allows the port
7. Test from within cluster first, then external access
**Prevention:**
- Use meaningful service selectors and pod labels
- Implement proper readiness probes
- Test services in staging before production
- Monitor service endpoint counts
- Document external access requirements
---
## Storage Issues
### PVC Pending
**Symptoms:**
- PersistentVolumeClaim stuck in `Pending` state
- Pod can't start due to volume not available
**Common Causes:**
1. No PV matches PVC requirements
2. StorageClass doesn't exist or is misconfigured
3. Dynamic provisioner not working
4. Insufficient permissions for provisioner
5. Volume capacity exhausted in storage backend
**Diagnostic Commands:**
```bash
kubectl describe pvc <pvc-name> -n <namespace>
kubectl get pv
kubectl get storageclass
kubectl describe storageclass <storage-class-name>
kubectl get pods -n <provisioner-namespace>
```
**Remediation Steps:**
1. Check PVC events for specific error message
2. Verify StorageClass exists: `kubectl get sc`
3. Check if dynamic provisioner pod is running
4. For static provisioning: ensure PV exists with matching size/access mode
5. Verify provisioner has correct cloud credentials/permissions
6. Check storage backend capacity
7. Review StorageClass parameters for typos
**Prevention:**
- Use dynamic provisioning with reliable StorageClass
- Monitor storage backend capacity
- Set up alerts for PVC binding failures
- Test storage provisioning in non-production first
- Document storage requirements and limitations
---
### Volume Mount Failures
**Symptoms:**
- Pod fails to start with volume mount errors
- Events show mounting failures
- Container create errors related to volumes
**Common Causes:**
1. Volume already mounted to different node (RWO with multi-attach)
2. Volume doesn't exist
3. Insufficient permissions
4. Node can't reach storage backend
5. Filesystem issues on volume
**Diagnostic Commands:**
```bash
kubectl describe pod <pod-name> -n <namespace>
kubectl describe pvc <pvc-name> -n <namespace>
kubectl describe pv <pv-name>
kubectl get volumeattachments
```
**Remediation Steps:**
1. Check pod events for specific mount error
2. For RWO volumes: ensure pod is scheduled to node with volume attached
3. Verify PVC is bound to a PV
4. Check node can reach storage backend (cloud/NFS/iSCSI)
5. For CSI volumes: check CSI driver pods are healthy
6. Delete and recreate pod if volume is stuck in multi-attach state
7. Check filesystem on volume if accessible
**Prevention:**
- Use ReadWriteMany for multi-pod access scenarios
- Implement pod disruption budgets to prevent scheduling conflicts
- Monitor volume attachment status
- Use StatefulSets for stateful workloads with volumes
- Regular backup and restore testing
---
## Resource Issues
### Resource Quota Exceeded
**Symptoms:**
- New pods fail to schedule
- Error: "exceeded quota"
- ResourceQuota limits preventing resource allocation
**Common Causes:**
1. Namespace resource quota exceeded
2. Not enough resource budget available
3. Resource requests not specified on pods
4. Quota misconfiguration
**Diagnostic Commands:**
```bash
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get pods -n <namespace> -o json | jq '.items[].spec.containers[].resources'
kubectl describe namespace <namespace>
```
**Remediation Steps:**
1. Check current quota usage: `kubectl describe resourcequota`
2. Identify pods consuming resources
3. Either increase quota limits or reduce resource requests
4. Delete unused resources to free up quota
5. Optimize pod resource requests based on actual usage
6. Consider splitting workloads across namespaces
**Prevention:**
- Set quotas based on actual usage patterns
- Monitor quota usage with alerts
- Right-size pod resource requests
- Implement automatic cleanup of completed jobs/pods
- Regular quota review and adjustment
---
### CPU Throttling
**Symptoms:**
- Application performance degradation
- High CPU throttling metrics
- Services responding slowly despite available CPU
**Common Causes:**
1. CPU limits set too low
2. Burst workloads hitting limits
3. Noisy neighbor effects
4. CPU limits set without understanding workload
**Diagnostic Commands:**
```bash
kubectl top pod <pod-name> -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
kubectl exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_throttled_time
```
**Remediation Steps:**
1. Check current CPU usage vs limits
2. Review throttling metrics if available
3. Increase CPU limits if application legitimately needs more
4. Remove CPU limits if workload is bursty (keep requests)
5. Use HPA if load varies significantly
6. Profile application to identify CPU-intensive operations
**Prevention:**
- Set CPU requests based on average usage
- Set CPU limits with 50-100% headroom for bursts
- Use HPA for variable workloads
- Monitor CPU throttling metrics
- Regular performance testing and profiling
---
## Security Issues
### Image Vulnerability
**Symptoms:**
- Security scanner reports vulnerabilities
- Compliance violations
- Known CVEs in running images
**Diagnostic Commands:**
```bash
kubectl get pods -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u
trivy image <image-name>
```
**Remediation Steps:**
1. Identify vulnerable images with scanner
2. Update base images to patched versions
3. Rebuild application images with updated dependencies
4. Update deployments with new image tags
5. Implement admission controller to block vulnerable images
**Prevention:**
- Scan images in CI/CD pipeline
- Regular base image updates
- Use minimal base images
- Implement admission controllers (OPA, Kyverno)
- Automated image updates and testing
---
### RBAC Permission Denied
**Symptoms:**
- Users or service accounts can't perform operations
- "forbidden" errors in logs
- API calls fail with 403 errors
**Common Causes:**
1. Missing Role or ClusterRole binding
2. Overly restrictive RBAC policies
3. Wrong service account in pod
4. Namespace-scoped role for cluster-wide resource
**Diagnostic Commands:**
```bash
kubectl auth can-i <verb> <resource> --as=<user/sa>
kubectl get rolebindings -n <namespace>
kubectl get clusterrolebindings
kubectl describe role <role-name> -n <namespace>
kubectl describe serviceaccount <sa-name> -n <namespace>
```
**Remediation Steps:**
1. Identify exact permission needed from error message
2. Check what user/SA can do: `kubectl auth can-i --list`
3. Create appropriate Role/ClusterRole with needed permissions
4. Create RoleBinding/ClusterRoleBinding
5. Verify service account is correctly set in pod spec
6. Test with `kubectl auth can-i` before deploying
**Prevention:**
- Follow principle of least privilege
- Use namespace-scoped roles where possible
- Document RBAC policies and their purpose
- Regular RBAC audits
- Use pre-defined roles when possible
---
This reference covers the most common Kubernetes issues. For each issue, always:
1. Gather information (describe, logs, events)
2. Form hypothesis based on symptoms
3. Test hypothesis with targeted diagnostics
4. Apply remediation
5. Verify fix
6. Document for future reference

View File

@@ -0,0 +1,708 @@
# Helm Troubleshooting Guide
Comprehensive guide for troubleshooting Helm releases, charts, and deployments.
## Table of Contents
1. [Helm Release Issues](#helm-release-issues)
2. [Chart Installation Failures](#chart-installation-failures)
3. [Upgrade and Rollback Problems](#upgrade-and-rollback-problems)
4. [Values and Configuration](#values-and-configuration)
5. [Chart Dependencies](#chart-dependencies)
6. [Hooks and Lifecycle](#hooks-and-lifecycle)
7. [Repository Issues](#repository-issues)
---
## Helm Release Issues
### Release Stuck in Pending-Install/Pending-Upgrade
**Symptoms:**
- Release shows status `pending-install` or `pending-upgrade`
- New installations or upgrades hang indefinitely
- `helm list` shows release but resources not created
**Diagnostic Commands:**
```bash
# Check release status
helm list -n <namespace>
helm status <release-name> -n <namespace>
# Check release history
helm history <release-name> -n <namespace>
# Get detailed release information
kubectl get secrets -n <namespace> -l owner=helm,status=pending-install
# Check helm operation pods/jobs
kubectl get pods -n <namespace> -l app.kubernetes.io/managed-by=Helm
```
**Common Causes:**
1. Previous installation failed and wasn't cleaned up
2. Helm hooks stuck or failed
3. Kubernetes resources can't be created (RBAC, quotas)
4. Timeout during installation
**Resolution:**
```bash
# Check for stuck hooks
kubectl get jobs -n <namespace>
kubectl get pods -n <namespace> -l "helm.sh/hook"
# Delete stuck hooks
kubectl delete job <hook-job> -n <namespace>
# Rollback to previous version
helm rollback <release-name> -n <namespace>
# Force delete release (last resort)
helm delete <release-name> -n <namespace> --no-hooks
# Clean up secrets
kubectl delete secret -n <namespace> -l owner=helm,name=<release-name>
```
### Release Shows as Deployed but Resources Missing
**Symptoms:**
- `helm list` shows release as deployed
- Expected pods/services not running
- Resources partially created
**Investigation:**
```bash
# Get manifest from release
helm get manifest <release-name> -n <namespace>
# Compare with what's actually deployed
helm get manifest <release-name> -n <namespace> | kubectl apply --dry-run=client -f -
# Check helm values used
helm get values <release-name> -n <namespace>
# Check for resource creation errors
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
```
**Resolution:**
```bash
# Reapply the release
helm upgrade <release-name> <chart> -n <namespace> --reuse-values
# Or force recreate
helm upgrade <release-name> <chart> -n <namespace> --force
```
---
## Chart Installation Failures
### "Release Already Exists" Error
**Symptoms:**
```
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
```
**Resolution:**
```bash
# Check existing releases
helm list -n <namespace> --all
# Check for failed releases
helm list -n <namespace> --failed
# Uninstall existing release
helm uninstall <release-name> -n <namespace>
# Or use different release name
helm install <new-release-name> <chart> -n <namespace>
```
### "Invalid Chart" Error
**Symptoms:**
```
Error: INSTALLATION FAILED: chart requires kubeVersion: >=1.20.0 which is incompatible with Kubernetes v1.19.0
```
**Investigation:**
```bash
# Check chart requirements
helm show chart <chart-name>
# Check Kubernetes version
kubectl version --short
# Inspect chart contents
helm pull <chart> --untar
cat <chart>/Chart.yaml
```
**Resolution:**
- Upgrade Kubernetes cluster to meet requirements
- Use compatible chart version: `helm install <release> <chart> --version <compatible-version>`
- Modify Chart.yaml kubeVersion requirement (not recommended)
### Template Rendering Errors
**Symptoms:**
```
Error: INSTALLATION FAILED: template: <chart>/templates/deployment.yaml:10:4:
executing "<chart>/templates/deployment.yaml" at <.Values.invalid.path>:
nil pointer evaluating interface {}.path
```
**Investigation:**
```bash
# Render templates locally
helm template <release-name> <chart> -n <namespace>
# Render with your values
helm template <release-name> <chart> -f values.yaml -n <namespace>
# Debug mode
helm install <release-name> <chart> -n <namespace> --debug --dry-run
```
**Resolution:**
1. Check values.yaml for missing required fields
2. Verify template syntax in chart
3. Use `helm lint` to validate chart
```bash
# Lint chart
helm lint <chart-directory>
# Lint with values
helm lint <chart-directory> -f values.yaml
```
---
## Upgrade and Rollback Problems
### Upgrade Fails with "Rendered Manifests Contain Errors"
**Symptoms:**
```
Error: UPGRADE FAILED: unable to build kubernetes objects from release manifest
```
**Investigation:**
```bash
# Dry run upgrade
helm upgrade <release-name> <chart> -n <namespace> --dry-run --debug
# Compare differences
helm diff upgrade <release-name> <chart> -n <namespace> # requires helm-diff plugin
```
**Resolution:**
```bash
# Fix values.yaml and retry
helm upgrade <release-name> <chart> -n <namespace> -f fixed-values.yaml
# Skip tests if test hooks failing
helm upgrade <release-name> <chart> -n <namespace> --no-hooks
# Force upgrade
helm upgrade <release-name> <chart> -n <namespace> --force
```
### Rollback Fails
**Symptoms:**
```
Error: ROLLBACK FAILED: release <release-name> failed, and has been rolled back due to atomic being set
```
**Investigation:**
```bash
# Check release history
helm history <release-name> -n <namespace>
# Get specific revision manifest
helm get manifest <release-name> --revision <revision-number> -n <namespace>
```
**Resolution:**
```bash
# Rollback to specific working revision
helm rollback <release-name> <revision-number> -n <namespace>
# If rollback fails, uninstall and reinstall
helm uninstall <release-name> -n <namespace>
helm install <release-name> <chart> -n <namespace> -f values.yaml
# Clean up failed rollback
kubectl get secrets -n <namespace> -l owner=helm,status=pending-rollback
kubectl delete secret <secret-name> -n <namespace>
```
### "Immutable Field" Error During Upgrade
**Symptoms:**
```
Error: UPGRADE FAILED: Service "myapp" is invalid: spec.clusterIP: Invalid value: "": field is immutable
```
**Common immutable fields:**
- Service `clusterIP`
- StatefulSet `volumeClaimTemplates`
- PVC `storageClassName`
**Resolution:**
```bash
# Option 1: Delete and recreate resource
kubectl delete service <service-name> -n <namespace>
helm upgrade <release-name> <chart> -n <namespace>
# Option 2: Use --force to recreate resources
helm upgrade <release-name> <chart> -n <namespace> --force
# Option 3: Manually patch the resource
kubectl patch service <service-name> -n <namespace> --type='json' \
-p='[{"op": "remove", "path": "/spec/clusterIP"}]'
```
---
## Values and Configuration
### Values Not Applied
**Symptoms:**
- Deployed resources don't reflect values in values.yaml
- Default chart values used instead of custom values
**Investigation:**
```bash
# Check what values were used in deployment
helm get values <release-name> -n <namespace>
# Check all values (including defaults)
helm get values <release-name> -n <namespace> --all
# Test rendering with your values
helm template <release-name> <chart> -f values.yaml -n <namespace> | less
```
**Resolution:**
```bash
# Ensure values file is specified correctly
helm upgrade <release-name> <chart> -n <namespace> -f values.yaml
# Use multiple values files (later files override earlier)
helm upgrade <release-name> <chart> -n <namespace> \
-f values-common.yaml \
-f values-prod.yaml
# Set specific values via CLI
helm upgrade <release-name> <chart> -n <namespace> \
--set image.tag=v2.0.0 \
--set replicas=5
```
### Values File Parsing Errors
**Symptoms:**
```
Error: INSTALLATION FAILED: YAML parse error
```
**Investigation:**
```bash
# Validate YAML syntax
yamllint values.yaml
# Or use Python
python3 -c 'import yaml; yaml.safe_load(open("values.yaml"))'
# Check for tabs (YAML doesn't allow tabs)
cat -A values.yaml | grep $'\t'
```
**Resolution:**
- Fix YAML syntax errors
- Replace tabs with spaces
- Ensure proper indentation
- Quote special characters in strings
### Secret Values Not Working
**Symptoms:**
- Secrets not created or contain wrong values
- Base64 encoding issues
**Investigation:**
```bash
# Check secret in manifest
helm get manifest <release-name> -n <namespace> | grep -A 10 "kind: Secret"
# Decode secret
kubectl get secret <secret-name> -n <namespace> -o json | \
jq '.data | map_values(@base64d)'
```
**Resolution:**
```yaml
# Use proper secret format in values.yaml
secrets:
password: "mySecretPassword" # Helm will base64 encode
# Or pre-encode if template expects it
secrets:
password: "bXlTZWNyZXRQYXNzd29yZA==" # Already base64 encoded
```
---
## Chart Dependencies
### Dependency Update Fails
**Symptoms:**
```
Error: An error occurred while checking for chart dependencies
```
**Investigation:**
```bash
# Check Chart.yaml dependencies
cat Chart.yaml
# List current dependencies
helm dependency list <chart-directory>
# Check repository access
helm repo list
helm repo update
```
**Resolution:**
```bash
# Update dependencies
helm dependency update <chart-directory>
# Build dependencies (downloads to charts/)
helm dependency build <chart-directory>
# Add missing repositories
helm repo add <repo-name> <repo-url>
helm repo update
```
### Dependency Version Conflicts
**Symptoms:**
```
Error: found in Chart.yaml, but missing in charts/ directory
```
**Resolution:**
```bash
# Clean dependencies
rm -rf <chart-directory>/charts/*
rm -f <chart-directory>/Chart.lock
# Rebuild
helm dependency update <chart-directory>
# Verify
helm dependency list <chart-directory>
```
### Subchart Values Not Applied
**Investigation:**
```bash
# Check subchart values in parent chart
cat values.yaml | grep -A 20 <subchart-name>
# Render to see what values subchart receives
helm template <release-name> <chart> -f values.yaml | grep -A 50 "# Source: <subchart-name>"
```
**Resolution:**
```yaml
# In parent chart's values.yaml, nest subchart values under subchart name:
postgresql: # Subchart name
auth:
username: myuser
password: mypass
database: mydb
primary:
resources:
requests:
memory: "256Mi"
cpu: "250m"
```
---
## Hooks and Lifecycle
### Pre/Post Hooks Failing
**Symptoms:**
- Installation/upgrade hangs waiting for hooks
- Hook jobs fail
- Release stuck in pending state
**Investigation:**
```bash
# List hooks
kubectl get jobs -n <namespace> -l "helm.sh/hook"
# Check hook status
kubectl describe job <hook-job-name> -n <namespace>
# Get hook logs
kubectl logs -n <namespace> -l "helm.sh/hook=pre-install"
kubectl logs -n <namespace> -l "helm.sh/hook=post-install"
```
**Resolution:**
```bash
# Delete failed hooks
kubectl delete job -n <namespace> -l "helm.sh/hook"
# Retry without hooks
helm upgrade <release-name> <chart> -n <namespace> --no-hooks
# Or skip hooks during install
helm install <release-name> <chart> -n <namespace> --no-hooks
```
### Hook Cleanup Issues
**Symptoms:**
- Hook resources remain after installation
- Accumulating failed hook jobs
**Investigation:**
```bash
# Check hook deletion policy
helm get manifest <release-name> -n <namespace> | grep -B 5 "helm.sh/hook-delete-policy"
# List remaining hooks
kubectl get all -n <namespace> -l "helm.sh/hook"
```
**Resolution:**
```bash
# Manual cleanup
kubectl delete jobs,pods -n <namespace> -l "helm.sh/hook"
# Update chart template to include proper hook-delete-policy:
# metadata:
# annotations:
# "helm.sh/hook": pre-install
# "helm.sh/hook-delete-policy": hook-succeeded,hook-failed
```
---
## Repository Issues
### Unable to Get Chart from Repository
**Symptoms:**
```
Error: failed to download "<chart-name>"
```
**Investigation:**
```bash
# Check repository configuration
helm repo list
# Update repositories
helm repo update
# Search for chart
helm search repo <chart-name> --versions
# Test repository access
curl -I <repo-url>/index.yaml
```
**Resolution:**
```bash
# Remove and re-add repository
helm repo remove <repo-name>
helm repo add <repo-name> <repo-url>
helm repo update
# For private repos, configure credentials
helm repo add <repo-name> <repo-url> \
--username=<username> \
--password=<password>
# Or use OCI registry
helm pull oci://registry.example.com/charts/<chart-name> --version 1.0.0
```
### Chart Version Not Found
**Symptoms:**
```
Error: chart "<chart-name>" version "1.2.3" not found
```
**Investigation:**
```bash
# List available versions
helm search repo <chart-name> --versions
# Check if specific version exists
helm show chart <repo-name>/<chart-name> --version 1.2.3
```
**Resolution:**
```bash
# Use available version
helm install <release-name> <repo-name>/<chart-name> --version <available-version>
# Or use latest
helm install <release-name> <repo-name>/<chart-name>
```
---
## Debugging Tools and Commands
### Essential Helm Commands
```bash
# Get release information
helm list -n <namespace> --all
helm status <release-name> -n <namespace>
helm history <release-name> -n <namespace>
# Get release content
helm get values <release-name> -n <namespace>
helm get manifest <release-name> -n <namespace>
helm get hooks <release-name> -n <namespace>
helm get notes <release-name> -n <namespace>
# Debugging
helm install <release-name> <chart> --debug --dry-run -n <namespace>
helm template <release-name> <chart> --debug -n <namespace>
# Testing
helm test <release-name> -n <namespace>
helm lint <chart-directory>
```
### Useful Plugins
```bash
# Install helm-diff plugin
helm plugin install https://github.com/databus23/helm-diff
# Compare releases
helm diff upgrade <release-name> <chart> -n <namespace>
# Install helm-secrets plugin
helm plugin install https://github.com/jkroepke/helm-secrets
# Use encrypted values
helm secrets install <release-name> <chart> -f secrets.yaml -n <namespace>
```
### Helm Environment Issues
**Check Helm configuration:**
```bash
# Helm version
helm version
# Kubernetes context
kubectl config current-context
# Helm environment
helm env
# Cache location
helm env | grep CACHE
```
---
## Best Practices
### Release Management
- Use descriptive release names
- Always specify namespace explicitly
- Use `--atomic` flag for safer upgrades (rolls back on failure)
- Keep release history manageable: `helm history <release> -n <namespace> --max 10`
### Values Management
- Use multiple values files for different environments
- Version control your values files
- Use `helm template` to preview changes before applying
- Document required values in chart README
### Chart Development
- Always run `helm lint` before packaging
- Test charts in multiple environments
- Use semantic versioning for charts
- Implement proper hooks with deletion policies
### Troubleshooting Workflow
1. Check release status: `helm status <release> -n <namespace>`
2. Check history: `helm history <release> -n <namespace>`
3. Get values: `helm get values <release> -n <namespace>`
4. Check manifest: `helm get manifest <release> -n <namespace>`
5. Check Kubernetes events: `kubectl get events -n <namespace>`
6. Check pod logs: `kubectl logs <pod> -n <namespace>`
7. Check hooks: `kubectl get jobs -n <namespace> -l helm.sh/hook`
---
## Quick Reference
### Common Flags
```bash
# Installation/Upgrade
--atomic # Rollback on failure
--wait # Wait for resources to be ready
--timeout 10m # Set timeout (default 5m)
--force # Force update by deleting and recreating resources
--cleanup-on-fail # Delete resources on failed install
# Debugging
--debug # Enable verbose output
--dry-run # Simulate operation
--no-hooks # Skip hooks
# Values
-f values.yaml # Use values file
--set key=value # Set value via command line
--reuse-values # Reuse values from previous release
```
### Typical Rescue Commands
```bash
# Release stuck? Force delete and reinstall
helm uninstall <release> -n <namespace> --no-hooks
kubectl delete secret -n <namespace> -l owner=helm,name=<release>
helm install <release> <chart> -n <namespace> -f values.yaml
# Upgrade failed? Rollback
helm rollback <release> 0 -n <namespace> # 0 = previous revision
# Can't rollback? Force upgrade
helm upgrade <release> <chart> -n <namespace> --force --recreate-pods
# Complete cleanup
helm uninstall <release> -n <namespace>
kubectl delete namespace <namespace> # If dedicated namespace
```

View File

@@ -0,0 +1,466 @@
# Kubernetes Incident Response Playbook
This playbook provides structured procedures for responding to Kubernetes incidents.
## Incident Response Framework
### 1. Detection Phase
- Identify the incident (alerts, user reports, monitoring)
- Determine severity level
- Initiate incident response
### 2. Triage Phase
- Assess impact and scope
- Gather initial diagnostic data
- Determine if immediate action needed
### 3. Investigation Phase
- Collect comprehensive diagnostics
- Identify root cause
- Document findings
### 4. Resolution Phase
- Apply remediation
- Verify fix
- Monitor for recurrence
### 5. Post-Incident Phase
- Document incident
- Conduct blameless post-mortem
- Implement preventive measures
---
## Severity Levels
### SEV-1: Critical
- Complete service outage
- Data loss or corruption
- Security breach
- Impact: All users affected
- Response: Immediate, all-hands
### SEV-2: High
- Major functionality degraded
- Significant performance impact
- Impact: Large subset of users
- Response: Within 15 minutes
### SEV-3: Medium
- Minor functionality impaired
- Workaround available
- Impact: Some users affected
- Response: Within 1 hour
### SEV-4: Low
- Cosmetic issues
- Negligible impact
- Impact: Minimal
- Response: During business hours
---
## Common Incident Scenarios
### Scenario 1: Complete Cluster Outage
**Symptoms:**
- All services unreachable
- kubectl commands timing out
- API server not responding
**Immediate Actions:**
1. Verify the scope (single cluster or multi-cluster)
2. Check API server status and logs
3. Check control plane nodes
4. Verify network connectivity to control plane
5. Check etcd cluster health
**Investigation Steps:**
```bash
# Check control plane pods
kubectl get pods -n kube-system
# Check etcd
kubectl exec -it etcd-<node> -n kube-system -- etcdctl endpoint health
# Check API server logs
journalctl -u kube-apiserver -n 100
# Check control plane node resources
ssh <control-plane-node> "top"
```
**Common Causes:**
- etcd cluster failure
- API server OOM/crash
- Control plane network partition
- Certificate expiration
- Cloud provider outage
**Resolution Paths:**
1. etcd issue: Restore from backup or rebuild cluster
2. API server issue: Restart API server pods/service
3. Network: Fix routing, security groups, or DNS
4. Certificates: Renew certificates (kubeadm cert renew all)
---
### Scenario 2: Service Degradation
**Symptoms:**
- Increased latency or error rates
- Some requests failing
- Intermittent issues
**Immediate Actions:**
1. Check service metrics and logs
2. Verify pod health and count
3. Check for recent deployments
4. Review resource utilization
**Investigation Steps:**
```bash
# Check service endpoints
kubectl get endpoints <service> -n <namespace>
# Check pod status
kubectl get pods -l <service-selector> -n <namespace>
# Review recent changes
kubectl rollout history deployment/<name> -n <namespace>
# Check resource usage
kubectl top pods -n <namespace>
# Get recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
```
**Common Causes:**
- Insufficient replicas
- Pod restarts/crashes
- Resource contention
- Bad deployment
- External dependency failure
**Resolution Paths:**
1. Scale up replicas if needed
2. Rollback bad deployment
3. Increase resources if constrained
4. Fix configuration issues
5. Implement circuit breaker for external deps
---
### Scenario 3: Node Failure
**Symptoms:**
- Node reported as NotReady
- Pods being evicted from node
- High node resource utilization
**Immediate Actions:**
1. Identify affected node
2. Check impact (which pods running on node)
3. Determine if pods need immediate migration
4. Assess if node is recoverable
**Investigation Steps:**
```bash
# Get node status
kubectl get nodes
# Describe the problem node
kubectl describe node <node-name>
# Check pods on the node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>
# SSH to node and check
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
ssh <node> "docker ps" # or containerd
ssh <node> "df -h"
ssh <node> "free -m"
```
**Common Causes:**
- Kubelet failure
- Disk full
- Memory exhaustion
- Network issues
- Hardware failure
**Resolution Paths:**
1. Recoverable: Fix issue (clean disk, restart services)
2. Not recoverable: Cordon, drain, and replace node
3. For critical pods: Manually reschedule if necessary
4. Update monitoring and alerting based on findings
---
### Scenario 4: Storage Issues
**Symptoms:**
- PVCs stuck in Pending
- Pods can't start due to volume issues
- Data access failures
**Immediate Actions:**
1. Identify affected PVCs/PVs
2. Check storage backend health
3. Verify provisioner status
4. Assess data integrity risk
**Investigation Steps:**
```bash
# Check PVC status
kubectl get pvc --all-namespaces
# Describe pending PVC
kubectl describe pvc <pvc-name> -n <namespace>
# Check PV status
kubectl get pv
# Check storage class
kubectl get storageclass
# Check provisioner
kubectl get pods -n <storage-namespace>
# Check volume attachments
kubectl get volumeattachments
```
**Common Causes:**
- Storage backend failure/full
- Provisioner issues
- Network to storage backend
- Volume attachment limits reached
- Corrupted volume
**Resolution Paths:**
1. Fix storage backend issues
2. Restart provisioner if needed
3. Manually provision PV if dynamic provisioning failed
4. Delete and recreate if volume corrupted
5. Restore from backup if data lost
---
### Scenario 5: Security Incident
**Symptoms:**
- Unauthorized access detected
- Suspicious pod behavior
- Security alerts triggered
- Unusual network traffic
**Immediate Actions:**
1. Assess severity and scope
2. Isolate affected resources
3. Preserve evidence
4. Engage security team
**Investigation Steps:**
```bash
# Check recent RBAC changes
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json
# Audit pod security contexts
kubectl get pods --all-namespaces -o json | jq '.items[].spec.securityContext'
# Check for privileged pods
kubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.containers[].securityContext.privileged==true)'
# Review service accounts
kubectl get serviceaccounts --all-namespaces
# Get audit logs
cat /var/log/kubernetes/audit/audit.log | grep <suspicious-activity>
```
**Common Causes:**
- Compromised credentials
- Vulnerable container image
- Misconfigured RBAC
- Exposed secrets
- Supply chain attack
**Resolution Paths:**
1. Isolate: Network policies, cordon nodes
2. Investigate: Audit logs, pod logs, network flows
3. Remediate: Rotate credentials, patch vulnerabilities
4. Restore: From known-good state if needed
5. Prevent: Enhanced security policies, monitoring
---
## Diagnostic Commands Cheat Sheet
### Quick Health Check
```bash
# Overall cluster health
kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
# Component status (older clusters)
kubectl get componentstatuses
# Recent events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
```
### Pod Diagnostics
```bash
# Pod details
kubectl describe pod <pod> -n <namespace>
kubectl get pod <pod> -n <namespace> -o yaml
# Logs
kubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl logs <pod> -c <container> -n <namespace>
# Interactive debugging
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl debug <pod> -it --image=busybox -n <namespace>
```
### Node Diagnostics
```bash
# Node details
kubectl describe node <node>
kubectl get node <node> -o yaml
# Resource usage
kubectl top nodes
kubectl top pods --all-namespaces
# Node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'
```
### Service & Network Diagnostics
```bash
# Service details
kubectl describe svc <service> -n <namespace>
kubectl get endpoints <service> -n <namespace>
# Network policies
kubectl get networkpolicies --all-namespaces
# DNS testing
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
# Then: nslookup <service>.<namespace>.svc.cluster.local
```
### Storage Diagnostics
```bash
# PVC and PV status
kubectl get pvc,pv --all-namespaces
# Storage class
kubectl get storageclass
kubectl describe storageclass <storage-class>
# Volume attachments
kubectl get volumeattachments
```
---
## Communication During Incidents
### Internal Communication
- Use dedicated incident channel
- Regular status updates (every 30 min)
- Clear roles (incident commander, scribe, experts)
- Document all actions taken
### External Communication
- Status page updates
- Customer notifications
- Clear expected resolution time
- Updates on progress
### Post-Incident Communication
- Incident report
- Root cause analysis
- Remediation steps taken
- Prevention measures
---
## Post-Incident Review Template
### Incident Summary
- Date and time
- Duration
- Severity
- Services affected
- User impact
### Timeline
- Detection time
- Response time
- Resolution time
- Key events during incident
### Root Cause
- What happened
- Why it happened
- Contributing factors
### Resolution
- What fixed the issue
- Who fixed it
- How long it took
### Lessons Learned
- What went well
- What could be improved
- Action items with owners
### Prevention
- Technical changes
- Process improvements
- Monitoring enhancements
- Documentation updates
---
## Best Practices
### Prevention
- Regular cluster audits
- Proactive monitoring and alerting
- Capacity planning
- Regular disaster recovery drills
- Automated backups
- Security scanning and policies
### Preparedness
- Document runbooks
- Practice incident response
- Keep contact lists updated
- Maintain up-to-date diagrams
- Pre-provision debugging tools
### Response
- Follow structured approach
- Document everything
- Communicate clearly
- Don't panic
- Think before acting
- Preserve evidence
### Recovery
- Verify fix thoroughly
- Monitor for recurrence
- Update documentation
- Conduct post-mortem
- Implement preventive measures

View File

@@ -0,0 +1,687 @@
# Kubernetes Performance Troubleshooting
Systematic approach to diagnosing and resolving Kubernetes performance issues.
## Table of Contents
1. [High Latency Issues](#high-latency-issues)
2. [CPU Performance](#cpu-performance)
3. [Memory Performance](#memory-performance)
4. [Network Performance](#network-performance)
5. [Storage I/O Performance](#storage-io-performance)
6. [Application-Level Metrics](#application-level-metrics)
7. [Cluster-Wide Performance](#cluster-wide-performance)
---
## High Latency Issues
### Symptoms
- Slow API response times
- Increased request latency
- Timeouts
- Degraded user experience
### Investigation Workflow
**1. Identify the layer with latency:**
```bash
# Check service mesh metrics (if using Istio/Linkerd)
kubectl top pods -n <namespace>
# Check ingress controller metrics
kubectl logs -n ingress-nginx <ingress-controller-pod> | grep "request_time"
# Check application logs for slow requests
kubectl logs <pod-name> -n <namespace> | grep -i "slow\|timeout\|latency"
```
**2. Profile application performance:**
```bash
# Get pod metrics
kubectl top pod <pod-name> -n <namespace>
# Check if pod is CPU throttled
kubectl get pod <pod-name> -n <namespace> -o json | \
jq '.spec.containers[].resources'
# Exec into pod and check application-specific metrics
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Then: curl localhost:8080/metrics (if Prometheus metrics available)
```
**3. Check dependencies:**
```bash
# Test connectivity to downstream services
kubectl exec -it <pod-name> -n <namespace> -- \
curl -w "@curl-format.txt" -o /dev/null -s http://backend-service
# curl-format.txt content:
# time_namelookup: %{time_namelookup}\n
# time_connect: %{time_connect}\n
# time_appconnect: %{time_appconnect}\n
# time_pretransfer: %{time_pretransfer}\n
# time_redirect: %{time_redirect}\n
# time_starttransfer: %{time_starttransfer}\n
# time_total: %{time_total}\n
```
### Common Causes and Solutions
**CPU Throttling:**
```yaml
# Increase CPU limits or remove limits for bursty workloads
resources:
requests:
cpu: "500m" # What pod needs typically
limits:
cpu: "2000m" # Burst capacity (or remove for unlimited)
```
**Insufficient Replicas:**
```bash
# Scale up deployment
kubectl scale deployment <deployment-name> -n <namespace> --replicas=5
# Or enable HPA
kubectl autoscale deployment <deployment-name> \
--cpu-percent=70 \
--min=2 \
--max=10
```
**Slow Dependencies:**
```yaml
# Implement circuit breakers and timeouts in application
# Or use service mesh policies (Istio example):
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: backend-circuit-breaker
spec:
host: backend-service
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
http:
http1MaxPendingRequests: 50
http2MaxRequests: 100
outlierDetection:
consecutiveErrors: 5
interval: 30s
baseEjectionTime: 30s
```
---
## CPU Performance
### Symptoms
- High CPU usage
- Throttling
- Slow processing
- Queue buildup
### Investigation Commands
```bash
# Check CPU usage
kubectl top nodes
kubectl top pods -n <namespace>
# Check CPU throttling
kubectl get pod <pod-name> -n <namespace> -o json | \
jq '.spec.containers[].resources'
# Get detailed CPU metrics (requires metrics-server)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | jq
# Check container-level CPU from node (SSH to node)
ssh <node> "docker stats --no-stream"
```
### Advanced CPU Profiling
**Enable CPU profiling in application:**
```bash
# For Go applications with pprof
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
# Capture CPU profile
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
# Analyze with pprof
go tool pprof -http=:8080 cpu.prof
```
**For Java applications:**
```bash
# Use async-profiler
kubectl exec -it <pod-name> -n <namespace> -- \
/profiler.sh -d 30 -f /tmp/flamegraph.html 1
# Copy flamegraph
kubectl cp <namespace>/<pod-name>:/tmp/flamegraph.html ./flamegraph.html
```
### Solutions
**Vertical Scaling:**
```yaml
resources:
requests:
cpu: "1000m" # Increased from 500m
limits:
cpu: "2000m" # Increased from 1000m
```
**Horizontal Scaling:**
```yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: app
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
```
**Remove CPU Limits for Bursty Workloads:**
```yaml
# Allow bursting to available CPU
resources:
requests:
cpu: "500m"
# No limits - can use all available CPU
```
---
## Memory Performance
### Symptoms
- OOMKilled pods
- Memory leaks
- Slow garbage collection
- Swap usage (if enabled)
### Investigation Commands
```bash
# Check memory usage
kubectl top nodes
kubectl top pods -n <namespace>
# Check memory limits and requests
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits\|Requests"
# Check OOM kills
kubectl get pods -n <namespace> -o json | \
jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") | .metadata.name'
# Detailed memory breakdown (requires metrics-server)
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | \
jq '.containers[] | {name, usage: .usage.memory}'
```
### Memory Profiling
**Heap dump for Java:**
```bash
# Capture heap dump
kubectl exec <pod-name> -n <namespace> -- \
jmap -dump:format=b,file=/tmp/heapdump.hprof 1
# Copy heap dump
kubectl cp <namespace>/<pod-name>:/tmp/heapdump.hprof ./heapdump.hprof
# Analyze with Eclipse MAT or VisualVM
```
**Memory profiling for Go:**
```bash
# Capture heap profile
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
curl http://localhost:6060/debug/pprof/heap > heap.prof
# Analyze
go tool pprof -http=:8080 heap.prof
```
### Solutions
**Increase Memory Limits:**
```yaml
resources:
requests:
memory: "512Mi"
limits:
memory: "2Gi" # Increased from 1Gi
```
**Optimize Application:**
- Fix memory leaks
- Implement connection pooling
- Optimize caching strategies
- Tune garbage collection
**Use Memory-Optimized Node Pools:**
```yaml
# Node affinity for memory-intensive workloads
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: workload-type
operator: In
values:
- memory-optimized
```
---
## Network Performance
### Symptoms
- High network latency
- Packet loss
- Connection timeouts
- Bandwidth saturation
### Investigation Commands
```bash
# Check pod network statistics
kubectl exec <pod-name> -n <namespace> -- netstat -s
# Test network performance between pods
# Deploy netperf
kubectl run netperf-client --image=networkstatic/netperf --rm -it -- /bin/bash
# From client, run:
netperf -H <target-pod-ip> -t TCP_STREAM
netperf -H <target-pod-ip> -t TCP_RR # Request-response latency
# Check DNS resolution time
kubectl exec <pod-name> -n <namespace> -- \
time nslookup service-name.namespace.svc.cluster.local
# Check service mesh overhead (if using Istio)
kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
curl -s localhost:15000/stats | grep "http.inbound\|http.outbound"
```
### Check Network Policies
```bash
# List network policies
kubectl get networkpolicies -n <namespace>
# Check if policy is blocking traffic
kubectl describe networkpolicy <policy-name> -n <namespace>
# Temporarily remove policies to test (in non-production)
kubectl delete networkpolicy <policy-name> -n <namespace>
```
### Solutions
**DNS Optimization:**
```yaml
# Use CoreDNS caching
# Increase CoreDNS replicas
kubectl scale deployment coredns -n kube-system --replicas=5
# Or use NodeLocal DNSCache
# https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
```
**Optimize Service Mesh:**
```yaml
# Reduce Istio sidecar resources if over-provisioned
sidecar.istio.io/proxyCPU: "100m"
sidecar.istio.io/proxyMemory: "128Mi"
# Or disable for internal, trusted services
sidecar.istio.io/inject: "false"
```
**Use HostNetwork for Network-Intensive Pods:**
```yaml
# Use with caution - bypasses pod networking
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet
```
**Enable Bandwidth Limits (QoS):**
```yaml
metadata:
annotations:
kubernetes.io/ingress-bandwidth: "10M"
kubernetes.io/egress-bandwidth: "10M"
```
---
## Storage I/O Performance
### Symptoms
- Slow read/write operations
- High I/O wait
- Application timeouts during disk operations
- Database performance issues
### Investigation Commands
```bash
# Check I/O metrics on node
ssh <node> "iostat -x 1 10"
# Check disk usage
kubectl exec <pod-name> -n <namespace> -- df -h
# Check I/O wait from pod
kubectl exec <pod-name> -n <namespace> -- top
# Test storage performance
kubectl exec <pod-name> -n <namespace> -- \
dd if=/dev/zero of=/data/test bs=1M count=1024 conv=fdatasync
# Check PV performance class
kubectl get pv <pv-name> -o yaml | grep storageClassName
kubectl describe storageclass <storage-class-name>
```
### Storage Benchmarking
**Deploy fio for benchmarking:**
```yaml
apiVersion: v1
kind: Pod
metadata:
name: fio-benchmark
spec:
containers:
- name: fio
image: ljishen/fio
command: ["/bin/sh", "-c"]
args:
- |
fio --name=seqread --rw=read --bs=1M --size=1G --runtime=60 --filename=/data/test
fio --name=seqwrite --rw=write --bs=1M --size=1G --runtime=60 --filename=/data/test
fio --name=randread --rw=randread --bs=4k --size=1G --runtime=60 --filename=/data/test
fio --name=randwrite --rw=randwrite --bs=4k --size=1G --runtime=60 --filename=/data/test
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
persistentVolumeClaim:
claimName: test-pvc
```
### Solutions
**Use Higher Performance Storage Class:**
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: high-performance-pvc
spec:
accessModes:
- ReadWriteOnce
storageClassName: gp3 # or io2, premium-rwo (GKE), etc.
resources:
requests:
storage: 100Gi
```
**Provision IOPS (AWS EBS io2):**
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: io2-high-iops
provisioner: ebs.csi.aws.com
parameters:
type: io2
iops: "10000"
fsType: ext4
volumeBindingMode: WaitForFirstConsumer
```
**Use Local NVMe for Ultra-Low Latency:**
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: local-nvme
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
---
apiVersion: v1
kind: PersistentVolume
metadata:
name: local-pv
spec:
capacity:
storage: 100Gi
accessModes:
- ReadWriteOnce
persistentVolumeReclaimPolicy: Retain
storageClassName: local-nvme
local:
path: /mnt/disks/nvme0n1
nodeAffinity:
required:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- node-with-nvme
```
---
## Application-Level Metrics
### Expose Prometheus Metrics
**Add metrics endpoint to application:**
```yaml
apiVersion: v1
kind: Service
metadata:
name: app-metrics
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8080"
prometheus.io/path: "/metrics"
spec:
selector:
app: myapp
ports:
- name: metrics
port: 8080
targetPort: 8080
```
### Key Metrics to Monitor
**Application metrics:**
- Request rate
- Request latency (p50, p95, p99)
- Error rate
- Active connections
- Queue depth
- Cache hit rate
**Example Prometheus queries:**
```promql
# P95 latency
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
# Request rate
sum(rate(http_requests_total[5m]))
```
### Distributed Tracing
**Implement OpenTelemetry:**
```yaml
# Deploy Jaeger
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
spec:
template:
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:latest
ports:
- containerPort: 16686 # UI
- containerPort: 14268 # Collector
```
**Instrument application:**
- Add OpenTelemetry SDK to application
- Configure trace export to Jaeger
- Analyze end-to-end request traces to identify bottlenecks
---
## Cluster-Wide Performance
### Cluster Resource Utilization
```bash
# Overall cluster capacity
kubectl top nodes
# Total resources
kubectl describe nodes | grep -A 5 "Allocated resources"
# Resource requests vs limits
kubectl get pods --all-namespaces -o json | \
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources)"'
```
### Control Plane Performance
```bash
# Check API server latency
kubectl get --raw /metrics | grep apiserver_request_duration_seconds
# Check etcd performance
kubectl exec -it -n kube-system etcd-<node> -- \
etcdctl --endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key \
check perf
# Controller manager metrics
kubectl get --raw /metrics | grep workqueue_depth
```
### Scheduler Performance
```bash
# Check scheduler latency
kubectl get --raw /metrics | grep scheduler_scheduling_duration_seconds
# Check pending pods
kubectl get pods --all-namespaces --field-selector status.phase=Pending
# Scheduler logs
kubectl logs -n kube-system kube-scheduler-<node>
```
### Solutions for Cluster-Wide Issues
**Scale Control Plane:**
- Add more control plane nodes
- Increase API server replicas
- Tune etcd (increase memory, use SSD)
**Optimize Scheduling:**
- Use pod priority and preemption
- Implement pod topology spread constraints
- Use node affinity/anti-affinity appropriately
**Resource Management:**
- Set appropriate resource requests and limits
- Use LimitRanges and ResourceQuotas
- Implement VerticalPodAutoscaler for right-sizing
---
## Performance Optimization Checklist
### Application Level
- [ ] Implement connection pooling
- [ ] Enable response caching
- [ ] Optimize database queries
- [ ] Use async/non-blocking I/O
- [ ] Implement circuit breakers
- [ ] Profile and optimize hot paths
### Kubernetes Level
- [ ] Set appropriate resource requests/limits
- [ ] Use HPA for auto-scaling
- [ ] Implement readiness/liveness probes correctly
- [ ] Use anti-affinity for high-availability
- [ ] Optimize container image size
- [ ] Use multi-stage builds
### Infrastructure Level
- [ ] Use appropriate instance/node types
- [ ] Enable cluster autoscaling
- [ ] Use high-performance storage classes
- [ ] Optimize network topology
- [ ] Implement monitoring and alerting
- [ ] Regular performance testing
---
## Monitoring Tools
**Essential tools:**
- **Prometheus + Grafana**: Metrics and dashboards
- **Jaeger/Zipkin**: Distributed tracing
- **kube-state-metrics**: Kubernetes object metrics
- **node-exporter**: Node-level metrics
- **cAdvisor**: Container metrics
- **kubectl-flamegraph**: CPU profiling
**Commercial options:**
- Datadog
- New Relic
- Dynatrace
- Elastic APM