Initial commit
This commit is contained in:
582
skills/references/common_issues.md
Normal file
582
skills/references/common_issues.md
Normal file
@@ -0,0 +1,582 @@
|
||||
# Common Kubernetes Issues and Solutions
|
||||
|
||||
This reference provides detailed information about common Kubernetes issues, their root causes, and remediation steps.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Pod Issues](#pod-issues)
|
||||
2. [Node Issues](#node-issues)
|
||||
3. [Networking Issues](#networking-issues)
|
||||
4. [Storage Issues](#storage-issues)
|
||||
5. [Resource Issues](#resource-issues)
|
||||
6. [Security Issues](#security-issues)
|
||||
|
||||
---
|
||||
|
||||
## Pod Issues
|
||||
|
||||
### ImagePullBackOff / ErrImagePull
|
||||
|
||||
**Symptoms:**
|
||||
- Pod stuck in `ImagePullBackOff` or `ErrImagePull` state
|
||||
- Events show image pull errors
|
||||
|
||||
**Common Causes:**
|
||||
1. Image doesn't exist or tag is wrong
|
||||
2. Registry authentication failure
|
||||
3. Network issues reaching registry
|
||||
4. Rate limiting from registry
|
||||
5. Private registry without imagePullSecrets
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Verify image name and tag: `docker pull <image>` from local machine
|
||||
2. Check imagePullSecrets exist: `kubectl get secrets -n <namespace>`
|
||||
3. Verify secret has correct registry credentials
|
||||
4. Check if registry is accessible from cluster
|
||||
5. For rate limiting: implement imagePullSecrets with authenticated account
|
||||
6. Update deployment with correct image path
|
||||
|
||||
**Prevention:**
|
||||
- Use specific image tags instead of `latest`
|
||||
- Implement image pull secrets for private registries
|
||||
- Set up local registry or cache to reduce external pulls
|
||||
- Use image validation in CI/CD pipeline
|
||||
|
||||
---
|
||||
|
||||
### CrashLoopBackOff
|
||||
|
||||
**Symptoms:**
|
||||
- Pod repeatedly crashes and restarts
|
||||
- Increasing restart count
|
||||
- Container exits shortly after starting
|
||||
|
||||
**Common Causes:**
|
||||
1. Application error on startup
|
||||
2. Missing environment variables or config
|
||||
3. Resource limits too restrictive
|
||||
4. Liveness probe too aggressive
|
||||
5. Missing dependencies (DB, cache, etc.)
|
||||
6. Command/args misconfiguration
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl logs <pod-name> -n <namespace>
|
||||
kubectl logs <pod-name> -n <namespace> --previous
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
kubectl get pod <pod-name> -n <namespace> -o yaml
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Check current logs: `kubectl logs <pod-name>`
|
||||
2. Check previous container logs: `kubectl logs <pod-name> --previous`
|
||||
3. Look for startup errors, stack traces, or missing config messages
|
||||
4. Verify environment variables are set correctly
|
||||
5. Check if external dependencies are accessible
|
||||
6. Review and adjust resource limits if OOMKilled
|
||||
7. Adjust liveness probe initialDelaySeconds if failing too early
|
||||
8. Verify command and args in pod spec
|
||||
|
||||
**Prevention:**
|
||||
- Implement proper application health checks and readiness
|
||||
- Use init containers for dependency checks
|
||||
- Set appropriate resource requests/limits based on profiling
|
||||
- Configure liveness probes with sufficient delay
|
||||
- Add retry logic and graceful degradation to applications
|
||||
|
||||
---
|
||||
|
||||
### Pending Pods
|
||||
|
||||
**Symptoms:**
|
||||
- Pod stuck in `Pending` state
|
||||
- Not scheduled to any node
|
||||
|
||||
**Common Causes:**
|
||||
1. Insufficient cluster resources (CPU/memory)
|
||||
2. Node selectors/affinity rules can't be satisfied
|
||||
3. No nodes match taints/tolerations
|
||||
4. PersistentVolume not available
|
||||
5. Resource quotas exceeded
|
||||
6. Scheduler not running or misconfigured
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
kubectl get nodes -o wide
|
||||
kubectl describe nodes
|
||||
kubectl get pv,pvc -n <namespace>
|
||||
kubectl get resourcequota -n <namespace>
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Check pod events for scheduling failure reason
|
||||
2. Verify node resources: `kubectl describe nodes | grep -A 5 "Allocated resources"`
|
||||
3. Check if pod has node selectors: verify nodes have required labels
|
||||
4. Review taints on nodes and tolerations on pod
|
||||
5. For PVC issues: verify PV exists and is in `Available` state
|
||||
6. Check namespace resource quota: `kubectl describe resourcequota -n <namespace>`
|
||||
7. If no resources: scale cluster or reduce resource requests
|
||||
8. If affinity issue: adjust affinity rules or add matching nodes
|
||||
|
||||
**Prevention:**
|
||||
- Set appropriate resource requests (not just limits)
|
||||
- Monitor cluster capacity and scale proactively
|
||||
- Use pod disruption budgets to prevent total unavailability
|
||||
- Implement cluster autoscaling
|
||||
- Use multiple node pools for different workload types
|
||||
|
||||
---
|
||||
|
||||
### OOMKilled Pods
|
||||
|
||||
**Symptoms:**
|
||||
- Pod terminates with exit code 137
|
||||
- Container status shows `OOMKilled` reason
|
||||
- Frequent restarts due to memory exhaustion
|
||||
|
||||
**Common Causes:**
|
||||
1. Memory limit set too low
|
||||
2. Memory leak in application
|
||||
3. Unexpected load increase
|
||||
4. No memory limits (using node's memory)
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
kubectl logs <pod-name> --previous -n <namespace>
|
||||
kubectl top pod <pod-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Confirm OOMKilled in pod events or status
|
||||
2. Check memory usage before crash if metrics available
|
||||
3. Review application logs for memory-intensive operations
|
||||
4. Increase memory limit if application legitimately needs more
|
||||
5. Profile application to identify memory leaks
|
||||
6. Implement memory limits with requests = limits for guaranteed QoS
|
||||
7. Consider horizontal scaling instead of vertical
|
||||
|
||||
**Prevention:**
|
||||
- Profile applications to determine actual memory needs
|
||||
- Set memory requests based on normal usage, limits with headroom
|
||||
- Implement application-level memory monitoring
|
||||
- Use horizontal pod autoscaling based on memory metrics
|
||||
- Regular load testing to understand resource requirements
|
||||
|
||||
---
|
||||
|
||||
## Node Issues
|
||||
|
||||
### Node NotReady
|
||||
|
||||
**Symptoms:**
|
||||
- Node status shows `NotReady`
|
||||
- Pods on node may be evicted
|
||||
- Node may be cordoned automatically
|
||||
|
||||
**Common Causes:**
|
||||
1. Kubelet stopped or crashed
|
||||
2. Network connectivity issues
|
||||
3. Disk pressure
|
||||
4. Memory pressure
|
||||
5. PID pressure
|
||||
6. Container runtime issues
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl describe node <node-name>
|
||||
kubectl get nodes -o wide
|
||||
ssh <node> "systemctl status kubelet"
|
||||
ssh <node> "journalctl -u kubelet -n 100"
|
||||
ssh <node> "df -h"
|
||||
ssh <node> "free -m"
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Check node conditions in describe output
|
||||
2. Verify kubelet is running: `systemctl status kubelet`
|
||||
3. Check kubelet logs: `journalctl -u kubelet`
|
||||
4. For disk pressure: clean up unused images/containers
|
||||
5. For memory pressure: identify and stop memory-hogging processes
|
||||
6. Restart kubelet if crashed: `systemctl restart kubelet`
|
||||
7. Check container runtime: `systemctl status docker` or `containerd`
|
||||
8. Verify network connectivity to API server
|
||||
|
||||
**Prevention:**
|
||||
- Monitor node resources with alerts
|
||||
- Implement log rotation and image cleanup
|
||||
- Set up node problem detector
|
||||
- Use resource quotas to prevent resource exhaustion
|
||||
- Regular maintenance windows for updates
|
||||
|
||||
---
|
||||
|
||||
### Disk Pressure
|
||||
|
||||
**Symptoms:**
|
||||
- Node condition `DiskPressure` is True
|
||||
- Pod evictions may occur
|
||||
- Node may become NotReady
|
||||
|
||||
**Common Causes:**
|
||||
1. Docker/containerd image cache filling disk
|
||||
2. Container logs consuming space
|
||||
3. Ephemeral storage usage by pods
|
||||
4. System logs filling up
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl describe node <node-name> | grep -A 10 Conditions
|
||||
ssh <node> "df -h"
|
||||
ssh <node> "du -sh /var/lib/docker/*"
|
||||
ssh <node> "du -sh /var/lib/containerd/*"
|
||||
ssh <node> "docker system df"
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Clean up unused images: `docker image prune -a`
|
||||
2. Clean up stopped containers: `docker container prune`
|
||||
3. Clean up volumes: `docker volume prune`
|
||||
4. Rotate and compress logs
|
||||
5. Check for pods with excessive ephemeral storage usage
|
||||
6. Expand disk if consistently running out of space
|
||||
7. Configure kubelet garbage collection parameters
|
||||
|
||||
**Prevention:**
|
||||
- Set imagefs.available threshold in kubelet config
|
||||
- Implement automated image pruning
|
||||
- Use log rotation for container logs
|
||||
- Monitor disk usage with alerts
|
||||
- Set ephemeral-storage limits on pods
|
||||
- Size nodes appropriately for workload
|
||||
|
||||
---
|
||||
|
||||
## Networking Issues
|
||||
|
||||
### Pod-to-Pod Communication Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Services can't reach other services
|
||||
- Connection timeouts between pods
|
||||
- DNS resolution may or may not work
|
||||
|
||||
**Common Causes:**
|
||||
1. Network policy blocking traffic
|
||||
2. CNI plugin issues
|
||||
3. Firewall rules blocking traffic
|
||||
4. Service misconfiguration
|
||||
5. Pod CIDR exhaustion
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl get networkpolicies --all-namespaces
|
||||
kubectl exec -it <pod> -- ping <target-pod-ip>
|
||||
kubectl exec -it <pod> -- nslookup <service-name>
|
||||
kubectl get svc -n <namespace>
|
||||
kubectl describe svc <service-name> -n <namespace>
|
||||
kubectl get endpoints <service-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Test basic connectivity: exec into pod and ping target
|
||||
2. Check network policies: look for policies affecting source/dest namespaces
|
||||
3. Verify service has endpoints: `kubectl get endpoints`
|
||||
4. Check if pod labels match service selector
|
||||
5. Verify CNI plugin pods are healthy (usually in kube-system)
|
||||
6. Check node-level firewall rules
|
||||
7. Verify pod CIDR hasn't been exhausted
|
||||
|
||||
**Prevention:**
|
||||
- Document network policies and their intent
|
||||
- Use namespace labels for network policy management
|
||||
- Monitor CNI plugin health
|
||||
- Regularly audit network policies
|
||||
- Implement network policy testing in CI/CD
|
||||
|
||||
---
|
||||
|
||||
### Service Not Accessible
|
||||
|
||||
**Symptoms:**
|
||||
- Cannot connect to service
|
||||
- LoadBalancer stuck in Pending
|
||||
- NodePort not accessible
|
||||
|
||||
**Common Causes:**
|
||||
1. Service has no endpoints (no matching pods)
|
||||
2. Pods not passing readiness checks
|
||||
3. LoadBalancer controller not working
|
||||
4. Cloud provider integration issues
|
||||
5. Port conflicts
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl get svc <service-name> -n <namespace>
|
||||
kubectl describe svc <service-name> -n <namespace>
|
||||
kubectl get endpoints <service-name> -n <namespace>
|
||||
kubectl get pods -l <service-selector> -n <namespace>
|
||||
kubectl logs -l <service-selector> -n <namespace>
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Verify service type and ports are correct
|
||||
2. Check if service has endpoints: `kubectl get endpoints`
|
||||
3. If no endpoints: verify pod selector matches pod labels
|
||||
4. Check pod readiness: `kubectl describe pod`
|
||||
5. For LoadBalancer: check cloud provider controller logs
|
||||
6. For NodePort: verify node firewall allows the port
|
||||
7. Test from within cluster first, then external access
|
||||
|
||||
**Prevention:**
|
||||
- Use meaningful service selectors and pod labels
|
||||
- Implement proper readiness probes
|
||||
- Test services in staging before production
|
||||
- Monitor service endpoint counts
|
||||
- Document external access requirements
|
||||
|
||||
---
|
||||
|
||||
## Storage Issues
|
||||
|
||||
### PVC Pending
|
||||
|
||||
**Symptoms:**
|
||||
- PersistentVolumeClaim stuck in `Pending` state
|
||||
- Pod can't start due to volume not available
|
||||
|
||||
**Common Causes:**
|
||||
1. No PV matches PVC requirements
|
||||
2. StorageClass doesn't exist or is misconfigured
|
||||
3. Dynamic provisioner not working
|
||||
4. Insufficient permissions for provisioner
|
||||
5. Volume capacity exhausted in storage backend
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl describe pvc <pvc-name> -n <namespace>
|
||||
kubectl get pv
|
||||
kubectl get storageclass
|
||||
kubectl describe storageclass <storage-class-name>
|
||||
kubectl get pods -n <provisioner-namespace>
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Check PVC events for specific error message
|
||||
2. Verify StorageClass exists: `kubectl get sc`
|
||||
3. Check if dynamic provisioner pod is running
|
||||
4. For static provisioning: ensure PV exists with matching size/access mode
|
||||
5. Verify provisioner has correct cloud credentials/permissions
|
||||
6. Check storage backend capacity
|
||||
7. Review StorageClass parameters for typos
|
||||
|
||||
**Prevention:**
|
||||
- Use dynamic provisioning with reliable StorageClass
|
||||
- Monitor storage backend capacity
|
||||
- Set up alerts for PVC binding failures
|
||||
- Test storage provisioning in non-production first
|
||||
- Document storage requirements and limitations
|
||||
|
||||
---
|
||||
|
||||
### Volume Mount Failures
|
||||
|
||||
**Symptoms:**
|
||||
- Pod fails to start with volume mount errors
|
||||
- Events show mounting failures
|
||||
- Container create errors related to volumes
|
||||
|
||||
**Common Causes:**
|
||||
1. Volume already mounted to different node (RWO with multi-attach)
|
||||
2. Volume doesn't exist
|
||||
3. Insufficient permissions
|
||||
4. Node can't reach storage backend
|
||||
5. Filesystem issues on volume
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl describe pod <pod-name> -n <namespace>
|
||||
kubectl describe pvc <pvc-name> -n <namespace>
|
||||
kubectl describe pv <pv-name>
|
||||
kubectl get volumeattachments
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Check pod events for specific mount error
|
||||
2. For RWO volumes: ensure pod is scheduled to node with volume attached
|
||||
3. Verify PVC is bound to a PV
|
||||
4. Check node can reach storage backend (cloud/NFS/iSCSI)
|
||||
5. For CSI volumes: check CSI driver pods are healthy
|
||||
6. Delete and recreate pod if volume is stuck in multi-attach state
|
||||
7. Check filesystem on volume if accessible
|
||||
|
||||
**Prevention:**
|
||||
- Use ReadWriteMany for multi-pod access scenarios
|
||||
- Implement pod disruption budgets to prevent scheduling conflicts
|
||||
- Monitor volume attachment status
|
||||
- Use StatefulSets for stateful workloads with volumes
|
||||
- Regular backup and restore testing
|
||||
|
||||
---
|
||||
|
||||
## Resource Issues
|
||||
|
||||
### Resource Quota Exceeded
|
||||
|
||||
**Symptoms:**
|
||||
- New pods fail to schedule
|
||||
- Error: "exceeded quota"
|
||||
- ResourceQuota limits preventing resource allocation
|
||||
|
||||
**Common Causes:**
|
||||
1. Namespace resource quota exceeded
|
||||
2. Not enough resource budget available
|
||||
3. Resource requests not specified on pods
|
||||
4. Quota misconfiguration
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl describe resourcequota -n <namespace>
|
||||
kubectl describe limitrange -n <namespace>
|
||||
kubectl get pods -n <namespace> -o json | jq '.items[].spec.containers[].resources'
|
||||
kubectl describe namespace <namespace>
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Check current quota usage: `kubectl describe resourcequota`
|
||||
2. Identify pods consuming resources
|
||||
3. Either increase quota limits or reduce resource requests
|
||||
4. Delete unused resources to free up quota
|
||||
5. Optimize pod resource requests based on actual usage
|
||||
6. Consider splitting workloads across namespaces
|
||||
|
||||
**Prevention:**
|
||||
- Set quotas based on actual usage patterns
|
||||
- Monitor quota usage with alerts
|
||||
- Right-size pod resource requests
|
||||
- Implement automatic cleanup of completed jobs/pods
|
||||
- Regular quota review and adjustment
|
||||
|
||||
---
|
||||
|
||||
### CPU Throttling
|
||||
|
||||
**Symptoms:**
|
||||
- Application performance degradation
|
||||
- High CPU throttling metrics
|
||||
- Services responding slowly despite available CPU
|
||||
|
||||
**Common Causes:**
|
||||
1. CPU limits set too low
|
||||
2. Burst workloads hitting limits
|
||||
3. Noisy neighbor effects
|
||||
4. CPU limits set without understanding workload
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl top pod <pod-name> -n <namespace>
|
||||
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
|
||||
kubectl exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_throttled_time
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Check current CPU usage vs limits
|
||||
2. Review throttling metrics if available
|
||||
3. Increase CPU limits if application legitimately needs more
|
||||
4. Remove CPU limits if workload is bursty (keep requests)
|
||||
5. Use HPA if load varies significantly
|
||||
6. Profile application to identify CPU-intensive operations
|
||||
|
||||
**Prevention:**
|
||||
- Set CPU requests based on average usage
|
||||
- Set CPU limits with 50-100% headroom for bursts
|
||||
- Use HPA for variable workloads
|
||||
- Monitor CPU throttling metrics
|
||||
- Regular performance testing and profiling
|
||||
|
||||
---
|
||||
|
||||
## Security Issues
|
||||
|
||||
### Image Vulnerability
|
||||
|
||||
**Symptoms:**
|
||||
- Security scanner reports vulnerabilities
|
||||
- Compliance violations
|
||||
- Known CVEs in running images
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl get pods -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u
|
||||
trivy image <image-name>
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Identify vulnerable images with scanner
|
||||
2. Update base images to patched versions
|
||||
3. Rebuild application images with updated dependencies
|
||||
4. Update deployments with new image tags
|
||||
5. Implement admission controller to block vulnerable images
|
||||
|
||||
**Prevention:**
|
||||
- Scan images in CI/CD pipeline
|
||||
- Regular base image updates
|
||||
- Use minimal base images
|
||||
- Implement admission controllers (OPA, Kyverno)
|
||||
- Automated image updates and testing
|
||||
|
||||
---
|
||||
|
||||
### RBAC Permission Denied
|
||||
|
||||
**Symptoms:**
|
||||
- Users or service accounts can't perform operations
|
||||
- "forbidden" errors in logs
|
||||
- API calls fail with 403 errors
|
||||
|
||||
**Common Causes:**
|
||||
1. Missing Role or ClusterRole binding
|
||||
2. Overly restrictive RBAC policies
|
||||
3. Wrong service account in pod
|
||||
4. Namespace-scoped role for cluster-wide resource
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
kubectl auth can-i <verb> <resource> --as=<user/sa>
|
||||
kubectl get rolebindings -n <namespace>
|
||||
kubectl get clusterrolebindings
|
||||
kubectl describe role <role-name> -n <namespace>
|
||||
kubectl describe serviceaccount <sa-name> -n <namespace>
|
||||
```
|
||||
|
||||
**Remediation Steps:**
|
||||
1. Identify exact permission needed from error message
|
||||
2. Check what user/SA can do: `kubectl auth can-i --list`
|
||||
3. Create appropriate Role/ClusterRole with needed permissions
|
||||
4. Create RoleBinding/ClusterRoleBinding
|
||||
5. Verify service account is correctly set in pod spec
|
||||
6. Test with `kubectl auth can-i` before deploying
|
||||
|
||||
**Prevention:**
|
||||
- Follow principle of least privilege
|
||||
- Use namespace-scoped roles where possible
|
||||
- Document RBAC policies and their purpose
|
||||
- Regular RBAC audits
|
||||
- Use pre-defined roles when possible
|
||||
|
||||
---
|
||||
|
||||
This reference covers the most common Kubernetes issues. For each issue, always:
|
||||
1. Gather information (describe, logs, events)
|
||||
2. Form hypothesis based on symptoms
|
||||
3. Test hypothesis with targeted diagnostics
|
||||
4. Apply remediation
|
||||
5. Verify fix
|
||||
6. Document for future reference
|
||||
708
skills/references/helm_troubleshooting.md
Normal file
708
skills/references/helm_troubleshooting.md
Normal file
@@ -0,0 +1,708 @@
|
||||
# Helm Troubleshooting Guide
|
||||
|
||||
Comprehensive guide for troubleshooting Helm releases, charts, and deployments.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Helm Release Issues](#helm-release-issues)
|
||||
2. [Chart Installation Failures](#chart-installation-failures)
|
||||
3. [Upgrade and Rollback Problems](#upgrade-and-rollback-problems)
|
||||
4. [Values and Configuration](#values-and-configuration)
|
||||
5. [Chart Dependencies](#chart-dependencies)
|
||||
6. [Hooks and Lifecycle](#hooks-and-lifecycle)
|
||||
7. [Repository Issues](#repository-issues)
|
||||
|
||||
---
|
||||
|
||||
## Helm Release Issues
|
||||
|
||||
### Release Stuck in Pending-Install/Pending-Upgrade
|
||||
|
||||
**Symptoms:**
|
||||
- Release shows status `pending-install` or `pending-upgrade`
|
||||
- New installations or upgrades hang indefinitely
|
||||
- `helm list` shows release but resources not created
|
||||
|
||||
**Diagnostic Commands:**
|
||||
```bash
|
||||
# Check release status
|
||||
helm list -n <namespace>
|
||||
helm status <release-name> -n <namespace>
|
||||
|
||||
# Check release history
|
||||
helm history <release-name> -n <namespace>
|
||||
|
||||
# Get detailed release information
|
||||
kubectl get secrets -n <namespace> -l owner=helm,status=pending-install
|
||||
|
||||
# Check helm operation pods/jobs
|
||||
kubectl get pods -n <namespace> -l app.kubernetes.io/managed-by=Helm
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
1. Previous installation failed and wasn't cleaned up
|
||||
2. Helm hooks stuck or failed
|
||||
3. Kubernetes resources can't be created (RBAC, quotas)
|
||||
4. Timeout during installation
|
||||
|
||||
**Resolution:**
|
||||
|
||||
```bash
|
||||
# Check for stuck hooks
|
||||
kubectl get jobs -n <namespace>
|
||||
kubectl get pods -n <namespace> -l "helm.sh/hook"
|
||||
|
||||
# Delete stuck hooks
|
||||
kubectl delete job <hook-job> -n <namespace>
|
||||
|
||||
# Rollback to previous version
|
||||
helm rollback <release-name> -n <namespace>
|
||||
|
||||
# Force delete release (last resort)
|
||||
helm delete <release-name> -n <namespace> --no-hooks
|
||||
|
||||
# Clean up secrets
|
||||
kubectl delete secret -n <namespace> -l owner=helm,name=<release-name>
|
||||
```
|
||||
|
||||
### Release Shows as Deployed but Resources Missing
|
||||
|
||||
**Symptoms:**
|
||||
- `helm list` shows release as deployed
|
||||
- Expected pods/services not running
|
||||
- Resources partially created
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Get manifest from release
|
||||
helm get manifest <release-name> -n <namespace>
|
||||
|
||||
# Compare with what's actually deployed
|
||||
helm get manifest <release-name> -n <namespace> | kubectl apply --dry-run=client -f -
|
||||
|
||||
# Check helm values used
|
||||
helm get values <release-name> -n <namespace>
|
||||
|
||||
# Check for resource creation errors
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Reapply the release
|
||||
helm upgrade <release-name> <chart> -n <namespace> --reuse-values
|
||||
|
||||
# Or force recreate
|
||||
helm upgrade <release-name> <chart> -n <namespace> --force
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Chart Installation Failures
|
||||
|
||||
### "Release Already Exists" Error
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: INSTALLATION FAILED: cannot re-use a name that is still in use
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Check existing releases
|
||||
helm list -n <namespace> --all
|
||||
|
||||
# Check for failed releases
|
||||
helm list -n <namespace> --failed
|
||||
|
||||
# Uninstall existing release
|
||||
helm uninstall <release-name> -n <namespace>
|
||||
|
||||
# Or use different release name
|
||||
helm install <new-release-name> <chart> -n <namespace>
|
||||
```
|
||||
|
||||
### "Invalid Chart" Error
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: INSTALLATION FAILED: chart requires kubeVersion: >=1.20.0 which is incompatible with Kubernetes v1.19.0
|
||||
```
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check chart requirements
|
||||
helm show chart <chart-name>
|
||||
|
||||
# Check Kubernetes version
|
||||
kubectl version --short
|
||||
|
||||
# Inspect chart contents
|
||||
helm pull <chart> --untar
|
||||
cat <chart>/Chart.yaml
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
- Upgrade Kubernetes cluster to meet requirements
|
||||
- Use compatible chart version: `helm install <release> <chart> --version <compatible-version>`
|
||||
- Modify Chart.yaml kubeVersion requirement (not recommended)
|
||||
|
||||
### Template Rendering Errors
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: INSTALLATION FAILED: template: <chart>/templates/deployment.yaml:10:4:
|
||||
executing "<chart>/templates/deployment.yaml" at <.Values.invalid.path>:
|
||||
nil pointer evaluating interface {}.path
|
||||
```
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Render templates locally
|
||||
helm template <release-name> <chart> -n <namespace>
|
||||
|
||||
# Render with your values
|
||||
helm template <release-name> <chart> -f values.yaml -n <namespace>
|
||||
|
||||
# Debug mode
|
||||
helm install <release-name> <chart> -n <namespace> --debug --dry-run
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
1. Check values.yaml for missing required fields
|
||||
2. Verify template syntax in chart
|
||||
3. Use `helm lint` to validate chart
|
||||
|
||||
```bash
|
||||
# Lint chart
|
||||
helm lint <chart-directory>
|
||||
|
||||
# Lint with values
|
||||
helm lint <chart-directory> -f values.yaml
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Upgrade and Rollback Problems
|
||||
|
||||
### Upgrade Fails with "Rendered Manifests Contain Errors"
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: UPGRADE FAILED: unable to build kubernetes objects from release manifest
|
||||
```
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Dry run upgrade
|
||||
helm upgrade <release-name> <chart> -n <namespace> --dry-run --debug
|
||||
|
||||
# Compare differences
|
||||
helm diff upgrade <release-name> <chart> -n <namespace> # requires helm-diff plugin
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Fix values.yaml and retry
|
||||
helm upgrade <release-name> <chart> -n <namespace> -f fixed-values.yaml
|
||||
|
||||
# Skip tests if test hooks failing
|
||||
helm upgrade <release-name> <chart> -n <namespace> --no-hooks
|
||||
|
||||
# Force upgrade
|
||||
helm upgrade <release-name> <chart> -n <namespace> --force
|
||||
```
|
||||
|
||||
### Rollback Fails
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: ROLLBACK FAILED: release <release-name> failed, and has been rolled back due to atomic being set
|
||||
```
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check release history
|
||||
helm history <release-name> -n <namespace>
|
||||
|
||||
# Get specific revision manifest
|
||||
helm get manifest <release-name> --revision <revision-number> -n <namespace>
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Rollback to specific working revision
|
||||
helm rollback <release-name> <revision-number> -n <namespace>
|
||||
|
||||
# If rollback fails, uninstall and reinstall
|
||||
helm uninstall <release-name> -n <namespace>
|
||||
helm install <release-name> <chart> -n <namespace> -f values.yaml
|
||||
|
||||
# Clean up failed rollback
|
||||
kubectl get secrets -n <namespace> -l owner=helm,status=pending-rollback
|
||||
kubectl delete secret <secret-name> -n <namespace>
|
||||
```
|
||||
|
||||
### "Immutable Field" Error During Upgrade
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: UPGRADE FAILED: Service "myapp" is invalid: spec.clusterIP: Invalid value: "": field is immutable
|
||||
```
|
||||
|
||||
**Common immutable fields:**
|
||||
- Service `clusterIP`
|
||||
- StatefulSet `volumeClaimTemplates`
|
||||
- PVC `storageClassName`
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Option 1: Delete and recreate resource
|
||||
kubectl delete service <service-name> -n <namespace>
|
||||
helm upgrade <release-name> <chart> -n <namespace>
|
||||
|
||||
# Option 2: Use --force to recreate resources
|
||||
helm upgrade <release-name> <chart> -n <namespace> --force
|
||||
|
||||
# Option 3: Manually patch the resource
|
||||
kubectl patch service <service-name> -n <namespace> --type='json' \
|
||||
-p='[{"op": "remove", "path": "/spec/clusterIP"}]'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Values and Configuration
|
||||
|
||||
### Values Not Applied
|
||||
|
||||
**Symptoms:**
|
||||
- Deployed resources don't reflect values in values.yaml
|
||||
- Default chart values used instead of custom values
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check what values were used in deployment
|
||||
helm get values <release-name> -n <namespace>
|
||||
|
||||
# Check all values (including defaults)
|
||||
helm get values <release-name> -n <namespace> --all
|
||||
|
||||
# Test rendering with your values
|
||||
helm template <release-name> <chart> -f values.yaml -n <namespace> | less
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Ensure values file is specified correctly
|
||||
helm upgrade <release-name> <chart> -n <namespace> -f values.yaml
|
||||
|
||||
# Use multiple values files (later files override earlier)
|
||||
helm upgrade <release-name> <chart> -n <namespace> \
|
||||
-f values-common.yaml \
|
||||
-f values-prod.yaml
|
||||
|
||||
# Set specific values via CLI
|
||||
helm upgrade <release-name> <chart> -n <namespace> \
|
||||
--set image.tag=v2.0.0 \
|
||||
--set replicas=5
|
||||
```
|
||||
|
||||
### Values File Parsing Errors
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: INSTALLATION FAILED: YAML parse error
|
||||
```
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Validate YAML syntax
|
||||
yamllint values.yaml
|
||||
|
||||
# Or use Python
|
||||
python3 -c 'import yaml; yaml.safe_load(open("values.yaml"))'
|
||||
|
||||
# Check for tabs (YAML doesn't allow tabs)
|
||||
cat -A values.yaml | grep $'\t'
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
- Fix YAML syntax errors
|
||||
- Replace tabs with spaces
|
||||
- Ensure proper indentation
|
||||
- Quote special characters in strings
|
||||
|
||||
### Secret Values Not Working
|
||||
|
||||
**Symptoms:**
|
||||
- Secrets not created or contain wrong values
|
||||
- Base64 encoding issues
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check secret in manifest
|
||||
helm get manifest <release-name> -n <namespace> | grep -A 10 "kind: Secret"
|
||||
|
||||
# Decode secret
|
||||
kubectl get secret <secret-name> -n <namespace> -o json | \
|
||||
jq '.data | map_values(@base64d)'
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```yaml
|
||||
# Use proper secret format in values.yaml
|
||||
secrets:
|
||||
password: "mySecretPassword" # Helm will base64 encode
|
||||
|
||||
# Or pre-encode if template expects it
|
||||
secrets:
|
||||
password: "bXlTZWNyZXRQYXNzd29yZA==" # Already base64 encoded
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Chart Dependencies
|
||||
|
||||
### Dependency Update Fails
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: An error occurred while checking for chart dependencies
|
||||
```
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check Chart.yaml dependencies
|
||||
cat Chart.yaml
|
||||
|
||||
# List current dependencies
|
||||
helm dependency list <chart-directory>
|
||||
|
||||
# Check repository access
|
||||
helm repo list
|
||||
helm repo update
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Update dependencies
|
||||
helm dependency update <chart-directory>
|
||||
|
||||
# Build dependencies (downloads to charts/)
|
||||
helm dependency build <chart-directory>
|
||||
|
||||
# Add missing repositories
|
||||
helm repo add <repo-name> <repo-url>
|
||||
helm repo update
|
||||
```
|
||||
|
||||
### Dependency Version Conflicts
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: found in Chart.yaml, but missing in charts/ directory
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Clean dependencies
|
||||
rm -rf <chart-directory>/charts/*
|
||||
rm -f <chart-directory>/Chart.lock
|
||||
|
||||
# Rebuild
|
||||
helm dependency update <chart-directory>
|
||||
|
||||
# Verify
|
||||
helm dependency list <chart-directory>
|
||||
```
|
||||
|
||||
### Subchart Values Not Applied
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check subchart values in parent chart
|
||||
cat values.yaml | grep -A 20 <subchart-name>
|
||||
|
||||
# Render to see what values subchart receives
|
||||
helm template <release-name> <chart> -f values.yaml | grep -A 50 "# Source: <subchart-name>"
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```yaml
|
||||
# In parent chart's values.yaml, nest subchart values under subchart name:
|
||||
postgresql: # Subchart name
|
||||
auth:
|
||||
username: myuser
|
||||
password: mypass
|
||||
database: mydb
|
||||
primary:
|
||||
resources:
|
||||
requests:
|
||||
memory: "256Mi"
|
||||
cpu: "250m"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Hooks and Lifecycle
|
||||
|
||||
### Pre/Post Hooks Failing
|
||||
|
||||
**Symptoms:**
|
||||
- Installation/upgrade hangs waiting for hooks
|
||||
- Hook jobs fail
|
||||
- Release stuck in pending state
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# List hooks
|
||||
kubectl get jobs -n <namespace> -l "helm.sh/hook"
|
||||
|
||||
# Check hook status
|
||||
kubectl describe job <hook-job-name> -n <namespace>
|
||||
|
||||
# Get hook logs
|
||||
kubectl logs -n <namespace> -l "helm.sh/hook=pre-install"
|
||||
kubectl logs -n <namespace> -l "helm.sh/hook=post-install"
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Delete failed hooks
|
||||
kubectl delete job -n <namespace> -l "helm.sh/hook"
|
||||
|
||||
# Retry without hooks
|
||||
helm upgrade <release-name> <chart> -n <namespace> --no-hooks
|
||||
|
||||
# Or skip hooks during install
|
||||
helm install <release-name> <chart> -n <namespace> --no-hooks
|
||||
```
|
||||
|
||||
### Hook Cleanup Issues
|
||||
|
||||
**Symptoms:**
|
||||
- Hook resources remain after installation
|
||||
- Accumulating failed hook jobs
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check hook deletion policy
|
||||
helm get manifest <release-name> -n <namespace> | grep -B 5 "helm.sh/hook-delete-policy"
|
||||
|
||||
# List remaining hooks
|
||||
kubectl get all -n <namespace> -l "helm.sh/hook"
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Manual cleanup
|
||||
kubectl delete jobs,pods -n <namespace> -l "helm.sh/hook"
|
||||
|
||||
# Update chart template to include proper hook-delete-policy:
|
||||
# metadata:
|
||||
# annotations:
|
||||
# "helm.sh/hook": pre-install
|
||||
# "helm.sh/hook-delete-policy": hook-succeeded,hook-failed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Repository Issues
|
||||
|
||||
### Unable to Get Chart from Repository
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: failed to download "<chart-name>"
|
||||
```
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# Check repository configuration
|
||||
helm repo list
|
||||
|
||||
# Update repositories
|
||||
helm repo update
|
||||
|
||||
# Search for chart
|
||||
helm search repo <chart-name> --versions
|
||||
|
||||
# Test repository access
|
||||
curl -I <repo-url>/index.yaml
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Remove and re-add repository
|
||||
helm repo remove <repo-name>
|
||||
helm repo add <repo-name> <repo-url>
|
||||
helm repo update
|
||||
|
||||
# For private repos, configure credentials
|
||||
helm repo add <repo-name> <repo-url> \
|
||||
--username=<username> \
|
||||
--password=<password>
|
||||
|
||||
# Or use OCI registry
|
||||
helm pull oci://registry.example.com/charts/<chart-name> --version 1.0.0
|
||||
```
|
||||
|
||||
### Chart Version Not Found
|
||||
|
||||
**Symptoms:**
|
||||
```
|
||||
Error: chart "<chart-name>" version "1.2.3" not found
|
||||
```
|
||||
|
||||
**Investigation:**
|
||||
```bash
|
||||
# List available versions
|
||||
helm search repo <chart-name> --versions
|
||||
|
||||
# Check if specific version exists
|
||||
helm show chart <repo-name>/<chart-name> --version 1.2.3
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
```bash
|
||||
# Use available version
|
||||
helm install <release-name> <repo-name>/<chart-name> --version <available-version>
|
||||
|
||||
# Or use latest
|
||||
helm install <release-name> <repo-name>/<chart-name>
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Debugging Tools and Commands
|
||||
|
||||
### Essential Helm Commands
|
||||
|
||||
```bash
|
||||
# Get release information
|
||||
helm list -n <namespace> --all
|
||||
helm status <release-name> -n <namespace>
|
||||
helm history <release-name> -n <namespace>
|
||||
|
||||
# Get release content
|
||||
helm get values <release-name> -n <namespace>
|
||||
helm get manifest <release-name> -n <namespace>
|
||||
helm get hooks <release-name> -n <namespace>
|
||||
helm get notes <release-name> -n <namespace>
|
||||
|
||||
# Debugging
|
||||
helm install <release-name> <chart> --debug --dry-run -n <namespace>
|
||||
helm template <release-name> <chart> --debug -n <namespace>
|
||||
|
||||
# Testing
|
||||
helm test <release-name> -n <namespace>
|
||||
helm lint <chart-directory>
|
||||
```
|
||||
|
||||
### Useful Plugins
|
||||
|
||||
```bash
|
||||
# Install helm-diff plugin
|
||||
helm plugin install https://github.com/databus23/helm-diff
|
||||
|
||||
# Compare releases
|
||||
helm diff upgrade <release-name> <chart> -n <namespace>
|
||||
|
||||
# Install helm-secrets plugin
|
||||
helm plugin install https://github.com/jkroepke/helm-secrets
|
||||
|
||||
# Use encrypted values
|
||||
helm secrets install <release-name> <chart> -f secrets.yaml -n <namespace>
|
||||
```
|
||||
|
||||
### Helm Environment Issues
|
||||
|
||||
**Check Helm configuration:**
|
||||
```bash
|
||||
# Helm version
|
||||
helm version
|
||||
|
||||
# Kubernetes context
|
||||
kubectl config current-context
|
||||
|
||||
# Helm environment
|
||||
helm env
|
||||
|
||||
# Cache location
|
||||
helm env | grep CACHE
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Release Management
|
||||
- Use descriptive release names
|
||||
- Always specify namespace explicitly
|
||||
- Use `--atomic` flag for safer upgrades (rolls back on failure)
|
||||
- Keep release history manageable: `helm history <release> -n <namespace> --max 10`
|
||||
|
||||
### Values Management
|
||||
- Use multiple values files for different environments
|
||||
- Version control your values files
|
||||
- Use `helm template` to preview changes before applying
|
||||
- Document required values in chart README
|
||||
|
||||
### Chart Development
|
||||
- Always run `helm lint` before packaging
|
||||
- Test charts in multiple environments
|
||||
- Use semantic versioning for charts
|
||||
- Implement proper hooks with deletion policies
|
||||
|
||||
### Troubleshooting Workflow
|
||||
1. Check release status: `helm status <release> -n <namespace>`
|
||||
2. Check history: `helm history <release> -n <namespace>`
|
||||
3. Get values: `helm get values <release> -n <namespace>`
|
||||
4. Check manifest: `helm get manifest <release> -n <namespace>`
|
||||
5. Check Kubernetes events: `kubectl get events -n <namespace>`
|
||||
6. Check pod logs: `kubectl logs <pod> -n <namespace>`
|
||||
7. Check hooks: `kubectl get jobs -n <namespace> -l helm.sh/hook`
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Common Flags
|
||||
|
||||
```bash
|
||||
# Installation/Upgrade
|
||||
--atomic # Rollback on failure
|
||||
--wait # Wait for resources to be ready
|
||||
--timeout 10m # Set timeout (default 5m)
|
||||
--force # Force update by deleting and recreating resources
|
||||
--cleanup-on-fail # Delete resources on failed install
|
||||
|
||||
# Debugging
|
||||
--debug # Enable verbose output
|
||||
--dry-run # Simulate operation
|
||||
--no-hooks # Skip hooks
|
||||
|
||||
# Values
|
||||
-f values.yaml # Use values file
|
||||
--set key=value # Set value via command line
|
||||
--reuse-values # Reuse values from previous release
|
||||
```
|
||||
|
||||
### Typical Rescue Commands
|
||||
|
||||
```bash
|
||||
# Release stuck? Force delete and reinstall
|
||||
helm uninstall <release> -n <namespace> --no-hooks
|
||||
kubectl delete secret -n <namespace> -l owner=helm,name=<release>
|
||||
helm install <release> <chart> -n <namespace> -f values.yaml
|
||||
|
||||
# Upgrade failed? Rollback
|
||||
helm rollback <release> 0 -n <namespace> # 0 = previous revision
|
||||
|
||||
# Can't rollback? Force upgrade
|
||||
helm upgrade <release> <chart> -n <namespace> --force --recreate-pods
|
||||
|
||||
# Complete cleanup
|
||||
helm uninstall <release> -n <namespace>
|
||||
kubectl delete namespace <namespace> # If dedicated namespace
|
||||
```
|
||||
466
skills/references/incident_response.md
Normal file
466
skills/references/incident_response.md
Normal file
@@ -0,0 +1,466 @@
|
||||
# Kubernetes Incident Response Playbook
|
||||
|
||||
This playbook provides structured procedures for responding to Kubernetes incidents.
|
||||
|
||||
## Incident Response Framework
|
||||
|
||||
### 1. Detection Phase
|
||||
- Identify the incident (alerts, user reports, monitoring)
|
||||
- Determine severity level
|
||||
- Initiate incident response
|
||||
|
||||
### 2. Triage Phase
|
||||
- Assess impact and scope
|
||||
- Gather initial diagnostic data
|
||||
- Determine if immediate action needed
|
||||
|
||||
### 3. Investigation Phase
|
||||
- Collect comprehensive diagnostics
|
||||
- Identify root cause
|
||||
- Document findings
|
||||
|
||||
### 4. Resolution Phase
|
||||
- Apply remediation
|
||||
- Verify fix
|
||||
- Monitor for recurrence
|
||||
|
||||
### 5. Post-Incident Phase
|
||||
- Document incident
|
||||
- Conduct blameless post-mortem
|
||||
- Implement preventive measures
|
||||
|
||||
---
|
||||
|
||||
## Severity Levels
|
||||
|
||||
### SEV-1: Critical
|
||||
- Complete service outage
|
||||
- Data loss or corruption
|
||||
- Security breach
|
||||
- Impact: All users affected
|
||||
- Response: Immediate, all-hands
|
||||
|
||||
### SEV-2: High
|
||||
- Major functionality degraded
|
||||
- Significant performance impact
|
||||
- Impact: Large subset of users
|
||||
- Response: Within 15 minutes
|
||||
|
||||
### SEV-3: Medium
|
||||
- Minor functionality impaired
|
||||
- Workaround available
|
||||
- Impact: Some users affected
|
||||
- Response: Within 1 hour
|
||||
|
||||
### SEV-4: Low
|
||||
- Cosmetic issues
|
||||
- Negligible impact
|
||||
- Impact: Minimal
|
||||
- Response: During business hours
|
||||
|
||||
---
|
||||
|
||||
## Common Incident Scenarios
|
||||
|
||||
### Scenario 1: Complete Cluster Outage
|
||||
|
||||
**Symptoms:**
|
||||
- All services unreachable
|
||||
- kubectl commands timing out
|
||||
- API server not responding
|
||||
|
||||
**Immediate Actions:**
|
||||
1. Verify the scope (single cluster or multi-cluster)
|
||||
2. Check API server status and logs
|
||||
3. Check control plane nodes
|
||||
4. Verify network connectivity to control plane
|
||||
5. Check etcd cluster health
|
||||
|
||||
**Investigation Steps:**
|
||||
```bash
|
||||
# Check control plane pods
|
||||
kubectl get pods -n kube-system
|
||||
|
||||
# Check etcd
|
||||
kubectl exec -it etcd-<node> -n kube-system -- etcdctl endpoint health
|
||||
|
||||
# Check API server logs
|
||||
journalctl -u kube-apiserver -n 100
|
||||
|
||||
# Check control plane node resources
|
||||
ssh <control-plane-node> "top"
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
- etcd cluster failure
|
||||
- API server OOM/crash
|
||||
- Control plane network partition
|
||||
- Certificate expiration
|
||||
- Cloud provider outage
|
||||
|
||||
**Resolution Paths:**
|
||||
1. etcd issue: Restore from backup or rebuild cluster
|
||||
2. API server issue: Restart API server pods/service
|
||||
3. Network: Fix routing, security groups, or DNS
|
||||
4. Certificates: Renew certificates (kubeadm cert renew all)
|
||||
|
||||
---
|
||||
|
||||
### Scenario 2: Service Degradation
|
||||
|
||||
**Symptoms:**
|
||||
- Increased latency or error rates
|
||||
- Some requests failing
|
||||
- Intermittent issues
|
||||
|
||||
**Immediate Actions:**
|
||||
1. Check service metrics and logs
|
||||
2. Verify pod health and count
|
||||
3. Check for recent deployments
|
||||
4. Review resource utilization
|
||||
|
||||
**Investigation Steps:**
|
||||
```bash
|
||||
# Check service endpoints
|
||||
kubectl get endpoints <service> -n <namespace>
|
||||
|
||||
# Check pod status
|
||||
kubectl get pods -l <service-selector> -n <namespace>
|
||||
|
||||
# Review recent changes
|
||||
kubectl rollout history deployment/<name> -n <namespace>
|
||||
|
||||
# Check resource usage
|
||||
kubectl top pods -n <namespace>
|
||||
|
||||
# Get recent events
|
||||
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
- Insufficient replicas
|
||||
- Pod restarts/crashes
|
||||
- Resource contention
|
||||
- Bad deployment
|
||||
- External dependency failure
|
||||
|
||||
**Resolution Paths:**
|
||||
1. Scale up replicas if needed
|
||||
2. Rollback bad deployment
|
||||
3. Increase resources if constrained
|
||||
4. Fix configuration issues
|
||||
5. Implement circuit breaker for external deps
|
||||
|
||||
---
|
||||
|
||||
### Scenario 3: Node Failure
|
||||
|
||||
**Symptoms:**
|
||||
- Node reported as NotReady
|
||||
- Pods being evicted from node
|
||||
- High node resource utilization
|
||||
|
||||
**Immediate Actions:**
|
||||
1. Identify affected node
|
||||
2. Check impact (which pods running on node)
|
||||
3. Determine if pods need immediate migration
|
||||
4. Assess if node is recoverable
|
||||
|
||||
**Investigation Steps:**
|
||||
```bash
|
||||
# Get node status
|
||||
kubectl get nodes
|
||||
|
||||
# Describe the problem node
|
||||
kubectl describe node <node-name>
|
||||
|
||||
# Check pods on the node
|
||||
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>
|
||||
|
||||
# SSH to node and check
|
||||
ssh <node> "systemctl status kubelet"
|
||||
ssh <node> "journalctl -u kubelet -n 100"
|
||||
ssh <node> "docker ps" # or containerd
|
||||
ssh <node> "df -h"
|
||||
ssh <node> "free -m"
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
- Kubelet failure
|
||||
- Disk full
|
||||
- Memory exhaustion
|
||||
- Network issues
|
||||
- Hardware failure
|
||||
|
||||
**Resolution Paths:**
|
||||
1. Recoverable: Fix issue (clean disk, restart services)
|
||||
2. Not recoverable: Cordon, drain, and replace node
|
||||
3. For critical pods: Manually reschedule if necessary
|
||||
4. Update monitoring and alerting based on findings
|
||||
|
||||
---
|
||||
|
||||
### Scenario 4: Storage Issues
|
||||
|
||||
**Symptoms:**
|
||||
- PVCs stuck in Pending
|
||||
- Pods can't start due to volume issues
|
||||
- Data access failures
|
||||
|
||||
**Immediate Actions:**
|
||||
1. Identify affected PVCs/PVs
|
||||
2. Check storage backend health
|
||||
3. Verify provisioner status
|
||||
4. Assess data integrity risk
|
||||
|
||||
**Investigation Steps:**
|
||||
```bash
|
||||
# Check PVC status
|
||||
kubectl get pvc --all-namespaces
|
||||
|
||||
# Describe pending PVC
|
||||
kubectl describe pvc <pvc-name> -n <namespace>
|
||||
|
||||
# Check PV status
|
||||
kubectl get pv
|
||||
|
||||
# Check storage class
|
||||
kubectl get storageclass
|
||||
|
||||
# Check provisioner
|
||||
kubectl get pods -n <storage-namespace>
|
||||
|
||||
# Check volume attachments
|
||||
kubectl get volumeattachments
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
- Storage backend failure/full
|
||||
- Provisioner issues
|
||||
- Network to storage backend
|
||||
- Volume attachment limits reached
|
||||
- Corrupted volume
|
||||
|
||||
**Resolution Paths:**
|
||||
1. Fix storage backend issues
|
||||
2. Restart provisioner if needed
|
||||
3. Manually provision PV if dynamic provisioning failed
|
||||
4. Delete and recreate if volume corrupted
|
||||
5. Restore from backup if data lost
|
||||
|
||||
---
|
||||
|
||||
### Scenario 5: Security Incident
|
||||
|
||||
**Symptoms:**
|
||||
- Unauthorized access detected
|
||||
- Suspicious pod behavior
|
||||
- Security alerts triggered
|
||||
- Unusual network traffic
|
||||
|
||||
**Immediate Actions:**
|
||||
1. Assess severity and scope
|
||||
2. Isolate affected resources
|
||||
3. Preserve evidence
|
||||
4. Engage security team
|
||||
|
||||
**Investigation Steps:**
|
||||
```bash
|
||||
# Check recent RBAC changes
|
||||
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json
|
||||
|
||||
# Audit pod security contexts
|
||||
kubectl get pods --all-namespaces -o json | jq '.items[].spec.securityContext'
|
||||
|
||||
# Check for privileged pods
|
||||
kubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.containers[].securityContext.privileged==true)'
|
||||
|
||||
# Review service accounts
|
||||
kubectl get serviceaccounts --all-namespaces
|
||||
|
||||
# Get audit logs
|
||||
cat /var/log/kubernetes/audit/audit.log | grep <suspicious-activity>
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
- Compromised credentials
|
||||
- Vulnerable container image
|
||||
- Misconfigured RBAC
|
||||
- Exposed secrets
|
||||
- Supply chain attack
|
||||
|
||||
**Resolution Paths:**
|
||||
1. Isolate: Network policies, cordon nodes
|
||||
2. Investigate: Audit logs, pod logs, network flows
|
||||
3. Remediate: Rotate credentials, patch vulnerabilities
|
||||
4. Restore: From known-good state if needed
|
||||
5. Prevent: Enhanced security policies, monitoring
|
||||
|
||||
---
|
||||
|
||||
## Diagnostic Commands Cheat Sheet
|
||||
|
||||
### Quick Health Check
|
||||
```bash
|
||||
# Overall cluster health
|
||||
kubectl cluster-info
|
||||
kubectl get nodes
|
||||
kubectl get pods --all-namespaces | grep -v Running
|
||||
|
||||
# Component status (older clusters)
|
||||
kubectl get componentstatuses
|
||||
|
||||
# Recent events
|
||||
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
|
||||
```
|
||||
|
||||
### Pod Diagnostics
|
||||
```bash
|
||||
# Pod details
|
||||
kubectl describe pod <pod> -n <namespace>
|
||||
kubectl get pod <pod> -n <namespace> -o yaml
|
||||
|
||||
# Logs
|
||||
kubectl logs <pod> -n <namespace>
|
||||
kubectl logs <pod> -n <namespace> --previous
|
||||
kubectl logs <pod> -c <container> -n <namespace>
|
||||
|
||||
# Interactive debugging
|
||||
kubectl exec -it <pod> -n <namespace> -- /bin/sh
|
||||
kubectl debug <pod> -it --image=busybox -n <namespace>
|
||||
```
|
||||
|
||||
### Node Diagnostics
|
||||
```bash
|
||||
# Node details
|
||||
kubectl describe node <node>
|
||||
kubectl get node <node> -o yaml
|
||||
|
||||
# Resource usage
|
||||
kubectl top nodes
|
||||
kubectl top pods --all-namespaces
|
||||
|
||||
# Node conditions
|
||||
kubectl get nodes -o json | jq '.items[].status.conditions'
|
||||
```
|
||||
|
||||
### Service & Network Diagnostics
|
||||
```bash
|
||||
# Service details
|
||||
kubectl describe svc <service> -n <namespace>
|
||||
kubectl get endpoints <service> -n <namespace>
|
||||
|
||||
# Network policies
|
||||
kubectl get networkpolicies --all-namespaces
|
||||
|
||||
# DNS testing
|
||||
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
|
||||
# Then: nslookup <service>.<namespace>.svc.cluster.local
|
||||
```
|
||||
|
||||
### Storage Diagnostics
|
||||
```bash
|
||||
# PVC and PV status
|
||||
kubectl get pvc,pv --all-namespaces
|
||||
|
||||
# Storage class
|
||||
kubectl get storageclass
|
||||
kubectl describe storageclass <storage-class>
|
||||
|
||||
# Volume attachments
|
||||
kubectl get volumeattachments
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Communication During Incidents
|
||||
|
||||
### Internal Communication
|
||||
- Use dedicated incident channel
|
||||
- Regular status updates (every 30 min)
|
||||
- Clear roles (incident commander, scribe, experts)
|
||||
- Document all actions taken
|
||||
|
||||
### External Communication
|
||||
- Status page updates
|
||||
- Customer notifications
|
||||
- Clear expected resolution time
|
||||
- Updates on progress
|
||||
|
||||
### Post-Incident Communication
|
||||
- Incident report
|
||||
- Root cause analysis
|
||||
- Remediation steps taken
|
||||
- Prevention measures
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident Review Template
|
||||
|
||||
### Incident Summary
|
||||
- Date and time
|
||||
- Duration
|
||||
- Severity
|
||||
- Services affected
|
||||
- User impact
|
||||
|
||||
### Timeline
|
||||
- Detection time
|
||||
- Response time
|
||||
- Resolution time
|
||||
- Key events during incident
|
||||
|
||||
### Root Cause
|
||||
- What happened
|
||||
- Why it happened
|
||||
- Contributing factors
|
||||
|
||||
### Resolution
|
||||
- What fixed the issue
|
||||
- Who fixed it
|
||||
- How long it took
|
||||
|
||||
### Lessons Learned
|
||||
- What went well
|
||||
- What could be improved
|
||||
- Action items with owners
|
||||
|
||||
### Prevention
|
||||
- Technical changes
|
||||
- Process improvements
|
||||
- Monitoring enhancements
|
||||
- Documentation updates
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Prevention
|
||||
- Regular cluster audits
|
||||
- Proactive monitoring and alerting
|
||||
- Capacity planning
|
||||
- Regular disaster recovery drills
|
||||
- Automated backups
|
||||
- Security scanning and policies
|
||||
|
||||
### Preparedness
|
||||
- Document runbooks
|
||||
- Practice incident response
|
||||
- Keep contact lists updated
|
||||
- Maintain up-to-date diagrams
|
||||
- Pre-provision debugging tools
|
||||
|
||||
### Response
|
||||
- Follow structured approach
|
||||
- Document everything
|
||||
- Communicate clearly
|
||||
- Don't panic
|
||||
- Think before acting
|
||||
- Preserve evidence
|
||||
|
||||
### Recovery
|
||||
- Verify fix thoroughly
|
||||
- Monitor for recurrence
|
||||
- Update documentation
|
||||
- Conduct post-mortem
|
||||
- Implement preventive measures
|
||||
687
skills/references/performance_troubleshooting.md
Normal file
687
skills/references/performance_troubleshooting.md
Normal file
@@ -0,0 +1,687 @@
|
||||
# Kubernetes Performance Troubleshooting
|
||||
|
||||
Systematic approach to diagnosing and resolving Kubernetes performance issues.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [High Latency Issues](#high-latency-issues)
|
||||
2. [CPU Performance](#cpu-performance)
|
||||
3. [Memory Performance](#memory-performance)
|
||||
4. [Network Performance](#network-performance)
|
||||
5. [Storage I/O Performance](#storage-io-performance)
|
||||
6. [Application-Level Metrics](#application-level-metrics)
|
||||
7. [Cluster-Wide Performance](#cluster-wide-performance)
|
||||
|
||||
---
|
||||
|
||||
## High Latency Issues
|
||||
|
||||
### Symptoms
|
||||
- Slow API response times
|
||||
- Increased request latency
|
||||
- Timeouts
|
||||
- Degraded user experience
|
||||
|
||||
### Investigation Workflow
|
||||
|
||||
**1. Identify the layer with latency:**
|
||||
|
||||
```bash
|
||||
# Check service mesh metrics (if using Istio/Linkerd)
|
||||
kubectl top pods -n <namespace>
|
||||
|
||||
# Check ingress controller metrics
|
||||
kubectl logs -n ingress-nginx <ingress-controller-pod> | grep "request_time"
|
||||
|
||||
# Check application logs for slow requests
|
||||
kubectl logs <pod-name> -n <namespace> | grep -i "slow\|timeout\|latency"
|
||||
```
|
||||
|
||||
**2. Profile application performance:**
|
||||
|
||||
```bash
|
||||
# Get pod metrics
|
||||
kubectl top pod <pod-name> -n <namespace>
|
||||
|
||||
# Check if pod is CPU throttled
|
||||
kubectl get pod <pod-name> -n <namespace> -o json | \
|
||||
jq '.spec.containers[].resources'
|
||||
|
||||
# Exec into pod and check application-specific metrics
|
||||
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
|
||||
# Then: curl localhost:8080/metrics (if Prometheus metrics available)
|
||||
```
|
||||
|
||||
**3. Check dependencies:**
|
||||
|
||||
```bash
|
||||
# Test connectivity to downstream services
|
||||
kubectl exec -it <pod-name> -n <namespace> -- \
|
||||
curl -w "@curl-format.txt" -o /dev/null -s http://backend-service
|
||||
|
||||
# curl-format.txt content:
|
||||
# time_namelookup: %{time_namelookup}\n
|
||||
# time_connect: %{time_connect}\n
|
||||
# time_appconnect: %{time_appconnect}\n
|
||||
# time_pretransfer: %{time_pretransfer}\n
|
||||
# time_redirect: %{time_redirect}\n
|
||||
# time_starttransfer: %{time_starttransfer}\n
|
||||
# time_total: %{time_total}\n
|
||||
```
|
||||
|
||||
### Common Causes and Solutions
|
||||
|
||||
**CPU Throttling:**
|
||||
```yaml
|
||||
# Increase CPU limits or remove limits for bursty workloads
|
||||
resources:
|
||||
requests:
|
||||
cpu: "500m" # What pod needs typically
|
||||
limits:
|
||||
cpu: "2000m" # Burst capacity (or remove for unlimited)
|
||||
```
|
||||
|
||||
**Insufficient Replicas:**
|
||||
```bash
|
||||
# Scale up deployment
|
||||
kubectl scale deployment <deployment-name> -n <namespace> --replicas=5
|
||||
|
||||
# Or enable HPA
|
||||
kubectl autoscale deployment <deployment-name> \
|
||||
--cpu-percent=70 \
|
||||
--min=2 \
|
||||
--max=10
|
||||
```
|
||||
|
||||
**Slow Dependencies:**
|
||||
```yaml
|
||||
# Implement circuit breakers and timeouts in application
|
||||
# Or use service mesh policies (Istio example):
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: DestinationRule
|
||||
metadata:
|
||||
name: backend-circuit-breaker
|
||||
spec:
|
||||
host: backend-service
|
||||
trafficPolicy:
|
||||
connectionPool:
|
||||
tcp:
|
||||
maxConnections: 100
|
||||
http:
|
||||
http1MaxPendingRequests: 50
|
||||
http2MaxRequests: 100
|
||||
outlierDetection:
|
||||
consecutiveErrors: 5
|
||||
interval: 30s
|
||||
baseEjectionTime: 30s
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CPU Performance
|
||||
|
||||
### Symptoms
|
||||
- High CPU usage
|
||||
- Throttling
|
||||
- Slow processing
|
||||
- Queue buildup
|
||||
|
||||
### Investigation Commands
|
||||
|
||||
```bash
|
||||
# Check CPU usage
|
||||
kubectl top nodes
|
||||
kubectl top pods -n <namespace>
|
||||
|
||||
# Check CPU throttling
|
||||
kubectl get pod <pod-name> -n <namespace> -o json | \
|
||||
jq '.spec.containers[].resources'
|
||||
|
||||
# Get detailed CPU metrics (requires metrics-server)
|
||||
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | jq
|
||||
|
||||
# Check container-level CPU from node (SSH to node)
|
||||
ssh <node> "docker stats --no-stream"
|
||||
```
|
||||
|
||||
### Advanced CPU Profiling
|
||||
|
||||
**Enable CPU profiling in application:**
|
||||
|
||||
```bash
|
||||
# For Go applications with pprof
|
||||
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
|
||||
|
||||
# Capture CPU profile
|
||||
curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
|
||||
|
||||
# Analyze with pprof
|
||||
go tool pprof -http=:8080 cpu.prof
|
||||
```
|
||||
|
||||
**For Java applications:**
|
||||
|
||||
```bash
|
||||
# Use async-profiler
|
||||
kubectl exec -it <pod-name> -n <namespace> -- \
|
||||
/profiler.sh -d 30 -f /tmp/flamegraph.html 1
|
||||
|
||||
# Copy flamegraph
|
||||
kubectl cp <namespace>/<pod-name>:/tmp/flamegraph.html ./flamegraph.html
|
||||
```
|
||||
|
||||
### Solutions
|
||||
|
||||
**Vertical Scaling:**
|
||||
```yaml
|
||||
resources:
|
||||
requests:
|
||||
cpu: "1000m" # Increased from 500m
|
||||
limits:
|
||||
cpu: "2000m" # Increased from 1000m
|
||||
```
|
||||
|
||||
**Horizontal Scaling:**
|
||||
```yaml
|
||||
apiVersion: autoscaling/v2
|
||||
kind: HorizontalPodAutoscaler
|
||||
metadata:
|
||||
name: app-hpa
|
||||
spec:
|
||||
scaleTargetRef:
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
name: app
|
||||
minReplicas: 3
|
||||
maxReplicas: 20
|
||||
metrics:
|
||||
- type: Resource
|
||||
resource:
|
||||
name: cpu
|
||||
target:
|
||||
type: Utilization
|
||||
averageUtilization: 70
|
||||
```
|
||||
|
||||
**Remove CPU Limits for Bursty Workloads:**
|
||||
```yaml
|
||||
# Allow bursting to available CPU
|
||||
resources:
|
||||
requests:
|
||||
cpu: "500m"
|
||||
# No limits - can use all available CPU
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Memory Performance
|
||||
|
||||
### Symptoms
|
||||
- OOMKilled pods
|
||||
- Memory leaks
|
||||
- Slow garbage collection
|
||||
- Swap usage (if enabled)
|
||||
|
||||
### Investigation Commands
|
||||
|
||||
```bash
|
||||
# Check memory usage
|
||||
kubectl top nodes
|
||||
kubectl top pods -n <namespace>
|
||||
|
||||
# Check memory limits and requests
|
||||
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits\|Requests"
|
||||
|
||||
# Check OOM kills
|
||||
kubectl get pods -n <namespace> -o json | \
|
||||
jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") | .metadata.name'
|
||||
|
||||
# Detailed memory breakdown (requires metrics-server)
|
||||
kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | \
|
||||
jq '.containers[] | {name, usage: .usage.memory}'
|
||||
```
|
||||
|
||||
### Memory Profiling
|
||||
|
||||
**Heap dump for Java:**
|
||||
```bash
|
||||
# Capture heap dump
|
||||
kubectl exec <pod-name> -n <namespace> -- \
|
||||
jmap -dump:format=b,file=/tmp/heapdump.hprof 1
|
||||
|
||||
# Copy heap dump
|
||||
kubectl cp <namespace>/<pod-name>:/tmp/heapdump.hprof ./heapdump.hprof
|
||||
|
||||
# Analyze with Eclipse MAT or VisualVM
|
||||
```
|
||||
|
||||
**Memory profiling for Go:**
|
||||
```bash
|
||||
# Capture heap profile
|
||||
kubectl port-forward <pod-name> 6060:6060 -n <namespace>
|
||||
curl http://localhost:6060/debug/pprof/heap > heap.prof
|
||||
|
||||
# Analyze
|
||||
go tool pprof -http=:8080 heap.prof
|
||||
```
|
||||
|
||||
### Solutions
|
||||
|
||||
**Increase Memory Limits:**
|
||||
```yaml
|
||||
resources:
|
||||
requests:
|
||||
memory: "512Mi"
|
||||
limits:
|
||||
memory: "2Gi" # Increased from 1Gi
|
||||
```
|
||||
|
||||
**Optimize Application:**
|
||||
- Fix memory leaks
|
||||
- Implement connection pooling
|
||||
- Optimize caching strategies
|
||||
- Tune garbage collection
|
||||
|
||||
**Use Memory-Optimized Node Pools:**
|
||||
```yaml
|
||||
# Node affinity for memory-intensive workloads
|
||||
affinity:
|
||||
nodeAffinity:
|
||||
requiredDuringSchedulingIgnoredDuringExecution:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: workload-type
|
||||
operator: In
|
||||
values:
|
||||
- memory-optimized
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Network Performance
|
||||
|
||||
### Symptoms
|
||||
- High network latency
|
||||
- Packet loss
|
||||
- Connection timeouts
|
||||
- Bandwidth saturation
|
||||
|
||||
### Investigation Commands
|
||||
|
||||
```bash
|
||||
# Check pod network statistics
|
||||
kubectl exec <pod-name> -n <namespace> -- netstat -s
|
||||
|
||||
# Test network performance between pods
|
||||
# Deploy netperf
|
||||
kubectl run netperf-client --image=networkstatic/netperf --rm -it -- /bin/bash
|
||||
|
||||
# From client, run:
|
||||
netperf -H <target-pod-ip> -t TCP_STREAM
|
||||
netperf -H <target-pod-ip> -t TCP_RR # Request-response latency
|
||||
|
||||
# Check DNS resolution time
|
||||
kubectl exec <pod-name> -n <namespace> -- \
|
||||
time nslookup service-name.namespace.svc.cluster.local
|
||||
|
||||
# Check service mesh overhead (if using Istio)
|
||||
kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
|
||||
curl -s localhost:15000/stats | grep "http.inbound\|http.outbound"
|
||||
```
|
||||
|
||||
### Check Network Policies
|
||||
|
||||
```bash
|
||||
# List network policies
|
||||
kubectl get networkpolicies -n <namespace>
|
||||
|
||||
# Check if policy is blocking traffic
|
||||
kubectl describe networkpolicy <policy-name> -n <namespace>
|
||||
|
||||
# Temporarily remove policies to test (in non-production)
|
||||
kubectl delete networkpolicy <policy-name> -n <namespace>
|
||||
```
|
||||
|
||||
### Solutions
|
||||
|
||||
**DNS Optimization:**
|
||||
```yaml
|
||||
# Use CoreDNS caching
|
||||
# Increase CoreDNS replicas
|
||||
kubectl scale deployment coredns -n kube-system --replicas=5
|
||||
|
||||
# Or use NodeLocal DNSCache
|
||||
# https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
|
||||
```
|
||||
|
||||
**Optimize Service Mesh:**
|
||||
```yaml
|
||||
# Reduce Istio sidecar resources if over-provisioned
|
||||
sidecar.istio.io/proxyCPU: "100m"
|
||||
sidecar.istio.io/proxyMemory: "128Mi"
|
||||
|
||||
# Or disable for internal, trusted services
|
||||
sidecar.istio.io/inject: "false"
|
||||
```
|
||||
|
||||
**Use HostNetwork for Network-Intensive Pods:**
|
||||
```yaml
|
||||
# Use with caution - bypasses pod networking
|
||||
spec:
|
||||
hostNetwork: true
|
||||
dnsPolicy: ClusterFirstWithHostNet
|
||||
```
|
||||
|
||||
**Enable Bandwidth Limits (QoS):**
|
||||
```yaml
|
||||
metadata:
|
||||
annotations:
|
||||
kubernetes.io/ingress-bandwidth: "10M"
|
||||
kubernetes.io/egress-bandwidth: "10M"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Storage I/O Performance
|
||||
|
||||
### Symptoms
|
||||
- Slow read/write operations
|
||||
- High I/O wait
|
||||
- Application timeouts during disk operations
|
||||
- Database performance issues
|
||||
|
||||
### Investigation Commands
|
||||
|
||||
```bash
|
||||
# Check I/O metrics on node
|
||||
ssh <node> "iostat -x 1 10"
|
||||
|
||||
# Check disk usage
|
||||
kubectl exec <pod-name> -n <namespace> -- df -h
|
||||
|
||||
# Check I/O wait from pod
|
||||
kubectl exec <pod-name> -n <namespace> -- top
|
||||
|
||||
# Test storage performance
|
||||
kubectl exec <pod-name> -n <namespace> -- \
|
||||
dd if=/dev/zero of=/data/test bs=1M count=1024 conv=fdatasync
|
||||
|
||||
# Check PV performance class
|
||||
kubectl get pv <pv-name> -o yaml | grep storageClassName
|
||||
kubectl describe storageclass <storage-class-name>
|
||||
```
|
||||
|
||||
### Storage Benchmarking
|
||||
|
||||
**Deploy fio for benchmarking:**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Pod
|
||||
metadata:
|
||||
name: fio-benchmark
|
||||
spec:
|
||||
containers:
|
||||
- name: fio
|
||||
image: ljishen/fio
|
||||
command: ["/bin/sh", "-c"]
|
||||
args:
|
||||
- |
|
||||
fio --name=seqread --rw=read --bs=1M --size=1G --runtime=60 --filename=/data/test
|
||||
fio --name=seqwrite --rw=write --bs=1M --size=1G --runtime=60 --filename=/data/test
|
||||
fio --name=randread --rw=randread --bs=4k --size=1G --runtime=60 --filename=/data/test
|
||||
fio --name=randwrite --rw=randwrite --bs=4k --size=1G --runtime=60 --filename=/data/test
|
||||
volumeMounts:
|
||||
- name: data
|
||||
mountPath: /data
|
||||
volumes:
|
||||
- name: data
|
||||
persistentVolumeClaim:
|
||||
claimName: test-pvc
|
||||
```
|
||||
|
||||
### Solutions
|
||||
|
||||
**Use Higher Performance Storage Class:**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: PersistentVolumeClaim
|
||||
metadata:
|
||||
name: high-performance-pvc
|
||||
spec:
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
storageClassName: gp3 # or io2, premium-rwo (GKE), etc.
|
||||
resources:
|
||||
requests:
|
||||
storage: 100Gi
|
||||
```
|
||||
|
||||
**Provision IOPS (AWS EBS io2):**
|
||||
```yaml
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: io2-high-iops
|
||||
provisioner: ebs.csi.aws.com
|
||||
parameters:
|
||||
type: io2
|
||||
iops: "10000"
|
||||
fsType: ext4
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
```
|
||||
|
||||
**Use Local NVMe for Ultra-Low Latency:**
|
||||
```yaml
|
||||
apiVersion: storage.k8s.io/v1
|
||||
kind: StorageClass
|
||||
metadata:
|
||||
name: local-nvme
|
||||
provisioner: kubernetes.io/no-provisioner
|
||||
volumeBindingMode: WaitForFirstConsumer
|
||||
---
|
||||
apiVersion: v1
|
||||
kind: PersistentVolume
|
||||
metadata:
|
||||
name: local-pv
|
||||
spec:
|
||||
capacity:
|
||||
storage: 100Gi
|
||||
accessModes:
|
||||
- ReadWriteOnce
|
||||
persistentVolumeReclaimPolicy: Retain
|
||||
storageClassName: local-nvme
|
||||
local:
|
||||
path: /mnt/disks/nvme0n1
|
||||
nodeAffinity:
|
||||
required:
|
||||
nodeSelectorTerms:
|
||||
- matchExpressions:
|
||||
- key: kubernetes.io/hostname
|
||||
operator: In
|
||||
values:
|
||||
- node-with-nvme
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Application-Level Metrics
|
||||
|
||||
### Expose Prometheus Metrics
|
||||
|
||||
**Add metrics endpoint to application:**
|
||||
```yaml
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: app-metrics
|
||||
annotations:
|
||||
prometheus.io/scrape: "true"
|
||||
prometheus.io/port: "8080"
|
||||
prometheus.io/path: "/metrics"
|
||||
spec:
|
||||
selector:
|
||||
app: myapp
|
||||
ports:
|
||||
- name: metrics
|
||||
port: 8080
|
||||
targetPort: 8080
|
||||
```
|
||||
|
||||
### Key Metrics to Monitor
|
||||
|
||||
**Application metrics:**
|
||||
- Request rate
|
||||
- Request latency (p50, p95, p99)
|
||||
- Error rate
|
||||
- Active connections
|
||||
- Queue depth
|
||||
- Cache hit rate
|
||||
|
||||
**Example Prometheus queries:**
|
||||
```promql
|
||||
# P95 latency
|
||||
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
|
||||
|
||||
# Error rate
|
||||
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
|
||||
|
||||
# Request rate
|
||||
sum(rate(http_requests_total[5m]))
|
||||
```
|
||||
|
||||
### Distributed Tracing
|
||||
|
||||
**Implement OpenTelemetry:**
|
||||
```yaml
|
||||
# Deploy Jaeger
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: jaeger
|
||||
spec:
|
||||
template:
|
||||
spec:
|
||||
containers:
|
||||
- name: jaeger
|
||||
image: jaegertracing/all-in-one:latest
|
||||
ports:
|
||||
- containerPort: 16686 # UI
|
||||
- containerPort: 14268 # Collector
|
||||
```
|
||||
|
||||
**Instrument application:**
|
||||
- Add OpenTelemetry SDK to application
|
||||
- Configure trace export to Jaeger
|
||||
- Analyze end-to-end request traces to identify bottlenecks
|
||||
|
||||
---
|
||||
|
||||
## Cluster-Wide Performance
|
||||
|
||||
### Cluster Resource Utilization
|
||||
|
||||
```bash
|
||||
# Overall cluster capacity
|
||||
kubectl top nodes
|
||||
|
||||
# Total resources
|
||||
kubectl describe nodes | grep -A 5 "Allocated resources"
|
||||
|
||||
# Resource requests vs limits
|
||||
kubectl get pods --all-namespaces -o json | \
|
||||
jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources)"'
|
||||
```
|
||||
|
||||
### Control Plane Performance
|
||||
|
||||
```bash
|
||||
# Check API server latency
|
||||
kubectl get --raw /metrics | grep apiserver_request_duration_seconds
|
||||
|
||||
# Check etcd performance
|
||||
kubectl exec -it -n kube-system etcd-<node> -- \
|
||||
etcdctl --endpoints=https://127.0.0.1:2379 \
|
||||
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
|
||||
--cert=/etc/kubernetes/pki/etcd/server.crt \
|
||||
--key=/etc/kubernetes/pki/etcd/server.key \
|
||||
check perf
|
||||
|
||||
# Controller manager metrics
|
||||
kubectl get --raw /metrics | grep workqueue_depth
|
||||
```
|
||||
|
||||
### Scheduler Performance
|
||||
|
||||
```bash
|
||||
# Check scheduler latency
|
||||
kubectl get --raw /metrics | grep scheduler_scheduling_duration_seconds
|
||||
|
||||
# Check pending pods
|
||||
kubectl get pods --all-namespaces --field-selector status.phase=Pending
|
||||
|
||||
# Scheduler logs
|
||||
kubectl logs -n kube-system kube-scheduler-<node>
|
||||
```
|
||||
|
||||
### Solutions for Cluster-Wide Issues
|
||||
|
||||
**Scale Control Plane:**
|
||||
- Add more control plane nodes
|
||||
- Increase API server replicas
|
||||
- Tune etcd (increase memory, use SSD)
|
||||
|
||||
**Optimize Scheduling:**
|
||||
- Use pod priority and preemption
|
||||
- Implement pod topology spread constraints
|
||||
- Use node affinity/anti-affinity appropriately
|
||||
|
||||
**Resource Management:**
|
||||
- Set appropriate resource requests and limits
|
||||
- Use LimitRanges and ResourceQuotas
|
||||
- Implement VerticalPodAutoscaler for right-sizing
|
||||
|
||||
---
|
||||
|
||||
## Performance Optimization Checklist
|
||||
|
||||
### Application Level
|
||||
- [ ] Implement connection pooling
|
||||
- [ ] Enable response caching
|
||||
- [ ] Optimize database queries
|
||||
- [ ] Use async/non-blocking I/O
|
||||
- [ ] Implement circuit breakers
|
||||
- [ ] Profile and optimize hot paths
|
||||
|
||||
### Kubernetes Level
|
||||
- [ ] Set appropriate resource requests/limits
|
||||
- [ ] Use HPA for auto-scaling
|
||||
- [ ] Implement readiness/liveness probes correctly
|
||||
- [ ] Use anti-affinity for high-availability
|
||||
- [ ] Optimize container image size
|
||||
- [ ] Use multi-stage builds
|
||||
|
||||
### Infrastructure Level
|
||||
- [ ] Use appropriate instance/node types
|
||||
- [ ] Enable cluster autoscaling
|
||||
- [ ] Use high-performance storage classes
|
||||
- [ ] Optimize network topology
|
||||
- [ ] Implement monitoring and alerting
|
||||
- [ ] Regular performance testing
|
||||
|
||||
---
|
||||
|
||||
## Monitoring Tools
|
||||
|
||||
**Essential tools:**
|
||||
- **Prometheus + Grafana**: Metrics and dashboards
|
||||
- **Jaeger/Zipkin**: Distributed tracing
|
||||
- **kube-state-metrics**: Kubernetes object metrics
|
||||
- **node-exporter**: Node-level metrics
|
||||
- **cAdvisor**: Container metrics
|
||||
- **kubectl-flamegraph**: CPU profiling
|
||||
|
||||
**Commercial options:**
|
||||
- Datadog
|
||||
- New Relic
|
||||
- Dynatrace
|
||||
- Elastic APM
|
||||
Reference in New Issue
Block a user