Initial commit

2025-11-29 17:51:20 +08:00
commit ad81bc571f
11 changed files with 3746 additions and 0 deletions
--- a/skills/references/common_issues.md
+++ b/skills/references/common_issues.md
@@ -0,0 +1,582 @@
+# Common Kubernetes Issues and Solutions
+
+This reference provides detailed information about common Kubernetes issues, their root causes, and remediation steps.
+
+## Table of Contents
+
+1. [Pod Issues](#pod-issues)
+2. [Node Issues](#node-issues)
+3. [Networking Issues](#networking-issues)
+4. [Storage Issues](#storage-issues)
+5. [Resource Issues](#resource-issues)
+6. [Security Issues](#security-issues)
+
+---
+
+## Pod Issues
+
+### ImagePullBackOff / ErrImagePull
+
+**Symptoms:**
+- Pod stuck in `ImagePullBackOff` or `ErrImagePull` state
+- Events show image pull errors
+
+**Common Causes:**
+1. Image doesn't exist or tag is wrong
+2. Registry authentication failure
+3. Network issues reaching registry
+4. Rate limiting from registry
+5. Private registry without imagePullSecrets
+
+**Diagnostic Commands:**
+```bash
+kubectl describe pod <pod-name> -n <namespace>
+kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name>
+```
+
+**Remediation Steps:**
+1. Verify image name and tag: `docker pull <image>` from local machine
+2. Check imagePullSecrets exist: `kubectl get secrets -n <namespace>`
+3. Verify secret has correct registry credentials
+4. Check if registry is accessible from cluster
+5. For rate limiting: implement imagePullSecrets with authenticated account
+6. Update deployment with correct image path
+
+**Prevention:**
+- Use specific image tags instead of `latest`
+- Implement image pull secrets for private registries
+- Set up local registry or cache to reduce external pulls
+- Use image validation in CI/CD pipeline
+
+---
+
+### CrashLoopBackOff
+
+**Symptoms:**
+- Pod repeatedly crashes and restarts
+- Increasing restart count
+- Container exits shortly after starting
+
+**Common Causes:**
+1. Application error on startup
+2. Missing environment variables or config
+3. Resource limits too restrictive
+4. Liveness probe too aggressive
+5. Missing dependencies (DB, cache, etc.)
+6. Command/args misconfiguration
+
+**Diagnostic Commands:**
+```bash
+kubectl logs <pod-name> -n <namespace>
+kubectl logs <pod-name> -n <namespace> --previous
+kubectl describe pod <pod-name> -n <namespace>
+kubectl get pod <pod-name> -n <namespace> -o yaml
+```
+
+**Remediation Steps:**
+1. Check current logs: `kubectl logs <pod-name>`
+2. Check previous container logs: `kubectl logs <pod-name> --previous`
+3. Look for startup errors, stack traces, or missing config messages
+4. Verify environment variables are set correctly
+5. Check if external dependencies are accessible
+6. Review and adjust resource limits if OOMKilled
+7. Adjust liveness probe initialDelaySeconds if failing too early
+8. Verify command and args in pod spec
+
+**Prevention:**
+- Implement proper application health checks and readiness
+- Use init containers for dependency checks
+- Set appropriate resource requests/limits based on profiling
+- Configure liveness probes with sufficient delay
+- Add retry logic and graceful degradation to applications
+
+---
+
+### Pending Pods
+
+**Symptoms:**
+- Pod stuck in `Pending` state
+- Not scheduled to any node
+
+**Common Causes:**
+1. Insufficient cluster resources (CPU/memory)
+2. Node selectors/affinity rules can't be satisfied
+3. No nodes match taints/tolerations
+4. PersistentVolume not available
+5. Resource quotas exceeded
+6. Scheduler not running or misconfigured
+
+**Diagnostic Commands:**
+```bash
+kubectl describe pod <pod-name> -n <namespace>
+kubectl get nodes -o wide
+kubectl describe nodes
+kubectl get pv,pvc -n <namespace>
+kubectl get resourcequota -n <namespace>
+```
+
+**Remediation Steps:**
+1. Check pod events for scheduling failure reason
+2. Verify node resources: `kubectl describe nodes | grep -A 5 "Allocated resources"`
+3. Check if pod has node selectors: verify nodes have required labels
+4. Review taints on nodes and tolerations on pod
+5. For PVC issues: verify PV exists and is in `Available` state
+6. Check namespace resource quota: `kubectl describe resourcequota -n <namespace>`
+7. If no resources: scale cluster or reduce resource requests
+8. If affinity issue: adjust affinity rules or add matching nodes
+
+**Prevention:**
+- Set appropriate resource requests (not just limits)
+- Monitor cluster capacity and scale proactively
+- Use pod disruption budgets to prevent total unavailability
+- Implement cluster autoscaling
+- Use multiple node pools for different workload types
+
+---
+
+### OOMKilled Pods
+
+**Symptoms:**
+- Pod terminates with exit code 137
+- Container status shows `OOMKilled` reason
+- Frequent restarts due to memory exhaustion
+
+**Common Causes:**
+1. Memory limit set too low
+2. Memory leak in application
+3. Unexpected load increase
+4. No memory limits (using node's memory)
+
+**Diagnostic Commands:**
+```bash
+kubectl describe pod <pod-name> -n <namespace>
+kubectl logs <pod-name> --previous -n <namespace>
+kubectl top pod <pod-name> -n <namespace>
+```
+
+**Remediation Steps:**
+1. Confirm OOMKilled in pod events or status
+2. Check memory usage before crash if metrics available
+3. Review application logs for memory-intensive operations
+4. Increase memory limit if application legitimately needs more
+5. Profile application to identify memory leaks
+6. Implement memory limits with requests = limits for guaranteed QoS
+7. Consider horizontal scaling instead of vertical
+
+**Prevention:**
+- Profile applications to determine actual memory needs
+- Set memory requests based on normal usage, limits with headroom
+- Implement application-level memory monitoring
+- Use horizontal pod autoscaling based on memory metrics
+- Regular load testing to understand resource requirements
+
+---
+
+## Node Issues
+
+### Node NotReady
+
+**Symptoms:**
+- Node status shows `NotReady`
+- Pods on node may be evicted
+- Node may be cordoned automatically
+
+**Common Causes:**
+1. Kubelet stopped or crashed
+2. Network connectivity issues
+3. Disk pressure
+4. Memory pressure
+5. PID pressure
+6. Container runtime issues
+
+**Diagnostic Commands:**
+```bash
+kubectl describe node <node-name>
+kubectl get nodes -o wide
+ssh <node> "systemctl status kubelet"
+ssh <node> "journalctl -u kubelet -n 100"
+ssh <node> "df -h"
+ssh <node> "free -m"
+```
+
+**Remediation Steps:**
+1. Check node conditions in describe output
+2. Verify kubelet is running: `systemctl status kubelet`
+3. Check kubelet logs: `journalctl -u kubelet`
+4. For disk pressure: clean up unused images/containers
+5. For memory pressure: identify and stop memory-hogging processes
+6. Restart kubelet if crashed: `systemctl restart kubelet`
+7. Check container runtime: `systemctl status docker` or `containerd`
+8. Verify network connectivity to API server
+
+**Prevention:**
+- Monitor node resources with alerts
+- Implement log rotation and image cleanup
+- Set up node problem detector
+- Use resource quotas to prevent resource exhaustion
+- Regular maintenance windows for updates
+
+---
+
+### Disk Pressure
+
+**Symptoms:**
+- Node condition `DiskPressure` is True
+- Pod evictions may occur
+- Node may become NotReady
+
+**Common Causes:**
+1. Docker/containerd image cache filling disk
+2. Container logs consuming space
+3. Ephemeral storage usage by pods
+4. System logs filling up
+
+**Diagnostic Commands:**
+```bash
+kubectl describe node <node-name> | grep -A 10 Conditions
+ssh <node> "df -h"
+ssh <node> "du -sh /var/lib/docker/*"
+ssh <node> "du -sh /var/lib/containerd/*"
+ssh <node> "docker system df"
+```
+
+**Remediation Steps:**
+1. Clean up unused images: `docker image prune -a`
+2. Clean up stopped containers: `docker container prune`
+3. Clean up volumes: `docker volume prune`
+4. Rotate and compress logs
+5. Check for pods with excessive ephemeral storage usage
+6. Expand disk if consistently running out of space
+7. Configure kubelet garbage collection parameters
+
+**Prevention:**
+- Set imagefs.available threshold in kubelet config
+- Implement automated image pruning
+- Use log rotation for container logs
+- Monitor disk usage with alerts
+- Set ephemeral-storage limits on pods
+- Size nodes appropriately for workload
+
+---
+
+## Networking Issues
+
+### Pod-to-Pod Communication Failure
+
+**Symptoms:**
+- Services can't reach other services
+- Connection timeouts between pods
+- DNS resolution may or may not work
+
+**Common Causes:**
+1. Network policy blocking traffic
+2. CNI plugin issues
+3. Firewall rules blocking traffic
+4. Service misconfiguration
+5. Pod CIDR exhaustion
+
+**Diagnostic Commands:**
+```bash
+kubectl get networkpolicies --all-namespaces
+kubectl exec -it <pod> -- ping <target-pod-ip>
+kubectl exec -it <pod> -- nslookup <service-name>
+kubectl get svc -n <namespace>
+kubectl describe svc <service-name> -n <namespace>
+kubectl get endpoints <service-name> -n <namespace>
+```
+
+**Remediation Steps:**
+1. Test basic connectivity: exec into pod and ping target
+2. Check network policies: look for policies affecting source/dest namespaces
+3. Verify service has endpoints: `kubectl get endpoints`
+4. Check if pod labels match service selector
+5. Verify CNI plugin pods are healthy (usually in kube-system)
+6. Check node-level firewall rules
+7. Verify pod CIDR hasn't been exhausted
+
+**Prevention:**
+- Document network policies and their intent
+- Use namespace labels for network policy management
+- Monitor CNI plugin health
+- Regularly audit network policies
+- Implement network policy testing in CI/CD
+
+---
+
+### Service Not Accessible
+
+**Symptoms:**
+- Cannot connect to service
+- LoadBalancer stuck in Pending
+- NodePort not accessible
+
+**Common Causes:**
+1. Service has no endpoints (no matching pods)
+2. Pods not passing readiness checks
+3. LoadBalancer controller not working
+4. Cloud provider integration issues
+5. Port conflicts
+
+**Diagnostic Commands:**
+```bash
+kubectl get svc <service-name> -n <namespace>
+kubectl describe svc <service-name> -n <namespace>
+kubectl get endpoints <service-name> -n <namespace>
+kubectl get pods -l <service-selector> -n <namespace>
+kubectl logs -l <service-selector> -n <namespace>
+```
+
+**Remediation Steps:**
+1. Verify service type and ports are correct
+2. Check if service has endpoints: `kubectl get endpoints`
+3. If no endpoints: verify pod selector matches pod labels
+4. Check pod readiness: `kubectl describe pod`
+5. For LoadBalancer: check cloud provider controller logs
+6. For NodePort: verify node firewall allows the port
+7. Test from within cluster first, then external access
+
+**Prevention:**
+- Use meaningful service selectors and pod labels
+- Implement proper readiness probes
+- Test services in staging before production
+- Monitor service endpoint counts
+- Document external access requirements
+
+---
+
+## Storage Issues
+
+### PVC Pending
+
+**Symptoms:**
+- PersistentVolumeClaim stuck in `Pending` state
+- Pod can't start due to volume not available
+
+**Common Causes:**
+1. No PV matches PVC requirements
+2. StorageClass doesn't exist or is misconfigured
+3. Dynamic provisioner not working
+4. Insufficient permissions for provisioner
+5. Volume capacity exhausted in storage backend
+
+**Diagnostic Commands:**
+```bash
+kubectl describe pvc <pvc-name> -n <namespace>
+kubectl get pv
+kubectl get storageclass
+kubectl describe storageclass <storage-class-name>
+kubectl get pods -n <provisioner-namespace>
+```
+
+**Remediation Steps:**
+1. Check PVC events for specific error message
+2. Verify StorageClass exists: `kubectl get sc`
+3. Check if dynamic provisioner pod is running
+4. For static provisioning: ensure PV exists with matching size/access mode
+5. Verify provisioner has correct cloud credentials/permissions
+6. Check storage backend capacity
+7. Review StorageClass parameters for typos
+
+**Prevention:**
+- Use dynamic provisioning with reliable StorageClass
+- Monitor storage backend capacity
+- Set up alerts for PVC binding failures
+- Test storage provisioning in non-production first
+- Document storage requirements and limitations
+
+---
+
+### Volume Mount Failures
+
+**Symptoms:**
+- Pod fails to start with volume mount errors
+- Events show mounting failures
+- Container create errors related to volumes
+
+**Common Causes:**
+1. Volume already mounted to different node (RWO with multi-attach)
+2. Volume doesn't exist
+3. Insufficient permissions
+4. Node can't reach storage backend
+5. Filesystem issues on volume
+
+**Diagnostic Commands:**
+```bash
+kubectl describe pod <pod-name> -n <namespace>
+kubectl describe pvc <pvc-name> -n <namespace>
+kubectl describe pv <pv-name>
+kubectl get volumeattachments
+```
+
+**Remediation Steps:**
+1. Check pod events for specific mount error
+2. For RWO volumes: ensure pod is scheduled to node with volume attached
+3. Verify PVC is bound to a PV
+4. Check node can reach storage backend (cloud/NFS/iSCSI)
+5. For CSI volumes: check CSI driver pods are healthy
+6. Delete and recreate pod if volume is stuck in multi-attach state
+7. Check filesystem on volume if accessible
+
+**Prevention:**
+- Use ReadWriteMany for multi-pod access scenarios
+- Implement pod disruption budgets to prevent scheduling conflicts
+- Monitor volume attachment status
+- Use StatefulSets for stateful workloads with volumes
+- Regular backup and restore testing
+
+---
+
+## Resource Issues
+
+### Resource Quota Exceeded
+
+**Symptoms:**
+- New pods fail to schedule
+- Error: "exceeded quota"
+- ResourceQuota limits preventing resource allocation
+
+**Common Causes:**
+1. Namespace resource quota exceeded
+2. Not enough resource budget available
+3. Resource requests not specified on pods
+4. Quota misconfiguration
+
+**Diagnostic Commands:**
+```bash
+kubectl describe resourcequota -n <namespace>
+kubectl describe limitrange -n <namespace>
+kubectl get pods -n <namespace> -o json | jq '.items[].spec.containers[].resources'
+kubectl describe namespace <namespace>
+```
+
+**Remediation Steps:**
+1. Check current quota usage: `kubectl describe resourcequota`
+2. Identify pods consuming resources
+3. Either increase quota limits or reduce resource requests
+4. Delete unused resources to free up quota
+5. Optimize pod resource requests based on actual usage
+6. Consider splitting workloads across namespaces
+
+**Prevention:**
+- Set quotas based on actual usage patterns
+- Monitor quota usage with alerts
+- Right-size pod resource requests
+- Implement automatic cleanup of completed jobs/pods
+- Regular quota review and adjustment
+
+---
+
+### CPU Throttling
+
+**Symptoms:**
+- Application performance degradation
+- High CPU throttling metrics
+- Services responding slowly despite available CPU
+
+**Common Causes:**
+1. CPU limits set too low
+2. Burst workloads hitting limits
+3. Noisy neighbor effects
+4. CPU limits set without understanding workload
+
+**Diagnostic Commands:**
+```bash
+kubectl top pod <pod-name> -n <namespace>
+kubectl describe pod <pod-name> -n <namespace> | grep -A 5 Limits
+kubectl exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_throttled_time
+```
+
+**Remediation Steps:**
+1. Check current CPU usage vs limits
+2. Review throttling metrics if available
+3. Increase CPU limits if application legitimately needs more
+4. Remove CPU limits if workload is bursty (keep requests)
+5. Use HPA if load varies significantly
+6. Profile application to identify CPU-intensive operations
+
+**Prevention:**
+- Set CPU requests based on average usage
+- Set CPU limits with 50-100% headroom for bursts
+- Use HPA for variable workloads
+- Monitor CPU throttling metrics
+- Regular performance testing and profiling
+
+---
+
+## Security Issues
+
+### Image Vulnerability
+
+**Symptoms:**
+- Security scanner reports vulnerabilities
+- Compliance violations
+- Known CVEs in running images
+
+**Diagnostic Commands:**
+```bash
+kubectl get pods -o jsonpath='{.items[*].spec.containers[*].image}' | tr ' ' '\n' | sort -u
+trivy image <image-name>
+```
+
+**Remediation Steps:**
+1. Identify vulnerable images with scanner
+2. Update base images to patched versions
+3. Rebuild application images with updated dependencies
+4. Update deployments with new image tags
+5. Implement admission controller to block vulnerable images
+
+**Prevention:**
+- Scan images in CI/CD pipeline
+- Regular base image updates
+- Use minimal base images
+- Implement admission controllers (OPA, Kyverno)
+- Automated image updates and testing
+
+---
+
+### RBAC Permission Denied
+
+**Symptoms:**
+- Users or service accounts can't perform operations
+- "forbidden" errors in logs
+- API calls fail with 403 errors
+
+**Common Causes:**
+1. Missing Role or ClusterRole binding
+2. Overly restrictive RBAC policies
+3. Wrong service account in pod
+4. Namespace-scoped role for cluster-wide resource
+
+**Diagnostic Commands:**
+```bash
+kubectl auth can-i <verb> <resource> --as=<user/sa>
+kubectl get rolebindings -n <namespace>
+kubectl get clusterrolebindings
+kubectl describe role <role-name> -n <namespace>
+kubectl describe serviceaccount <sa-name> -n <namespace>
+```
+
+**Remediation Steps:**
+1. Identify exact permission needed from error message
+2. Check what user/SA can do: `kubectl auth can-i --list`
+3. Create appropriate Role/ClusterRole with needed permissions
+4. Create RoleBinding/ClusterRoleBinding
+5. Verify service account is correctly set in pod spec
+6. Test with `kubectl auth can-i` before deploying
+
+**Prevention:**
+- Follow principle of least privilege
+- Use namespace-scoped roles where possible
+- Document RBAC policies and their purpose
+- Regular RBAC audits
+- Use pre-defined roles when possible
+
+---
+
+This reference covers the most common Kubernetes issues. For each issue, always:
+1. Gather information (describe, logs, events)
+2. Form hypothesis based on symptoms
+3. Test hypothesis with targeted diagnostics
+4. Apply remediation
+5. Verify fix
+6. Document for future reference
--- a/skills/references/helm_troubleshooting.md
+++ b/skills/references/helm_troubleshooting.md
@@ -0,0 +1,708 @@
+# Helm Troubleshooting Guide
+
+Comprehensive guide for troubleshooting Helm releases, charts, and deployments.
+
+## Table of Contents
+
+1. [Helm Release Issues](#helm-release-issues)
+2. [Chart Installation Failures](#chart-installation-failures)
+3. [Upgrade and Rollback Problems](#upgrade-and-rollback-problems)
+4. [Values and Configuration](#values-and-configuration)
+5. [Chart Dependencies](#chart-dependencies)
+6. [Hooks and Lifecycle](#hooks-and-lifecycle)
+7. [Repository Issues](#repository-issues)
+
+---
+
+## Helm Release Issues
+
+### Release Stuck in Pending-Install/Pending-Upgrade
+
+**Symptoms:**
+- Release shows status `pending-install` or `pending-upgrade`
+- New installations or upgrades hang indefinitely
+- `helm list` shows release but resources not created
+
+**Diagnostic Commands:**
+```bash
+# Check release status
+helm list -n <namespace>
+helm status <release-name> -n <namespace>
+
+# Check release history
+helm history <release-name> -n <namespace>
+
+# Get detailed release information
+kubectl get secrets -n <namespace> -l owner=helm,status=pending-install
+
+# Check helm operation pods/jobs
+kubectl get pods -n <namespace> -l app.kubernetes.io/managed-by=Helm
+```
+
+**Common Causes:**
+1. Previous installation failed and wasn't cleaned up
+2. Helm hooks stuck or failed
+3. Kubernetes resources can't be created (RBAC, quotas)
+4. Timeout during installation
+
+**Resolution:**
+
+```bash
+# Check for stuck hooks
+kubectl get jobs -n <namespace>
+kubectl get pods -n <namespace> -l "helm.sh/hook"
+
+# Delete stuck hooks
+kubectl delete job <hook-job> -n <namespace>
+
+# Rollback to previous version
+helm rollback <release-name> -n <namespace>
+
+# Force delete release (last resort)
+helm delete <release-name> -n <namespace> --no-hooks
+
+# Clean up secrets
+kubectl delete secret -n <namespace> -l owner=helm,name=<release-name>
+```
+
+### Release Shows as Deployed but Resources Missing
+
+**Symptoms:**
+- `helm list` shows release as deployed
+- Expected pods/services not running
+- Resources partially created
+
+**Investigation:**
+```bash
+# Get manifest from release
+helm get manifest <release-name> -n <namespace>
+
+# Compare with what's actually deployed
+helm get manifest <release-name> -n <namespace> | kubectl apply --dry-run=client -f -
+
+# Check helm values used
+helm get values <release-name> -n <namespace>
+
+# Check for resource creation errors
+kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
+```
+
+**Resolution:**
+```bash
+# Reapply the release
+helm upgrade <release-name> <chart> -n <namespace> --reuse-values
+
+# Or force recreate
+helm upgrade <release-name> <chart> -n <namespace> --force
+```
+
+---
+
+## Chart Installation Failures
+
+### "Release Already Exists" Error
+
+**Symptoms:**
+```
+Error: INSTALLATION FAILED: cannot re-use a name that is still in use
+```
+
+**Resolution:**
+```bash
+# Check existing releases
+helm list -n <namespace> --all
+
+# Check for failed releases
+helm list -n <namespace> --failed
+
+# Uninstall existing release
+helm uninstall <release-name> -n <namespace>
+
+# Or use different release name
+helm install <new-release-name> <chart> -n <namespace>
+```
+
+### "Invalid Chart" Error
+
+**Symptoms:**
+```
+Error: INSTALLATION FAILED: chart requires kubeVersion: >=1.20.0 which is incompatible with Kubernetes v1.19.0
+```
+
+**Investigation:**
+```bash
+# Check chart requirements
+helm show chart <chart-name>
+
+# Check Kubernetes version
+kubectl version --short
+
+# Inspect chart contents
+helm pull <chart> --untar
+cat <chart>/Chart.yaml
+```
+
+**Resolution:**
+- Upgrade Kubernetes cluster to meet requirements
+- Use compatible chart version: `helm install <release> <chart> --version <compatible-version>`
+- Modify Chart.yaml kubeVersion requirement (not recommended)
+
+### Template Rendering Errors
+
+**Symptoms:**
+```
+Error: INSTALLATION FAILED: template: <chart>/templates/deployment.yaml:10:4:
+executing "<chart>/templates/deployment.yaml" at <.Values.invalid.path>:
+nil pointer evaluating interface {}.path
+```
+
+**Investigation:**
+```bash
+# Render templates locally
+helm template <release-name> <chart> -n <namespace>
+
+# Render with your values
+helm template <release-name> <chart> -f values.yaml -n <namespace>
+
+# Debug mode
+helm install <release-name> <chart> -n <namespace> --debug --dry-run
+```
+
+**Resolution:**
+1. Check values.yaml for missing required fields
+2. Verify template syntax in chart
+3. Use `helm lint` to validate chart
+
+```bash
+# Lint chart
+helm lint <chart-directory>
+
+# Lint with values
+helm lint <chart-directory> -f values.yaml
+```
+
+---
+
+## Upgrade and Rollback Problems
+
+### Upgrade Fails with "Rendered Manifests Contain Errors"
+
+**Symptoms:**
+```
+Error: UPGRADE FAILED: unable to build kubernetes objects from release manifest
+```
+
+**Investigation:**
+```bash
+# Dry run upgrade
+helm upgrade <release-name> <chart> -n <namespace> --dry-run --debug
+
+# Compare differences
+helm diff upgrade <release-name> <chart> -n <namespace>  # requires helm-diff plugin
+```
+
+**Resolution:**
+```bash
+# Fix values.yaml and retry
+helm upgrade <release-name> <chart> -n <namespace> -f fixed-values.yaml
+
+# Skip tests if test hooks failing
+helm upgrade <release-name> <chart> -n <namespace> --no-hooks
+
+# Force upgrade
+helm upgrade <release-name> <chart> -n <namespace> --force
+```
+
+### Rollback Fails
+
+**Symptoms:**
+```
+Error: ROLLBACK FAILED: release <release-name> failed, and has been rolled back due to atomic being set
+```
+
+**Investigation:**
+```bash
+# Check release history
+helm history <release-name> -n <namespace>
+
+# Get specific revision manifest
+helm get manifest <release-name> --revision <revision-number> -n <namespace>
+```
+
+**Resolution:**
+```bash
+# Rollback to specific working revision
+helm rollback <release-name> <revision-number> -n <namespace>
+
+# If rollback fails, uninstall and reinstall
+helm uninstall <release-name> -n <namespace>
+helm install <release-name> <chart> -n <namespace> -f values.yaml
+
+# Clean up failed rollback
+kubectl get secrets -n <namespace> -l owner=helm,status=pending-rollback
+kubectl delete secret <secret-name> -n <namespace>
+```
+
+### "Immutable Field" Error During Upgrade
+
+**Symptoms:**
+```
+Error: UPGRADE FAILED: Service "myapp" is invalid: spec.clusterIP: Invalid value: "": field is immutable
+```
+
+**Common immutable fields:**
+- Service `clusterIP`
+- StatefulSet `volumeClaimTemplates`
+- PVC `storageClassName`
+
+**Resolution:**
+```bash
+# Option 1: Delete and recreate resource
+kubectl delete service <service-name> -n <namespace>
+helm upgrade <release-name> <chart> -n <namespace>
+
+# Option 2: Use --force to recreate resources
+helm upgrade <release-name> <chart> -n <namespace> --force
+
+# Option 3: Manually patch the resource
+kubectl patch service <service-name> -n <namespace> --type='json' \
+  -p='[{"op": "remove", "path": "/spec/clusterIP"}]'
+```
+
+---
+
+## Values and Configuration
+
+### Values Not Applied
+
+**Symptoms:**
+- Deployed resources don't reflect values in values.yaml
+- Default chart values used instead of custom values
+
+**Investigation:**
+```bash
+# Check what values were used in deployment
+helm get values <release-name> -n <namespace>
+
+# Check all values (including defaults)
+helm get values <release-name> -n <namespace> --all
+
+# Test rendering with your values
+helm template <release-name> <chart> -f values.yaml -n <namespace> | less
+```
+
+**Resolution:**
+```bash
+# Ensure values file is specified correctly
+helm upgrade <release-name> <chart> -n <namespace> -f values.yaml
+
+# Use multiple values files (later files override earlier)
+helm upgrade <release-name> <chart> -n <namespace> \
+  -f values-common.yaml \
+  -f values-prod.yaml
+
+# Set specific values via CLI
+helm upgrade <release-name> <chart> -n <namespace> \
+  --set image.tag=v2.0.0 \
+  --set replicas=5
+```
+
+### Values File Parsing Errors
+
+**Symptoms:**
+```
+Error: INSTALLATION FAILED: YAML parse error
+```
+
+**Investigation:**
+```bash
+# Validate YAML syntax
+yamllint values.yaml
+
+# Or use Python
+python3 -c 'import yaml; yaml.safe_load(open("values.yaml"))'
+
+# Check for tabs (YAML doesn't allow tabs)
+cat -A values.yaml | grep $'\t'
+```
+
+**Resolution:**
+- Fix YAML syntax errors
+- Replace tabs with spaces
+- Ensure proper indentation
+- Quote special characters in strings
+
+### Secret Values Not Working
+
+**Symptoms:**
+- Secrets not created or contain wrong values
+- Base64 encoding issues
+
+**Investigation:**
+```bash
+# Check secret in manifest
+helm get manifest <release-name> -n <namespace> | grep -A 10 "kind: Secret"
+
+# Decode secret
+kubectl get secret <secret-name> -n <namespace> -o json | \
+  jq '.data | map_values(@base64d)'
+```
+
+**Resolution:**
+```yaml
+# Use proper secret format in values.yaml
+secrets:
+  password: "mySecretPassword"  # Helm will base64 encode
+
+# Or pre-encode if template expects it
+secrets:
+  password: "bXlTZWNyZXRQYXNzd29yZA=="  # Already base64 encoded
+```
+
+---
+
+## Chart Dependencies
+
+### Dependency Update Fails
+
+**Symptoms:**
+```
+Error: An error occurred while checking for chart dependencies
+```
+
+**Investigation:**
+```bash
+# Check Chart.yaml dependencies
+cat Chart.yaml
+
+# List current dependencies
+helm dependency list <chart-directory>
+
+# Check repository access
+helm repo list
+helm repo update
+```
+
+**Resolution:**
+```bash
+# Update dependencies
+helm dependency update <chart-directory>
+
+# Build dependencies (downloads to charts/)
+helm dependency build <chart-directory>
+
+# Add missing repositories
+helm repo add <repo-name> <repo-url>
+helm repo update
+```
+
+### Dependency Version Conflicts
+
+**Symptoms:**
+```
+Error: found in Chart.yaml, but missing in charts/ directory
+```
+
+**Resolution:**
+```bash
+# Clean dependencies
+rm -rf <chart-directory>/charts/*
+rm -f <chart-directory>/Chart.lock
+
+# Rebuild
+helm dependency update <chart-directory>
+
+# Verify
+helm dependency list <chart-directory>
+```
+
+### Subchart Values Not Applied
+
+**Investigation:**
+```bash
+# Check subchart values in parent chart
+cat values.yaml | grep -A 20 <subchart-name>
+
+# Render to see what values subchart receives
+helm template <release-name> <chart> -f values.yaml | grep -A 50 "# Source: <subchart-name>"
+```
+
+**Resolution:**
+```yaml
+# In parent chart's values.yaml, nest subchart values under subchart name:
+postgresql:  # Subchart name
+  auth:
+    username: myuser
+    password: mypass
+    database: mydb
+  primary:
+    resources:
+      requests:
+        memory: "256Mi"
+        cpu: "250m"
+```
+
+---
+
+## Hooks and Lifecycle
+
+### Pre/Post Hooks Failing
+
+**Symptoms:**
+- Installation/upgrade hangs waiting for hooks
+- Hook jobs fail
+- Release stuck in pending state
+
+**Investigation:**
+```bash
+# List hooks
+kubectl get jobs -n <namespace> -l "helm.sh/hook"
+
+# Check hook status
+kubectl describe job <hook-job-name> -n <namespace>
+
+# Get hook logs
+kubectl logs -n <namespace> -l "helm.sh/hook=pre-install"
+kubectl logs -n <namespace> -l "helm.sh/hook=post-install"
+```
+
+**Resolution:**
+```bash
+# Delete failed hooks
+kubectl delete job -n <namespace> -l "helm.sh/hook"
+
+# Retry without hooks
+helm upgrade <release-name> <chart> -n <namespace> --no-hooks
+
+# Or skip hooks during install
+helm install <release-name> <chart> -n <namespace> --no-hooks
+```
+
+### Hook Cleanup Issues
+
+**Symptoms:**
+- Hook resources remain after installation
+- Accumulating failed hook jobs
+
+**Investigation:**
+```bash
+# Check hook deletion policy
+helm get manifest <release-name> -n <namespace> | grep -B 5 "helm.sh/hook-delete-policy"
+
+# List remaining hooks
+kubectl get all -n <namespace> -l "helm.sh/hook"
+```
+
+**Resolution:**
+```bash
+# Manual cleanup
+kubectl delete jobs,pods -n <namespace> -l "helm.sh/hook"
+
+# Update chart template to include proper hook-delete-policy:
+# metadata:
+#   annotations:
+#     "helm.sh/hook": pre-install
+#     "helm.sh/hook-delete-policy": hook-succeeded,hook-failed
+```
+
+---
+
+## Repository Issues
+
+### Unable to Get Chart from Repository
+
+**Symptoms:**
+```
+Error: failed to download "<chart-name>"
+```
+
+**Investigation:**
+```bash
+# Check repository configuration
+helm repo list
+
+# Update repositories
+helm repo update
+
+# Search for chart
+helm search repo <chart-name> --versions
+
+# Test repository access
+curl -I <repo-url>/index.yaml
+```
+
+**Resolution:**
+```bash
+# Remove and re-add repository
+helm repo remove <repo-name>
+helm repo add <repo-name> <repo-url>
+helm repo update
+
+# For private repos, configure credentials
+helm repo add <repo-name> <repo-url> \
+  --username=<username> \
+  --password=<password>
+
+# Or use OCI registry
+helm pull oci://registry.example.com/charts/<chart-name> --version 1.0.0
+```
+
+### Chart Version Not Found
+
+**Symptoms:**
+```
+Error: chart "<chart-name>" version "1.2.3" not found
+```
+
+**Investigation:**
+```bash
+# List available versions
+helm search repo <chart-name> --versions
+
+# Check if specific version exists
+helm show chart <repo-name>/<chart-name> --version 1.2.3
+```
+
+**Resolution:**
+```bash
+# Use available version
+helm install <release-name> <repo-name>/<chart-name> --version <available-version>
+
+# Or use latest
+helm install <release-name> <repo-name>/<chart-name>
+```
+
+---
+
+## Debugging Tools and Commands
+
+### Essential Helm Commands
+
+```bash
+# Get release information
+helm list -n <namespace> --all
+helm status <release-name> -n <namespace>
+helm history <release-name> -n <namespace>
+
+# Get release content
+helm get values <release-name> -n <namespace>
+helm get manifest <release-name> -n <namespace>
+helm get hooks <release-name> -n <namespace>
+helm get notes <release-name> -n <namespace>
+
+# Debugging
+helm install <release-name> <chart> --debug --dry-run -n <namespace>
+helm template <release-name> <chart> --debug -n <namespace>
+
+# Testing
+helm test <release-name> -n <namespace>
+helm lint <chart-directory>
+```
+
+### Useful Plugins
+
+```bash
+# Install helm-diff plugin
+helm plugin install https://github.com/databus23/helm-diff
+
+# Compare releases
+helm diff upgrade <release-name> <chart> -n <namespace>
+
+# Install helm-secrets plugin
+helm plugin install https://github.com/jkroepke/helm-secrets
+
+# Use encrypted values
+helm secrets install <release-name> <chart> -f secrets.yaml -n <namespace>
+```
+
+### Helm Environment Issues
+
+**Check Helm configuration:**
+```bash
+# Helm version
+helm version
+
+# Kubernetes context
+kubectl config current-context
+
+# Helm environment
+helm env
+
+# Cache location
+helm env | grep CACHE
+```
+
+---
+
+## Best Practices
+
+### Release Management
+- Use descriptive release names
+- Always specify namespace explicitly
+- Use `--atomic` flag for safer upgrades (rolls back on failure)
+- Keep release history manageable: `helm history <release> -n <namespace> --max 10`
+
+### Values Management
+- Use multiple values files for different environments
+- Version control your values files
+- Use `helm template` to preview changes before applying
+- Document required values in chart README
+
+### Chart Development
+- Always run `helm lint` before packaging
+- Test charts in multiple environments
+- Use semantic versioning for charts
+- Implement proper hooks with deletion policies
+
+### Troubleshooting Workflow
+1. Check release status: `helm status <release> -n <namespace>`
+2. Check history: `helm history <release> -n <namespace>`
+3. Get values: `helm get values <release> -n <namespace>`
+4. Check manifest: `helm get manifest <release> -n <namespace>`
+5. Check Kubernetes events: `kubectl get events -n <namespace>`
+6. Check pod logs: `kubectl logs <pod> -n <namespace>`
+7. Check hooks: `kubectl get jobs -n <namespace> -l helm.sh/hook`
+
+---
+
+## Quick Reference
+
+### Common Flags
+
+```bash
+# Installation/Upgrade
+--atomic                 # Rollback on failure
+--wait                   # Wait for resources to be ready
+--timeout 10m            # Set timeout (default 5m)
+--force                  # Force update by deleting and recreating resources
+--cleanup-on-fail        # Delete resources on failed install
+
+# Debugging
+--debug                  # Enable verbose output
+--dry-run                # Simulate operation
+--no-hooks               # Skip hooks
+
+# Values
+-f values.yaml          # Use values file
+--set key=value         # Set value via command line
+--reuse-values          # Reuse values from previous release
+```
+
+### Typical Rescue Commands
+
+```bash
+# Release stuck? Force delete and reinstall
+helm uninstall <release> -n <namespace> --no-hooks
+kubectl delete secret -n <namespace> -l owner=helm,name=<release>
+helm install <release> <chart> -n <namespace> -f values.yaml
+
+# Upgrade failed? Rollback
+helm rollback <release> 0 -n <namespace>  # 0 = previous revision
+
+# Can't rollback? Force upgrade
+helm upgrade <release> <chart> -n <namespace> --force --recreate-pods
+
+# Complete cleanup
+helm uninstall <release> -n <namespace>
+kubectl delete namespace <namespace>  # If dedicated namespace
+```
--- a/skills/references/incident_response.md
+++ b/skills/references/incident_response.md
@@ -0,0 +1,466 @@
+# Kubernetes Incident Response Playbook
+
+This playbook provides structured procedures for responding to Kubernetes incidents.
+
+## Incident Response Framework
+
+### 1. Detection Phase
+- Identify the incident (alerts, user reports, monitoring)
+- Determine severity level
+- Initiate incident response
+
+### 2. Triage Phase
+- Assess impact and scope
+- Gather initial diagnostic data
+- Determine if immediate action needed
+
+### 3. Investigation Phase
+- Collect comprehensive diagnostics
+- Identify root cause
+- Document findings
+
+### 4. Resolution Phase
+- Apply remediation
+- Verify fix
+- Monitor for recurrence
+
+### 5. Post-Incident Phase
+- Document incident
+- Conduct blameless post-mortem
+- Implement preventive measures
+
+---
+
+## Severity Levels
+
+### SEV-1: Critical
+- Complete service outage
+- Data loss or corruption
+- Security breach
+- Impact: All users affected
+- Response: Immediate, all-hands
+
+### SEV-2: High
+- Major functionality degraded
+- Significant performance impact
+- Impact: Large subset of users
+- Response: Within 15 minutes
+
+### SEV-3: Medium
+- Minor functionality impaired
+- Workaround available
+- Impact: Some users affected
+- Response: Within 1 hour
+
+### SEV-4: Low
+- Cosmetic issues
+- Negligible impact
+- Impact: Minimal
+- Response: During business hours
+
+---
+
+## Common Incident Scenarios
+
+### Scenario 1: Complete Cluster Outage
+
+**Symptoms:**
+- All services unreachable
+- kubectl commands timing out
+- API server not responding
+
+**Immediate Actions:**
+1. Verify the scope (single cluster or multi-cluster)
+2. Check API server status and logs
+3. Check control plane nodes
+4. Verify network connectivity to control plane
+5. Check etcd cluster health
+
+**Investigation Steps:**
+```bash
+# Check control plane pods
+kubectl get pods -n kube-system
+
+# Check etcd
+kubectl exec -it etcd-<node> -n kube-system -- etcdctl endpoint health
+
+# Check API server logs
+journalctl -u kube-apiserver -n 100
+
+# Check control plane node resources
+ssh <control-plane-node> "top"
+```
+
+**Common Causes:**
+- etcd cluster failure
+- API server OOM/crash
+- Control plane network partition
+- Certificate expiration
+- Cloud provider outage
+
+**Resolution Paths:**
+1. etcd issue: Restore from backup or rebuild cluster
+2. API server issue: Restart API server pods/service
+3. Network: Fix routing, security groups, or DNS
+4. Certificates: Renew certificates (kubeadm cert renew all)
+
+---
+
+### Scenario 2: Service Degradation
+
+**Symptoms:**
+- Increased latency or error rates
+- Some requests failing
+- Intermittent issues
+
+**Immediate Actions:**
+1. Check service metrics and logs
+2. Verify pod health and count
+3. Check for recent deployments
+4. Review resource utilization
+
+**Investigation Steps:**
+```bash
+# Check service endpoints
+kubectl get endpoints <service> -n <namespace>
+
+# Check pod status
+kubectl get pods -l <service-selector> -n <namespace>
+
+# Review recent changes
+kubectl rollout history deployment/<name> -n <namespace>
+
+# Check resource usage
+kubectl top pods -n <namespace>
+
+# Get recent events
+kubectl get events -n <namespace> --sort-by='.lastTimestamp'
+```
+
+**Common Causes:**
+- Insufficient replicas
+- Pod restarts/crashes
+- Resource contention
+- Bad deployment
+- External dependency failure
+
+**Resolution Paths:**
+1. Scale up replicas if needed
+2. Rollback bad deployment
+3. Increase resources if constrained
+4. Fix configuration issues
+5. Implement circuit breaker for external deps
+
+---
+
+### Scenario 3: Node Failure
+
+**Symptoms:**
+- Node reported as NotReady
+- Pods being evicted from node
+- High node resource utilization
+
+**Immediate Actions:**
+1. Identify affected node
+2. Check impact (which pods running on node)
+3. Determine if pods need immediate migration
+4. Assess if node is recoverable
+
+**Investigation Steps:**
+```bash
+# Get node status
+kubectl get nodes
+
+# Describe the problem node
+kubectl describe node <node-name>
+
+# Check pods on the node
+kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>
+
+# SSH to node and check
+ssh <node> "systemctl status kubelet"
+ssh <node> "journalctl -u kubelet -n 100"
+ssh <node> "docker ps"  # or containerd
+ssh <node> "df -h"
+ssh <node> "free -m"
+```
+
+**Common Causes:**
+- Kubelet failure
+- Disk full
+- Memory exhaustion
+- Network issues
+- Hardware failure
+
+**Resolution Paths:**
+1. Recoverable: Fix issue (clean disk, restart services)
+2. Not recoverable: Cordon, drain, and replace node
+3. For critical pods: Manually reschedule if necessary
+4. Update monitoring and alerting based on findings
+
+---
+
+### Scenario 4: Storage Issues
+
+**Symptoms:**
+- PVCs stuck in Pending
+- Pods can't start due to volume issues
+- Data access failures
+
+**Immediate Actions:**
+1. Identify affected PVCs/PVs
+2. Check storage backend health
+3. Verify provisioner status
+4. Assess data integrity risk
+
+**Investigation Steps:**
+```bash
+# Check PVC status
+kubectl get pvc --all-namespaces
+
+# Describe pending PVC
+kubectl describe pvc <pvc-name> -n <namespace>
+
+# Check PV status
+kubectl get pv
+
+# Check storage class
+kubectl get storageclass
+
+# Check provisioner
+kubectl get pods -n <storage-namespace>
+
+# Check volume attachments
+kubectl get volumeattachments
+```
+
+**Common Causes:**
+- Storage backend failure/full
+- Provisioner issues
+- Network to storage backend
+- Volume attachment limits reached
+- Corrupted volume
+
+**Resolution Paths:**
+1. Fix storage backend issues
+2. Restart provisioner if needed
+3. Manually provision PV if dynamic provisioning failed
+4. Delete and recreate if volume corrupted
+5. Restore from backup if data lost
+
+---
+
+### Scenario 5: Security Incident
+
+**Symptoms:**
+- Unauthorized access detected
+- Suspicious pod behavior
+- Security alerts triggered
+- Unusual network traffic
+
+**Immediate Actions:**
+1. Assess severity and scope
+2. Isolate affected resources
+3. Preserve evidence
+4. Engage security team
+
+**Investigation Steps:**
+```bash
+# Check recent RBAC changes
+kubectl get rolebindings,clusterrolebindings --all-namespaces -o json
+
+# Audit pod security contexts
+kubectl get pods --all-namespaces -o json | jq '.items[].spec.securityContext'
+
+# Check for privileged pods
+kubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.containers[].securityContext.privileged==true)'
+
+# Review service accounts
+kubectl get serviceaccounts --all-namespaces
+
+# Get audit logs
+cat /var/log/kubernetes/audit/audit.log | grep <suspicious-activity>
+```
+
+**Common Causes:**
+- Compromised credentials
+- Vulnerable container image
+- Misconfigured RBAC
+- Exposed secrets
+- Supply chain attack
+
+**Resolution Paths:**
+1. Isolate: Network policies, cordon nodes
+2. Investigate: Audit logs, pod logs, network flows
+3. Remediate: Rotate credentials, patch vulnerabilities
+4. Restore: From known-good state if needed
+5. Prevent: Enhanced security policies, monitoring
+
+---
+
+## Diagnostic Commands Cheat Sheet
+
+### Quick Health Check
+```bash
+# Overall cluster health
+kubectl cluster-info
+kubectl get nodes
+kubectl get pods --all-namespaces | grep -v Running
+
+# Component status (older clusters)
+kubectl get componentstatuses
+
+# Recent events
+kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
+```
+
+### Pod Diagnostics
+```bash
+# Pod details
+kubectl describe pod <pod> -n <namespace>
+kubectl get pod <pod> -n <namespace> -o yaml
+
+# Logs
+kubectl logs <pod> -n <namespace>
+kubectl logs <pod> -n <namespace> --previous
+kubectl logs <pod> -c <container> -n <namespace>
+
+# Interactive debugging
+kubectl exec -it <pod> -n <namespace> -- /bin/sh
+kubectl debug <pod> -it --image=busybox -n <namespace>
+```
+
+### Node Diagnostics
+```bash
+# Node details
+kubectl describe node <node>
+kubectl get node <node> -o yaml
+
+# Resource usage
+kubectl top nodes
+kubectl top pods --all-namespaces
+
+# Node conditions
+kubectl get nodes -o json | jq '.items[].status.conditions'
+```
+
+### Service & Network Diagnostics
+```bash
+# Service details
+kubectl describe svc <service> -n <namespace>
+kubectl get endpoints <service> -n <namespace>
+
+# Network policies
+kubectl get networkpolicies --all-namespaces
+
+# DNS testing
+kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
+# Then: nslookup <service>.<namespace>.svc.cluster.local
+```
+
+### Storage Diagnostics
+```bash
+# PVC and PV status
+kubectl get pvc,pv --all-namespaces
+
+# Storage class
+kubectl get storageclass
+kubectl describe storageclass <storage-class>
+
+# Volume attachments
+kubectl get volumeattachments
+```
+
+---
+
+## Communication During Incidents
+
+### Internal Communication
+- Use dedicated incident channel
+- Regular status updates (every 30 min)
+- Clear roles (incident commander, scribe, experts)
+- Document all actions taken
+
+### External Communication
+- Status page updates
+- Customer notifications
+- Clear expected resolution time
+- Updates on progress
+
+### Post-Incident Communication
+- Incident report
+- Root cause analysis
+- Remediation steps taken
+- Prevention measures
+
+---
+
+## Post-Incident Review Template
+
+### Incident Summary
+- Date and time
+- Duration
+- Severity
+- Services affected
+- User impact
+
+### Timeline
+- Detection time
+- Response time
+- Resolution time
+- Key events during incident
+
+### Root Cause
+- What happened
+- Why it happened
+- Contributing factors
+
+### Resolution
+- What fixed the issue
+- Who fixed it
+- How long it took
+
+### Lessons Learned
+- What went well
+- What could be improved
+- Action items with owners
+
+### Prevention
+- Technical changes
+- Process improvements
+- Monitoring enhancements
+- Documentation updates
+
+---
+
+## Best Practices
+
+### Prevention
+- Regular cluster audits
+- Proactive monitoring and alerting
+- Capacity planning
+- Regular disaster recovery drills
+- Automated backups
+- Security scanning and policies
+
+### Preparedness
+- Document runbooks
+- Practice incident response
+- Keep contact lists updated
+- Maintain up-to-date diagrams
+- Pre-provision debugging tools
+
+### Response
+- Follow structured approach
+- Document everything
+- Communicate clearly
+- Don't panic
+- Think before acting
+- Preserve evidence
+
+### Recovery
+- Verify fix thoroughly
+- Monitor for recurrence
+- Update documentation
+- Conduct post-mortem
+- Implement preventive measures
--- a/skills/references/performance_troubleshooting.md
+++ b/skills/references/performance_troubleshooting.md
@@ -0,0 +1,687 @@
+# Kubernetes Performance Troubleshooting
+
+Systematic approach to diagnosing and resolving Kubernetes performance issues.
+
+## Table of Contents
+
+1. [High Latency Issues](#high-latency-issues)
+2. [CPU Performance](#cpu-performance)
+3. [Memory Performance](#memory-performance)
+4. [Network Performance](#network-performance)
+5. [Storage I/O Performance](#storage-io-performance)
+6. [Application-Level Metrics](#application-level-metrics)
+7. [Cluster-Wide Performance](#cluster-wide-performance)
+
+---
+
+## High Latency Issues
+
+### Symptoms
+- Slow API response times
+- Increased request latency
+- Timeouts
+- Degraded user experience
+
+### Investigation Workflow
+
+**1. Identify the layer with latency:**
+
+```bash
+# Check service mesh metrics (if using Istio/Linkerd)
+kubectl top pods -n <namespace>
+
+# Check ingress controller metrics
+kubectl logs -n ingress-nginx <ingress-controller-pod> | grep "request_time"
+
+# Check application logs for slow requests
+kubectl logs <pod-name> -n <namespace> | grep -i "slow\|timeout\|latency"
+```
+
+**2. Profile application performance:**
+
+```bash
+# Get pod metrics
+kubectl top pod <pod-name> -n <namespace>
+
+# Check if pod is CPU throttled
+kubectl get pod <pod-name> -n <namespace> -o json | \
+  jq '.spec.containers[].resources'
+
+# Exec into pod and check application-specific metrics
+kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
+# Then: curl localhost:8080/metrics  (if Prometheus metrics available)
+```
+
+**3. Check dependencies:**
+
+```bash
+# Test connectivity to downstream services
+kubectl exec -it <pod-name> -n <namespace> -- \
+  curl -w "@curl-format.txt" -o /dev/null -s http://backend-service
+
+# curl-format.txt content:
+# time_namelookup: %{time_namelookup}\n
+# time_connect: %{time_connect}\n
+# time_appconnect: %{time_appconnect}\n
+# time_pretransfer: %{time_pretransfer}\n
+# time_redirect: %{time_redirect}\n
+# time_starttransfer: %{time_starttransfer}\n
+# time_total: %{time_total}\n
+```
+
+### Common Causes and Solutions
+
+**CPU Throttling:**
+```yaml
+# Increase CPU limits or remove limits for bursty workloads
+resources:
+  requests:
+    cpu: "500m"      # What pod needs typically
+  limits:
+    cpu: "2000m"     # Burst capacity (or remove for unlimited)
+```
+
+**Insufficient Replicas:**
+```bash
+# Scale up deployment
+kubectl scale deployment <deployment-name> -n <namespace> --replicas=5
+
+# Or enable HPA
+kubectl autoscale deployment <deployment-name> \
+  --cpu-percent=70 \
+  --min=2 \
+  --max=10
+```
+
+**Slow Dependencies:**
+```yaml
+# Implement circuit breakers and timeouts in application
+# Or use service mesh policies (Istio example):
+apiVersion: networking.istio.io/v1beta1
+kind: DestinationRule
+metadata:
+  name: backend-circuit-breaker
+spec:
+  host: backend-service
+  trafficPolicy:
+    connectionPool:
+      tcp:
+        maxConnections: 100
+      http:
+        http1MaxPendingRequests: 50
+        http2MaxRequests: 100
+    outlierDetection:
+      consecutiveErrors: 5
+      interval: 30s
+      baseEjectionTime: 30s
+```
+
+---
+
+## CPU Performance
+
+### Symptoms
+- High CPU usage
+- Throttling
+- Slow processing
+- Queue buildup
+
+### Investigation Commands
+
+```bash
+# Check CPU usage
+kubectl top nodes
+kubectl top pods -n <namespace>
+
+# Check CPU throttling
+kubectl get pod <pod-name> -n <namespace> -o json | \
+  jq '.spec.containers[].resources'
+
+# Get detailed CPU metrics (requires metrics-server)
+kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | jq
+
+# Check container-level CPU from node (SSH to node)
+ssh <node> "docker stats --no-stream"
+```
+
+### Advanced CPU Profiling
+
+**Enable CPU profiling in application:**
+
+```bash
+# For Go applications with pprof
+kubectl port-forward <pod-name> 6060:6060 -n <namespace>
+
+# Capture CPU profile
+curl http://localhost:6060/debug/pprof/profile?seconds=30 > cpu.prof
+
+# Analyze with pprof
+go tool pprof -http=:8080 cpu.prof
+```
+
+**For Java applications:**
+
+```bash
+# Use async-profiler
+kubectl exec -it <pod-name> -n <namespace> -- \
+  /profiler.sh -d 30 -f /tmp/flamegraph.html 1
+
+# Copy flamegraph
+kubectl cp <namespace>/<pod-name>:/tmp/flamegraph.html ./flamegraph.html
+```
+
+### Solutions
+
+**Vertical Scaling:**
+```yaml
+resources:
+  requests:
+    cpu: "1000m"  # Increased from 500m
+  limits:
+    cpu: "2000m"  # Increased from 1000m
+```
+
+**Horizontal Scaling:**
+```yaml
+apiVersion: autoscaling/v2
+kind: HorizontalPodAutoscaler
+metadata:
+  name: app-hpa
+spec:
+  scaleTargetRef:
+    apiVersion: apps/v1
+    kind: Deployment
+    name: app
+  minReplicas: 3
+  maxReplicas: 20
+  metrics:
+  - type: Resource
+    resource:
+      name: cpu
+      target:
+        type: Utilization
+        averageUtilization: 70
+```
+
+**Remove CPU Limits for Bursty Workloads:**
+```yaml
+# Allow bursting to available CPU
+resources:
+  requests:
+    cpu: "500m"
+  # No limits - can use all available CPU
+```
+
+---
+
+## Memory Performance
+
+### Symptoms
+- OOMKilled pods
+- Memory leaks
+- Slow garbage collection
+- Swap usage (if enabled)
+
+### Investigation Commands
+
+```bash
+# Check memory usage
+kubectl top nodes
+kubectl top pods -n <namespace>
+
+# Check memory limits and requests
+kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Limits\|Requests"
+
+# Check OOM kills
+kubectl get pods -n <namespace> -o json | \
+  jq '.items[] | select(.status.containerStatuses[]?.lastState.terminated.reason == "OOMKilled") | .metadata.name'
+
+# Detailed memory breakdown (requires metrics-server)
+kubectl get --raw "/apis/metrics.k8s.io/v1beta1/namespaces/<namespace>/pods/<pod-name>" | \
+  jq '.containers[] | {name, usage: .usage.memory}'
+```
+
+### Memory Profiling
+
+**Heap dump for Java:**
+```bash
+# Capture heap dump
+kubectl exec <pod-name> -n <namespace> -- \
+  jmap -dump:format=b,file=/tmp/heapdump.hprof 1
+
+# Copy heap dump
+kubectl cp <namespace>/<pod-name>:/tmp/heapdump.hprof ./heapdump.hprof
+
+# Analyze with Eclipse MAT or VisualVM
+```
+
+**Memory profiling for Go:**
+```bash
+# Capture heap profile
+kubectl port-forward <pod-name> 6060:6060 -n <namespace>
+curl http://localhost:6060/debug/pprof/heap > heap.prof
+
+# Analyze
+go tool pprof -http=:8080 heap.prof
+```
+
+### Solutions
+
+**Increase Memory Limits:**
+```yaml
+resources:
+  requests:
+    memory: "512Mi"
+  limits:
+    memory: "2Gi"  # Increased from 1Gi
+```
+
+**Optimize Application:**
+- Fix memory leaks
+- Implement connection pooling
+- Optimize caching strategies
+- Tune garbage collection
+
+**Use Memory-Optimized Node Pools:**
+```yaml
+# Node affinity for memory-intensive workloads
+affinity:
+  nodeAffinity:
+    requiredDuringSchedulingIgnoredDuringExecution:
+      nodeSelectorTerms:
+      - matchExpressions:
+        - key: workload-type
+          operator: In
+          values:
+          - memory-optimized
+```
+
+---
+
+## Network Performance
+
+### Symptoms
+- High network latency
+- Packet loss
+- Connection timeouts
+- Bandwidth saturation
+
+### Investigation Commands
+
+```bash
+# Check pod network statistics
+kubectl exec <pod-name> -n <namespace> -- netstat -s
+
+# Test network performance between pods
+# Deploy netperf
+kubectl run netperf-client --image=networkstatic/netperf --rm -it -- /bin/bash
+
+# From client, run:
+netperf -H <target-pod-ip> -t TCP_STREAM
+netperf -H <target-pod-ip> -t TCP_RR  # Request-response latency
+
+# Check DNS resolution time
+kubectl exec <pod-name> -n <namespace> -- \
+  time nslookup service-name.namespace.svc.cluster.local
+
+# Check service mesh overhead (if using Istio)
+kubectl exec <pod-name> -n <namespace> -c istio-proxy -- \
+  curl -s localhost:15000/stats | grep "http.inbound\|http.outbound"
+```
+
+### Check Network Policies
+
+```bash
+# List network policies
+kubectl get networkpolicies -n <namespace>
+
+# Check if policy is blocking traffic
+kubectl describe networkpolicy <policy-name> -n <namespace>
+
+# Temporarily remove policies to test (in non-production)
+kubectl delete networkpolicy <policy-name> -n <namespace>
+```
+
+### Solutions
+
+**DNS Optimization:**
+```yaml
+# Use CoreDNS caching
+# Increase CoreDNS replicas
+kubectl scale deployment coredns -n kube-system --replicas=5
+
+# Or use NodeLocal DNSCache
+# https://kubernetes.io/docs/tasks/administer-cluster/nodelocaldns/
+```
+
+**Optimize Service Mesh:**
+```yaml
+# Reduce Istio sidecar resources if over-provisioned
+sidecar.istio.io/proxyCPU: "100m"
+sidecar.istio.io/proxyMemory: "128Mi"
+
+# Or disable for internal, trusted services
+sidecar.istio.io/inject: "false"
+```
+
+**Use HostNetwork for Network-Intensive Pods:**
+```yaml
+# Use with caution - bypasses pod networking
+spec:
+  hostNetwork: true
+  dnsPolicy: ClusterFirstWithHostNet
+```
+
+**Enable Bandwidth Limits (QoS):**
+```yaml
+metadata:
+  annotations:
+    kubernetes.io/ingress-bandwidth: "10M"
+    kubernetes.io/egress-bandwidth: "10M"
+```
+
+---
+
+## Storage I/O Performance
+
+### Symptoms
+- Slow read/write operations
+- High I/O wait
+- Application timeouts during disk operations
+- Database performance issues
+
+### Investigation Commands
+
+```bash
+# Check I/O metrics on node
+ssh <node> "iostat -x 1 10"
+
+# Check disk usage
+kubectl exec <pod-name> -n <namespace> -- df -h
+
+# Check I/O wait from pod
+kubectl exec <pod-name> -n <namespace> -- top
+
+# Test storage performance
+kubectl exec <pod-name> -n <namespace> -- \
+  dd if=/dev/zero of=/data/test bs=1M count=1024 conv=fdatasync
+
+# Check PV performance class
+kubectl get pv <pv-name> -o yaml | grep storageClassName
+kubectl describe storageclass <storage-class-name>
+```
+
+### Storage Benchmarking
+
+**Deploy fio for benchmarking:**
+```yaml
+apiVersion: v1
+kind: Pod
+metadata:
+  name: fio-benchmark
+spec:
+  containers:
+  - name: fio
+    image: ljishen/fio
+    command: ["/bin/sh", "-c"]
+    args:
+    - |
+      fio --name=seqread --rw=read --bs=1M --size=1G --runtime=60 --filename=/data/test
+      fio --name=seqwrite --rw=write --bs=1M --size=1G --runtime=60 --filename=/data/test
+      fio --name=randread --rw=randread --bs=4k --size=1G --runtime=60 --filename=/data/test
+      fio --name=randwrite --rw=randwrite --bs=4k --size=1G --runtime=60 --filename=/data/test
+    volumeMounts:
+    - name: data
+      mountPath: /data
+  volumes:
+  - name: data
+    persistentVolumeClaim:
+      claimName: test-pvc
+```
+
+### Solutions
+
+**Use Higher Performance Storage Class:**
+```yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: high-performance-pvc
+spec:
+  accessModes:
+    - ReadWriteOnce
+  storageClassName: gp3  # or io2, premium-rwo (GKE), etc.
+  resources:
+    requests:
+      storage: 100Gi
+```
+
+**Provision IOPS (AWS EBS io2):**
+```yaml
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: io2-high-iops
+provisioner: ebs.csi.aws.com
+parameters:
+  type: io2
+  iops: "10000"
+  fsType: ext4
+volumeBindingMode: WaitForFirstConsumer
+```
+
+**Use Local NVMe for Ultra-Low Latency:**
+```yaml
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: local-nvme
+provisioner: kubernetes.io/no-provisioner
+volumeBindingMode: WaitForFirstConsumer
+---
+apiVersion: v1
+kind: PersistentVolume
+metadata:
+  name: local-pv
+spec:
+  capacity:
+    storage: 100Gi
+  accessModes:
+  - ReadWriteOnce
+  persistentVolumeReclaimPolicy: Retain
+  storageClassName: local-nvme
+  local:
+    path: /mnt/disks/nvme0n1
+  nodeAffinity:
+    required:
+      nodeSelectorTerms:
+      - matchExpressions:
+        - key: kubernetes.io/hostname
+          operator: In
+          values:
+          - node-with-nvme
+```
+
+---
+
+## Application-Level Metrics
+
+### Expose Prometheus Metrics
+
+**Add metrics endpoint to application:**
+```yaml
+apiVersion: v1
+kind: Service
+metadata:
+  name: app-metrics
+  annotations:
+    prometheus.io/scrape: "true"
+    prometheus.io/port: "8080"
+    prometheus.io/path: "/metrics"
+spec:
+  selector:
+    app: myapp
+  ports:
+  - name: metrics
+    port: 8080
+    targetPort: 8080
+```
+
+### Key Metrics to Monitor
+
+**Application metrics:**
+- Request rate
+- Request latency (p50, p95, p99)
+- Error rate
+- Active connections
+- Queue depth
+- Cache hit rate
+
+**Example Prometheus queries:**
+```promql
+# P95 latency
+histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
+
+# Error rate
+sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
+
+# Request rate
+sum(rate(http_requests_total[5m]))
+```
+
+### Distributed Tracing
+
+**Implement OpenTelemetry:**
+```yaml
+# Deploy Jaeger
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: jaeger
+spec:
+  template:
+    spec:
+      containers:
+      - name: jaeger
+        image: jaegertracing/all-in-one:latest
+        ports:
+        - containerPort: 16686  # UI
+        - containerPort: 14268  # Collector
+```
+
+**Instrument application:**
+- Add OpenTelemetry SDK to application
+- Configure trace export to Jaeger
+- Analyze end-to-end request traces to identify bottlenecks
+
+---
+
+## Cluster-Wide Performance
+
+### Cluster Resource Utilization
+
+```bash
+# Overall cluster capacity
+kubectl top nodes
+
+# Total resources
+kubectl describe nodes | grep -A 5 "Allocated resources"
+
+# Resource requests vs limits
+kubectl get pods --all-namespaces -o json | \
+  jq -r '.items[] | "\(.metadata.namespace)/\(.metadata.name) \(.spec.containers[].resources)"'
+```
+
+### Control Plane Performance
+
+```bash
+# Check API server latency
+kubectl get --raw /metrics | grep apiserver_request_duration_seconds
+
+# Check etcd performance
+kubectl exec -it -n kube-system etcd-<node> -- \
+  etcdctl --endpoints=https://127.0.0.1:2379 \
+  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
+  --cert=/etc/kubernetes/pki/etcd/server.crt \
+  --key=/etc/kubernetes/pki/etcd/server.key \
+  check perf
+
+# Controller manager metrics
+kubectl get --raw /metrics | grep workqueue_depth
+```
+
+### Scheduler Performance
+
+```bash
+# Check scheduler latency
+kubectl get --raw /metrics | grep scheduler_scheduling_duration_seconds
+
+# Check pending pods
+kubectl get pods --all-namespaces --field-selector status.phase=Pending
+
+# Scheduler logs
+kubectl logs -n kube-system kube-scheduler-<node>
+```
+
+### Solutions for Cluster-Wide Issues
+
+**Scale Control Plane:**
+- Add more control plane nodes
+- Increase API server replicas
+- Tune etcd (increase memory, use SSD)
+
+**Optimize Scheduling:**
+- Use pod priority and preemption
+- Implement pod topology spread constraints
+- Use node affinity/anti-affinity appropriately
+
+**Resource Management:**
+- Set appropriate resource requests and limits
+- Use LimitRanges and ResourceQuotas
+- Implement VerticalPodAutoscaler for right-sizing
+
+---
+
+## Performance Optimization Checklist
+
+### Application Level
+- [ ] Implement connection pooling
+- [ ] Enable response caching
+- [ ] Optimize database queries
+- [ ] Use async/non-blocking I/O
+- [ ] Implement circuit breakers
+- [ ] Profile and optimize hot paths
+
+### Kubernetes Level
+- [ ] Set appropriate resource requests/limits
+- [ ] Use HPA for auto-scaling
+- [ ] Implement readiness/liveness probes correctly
+- [ ] Use anti-affinity for high-availability
+- [ ] Optimize container image size
+- [ ] Use multi-stage builds
+
+### Infrastructure Level
+- [ ] Use appropriate instance/node types
+- [ ] Enable cluster autoscaling
+- [ ] Use high-performance storage classes
+- [ ] Optimize network topology
+- [ ] Implement monitoring and alerting
+- [ ] Regular performance testing
+
+---
+
+## Monitoring Tools
+
+**Essential tools:**
+- **Prometheus + Grafana**: Metrics and dashboards
+- **Jaeger/Zipkin**: Distributed tracing
+- **kube-state-metrics**: Kubernetes object metrics
+- **node-exporter**: Node-level metrics
+- **cAdvisor**: Container metrics
+- **kubectl-flamegraph**: CPU profiling
+
+**Commercial options:**
+- Datadog
+- New Relic
+- Dynatrace
+- Elastic APM