467 lines
9.8 KiB
Markdown
467 lines
9.8 KiB
Markdown
# Kubernetes Incident Response Playbook
|
|
|
|
This playbook provides structured procedures for responding to Kubernetes incidents.
|
|
|
|
## Incident Response Framework
|
|
|
|
### 1. Detection Phase
|
|
- Identify the incident (alerts, user reports, monitoring)
|
|
- Determine severity level
|
|
- Initiate incident response
|
|
|
|
### 2. Triage Phase
|
|
- Assess impact and scope
|
|
- Gather initial diagnostic data
|
|
- Determine if immediate action needed
|
|
|
|
### 3. Investigation Phase
|
|
- Collect comprehensive diagnostics
|
|
- Identify root cause
|
|
- Document findings
|
|
|
|
### 4. Resolution Phase
|
|
- Apply remediation
|
|
- Verify fix
|
|
- Monitor for recurrence
|
|
|
|
### 5. Post-Incident Phase
|
|
- Document incident
|
|
- Conduct blameless post-mortem
|
|
- Implement preventive measures
|
|
|
|
---
|
|
|
|
## Severity Levels
|
|
|
|
### SEV-1: Critical
|
|
- Complete service outage
|
|
- Data loss or corruption
|
|
- Security breach
|
|
- Impact: All users affected
|
|
- Response: Immediate, all-hands
|
|
|
|
### SEV-2: High
|
|
- Major functionality degraded
|
|
- Significant performance impact
|
|
- Impact: Large subset of users
|
|
- Response: Within 15 minutes
|
|
|
|
### SEV-3: Medium
|
|
- Minor functionality impaired
|
|
- Workaround available
|
|
- Impact: Some users affected
|
|
- Response: Within 1 hour
|
|
|
|
### SEV-4: Low
|
|
- Cosmetic issues
|
|
- Negligible impact
|
|
- Impact: Minimal
|
|
- Response: During business hours
|
|
|
|
---
|
|
|
|
## Common Incident Scenarios
|
|
|
|
### Scenario 1: Complete Cluster Outage
|
|
|
|
**Symptoms:**
|
|
- All services unreachable
|
|
- kubectl commands timing out
|
|
- API server not responding
|
|
|
|
**Immediate Actions:**
|
|
1. Verify the scope (single cluster or multi-cluster)
|
|
2. Check API server status and logs
|
|
3. Check control plane nodes
|
|
4. Verify network connectivity to control plane
|
|
5. Check etcd cluster health
|
|
|
|
**Investigation Steps:**
|
|
```bash
|
|
# Check control plane pods
|
|
kubectl get pods -n kube-system
|
|
|
|
# Check etcd
|
|
kubectl exec -it etcd-<node> -n kube-system -- etcdctl endpoint health
|
|
|
|
# Check API server logs
|
|
journalctl -u kube-apiserver -n 100
|
|
|
|
# Check control plane node resources
|
|
ssh <control-plane-node> "top"
|
|
```
|
|
|
|
**Common Causes:**
|
|
- etcd cluster failure
|
|
- API server OOM/crash
|
|
- Control plane network partition
|
|
- Certificate expiration
|
|
- Cloud provider outage
|
|
|
|
**Resolution Paths:**
|
|
1. etcd issue: Restore from backup or rebuild cluster
|
|
2. API server issue: Restart API server pods/service
|
|
3. Network: Fix routing, security groups, or DNS
|
|
4. Certificates: Renew certificates (kubeadm cert renew all)
|
|
|
|
---
|
|
|
|
### Scenario 2: Service Degradation
|
|
|
|
**Symptoms:**
|
|
- Increased latency or error rates
|
|
- Some requests failing
|
|
- Intermittent issues
|
|
|
|
**Immediate Actions:**
|
|
1. Check service metrics and logs
|
|
2. Verify pod health and count
|
|
3. Check for recent deployments
|
|
4. Review resource utilization
|
|
|
|
**Investigation Steps:**
|
|
```bash
|
|
# Check service endpoints
|
|
kubectl get endpoints <service> -n <namespace>
|
|
|
|
# Check pod status
|
|
kubectl get pods -l <service-selector> -n <namespace>
|
|
|
|
# Review recent changes
|
|
kubectl rollout history deployment/<name> -n <namespace>
|
|
|
|
# Check resource usage
|
|
kubectl top pods -n <namespace>
|
|
|
|
# Get recent events
|
|
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
|
|
```
|
|
|
|
**Common Causes:**
|
|
- Insufficient replicas
|
|
- Pod restarts/crashes
|
|
- Resource contention
|
|
- Bad deployment
|
|
- External dependency failure
|
|
|
|
**Resolution Paths:**
|
|
1. Scale up replicas if needed
|
|
2. Rollback bad deployment
|
|
3. Increase resources if constrained
|
|
4. Fix configuration issues
|
|
5. Implement circuit breaker for external deps
|
|
|
|
---
|
|
|
|
### Scenario 3: Node Failure
|
|
|
|
**Symptoms:**
|
|
- Node reported as NotReady
|
|
- Pods being evicted from node
|
|
- High node resource utilization
|
|
|
|
**Immediate Actions:**
|
|
1. Identify affected node
|
|
2. Check impact (which pods running on node)
|
|
3. Determine if pods need immediate migration
|
|
4. Assess if node is recoverable
|
|
|
|
**Investigation Steps:**
|
|
```bash
|
|
# Get node status
|
|
kubectl get nodes
|
|
|
|
# Describe the problem node
|
|
kubectl describe node <node-name>
|
|
|
|
# Check pods on the node
|
|
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>
|
|
|
|
# SSH to node and check
|
|
ssh <node> "systemctl status kubelet"
|
|
ssh <node> "journalctl -u kubelet -n 100"
|
|
ssh <node> "docker ps" # or containerd
|
|
ssh <node> "df -h"
|
|
ssh <node> "free -m"
|
|
```
|
|
|
|
**Common Causes:**
|
|
- Kubelet failure
|
|
- Disk full
|
|
- Memory exhaustion
|
|
- Network issues
|
|
- Hardware failure
|
|
|
|
**Resolution Paths:**
|
|
1. Recoverable: Fix issue (clean disk, restart services)
|
|
2. Not recoverable: Cordon, drain, and replace node
|
|
3. For critical pods: Manually reschedule if necessary
|
|
4. Update monitoring and alerting based on findings
|
|
|
|
---
|
|
|
|
### Scenario 4: Storage Issues
|
|
|
|
**Symptoms:**
|
|
- PVCs stuck in Pending
|
|
- Pods can't start due to volume issues
|
|
- Data access failures
|
|
|
|
**Immediate Actions:**
|
|
1. Identify affected PVCs/PVs
|
|
2. Check storage backend health
|
|
3. Verify provisioner status
|
|
4. Assess data integrity risk
|
|
|
|
**Investigation Steps:**
|
|
```bash
|
|
# Check PVC status
|
|
kubectl get pvc --all-namespaces
|
|
|
|
# Describe pending PVC
|
|
kubectl describe pvc <pvc-name> -n <namespace>
|
|
|
|
# Check PV status
|
|
kubectl get pv
|
|
|
|
# Check storage class
|
|
kubectl get storageclass
|
|
|
|
# Check provisioner
|
|
kubectl get pods -n <storage-namespace>
|
|
|
|
# Check volume attachments
|
|
kubectl get volumeattachments
|
|
```
|
|
|
|
**Common Causes:**
|
|
- Storage backend failure/full
|
|
- Provisioner issues
|
|
- Network to storage backend
|
|
- Volume attachment limits reached
|
|
- Corrupted volume
|
|
|
|
**Resolution Paths:**
|
|
1. Fix storage backend issues
|
|
2. Restart provisioner if needed
|
|
3. Manually provision PV if dynamic provisioning failed
|
|
4. Delete and recreate if volume corrupted
|
|
5. Restore from backup if data lost
|
|
|
|
---
|
|
|
|
### Scenario 5: Security Incident
|
|
|
|
**Symptoms:**
|
|
- Unauthorized access detected
|
|
- Suspicious pod behavior
|
|
- Security alerts triggered
|
|
- Unusual network traffic
|
|
|
|
**Immediate Actions:**
|
|
1. Assess severity and scope
|
|
2. Isolate affected resources
|
|
3. Preserve evidence
|
|
4. Engage security team
|
|
|
|
**Investigation Steps:**
|
|
```bash
|
|
# Check recent RBAC changes
|
|
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json
|
|
|
|
# Audit pod security contexts
|
|
kubectl get pods --all-namespaces -o json | jq '.items[].spec.securityContext'
|
|
|
|
# Check for privileged pods
|
|
kubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.containers[].securityContext.privileged==true)'
|
|
|
|
# Review service accounts
|
|
kubectl get serviceaccounts --all-namespaces
|
|
|
|
# Get audit logs
|
|
cat /var/log/kubernetes/audit/audit.log | grep <suspicious-activity>
|
|
```
|
|
|
|
**Common Causes:**
|
|
- Compromised credentials
|
|
- Vulnerable container image
|
|
- Misconfigured RBAC
|
|
- Exposed secrets
|
|
- Supply chain attack
|
|
|
|
**Resolution Paths:**
|
|
1. Isolate: Network policies, cordon nodes
|
|
2. Investigate: Audit logs, pod logs, network flows
|
|
3. Remediate: Rotate credentials, patch vulnerabilities
|
|
4. Restore: From known-good state if needed
|
|
5. Prevent: Enhanced security policies, monitoring
|
|
|
|
---
|
|
|
|
## Diagnostic Commands Cheat Sheet
|
|
|
|
### Quick Health Check
|
|
```bash
|
|
# Overall cluster health
|
|
kubectl cluster-info
|
|
kubectl get nodes
|
|
kubectl get pods --all-namespaces | grep -v Running
|
|
|
|
# Component status (older clusters)
|
|
kubectl get componentstatuses
|
|
|
|
# Recent events
|
|
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
|
|
```
|
|
|
|
### Pod Diagnostics
|
|
```bash
|
|
# Pod details
|
|
kubectl describe pod <pod> -n <namespace>
|
|
kubectl get pod <pod> -n <namespace> -o yaml
|
|
|
|
# Logs
|
|
kubectl logs <pod> -n <namespace>
|
|
kubectl logs <pod> -n <namespace> --previous
|
|
kubectl logs <pod> -c <container> -n <namespace>
|
|
|
|
# Interactive debugging
|
|
kubectl exec -it <pod> -n <namespace> -- /bin/sh
|
|
kubectl debug <pod> -it --image=busybox -n <namespace>
|
|
```
|
|
|
|
### Node Diagnostics
|
|
```bash
|
|
# Node details
|
|
kubectl describe node <node>
|
|
kubectl get node <node> -o yaml
|
|
|
|
# Resource usage
|
|
kubectl top nodes
|
|
kubectl top pods --all-namespaces
|
|
|
|
# Node conditions
|
|
kubectl get nodes -o json | jq '.items[].status.conditions'
|
|
```
|
|
|
|
### Service & Network Diagnostics
|
|
```bash
|
|
# Service details
|
|
kubectl describe svc <service> -n <namespace>
|
|
kubectl get endpoints <service> -n <namespace>
|
|
|
|
# Network policies
|
|
kubectl get networkpolicies --all-namespaces
|
|
|
|
# DNS testing
|
|
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
|
|
# Then: nslookup <service>.<namespace>.svc.cluster.local
|
|
```
|
|
|
|
### Storage Diagnostics
|
|
```bash
|
|
# PVC and PV status
|
|
kubectl get pvc,pv --all-namespaces
|
|
|
|
# Storage class
|
|
kubectl get storageclass
|
|
kubectl describe storageclass <storage-class>
|
|
|
|
# Volume attachments
|
|
kubectl get volumeattachments
|
|
```
|
|
|
|
---
|
|
|
|
## Communication During Incidents
|
|
|
|
### Internal Communication
|
|
- Use dedicated incident channel
|
|
- Regular status updates (every 30 min)
|
|
- Clear roles (incident commander, scribe, experts)
|
|
- Document all actions taken
|
|
|
|
### External Communication
|
|
- Status page updates
|
|
- Customer notifications
|
|
- Clear expected resolution time
|
|
- Updates on progress
|
|
|
|
### Post-Incident Communication
|
|
- Incident report
|
|
- Root cause analysis
|
|
- Remediation steps taken
|
|
- Prevention measures
|
|
|
|
---
|
|
|
|
## Post-Incident Review Template
|
|
|
|
### Incident Summary
|
|
- Date and time
|
|
- Duration
|
|
- Severity
|
|
- Services affected
|
|
- User impact
|
|
|
|
### Timeline
|
|
- Detection time
|
|
- Response time
|
|
- Resolution time
|
|
- Key events during incident
|
|
|
|
### Root Cause
|
|
- What happened
|
|
- Why it happened
|
|
- Contributing factors
|
|
|
|
### Resolution
|
|
- What fixed the issue
|
|
- Who fixed it
|
|
- How long it took
|
|
|
|
### Lessons Learned
|
|
- What went well
|
|
- What could be improved
|
|
- Action items with owners
|
|
|
|
### Prevention
|
|
- Technical changes
|
|
- Process improvements
|
|
- Monitoring enhancements
|
|
- Documentation updates
|
|
|
|
---
|
|
|
|
## Best Practices
|
|
|
|
### Prevention
|
|
- Regular cluster audits
|
|
- Proactive monitoring and alerting
|
|
- Capacity planning
|
|
- Regular disaster recovery drills
|
|
- Automated backups
|
|
- Security scanning and policies
|
|
|
|
### Preparedness
|
|
- Document runbooks
|
|
- Practice incident response
|
|
- Keep contact lists updated
|
|
- Maintain up-to-date diagrams
|
|
- Pre-provision debugging tools
|
|
|
|
### Response
|
|
- Follow structured approach
|
|
- Document everything
|
|
- Communicate clearly
|
|
- Don't panic
|
|
- Think before acting
|
|
- Preserve evidence
|
|
|
|
### Recovery
|
|
- Verify fix thoroughly
|
|
- Monitor for recurrence
|
|
- Update documentation
|
|
- Conduct post-mortem
|
|
- Implement preventive measures
|