gh-ahmedasmar-devops-claude…/skills/references/incident_response.md

# Kubernetes Incident Response Playbook

This playbook provides structured procedures for responding to Kubernetes incidents.

## Incident Response Framework

### 1. Detection Phase
- Identify the incident (alerts, user reports, monitoring)
- Determine severity level
- Initiate incident response

### 2. Triage Phase
- Assess impact and scope
- Gather initial diagnostic data
- Determine if immediate action needed

### 3. Investigation Phase
- Collect comprehensive diagnostics
- Identify root cause
- Document findings

### 4. Resolution Phase
- Apply remediation
- Verify fix
- Monitor for recurrence

### 5. Post-Incident Phase
- Document incident
- Conduct blameless post-mortem
- Implement preventive measures

---

## Severity Levels

### SEV-1: Critical
- Complete service outage
- Data loss or corruption
- Security breach
- Impact: All users affected
- Response: Immediate, all-hands

### SEV-2: High
- Major functionality degraded
- Significant performance impact
- Impact: Large subset of users
- Response: Within 15 minutes

### SEV-3: Medium
- Minor functionality impaired
- Workaround available
- Impact: Some users affected
- Response: Within 1 hour

### SEV-4: Low
- Cosmetic issues
- Negligible impact
- Impact: Minimal
- Response: During business hours

---

## Common Incident Scenarios

### Scenario 1: Complete Cluster Outage

**Symptoms:**
- All services unreachable
- kubectl commands timing out
- API server not responding

**Immediate Actions:**
1. Verify the scope (single cluster or multi-cluster)
2. Check API server status and logs
3. Check control plane nodes
4. Verify network connectivity to control plane
5. Check etcd cluster health

**Investigation Steps:**
```bash
# Check control plane pods
kubectl get pods -n kube-system

# Check etcd
kubectl exec -it etcd-<node> -n kube-system -- etcdctl endpoint health

# Check API server logs
journalctl -u kube-apiserver -n 100

# Check control plane node resources
ssh <control-plane-node> "top"
```

**Common Causes:**
- etcd cluster failure
- API server OOM/crash
- Control plane network partition
- Certificate expiration
- Cloud provider outage

**Resolution Paths:**
1. etcd issue: Restore from backup or rebuild cluster
2. API server issue: Restart API server pods/service
3. Network: Fix routing, security groups, or DNS
4. Certificates: Renew certificates (kubeadm cert renew all)

---

### Scenario 2: Service Degradation

**Symptoms:**
- Increased latency or error rates
- Some requests failing
- Intermittent issues

**Immediate Actions:**
1. Check service metrics and logs
2. Verify pod health and count
3. Check for recent deployments
4. Review resource utilization

**Investigation Steps:**
```bash
# Check service endpoints
kubectl get endpoints <service> -n <namespace>

# Check pod status
kubectl get pods -l <service-selector> -n <namespace>

# Review recent changes
kubectl rollout history deployment/<name> -n <namespace>

# Check resource usage
kubectl top pods -n <namespace>

# Get recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
```

**Common Causes:**
- Insufficient replicas
- Pod restarts/crashes
- Resource contention
- Bad deployment
- External dependency failure

**Resolution Paths:**
1. Scale up replicas if needed
2. Rollback bad deployment
3. Increase resources if constrained
4. Fix configuration issues
5. Implement circuit breaker for external deps

---

### Scenario 3: Node Failure

**Symptoms:**
- Node reported as NotReady
- Pods being evicted from node
- High node resource utilization

**Immediate Actions:**
1. Identify affected node
2. Check impact (which pods running on node)
3. Determine if pods need immediate migration
4. Assess if node is recoverable

**Investigation Steps:**
```bash
# Get node status
kubectl get nodes

# Describe the problem node
kubectl describe node <node-name>

# Check pods on the node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>

# SSH to node and check
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
ssh <node> "docker ps"  # or containerd
ssh <node> "df -h"
ssh <node> "free -m"
```

**Common Causes:**
- Kubelet failure
- Disk full
- Memory exhaustion
- Network issues
- Hardware failure

**Resolution Paths:**
1. Recoverable: Fix issue (clean disk, restart services)
2. Not recoverable: Cordon, drain, and replace node
3. For critical pods: Manually reschedule if necessary
4. Update monitoring and alerting based on findings

---

### Scenario 4: Storage Issues

**Symptoms:**
- PVCs stuck in Pending
- Pods can't start due to volume issues
- Data access failures

**Immediate Actions:**
1. Identify affected PVCs/PVs
2. Check storage backend health
3. Verify provisioner status
4. Assess data integrity risk

**Investigation Steps:**
```bash
# Check PVC status
kubectl get pvc --all-namespaces

# Describe pending PVC
kubectl describe pvc <pvc-name> -n <namespace>

# Check PV status
kubectl get pv

# Check storage class
kubectl get storageclass

# Check provisioner
kubectl get pods -n <storage-namespace>

# Check volume attachments
kubectl get volumeattachments
```

**Common Causes:**
- Storage backend failure/full
- Provisioner issues
- Network to storage backend
- Volume attachment limits reached
- Corrupted volume

**Resolution Paths:**
1. Fix storage backend issues
2. Restart provisioner if needed
3. Manually provision PV if dynamic provisioning failed
4. Delete and recreate if volume corrupted
5. Restore from backup if data lost

---

### Scenario 5: Security Incident

**Symptoms:**
- Unauthorized access detected
- Suspicious pod behavior
- Security alerts triggered
- Unusual network traffic

**Immediate Actions:**
1. Assess severity and scope
2. Isolate affected resources
3. Preserve evidence
4. Engage security team

**Investigation Steps:**
```bash
# Check recent RBAC changes
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json

# Audit pod security contexts
kubectl get pods --all-namespaces -o json | jq '.items[].spec.securityContext'

# Check for privileged pods
kubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.containers[].securityContext.privileged==true)'

# Review service accounts
kubectl get serviceaccounts --all-namespaces

# Get audit logs
cat /var/log/kubernetes/audit/audit.log | grep <suspicious-activity>
```

**Common Causes:**
- Compromised credentials
- Vulnerable container image
- Misconfigured RBAC
- Exposed secrets
- Supply chain attack

**Resolution Paths:**
1. Isolate: Network policies, cordon nodes
2. Investigate: Audit logs, pod logs, network flows
3. Remediate: Rotate credentials, patch vulnerabilities
4. Restore: From known-good state if needed
5. Prevent: Enhanced security policies, monitoring

---

## Diagnostic Commands Cheat Sheet

### Quick Health Check
```bash
# Overall cluster health
kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

# Component status (older clusters)
kubectl get componentstatuses

# Recent events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
```

### Pod Diagnostics
```bash
# Pod details
kubectl describe pod <pod> -n <namespace>
kubectl get pod <pod> -n <namespace> -o yaml

# Logs
kubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl logs <pod> -c <container> -n <namespace>

# Interactive debugging
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl debug <pod> -it --image=busybox -n <namespace>
```

### Node Diagnostics
```bash
# Node details
kubectl describe node <node>
kubectl get node <node> -o yaml

# Resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'
```

### Service & Network Diagnostics
```bash
# Service details
kubectl describe svc <service> -n <namespace>
kubectl get endpoints <service> -n <namespace>

# Network policies
kubectl get networkpolicies --all-namespaces

# DNS testing
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
# Then: nslookup <service>.<namespace>.svc.cluster.local
```

### Storage Diagnostics
```bash
# PVC and PV status
kubectl get pvc,pv --all-namespaces

# Storage class
kubectl get storageclass
kubectl describe storageclass <storage-class>

# Volume attachments
kubectl get volumeattachments
```

---

## Communication During Incidents

### Internal Communication
- Use dedicated incident channel
- Regular status updates (every 30 min)
- Clear roles (incident commander, scribe, experts)
- Document all actions taken

### External Communication
- Status page updates
- Customer notifications
- Clear expected resolution time
- Updates on progress

### Post-Incident Communication
- Incident report
- Root cause analysis
- Remediation steps taken
- Prevention measures

---

## Post-Incident Review Template

### Incident Summary
- Date and time
- Duration
- Severity
- Services affected
- User impact

### Timeline
- Detection time
- Response time
- Resolution time
- Key events during incident

### Root Cause
- What happened
- Why it happened
- Contributing factors

### Resolution
- What fixed the issue
- Who fixed it
- How long it took

### Lessons Learned
- What went well
- What could be improved
- Action items with owners

### Prevention
- Technical changes
- Process improvements
- Monitoring enhancements
- Documentation updates

---

## Best Practices

### Prevention
- Regular cluster audits
- Proactive monitoring and alerting
- Capacity planning
- Regular disaster recovery drills
- Automated backups
- Security scanning and policies

### Preparedness
- Document runbooks
- Practice incident response
- Keep contact lists updated
- Maintain up-to-date diagrams
- Pre-provision debugging tools

### Response
- Follow structured approach
- Document everything
- Communicate clearly
- Don't panic
- Think before acting
- Preserve evidence

### Recovery
- Verify fix thoroughly
- Monitor for recurrence
- Update documentation
- Conduct post-mortem
- Implement preventive measures