zhongwei/gh-ahmedasmar-devops-claude-skills-k8s-troubleshooter

Fork 0

Files

Zhongwei Li ad81bc571f Initial commit

2025-11-29 17:51:20 +08:00

9.8 KiB

Raw Blame History

Kubernetes Incident Response Playbook

This playbook provides structured procedures for responding to Kubernetes incidents.

Incident Response Framework

1. Detection Phase

Identify the incident (alerts, user reports, monitoring)
Determine severity level
Initiate incident response

2. Triage Phase

Assess impact and scope
Gather initial diagnostic data
Determine if immediate action needed

3. Investigation Phase

Collect comprehensive diagnostics
Identify root cause
Document findings

4. Resolution Phase

Apply remediation
Verify fix
Monitor for recurrence

5. Post-Incident Phase

Document incident
Conduct blameless post-mortem
Implement preventive measures

Severity Levels

SEV-1: Critical

Complete service outage
Data loss or corruption
Security breach
Impact: All users affected
Response: Immediate, all-hands

SEV-2: High

Major functionality degraded
Significant performance impact
Impact: Large subset of users
Response: Within 15 minutes

SEV-3: Medium

Minor functionality impaired
Workaround available
Impact: Some users affected
Response: Within 1 hour

SEV-4: Low

Cosmetic issues
Negligible impact
Impact: Minimal
Response: During business hours

Common Incident Scenarios

Scenario 1: Complete Cluster Outage

Symptoms:

All services unreachable
kubectl commands timing out
API server not responding

Immediate Actions:

Verify the scope (single cluster or multi-cluster)
Check API server status and logs
Check control plane nodes
Verify network connectivity to control plane
Check etcd cluster health

Investigation Steps:

# Check control plane pods
kubectl get pods -n kube-system

# Check etcd
kubectl exec -it etcd-<node> -n kube-system -- etcdctl endpoint health

# Check API server logs
journalctl -u kube-apiserver -n 100

# Check control plane node resources
ssh <control-plane-node> "top"

Common Causes:

etcd cluster failure
API server OOM/crash
Control plane network partition
Certificate expiration
Cloud provider outage

Resolution Paths:

etcd issue: Restore from backup or rebuild cluster
API server issue: Restart API server pods/service
Network: Fix routing, security groups, or DNS
Certificates: Renew certificates (kubeadm cert renew all)

Scenario 2: Service Degradation

Symptoms:

Increased latency or error rates
Some requests failing
Intermittent issues

Immediate Actions:

Check service metrics and logs
Verify pod health and count
Check for recent deployments
Review resource utilization

Investigation Steps:

# Check service endpoints
kubectl get endpoints <service> -n <namespace>

# Check pod status
kubectl get pods -l <service-selector> -n <namespace>

# Review recent changes
kubectl rollout history deployment/<name> -n <namespace>

# Check resource usage
kubectl top pods -n <namespace>

# Get recent events
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

Common Causes:

Insufficient replicas
Pod restarts/crashes
Resource contention
Bad deployment
External dependency failure

Resolution Paths:

Scale up replicas if needed
Rollback bad deployment
Increase resources if constrained
Fix configuration issues
Implement circuit breaker for external deps

Scenario 3: Node Failure

Symptoms:

Node reported as NotReady
Pods being evicted from node
High node resource utilization

Immediate Actions:

Identify affected node
Check impact (which pods running on node)
Determine if pods need immediate migration
Assess if node is recoverable

Investigation Steps:

# Get node status
kubectl get nodes

# Describe the problem node
kubectl describe node <node-name>

# Check pods on the node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node-name>

# SSH to node and check
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
ssh <node> "docker ps"  # or containerd
ssh <node> "df -h"
ssh <node> "free -m"

Common Causes:

Kubelet failure
Disk full
Memory exhaustion
Network issues
Hardware failure

Resolution Paths:

Recoverable: Fix issue (clean disk, restart services)
Not recoverable: Cordon, drain, and replace node
For critical pods: Manually reschedule if necessary
Update monitoring and alerting based on findings

Scenario 4: Storage Issues

Symptoms:

PVCs stuck in Pending
Pods can't start due to volume issues
Data access failures

Immediate Actions:

Identify affected PVCs/PVs
Check storage backend health
Verify provisioner status
Assess data integrity risk

Investigation Steps:

# Check PVC status
kubectl get pvc --all-namespaces

# Describe pending PVC
kubectl describe pvc <pvc-name> -n <namespace>

# Check PV status
kubectl get pv

# Check storage class
kubectl get storageclass

# Check provisioner
kubectl get pods -n <storage-namespace>

# Check volume attachments
kubectl get volumeattachments

Common Causes:

Storage backend failure/full
Provisioner issues
Network to storage backend
Volume attachment limits reached
Corrupted volume

Resolution Paths:

Fix storage backend issues
Restart provisioner if needed
Manually provision PV if dynamic provisioning failed
Delete and recreate if volume corrupted
Restore from backup if data lost

Scenario 5: Security Incident

Symptoms:

Unauthorized access detected
Suspicious pod behavior
Security alerts triggered
Unusual network traffic

Immediate Actions:

Assess severity and scope
Isolate affected resources
Preserve evidence
Engage security team

Investigation Steps:

# Check recent RBAC changes
kubectl get rolebindings,clusterrolebindings --all-namespaces -o json

# Audit pod security contexts
kubectl get pods --all-namespaces -o json | jq '.items[].spec.securityContext'

# Check for privileged pods
kubectl get pods --all-namespaces -o json | jq '.items[] | select(.spec.containers[].securityContext.privileged==true)'

# Review service accounts
kubectl get serviceaccounts --all-namespaces

# Get audit logs
cat /var/log/kubernetes/audit/audit.log | grep <suspicious-activity>

Common Causes:

Compromised credentials
Vulnerable container image
Misconfigured RBAC
Exposed secrets
Supply chain attack

Resolution Paths:

Isolate: Network policies, cordon nodes
Investigate: Audit logs, pod logs, network flows
Remediate: Rotate credentials, patch vulnerabilities
Restore: From known-good state if needed
Prevent: Enhanced security policies, monitoring

Diagnostic Commands Cheat Sheet

Quick Health Check

# Overall cluster health
kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running

# Component status (older clusters)
kubectl get componentstatuses

# Recent events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

Pod Diagnostics

# Pod details
kubectl describe pod <pod> -n <namespace>
kubectl get pod <pod> -n <namespace> -o yaml

# Logs
kubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl logs <pod> -c <container> -n <namespace>

# Interactive debugging
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl debug <pod> -it --image=busybox -n <namespace>

Node Diagnostics

# Node details
kubectl describe node <node>
kubectl get node <node> -o yaml

# Resource usage
kubectl top nodes
kubectl top pods --all-namespaces

# Node conditions
kubectl get nodes -o json | jq '.items[].status.conditions'

Service & Network Diagnostics

# Service details
kubectl describe svc <service> -n <namespace>
kubectl get endpoints <service> -n <namespace>

# Network policies
kubectl get networkpolicies --all-namespaces

# DNS testing
kubectl run tmp-shell --rm -i --tty --image nicolaka/netshoot
# Then: nslookup <service>.<namespace>.svc.cluster.local

Storage Diagnostics

# PVC and PV status
kubectl get pvc,pv --all-namespaces

# Storage class
kubectl get storageclass
kubectl describe storageclass <storage-class>

# Volume attachments
kubectl get volumeattachments

Communication During Incidents

Internal Communication

Use dedicated incident channel
Regular status updates (every 30 min)
Clear roles (incident commander, scribe, experts)
Document all actions taken

External Communication

Status page updates
Customer notifications
Clear expected resolution time
Updates on progress

Post-Incident Communication

Incident report
Root cause analysis
Remediation steps taken
Prevention measures

Post-Incident Review Template

Incident Summary

Date and time
Duration
Severity
Services affected
User impact

Timeline

Detection time
Response time
Resolution time
Key events during incident

Root Cause

What happened
Why it happened
Contributing factors

Resolution

What fixed the issue
Who fixed it
How long it took

Lessons Learned

What went well
What could be improved
Action items with owners

Prevention

Technical changes
Process improvements
Monitoring enhancements
Documentation updates

Best Practices

Prevention

Regular cluster audits
Proactive monitoring and alerting
Capacity planning
Regular disaster recovery drills
Automated backups
Security scanning and policies

Preparedness

Document runbooks
Practice incident response
Keep contact lists updated
Maintain up-to-date diagrams
Pre-provision debugging tools

Response

Follow structured approach
Document everything
Communicate clearly
Don't panic
Think before acting
Preserve evidence

Recovery

Verify fix thoroughly
Monitor for recurrence
Update documentation
Conduct post-mortem
Implement preventive measures

9.8 KiB Raw Blame History

Kubernetes Incident Response Playbook

Incident Response Framework

1. Detection Phase

2. Triage Phase

3. Investigation Phase

4. Resolution Phase

5. Post-Incident Phase

Severity Levels

SEV-1: Critical

SEV-2: High

SEV-3: Medium

SEV-4: Low

Common Incident Scenarios

Scenario 1: Complete Cluster Outage

Scenario 2: Service Degradation

Scenario 3: Node Failure

Scenario 4: Storage Issues

Scenario 5: Security Incident

Diagnostic Commands Cheat Sheet

Quick Health Check

Pod Diagnostics

Node Diagnostics

Service & Network Diagnostics

Storage Diagnostics

Communication During Incidents

Internal Communication

External Communication

Post-Incident Communication

Post-Incident Review Template

Incident Summary

Timeline

Root Cause

Resolution

Lessons Learned

Prevention

Best Practices

Prevention

Preparedness

Response

Recovery

9.8 KiB

Raw Blame History