Files
2025-11-29 17:56:41 +08:00

334 lines
7.0 KiB
Markdown

# Playbook: Service Down
## Symptoms
- Service not responding
- Health check failures
- 502 Bad Gateway or 503 Service Unavailable
- Users can't access application
- Monitoring alert: "Service down", "Health check failed"
## Severity
- **SEV1** - Production service completely unavailable
## Diagnosis
### Step 1: Check Service Status
```bash
# Check if service is running (systemd)
systemctl status nginx
systemctl status application
systemctl status postgresql
# Check process
ps aux | grep nginx
pidof nginx
# Example output:
# nginx.service - nginx web server
# Active: inactive (dead) ← SERVICE IS DOWN
```
---
### Step 2: Check Why Service Stopped
**Check Service Logs** (systemd):
```bash
# Last 50 lines of service logs
journalctl -u nginx -n 50
# Tail logs in real-time
journalctl -u nginx -f
# Look for:
# - Exit code (0 = normal, non-zero = error)
# - Error messages
# - Crash reason
```
**Check Application Logs**:
```bash
# Check application error log
tail -100 /var/log/application/error.log
# Look for:
# - Exception/error before crash
# - Stack trace
# - "Fatal error", "Segmentation fault"
```
**Check System Logs**:
```bash
# Check for OOM (Out of Memory) killer
dmesg | grep -i "out of memory\|oom\|killed process"
# Example:
# Out of memory: Killed process 1234 (node) total-vm:8GB
# ↑ OOM Killer terminated application
# Check kernel errors
dmesg | tail -50
# Check syslog
grep "error\|segfault" /var/log/syslog
```
---
### Step 3: Identify Root Cause
**Common causes**:
| Symptom | Root Cause |
|---------|------------|
| "Out of memory" in dmesg | OOM Killer (memory leak, insufficient memory) |
| "Segmentation fault" | Application bug (crash) |
| "Address already in use" | Port already bound |
| "Connection refused" to database | Database down |
| "No such file or directory" | Missing config file |
| "Permission denied" | Wrong file permissions |
| Exit code 137 | Killed by OOM Killer |
| Exit code 139 | Segmentation fault |
---
## Mitigation
### Immediate (Now - 5 min)
**Option A: Restart Service**
```bash
# Restart service
systemctl restart nginx
# Check if started successfully
systemctl status nginx
# Test endpoint
curl http://localhost
# Impact: Service restored
# Risk: Low (if root cause not addressed, may crash again)
```
**Option B: Fix Configuration Error** (if config issue)
```bash
# Test configuration
nginx -t # nginx
postgresql --help # postgres
# If config error, check recent changes
git diff HEAD~1 /etc/nginx/nginx.conf
# Revert to working config
git checkout HEAD~1 /etc/nginx/nginx.conf
# Restart
systemctl restart nginx
```
**Option C: Free Up Resources** (if OOM)
```bash
# Check memory usage
free -h
# Kill memory-heavy processes (non-critical)
kill -9 <PID>
# Free page cache
sync && echo 3 > /proc/sys/vm/drop_caches
# Restart service
systemctl restart application
```
**Option D: Change Port** (if port conflict)
```bash
# Check what's using port
lsof -i :80
# Example:
# apache2 1234 root 4u IPv4 12345 0t0 TCP *:80 (LISTEN)
# ↑ Apache using port 80
# Stop conflicting service
systemctl stop apache2
# Start intended service
systemctl start nginx
```
---
### Short-term (5 min - 1 hour)
**Option A: Fix Crash Bug** (if application bug)
```bash
# Check stack trace in logs
tail -100 /var/log/application/error.log
# Identify line causing crash
# Example: NullPointerException at PaymentService.java:42
# Deploy hotfix OR revert to previous version
git checkout <previous-working-commit>
npm run build && pm2 restart all
# Impact: Bug fixed, service stable
# Risk: Medium (need proper testing)
```
**Option B: Increase Memory** (if OOM)
```bash
# Short-term: Increase swap
dd if=/dev/zero of=/swapfile bs=1M count=2048
mkswap /swapfile
swapon /swapfile
# Long-term: Resize instance
# AWS: Change instance type (t3.medium → t3.large)
# Azure: Resize VM
# Impact: More memory available
# Risk: Medium (swap is slow, instance resize has downtime)
```
**Option C: Enable Auto-Restart** (systemd)
```bash
# Edit service file
# /etc/systemd/system/application.service
[Service]
Restart=always # Auto-restart on failure
RestartSec=10 # Wait 10s before restart
StartLimitBurst=5 # Max 5 restarts
StartLimitIntervalSec=60 # In 60 seconds
# Reload systemd
systemctl daemon-reload
# Impact: Service auto-restarts on crash
# Risk: Low (but doesn't fix root cause)
```
**Option D: Route Traffic to Backup** (if multi-instance)
```bash
# If using load balancer:
# 1. Remove failed instance from LB
# 2. Traffic goes to healthy instances
# AWS:
aws elbv2 deregister-targets \
--target-group-arn <arn> \
--targets Id=i-1234567890abcdef0
# Impact: Users see working instance
# Risk: Low (other instances handle load)
```
---
### Long-term (1 hour+)
- [ ] Fix root cause (memory leak, bug, etc.)
- [ ] Add health check monitoring
- [ ] Enable auto-restart (systemd)
- [ ] Set up redundancy (multiple instances)
- [ ] Add load balancer (distribute traffic)
- [ ] Increase memory/CPU (if resource issue)
- [ ] Add alerting (service down, health check fail)
- [ ] Add E2E test (smoke test after deploy)
- [ ] Review deployment process (how did bug reach prod?)
---
## Root Cause Analysis
**For each incident, determine**:
1. **What failed?** (nginx, application, database)
2. **Why did it fail?** (OOM, bug, config error)
3. **What triggered it?** (deploy, traffic spike, external event)
4. **How to prevent?** (fix bug, add monitoring, increase capacity)
---
## Escalation
**Escalate to developer if**:
- Application crash due to bug
- Need code fix
**Escalate to platform team if**:
- Platform/framework issue
- Infrastructure problem
**Escalate to on-call manager if**:
- Can't restore service in 30 min
- Need additional resources
---
## Prevention Checklist
- [ ] Health check monitoring (alert on failure)
- [ ] Auto-restart (systemd Restart=always)
- [ ] Redundancy (multiple instances behind LB)
- [ ] Resource monitoring (CPU, memory alerts)
- [ ] Graceful degradation (circuit breakers, fallbacks)
- [ ] Smoke tests after deploy
- [ ] Rollback plan (blue-green, canary)
- [ ] Chaos engineering (test failure scenarios)
---
## Related Runbooks
- [03-memory-leak.md](03-memory-leak.md) - If OOM caused crash
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Application diagnostics
---
## Post-Incident
After resolving:
- [ ] Create post-mortem (MANDATORY for SEV1)
- [ ] Timeline with all events
- [ ] Root cause analysis
- [ ] Action items (prevent recurrence)
- [ ] Update runbook if needed
- [ ] Share learnings with team
---
## Useful Commands Reference
```bash
# Service status
systemctl status <service>
systemctl restart <service>
journalctl -u <service> -n 50
# Process check
ps aux | grep <process>
pidof <process>
# Check OOM
dmesg | grep -i "out of memory\|oom"
# Check port usage
lsof -i :<port>
netstat -tlnp | grep <port>
# Test config
nginx -t
postgresql --help
# Health check
curl http://localhost/health
```