334 lines
7.0 KiB
Markdown
334 lines
7.0 KiB
Markdown
# Playbook: Service Down
|
|
|
|
## Symptoms
|
|
|
|
- Service not responding
|
|
- Health check failures
|
|
- 502 Bad Gateway or 503 Service Unavailable
|
|
- Users can't access application
|
|
- Monitoring alert: "Service down", "Health check failed"
|
|
|
|
## Severity
|
|
|
|
- **SEV1** - Production service completely unavailable
|
|
|
|
## Diagnosis
|
|
|
|
### Step 1: Check Service Status
|
|
|
|
```bash
|
|
# Check if service is running (systemd)
|
|
systemctl status nginx
|
|
systemctl status application
|
|
systemctl status postgresql
|
|
|
|
# Check process
|
|
ps aux | grep nginx
|
|
pidof nginx
|
|
|
|
# Example output:
|
|
# nginx.service - nginx web server
|
|
# Active: inactive (dead) ← SERVICE IS DOWN
|
|
```
|
|
|
|
---
|
|
|
|
### Step 2: Check Why Service Stopped
|
|
|
|
**Check Service Logs** (systemd):
|
|
```bash
|
|
# Last 50 lines of service logs
|
|
journalctl -u nginx -n 50
|
|
|
|
# Tail logs in real-time
|
|
journalctl -u nginx -f
|
|
|
|
# Look for:
|
|
# - Exit code (0 = normal, non-zero = error)
|
|
# - Error messages
|
|
# - Crash reason
|
|
```
|
|
|
|
**Check Application Logs**:
|
|
```bash
|
|
# Check application error log
|
|
tail -100 /var/log/application/error.log
|
|
|
|
# Look for:
|
|
# - Exception/error before crash
|
|
# - Stack trace
|
|
# - "Fatal error", "Segmentation fault"
|
|
```
|
|
|
|
**Check System Logs**:
|
|
```bash
|
|
# Check for OOM (Out of Memory) killer
|
|
dmesg | grep -i "out of memory\|oom\|killed process"
|
|
|
|
# Example:
|
|
# Out of memory: Killed process 1234 (node) total-vm:8GB
|
|
# ↑ OOM Killer terminated application
|
|
|
|
# Check kernel errors
|
|
dmesg | tail -50
|
|
|
|
# Check syslog
|
|
grep "error\|segfault" /var/log/syslog
|
|
```
|
|
|
|
---
|
|
|
|
### Step 3: Identify Root Cause
|
|
|
|
**Common causes**:
|
|
|
|
| Symptom | Root Cause |
|
|
|---------|------------|
|
|
| "Out of memory" in dmesg | OOM Killer (memory leak, insufficient memory) |
|
|
| "Segmentation fault" | Application bug (crash) |
|
|
| "Address already in use" | Port already bound |
|
|
| "Connection refused" to database | Database down |
|
|
| "No such file or directory" | Missing config file |
|
|
| "Permission denied" | Wrong file permissions |
|
|
| Exit code 137 | Killed by OOM Killer |
|
|
| Exit code 139 | Segmentation fault |
|
|
|
|
---
|
|
|
|
## Mitigation
|
|
|
|
### Immediate (Now - 5 min)
|
|
|
|
**Option A: Restart Service**
|
|
```bash
|
|
# Restart service
|
|
systemctl restart nginx
|
|
|
|
# Check if started successfully
|
|
systemctl status nginx
|
|
|
|
# Test endpoint
|
|
curl http://localhost
|
|
|
|
# Impact: Service restored
|
|
# Risk: Low (if root cause not addressed, may crash again)
|
|
```
|
|
|
|
**Option B: Fix Configuration Error** (if config issue)
|
|
```bash
|
|
# Test configuration
|
|
nginx -t # nginx
|
|
postgresql --help # postgres
|
|
|
|
# If config error, check recent changes
|
|
git diff HEAD~1 /etc/nginx/nginx.conf
|
|
|
|
# Revert to working config
|
|
git checkout HEAD~1 /etc/nginx/nginx.conf
|
|
|
|
# Restart
|
|
systemctl restart nginx
|
|
```
|
|
|
|
**Option C: Free Up Resources** (if OOM)
|
|
```bash
|
|
# Check memory usage
|
|
free -h
|
|
|
|
# Kill memory-heavy processes (non-critical)
|
|
kill -9 <PID>
|
|
|
|
# Free page cache
|
|
sync && echo 3 > /proc/sys/vm/drop_caches
|
|
|
|
# Restart service
|
|
systemctl restart application
|
|
```
|
|
|
|
**Option D: Change Port** (if port conflict)
|
|
```bash
|
|
# Check what's using port
|
|
lsof -i :80
|
|
|
|
# Example:
|
|
# apache2 1234 root 4u IPv4 12345 0t0 TCP *:80 (LISTEN)
|
|
# ↑ Apache using port 80
|
|
|
|
# Stop conflicting service
|
|
systemctl stop apache2
|
|
|
|
# Start intended service
|
|
systemctl start nginx
|
|
```
|
|
|
|
---
|
|
|
|
### Short-term (5 min - 1 hour)
|
|
|
|
**Option A: Fix Crash Bug** (if application bug)
|
|
```bash
|
|
# Check stack trace in logs
|
|
tail -100 /var/log/application/error.log
|
|
|
|
# Identify line causing crash
|
|
# Example: NullPointerException at PaymentService.java:42
|
|
|
|
# Deploy hotfix OR revert to previous version
|
|
git checkout <previous-working-commit>
|
|
npm run build && pm2 restart all
|
|
|
|
# Impact: Bug fixed, service stable
|
|
# Risk: Medium (need proper testing)
|
|
```
|
|
|
|
**Option B: Increase Memory** (if OOM)
|
|
```bash
|
|
# Short-term: Increase swap
|
|
dd if=/dev/zero of=/swapfile bs=1M count=2048
|
|
mkswap /swapfile
|
|
swapon /swapfile
|
|
|
|
# Long-term: Resize instance
|
|
# AWS: Change instance type (t3.medium → t3.large)
|
|
# Azure: Resize VM
|
|
|
|
# Impact: More memory available
|
|
# Risk: Medium (swap is slow, instance resize has downtime)
|
|
```
|
|
|
|
**Option C: Enable Auto-Restart** (systemd)
|
|
```bash
|
|
# Edit service file
|
|
# /etc/systemd/system/application.service
|
|
|
|
[Service]
|
|
Restart=always # Auto-restart on failure
|
|
RestartSec=10 # Wait 10s before restart
|
|
StartLimitBurst=5 # Max 5 restarts
|
|
StartLimitIntervalSec=60 # In 60 seconds
|
|
|
|
# Reload systemd
|
|
systemctl daemon-reload
|
|
|
|
# Impact: Service auto-restarts on crash
|
|
# Risk: Low (but doesn't fix root cause)
|
|
```
|
|
|
|
**Option D: Route Traffic to Backup** (if multi-instance)
|
|
```bash
|
|
# If using load balancer:
|
|
# 1. Remove failed instance from LB
|
|
# 2. Traffic goes to healthy instances
|
|
|
|
# AWS:
|
|
aws elbv2 deregister-targets \
|
|
--target-group-arn <arn> \
|
|
--targets Id=i-1234567890abcdef0
|
|
|
|
# Impact: Users see working instance
|
|
# Risk: Low (other instances handle load)
|
|
```
|
|
|
|
---
|
|
|
|
### Long-term (1 hour+)
|
|
|
|
- [ ] Fix root cause (memory leak, bug, etc.)
|
|
- [ ] Add health check monitoring
|
|
- [ ] Enable auto-restart (systemd)
|
|
- [ ] Set up redundancy (multiple instances)
|
|
- [ ] Add load balancer (distribute traffic)
|
|
- [ ] Increase memory/CPU (if resource issue)
|
|
- [ ] Add alerting (service down, health check fail)
|
|
- [ ] Add E2E test (smoke test after deploy)
|
|
- [ ] Review deployment process (how did bug reach prod?)
|
|
|
|
---
|
|
|
|
## Root Cause Analysis
|
|
|
|
**For each incident, determine**:
|
|
|
|
1. **What failed?** (nginx, application, database)
|
|
2. **Why did it fail?** (OOM, bug, config error)
|
|
3. **What triggered it?** (deploy, traffic spike, external event)
|
|
4. **How to prevent?** (fix bug, add monitoring, increase capacity)
|
|
|
|
---
|
|
|
|
## Escalation
|
|
|
|
**Escalate to developer if**:
|
|
- Application crash due to bug
|
|
- Need code fix
|
|
|
|
**Escalate to platform team if**:
|
|
- Platform/framework issue
|
|
- Infrastructure problem
|
|
|
|
**Escalate to on-call manager if**:
|
|
- Can't restore service in 30 min
|
|
- Need additional resources
|
|
|
|
---
|
|
|
|
## Prevention Checklist
|
|
|
|
- [ ] Health check monitoring (alert on failure)
|
|
- [ ] Auto-restart (systemd Restart=always)
|
|
- [ ] Redundancy (multiple instances behind LB)
|
|
- [ ] Resource monitoring (CPU, memory alerts)
|
|
- [ ] Graceful degradation (circuit breakers, fallbacks)
|
|
- [ ] Smoke tests after deploy
|
|
- [ ] Rollback plan (blue-green, canary)
|
|
- [ ] Chaos engineering (test failure scenarios)
|
|
|
|
---
|
|
|
|
## Related Runbooks
|
|
|
|
- [03-memory-leak.md](03-memory-leak.md) - If OOM caused crash
|
|
- [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting
|
|
- [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Application diagnostics
|
|
|
|
---
|
|
|
|
## Post-Incident
|
|
|
|
After resolving:
|
|
- [ ] Create post-mortem (MANDATORY for SEV1)
|
|
- [ ] Timeline with all events
|
|
- [ ] Root cause analysis
|
|
- [ ] Action items (prevent recurrence)
|
|
- [ ] Update runbook if needed
|
|
- [ ] Share learnings with team
|
|
|
|
---
|
|
|
|
## Useful Commands Reference
|
|
|
|
```bash
|
|
# Service status
|
|
systemctl status <service>
|
|
systemctl restart <service>
|
|
journalctl -u <service> -n 50
|
|
|
|
# Process check
|
|
ps aux | grep <process>
|
|
pidof <process>
|
|
|
|
# Check OOM
|
|
dmesg | grep -i "out of memory\|oom"
|
|
|
|
# Check port usage
|
|
lsof -i :<port>
|
|
netstat -tlnp | grep <port>
|
|
|
|
# Test config
|
|
nginx -t
|
|
postgresql --help
|
|
|
|
# Health check
|
|
curl http://localhost/health
|
|
```
|