7.0 KiB
7.0 KiB
Playbook: Service Down
Symptoms
- Service not responding
- Health check failures
- 502 Bad Gateway or 503 Service Unavailable
- Users can't access application
- Monitoring alert: "Service down", "Health check failed"
Severity
- SEV1 - Production service completely unavailable
Diagnosis
Step 1: Check Service Status
# Check if service is running (systemd)
systemctl status nginx
systemctl status application
systemctl status postgresql
# Check process
ps aux | grep nginx
pidof nginx
# Example output:
# nginx.service - nginx web server
# Active: inactive (dead) ← SERVICE IS DOWN
Step 2: Check Why Service Stopped
Check Service Logs (systemd):
# Last 50 lines of service logs
journalctl -u nginx -n 50
# Tail logs in real-time
journalctl -u nginx -f
# Look for:
# - Exit code (0 = normal, non-zero = error)
# - Error messages
# - Crash reason
Check Application Logs:
# Check application error log
tail -100 /var/log/application/error.log
# Look for:
# - Exception/error before crash
# - Stack trace
# - "Fatal error", "Segmentation fault"
Check System Logs:
# Check for OOM (Out of Memory) killer
dmesg | grep -i "out of memory\|oom\|killed process"
# Example:
# Out of memory: Killed process 1234 (node) total-vm:8GB
# ↑ OOM Killer terminated application
# Check kernel errors
dmesg | tail -50
# Check syslog
grep "error\|segfault" /var/log/syslog
Step 3: Identify Root Cause
Common causes:
| Symptom | Root Cause |
|---|---|
| "Out of memory" in dmesg | OOM Killer (memory leak, insufficient memory) |
| "Segmentation fault" | Application bug (crash) |
| "Address already in use" | Port already bound |
| "Connection refused" to database | Database down |
| "No such file or directory" | Missing config file |
| "Permission denied" | Wrong file permissions |
| Exit code 137 | Killed by OOM Killer |
| Exit code 139 | Segmentation fault |
Mitigation
Immediate (Now - 5 min)
Option A: Restart Service
# Restart service
systemctl restart nginx
# Check if started successfully
systemctl status nginx
# Test endpoint
curl http://localhost
# Impact: Service restored
# Risk: Low (if root cause not addressed, may crash again)
Option B: Fix Configuration Error (if config issue)
# Test configuration
nginx -t # nginx
postgresql --help # postgres
# If config error, check recent changes
git diff HEAD~1 /etc/nginx/nginx.conf
# Revert to working config
git checkout HEAD~1 /etc/nginx/nginx.conf
# Restart
systemctl restart nginx
Option C: Free Up Resources (if OOM)
# Check memory usage
free -h
# Kill memory-heavy processes (non-critical)
kill -9 <PID>
# Free page cache
sync && echo 3 > /proc/sys/vm/drop_caches
# Restart service
systemctl restart application
Option D: Change Port (if port conflict)
# Check what's using port
lsof -i :80
# Example:
# apache2 1234 root 4u IPv4 12345 0t0 TCP *:80 (LISTEN)
# ↑ Apache using port 80
# Stop conflicting service
systemctl stop apache2
# Start intended service
systemctl start nginx
Short-term (5 min - 1 hour)
Option A: Fix Crash Bug (if application bug)
# Check stack trace in logs
tail -100 /var/log/application/error.log
# Identify line causing crash
# Example: NullPointerException at PaymentService.java:42
# Deploy hotfix OR revert to previous version
git checkout <previous-working-commit>
npm run build && pm2 restart all
# Impact: Bug fixed, service stable
# Risk: Medium (need proper testing)
Option B: Increase Memory (if OOM)
# Short-term: Increase swap
dd if=/dev/zero of=/swapfile bs=1M count=2048
mkswap /swapfile
swapon /swapfile
# Long-term: Resize instance
# AWS: Change instance type (t3.medium → t3.large)
# Azure: Resize VM
# Impact: More memory available
# Risk: Medium (swap is slow, instance resize has downtime)
Option C: Enable Auto-Restart (systemd)
# Edit service file
# /etc/systemd/system/application.service
[Service]
Restart=always # Auto-restart on failure
RestartSec=10 # Wait 10s before restart
StartLimitBurst=5 # Max 5 restarts
StartLimitIntervalSec=60 # In 60 seconds
# Reload systemd
systemctl daemon-reload
# Impact: Service auto-restarts on crash
# Risk: Low (but doesn't fix root cause)
Option D: Route Traffic to Backup (if multi-instance)
# If using load balancer:
# 1. Remove failed instance from LB
# 2. Traffic goes to healthy instances
# AWS:
aws elbv2 deregister-targets \
--target-group-arn <arn> \
--targets Id=i-1234567890abcdef0
# Impact: Users see working instance
# Risk: Low (other instances handle load)
Long-term (1 hour+)
- Fix root cause (memory leak, bug, etc.)
- Add health check monitoring
- Enable auto-restart (systemd)
- Set up redundancy (multiple instances)
- Add load balancer (distribute traffic)
- Increase memory/CPU (if resource issue)
- Add alerting (service down, health check fail)
- Add E2E test (smoke test after deploy)
- Review deployment process (how did bug reach prod?)
Root Cause Analysis
For each incident, determine:
- What failed? (nginx, application, database)
- Why did it fail? (OOM, bug, config error)
- What triggered it? (deploy, traffic spike, external event)
- How to prevent? (fix bug, add monitoring, increase capacity)
Escalation
Escalate to developer if:
- Application crash due to bug
- Need code fix
Escalate to platform team if:
- Platform/framework issue
- Infrastructure problem
Escalate to on-call manager if:
- Can't restore service in 30 min
- Need additional resources
Prevention Checklist
- Health check monitoring (alert on failure)
- Auto-restart (systemd Restart=always)
- Redundancy (multiple instances behind LB)
- Resource monitoring (CPU, memory alerts)
- Graceful degradation (circuit breakers, fallbacks)
- Smoke tests after deploy
- Rollback plan (blue-green, canary)
- Chaos engineering (test failure scenarios)
Related Runbooks
- 03-memory-leak.md - If OOM caused crash
- ../modules/infrastructure.md - Infrastructure troubleshooting
- ../modules/backend-diagnostics.md - Application diagnostics
Post-Incident
After resolving:
- Create post-mortem (MANDATORY for SEV1)
- Timeline with all events
- Root cause analysis
- Action items (prevent recurrence)
- Update runbook if needed
- Share learnings with team
Useful Commands Reference
# Service status
systemctl status <service>
systemctl restart <service>
journalctl -u <service> -n 50
# Process check
ps aux | grep <process>
pidof <process>
# Check OOM
dmesg | grep -i "out of memory\|oom"
# Check port usage
lsof -i :<port>
netstat -tlnp | grep <port>
# Test config
nginx -t
postgresql --help
# Health check
curl http://localhost/health