# Playbook: Service Down ## Symptoms - Service not responding - Health check failures - 502 Bad Gateway or 503 Service Unavailable - Users can't access application - Monitoring alert: "Service down", "Health check failed" ## Severity - **SEV1** - Production service completely unavailable ## Diagnosis ### Step 1: Check Service Status ```bash # Check if service is running (systemd) systemctl status nginx systemctl status application systemctl status postgresql # Check process ps aux | grep nginx pidof nginx # Example output: # nginx.service - nginx web server # Active: inactive (dead) ← SERVICE IS DOWN ``` --- ### Step 2: Check Why Service Stopped **Check Service Logs** (systemd): ```bash # Last 50 lines of service logs journalctl -u nginx -n 50 # Tail logs in real-time journalctl -u nginx -f # Look for: # - Exit code (0 = normal, non-zero = error) # - Error messages # - Crash reason ``` **Check Application Logs**: ```bash # Check application error log tail -100 /var/log/application/error.log # Look for: # - Exception/error before crash # - Stack trace # - "Fatal error", "Segmentation fault" ``` **Check System Logs**: ```bash # Check for OOM (Out of Memory) killer dmesg | grep -i "out of memory\|oom\|killed process" # Example: # Out of memory: Killed process 1234 (node) total-vm:8GB # ↑ OOM Killer terminated application # Check kernel errors dmesg | tail -50 # Check syslog grep "error\|segfault" /var/log/syslog ``` --- ### Step 3: Identify Root Cause **Common causes**: | Symptom | Root Cause | |---------|------------| | "Out of memory" in dmesg | OOM Killer (memory leak, insufficient memory) | | "Segmentation fault" | Application bug (crash) | | "Address already in use" | Port already bound | | "Connection refused" to database | Database down | | "No such file or directory" | Missing config file | | "Permission denied" | Wrong file permissions | | Exit code 137 | Killed by OOM Killer | | Exit code 139 | Segmentation fault | --- ## Mitigation ### Immediate (Now - 5 min) **Option A: Restart Service** ```bash # Restart service systemctl restart nginx # Check if started successfully systemctl status nginx # Test endpoint curl http://localhost # Impact: Service restored # Risk: Low (if root cause not addressed, may crash again) ``` **Option B: Fix Configuration Error** (if config issue) ```bash # Test configuration nginx -t # nginx postgresql --help # postgres # If config error, check recent changes git diff HEAD~1 /etc/nginx/nginx.conf # Revert to working config git checkout HEAD~1 /etc/nginx/nginx.conf # Restart systemctl restart nginx ``` **Option C: Free Up Resources** (if OOM) ```bash # Check memory usage free -h # Kill memory-heavy processes (non-critical) kill -9 # Free page cache sync && echo 3 > /proc/sys/vm/drop_caches # Restart service systemctl restart application ``` **Option D: Change Port** (if port conflict) ```bash # Check what's using port lsof -i :80 # Example: # apache2 1234 root 4u IPv4 12345 0t0 TCP *:80 (LISTEN) # ↑ Apache using port 80 # Stop conflicting service systemctl stop apache2 # Start intended service systemctl start nginx ``` --- ### Short-term (5 min - 1 hour) **Option A: Fix Crash Bug** (if application bug) ```bash # Check stack trace in logs tail -100 /var/log/application/error.log # Identify line causing crash # Example: NullPointerException at PaymentService.java:42 # Deploy hotfix OR revert to previous version git checkout npm run build && pm2 restart all # Impact: Bug fixed, service stable # Risk: Medium (need proper testing) ``` **Option B: Increase Memory** (if OOM) ```bash # Short-term: Increase swap dd if=/dev/zero of=/swapfile bs=1M count=2048 mkswap /swapfile swapon /swapfile # Long-term: Resize instance # AWS: Change instance type (t3.medium → t3.large) # Azure: Resize VM # Impact: More memory available # Risk: Medium (swap is slow, instance resize has downtime) ``` **Option C: Enable Auto-Restart** (systemd) ```bash # Edit service file # /etc/systemd/system/application.service [Service] Restart=always # Auto-restart on failure RestartSec=10 # Wait 10s before restart StartLimitBurst=5 # Max 5 restarts StartLimitIntervalSec=60 # In 60 seconds # Reload systemd systemctl daemon-reload # Impact: Service auto-restarts on crash # Risk: Low (but doesn't fix root cause) ``` **Option D: Route Traffic to Backup** (if multi-instance) ```bash # If using load balancer: # 1. Remove failed instance from LB # 2. Traffic goes to healthy instances # AWS: aws elbv2 deregister-targets \ --target-group-arn \ --targets Id=i-1234567890abcdef0 # Impact: Users see working instance # Risk: Low (other instances handle load) ``` --- ### Long-term (1 hour+) - [ ] Fix root cause (memory leak, bug, etc.) - [ ] Add health check monitoring - [ ] Enable auto-restart (systemd) - [ ] Set up redundancy (multiple instances) - [ ] Add load balancer (distribute traffic) - [ ] Increase memory/CPU (if resource issue) - [ ] Add alerting (service down, health check fail) - [ ] Add E2E test (smoke test after deploy) - [ ] Review deployment process (how did bug reach prod?) --- ## Root Cause Analysis **For each incident, determine**: 1. **What failed?** (nginx, application, database) 2. **Why did it fail?** (OOM, bug, config error) 3. **What triggered it?** (deploy, traffic spike, external event) 4. **How to prevent?** (fix bug, add monitoring, increase capacity) --- ## Escalation **Escalate to developer if**: - Application crash due to bug - Need code fix **Escalate to platform team if**: - Platform/framework issue - Infrastructure problem **Escalate to on-call manager if**: - Can't restore service in 30 min - Need additional resources --- ## Prevention Checklist - [ ] Health check monitoring (alert on failure) - [ ] Auto-restart (systemd Restart=always) - [ ] Redundancy (multiple instances behind LB) - [ ] Resource monitoring (CPU, memory alerts) - [ ] Graceful degradation (circuit breakers, fallbacks) - [ ] Smoke tests after deploy - [ ] Rollback plan (blue-green, canary) - [ ] Chaos engineering (test failure scenarios) --- ## Related Runbooks - [03-memory-leak.md](03-memory-leak.md) - If OOM caused crash - [../modules/infrastructure.md](../modules/infrastructure.md) - Infrastructure troubleshooting - [../modules/backend-diagnostics.md](../modules/backend-diagnostics.md) - Application diagnostics --- ## Post-Incident After resolving: - [ ] Create post-mortem (MANDATORY for SEV1) - [ ] Timeline with all events - [ ] Root cause analysis - [ ] Action items (prevent recurrence) - [ ] Update runbook if needed - [ ] Share learnings with team --- ## Useful Commands Reference ```bash # Service status systemctl status systemctl restart journalctl -u -n 50 # Process check ps aux | grep pidof # Check OOM dmesg | grep -i "out of memory\|oom" # Check port usage lsof -i : netstat -tlnp | grep # Test config nginx -t postgresql --help # Health check curl http://localhost/health ```