# How to Create a Deployment Procedure Specification Deployment procedures document step-by-step instructions for deploying systems to production, including prerequisites, procedures, rollback, and troubleshooting. ## Quick Start ```bash # 1. Create a new deployment procedure scripts/generate-spec.sh deployment-procedure deploy-001-descriptive-slug # 2. Open and fill in the file # (The file will be created at: docs/specs/deployment-procedure/deploy-001-descriptive-slug.md) # 3. Fill in steps and checklists, then validate: scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md # 4. Fix issues and check completeness: scripts/check-completeness.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md ``` ## When to Write a Deployment Procedure Use a Deployment Procedure when you need to: - Document how to deploy a new service or component - Ensure consistent, repeatable deployments - Provide runbooks for operations teams - Document rollback procedures for failures - Enable any team member to deploy safely - Create an audit trail of deployments ## Research Phase ### 1. Research Related Specifications Find what you're deploying: ```bash # Find component specs grep -r "component" docs/specs/ --include="*.md" # Find design documents that mention infrastructure grep -r "design\|infrastructure" docs/specs/ --include="*.md" # Find existing deployment procedures grep -r "deploy" docs/specs/ --include="*.md" ``` ### 2. Understand Your Infrastructure - What's the deployment target? (Kubernetes, serverless, VMs) - What infrastructure does this component need? - What access/permissions are required? - What monitoring must be in place? ### 3. Review Past Deployments - How have similar components been deployed? - What issues arose? How were they resolved? - What worked well? What didn't? - Any patterns or templates to follow? ## Structure & Content Guide ### Title & Metadata - **Title**: "Export Service Deployment to Production", "Database Migration", etc. - **Component**: What's being deployed - **Target**: Production, staging, canary, etc. - **Owner**: Team responsible for deployment ### Prerequisites Section Document what must be done before deployment: ```markdown # Export Service Production Deployment ## Prerequisites ### Infrastructure Requirements - [ ] AWS resources provisioned (see [CMP-001] for details) - [ ] ElastiCache Redis cluster (export-service-queue) - [ ] RDS PostgreSQL instance (export-db) - [ ] S3 bucket (export-files-prod) - [ ] IAM roles and policies configured - [ ] Kubernetes cluster accessible - [ ] kubectl configured with production cluster context - [ ] Deployment manifests reviewed by tech lead - [ ] Namespace `export-service-prod` created ### Code & Build Requirements - [ ] All code merged to main branch - [ ] Code reviewed by 2+ senior engineers - [ ] All tests passing - [ ] Unit tests (90%+ coverage) - [ ] Integration tests - [ ] Load tests pass at target throughput - [ ] Docker image built and pushed to ECR - [ ] Image tagged with version (e.g., v1.2.3) - [ ] Image scanned for vulnerabilities - [ ] Image verified to work (manual test in staging) ### Team & Access Requirements - [ ] Deployment lead identified (typically tech lead or on-call eng) - [ ] Access verified for: - [ ] AWS console (ECR, S3, CloudWatch) - [ ] Kubernetes cluster (kubectl access) - [ ] Database (for running migrations if needed) - [ ] Monitoring/alerting system (Grafana, PagerDuty) - [ ] Communication channel open (Slack, war room) - [ ] Runbook reviewed by both eng and ops team ### Pre-Deployment Verification Checklist - [ ] Staging deployment successful (deployed 24+ hours ago, stable) - [ ] Monitoring in place and verified working - [ ] Rollback plan reviewed and tested - [ ] Emergency contacts identified - [ ] Stakeholders notified of deployment window - [ ] Change log prepared (what's new in this version) ### Data/Database Requirements - [ ] Database schema compatible with new version - [ ] Backward compatible (no breaking changes) - [ ] Migrations tested in staging - [ ] Rollback plan for migrations documented - [ ] No data conflicts or corruption risks - [ ] Backup created (if applicable) ### Approval Checklist - [ ] Tech Lead: Code and approach approved - [ ] Product Owner: Feature approved, ready for launch - [ ] Operations Lead: Deployment plan reviewed - [ ] Security: Security review passed (if applicable) ``` ### Deployment Steps Section Provide step-by-step instructions: ```markdown ## Deployment Procedure ### Pre-Deployment (Validation Phase) **Step 1: Verify Prerequisites** - Command: Run pre-deployment checklist above - Verify: All items checked ✓ - If any fail: Stop deployment, resolve issues - Time: ~15 minutes **Step 2: Create Deployment Record** - Document: Who is deploying, when, what version - Command: Log in to deployment tracking system - Entry: ``` Deployment: export-service Version: v1.2.3 Environment: production Deployed By: Alice Smith Time: 2024-01-15 14:30 UTC Change Summary: Added bulk export feature, fixed queue processing ``` - Time: ~5 minutes ### Deployment Phase **Step 3: Tag Database Migration (if applicable)** - Check: Are there schema changes in this version? - If YES: ```bash # SSH to database server ssh -i ~/.ssh/prod.pem admin@db.example.com # Run migrations psql -U export_service -d export_service -c \ "ALTER TABLE exports ADD COLUMN retry_count INT DEFAULT 0;" # Verify migration psql -U export_service -d export_service -c \ "SELECT column_name FROM information_schema.columns WHERE table_name='exports';" ``` - If NO: Skip this step - Verify: All migrations complete without errors - Time: ~10 minutes **Step 4: Deploy to Kubernetes** - Verify: You're deploying to PRODUCTION cluster ```bash kubectl config current-context # Should output: arn:aws:eks:us-east-1:123456789:cluster/prod ``` - If wrong context: STOP, switch to correct cluster - Deploy new image version: ```bash # Update deployment with new image kubectl set image deployment/export-service \ export-service=123456789.dkr.ecr.us-east-1.amazonaws.com/export-service:v1.2.3 \ -n export-service-prod ``` - Verify: Deployment triggered ```bash kubectl rollout status deployment/export-service -n export-service-prod ``` - Wait: For all pods to become ready (typically 2-3 minutes) - Output should show: `deployment "export-service" successfully rolled out` - Time: ~5 minutes **Step 5: Verify Deployment Health** - Check: Pod status ```bash kubectl get pods -n export-service-prod ``` - All pods should show `Running` status - If any show `CrashLoopBackOff`: Stop deployment, investigate - Check: Service endpoints ```bash kubectl get svc export-service -n export-service-prod ``` - Should show external IP/load balancer endpoint - Check: Logs for errors ```bash kubectl logs -n export-service-prod -l app=export-service --tail=50 ``` - Should show startup logs, no ERROR level messages - If errors present: Check Step 6 for rollback - Check: Health endpoints ```bash curl https://api.example.com/health ``` - Should return 200 OK - If not: Service may still be starting (wait 30s and retry) - Time: ~5 minutes ### Post-Deployment (Verification Phase) **Step 6: Monitor Metrics** - Open: Grafana dashboard for export-service - Check: Key metrics for 5 minutes - Request latency: Should be stable (< 100ms p95) - Error rate: Should remain < 0.1% - CPU/Memory: Should be within normal ranges - Queue depth: Should process jobs smoothly - Look for: Any sudden spikes or anomalies - If anomalies: Proceed to rollback (Step 8) - Time: ~5 minutes **Step 7: Functional Testing** - Manual test: Create export via API ```bash curl -X POST https://api.example.com/exports \ -H "Authorization: Bearer $TOKEN" \ -H "Content-Type: application/json" \ -d '{ "format": "csv", "data_types": ["users"] }' ``` - Response: Should return 201 Created with export_id - Check status: ```bash curl https://api.example.com/exports/{export_id} \ -H "Authorization: Bearer $TOKEN" ``` - Verify: Status transitions from queued → processing → completed - Download: Successfully download export file - Verify: File contents correct - If any step fails: Proceed to rollback (Step 8) - Time: ~5 minutes **Step 8: Notify Stakeholders** - Update: Deployment tracking system ``` Status: DEPLOYED Completion Time: 14:45 UTC Health: ✓ All checks passed Metrics: ✓ Stable Functional Tests: ✓ Passed ``` - Announce: Slack to #product-eng ``` @channel Export Service v1.2.3 deployed to production. New feature: Bulk data exports now available. Status: Monitoring. ``` - Notify: On-call engineer (monitoring for 2 hours post-deployment) ### Rollback Procedure (If Issues Found) **Step 8: Rollback (Only if Step 6 or 7 fail)** - Decision: Is deployment safe to continue? - YES → All checks pass, monitoring is good → Release complete - NO → Issues found → Proceed with rollback - Execute rollback: ```bash # Revert to previous version kubectl rollout undo deployment/export-service -n export-service-prod # Verify rollback in progress kubectl rollout status deployment/export-service -n export-service-prod # Wait for rollback to complete ``` - Verify rollback successful: ```bash # Check current image kubectl describe deployment export-service -n export-service-prod | grep Image # Should show previous version (e.g., v1.2.2) # Verify service responding curl https://api.example.com/health ``` - Notify: Update stakeholders ``` @channel Deployment rolled back due to [specific reason]. Current version: v1.2.2 (stable) Investigating issue. Will retry deployment tomorrow. ``` - Document: Root cause analysis - What went wrong? - Why wasn't it caught in staging? - How do we prevent this next time? - Time: ~10 minutes ``` ### Success Criteria Section ```markdown ## Deployment Success Criteria The deployment is successful if ALL of these are true: ### Technical Criteria - [ ] All pods running and healthy (0 CrashLoopBackOff) - [ ] Service responding to health checks (200 OK) - [ ] Metrics showing normal values (no spikes) - [ ] Error rate < 0.1% (< 1 error per 1000 requests) - [ ] Response latency p95 < 100ms - [ ] No errors in application logs ### Functional Criteria - [ ] Export API responds to requests - [ ] Export jobs queue successfully - [ ] Jobs process and complete - [ ] Files upload to S3 correctly - [ ] Users can download exported files - [ ] File contents verified correct ### Operational Criteria - [ ] Monitoring active and receiving metrics - [ ] Alerting working (test alert fired) - [ ] Logs aggregated and searchable - [ ] Runbook tested and functional - [ ] Team confident in operating system ``` ### Monitoring & Alerting Section ```markdown ## Monitoring Setup ### Critical Alerts (Page on-call) - Service down (health check fails) - Error rate > 1% for 5 minutes - Response latency p95 > 500ms for 5 minutes - Queue depth > 1000 for 10 minutes ### Warning Alerts (Slack notification) - Error rate > 0.5% for 5 minutes - CPU > 80% for 10 minutes - Memory > 85% for 10 minutes - Export job timeout increasing ### Dashboard - Service: export-service-prod - Metrics: Latency, errors, throughput, queue depth - Time range: Last 24 hours by default - Alerts: Show current alert status ``` ### Troubleshooting Section ```markdown ## Troubleshooting Common Issues ### Issue: Pods stuck in CrashLoopBackOff **Symptoms**: Pods repeatedly crash and restart **Diagnosis**: ```bash # Check logs for errors kubectl logs -n export-service-prod ``` **Common Causes**: - Configuration error (check environment variables) - Database connection failed (check credentials) - Out of memory (check resource limits) **Fix**: Review logs, check prerequisites, rollback if unclear ### Issue: Response latency spiking **Symptoms**: p95 latency > 200ms, users report slow exports **Diagnosis**: ```bash # Check queue depth kubectl exec -it -n export-service-prod \ -- redis-cli -h redis.example.com LLEN export-queue ``` **Common Causes**: - Too many concurrent exports (queue backlog) - Database slow (check queries, indexes) - Network issues (check connectivity) **Fix**: Scale up workers, check database performance, verify network ### Issue: Export jobs failing **Symptoms**: Job status shows `failed`, users can't export **Diagnosis**: ```bash # Check worker logs kubectl logs -n export-service-prod -l app=export-service ``` **Common Causes**: - S3 upload failing (check permissions, bucket exists) - Database query error (schema mismatch) - User doesn't have data to export **Fix**: Review logs, verify S3 access, check schema version ### Issue: Database migration failed **Symptoms**: Service won't start after deployment **Diagnosis**: ```bash # Check migration logs psql -U export_service -d export_service -c \ "SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 5;" ``` **Recovery**: 1. Identify failed migration 2. Rollback deployment (revert to previous version) 3. Debug migration issue in staging 4. Retry deployment after fix ``` ### Post-Deployment Actions Section ```markdown ## After Deployment ### Immediate (Next 2 hours) - [ ] On-call engineer monitoring - [ ] Check metrics every 15 minutes - [ ] Monitor error rate and latency - [ ] Watch for user-reported issues in #support ### Short-term (Next 24 hours) - [ ] Review deployment metrics - [ ] Collect feedback from users - [ ] Document any issues encountered - [ ] Update runbook if needed ### Follow-up (Next week) - [ ] Post-mortem if issues occurred - [ ] Update deployment procedure based on lessons learned - [ ] Plan performance improvements if needed - [ ] Update documentation if system behavior changed ``` ## Writing Tips ### Be Precise and Detailed - Exact commands to run (copy-paste ready) - Specific values (versions, endpoints, timeouts) - Expected outputs for verification - Time estimates for each step ### Think About Edge Cases - What if something is already deployed? - What if a prerequisite is missing? - What if deployment partially succeeds? - What if rollback is needed? ### Make Rollback Easy - Document rollback procedure clearly - Test rollback before using in production - Make rollback faster than forward deployment - Have quick communication plan for failures ### Document Monitoring - What metrics indicate health? - What should we watch during deployment? - What thresholds trigger alerts? - How do we validate success? ### Link to Related Specs - Reference component specs: `[CMP-001]` - Reference design documents: `[DES-001]` - Reference operations runbooks ## Validation & Fixing Issues ### Run the Validator ```bash scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-your-spec.md ``` ### Common Issues & Fixes **Issue**: "Prerequisites section incomplete" - **Fix**: Add all required infrastructure, code, access, and approvals **Issue**: "Step-by-step procedures lack detail" - **Fix**: Add actual commands, expected output, time estimates **Issue**: "No rollback procedure" - **Fix**: Document how to revert deployment if issues arise **Issue**: "Monitoring and troubleshooting missing" - **Fix**: Add success criteria, monitoring setup, and troubleshooting guide ## Decision-Making Framework When writing a deployment procedure: 1. **Prerequisites**: What must be true before we start? - Infrastructure ready? - Code reviewed and tested? - Team trained? - Approvals gotten? 2. **Procedure**: What are the exact steps? - Simple, repeatable steps? - Verification at each step? - Estimated timing? 3. **Safety**: How do we prevent/catch issues? - Verification steps after each phase? - Rollback procedure? - Quick failure detection? 4. **Communication**: Who needs to know what? - Stakeholders notified? - On-call monitoring? - Escalation path? 5. **Learning**: How do we improve next time? - Monitoring enabled? - Runbook updated? - Issues documented? ## Next Steps 1. **Create the spec**: `scripts/generate-spec.sh deployment-procedure deploy-XXX-slug` 2. **Research**: Find component specs and existing procedures 3. **Document prerequisites**: What must be true before deployment? 4. **Write procedures**: Step-by-step, with commands and verification 5. **Plan rollback**: How do we undo this if needed? 6. **Validate**: `scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-XXX-slug.md` 7. **Test procedure**: Walk through it in staging environment 8. **Get team review** before using in production