Files
2025-11-30 08:45:31 +08:00

16 KiB

How to Create a Deployment Procedure Specification

Deployment procedures document step-by-step instructions for deploying systems to production, including prerequisites, procedures, rollback, and troubleshooting.

Quick Start

# 1. Create a new deployment procedure
scripts/generate-spec.sh deployment-procedure deploy-001-descriptive-slug

# 2. Open and fill in the file
# (The file will be created at: docs/specs/deployment-procedure/deploy-001-descriptive-slug.md)

# 3. Fill in steps and checklists, then validate:
scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md

# 4. Fix issues and check completeness:
scripts/check-completeness.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md

When to Write a Deployment Procedure

Use a Deployment Procedure when you need to:

  • Document how to deploy a new service or component
  • Ensure consistent, repeatable deployments
  • Provide runbooks for operations teams
  • Document rollback procedures for failures
  • Enable any team member to deploy safely
  • Create an audit trail of deployments

Research Phase

Find what you're deploying:

# Find component specs
grep -r "component" docs/specs/ --include="*.md"

# Find design documents that mention infrastructure
grep -r "design\|infrastructure" docs/specs/ --include="*.md"

# Find existing deployment procedures
grep -r "deploy" docs/specs/ --include="*.md"

2. Understand Your Infrastructure

  • What's the deployment target? (Kubernetes, serverless, VMs)
  • What infrastructure does this component need?
  • What access/permissions are required?
  • What monitoring must be in place?

3. Review Past Deployments

  • How have similar components been deployed?
  • What issues arose? How were they resolved?
  • What worked well? What didn't?
  • Any patterns or templates to follow?

Structure & Content Guide

Title & Metadata

  • Title: "Export Service Deployment to Production", "Database Migration", etc.
  • Component: What's being deployed
  • Target: Production, staging, canary, etc.
  • Owner: Team responsible for deployment

Prerequisites Section

Document what must be done before deployment:

# Export Service Production Deployment

## Prerequisites

### Infrastructure Requirements
- [ ] AWS resources provisioned (see [CMP-001] for details)
  - [ ] ElastiCache Redis cluster (export-service-queue)
  - [ ] RDS PostgreSQL instance (export-db)
  - [ ] S3 bucket (export-files-prod)
  - [ ] IAM roles and policies configured
- [ ] Kubernetes cluster accessible
  - [ ] kubectl configured with production cluster context
  - [ ] Deployment manifests reviewed by tech lead
  - [ ] Namespace `export-service-prod` created

### Code & Build Requirements
- [ ] All code merged to main branch
- [ ] Code reviewed by 2+ senior engineers
- [ ] All tests passing
  - [ ] Unit tests (90%+ coverage)
  - [ ] Integration tests
  - [ ] Load tests pass at target throughput
- [ ] Docker image built and pushed to ECR
  - [ ] Image tagged with version (e.g., v1.2.3)
  - [ ] Image scanned for vulnerabilities
  - [ ] Image verified to work (manual test in staging)

### Team & Access Requirements
- [ ] Deployment lead identified (typically tech lead or on-call eng)
- [ ] Access verified for:
  - [ ] AWS console (ECR, S3, CloudWatch)
  - [ ] Kubernetes cluster (kubectl access)
  - [ ] Database (for running migrations if needed)
  - [ ] Monitoring/alerting system (Grafana, PagerDuty)
- [ ] Communication channel open (Slack, war room)
- [ ] Runbook reviewed by both eng and ops team

### Pre-Deployment Verification Checklist
- [ ] Staging deployment successful (deployed 24+ hours ago, stable)
- [ ] Monitoring in place and verified working
- [ ] Rollback plan reviewed and tested
- [ ] Emergency contacts identified
- [ ] Stakeholders notified of deployment window
- [ ] Change log prepared (what's new in this version)

### Data/Database Requirements
- [ ] Database schema compatible with new version
  - [ ] Backward compatible (no breaking changes)
  - [ ] Migrations tested in staging
  - [ ] Rollback plan for migrations documented
- [ ] No data conflicts or corruption risks
- [ ] Backup created (if applicable)

### Approval Checklist
- [ ] Tech Lead: Code and approach approved
- [ ] Product Owner: Feature approved, ready for launch
- [ ] Operations Lead: Deployment plan reviewed
- [ ] Security: Security review passed (if applicable)

Deployment Steps Section

Provide step-by-step instructions:

## Deployment Procedure

### Pre-Deployment (Validation Phase)

**Step 1: Verify Prerequisites**
- Command: Run pre-deployment checklist above
- Verify: All items checked ✓
- If any fail: Stop deployment, resolve issues
- Time: ~15 minutes

**Step 2: Create Deployment Record**
- Document: Who is deploying, when, what version
- Command: Log in to deployment tracking system
- Entry:

Deployment: export-service Version: v1.2.3 Environment: production Deployed By: Alice Smith Time: 2024-01-15 14:30 UTC Change Summary: Added bulk export feature, fixed queue processing

- Time: ~5 minutes

### Deployment Phase

**Step 3: Tag Database Migration (if applicable)**
- Check: Are there schema changes in this version?
- If YES:
```bash
# SSH to database server
ssh -i ~/.ssh/prod.pem admin@db.example.com

# Run migrations
psql -U export_service -d export_service -c \
  "ALTER TABLE exports ADD COLUMN retry_count INT DEFAULT 0;"

# Verify migration
psql -U export_service -d export_service -c \
  "SELECT column_name FROM information_schema.columns WHERE table_name='exports';"
  • If NO: Skip this step
  • Verify: All migrations complete without errors
  • Time: ~10 minutes

Step 4: Deploy to Kubernetes

  • Verify: You're deploying to PRODUCTION cluster
    kubectl config current-context
    # Should output: arn:aws:eks:us-east-1:123456789:cluster/prod
    
  • If wrong context: STOP, switch to correct cluster
  • Deploy new image version:
    # Update deployment with new image
    kubectl set image deployment/export-service \
      export-service=123456789.dkr.ecr.us-east-1.amazonaws.com/export-service:v1.2.3 \
      -n export-service-prod
    
  • Verify: Deployment triggered
    kubectl rollout status deployment/export-service -n export-service-prod
    
  • Wait: For all pods to become ready (typically 2-3 minutes)
  • Output should show: deployment "export-service" successfully rolled out
  • Time: ~5 minutes

Step 5: Verify Deployment Health

  • Check: Pod status

    kubectl get pods -n export-service-prod
    
    • All pods should show Running status
    • If any show CrashLoopBackOff: Stop deployment, investigate
  • Check: Service endpoints

    kubectl get svc export-service -n export-service-prod
    
    • Should show external IP/load balancer endpoint
  • Check: Logs for errors

    kubectl logs -n export-service-prod -l app=export-service --tail=50
    
    • Should show startup logs, no ERROR level messages
    • If errors present: Check Step 6 for rollback
  • Check: Health endpoints

    curl https://api.example.com/health
    
    • Should return 200 OK
    • If not: Service may still be starting (wait 30s and retry)
  • Time: ~5 minutes

Post-Deployment (Verification Phase)

Step 6: Monitor Metrics

  • Open: Grafana dashboard for export-service
  • Check: Key metrics for 5 minutes
    • Request latency: Should be stable (< 100ms p95)
    • Error rate: Should remain < 0.1%
    • CPU/Memory: Should be within normal ranges
    • Queue depth: Should process jobs smoothly
  • Look for: Any sudden spikes or anomalies
  • If anomalies: Proceed to rollback (Step 8)
  • Time: ~5 minutes

Step 7: Functional Testing

  • Manual test: Create export via API
    curl -X POST https://api.example.com/exports \
      -H "Authorization: Bearer $TOKEN" \
      -H "Content-Type: application/json" \
      -d '{
        "format": "csv",
        "data_types": ["users"]
      }'
    
  • Response: Should return 201 Created with export_id
  • Check status:
    curl https://api.example.com/exports/{export_id} \
      -H "Authorization: Bearer $TOKEN"
    
  • Verify: Status transitions from queued → processing → completed
  • Download: Successfully download export file
  • Verify: File contents correct
  • If any step fails: Proceed to rollback (Step 8)
  • Time: ~5 minutes

Step 8: Notify Stakeholders

  • Update: Deployment tracking system
    Status: DEPLOYED
    Completion Time: 14:45 UTC
    Health: ✓ All checks passed
    Metrics: ✓ Stable
    Functional Tests: ✓ Passed
    
  • Announce: Slack to #product-eng
    @channel Export Service v1.2.3 deployed to production.
    New feature: Bulk data exports now available.
    Status: Monitoring.
    
  • Notify: On-call engineer (monitoring for 2 hours post-deployment)

Rollback Procedure (If Issues Found)

Step 8: Rollback (Only if Step 6 or 7 fail)

  • Decision: Is deployment safe to continue?

    • YES → All checks pass, monitoring is good → Release complete
    • NO → Issues found → Proceed with rollback
  • Execute rollback:

    # Revert to previous version
    kubectl rollout undo deployment/export-service -n export-service-prod
    
    # Verify rollback in progress
    kubectl rollout status deployment/export-service -n export-service-prod
    
    # Wait for rollback to complete
    
  • Verify rollback successful:

    # Check current image
    kubectl describe deployment export-service -n export-service-prod | grep Image
    
    # Should show previous version (e.g., v1.2.2)
    
    # Verify service responding
    curl https://api.example.com/health
    
  • Notify: Update stakeholders

    @channel Deployment rolled back due to [specific reason].
    Current version: v1.2.2 (stable)
    Investigating issue. Will retry deployment tomorrow.
    
  • Document: Root cause analysis

    • What went wrong?
    • Why wasn't it caught in staging?
    • How do we prevent this next time?
  • Time: ~10 minutes


### Success Criteria Section

```markdown
## Deployment Success Criteria

The deployment is successful if ALL of these are true:

### Technical Criteria
- [ ] All pods running and healthy (0 CrashLoopBackOff)
- [ ] Service responding to health checks (200 OK)
- [ ] Metrics showing normal values (no spikes)
- [ ] Error rate < 0.1% (< 1 error per 1000 requests)
- [ ] Response latency p95 < 100ms
- [ ] No errors in application logs

### Functional Criteria
- [ ] Export API responds to requests
- [ ] Export jobs queue successfully
- [ ] Jobs process and complete
- [ ] Files upload to S3 correctly
- [ ] Users can download exported files
- [ ] File contents verified correct

### Operational Criteria
- [ ] Monitoring active and receiving metrics
- [ ] Alerting working (test alert fired)
- [ ] Logs aggregated and searchable
- [ ] Runbook tested and functional
- [ ] Team confident in operating system

Monitoring & Alerting Section

## Monitoring Setup

### Critical Alerts (Page on-call)
- Service down (health check fails)
- Error rate > 1% for 5 minutes
- Response latency p95 > 500ms for 5 minutes
- Queue depth > 1000 for 10 minutes

### Warning Alerts (Slack notification)
- Error rate > 0.5% for 5 minutes
- CPU > 80% for 10 minutes
- Memory > 85% for 10 minutes
- Export job timeout increasing

### Dashboard
- Service: export-service-prod
- Metrics: Latency, errors, throughput, queue depth
- Time range: Last 24 hours by default
- Alerts: Show current alert status

Troubleshooting Section

## Troubleshooting Common Issues

### Issue: Pods stuck in CrashLoopBackOff
**Symptoms**: Pods repeatedly crash and restart
**Diagnosis**:
```bash
# Check logs for errors
kubectl logs <pod-name> -n export-service-prod

Common Causes:

  • Configuration error (check environment variables)
  • Database connection failed (check credentials)
  • Out of memory (check resource limits) Fix: Review logs, check prerequisites, rollback if unclear

Issue: Response latency spiking

Symptoms: p95 latency > 200ms, users report slow exports Diagnosis:

# Check queue depth
kubectl exec -it <worker-pod> -n export-service-prod \
  -- redis-cli -h redis.example.com LLEN export-queue

Common Causes:

  • Too many concurrent exports (queue backlog)
  • Database slow (check queries, indexes)
  • Network issues (check connectivity) Fix: Scale up workers, check database performance, verify network

Issue: Export jobs failing

Symptoms: Job status shows failed, users can't export Diagnosis:

# Check worker logs
kubectl logs -n export-service-prod -l app=export-service

Common Causes:

  • S3 upload failing (check permissions, bucket exists)
  • Database query error (schema mismatch)
  • User doesn't have data to export Fix: Review logs, verify S3 access, check schema version

Issue: Database migration failed

Symptoms: Service won't start after deployment Diagnosis:

# Check migration logs
psql -U export_service -d export_service -c \
  "SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 5;"

Recovery:

  1. Identify failed migration
  2. Rollback deployment (revert to previous version)
  3. Debug migration issue in staging
  4. Retry deployment after fix

### Post-Deployment Actions Section

```markdown
## After Deployment

### Immediate (Next 2 hours)
- [ ] On-call engineer monitoring
- [ ] Check metrics every 15 minutes
- [ ] Monitor error rate and latency
- [ ] Watch for user-reported issues in #support

### Short-term (Next 24 hours)
- [ ] Review deployment metrics
- [ ] Collect feedback from users
- [ ] Document any issues encountered
- [ ] Update runbook if needed

### Follow-up (Next week)
- [ ] Post-mortem if issues occurred
- [ ] Update deployment procedure based on lessons learned
- [ ] Plan performance improvements if needed
- [ ] Update documentation if system behavior changed

Writing Tips

Be Precise and Detailed

  • Exact commands to run (copy-paste ready)
  • Specific values (versions, endpoints, timeouts)
  • Expected outputs for verification
  • Time estimates for each step

Think About Edge Cases

  • What if something is already deployed?
  • What if a prerequisite is missing?
  • What if deployment partially succeeds?
  • What if rollback is needed?

Make Rollback Easy

  • Document rollback procedure clearly
  • Test rollback before using in production
  • Make rollback faster than forward deployment
  • Have quick communication plan for failures

Document Monitoring

  • What metrics indicate health?
  • What should we watch during deployment?
  • What thresholds trigger alerts?
  • How do we validate success?
  • Reference component specs: [CMP-001]
  • Reference design documents: [DES-001]
  • Reference operations runbooks

Validation & Fixing Issues

Run the Validator

scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-your-spec.md

Common Issues & Fixes

Issue: "Prerequisites section incomplete"

  • Fix: Add all required infrastructure, code, access, and approvals

Issue: "Step-by-step procedures lack detail"

  • Fix: Add actual commands, expected output, time estimates

Issue: "No rollback procedure"

  • Fix: Document how to revert deployment if issues arise

Issue: "Monitoring and troubleshooting missing"

  • Fix: Add success criteria, monitoring setup, and troubleshooting guide

Decision-Making Framework

When writing a deployment procedure:

  1. Prerequisites: What must be true before we start?

    • Infrastructure ready?
    • Code reviewed and tested?
    • Team trained?
    • Approvals gotten?
  2. Procedure: What are the exact steps?

    • Simple, repeatable steps?
    • Verification at each step?
    • Estimated timing?
  3. Safety: How do we prevent/catch issues?

    • Verification steps after each phase?
    • Rollback procedure?
    • Quick failure detection?
  4. Communication: Who needs to know what?

    • Stakeholders notified?
    • On-call monitoring?
    • Escalation path?
  5. Learning: How do we improve next time?

    • Monitoring enabled?
    • Runbook updated?
    • Issues documented?

Next Steps

  1. Create the spec: scripts/generate-spec.sh deployment-procedure deploy-XXX-slug
  2. Research: Find component specs and existing procedures
  3. Document prerequisites: What must be true before deployment?
  4. Write procedures: Step-by-step, with commands and verification
  5. Plan rollback: How do we undo this if needed?
  6. Validate: scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-XXX-slug.md
  7. Test procedure: Walk through it in staging environment
  8. Get team review before using in production