Initial commit
This commit is contained in:
561
skills/spec-author/guides/deployment-procedure.md
Normal file
561
skills/spec-author/guides/deployment-procedure.md
Normal file
@@ -0,0 +1,561 @@
|
||||
# How to Create a Deployment Procedure Specification
|
||||
|
||||
Deployment procedures document step-by-step instructions for deploying systems to production, including prerequisites, procedures, rollback, and troubleshooting.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Create a new deployment procedure
|
||||
scripts/generate-spec.sh deployment-procedure deploy-001-descriptive-slug
|
||||
|
||||
# 2. Open and fill in the file
|
||||
# (The file will be created at: docs/specs/deployment-procedure/deploy-001-descriptive-slug.md)
|
||||
|
||||
# 3. Fill in steps and checklists, then validate:
|
||||
scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md
|
||||
|
||||
# 4. Fix issues and check completeness:
|
||||
scripts/check-completeness.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md
|
||||
```
|
||||
|
||||
## When to Write a Deployment Procedure
|
||||
|
||||
Use a Deployment Procedure when you need to:
|
||||
- Document how to deploy a new service or component
|
||||
- Ensure consistent, repeatable deployments
|
||||
- Provide runbooks for operations teams
|
||||
- Document rollback procedures for failures
|
||||
- Enable any team member to deploy safely
|
||||
- Create an audit trail of deployments
|
||||
|
||||
## Research Phase
|
||||
|
||||
### 1. Research Related Specifications
|
||||
Find what you're deploying:
|
||||
|
||||
```bash
|
||||
# Find component specs
|
||||
grep -r "component" docs/specs/ --include="*.md"
|
||||
|
||||
# Find design documents that mention infrastructure
|
||||
grep -r "design\|infrastructure" docs/specs/ --include="*.md"
|
||||
|
||||
# Find existing deployment procedures
|
||||
grep -r "deploy" docs/specs/ --include="*.md"
|
||||
```
|
||||
|
||||
### 2. Understand Your Infrastructure
|
||||
- What's the deployment target? (Kubernetes, serverless, VMs)
|
||||
- What infrastructure does this component need?
|
||||
- What access/permissions are required?
|
||||
- What monitoring must be in place?
|
||||
|
||||
### 3. Review Past Deployments
|
||||
- How have similar components been deployed?
|
||||
- What issues arose? How were they resolved?
|
||||
- What worked well? What didn't?
|
||||
- Any patterns or templates to follow?
|
||||
|
||||
## Structure & Content Guide
|
||||
|
||||
### Title & Metadata
|
||||
- **Title**: "Export Service Deployment to Production", "Database Migration", etc.
|
||||
- **Component**: What's being deployed
|
||||
- **Target**: Production, staging, canary, etc.
|
||||
- **Owner**: Team responsible for deployment
|
||||
|
||||
### Prerequisites Section
|
||||
|
||||
Document what must be done before deployment:
|
||||
|
||||
```markdown
|
||||
# Export Service Production Deployment
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Infrastructure Requirements
|
||||
- [ ] AWS resources provisioned (see [CMP-001] for details)
|
||||
- [ ] ElastiCache Redis cluster (export-service-queue)
|
||||
- [ ] RDS PostgreSQL instance (export-db)
|
||||
- [ ] S3 bucket (export-files-prod)
|
||||
- [ ] IAM roles and policies configured
|
||||
- [ ] Kubernetes cluster accessible
|
||||
- [ ] kubectl configured with production cluster context
|
||||
- [ ] Deployment manifests reviewed by tech lead
|
||||
- [ ] Namespace `export-service-prod` created
|
||||
|
||||
### Code & Build Requirements
|
||||
- [ ] All code merged to main branch
|
||||
- [ ] Code reviewed by 2+ senior engineers
|
||||
- [ ] All tests passing
|
||||
- [ ] Unit tests (90%+ coverage)
|
||||
- [ ] Integration tests
|
||||
- [ ] Load tests pass at target throughput
|
||||
- [ ] Docker image built and pushed to ECR
|
||||
- [ ] Image tagged with version (e.g., v1.2.3)
|
||||
- [ ] Image scanned for vulnerabilities
|
||||
- [ ] Image verified to work (manual test in staging)
|
||||
|
||||
### Team & Access Requirements
|
||||
- [ ] Deployment lead identified (typically tech lead or on-call eng)
|
||||
- [ ] Access verified for:
|
||||
- [ ] AWS console (ECR, S3, CloudWatch)
|
||||
- [ ] Kubernetes cluster (kubectl access)
|
||||
- [ ] Database (for running migrations if needed)
|
||||
- [ ] Monitoring/alerting system (Grafana, PagerDuty)
|
||||
- [ ] Communication channel open (Slack, war room)
|
||||
- [ ] Runbook reviewed by both eng and ops team
|
||||
|
||||
### Pre-Deployment Verification Checklist
|
||||
- [ ] Staging deployment successful (deployed 24+ hours ago, stable)
|
||||
- [ ] Monitoring in place and verified working
|
||||
- [ ] Rollback plan reviewed and tested
|
||||
- [ ] Emergency contacts identified
|
||||
- [ ] Stakeholders notified of deployment window
|
||||
- [ ] Change log prepared (what's new in this version)
|
||||
|
||||
### Data/Database Requirements
|
||||
- [ ] Database schema compatible with new version
|
||||
- [ ] Backward compatible (no breaking changes)
|
||||
- [ ] Migrations tested in staging
|
||||
- [ ] Rollback plan for migrations documented
|
||||
- [ ] No data conflicts or corruption risks
|
||||
- [ ] Backup created (if applicable)
|
||||
|
||||
### Approval Checklist
|
||||
- [ ] Tech Lead: Code and approach approved
|
||||
- [ ] Product Owner: Feature approved, ready for launch
|
||||
- [ ] Operations Lead: Deployment plan reviewed
|
||||
- [ ] Security: Security review passed (if applicable)
|
||||
```
|
||||
|
||||
### Deployment Steps Section
|
||||
|
||||
Provide step-by-step instructions:
|
||||
|
||||
```markdown
|
||||
## Deployment Procedure
|
||||
|
||||
### Pre-Deployment (Validation Phase)
|
||||
|
||||
**Step 1: Verify Prerequisites**
|
||||
- Command: Run pre-deployment checklist above
|
||||
- Verify: All items checked ✓
|
||||
- If any fail: Stop deployment, resolve issues
|
||||
- Time: ~15 minutes
|
||||
|
||||
**Step 2: Create Deployment Record**
|
||||
- Document: Who is deploying, when, what version
|
||||
- Command: Log in to deployment tracking system
|
||||
- Entry:
|
||||
```
|
||||
Deployment: export-service
|
||||
Version: v1.2.3
|
||||
Environment: production
|
||||
Deployed By: Alice Smith
|
||||
Time: 2024-01-15 14:30 UTC
|
||||
Change Summary: Added bulk export feature, fixed queue processing
|
||||
```
|
||||
- Time: ~5 minutes
|
||||
|
||||
### Deployment Phase
|
||||
|
||||
**Step 3: Tag Database Migration (if applicable)**
|
||||
- Check: Are there schema changes in this version?
|
||||
- If YES:
|
||||
```bash
|
||||
# SSH to database server
|
||||
ssh -i ~/.ssh/prod.pem admin@db.example.com
|
||||
|
||||
# Run migrations
|
||||
psql -U export_service -d export_service -c \
|
||||
"ALTER TABLE exports ADD COLUMN retry_count INT DEFAULT 0;"
|
||||
|
||||
# Verify migration
|
||||
psql -U export_service -d export_service -c \
|
||||
"SELECT column_name FROM information_schema.columns WHERE table_name='exports';"
|
||||
```
|
||||
- If NO: Skip this step
|
||||
- Verify: All migrations complete without errors
|
||||
- Time: ~10 minutes
|
||||
|
||||
**Step 4: Deploy to Kubernetes**
|
||||
- Verify: You're deploying to PRODUCTION cluster
|
||||
```bash
|
||||
kubectl config current-context
|
||||
# Should output: arn:aws:eks:us-east-1:123456789:cluster/prod
|
||||
```
|
||||
- If wrong context: STOP, switch to correct cluster
|
||||
- Deploy new image version:
|
||||
```bash
|
||||
# Update deployment with new image
|
||||
kubectl set image deployment/export-service \
|
||||
export-service=123456789.dkr.ecr.us-east-1.amazonaws.com/export-service:v1.2.3 \
|
||||
-n export-service-prod
|
||||
```
|
||||
- Verify: Deployment triggered
|
||||
```bash
|
||||
kubectl rollout status deployment/export-service -n export-service-prod
|
||||
```
|
||||
- Wait: For all pods to become ready (typically 2-3 minutes)
|
||||
- Output should show: `deployment "export-service" successfully rolled out`
|
||||
- Time: ~5 minutes
|
||||
|
||||
**Step 5: Verify Deployment Health**
|
||||
- Check: Pod status
|
||||
```bash
|
||||
kubectl get pods -n export-service-prod
|
||||
```
|
||||
- All pods should show `Running` status
|
||||
- If any show `CrashLoopBackOff`: Stop deployment, investigate
|
||||
|
||||
- Check: Service endpoints
|
||||
```bash
|
||||
kubectl get svc export-service -n export-service-prod
|
||||
```
|
||||
- Should show external IP/load balancer endpoint
|
||||
|
||||
- Check: Logs for errors
|
||||
```bash
|
||||
kubectl logs -n export-service-prod -l app=export-service --tail=50
|
||||
```
|
||||
- Should show startup logs, no ERROR level messages
|
||||
- If errors present: Check Step 6 for rollback
|
||||
|
||||
- Check: Health endpoints
|
||||
```bash
|
||||
curl https://api.example.com/health
|
||||
```
|
||||
- Should return 200 OK
|
||||
- If not: Service may still be starting (wait 30s and retry)
|
||||
|
||||
- Time: ~5 minutes
|
||||
|
||||
### Post-Deployment (Verification Phase)
|
||||
|
||||
**Step 6: Monitor Metrics**
|
||||
- Open: Grafana dashboard for export-service
|
||||
- Check: Key metrics for 5 minutes
|
||||
- Request latency: Should be stable (< 100ms p95)
|
||||
- Error rate: Should remain < 0.1%
|
||||
- CPU/Memory: Should be within normal ranges
|
||||
- Queue depth: Should process jobs smoothly
|
||||
- Look for: Any sudden spikes or anomalies
|
||||
- If anomalies: Proceed to rollback (Step 8)
|
||||
- Time: ~5 minutes
|
||||
|
||||
**Step 7: Functional Testing**
|
||||
- Manual test: Create export via API
|
||||
```bash
|
||||
curl -X POST https://api.example.com/exports \
|
||||
-H "Authorization: Bearer $TOKEN" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"format": "csv",
|
||||
"data_types": ["users"]
|
||||
}'
|
||||
```
|
||||
- Response: Should return 201 Created with export_id
|
||||
- Check status:
|
||||
```bash
|
||||
curl https://api.example.com/exports/{export_id} \
|
||||
-H "Authorization: Bearer $TOKEN"
|
||||
```
|
||||
- Verify: Status transitions from queued → processing → completed
|
||||
- Download: Successfully download export file
|
||||
- Verify: File contents correct
|
||||
- If any step fails: Proceed to rollback (Step 8)
|
||||
- Time: ~5 minutes
|
||||
|
||||
**Step 8: Notify Stakeholders**
|
||||
- Update: Deployment tracking system
|
||||
```
|
||||
Status: DEPLOYED
|
||||
Completion Time: 14:45 UTC
|
||||
Health: ✓ All checks passed
|
||||
Metrics: ✓ Stable
|
||||
Functional Tests: ✓ Passed
|
||||
```
|
||||
- Announce: Slack to #product-eng
|
||||
```
|
||||
@channel Export Service v1.2.3 deployed to production.
|
||||
New feature: Bulk data exports now available.
|
||||
Status: Monitoring.
|
||||
```
|
||||
- Notify: On-call engineer (monitoring for 2 hours post-deployment)
|
||||
|
||||
### Rollback Procedure (If Issues Found)
|
||||
|
||||
**Step 8: Rollback (Only if Step 6 or 7 fail)**
|
||||
- Decision: Is deployment safe to continue?
|
||||
- YES → All checks pass, monitoring is good → Release complete
|
||||
- NO → Issues found → Proceed with rollback
|
||||
|
||||
- Execute rollback:
|
||||
```bash
|
||||
# Revert to previous version
|
||||
kubectl rollout undo deployment/export-service -n export-service-prod
|
||||
|
||||
# Verify rollback in progress
|
||||
kubectl rollout status deployment/export-service -n export-service-prod
|
||||
|
||||
# Wait for rollback to complete
|
||||
```
|
||||
|
||||
- Verify rollback successful:
|
||||
```bash
|
||||
# Check current image
|
||||
kubectl describe deployment export-service -n export-service-prod | grep Image
|
||||
|
||||
# Should show previous version (e.g., v1.2.2)
|
||||
|
||||
# Verify service responding
|
||||
curl https://api.example.com/health
|
||||
```
|
||||
|
||||
- Notify: Update stakeholders
|
||||
```
|
||||
@channel Deployment rolled back due to [specific reason].
|
||||
Current version: v1.2.2 (stable)
|
||||
Investigating issue. Will retry deployment tomorrow.
|
||||
```
|
||||
|
||||
- Document: Root cause analysis
|
||||
- What went wrong?
|
||||
- Why wasn't it caught in staging?
|
||||
- How do we prevent this next time?
|
||||
|
||||
- Time: ~10 minutes
|
||||
```
|
||||
|
||||
### Success Criteria Section
|
||||
|
||||
```markdown
|
||||
## Deployment Success Criteria
|
||||
|
||||
The deployment is successful if ALL of these are true:
|
||||
|
||||
### Technical Criteria
|
||||
- [ ] All pods running and healthy (0 CrashLoopBackOff)
|
||||
- [ ] Service responding to health checks (200 OK)
|
||||
- [ ] Metrics showing normal values (no spikes)
|
||||
- [ ] Error rate < 0.1% (< 1 error per 1000 requests)
|
||||
- [ ] Response latency p95 < 100ms
|
||||
- [ ] No errors in application logs
|
||||
|
||||
### Functional Criteria
|
||||
- [ ] Export API responds to requests
|
||||
- [ ] Export jobs queue successfully
|
||||
- [ ] Jobs process and complete
|
||||
- [ ] Files upload to S3 correctly
|
||||
- [ ] Users can download exported files
|
||||
- [ ] File contents verified correct
|
||||
|
||||
### Operational Criteria
|
||||
- [ ] Monitoring active and receiving metrics
|
||||
- [ ] Alerting working (test alert fired)
|
||||
- [ ] Logs aggregated and searchable
|
||||
- [ ] Runbook tested and functional
|
||||
- [ ] Team confident in operating system
|
||||
```
|
||||
|
||||
### Monitoring & Alerting Section
|
||||
|
||||
```markdown
|
||||
## Monitoring Setup
|
||||
|
||||
### Critical Alerts (Page on-call)
|
||||
- Service down (health check fails)
|
||||
- Error rate > 1% for 5 minutes
|
||||
- Response latency p95 > 500ms for 5 minutes
|
||||
- Queue depth > 1000 for 10 minutes
|
||||
|
||||
### Warning Alerts (Slack notification)
|
||||
- Error rate > 0.5% for 5 minutes
|
||||
- CPU > 80% for 10 minutes
|
||||
- Memory > 85% for 10 minutes
|
||||
- Export job timeout increasing
|
||||
|
||||
### Dashboard
|
||||
- Service: export-service-prod
|
||||
- Metrics: Latency, errors, throughput, queue depth
|
||||
- Time range: Last 24 hours by default
|
||||
- Alerts: Show current alert status
|
||||
```
|
||||
|
||||
### Troubleshooting Section
|
||||
|
||||
```markdown
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Issue: Pods stuck in CrashLoopBackOff
|
||||
**Symptoms**: Pods repeatedly crash and restart
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check logs for errors
|
||||
kubectl logs <pod-name> -n export-service-prod
|
||||
```
|
||||
**Common Causes**:
|
||||
- Configuration error (check environment variables)
|
||||
- Database connection failed (check credentials)
|
||||
- Out of memory (check resource limits)
|
||||
**Fix**: Review logs, check prerequisites, rollback if unclear
|
||||
|
||||
### Issue: Response latency spiking
|
||||
**Symptoms**: p95 latency > 200ms, users report slow exports
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check queue depth
|
||||
kubectl exec -it <worker-pod> -n export-service-prod \
|
||||
-- redis-cli -h redis.example.com LLEN export-queue
|
||||
```
|
||||
**Common Causes**:
|
||||
- Too many concurrent exports (queue backlog)
|
||||
- Database slow (check queries, indexes)
|
||||
- Network issues (check connectivity)
|
||||
**Fix**: Scale up workers, check database performance, verify network
|
||||
|
||||
### Issue: Export jobs failing
|
||||
**Symptoms**: Job status shows `failed`, users can't export
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check worker logs
|
||||
kubectl logs -n export-service-prod -l app=export-service
|
||||
```
|
||||
**Common Causes**:
|
||||
- S3 upload failing (check permissions, bucket exists)
|
||||
- Database query error (schema mismatch)
|
||||
- User doesn't have data to export
|
||||
**Fix**: Review logs, verify S3 access, check schema version
|
||||
|
||||
### Issue: Database migration failed
|
||||
**Symptoms**: Service won't start after deployment
|
||||
**Diagnosis**:
|
||||
```bash
|
||||
# Check migration logs
|
||||
psql -U export_service -d export_service -c \
|
||||
"SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 5;"
|
||||
```
|
||||
**Recovery**:
|
||||
1. Identify failed migration
|
||||
2. Rollback deployment (revert to previous version)
|
||||
3. Debug migration issue in staging
|
||||
4. Retry deployment after fix
|
||||
```
|
||||
|
||||
### Post-Deployment Actions Section
|
||||
|
||||
```markdown
|
||||
## After Deployment
|
||||
|
||||
### Immediate (Next 2 hours)
|
||||
- [ ] On-call engineer monitoring
|
||||
- [ ] Check metrics every 15 minutes
|
||||
- [ ] Monitor error rate and latency
|
||||
- [ ] Watch for user-reported issues in #support
|
||||
|
||||
### Short-term (Next 24 hours)
|
||||
- [ ] Review deployment metrics
|
||||
- [ ] Collect feedback from users
|
||||
- [ ] Document any issues encountered
|
||||
- [ ] Update runbook if needed
|
||||
|
||||
### Follow-up (Next week)
|
||||
- [ ] Post-mortem if issues occurred
|
||||
- [ ] Update deployment procedure based on lessons learned
|
||||
- [ ] Plan performance improvements if needed
|
||||
- [ ] Update documentation if system behavior changed
|
||||
```
|
||||
|
||||
## Writing Tips
|
||||
|
||||
### Be Precise and Detailed
|
||||
- Exact commands to run (copy-paste ready)
|
||||
- Specific values (versions, endpoints, timeouts)
|
||||
- Expected outputs for verification
|
||||
- Time estimates for each step
|
||||
|
||||
### Think About Edge Cases
|
||||
- What if something is already deployed?
|
||||
- What if a prerequisite is missing?
|
||||
- What if deployment partially succeeds?
|
||||
- What if rollback is needed?
|
||||
|
||||
### Make Rollback Easy
|
||||
- Document rollback procedure clearly
|
||||
- Test rollback before using in production
|
||||
- Make rollback faster than forward deployment
|
||||
- Have quick communication plan for failures
|
||||
|
||||
### Document Monitoring
|
||||
- What metrics indicate health?
|
||||
- What should we watch during deployment?
|
||||
- What thresholds trigger alerts?
|
||||
- How do we validate success?
|
||||
|
||||
### Link to Related Specs
|
||||
- Reference component specs: `[CMP-001]`
|
||||
- Reference design documents: `[DES-001]`
|
||||
- Reference operations runbooks
|
||||
|
||||
## Validation & Fixing Issues
|
||||
|
||||
### Run the Validator
|
||||
```bash
|
||||
scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-your-spec.md
|
||||
```
|
||||
|
||||
### Common Issues & Fixes
|
||||
|
||||
**Issue**: "Prerequisites section incomplete"
|
||||
- **Fix**: Add all required infrastructure, code, access, and approvals
|
||||
|
||||
**Issue**: "Step-by-step procedures lack detail"
|
||||
- **Fix**: Add actual commands, expected output, time estimates
|
||||
|
||||
**Issue**: "No rollback procedure"
|
||||
- **Fix**: Document how to revert deployment if issues arise
|
||||
|
||||
**Issue**: "Monitoring and troubleshooting missing"
|
||||
- **Fix**: Add success criteria, monitoring setup, and troubleshooting guide
|
||||
|
||||
## Decision-Making Framework
|
||||
|
||||
When writing a deployment procedure:
|
||||
|
||||
1. **Prerequisites**: What must be true before we start?
|
||||
- Infrastructure ready?
|
||||
- Code reviewed and tested?
|
||||
- Team trained?
|
||||
- Approvals gotten?
|
||||
|
||||
2. **Procedure**: What are the exact steps?
|
||||
- Simple, repeatable steps?
|
||||
- Verification at each step?
|
||||
- Estimated timing?
|
||||
|
||||
3. **Safety**: How do we prevent/catch issues?
|
||||
- Verification steps after each phase?
|
||||
- Rollback procedure?
|
||||
- Quick failure detection?
|
||||
|
||||
4. **Communication**: Who needs to know what?
|
||||
- Stakeholders notified?
|
||||
- On-call monitoring?
|
||||
- Escalation path?
|
||||
|
||||
5. **Learning**: How do we improve next time?
|
||||
- Monitoring enabled?
|
||||
- Runbook updated?
|
||||
- Issues documented?
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Create the spec**: `scripts/generate-spec.sh deployment-procedure deploy-XXX-slug`
|
||||
2. **Research**: Find component specs and existing procedures
|
||||
3. **Document prerequisites**: What must be true before deployment?
|
||||
4. **Write procedures**: Step-by-step, with commands and verification
|
||||
5. **Plan rollback**: How do we undo this if needed?
|
||||
6. **Validate**: `scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-XXX-slug.md`
|
||||
7. **Test procedure**: Walk through it in staging environment
|
||||
8. **Get team review** before using in production
|
||||
Reference in New Issue
Block a user