Initial commit

2025-11-30 08:45:31 +08:00
commit ca9b85ccda
35 changed files with 10784 additions and 0 deletions
--- a/skills/spec-author/guides/deployment-procedure.md
+++ b/skills/spec-author/guides/deployment-procedure.md
@@ -0,0 +1,561 @@
+# How to Create a Deployment Procedure Specification
+
+Deployment procedures document step-by-step instructions for deploying systems to production, including prerequisites, procedures, rollback, and troubleshooting.
+
+## Quick Start
+
+```bash
+# 1. Create a new deployment procedure
+scripts/generate-spec.sh deployment-procedure deploy-001-descriptive-slug
+
+# 2. Open and fill in the file
+# (The file will be created at: docs/specs/deployment-procedure/deploy-001-descriptive-slug.md)
+
+# 3. Fill in steps and checklists, then validate:
+scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md
+
+# 4. Fix issues and check completeness:
+scripts/check-completeness.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md
+```
+
+## When to Write a Deployment Procedure
+
+Use a Deployment Procedure when you need to:
+- Document how to deploy a new service or component
+- Ensure consistent, repeatable deployments
+- Provide runbooks for operations teams
+- Document rollback procedures for failures
+- Enable any team member to deploy safely
+- Create an audit trail of deployments
+
+## Research Phase
+
+### 1. Research Related Specifications
+Find what you're deploying:
+
+```bash
+# Find component specs
+grep -r "component" docs/specs/ --include="*.md"
+
+# Find design documents that mention infrastructure
+grep -r "design\|infrastructure" docs/specs/ --include="*.md"
+
+# Find existing deployment procedures
+grep -r "deploy" docs/specs/ --include="*.md"
+```
+
+### 2. Understand Your Infrastructure
+- What's the deployment target? (Kubernetes, serverless, VMs)
+- What infrastructure does this component need?
+- What access/permissions are required?
+- What monitoring must be in place?
+
+### 3. Review Past Deployments
+- How have similar components been deployed?
+- What issues arose? How were they resolved?
+- What worked well? What didn't?
+- Any patterns or templates to follow?
+
+## Structure & Content Guide
+
+### Title & Metadata
+- **Title**: "Export Service Deployment to Production", "Database Migration", etc.
+- **Component**: What's being deployed
+- **Target**: Production, staging, canary, etc.
+- **Owner**: Team responsible for deployment
+
+### Prerequisites Section
+
+Document what must be done before deployment:
+
+```markdown
+# Export Service Production Deployment
+
+## Prerequisites
+
+### Infrastructure Requirements
+- [ ] AWS resources provisioned (see [CMP-001] for details)
+  - [ ] ElastiCache Redis cluster (export-service-queue)
+  - [ ] RDS PostgreSQL instance (export-db)
+  - [ ] S3 bucket (export-files-prod)
+  - [ ] IAM roles and policies configured
+- [ ] Kubernetes cluster accessible
+  - [ ] kubectl configured with production cluster context
+  - [ ] Deployment manifests reviewed by tech lead
+  - [ ] Namespace `export-service-prod` created
+
+### Code & Build Requirements
+- [ ] All code merged to main branch
+- [ ] Code reviewed by 2+ senior engineers
+- [ ] All tests passing
+  - [ ] Unit tests (90%+ coverage)
+  - [ ] Integration tests
+  - [ ] Load tests pass at target throughput
+- [ ] Docker image built and pushed to ECR
+  - [ ] Image tagged with version (e.g., v1.2.3)
+  - [ ] Image scanned for vulnerabilities
+  - [ ] Image verified to work (manual test in staging)
+
+### Team & Access Requirements
+- [ ] Deployment lead identified (typically tech lead or on-call eng)
+- [ ] Access verified for:
+  - [ ] AWS console (ECR, S3, CloudWatch)
+  - [ ] Kubernetes cluster (kubectl access)
+  - [ ] Database (for running migrations if needed)
+  - [ ] Monitoring/alerting system (Grafana, PagerDuty)
+- [ ] Communication channel open (Slack, war room)
+- [ ] Runbook reviewed by both eng and ops team
+
+### Pre-Deployment Verification Checklist
+- [ ] Staging deployment successful (deployed 24+ hours ago, stable)
+- [ ] Monitoring in place and verified working
+- [ ] Rollback plan reviewed and tested
+- [ ] Emergency contacts identified
+- [ ] Stakeholders notified of deployment window
+- [ ] Change log prepared (what's new in this version)
+
+### Data/Database Requirements
+- [ ] Database schema compatible with new version
+  - [ ] Backward compatible (no breaking changes)
+  - [ ] Migrations tested in staging
+  - [ ] Rollback plan for migrations documented
+- [ ] No data conflicts or corruption risks
+- [ ] Backup created (if applicable)
+
+### Approval Checklist
+- [ ] Tech Lead: Code and approach approved
+- [ ] Product Owner: Feature approved, ready for launch
+- [ ] Operations Lead: Deployment plan reviewed
+- [ ] Security: Security review passed (if applicable)
+```
+
+### Deployment Steps Section
+
+Provide step-by-step instructions:
+
+```markdown
+## Deployment Procedure
+
+### Pre-Deployment (Validation Phase)
+
+**Step 1: Verify Prerequisites**
+- Command: Run pre-deployment checklist above
+- Verify: All items checked ✓
+- If any fail: Stop deployment, resolve issues
+- Time: ~15 minutes
+
+**Step 2: Create Deployment Record**
+- Document: Who is deploying, when, what version
+- Command: Log in to deployment tracking system
+- Entry:
+  ```
+  Deployment: export-service
+  Version: v1.2.3
+  Environment: production
+  Deployed By: Alice Smith
+  Time: 2024-01-15 14:30 UTC
+  Change Summary: Added bulk export feature, fixed queue processing
+  ```
+- Time: ~5 minutes
+
+### Deployment Phase
+
+**Step 3: Tag Database Migration (if applicable)**
+- Check: Are there schema changes in this version?
+- If YES:
+  ```bash
+  # SSH to database server
+  ssh -i ~/.ssh/prod.pem admin@db.example.com
+
+  # Run migrations
+  psql -U export_service -d export_service -c \
+    "ALTER TABLE exports ADD COLUMN retry_count INT DEFAULT 0;"
+
+  # Verify migration
+  psql -U export_service -d export_service -c \
+    "SELECT column_name FROM information_schema.columns WHERE table_name='exports';"
+  ```
+- If NO: Skip this step
+- Verify: All migrations complete without errors
+- Time: ~10 minutes
+
+**Step 4: Deploy to Kubernetes**
+- Verify: You're deploying to PRODUCTION cluster
+  ```bash
+  kubectl config current-context
+  # Should output: arn:aws:eks:us-east-1:123456789:cluster/prod
+  ```
+- If wrong context: STOP, switch to correct cluster
+- Deploy new image version:
+  ```bash
+  # Update deployment with new image
+  kubectl set image deployment/export-service \
+    export-service=123456789.dkr.ecr.us-east-1.amazonaws.com/export-service:v1.2.3 \
+    -n export-service-prod
+  ```
+- Verify: Deployment triggered
+  ```bash
+  kubectl rollout status deployment/export-service -n export-service-prod
+  ```
+- Wait: For all pods to become ready (typically 2-3 minutes)
+- Output should show: `deployment "export-service" successfully rolled out`
+- Time: ~5 minutes
+
+**Step 5: Verify Deployment Health**
+- Check: Pod status
+  ```bash
+  kubectl get pods -n export-service-prod
+  ```
+  - All pods should show `Running` status
+  - If any show `CrashLoopBackOff`: Stop deployment, investigate
+
+- Check: Service endpoints
+  ```bash
+  kubectl get svc export-service -n export-service-prod
+  ```
+  - Should show external IP/load balancer endpoint
+
+- Check: Logs for errors
+  ```bash
+  kubectl logs -n export-service-prod -l app=export-service --tail=50
+  ```
+  - Should show startup logs, no ERROR level messages
+  - If errors present: Check Step 6 for rollback
+
+- Check: Health endpoints
+  ```bash
+  curl https://api.example.com/health
+  ```
+  - Should return 200 OK
+  - If not: Service may still be starting (wait 30s and retry)
+
+- Time: ~5 minutes
+
+### Post-Deployment (Verification Phase)
+
+**Step 6: Monitor Metrics**
+- Open: Grafana dashboard for export-service
+- Check: Key metrics for 5 minutes
+  - Request latency: Should be stable (< 100ms p95)
+  - Error rate: Should remain < 0.1%
+  - CPU/Memory: Should be within normal ranges
+  - Queue depth: Should process jobs smoothly
+- Look for: Any sudden spikes or anomalies
+- If anomalies: Proceed to rollback (Step 8)
+- Time: ~5 minutes
+
+**Step 7: Functional Testing**
+- Manual test: Create export via API
+  ```bash
+  curl -X POST https://api.example.com/exports \
+    -H "Authorization: Bearer $TOKEN" \
+    -H "Content-Type: application/json" \
+    -d '{
+      "format": "csv",
+      "data_types": ["users"]
+    }'
+  ```
+- Response: Should return 201 Created with export_id
+- Check status:
+  ```bash
+  curl https://api.example.com/exports/{export_id} \
+    -H "Authorization: Bearer $TOKEN"
+  ```
+- Verify: Status transitions from queued → processing → completed
+- Download: Successfully download export file
+- Verify: File contents correct
+- If any step fails: Proceed to rollback (Step 8)
+- Time: ~5 minutes
+
+**Step 8: Notify Stakeholders**
+- Update: Deployment tracking system
+  ```
+  Status: DEPLOYED
+  Completion Time: 14:45 UTC
+  Health: ✓ All checks passed
+  Metrics: ✓ Stable
+  Functional Tests: ✓ Passed
+  ```
+- Announce: Slack to #product-eng
+  ```
+  @channel Export Service v1.2.3 deployed to production.
+  New feature: Bulk data exports now available.
+  Status: Monitoring.
+  ```
+- Notify: On-call engineer (monitoring for 2 hours post-deployment)
+
+### Rollback Procedure (If Issues Found)
+
+**Step 8: Rollback (Only if Step 6 or 7 fail)**
+- Decision: Is deployment safe to continue?
+  - YES → All checks pass, monitoring is good → Release complete
+  - NO → Issues found → Proceed with rollback
+
+- Execute rollback:
+  ```bash
+  # Revert to previous version
+  kubectl rollout undo deployment/export-service -n export-service-prod
+
+  # Verify rollback in progress
+  kubectl rollout status deployment/export-service -n export-service-prod
+
+  # Wait for rollback to complete
+  ```
+
+- Verify rollback successful:
+  ```bash
+  # Check current image
+  kubectl describe deployment export-service -n export-service-prod | grep Image
+
+  # Should show previous version (e.g., v1.2.2)
+
+  # Verify service responding
+  curl https://api.example.com/health
+  ```
+
+- Notify: Update stakeholders
+  ```
+  @channel Deployment rolled back due to [specific reason].
+  Current version: v1.2.2 (stable)
+  Investigating issue. Will retry deployment tomorrow.
+  ```
+
+- Document: Root cause analysis
+  - What went wrong?
+  - Why wasn't it caught in staging?
+  - How do we prevent this next time?
+
+- Time: ~10 minutes
+```
+
+### Success Criteria Section
+
+```markdown
+## Deployment Success Criteria
+
+The deployment is successful if ALL of these are true:
+
+### Technical Criteria
+- [ ] All pods running and healthy (0 CrashLoopBackOff)
+- [ ] Service responding to health checks (200 OK)
+- [ ] Metrics showing normal values (no spikes)
+- [ ] Error rate < 0.1% (< 1 error per 1000 requests)
+- [ ] Response latency p95 < 100ms
+- [ ] No errors in application logs
+
+### Functional Criteria
+- [ ] Export API responds to requests
+- [ ] Export jobs queue successfully
+- [ ] Jobs process and complete
+- [ ] Files upload to S3 correctly
+- [ ] Users can download exported files
+- [ ] File contents verified correct
+
+### Operational Criteria
+- [ ] Monitoring active and receiving metrics
+- [ ] Alerting working (test alert fired)
+- [ ] Logs aggregated and searchable
+- [ ] Runbook tested and functional
+- [ ] Team confident in operating system
+```
+
+### Monitoring & Alerting Section
+
+```markdown
+## Monitoring Setup
+
+### Critical Alerts (Page on-call)
+- Service down (health check fails)
+- Error rate > 1% for 5 minutes
+- Response latency p95 > 500ms for 5 minutes
+- Queue depth > 1000 for 10 minutes
+
+### Warning Alerts (Slack notification)
+- Error rate > 0.5% for 5 minutes
+- CPU > 80% for 10 minutes
+- Memory > 85% for 10 minutes
+- Export job timeout increasing
+
+### Dashboard
+- Service: export-service-prod
+- Metrics: Latency, errors, throughput, queue depth
+- Time range: Last 24 hours by default
+- Alerts: Show current alert status
+```
+
+### Troubleshooting Section
+
+```markdown
+## Troubleshooting Common Issues
+
+### Issue: Pods stuck in CrashLoopBackOff
+**Symptoms**: Pods repeatedly crash and restart
+**Diagnosis**:
+```bash
+# Check logs for errors
+kubectl logs <pod-name> -n export-service-prod
+```
+**Common Causes**:
+- Configuration error (check environment variables)
+- Database connection failed (check credentials)
+- Out of memory (check resource limits)
+**Fix**: Review logs, check prerequisites, rollback if unclear
+
+### Issue: Response latency spiking
+**Symptoms**: p95 latency > 200ms, users report slow exports
+**Diagnosis**:
+```bash
+# Check queue depth
+kubectl exec -it <worker-pod> -n export-service-prod \
+  -- redis-cli -h redis.example.com LLEN export-queue
+```
+**Common Causes**:
+- Too many concurrent exports (queue backlog)
+- Database slow (check queries, indexes)
+- Network issues (check connectivity)
+**Fix**: Scale up workers, check database performance, verify network
+
+### Issue: Export jobs failing
+**Symptoms**: Job status shows `failed`, users can't export
+**Diagnosis**:
+```bash
+# Check worker logs
+kubectl logs -n export-service-prod -l app=export-service
+```
+**Common Causes**:
+- S3 upload failing (check permissions, bucket exists)
+- Database query error (schema mismatch)
+- User doesn't have data to export
+**Fix**: Review logs, verify S3 access, check schema version
+
+### Issue: Database migration failed
+**Symptoms**: Service won't start after deployment
+**Diagnosis**:
+```bash
+# Check migration logs
+psql -U export_service -d export_service -c \
+  "SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 5;"
+```
+**Recovery**:
+1. Identify failed migration
+2. Rollback deployment (revert to previous version)
+3. Debug migration issue in staging
+4. Retry deployment after fix
+```
+
+### Post-Deployment Actions Section
+
+```markdown
+## After Deployment
+
+### Immediate (Next 2 hours)
+- [ ] On-call engineer monitoring
+- [ ] Check metrics every 15 minutes
+- [ ] Monitor error rate and latency
+- [ ] Watch for user-reported issues in #support
+
+### Short-term (Next 24 hours)
+- [ ] Review deployment metrics
+- [ ] Collect feedback from users
+- [ ] Document any issues encountered
+- [ ] Update runbook if needed
+
+### Follow-up (Next week)
+- [ ] Post-mortem if issues occurred
+- [ ] Update deployment procedure based on lessons learned
+- [ ] Plan performance improvements if needed
+- [ ] Update documentation if system behavior changed
+```
+
+## Writing Tips
+
+### Be Precise and Detailed
+- Exact commands to run (copy-paste ready)
+- Specific values (versions, endpoints, timeouts)
+- Expected outputs for verification
+- Time estimates for each step
+
+### Think About Edge Cases
+- What if something is already deployed?
+- What if a prerequisite is missing?
+- What if deployment partially succeeds?
+- What if rollback is needed?
+
+### Make Rollback Easy
+- Document rollback procedure clearly
+- Test rollback before using in production
+- Make rollback faster than forward deployment
+- Have quick communication plan for failures
+
+### Document Monitoring
+- What metrics indicate health?
+- What should we watch during deployment?
+- What thresholds trigger alerts?
+- How do we validate success?
+
+### Link to Related Specs
+- Reference component specs: `[CMP-001]`
+- Reference design documents: `[DES-001]`
+- Reference operations runbooks
+
+## Validation & Fixing Issues
+
+### Run the Validator
+```bash
+scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-your-spec.md
+```
+
+### Common Issues & Fixes
+
+**Issue**: "Prerequisites section incomplete"
+- **Fix**: Add all required infrastructure, code, access, and approvals
+
+**Issue**: "Step-by-step procedures lack detail"
+- **Fix**: Add actual commands, expected output, time estimates
+
+**Issue**: "No rollback procedure"
+- **Fix**: Document how to revert deployment if issues arise
+
+**Issue**: "Monitoring and troubleshooting missing"
+- **Fix**: Add success criteria, monitoring setup, and troubleshooting guide
+
+## Decision-Making Framework
+
+When writing a deployment procedure:
+
+1. **Prerequisites**: What must be true before we start?
+   - Infrastructure ready?
+   - Code reviewed and tested?
+   - Team trained?
+   - Approvals gotten?
+
+2. **Procedure**: What are the exact steps?
+   - Simple, repeatable steps?
+   - Verification at each step?
+   - Estimated timing?
+
+3. **Safety**: How do we prevent/catch issues?
+   - Verification steps after each phase?
+   - Rollback procedure?
+   - Quick failure detection?
+
+4. **Communication**: Who needs to know what?
+   - Stakeholders notified?
+   - On-call monitoring?
+   - Escalation path?
+
+5. **Learning**: How do we improve next time?
+   - Monitoring enabled?
+   - Runbook updated?
+   - Issues documented?
+
+## Next Steps
+
+1. **Create the spec**: `scripts/generate-spec.sh deployment-procedure deploy-XXX-slug`
+2. **Research**: Find component specs and existing procedures
+3. **Document prerequisites**: What must be true before deployment?
+4. **Write procedures**: Step-by-step, with commands and verification
+5. **Plan rollback**: How do we undo this if needed?
+6. **Validate**: `scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-XXX-slug.md`
+7. **Test procedure**: Walk through it in staging environment
+8. **Get team review** before using in production