zhongwei/gh-onezerocompany-claude-project-basics

Files

Zhongwei Li ca9b85ccda Initial commit

2025-11-30 08:45:31 +08:00

16 KiB

Raw Blame History

How to Create a Deployment Procedure Specification

Deployment procedures document step-by-step instructions for deploying systems to production, including prerequisites, procedures, rollback, and troubleshooting.

Quick Start

# 1. Create a new deployment procedure
scripts/generate-spec.sh deployment-procedure deploy-001-descriptive-slug

# 2. Open and fill in the file
# (The file will be created at: docs/specs/deployment-procedure/deploy-001-descriptive-slug.md)

# 3. Fill in steps and checklists, then validate:
scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md

# 4. Fix issues and check completeness:
scripts/check-completeness.sh docs/specs/deployment-procedure/deploy-001-descriptive-slug.md

When to Write a Deployment Procedure

Use a Deployment Procedure when you need to:

Document how to deploy a new service or component
Ensure consistent, repeatable deployments
Provide runbooks for operations teams
Document rollback procedures for failures
Enable any team member to deploy safely
Create an audit trail of deployments

Research Phase

Find what you're deploying:

# Find component specs
grep -r "component" docs/specs/ --include="*.md"

# Find design documents that mention infrastructure
grep -r "design\|infrastructure" docs/specs/ --include="*.md"

# Find existing deployment procedures
grep -r "deploy" docs/specs/ --include="*.md"

2. Understand Your Infrastructure

What's the deployment target? (Kubernetes, serverless, VMs)
What infrastructure does this component need?
What access/permissions are required?
What monitoring must be in place?

3. Review Past Deployments

How have similar components been deployed?
What issues arose? How were they resolved?
What worked well? What didn't?
Any patterns or templates to follow?

Structure & Content Guide

Title & Metadata

Title: "Export Service Deployment to Production", "Database Migration", etc.
Component: What's being deployed
Target: Production, staging, canary, etc.
Owner: Team responsible for deployment

Prerequisites Section

Document what must be done before deployment:

# Export Service Production Deployment

## Prerequisites

### Infrastructure Requirements
- [ ] AWS resources provisioned (see [CMP-001] for details)
  - [ ] ElastiCache Redis cluster (export-service-queue)
  - [ ] RDS PostgreSQL instance (export-db)
  - [ ] S3 bucket (export-files-prod)
  - [ ] IAM roles and policies configured
- [ ] Kubernetes cluster accessible
  - [ ] kubectl configured with production cluster context
  - [ ] Deployment manifests reviewed by tech lead
  - [ ] Namespace `export-service-prod` created

### Code & Build Requirements
- [ ] All code merged to main branch
- [ ] Code reviewed by 2+ senior engineers
- [ ] All tests passing
  - [ ] Unit tests (90%+ coverage)
  - [ ] Integration tests
  - [ ] Load tests pass at target throughput
- [ ] Docker image built and pushed to ECR
  - [ ] Image tagged with version (e.g., v1.2.3)
  - [ ] Image scanned for vulnerabilities
  - [ ] Image verified to work (manual test in staging)

### Team & Access Requirements
- [ ] Deployment lead identified (typically tech lead or on-call eng)
- [ ] Access verified for:
  - [ ] AWS console (ECR, S3, CloudWatch)
  - [ ] Kubernetes cluster (kubectl access)
  - [ ] Database (for running migrations if needed)
  - [ ] Monitoring/alerting system (Grafana, PagerDuty)
- [ ] Communication channel open (Slack, war room)
- [ ] Runbook reviewed by both eng and ops team

### Pre-Deployment Verification Checklist
- [ ] Staging deployment successful (deployed 24+ hours ago, stable)
- [ ] Monitoring in place and verified working
- [ ] Rollback plan reviewed and tested
- [ ] Emergency contacts identified
- [ ] Stakeholders notified of deployment window
- [ ] Change log prepared (what's new in this version)

### Data/Database Requirements
- [ ] Database schema compatible with new version
  - [ ] Backward compatible (no breaking changes)
  - [ ] Migrations tested in staging
  - [ ] Rollback plan for migrations documented
- [ ] No data conflicts or corruption risks
- [ ] Backup created (if applicable)

### Approval Checklist
- [ ] Tech Lead: Code and approach approved
- [ ] Product Owner: Feature approved, ready for launch
- [ ] Operations Lead: Deployment plan reviewed
- [ ] Security: Security review passed (if applicable)

Deployment Steps Section

Provide step-by-step instructions:

## Deployment Procedure

### Pre-Deployment (Validation Phase)

**Step 1: Verify Prerequisites**
- Command: Run pre-deployment checklist above
- Verify: All items checked ✓
- If any fail: Stop deployment, resolve issues
- Time: ~15 minutes

**Step 2: Create Deployment Record**
- Document: Who is deploying, when, what version
- Command: Log in to deployment tracking system
- Entry:

Deployment: export-service Version: v1.2.3 Environment: production Deployed By: Alice Smith Time: 2024-01-15 14:30 UTC Change Summary: Added bulk export feature, fixed queue processing

- Time: ~5 minutes

### Deployment Phase

**Step 3: Tag Database Migration (if applicable)**
- Check: Are there schema changes in this version?
- If YES:
```bash
# SSH to database server
ssh -i ~/.ssh/prod.pem admin@db.example.com

# Run migrations
psql -U export_service -d export_service -c \
  "ALTER TABLE exports ADD COLUMN retry_count INT DEFAULT 0;"

# Verify migration
psql -U export_service -d export_service -c \
  "SELECT column_name FROM information_schema.columns WHERE table_name='exports';"

If NO: Skip this step
Verify: All migrations complete without errors
Time: ~10 minutes

Step 4: Deploy to Kubernetes

Verify: You're deploying to PRODUCTION cluster

kubectl config current-context
# Should output: arn:aws:eks:us-east-1:123456789:cluster/prod

If wrong context: STOP, switch to correct cluster

Deploy new image version:

# Update deployment with new image
kubectl set image deployment/export-service \
  export-service=123456789.dkr.ecr.us-east-1.amazonaws.com/export-service:v1.2.3 \
  -n export-service-prod

Verify: Deployment triggered

kubectl rollout status deployment/export-service -n export-service-prod

Wait: For all pods to become ready (typically 2-3 minutes)
Output should show: deployment "export-service" successfully rolled out
Time: ~5 minutes

Step 5: Verify Deployment Health

Check: Pod status
```
kubectl get pods -n export-service-prod
```
- All pods should show Running status
- If any show CrashLoopBackOff: Stop deployment, investigate
Check: Service endpoints
```
kubectl get svc export-service -n export-service-prod
```
- Should show external IP/load balancer endpoint
Check: Logs for errors
```
kubectl logs -n export-service-prod -l app=export-service --tail=50
```
- Should show startup logs, no ERROR level messages
- If errors present: Check Step 6 for rollback
Check: Health endpoints
```
curl https://api.example.com/health
```
- Should return 200 OK
- If not: Service may still be starting (wait 30s and retry)
Time: ~5 minutes

Post-Deployment (Verification Phase)

Step 6: Monitor Metrics

Open: Grafana dashboard for export-service
Check: Key metrics for 5 minutes
- Request latency: Should be stable (< 100ms p95)
- Error rate: Should remain < 0.1%
- CPU/Memory: Should be within normal ranges
- Queue depth: Should process jobs smoothly
Look for: Any sudden spikes or anomalies
If anomalies: Proceed to rollback (Step 8)
Time: ~5 minutes

Step 7: Functional Testing

Manual test: Create export via API

curl -X POST https://api.example.com/exports \
  -H "Authorization: Bearer $TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "format": "csv",
    "data_types": ["users"]
  }'

Response: Should return 201 Created with export_id

Check status:

curl https://api.example.com/exports/{export_id} \
  -H "Authorization: Bearer $TOKEN"

Verify: Status transitions from queued → processing → completed
Download: Successfully download export file
Verify: File contents correct
If any step fails: Proceed to rollback (Step 8)
Time: ~5 minutes

Step 8: Notify Stakeholders

Update: Deployment tracking system

Status: DEPLOYED
Completion Time: 14:45 UTC
Health: ✓ All checks passed
Metrics: ✓ Stable
Functional Tests: ✓ Passed

Announce: Slack to #product-eng

@channel Export Service v1.2.3 deployed to production.
New feature: Bulk data exports now available.
Status: Monitoring.

Notify: On-call engineer (monitoring for 2 hours post-deployment)

Rollback Procedure (If Issues Found)

Step 8: Rollback (Only if Step 6 or 7 fail)

Decision: Is deployment safe to continue?
- YES → All checks pass, monitoring is good → Release complete
- NO → Issues found → Proceed with rollback

Execute rollback:

# Revert to previous version
kubectl rollout undo deployment/export-service -n export-service-prod

# Verify rollback in progress
kubectl rollout status deployment/export-service -n export-service-prod

# Wait for rollback to complete

Verify rollback successful:

# Check current image
kubectl describe deployment export-service -n export-service-prod | grep Image

# Should show previous version (e.g., v1.2.2)

# Verify service responding
curl https://api.example.com/health

Notify: Update stakeholders

@channel Deployment rolled back due to [specific reason].
Current version: v1.2.2 (stable)
Investigating issue. Will retry deployment tomorrow.

Document: Root cause analysis
- What went wrong?
- Why wasn't it caught in staging?
- How do we prevent this next time?
Time: ~10 minutes


### Success Criteria Section

```markdown
## Deployment Success Criteria

The deployment is successful if ALL of these are true:

### Technical Criteria
- [ ] All pods running and healthy (0 CrashLoopBackOff)
- [ ] Service responding to health checks (200 OK)
- [ ] Metrics showing normal values (no spikes)
- [ ] Error rate < 0.1% (< 1 error per 1000 requests)
- [ ] Response latency p95 < 100ms
- [ ] No errors in application logs

### Functional Criteria
- [ ] Export API responds to requests
- [ ] Export jobs queue successfully
- [ ] Jobs process and complete
- [ ] Files upload to S3 correctly
- [ ] Users can download exported files
- [ ] File contents verified correct

### Operational Criteria
- [ ] Monitoring active and receiving metrics
- [ ] Alerting working (test alert fired)
- [ ] Logs aggregated and searchable
- [ ] Runbook tested and functional
- [ ] Team confident in operating system

Monitoring & Alerting Section

## Monitoring Setup

### Critical Alerts (Page on-call)
- Service down (health check fails)
- Error rate > 1% for 5 minutes
- Response latency p95 > 500ms for 5 minutes
- Queue depth > 1000 for 10 minutes

### Warning Alerts (Slack notification)
- Error rate > 0.5% for 5 minutes
- CPU > 80% for 10 minutes
- Memory > 85% for 10 minutes
- Export job timeout increasing

### Dashboard
- Service: export-service-prod
- Metrics: Latency, errors, throughput, queue depth
- Time range: Last 24 hours by default
- Alerts: Show current alert status

Troubleshooting Section

## Troubleshooting Common Issues

### Issue: Pods stuck in CrashLoopBackOff
**Symptoms**: Pods repeatedly crash and restart
**Diagnosis**:
```bash
# Check logs for errors
kubectl logs <pod-name> -n export-service-prod

Common Causes:

Configuration error (check environment variables)
Database connection failed (check credentials)
Out of memory (check resource limits) Fix: Review logs, check prerequisites, rollback if unclear

Issue: Response latency spiking

Symptoms: p95 latency > 200ms, users report slow exports Diagnosis:

# Check queue depth
kubectl exec -it <worker-pod> -n export-service-prod \
  -- redis-cli -h redis.example.com LLEN export-queue

Common Causes:

Too many concurrent exports (queue backlog)
Database slow (check queries, indexes)
Network issues (check connectivity) Fix: Scale up workers, check database performance, verify network

Issue: Export jobs failing

Symptoms: Job status shows failed, users can't export Diagnosis:

# Check worker logs
kubectl logs -n export-service-prod -l app=export-service

Common Causes:

S3 upload failing (check permissions, bucket exists)
Database query error (schema mismatch)
User doesn't have data to export Fix: Review logs, verify S3 access, check schema version

Issue: Database migration failed

Symptoms: Service won't start after deployment Diagnosis:

# Check migration logs
psql -U export_service -d export_service -c \
  "SELECT * FROM schema_migrations ORDER BY version DESC LIMIT 5;"

Recovery:

Identify failed migration
Rollback deployment (revert to previous version)
Debug migration issue in staging
Retry deployment after fix


### Post-Deployment Actions Section

```markdown
## After Deployment

### Immediate (Next 2 hours)
- [ ] On-call engineer monitoring
- [ ] Check metrics every 15 minutes
- [ ] Monitor error rate and latency
- [ ] Watch for user-reported issues in #support

### Short-term (Next 24 hours)
- [ ] Review deployment metrics
- [ ] Collect feedback from users
- [ ] Document any issues encountered
- [ ] Update runbook if needed

### Follow-up (Next week)
- [ ] Post-mortem if issues occurred
- [ ] Update deployment procedure based on lessons learned
- [ ] Plan performance improvements if needed
- [ ] Update documentation if system behavior changed

Writing Tips

Be Precise and Detailed

Exact commands to run (copy-paste ready)
Specific values (versions, endpoints, timeouts)
Expected outputs for verification
Time estimates for each step

Think About Edge Cases

What if something is already deployed?
What if a prerequisite is missing?
What if deployment partially succeeds?
What if rollback is needed?

Make Rollback Easy

Document rollback procedure clearly
Test rollback before using in production
Make rollback faster than forward deployment
Have quick communication plan for failures

Document Monitoring

What metrics indicate health?
What should we watch during deployment?
What thresholds trigger alerts?
How do we validate success?

Reference component specs: [CMP-001]
Reference design documents: [DES-001]
Reference operations runbooks

Validation & Fixing Issues

Run the Validator

scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-001-your-spec.md

Common Issues & Fixes

Issue: "Prerequisites section incomplete"

Fix: Add all required infrastructure, code, access, and approvals

Issue: "Step-by-step procedures lack detail"

Fix: Add actual commands, expected output, time estimates

Issue: "No rollback procedure"

Fix: Document how to revert deployment if issues arise

Issue: "Monitoring and troubleshooting missing"

Fix: Add success criteria, monitoring setup, and troubleshooting guide

Decision-Making Framework

When writing a deployment procedure:

Prerequisites: What must be true before we start?
- Infrastructure ready?
- Code reviewed and tested?
- Team trained?
- Approvals gotten?
Procedure: What are the exact steps?
- Simple, repeatable steps?
- Verification at each step?
- Estimated timing?
Safety: How do we prevent/catch issues?
- Verification steps after each phase?
- Rollback procedure?
- Quick failure detection?
Communication: Who needs to know what?
- Stakeholders notified?
- On-call monitoring?
- Escalation path?
Learning: How do we improve next time?
- Monitoring enabled?
- Runbook updated?
- Issues documented?

Next Steps

Create the spec: scripts/generate-spec.sh deployment-procedure deploy-XXX-slug
Research: Find component specs and existing procedures
Document prerequisites: What must be true before deployment?
Write procedures: Step-by-step, with commands and verification
Plan rollback: How do we undo this if needed?
Validate: scripts/validate-spec.sh docs/specs/deployment-procedure/deploy-XXX-slug.md
Test procedure: Walk through it in staging environment
Get team review before using in production

16 KiB Raw Blame History

How to Create a Deployment Procedure Specification

Quick Start

When to Write a Deployment Procedure

Research Phase

1. Research Related Specifications

2. Understand Your Infrastructure

3. Review Past Deployments

Structure & Content Guide

Title & Metadata

Prerequisites Section

Deployment Steps Section

Post-Deployment (Verification Phase)

Rollback Procedure (If Issues Found)

Monitoring & Alerting Section

Troubleshooting Section

Issue: Response latency spiking

Issue: Export jobs failing

Issue: Database migration failed

Writing Tips

Be Precise and Detailed

Think About Edge Cases

Make Rollback Easy

Document Monitoring

Link to Related Specs

Validation & Fixing Issues

Run the Validator

Common Issues & Fixes

Decision-Making Framework

Next Steps

16 KiB

Raw Blame History