Initial commit
This commit is contained in:
669
commands/hotfix-deployer.md
Normal file
669
commands/hotfix-deployer.md
Normal file
@@ -0,0 +1,669 @@
|
||||
---
|
||||
allowed-tools: Bash, Read, Write, Edit, Grep, Glob
|
||||
description: ClaudeForge emergency hotfix deployment specialist for critical production issues and rapid incident response.
|
||||
---
|
||||
|
||||
# ClaudeForge Hotfix Deployer
|
||||
|
||||
ClaudeForge intelligent hotfix deployment system that manages emergency production fixes with automated branching strategies, deployment workflows, rollback procedures, and comprehensive incident response protocols.
|
||||
|
||||
## Purpose
|
||||
|
||||
Transform emergency production fixes from chaotic manual processes to systematic, auditable, and reversible deployment workflows that minimize downtime, ensure code quality, and maintain production stability during critical incidents.
|
||||
|
||||
## Features
|
||||
|
||||
- **Hotfix Branching**: Automated Git branching strategies for emergency fixes
|
||||
- **Emergency Deployment**: Fast-track deployment pipelines with safety checks
|
||||
- **Rollback Procedures**: One-command rollback with state preservation
|
||||
- **Incident Response**: Structured incident management and communication workflows
|
||||
- **Change Validation**: Pre-deployment testing and validation automation
|
||||
- **Audit Logging**: Complete audit trail for compliance and post-mortems
|
||||
- **Team Coordination**: Automated notifications and stakeholder communication
|
||||
- **Post-Deployment Monitoring**: Real-time health checks and rollback triggers
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
/hotfix-deployer [command] [options]
|
||||
```
|
||||
|
||||
Target: $ARGUMENTS (if specified, otherwise analyze current production state)
|
||||
|
||||
### Hotfix Commands
|
||||
|
||||
**Create Hotfix Branch:**
|
||||
```bash
|
||||
/hotfix-deployer create --issue=PROD-2847 --severity=critical --from=production
|
||||
```
|
||||
Creates emergency hotfix branch with:
|
||||
- Branch naming convention (hotfix/PROD-2847-description)
|
||||
- Production baseline identification
|
||||
- Automatic issue tracker linking
|
||||
- Team notification via Slack/Teams
|
||||
- Incident timeline initialization
|
||||
- Emergency approval workflow trigger
|
||||
|
||||
**Deploy Hotfix:**
|
||||
```bash
|
||||
/hotfix-deployer deploy --target=production --verify --notify-stakeholders
|
||||
```
|
||||
Executes controlled hotfix deployment:
|
||||
- Pre-deployment health checks
|
||||
- Database migration validation
|
||||
- Configuration consistency check
|
||||
- Blue-green deployment strategy
|
||||
- Canary release with gradual rollout
|
||||
- Automated smoke tests
|
||||
- Real-time monitoring dashboard
|
||||
- Stakeholder notification system
|
||||
|
||||
**Rollback Deployment:**
|
||||
```bash
|
||||
/hotfix-deployer rollback --version=previous --preserve-data --immediate
|
||||
```
|
||||
Performs emergency rollback with:
|
||||
- Instant traffic switching
|
||||
- Database state preservation
|
||||
- Session continuity maintenance
|
||||
- Cache invalidation
|
||||
- CDN purge coordination
|
||||
- DNS propagation handling
|
||||
- Service health verification
|
||||
- Incident escalation procedures
|
||||
|
||||
### Incident Management
|
||||
|
||||
**Incident Declaration:**
|
||||
```bash
|
||||
/hotfix-deployer incident create --severity=sev1 --title="Payment Gateway Down" --impact=high
|
||||
```
|
||||
Initiates incident response with:
|
||||
- Severity level classification (SEV0-SEV3)
|
||||
- Automatic team assembly and notification
|
||||
- Incident commander assignment
|
||||
- War room creation (Zoom/Slack)
|
||||
- Status page update automation
|
||||
- Customer communication templates
|
||||
- Executive escalation if needed
|
||||
- Timeline documentation system
|
||||
|
||||
**Status Updates:**
|
||||
```bash
|
||||
/hotfix-deployer incident update --status=investigating --eta="15 minutes" --broadcast
|
||||
```
|
||||
Manages incident communication:
|
||||
- Regular status update intervals
|
||||
- Multi-channel broadcasting (Slack, email, status page)
|
||||
- Stakeholder-specific messaging
|
||||
- Customer-facing communication
|
||||
- Internal team coordination
|
||||
- Progress tracking and timeline
|
||||
- Evidence collection for post-mortem
|
||||
|
||||
**Incident Resolution:**
|
||||
```bash
|
||||
/hotfix-deployer incident resolve --root-cause="null pointer" --action-items="add validation"
|
||||
```
|
||||
Closes incident with:
|
||||
- Resolution confirmation and validation
|
||||
- Root cause documentation
|
||||
- Action item assignment
|
||||
- Post-mortem scheduling
|
||||
- Customer notification
|
||||
- Status page resolution
|
||||
- Monitoring alert tuning
|
||||
- Knowledge base article creation
|
||||
|
||||
### Validation and Testing
|
||||
|
||||
**Pre-Deployment Validation:**
|
||||
```bash
|
||||
/hotfix-deployer validate --environment=staging --test-suite=critical-path
|
||||
```
|
||||
Runs validation suite including:
|
||||
- Critical path smoke tests
|
||||
- Integration test subset
|
||||
- Database migration dry-run
|
||||
- Performance regression checks
|
||||
- Security vulnerability scan
|
||||
- Configuration validation
|
||||
- Dependency conflict detection
|
||||
- Backward compatibility verification
|
||||
|
||||
**Production Verification:**
|
||||
```bash
|
||||
/hotfix-deployer verify --endpoints=critical --duration=5m --auto-rollback
|
||||
```
|
||||
Monitors deployment health:
|
||||
- Health endpoint polling
|
||||
- Error rate monitoring
|
||||
- Response time tracking
|
||||
- Database connection pool status
|
||||
- External service dependencies
|
||||
- User session validation
|
||||
- Transaction success rates
|
||||
- Automatic rollback triggers
|
||||
|
||||
## Branching Strategies
|
||||
|
||||
### Git Flow Hotfix Model
|
||||
|
||||
**Hotfix Branch Creation:**
|
||||
```bash
|
||||
# Create from production tag
|
||||
git checkout -b hotfix/2.1.1 v2.1.0
|
||||
git push -u origin hotfix/2.1.1
|
||||
|
||||
# Make critical fix
|
||||
git add .
|
||||
git commit -m "hotfix: fix payment processing null pointer exception
|
||||
|
||||
- Add null check for payment method
|
||||
- Add defensive validation
|
||||
- Add error logging
|
||||
|
||||
Resolves: PROD-2847
|
||||
Severity: Critical"
|
||||
|
||||
# Tag hotfix version
|
||||
git tag -a v2.1.1 -m "Hotfix: Payment processing fix"
|
||||
git push origin v2.1.1
|
||||
```
|
||||
|
||||
**Merge Strategy:**
|
||||
```bash
|
||||
# Merge to production
|
||||
git checkout production
|
||||
git merge --no-ff hotfix/2.1.1
|
||||
git push origin production
|
||||
|
||||
# Merge to develop
|
||||
git checkout develop
|
||||
git merge --no-ff hotfix/2.1.1
|
||||
git push origin develop
|
||||
|
||||
# Clean up hotfix branch
|
||||
git branch -d hotfix/2.1.1
|
||||
git push origin --delete hotfix/2.1.1
|
||||
```
|
||||
|
||||
### Trunk-Based Hotfix Model
|
||||
|
||||
**Direct Production Fix:**
|
||||
```bash
|
||||
# Create hotfix from production
|
||||
git checkout production
|
||||
git pull origin production
|
||||
git checkout -b hotfix/payment-fix
|
||||
|
||||
# Make fix and fast-forward
|
||||
git add .
|
||||
git commit -m "hotfix: payment processing fix"
|
||||
|
||||
# Deploy directly
|
||||
git checkout production
|
||||
git merge --ff-only hotfix/payment-fix
|
||||
git push origin production
|
||||
|
||||
# Cherry-pick to main
|
||||
git checkout main
|
||||
git cherry-pick <hotfix-commit-sha>
|
||||
git push origin main
|
||||
```
|
||||
|
||||
## Deployment Workflows
|
||||
|
||||
### Blue-Green Deployment
|
||||
|
||||
**Implementation:**
|
||||
```yaml
|
||||
# Kubernetes blue-green deployment
|
||||
apiVersion: v1
|
||||
kind: Service
|
||||
metadata:
|
||||
name: app-service
|
||||
spec:
|
||||
selector:
|
||||
app: myapp
|
||||
version: blue # Switch to 'green' for deployment
|
||||
ports:
|
||||
- port: 80
|
||||
targetPort: 8080
|
||||
|
||||
---
|
||||
# Blue deployment
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: app-blue
|
||||
spec:
|
||||
replicas: 3
|
||||
selector:
|
||||
matchLabels:
|
||||
app: myapp
|
||||
version: blue
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: myapp
|
||||
version: blue
|
||||
spec:
|
||||
containers:
|
||||
- name: app
|
||||
image: myapp:v2.1.0
|
||||
ports:
|
||||
- containerPort: 8080
|
||||
|
||||
---
|
||||
# Green deployment (new version)
|
||||
apiVersion: apps/v1
|
||||
kind: Deployment
|
||||
metadata:
|
||||
name: app-green
|
||||
spec:
|
||||
replicas: 3
|
||||
selector:
|
||||
matchLabels:
|
||||
app: myapp
|
||||
version: green
|
||||
template:
|
||||
metadata:
|
||||
labels:
|
||||
app: myapp
|
||||
version: green
|
||||
spec:
|
||||
containers:
|
||||
- name: app
|
||||
image: myapp:v2.1.1 # New hotfix version
|
||||
ports:
|
||||
- containerPort: 8080
|
||||
```
|
||||
|
||||
**Deployment Script:**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Blue-green deployment script
|
||||
|
||||
CURRENT_VERSION=$(kubectl get service app-service -o jsonpath='{.spec.selector.version}')
|
||||
NEW_VERSION=$([ "$CURRENT_VERSION" = "blue" ] && echo "green" || echo "blue")
|
||||
|
||||
echo "Current version: $CURRENT_VERSION"
|
||||
echo "Deploying to: $NEW_VERSION"
|
||||
|
||||
# Deploy new version
|
||||
kubectl apply -f deployment-$NEW_VERSION.yaml
|
||||
|
||||
# Wait for pods to be ready
|
||||
kubectl wait --for=condition=ready pod -l version=$NEW_VERSION --timeout=300s
|
||||
|
||||
# Run smoke tests
|
||||
./run-smoke-tests.sh $NEW_VERSION
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "Smoke tests passed. Switching traffic..."
|
||||
kubectl patch service app-service -p '{"spec":{"selector":{"version":"'$NEW_VERSION'"}}}'
|
||||
echo "Deployment successful!"
|
||||
else
|
||||
echo "Smoke tests failed. Rolling back..."
|
||||
kubectl delete deployment app-$NEW_VERSION
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
### Canary Deployment
|
||||
|
||||
**Progressive Rollout:**
|
||||
```yaml
|
||||
# Istio VirtualService for canary
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: VirtualService
|
||||
metadata:
|
||||
name: app-canary
|
||||
spec:
|
||||
hosts:
|
||||
- app.example.com
|
||||
http:
|
||||
- match:
|
||||
- headers:
|
||||
canary:
|
||||
exact: "true"
|
||||
route:
|
||||
- destination:
|
||||
host: app-service
|
||||
subset: v2-1-1
|
||||
- route:
|
||||
- destination:
|
||||
host: app-service
|
||||
subset: v2-1-0
|
||||
weight: 90
|
||||
- destination:
|
||||
host: app-service
|
||||
subset: v2-1-1
|
||||
weight: 10 # Start with 10% traffic
|
||||
```
|
||||
|
||||
**Automated Canary Script:**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Progressive canary deployment
|
||||
|
||||
WEIGHTS=(10 25 50 75 100)
|
||||
CANARY_DURATION=300 # 5 minutes per stage
|
||||
|
||||
for WEIGHT in "${WEIGHTS[@]}"; do
|
||||
echo "Deploying canary with ${WEIGHT}% traffic..."
|
||||
|
||||
# Update traffic split
|
||||
kubectl apply -f - <<EOF
|
||||
apiVersion: networking.istio.io/v1beta1
|
||||
kind: VirtualService
|
||||
metadata:
|
||||
name: app-canary
|
||||
spec:
|
||||
hosts:
|
||||
- app.example.com
|
||||
http:
|
||||
- route:
|
||||
- destination:
|
||||
host: app-service
|
||||
subset: v2-1-0
|
||||
weight: $((100 - WEIGHT))
|
||||
- destination:
|
||||
host: app-service
|
||||
subset: v2-1-1
|
||||
weight: $WEIGHT
|
||||
EOF
|
||||
|
||||
echo "Monitoring for ${CANARY_DURATION} seconds..."
|
||||
sleep $CANARY_DURATION
|
||||
|
||||
# Check error rates
|
||||
ERROR_RATE=$(prometheus-query "rate(http_requests_total{status=~'5..'}[5m])")
|
||||
|
||||
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
|
||||
echo "Error rate too high! Rolling back..."
|
||||
kubectl apply -f virtualservice-stable.yaml
|
||||
exit 1
|
||||
fi
|
||||
|
||||
echo "Stage ${WEIGHT}% successful"
|
||||
done
|
||||
|
||||
echo "Canary deployment complete!"
|
||||
```
|
||||
|
||||
## Rollback Procedures
|
||||
|
||||
### Automated Rollback
|
||||
|
||||
**Kubernetes Rollback:**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Immediate rollback script
|
||||
|
||||
echo "Initiating emergency rollback..."
|
||||
|
||||
# Rollback deployment
|
||||
kubectl rollout undo deployment/app-deployment
|
||||
|
||||
# Wait for rollback to complete
|
||||
kubectl rollout status deployment/app-deployment --timeout=300s
|
||||
|
||||
# Verify health
|
||||
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" https://app.example.com/health)
|
||||
|
||||
if [ "$HEALTH" = "200" ]; then
|
||||
echo "Rollback successful. Application healthy."
|
||||
|
||||
# Notify team
|
||||
curl -X POST https://hooks.slack.com/services/XXX \
|
||||
-H 'Content-Type: application/json' \
|
||||
-d '{
|
||||
"text": "🔄 Emergency rollback completed successfully",
|
||||
"attachments": [{
|
||||
"color": "good",
|
||||
"text": "Application health verified. Production stable."
|
||||
}]
|
||||
}'
|
||||
else
|
||||
echo "Rollback failed! Health check returned: $HEALTH"
|
||||
echo "Escalating to on-call engineer..."
|
||||
./escalate-incident.sh
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
**Database Rollback:**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Database migration rollback
|
||||
|
||||
echo "Rolling back database migrations..."
|
||||
|
||||
# Get current migration version
|
||||
CURRENT_VERSION=$(psql -tAc "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1")
|
||||
echo "Current migration: $CURRENT_VERSION"
|
||||
|
||||
# Rollback last migration
|
||||
npm run migrate:rollback
|
||||
|
||||
# Verify database integrity
|
||||
npm run db:verify
|
||||
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "Database rollback successful"
|
||||
else
|
||||
echo "Database rollback failed! Manual intervention required."
|
||||
echo "Creating database snapshot for recovery..."
|
||||
pg_dump dbname > emergency-backup-$(date +%Y%m%d-%H%M%S).sql
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
## Incident Response Playbooks
|
||||
|
||||
### SEV1 Incident Playbook
|
||||
|
||||
**Critical Service Outage:**
|
||||
```markdown
|
||||
## SEV1: Complete Service Outage
|
||||
|
||||
### Immediate Actions (0-5 minutes)
|
||||
1. [ ] Declare SEV1 incident in #incidents channel
|
||||
2. [ ] Page on-call engineer and engineering manager
|
||||
3. [ ] Update status page: "Investigating service disruption"
|
||||
4. [ ] Assemble incident response team in war room
|
||||
5. [ ] Begin incident timeline documentation
|
||||
|
||||
### Investigation Phase (5-15 minutes)
|
||||
1. [ ] Check monitoring dashboards for anomalies
|
||||
2. [ ] Review recent deployments and changes
|
||||
3. [ ] Analyze error logs and stack traces
|
||||
4. [ ] Test critical user journeys manually
|
||||
5. [ ] Identify root cause hypothesis
|
||||
|
||||
### Mitigation Phase (15-30 minutes)
|
||||
1. [ ] Implement emergency fix or rollback
|
||||
2. [ ] Deploy hotfix following fast-track process
|
||||
3. [ ] Verify fix in staging environment
|
||||
4. [ ] Deploy to production with monitoring
|
||||
5. [ ] Confirm service restoration
|
||||
|
||||
### Recovery Phase (30+ minutes)
|
||||
1. [ ] Update status page: "Service restored"
|
||||
2. [ ] Send customer communication
|
||||
3. [ ] Continue monitoring for 2 hours
|
||||
4. [ ] Schedule post-mortem within 48 hours
|
||||
5. [ ] Document action items and learnings
|
||||
```
|
||||
|
||||
### Communication Templates
|
||||
|
||||
**Initial Notification:**
|
||||
```
|
||||
🚨 INCIDENT DECLARED: SEV1
|
||||
|
||||
Title: Payment Gateway Down
|
||||
Impact: All payment processing unavailable
|
||||
Started: 2024-03-15 14:32 UTC
|
||||
Status: Investigating
|
||||
|
||||
We are aware of an issue affecting payment processing.
|
||||
Our team is actively investigating and will provide updates every 15 minutes.
|
||||
|
||||
Incident Commander: @john.doe
|
||||
War Room: https://zoom.us/incident-123
|
||||
|
||||
#incident #sev1
|
||||
```
|
||||
|
||||
**Status Update:**
|
||||
```
|
||||
📊 INCIDENT UPDATE: SEV1 (15 min)
|
||||
|
||||
Status: Identified root cause
|
||||
Root Cause: Database connection pool exhaustion
|
||||
Action: Deploying hotfix to increase pool size
|
||||
ETA: 10 minutes
|
||||
|
||||
Next update in 15 minutes.
|
||||
```
|
||||
|
||||
**Resolution Notice:**
|
||||
```
|
||||
✅ INCIDENT RESOLVED: SEV1
|
||||
|
||||
Title: Payment Gateway Down
|
||||
Duration: 47 minutes
|
||||
Resolution: Hotfix deployed (v2.1.1)
|
||||
|
||||
Service is fully restored. We will continue monitoring for the next 2 hours.
|
||||
A detailed post-mortem will be shared within 48 hours.
|
||||
|
||||
Thank you for your patience.
|
||||
```
|
||||
|
||||
## CI/CD Integration
|
||||
|
||||
### GitHub Actions Hotfix Pipeline
|
||||
|
||||
```yaml
|
||||
name: Emergency Hotfix Deployment
|
||||
|
||||
on:
|
||||
push:
|
||||
branches:
|
||||
- 'hotfix/**'
|
||||
|
||||
env:
|
||||
SEVERITY: critical
|
||||
SKIP_TESTS: false
|
||||
|
||||
jobs:
|
||||
validate:
|
||||
runs-on: ubuntu-latest
|
||||
timeout-minutes: 5
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
|
||||
- name: Notify deployment start
|
||||
uses: slackapi/slack-github-action@v1
|
||||
with:
|
||||
payload: |
|
||||
{
|
||||
"text": "🔥 Hotfix deployment started: ${{ github.ref_name }}",
|
||||
"channel": "#deployments"
|
||||
}
|
||||
|
||||
- name: Fast security scan
|
||||
run: npm audit --audit-level=high
|
||||
|
||||
- name: Run critical tests only
|
||||
run: npm run test:critical
|
||||
|
||||
deploy-staging:
|
||||
needs: validate
|
||||
runs-on: ubuntu-latest
|
||||
environment: staging
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
|
||||
- name: Deploy to staging
|
||||
run: |
|
||||
npm run deploy:staging
|
||||
npm run smoke-test:staging
|
||||
|
||||
deploy-production:
|
||||
needs: deploy-staging
|
||||
runs-on: ubuntu-latest
|
||||
environment: production
|
||||
steps:
|
||||
- name: Create deployment backup
|
||||
run: kubectl create backup production-pre-hotfix-$(date +%s)
|
||||
|
||||
- name: Deploy hotfix
|
||||
run: |
|
||||
npm run deploy:production
|
||||
sleep 30 # Allow for pod startup
|
||||
|
||||
- name: Health check
|
||||
run: |
|
||||
for i in {1..10}; do
|
||||
if curl -f https://api.example.com/health; then
|
||||
echo "Health check passed"
|
||||
exit 0
|
||||
fi
|
||||
sleep 5
|
||||
done
|
||||
echo "Health check failed"
|
||||
exit 1
|
||||
|
||||
- name: Automatic rollback on failure
|
||||
if: failure()
|
||||
run: |
|
||||
echo "Deployment failed. Initiating rollback..."
|
||||
npm run rollback:production
|
||||
kubectl restore backup production-pre-hotfix-*
|
||||
|
||||
- name: Notify deployment result
|
||||
if: always()
|
||||
uses: slackapi/slack-github-action@v1
|
||||
with:
|
||||
payload: |
|
||||
{
|
||||
"text": "${{ job.status == 'success' && '✅ Hotfix deployed successfully' || '❌ Hotfix deployment failed - rolled back' }}",
|
||||
"channel": "#incidents"
|
||||
}
|
||||
```
|
||||
|
||||
### Monitoring and Alerting
|
||||
|
||||
**Prometheus Alert Rules:**
|
||||
```yaml
|
||||
groups:
|
||||
- name: hotfix-monitoring
|
||||
interval: 10s
|
||||
rules:
|
||||
- alert: HighErrorRateAfterDeploy
|
||||
expr: rate(http_requests_total{status=~"5.."}[1m]) > 0.05
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
annotations:
|
||||
summary: "High error rate detected after deployment"
|
||||
description: "Error rate is {{ $value }} (threshold: 0.05)"
|
||||
|
||||
- alert: ResponseTimeRegression
|
||||
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
annotations:
|
||||
summary: "P95 response time regression"
|
||||
description: "P95 latency is {{ $value }}s (threshold: 2s)"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**ClaudeForge Hotfix Deployer** - Enterprise-grade emergency deployment system with automated workflows, comprehensive rollback procedures, and structured incident response for production stability.
|
||||
Reference in New Issue
Block a user