Files
gh-claudeforge-marketplace-…/commands/hotfix-deployer.md
2025-11-29 18:13:29 +08:00

670 lines
16 KiB
Markdown

---
allowed-tools: Bash, Read, Write, Edit, Grep, Glob
description: ClaudeForge emergency hotfix deployment specialist for critical production issues and rapid incident response.
---
# ClaudeForge Hotfix Deployer
ClaudeForge intelligent hotfix deployment system that manages emergency production fixes with automated branching strategies, deployment workflows, rollback procedures, and comprehensive incident response protocols.
## Purpose
Transform emergency production fixes from chaotic manual processes to systematic, auditable, and reversible deployment workflows that minimize downtime, ensure code quality, and maintain production stability during critical incidents.
## Features
- **Hotfix Branching**: Automated Git branching strategies for emergency fixes
- **Emergency Deployment**: Fast-track deployment pipelines with safety checks
- **Rollback Procedures**: One-command rollback with state preservation
- **Incident Response**: Structured incident management and communication workflows
- **Change Validation**: Pre-deployment testing and validation automation
- **Audit Logging**: Complete audit trail for compliance and post-mortems
- **Team Coordination**: Automated notifications and stakeholder communication
- **Post-Deployment Monitoring**: Real-time health checks and rollback triggers
## Usage
```bash
/hotfix-deployer [command] [options]
```
Target: $ARGUMENTS (if specified, otherwise analyze current production state)
### Hotfix Commands
**Create Hotfix Branch:**
```bash
/hotfix-deployer create --issue=PROD-2847 --severity=critical --from=production
```
Creates emergency hotfix branch with:
- Branch naming convention (hotfix/PROD-2847-description)
- Production baseline identification
- Automatic issue tracker linking
- Team notification via Slack/Teams
- Incident timeline initialization
- Emergency approval workflow trigger
**Deploy Hotfix:**
```bash
/hotfix-deployer deploy --target=production --verify --notify-stakeholders
```
Executes controlled hotfix deployment:
- Pre-deployment health checks
- Database migration validation
- Configuration consistency check
- Blue-green deployment strategy
- Canary release with gradual rollout
- Automated smoke tests
- Real-time monitoring dashboard
- Stakeholder notification system
**Rollback Deployment:**
```bash
/hotfix-deployer rollback --version=previous --preserve-data --immediate
```
Performs emergency rollback with:
- Instant traffic switching
- Database state preservation
- Session continuity maintenance
- Cache invalidation
- CDN purge coordination
- DNS propagation handling
- Service health verification
- Incident escalation procedures
### Incident Management
**Incident Declaration:**
```bash
/hotfix-deployer incident create --severity=sev1 --title="Payment Gateway Down" --impact=high
```
Initiates incident response with:
- Severity level classification (SEV0-SEV3)
- Automatic team assembly and notification
- Incident commander assignment
- War room creation (Zoom/Slack)
- Status page update automation
- Customer communication templates
- Executive escalation if needed
- Timeline documentation system
**Status Updates:**
```bash
/hotfix-deployer incident update --status=investigating --eta="15 minutes" --broadcast
```
Manages incident communication:
- Regular status update intervals
- Multi-channel broadcasting (Slack, email, status page)
- Stakeholder-specific messaging
- Customer-facing communication
- Internal team coordination
- Progress tracking and timeline
- Evidence collection for post-mortem
**Incident Resolution:**
```bash
/hotfix-deployer incident resolve --root-cause="null pointer" --action-items="add validation"
```
Closes incident with:
- Resolution confirmation and validation
- Root cause documentation
- Action item assignment
- Post-mortem scheduling
- Customer notification
- Status page resolution
- Monitoring alert tuning
- Knowledge base article creation
### Validation and Testing
**Pre-Deployment Validation:**
```bash
/hotfix-deployer validate --environment=staging --test-suite=critical-path
```
Runs validation suite including:
- Critical path smoke tests
- Integration test subset
- Database migration dry-run
- Performance regression checks
- Security vulnerability scan
- Configuration validation
- Dependency conflict detection
- Backward compatibility verification
**Production Verification:**
```bash
/hotfix-deployer verify --endpoints=critical --duration=5m --auto-rollback
```
Monitors deployment health:
- Health endpoint polling
- Error rate monitoring
- Response time tracking
- Database connection pool status
- External service dependencies
- User session validation
- Transaction success rates
- Automatic rollback triggers
## Branching Strategies
### Git Flow Hotfix Model
**Hotfix Branch Creation:**
```bash
# Create from production tag
git checkout -b hotfix/2.1.1 v2.1.0
git push -u origin hotfix/2.1.1
# Make critical fix
git add .
git commit -m "hotfix: fix payment processing null pointer exception
- Add null check for payment method
- Add defensive validation
- Add error logging
Resolves: PROD-2847
Severity: Critical"
# Tag hotfix version
git tag -a v2.1.1 -m "Hotfix: Payment processing fix"
git push origin v2.1.1
```
**Merge Strategy:**
```bash
# Merge to production
git checkout production
git merge --no-ff hotfix/2.1.1
git push origin production
# Merge to develop
git checkout develop
git merge --no-ff hotfix/2.1.1
git push origin develop
# Clean up hotfix branch
git branch -d hotfix/2.1.1
git push origin --delete hotfix/2.1.1
```
### Trunk-Based Hotfix Model
**Direct Production Fix:**
```bash
# Create hotfix from production
git checkout production
git pull origin production
git checkout -b hotfix/payment-fix
# Make fix and fast-forward
git add .
git commit -m "hotfix: payment processing fix"
# Deploy directly
git checkout production
git merge --ff-only hotfix/payment-fix
git push origin production
# Cherry-pick to main
git checkout main
git cherry-pick <hotfix-commit-sha>
git push origin main
```
## Deployment Workflows
### Blue-Green Deployment
**Implementation:**
```yaml
# Kubernetes blue-green deployment
apiVersion: v1
kind: Service
metadata:
name: app-service
spec:
selector:
app: myapp
version: blue # Switch to 'green' for deployment
ports:
- port: 80
targetPort: 8080
---
# Blue deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: app
image: myapp:v2.1.0
ports:
- containerPort: 8080
---
# Green deployment (new version)
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: app
image: myapp:v2.1.1 # New hotfix version
ports:
- containerPort: 8080
```
**Deployment Script:**
```bash
#!/bin/bash
# Blue-green deployment script
CURRENT_VERSION=$(kubectl get service app-service -o jsonpath='{.spec.selector.version}')
NEW_VERSION=$([ "$CURRENT_VERSION" = "blue" ] && echo "green" || echo "blue")
echo "Current version: $CURRENT_VERSION"
echo "Deploying to: $NEW_VERSION"
# Deploy new version
kubectl apply -f deployment-$NEW_VERSION.yaml
# Wait for pods to be ready
kubectl wait --for=condition=ready pod -l version=$NEW_VERSION --timeout=300s
# Run smoke tests
./run-smoke-tests.sh $NEW_VERSION
if [ $? -eq 0 ]; then
echo "Smoke tests passed. Switching traffic..."
kubectl patch service app-service -p '{"spec":{"selector":{"version":"'$NEW_VERSION'"}}}'
echo "Deployment successful!"
else
echo "Smoke tests failed. Rolling back..."
kubectl delete deployment app-$NEW_VERSION
exit 1
fi
```
### Canary Deployment
**Progressive Rollout:**
```yaml
# Istio VirtualService for canary
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: app-canary
spec:
hosts:
- app.example.com
http:
- match:
- headers:
canary:
exact: "true"
route:
- destination:
host: app-service
subset: v2-1-1
- route:
- destination:
host: app-service
subset: v2-1-0
weight: 90
- destination:
host: app-service
subset: v2-1-1
weight: 10 # Start with 10% traffic
```
**Automated Canary Script:**
```bash
#!/bin/bash
# Progressive canary deployment
WEIGHTS=(10 25 50 75 100)
CANARY_DURATION=300 # 5 minutes per stage
for WEIGHT in "${WEIGHTS[@]}"; do
echo "Deploying canary with ${WEIGHT}% traffic..."
# Update traffic split
kubectl apply -f - <<EOF
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: app-canary
spec:
hosts:
- app.example.com
http:
- route:
- destination:
host: app-service
subset: v2-1-0
weight: $((100 - WEIGHT))
- destination:
host: app-service
subset: v2-1-1
weight: $WEIGHT
EOF
echo "Monitoring for ${CANARY_DURATION} seconds..."
sleep $CANARY_DURATION
# Check error rates
ERROR_RATE=$(prometheus-query "rate(http_requests_total{status=~'5..'}[5m])")
if (( $(echo "$ERROR_RATE > 0.01" | bc -l) )); then
echo "Error rate too high! Rolling back..."
kubectl apply -f virtualservice-stable.yaml
exit 1
fi
echo "Stage ${WEIGHT}% successful"
done
echo "Canary deployment complete!"
```
## Rollback Procedures
### Automated Rollback
**Kubernetes Rollback:**
```bash
#!/bin/bash
# Immediate rollback script
echo "Initiating emergency rollback..."
# Rollback deployment
kubectl rollout undo deployment/app-deployment
# Wait for rollback to complete
kubectl rollout status deployment/app-deployment --timeout=300s
# Verify health
HEALTH=$(curl -s -o /dev/null -w "%{http_code}" https://app.example.com/health)
if [ "$HEALTH" = "200" ]; then
echo "Rollback successful. Application healthy."
# Notify team
curl -X POST https://hooks.slack.com/services/XXX \
-H 'Content-Type: application/json' \
-d '{
"text": "🔄 Emergency rollback completed successfully",
"attachments": [{
"color": "good",
"text": "Application health verified. Production stable."
}]
}'
else
echo "Rollback failed! Health check returned: $HEALTH"
echo "Escalating to on-call engineer..."
./escalate-incident.sh
exit 1
fi
```
**Database Rollback:**
```bash
#!/bin/bash
# Database migration rollback
echo "Rolling back database migrations..."
# Get current migration version
CURRENT_VERSION=$(psql -tAc "SELECT version FROM schema_migrations ORDER BY version DESC LIMIT 1")
echo "Current migration: $CURRENT_VERSION"
# Rollback last migration
npm run migrate:rollback
# Verify database integrity
npm run db:verify
if [ $? -eq 0 ]; then
echo "Database rollback successful"
else
echo "Database rollback failed! Manual intervention required."
echo "Creating database snapshot for recovery..."
pg_dump dbname > emergency-backup-$(date +%Y%m%d-%H%M%S).sql
exit 1
fi
```
## Incident Response Playbooks
### SEV1 Incident Playbook
**Critical Service Outage:**
```markdown
## SEV1: Complete Service Outage
### Immediate Actions (0-5 minutes)
1. [ ] Declare SEV1 incident in #incidents channel
2. [ ] Page on-call engineer and engineering manager
3. [ ] Update status page: "Investigating service disruption"
4. [ ] Assemble incident response team in war room
5. [ ] Begin incident timeline documentation
### Investigation Phase (5-15 minutes)
1. [ ] Check monitoring dashboards for anomalies
2. [ ] Review recent deployments and changes
3. [ ] Analyze error logs and stack traces
4. [ ] Test critical user journeys manually
5. [ ] Identify root cause hypothesis
### Mitigation Phase (15-30 minutes)
1. [ ] Implement emergency fix or rollback
2. [ ] Deploy hotfix following fast-track process
3. [ ] Verify fix in staging environment
4. [ ] Deploy to production with monitoring
5. [ ] Confirm service restoration
### Recovery Phase (30+ minutes)
1. [ ] Update status page: "Service restored"
2. [ ] Send customer communication
3. [ ] Continue monitoring for 2 hours
4. [ ] Schedule post-mortem within 48 hours
5. [ ] Document action items and learnings
```
### Communication Templates
**Initial Notification:**
```
🚨 INCIDENT DECLARED: SEV1
Title: Payment Gateway Down
Impact: All payment processing unavailable
Started: 2024-03-15 14:32 UTC
Status: Investigating
We are aware of an issue affecting payment processing.
Our team is actively investigating and will provide updates every 15 minutes.
Incident Commander: @john.doe
War Room: https://zoom.us/incident-123
#incident #sev1
```
**Status Update:**
```
📊 INCIDENT UPDATE: SEV1 (15 min)
Status: Identified root cause
Root Cause: Database connection pool exhaustion
Action: Deploying hotfix to increase pool size
ETA: 10 minutes
Next update in 15 minutes.
```
**Resolution Notice:**
```
✅ INCIDENT RESOLVED: SEV1
Title: Payment Gateway Down
Duration: 47 minutes
Resolution: Hotfix deployed (v2.1.1)
Service is fully restored. We will continue monitoring for the next 2 hours.
A detailed post-mortem will be shared within 48 hours.
Thank you for your patience.
```
## CI/CD Integration
### GitHub Actions Hotfix Pipeline
```yaml
name: Emergency Hotfix Deployment
on:
push:
branches:
- 'hotfix/**'
env:
SEVERITY: critical
SKIP_TESTS: false
jobs:
validate:
runs-on: ubuntu-latest
timeout-minutes: 5
steps:
- uses: actions/checkout@v3
- name: Notify deployment start
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "🔥 Hotfix deployment started: ${{ github.ref_name }}",
"channel": "#deployments"
}
- name: Fast security scan
run: npm audit --audit-level=high
- name: Run critical tests only
run: npm run test:critical
deploy-staging:
needs: validate
runs-on: ubuntu-latest
environment: staging
steps:
- uses: actions/checkout@v3
- name: Deploy to staging
run: |
npm run deploy:staging
npm run smoke-test:staging
deploy-production:
needs: deploy-staging
runs-on: ubuntu-latest
environment: production
steps:
- name: Create deployment backup
run: kubectl create backup production-pre-hotfix-$(date +%s)
- name: Deploy hotfix
run: |
npm run deploy:production
sleep 30 # Allow for pod startup
- name: Health check
run: |
for i in {1..10}; do
if curl -f https://api.example.com/health; then
echo "Health check passed"
exit 0
fi
sleep 5
done
echo "Health check failed"
exit 1
- name: Automatic rollback on failure
if: failure()
run: |
echo "Deployment failed. Initiating rollback..."
npm run rollback:production
kubectl restore backup production-pre-hotfix-*
- name: Notify deployment result
if: always()
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "${{ job.status == 'success' && '✅ Hotfix deployed successfully' || '❌ Hotfix deployment failed - rolled back' }}",
"channel": "#incidents"
}
```
### Monitoring and Alerting
**Prometheus Alert Rules:**
```yaml
groups:
- name: hotfix-monitoring
interval: 10s
rules:
- alert: HighErrorRateAfterDeploy
expr: rate(http_requests_total{status=~"5.."}[1m]) > 0.05
for: 2m
labels:
severity: critical
annotations:
summary: "High error rate detected after deployment"
description: "Error rate is {{ $value }} (threshold: 0.05)"
- alert: ResponseTimeRegression
expr: histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 3m
labels:
severity: warning
annotations:
summary: "P95 response time regression"
description: "P95 latency is {{ $value }}s (threshold: 2s)"
```
---
**ClaudeForge Hotfix Deployer** - Enterprise-grade emergency deployment system with automated workflows, comprehensive rollback procedures, and structured incident response for production stability.