362 lines
12 KiB
Markdown
362 lines
12 KiB
Markdown
|
|
# Operational Acceptance Documentation
|
|
|
|
## Overview
|
|
|
|
Prepare complete acceptance packages for production deployment. Core principle: **Document readiness, risks, and acceptance criteria for informed go-live decisions**.
|
|
|
|
**Key insight**: Acceptance documentation enables stakeholders to make informed risk decisions about production deployment.
|
|
|
|
## When to Use
|
|
|
|
Load this skill when:
|
|
- Preparing systems for production launch
|
|
- Seeking executive go-live approval
|
|
- Completing operational handover
|
|
- Government/defense system authorization
|
|
|
|
**Symptoms you need this**:
|
|
- "How do I get approval to launch?"
|
|
- Preparing production readiness checklist
|
|
- Creating go-live approval package
|
|
- Operational handover to support team
|
|
|
|
**Don't use for**:
|
|
- Development/staging deployments
|
|
- Internal-only tools (unless high-risk)
|
|
|
|
## Production Readiness Checklist
|
|
|
|
### Infrastructure Readiness
|
|
|
|
```markdown
|
|
## Infrastructure Readiness
|
|
|
|
### Compute Resources
|
|
- [ ] Production servers provisioned (6x API servers, 2x database servers)
|
|
- [ ] Auto-scaling configured (scale 2-20 instances based on CPU >70%)
|
|
- [ ] Load balancer configured with health checks
|
|
- [ ] SSL/TLS certificates installed and valid (expires 2025-12-01)
|
|
|
|
### Storage
|
|
- [ ] Database provisioned (PostgreSQL 14, 500GB storage)
|
|
- [ ] Database backups configured (automated hourly backups, 30-day retention)
|
|
- [ ] Backup restoration tested (RTO: 1 hour, RPO: 1 hour)
|
|
|
|
### Network
|
|
- [ ] VPC configured with public/private subnets
|
|
- [ ] Firewall rules implemented (allow HTTPS 443, deny all other inbound)
|
|
- [ ] DNS configured (api.example.com → load balancer)
|
|
|
|
### Monitoring and Logging
|
|
- [ ] Application metrics instrumented (Prometheus)
|
|
- [ ] Logs centralized (CloudWatch Logs, 90-day retention)
|
|
- [ ] Dashboards created ([Grafana dashboard link])
|
|
- [ ] Alerts configured (error rate, latency, uptime)
|
|
|
|
### Security
|
|
- [ ] Secrets stored in secrets manager (not environment variables)
|
|
- [ ] TLS 1.3 enforced
|
|
- [ ] Authentication implemented (MFA for admins)
|
|
- [ ] Security scan completed (no HIGH/CRITICAL findings)
|
|
|
|
**Infrastructure Status**: ✅ READY (all criteria met)
|
|
```
|
|
|
|
|
|
## Operational Readiness Checklist
|
|
|
|
```markdown
|
|
## Operational Readiness
|
|
|
|
### Monitoring Coverage
|
|
- [ ] **Availability**: Uptime monitoring ([UptimeRobot link])
|
|
- [ ] **Performance**: Latency tracking (p50, p95, p99)
|
|
- [ ] **Errors**: Error rate monitoring (<0.1% threshold)
|
|
- [ ] **Business metrics**: User signups, API calls, revenue
|
|
|
|
**Success criteria**: All critical metrics have dashboards + alerts
|
|
|
|
### Alerting Configuration
|
|
- [ ] **P1 alerts** (PagerDuty): Service down, error rate >1%, security incident
|
|
- [ ] **P2 alerts** (Slack #ops): Elevated errors >0.5%, latency >500ms
|
|
- [ ] **P3 alerts** (Email): Performance degradation, capacity warnings
|
|
|
|
**Success criteria**: Alerts tested and verified to fire correctly
|
|
|
|
### Backup and Recovery
|
|
- [ ] **Backup procedure**: Automated hourly PostgreSQL dumps to S3
|
|
- [ ] **Backup testing**: Restored from backup on 2024-03-10 (successful)
|
|
- [ ] **Recovery time**: 1-hour RTO verified
|
|
- [ ] **Recovery point**: 1-hour RPO (acceptable data loss)
|
|
|
|
**Success criteria**: Restore from backup completes within RTO
|
|
|
|
### Runbooks and Documentation
|
|
- [ ] **Incident response runbooks**: Database outage, API errors, security incidents
|
|
- [ ] **Operational procedures**: Deployment, rollback, scaling
|
|
- [ ] **Architecture documentation**: System diagram, data flows, integrations
|
|
- [ ] **API documentation**: Endpoint reference, authentication guide
|
|
|
|
**Success criteria**: On-call engineer can respond to P1 incident using runbooks alone
|
|
|
|
**Operational Status**: ✅ READY (all criteria met)
|
|
```
|
|
|
|
|
|
## Test and Evaluation Documentation
|
|
|
|
### Test Summary Report
|
|
|
|
```markdown
|
|
# Test Summary Report: Customer Portal Launch
|
|
|
|
## Test Objectives
|
|
1. Verify functional requirements (user registration, login, profile management)
|
|
2. Validate performance requirements (p95 latency <200ms, support 1000 concurrent users)
|
|
3. Confirm security requirements (authentication, authorization, data encryption)
|
|
|
|
## Test Methodology
|
|
|
|
### Functional Testing
|
|
- **Unit tests**: 487 tests, 100% pass rate
|
|
- **Integration tests**: 156 tests, 100% pass rate
|
|
- **End-to-end tests**: 45 scenarios, 44 passed, 1 defect (LOW severity, workaround available)
|
|
|
|
### Performance Testing
|
|
- **Load test**: 1000 concurrent users, 10,000 requests/min
|
|
- p50 latency: 45ms ✅
|
|
- p95 latency: 180ms ✅ (target: <200ms)
|
|
- p99 latency: 350ms ⚠️ (target: <500ms)
|
|
- Error rate: 0.02% ✅ (target: <0.1%)
|
|
|
|
### Security Testing
|
|
- **Vulnerability scan**: Nessus scan completed 2024-03-15
|
|
- Critical: 0 ✅
|
|
- High: 0 ✅
|
|
- Medium: 3 (remediated)
|
|
- Low: 8 (accepted risk)
|
|
- **Penetration test**: External pentest completed 2024-03-18
|
|
- HIGH findings: 1 (SQL injection, fixed on 2024-03-19)
|
|
- MEDIUM findings: 2 (remediated)
|
|
|
|
## Defect Summary
|
|
|
|
| Defect ID | Severity | Description | Status | Disposition |
|
|
|-----------|----------|-------------|--------|-------------|
|
|
| DEF-001 | LOW | Profile image upload fails for files >10MB | Open | Workaround: Resize before upload (documented) |
|
|
| DEF-002 | MEDIUM | Password reset email delayed 5-10 minutes | Fixed | Fixed on 2024-03-20 |
|
|
| DEF-003 | HIGH | SQL injection in /api/users | Fixed | Fixed on 2024-03-19, re-tested |
|
|
|
|
## Test Completion Criteria
|
|
|
|
- [ ] ✅ All HIGH/CRITICAL defects fixed
|
|
- [ ] ✅ All MEDIUM defects fixed or have workarounds
|
|
- [ ] ✅ LOW defects documented (1 open, workaround available)
|
|
- [ ] ✅ Performance requirements met
|
|
- [ ] ✅ Security requirements met (no HIGH/CRITICAL findings)
|
|
|
|
**Test Status**: ✅ PASSED (all criteria met, 1 LOW defect acceptable)
|
|
```
|
|
|
|
|
|
## Go-Live Approval Package
|
|
|
|
### Executive Summary
|
|
|
|
```markdown
|
|
# Go-Live Approval Request: Customer Portal
|
|
|
|
## System Overview
|
|
**System Name**: Customer Portal
|
|
**Purpose**: Enable customers to self-serve account management, reducing support tickets by 40%
|
|
**Business Value**: $2M annual revenue enabler (enterprise customers require self-service portal)
|
|
**Target Launch**: 2024-04-01
|
|
|
|
## Readiness Status
|
|
|
|
### Infrastructure: ✅ READY
|
|
- All production servers provisioned and tested
|
|
- Auto-scaling configured
|
|
- Backups automated and tested (1-hour RTO/RPO)
|
|
|
|
### Operations: ✅ READY
|
|
- Monitoring and alerting configured
|
|
- Runbooks complete
|
|
- On-call rotation staffed (3 SREs, 2 backend engineers)
|
|
|
|
### Testing: ✅ PASSED
|
|
- Functional tests: 100% pass (1 LOW defect with workaround)
|
|
- Performance tests: p95 latency 180ms (target: <200ms)
|
|
- Security tests: 0 HIGH/CRITICAL findings
|
|
|
|
### Security Authorization: ✅ AUTHORIZED
|
|
- ATO granted on 2024-03-25 (valid for 3 years)
|
|
- POA&M with 2 LOW-risk items (tracked, non-blocking)
|
|
|
|
## Residual Risks
|
|
|
|
### Risk 1: Performance Degradation Above 1000 Users (MEDIUM)
|
|
**Description**: Load testing validated 1000 concurrent users. Performance above 1000 users unknown.
|
|
**Mitigation**:
|
|
- Auto-scaling configured to add capacity at 70% CPU
|
|
- Gradual rollout plan (100 users week 1, 500 week 2, all users week 4)
|
|
- Performance monitoring with alerts at 800ms latency threshold
|
|
**Accepted by**: CTO on 2024-03-28
|
|
|
|
### Risk 2: Profile Image Upload Limitation (LOW)
|
|
**Description**: Images >10MB fail to upload (DEF-001)
|
|
**Mitigation**:
|
|
- Workaround documented in user help center
|
|
- Fix planned for v1.1 release (2024-05-01)
|
|
**Accepted by**: Product Manager on 2024-03-28
|
|
|
|
## Launch Criteria
|
|
|
|
### Success Metrics
|
|
**Immediate (Week 1)**:
|
|
- Uptime: >99% (target: 99.9%)
|
|
- Error rate: <0.5% (target: <0.1%)
|
|
- p95 latency: <300ms (target: <200ms)
|
|
|
|
**Medium-term (Month 1)**:
|
|
- User adoption: 30% of customers use portal
|
|
- Support ticket reduction: 20% decrease
|
|
|
|
### Abort Criteria
|
|
**Immediate rollback if**:
|
|
- Uptime drops below 95% for >1 hour
|
|
- Error rate exceeds 5%
|
|
- Data breach or security incident
|
|
- Critical functionality broken for >50% of users
|
|
|
|
### Monitoring Plan
|
|
- **Real-time**: Grafana dashboard monitored by on-call
|
|
- **Daily**: Morning standup reviews previous 24 hours
|
|
- **Weekly**: Executive summary report (metrics vs targets)
|
|
|
|
## Rollback Plan
|
|
|
|
**Trigger**: Any abort criterion met
|
|
|
|
**Rollback Procedure** (30 minutes):
|
|
1. Enable maintenance page
|
|
2. Scale production deployment to 0 replicas
|
|
3. Restore database from pre-launch backup (if data changes occurred)
|
|
4. Re-enable previous customer support workflow
|
|
5. Communicate to customers via email
|
|
|
|
**Testing**: Rollback procedure tested in staging on 2024-03-27 (successful, 25-minute duration)
|
|
|
|
## Recommendation
|
|
|
|
**Status**: ✅ APPROVED FOR LAUNCH
|
|
|
|
All readiness criteria met. Residual risks identified and accepted by stakeholders. Launch criteria defined with clear success metrics and abort criteria. Rollback plan tested and ready.
|
|
|
|
**Requested Approval**: Executive Go-Live Approval
|
|
|
|
**Approvals Required**:
|
|
- [ ] VP Engineering (technical readiness)
|
|
- [ ] CTO (security and risk acceptance)
|
|
- [ ] VP Product (business value and user experience)
|
|
- [ ] CFO (budget and revenue impact)
|
|
```
|
|
|
|
|
|
## Operational Handover Checklist
|
|
|
|
```markdown
|
|
# Operational Handover: Customer Portal
|
|
|
|
## Knowledge Transfer
|
|
|
|
### Documentation Delivered
|
|
- [ ] ✅ Architecture documentation (`/docs/architecture.md`)
|
|
- [ ] ✅ API reference (`/docs/api-reference.md`)
|
|
- [ ] ✅ Runbooks (`/runbooks/` - 12 runbooks)
|
|
- [ ] ✅ Deployment procedures (`/docs/deployment.md`)
|
|
- [ ] ✅ Troubleshooting guide (`/docs/troubleshooting.md`)
|
|
|
|
### Training Completed
|
|
- [ ] ✅ On-call training (2024-03-20): 3 SREs, 2 backend engineers
|
|
- [ ] ✅ Runbook walkthrough (2024-03-22): All on-call staff
|
|
- [ ] ✅ Incident response drill (2024-03-25): Simulated database outage, responded successfully
|
|
|
|
### Handoff Meeting
|
|
- **Date**: 2024-03-28
|
|
- **Attendees**: Development team (6), Operations team (5), Product (2)
|
|
- **Agenda**:
|
|
1. System overview and architecture
|
|
2. Common issues and troubleshooting
|
|
3. Escalation paths and contact information
|
|
4. Q&A session
|
|
- **Outcome**: ✅ Operations team confident in supporting system
|
|
|
|
## Support Model
|
|
|
|
### On-Call Rotation
|
|
**Primary On-Call**: Rotating weekly schedule (3 SREs)
|
|
**Backup On-Call**: Backend engineer (2-person rotation)
|
|
|
|
**Schedule**: https://pagerduty.example.com/schedules/customer-portal
|
|
|
|
### Escalation Paths
|
|
|
|
**P1 (Critical)**:
|
|
1. Primary on-call (page immediately)
|
|
2. If no response in 5 min → Backup on-call
|
|
3. If no resolution in 30 min → Incident commander
|
|
4. If ongoing after 1 hour → VP Engineering
|
|
|
|
**P2 (High)**:
|
|
1. Primary on-call (page)
|
|
2. If no response in 15 min → Backup on-call
|
|
3. If no resolution in 4 hours → Team lead
|
|
|
|
**Contacts**:
|
|
- Primary on-call: [PagerDuty link]
|
|
- Incident commander: John Doe (+1-555-0100)
|
|
- Team lead: Jane Smith (+1-555-0200)
|
|
- VP Engineering: Bob Johnson (+1-555-0300)
|
|
|
|
### SLA Commitments
|
|
**Uptime**: 99.9% (measured monthly)
|
|
**Performance**: p95 latency <200ms
|
|
**Support Response**:
|
|
- P1: 15-minute response, 4-hour resolution target
|
|
- P2: 2-hour response, 1-day resolution target
|
|
- P3: Next business day response
|
|
|
|
## Acceptance Criteria Met
|
|
|
|
- [ ] ✅ All documentation delivered and reviewed
|
|
- [ ] ✅ Operations team trained
|
|
- [ ] ✅ Incident response drill successful
|
|
- [ ] ✅ On-call rotation staffed
|
|
- [ ] ✅ Escalation paths defined
|
|
- [ ] ✅ SLA commitments documented
|
|
|
|
**Handover Status**: ✅ COMPLETE
|
|
|
|
**Signed Off**:
|
|
- Development Team Lead: John Doe (2024-03-28)
|
|
- Operations Team Lead: Jane Smith (2024-03-28)
|
|
```
|
|
|
|
|
|
## Cross-References
|
|
|
|
**Use WITH this skill**:
|
|
- `ordis/security-architect/security-authorization-and-accreditation` - For government/defense ATO requirements
|
|
- `muna/technical-writer/itil-and-governance-documentation` - For RFC and service documentation
|
|
|
|
## Real-World Impact
|
|
|
|
**Systems using operational acceptance documentation**:
|
|
- **Customer Portal Launch**: Go-live approval package enabled same-day executive approval (vs 1-week review cycle). Clear risk acceptance + rollback plan gave confidence to approve.
|
|
- **Government System**: Complete readiness checklist (infrastructure, operations, testing, security authorization) passed IRAP assessment on first attempt. Assessor: "Most comprehensive readiness documentation in 5 years".
|
|
- **Operational Handover**: Training + runbooks + incident drill enabled junior SRE to respond to P1 database outage successfully within 45 minutes (first week post-handover).
|
|
|
|
**Key lesson**: **Comprehensive acceptance documentation (readiness, risks, criteria, handover) enables informed go-live decisions and smooth operational transitions.**
|