Initial commit

2025-11-30 08:59:40 +08:00
commit 57bed020cd
12 changed files with 4775 additions and 0 deletions
--- a/skills/using-technical-writer/itil-and-governance-documentation.md
+++ b/skills/using-technical-writer/itil-and-governance-documentation.md
@@ -0,0 +1,413 @@
+
+# ITIL and Governance Documentation
+
+## Overview
+
+Document systems for formal IT service management environments. Core principle: **Governance documentation enables controlled change and accountability**.
+
+**Key insight**: Enterprise operations require structured documentation for change management, service delivery, and business continuity.
+
+## When to Use
+
+Load this skill when:
+- Working in ITIL/ITSM environments
+- Preparing change requests (RFC)
+- Documenting enterprise services
+- Creating disaster recovery plans
+
+**Symptoms you need this**:
+- "How do I document a production change?"
+- Creating service catalog entries
+- Writing DR/BCP documentation
+- Formal change advisory board (CAB) processes
+
+**Don't use for**:
+- Informal/startup environments (too heavyweight)
+- Quick fixes without governance
+
+## Change Request (RFC) Documentation
+
+### Pattern: Request for Change
+
+```markdown
+# RFC-1234: Database Schema Migration
+
+## Change Type
+**NORMAL** (Standard | Normal | Emergency)
+
+## Summary
+Migrate user authentication schema from legacy format to OAuth2-compatible format. Adds `oauth_provider` and `oauth_token` columns to `users` table.
+
+## Business Justification
+- Enable SSO integration with enterprise identity providers
+- Customer requirement for Fortune 500 clients
+- Revenue impact: Unblocks $500k in contracts
+
+## Impact Analysis
+
+### Affected Services
+- User Authentication Service (primary impact)
+- API Gateway (configuration change required)
+- Mobile App (token format change)
+
+### Affected Users/Teams
+- All active users (15,000 users) - No visible impact
+- Engineering team - Deployment coordination required
+- Customer support - FAQs updated
+
+### Dependencies
+- Requires database maintenance window
+- OAuth provider configuration must be complete
+- Mobile app v2.5+ must be deployed first (rollout complete as of 2024-03-10)
+
+### Risk Assessment
+| Risk | Likelihood | Impact | Mitigation |
+|------|------------|--------|------------|
+| Migration fails mid-process | Low | High | Transaction-based migration, automatic rollback |
+| Users locked out post-migration | Medium | High | Keep legacy auth fallback for 30 days |
+| Performance degradation | Low | Medium | Load testing completed, indexes added |
+
+**Overall Risk**: MEDIUM
+
+## Implementation Plan
+
+### Pre-Implementation (1 day before)
+1. **2024-03-19 10:00**: Send user notification ("scheduled maintenance 2024-03-20 02:00-04:00 UTC, no action required")
+2. **2024-03-19 14:00**: Final QA testing in staging environment
+3. **2024-03-19 16:00**: Freeze code deployments (change freeze begins)
+
+### Implementation Window (2024-03-20 02:00-04:00 UTC)
+**Duration**: 2 hours
+**Teams required**: 2x SRE, 1x Database Admin, 1x Backend Engineer
+
+1. **02:00**: Enable maintenance mode
+   \```bash
+   kubectl scale deployment/api-gateway --replicas=0
+   \```
+   - Success: No active API requests
+
+2. **02:05**: Backup database
+   \```bash
+   pg_dump production_db > backup-pre-migration-20240320.sql
+   aws s3 cp backup-pre-migration-20240320.sql s3://backups/
+   \```
+   - Success: Backup file uploaded (verify size ~5GB)
+
+3. **02:15**: Run migration
+   \```bash
+   psql production_db < migration-20240320-oauth-schema.sql
+   \```
+   - Success: "ALTER TABLE" messages, no errors
+   - Rollback if: Any ERROR message → Restore from backup
+
+4. **02:45**: Deploy new API version
+   \```bash
+   kubectl set image deployment/api-gateway api=registry/api:v2.6.0
+   kubectl scale deployment/api-gateway --replicas=6
+   kubectl rollout status deployment/api-gateway
+   \```
+   - Success: "successfully rolled out"
+
+5. **03:00**: Smoke tests
+   \```bash
+   # Test legacy auth (backward compatibility)
+   curl -X POST https://api.example.com/auth/legacy -d '{"user":"test@example.com","pass":"fake_password"}'
+   # Expected: 200 OK, token returned
+
+   # Test OAuth auth (new feature)
+   curl -X POST https://api.example.com/auth/oauth -d '{"provider":"google","code":"fake_code"}'
+   # Expected: 200 OK, token returned
+   \```
+
+6. **03:15**: Monitor for 15 minutes
+   - Check error rate dashboard: [link]
+   - Check authentication success rate: [link]
+   - Expected: Error rate <0.1%, auth success >99%
+
+7. **03:30**: Disable maintenance mode
+   \```bash
+   # Restore normal operation
+   \```
+
+8. **03:45**: Post-deployment verification
+   - Verify mobile app can authenticate
+   - Verify web app can authenticate
+   - Check customer support queue (any authentication issues?)
+
+### Post-Implementation
+1. **2024-03-20 08:00**: Team standup - review deployment, any issues?
+2. **2024-03-20 09:00**: Send "maintenance complete" notification
+3. **2024-03-21**: Monitor for 24 hours, watch for late-breaking issues
+4. **2024-03-27**: Remove legacy auth fallback (7 days post-migration)
+
+## Rollback Plan
+
+### Trigger Conditions
+Rollback if ANY of these occur:
+- Migration script errors
+- Auth success rate drops below 95%
+- Critical bugs reported by >5 users
+- Performance degradation >20%
+
+### Rollback Procedure (30 minutes)
+1. **Enable maintenance mode** (same as step 1 above)
+
+2. **Restore database from backup**:
+   \```bash
+   aws s3 cp s3://backups/backup-pre-migration-20240320.sql ./
+   psql production_db < backup-pre-migration-20240320.sql
+   \```
+   - Duration: ~15 minutes for 5GB database
+
+3. **Revert to previous API version**:
+   \```bash
+   kubectl set image deployment/api-gateway api=registry/api:v2.5.9
+   kubectl rollout restart deployment/api-gateway
+   \```
+
+4. **Verify rollback success**: Run smoke tests (same as implementation step 5)
+
+5. **Disable maintenance mode**
+
+6. **Post-rollback**: Investigate failure, update RFC with findings, reschedule
+
+## Testing Plan
+
+### Pre-Production Testing (Completed)
+- [x] Unit tests pass (2024-03-15)
+- [x] Integration tests pass (2024-03-16)
+- [x] Load testing: 10,000 concurrent users, <200ms latency (2024-03-17)
+- [x] Staging deployment: Migration successful (2024-03-18)
+
+### Production Verification
+- Smoke tests (included in implementation plan step 5)
+- 24-hour monitoring period
+- User acceptance: No critical issues reported within 7 days
+
+## Approval Chain
+
+| Role | Name | Approval Date | Status |
+|------|------|---------------|--------|
+| Backend Engineer | John Doe | 2024-03-14 | ✅ Approved |
+| Database Admin | Jane Smith | 2024-03-14 | ✅ Approved |
+| SRE Lead | Bob Johnson | 2024-03-15 | ✅ Approved |
+| Product Manager | Alice Williams | 2024-03-15 | ✅ Approved |
+| Change Advisory Board (CAB) | CAB Chair | 2024-03-18 | ✅ Approved |
+
+**Change Authorized by**: CAB Chair (Alice Johnson) on 2024-03-18
+
+## Communication Plan
+
+| Audience | Message | Channel | Timing |
+|----------|---------|---------|--------|
+| All users | "Scheduled maintenance 2024-03-20 02:00-04:00 UTC. No action required." | Email, status page | 1 day before |
+| Engineering team | Implementation details, on-call requirements | Slack #engineering | 1 day before |
+| Customer support | FAQ updates, escalation path | Support portal | 1 day before |
+| Executive team | Change summary, business impact | Email | 1 day before |
+| All users | "Maintenance complete. New SSO features available." | Email, status page | After completion |
+```
+
+
+## Service Documentation
+
+### Service Catalog Entry
+
+```markdown
+# Service: User Authentication API
+
+## Service Overview
+**Service ID**: SVC-0042
+**Service Name**: User Authentication API
+**Description**: Provides authentication and authorization for all company applications. Supports password-based login, OAuth SSO, and API token management.
+
+## Service Owner
+**Team**: Platform Security Team
+**Primary Contact**: security-team@example.com
+**Escalation**: VP Engineering (vp-eng@example.com)
+
+## Service Details
+
+### Purpose
+Enable secure user authentication across all company products.
+
+### Scope
+**Included**:
+- User login (password, OAuth)
+- Session management
+- API token generation/validation
+- MFA enforcement
+
+**Not Included**:
+- User registration (owned by User Management Service)
+- Password reset (owned by User Management Service)
+
+### Users
+- **Primary Users**: All application users (15,000 active users)
+- **Internal Users**: API Gateway, Mobile App, Web App
+
+## Service Level Agreement (SLA)
+
+### Availability
+**Target**: 99.9% uptime
+**Measurement**: Monthly uptime excluding planned maintenance
+**Planned Maintenance**: Max 4 hours/month, communicated 7 days in advance
+
+### Performance
+**Target**: 95th percentile latency <100ms
+**Measurement**: API response time from request to auth token response
+
+### Support Hours
+**P1 (Critical)**: 24/7/365, 15-minute response time
+**P2 (High)**: Business hours (9am-5pm PT), 2-hour response time
+**P3/P4**: Business hours, next business day response
+
+## Operational Level Agreements (OLA)
+
+### Database Team OLA
+**Commitment**: Database backup every 6 hours, 30-day retention
+**Response Time**: 1-hour response for database issues affecting auth service
+
+### Network Team OLA
+**Commitment**: 99.95% network availability for auth service subnet
+**Response Time**: 30-minute response for network issues
+
+## Monitoring and Alerts
+
+**Dashboard**: https://grafana.example.com/d/auth-service
+**Key Metrics**:
+- Uptime: Target 99.9%
+- Latency (p95): Target <100ms
+- Error rate: Target <0.1%
+- Auth success rate: Target >99%
+
+**Alerts**:
+- P1: Service down (5-minute window), latency >500ms, error rate >1%
+- P2: Elevated errors (>0.5%), auth success rate <95%
+
+## Dependencies
+- **Database**: PostgreSQL (SVC-0010)
+- **Cache**: Redis (SVC-0015)
+- **Secrets Management**: Vault (SVC-0020)
+
+## Disaster Recovery
+**RTO**: 1 hour (service restored within 1 hour of total failure)
+**RPO**: 15 minutes (max 15 minutes of data loss acceptable)
+**DR Site**: us-west-2 (failover from us-east-1)
+**Testing**: Quarterly DR drills
+```
+
+
+## Business Continuity Documentation
+
+### Disaster Recovery Plan
+
+```markdown
+# Disaster Recovery Plan: User Authentication Service
+
+## RTO and RPO
+
+**RTO (Recovery Time Objective)**: 1 hour
+- Time to restore service after total regional failure
+
+**RPO (Recovery Point Objective)**: 15 minutes
+- Maximum acceptable data loss (last database backup)
+
+## Disaster Scenarios
+
+### Scenario 1: Complete Regional Outage (AWS us-east-1)
+**Trigger**: All us-east-1 availability zones unavailable
+**Impact**: PRIMARY - Service completely unavailable
+**Recovery**: Failover to us-west-2 (DR region)
+
+### Scenario 2: Database Corruption
+**Trigger**: Database integrity check fails, data corruption detected
+**Impact**: CRITICAL - Service degraded or unavailable
+**Recovery**: Restore from most recent backup (15-minute data loss)
+
+### Scenario 3: Critical Security Incident
+**Trigger**: Confirmed breach, attacker has system access
+**Impact**: CRITICAL - Service must be taken offline immediately
+**Recovery**: Forensic analysis, patch vulnerabilities, restore from known-good state
+
+## Failover Procedure (Scenario 1: Regional Outage)
+
+### Prerequisites
+- DR region (us-west-2) database in standby replication mode
+- DR region compute resources pre-provisioned (scaled to zero, can scale up instantly)
+
+### Failover Steps (45 minutes)
+
+1. **Declare disaster** (5 minutes)
+   - Incident commander confirms regional outage
+   - Notify stakeholders: "Initiating failover to DR region"
+
+2. **Promote DR database to primary** (10 minutes)
+   \```bash
+   # SSH to DR database server
+   ssh db-dr.us-west-2.internal
+
+   # Promote standby to primary
+   sudo -u postgres pg_ctl promote -D /var/lib/postgresql/data
+
+   # Verify promotion
+   psql -c "SELECT pg_is_in_recovery();"  # Should return 'f' (not in recovery = primary)
+   \```
+
+3. **Scale up DR compute** (10 minutes)
+   \```bash
+   # Scale API servers from 0 to 6 replicas
+   kubectl --context us-west-2 scale deployment/auth-api --replicas=6
+
+   # Wait for readiness
+   kubectl --context us-west-2 rollout status deployment/auth-api
+   \```
+
+4. **Update DNS to DR region** (10 minutes)
+   \```bash
+   # Update Route53 to point to DR load balancer
+   aws route53 change-resource-record-sets --hosted-zone-id Z123 --change-batch file://dns-failover.json
+
+   # Verify DNS propagation (may take 5-10 minutes for full propagation)
+   dig auth-api.example.com
+   \```
+
+5. **Verify service health** (5 minutes)
+   - Smoke test: Authenticate test user
+   - Check monitoring dashboard: Error rate, latency
+   - Success criteria: Error rate <1%, latency <200ms
+
+6. **Communicate completion** (5 minutes)
+   - Status page: "Service restored, operating from backup region"
+   - Engineering team: "Failover complete, monitor for issues"
+   - Customer support: "Service operational, escalate any issues to incident channel"
+
+### Failback Procedure (After Primary Region Restored)
+
+**Wait period**: 24 hours of stable DR operation before failback
+
+1. Restore primary region database from DR backup
+2. Establish replication DR → Primary (reverse direction)
+3. During maintenance window: Failover back to primary (same procedure as above)
+
+## Testing Schedule
+
+**DR Drills**: Quarterly (January, April, July, October)
+**Procedure**: Simulate regional failure, execute failover, measure RTO/RPO
+**Success Criteria**: Complete failover within 1-hour RTO, data loss <15 minutes
+```
+
+
+## Cross-References
+
+**Use WITH this skill**:
+- `muna/technical-writer/documentation-structure` - RFC and service docs follow structured patterns
+- `muna/technical-writer/incident-response-documentation` - DR procedures are incident responses
+
+## Real-World Impact
+
+**Governance documentation using these patterns**:
+- **Database Migration RFC**: Change Advisory Board approved in first review (vs 3-review avg) due to comprehensive impact analysis, rollback plan, and testing documentation.
+- **Service Catalog**: SLA documentation enabled proactive capacity planning. When auth service approached 99.9% SLA threshold, automatic scaling prevented SLA breach.
+- **DR Plan**: Quarterly DR drill revealed 2-hour RTO vs documented 1-hour. Plan updated with pre-provisioned DR resources, actual RTO now 45 minutes.
+
+**Key lesson**: **Structured governance documentation (RFC, SLA, DR plans) enables controlled change, accountability, and measurable service delivery.**