# Chain Spec Risk Metrics Template ## Workflow Copy this checklist and track your progress: ``` Chain Spec Risk Metrics Progress: - [ ] Step 1: Gather initiative context - [ ] Step 2: Write comprehensive specification - [ ] Step 3: Conduct premortem and build risk register - [ ] Step 4: Define success metrics and instrumentation - [ ] Step 5: Validate completeness and deliver ``` **Step 1: Gather initiative context** - Collect goal, constraints, stakeholders, baseline, desired outcomes. See [Context Questions](#context-questions). **Step 2: Write comprehensive specification** - Document scope, approach, requirements, dependencies, timeline. See [Quick Template](#quick-template) and [Specification Guidance](#specification-guidance). **Step 3: Conduct premortem and build risk register** - Run failure imagination exercise, categorize risks, assign mitigations/owners. See [Premortem Process](#premortem-process) and [Risk Register Template](#risk-register-template). **Step 4: Define success metrics** - Identify leading/lagging indicators, baselines, targets, counter-metrics. See [Metrics Template](#metrics-template) and [Instrumentation Guidance](#instrumentation-guidance). **Step 5: Validate and deliver** - Self-check with rubric (≥3.5 average), ensure all components align. See [Quality Checklist](#quality-checklist). ## Quick Template ```markdown # [Initiative Name]: Spec, Risks, Metrics ## 1. Specification ### 1.1 Executive Summary - **Goal**: [What are you building/changing and why?] - **Timeline**: [When does this need to be delivered?] - **Stakeholders**: [Who cares about this initiative?] - **Success Criteria**: [What does done look like?] ### 1.2 Current State (Baseline) - [Describe the current state/system] - [Key metrics/data points about current state] - [Pain points or limitations driving this initiative] ### 1.3 Proposed Approach - **Architecture/Design**: [High-level approach] - **Implementation Plan**: [Phases, milestones, dependencies] - **Technology Choices**: [Tools, frameworks, platforms with rationale] ### 1.4 Scope - **In Scope**: [What this initiative includes] - **Out of Scope**: [What is explicitly excluded] - **Future Considerations**: [What might come later] ### 1.5 Requirements **Functional Requirements:** - [Feature/capability 1] - [Feature/capability 2] - [Feature/capability 3] **Non-Functional Requirements:** - **Performance**: [Latency, throughput, scalability targets] - **Reliability**: [Uptime, error rates, recovery time] - **Security**: [Authentication, authorization, data protection] - **Compliance**: [Regulatory, policy, audit requirements] ### 1.6 Dependencies & Assumptions - **Dependencies**: [External teams, systems, resources needed] - **Assumptions**: [What we're assuming is true] - **Constraints**: [Budget, time, resource, technical limitations] ### 1.7 Timeline & Milestones | Milestone | Date | Deliverable | |-----------|------|-------------| | [Phase 1] | [Date] | [What's delivered] | | [Phase 2] | [Date] | [What's delivered] | | [Phase 3] | [Date] | [What's delivered] | ## 2. Risk Analysis ### 2.1 Premortem Summary **Exercise Prompt**: "It's [X months] from now. This initiative failed catastrophically. What went wrong?" **Top failure modes identified:** 1. [Failure mode 1] 2. [Failure mode 2] 3. [Failure mode 3] 4. [Failure mode 4] 5. [Failure mode 5] ### 2.2 Risk Register | Risk ID | Risk Description | Category | Likelihood | Impact | Risk Score | Mitigation Strategy | Owner | Status | |---------|-----------------|----------|------------|--------|-----------|---------------------|-------|--------| | R1 | [Specific failure scenario] | Technical | High | High | 9 | [How you'll prevent/reduce] | [Name] | Open | | R2 | [Specific failure scenario] | Operational | Medium | High | 6 | [How you'll prevent/reduce] | [Name] | Open | | R3 | [Specific failure scenario] | Organizational | Medium | Medium | 4 | [How you'll prevent/reduce] | [Name] | Open | | R4 | [Specific failure scenario] | External | Low | High | 3 | [How you'll prevent/reduce] | [Name] | Open | **Risk Scoring**: Low=1, Medium=2, High=3. Risk Score = Likelihood × Impact. **Risk Categories**: - **Technical**: Architecture, code quality, infrastructure, performance - **Operational**: Processes, runbooks, support, operations - **Organizational**: Resources, skills, alignment, communication - **External**: Market, vendors, regulation, dependencies ### 2.3 Risk Mitigation Timeline - **Pre-Launch**: [Risks to mitigate before going live] - **Launch Window**: [Monitoring and safeguards during launch] - **Post-Launch**: [Ongoing monitoring and response] ## 3. Success Metrics ### 3.1 Leading Indicators (Early Signals) | Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner | |--------|----------|--------|-------------------|------------------|-------| | [Predictive metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] | | [Predictive metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] | | [Predictive metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] | **Examples**: Deployment frequency, code review cycle time, test coverage, incident detection time ### 3.2 Lagging Indicators (Outcome Measures) | Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner | |--------|----------|--------|-------------------|------------------|-------| | [Outcome metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] | | [Outcome metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] | | [Outcome metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] | **Examples**: Uptime, user adoption rate, revenue impact, customer satisfaction score ### 3.3 Counter-Metrics (What We Won't Sacrifice) | Metric | Threshold | Monitoring Method | Escalation Trigger | |--------|-----------|-------------------|-------------------| | [Protection metric 1] | [Minimum acceptable] | [How monitored] | [When to escalate] | | [Protection metric 2] | [Minimum acceptable] | [How monitored] | [When to escalate] | | [Protection metric 3] | [Minimum acceptable] | [How monitored] | [When to escalate] | **Examples**: Code quality (test coverage > 80%), team morale (happiness score > 7/10), security posture (no critical vulnerabilities), user privacy (zero data leaks) ### 3.4 Success Criteria Summary **Must-haves (hard requirements)**: - [Critical success criterion 1] - [Critical success criterion 2] - [Critical success criterion 3] **Should-haves (target achievements)**: - [Desired outcome 1] - [Desired outcome 2] - [Desired outcome 3] **Nice-to-haves (stretch goals)**: - [Aspirational outcome 1] - [Aspirational outcome 2] ``` ## Context Questions **Basics**: What are you building/changing? Why now? Who are the stakeholders? **Constraints**: When does this need to be delivered? What's the budget? What resources are available? **Current State**: What exists today? What's the baseline? What are the pain points? **Desired Outcomes**: What does success look like? What metrics would you track? What are you most worried about? **Scope**: Greenfield/migration/enhancement? Single or multi-phase? Who is the primary user? ## Specification Guidance ### Scope Definition **In Scope** should be specific and concrete: - ✅ "Migrate user authentication service to OAuth 2.0 with PKCE" - ❌ "Improve authentication" **Out of Scope** prevents scope creep: - List explicitly what won't be included in this phase - Reference future work that might address excluded items - Example: "Out of Scope: Social login (Google/GitHub) - deferred to Phase 2" ### Requirements Best Practices **Functional Requirements**: What the system must do - Use "must" for requirements, "should" for preferences - Be specific: "System must handle 10K requests/sec" not "System should be fast" - Include acceptance criteria: How will you verify this requirement is met? **Non-Functional Requirements**: How well the system must perform - **Performance**: Quantify with numbers (latency < 200ms, throughput > 1000 req/s) - **Reliability**: Define uptime SLAs, error budgets, MTTR targets - **Security**: Specify authentication, authorization, encryption requirements - **Scalability**: Define growth expectations (users, data, traffic) ### Timeline & Milestones Make milestones concrete and verifiable: - ✅ "Milestone 1 (Mar 15): Auth service deployed to staging, 100% auth tests passing" - ❌ "Milestone 1: Complete most of auth work" Include buffer for unknowns: - 80% planned work, 20% buffer for issues/tech debt - Identify critical path and dependencies clearly ## Premortem Process ### Step 1: Set the Scene Frame the failure scenario clearly: > "It's [6/12/24] months from now - [Date]. We launched [initiative name] and it failed catastrophically. [Stakeholders] are upset. The team is demoralized. What went wrong?" Choose timeframe based on initiative: - Quick launch: 3-6 months - Major migration: 12-24 months - Strategic change: 24+ months ### Step 2: Brainstorm Failures (Independent) Have each stakeholder privately write 3-5 specific failure modes: - Be specific: "Database migration lost 10K user records" not "data issues" - Think from your domain: Engineers focus on technical, PMs on product, Ops on operational - No filtering: List even unlikely scenarios ### Step 3: Share and Cluster Collect all failure modes and group similar ones: - **Technical failures**: System design, code bugs, infrastructure - **Operational failures**: Runbooks missing, wrong escalation, poor monitoring - **Organizational failures**: Lack of skills, poor communication, misaligned incentives - **External failures**: Vendor issues, market changes, regulatory changes ### Step 4: Vote and Prioritize For each cluster, assess: - **Likelihood**: How probable is this? (Low 10-30%, Medium 30-60%, High 60%+) - **Impact**: How bad would it be? (Low = annoying, Medium = painful, High = catastrophic) - **Risk Score**: Likelihood × Impact (1-9 scale) Focus mitigation on High-risk items (score 6-9). ### Step 5: Convert to Risk Register For each significant failure mode: 1. Reframe as a risk: "Risk that [specific scenario] happens" 2. Categorize (Technical/Operational/Organizational/External) 3. Assign owner (who monitors and responds) 4. Define mitigation (how to reduce likelihood or impact) 5. Track status (Open/Mitigated/Accepted/Closed) ## Risk Register Template ### Risk Statement Format Good risk statements are specific and measurable: - ✅ "Risk that database migration script fails to preserve foreign key relationships, causing data integrity errors in 15% of records" - ❌ "Risk of data issues" ### Mitigation Strategies **Reduce Likelihood** (prevent the risk): - Example: "Implement dry-run migration on production snapshot; verify all FK relationships before live migration" **Reduce Impact** (limit the damage): - Example: "Create rollback script tested on staging; set up real-time monitoring for FK violations; keep read replica as backup" **Accept Risk** (consciously choose not to mitigate): - For low-impact or very-low-likelihood risks - Document why it's acceptable: "Accept risk of 3rd-party API rate limiting during launch (low likelihood, workaround available)" **Transfer Risk** (shift to vendor/insurance): - Example: "Use managed database service with automated backups and point-in-time recovery (transfers operational risk to vendor)" ### Risk Owners Each risk needs a clear owner who: - Monitors early warning signals - Executes mitigation plans - Escalates if risk materializes - Updates risk status regularly Not necessarily the person who fixes it, but the person accountable for tracking it. ## Metrics Template ### Leading Indicators (Early Signals) These predict future success before lagging metrics move: - **Engineering**: Deployment frequency, build time, code review cycle time, test coverage - **Product**: Feature adoption rate, activation rate, time-to-value, engagement trends - **Operations**: Incident detection time, MTTD, runbook execution rate, alert accuracy Choose 3-5 leading indicators that: 1. Predict lagging outcomes (validated correlation) 2. Can be measured frequently (daily/weekly) 3. You can act on quickly (short feedback loop) ### Lagging Indicators (Outcomes) These measure actual success but appear later: - **Reliability**: Uptime, error rate, MTTR, SLA compliance - **Performance**: p50/p95/p99 latency, throughput, response time - **Business**: Revenue, user growth, retention, NPS, cost savings - **User**: Activation rate, feature adoption, satisfaction score Choose 3-5 lagging indicators that: 1. Directly measure initiative goals 2. Are measurable and objective 3. Stakeholders care about ### Counter-Metrics (Guardrails) What success does NOT mean sacrificing: - If optimizing for speed → Counter-metric: Code quality (test coverage, bug rate) - If optimizing for growth → Counter-metric: Costs (infrastructure spend, CAC) - If optimizing for features → Counter-metric: Technical debt (cycle time, deployment frequency) Choose 2-3 counter-metrics to: 1. Prevent gaming the system 2. Protect long-term sustainability 3. Maintain team/user trust ## Instrumentation Guidance **Baseline**: Measure current state before launch (✅ "p99 latency: 500ms" ❌ "API is slow"). Without baseline, you can't measure improvement. **Targets**: Make them specific ("reduce p99 from 500ms to 200ms"), achievable (industry benchmarks), time-bound ("by Q2 end"), meaningful (tied to outcomes). **Measurement**: Document data source, calculation method, measurement frequency, and who can access the metrics. **Tracking Cadence**: Real-time (system health), daily (operations), weekly (product), monthly (business). ## Quality Checklist Before delivering, verify: **Specification Complete:** - [ ] Goal, stakeholders, timeline clearly stated - [ ] Current state baseline documented with data - [ ] Scope (in/out/future) explicitly defined - [ ] Requirements are specific and measurable - [ ] Dependencies and assumptions listed - [ ] Timeline has concrete milestones **Risks Comprehensive:** - [ ] Premortem conducted (failure imagination exercise) - [ ] Risks span technical, operational, organizational, external - [ ] Each risk has likelihood, impact, score - [ ] High-risk items (6-9 score) have detailed mitigation plans - [ ] Every risk has an assigned owner - [ ] Risk register is prioritized by score **Metrics Measurable:** - [ ] 3-5 leading indicators (early signals) defined - [ ] 3-5 lagging indicators (outcomes) defined - [ ] 2-3 counter-metrics (guardrails) defined - [ ] Each metric has baseline, target, measurement method - [ ] Metrics have clear owners and tracking cadence - [ ] Success criteria (must/should/nice-to-have) documented **Integration:** - [ ] Risks map to specs (e.g., technical risks tied to architecture choices) - [ ] Metrics validate risk mitigations (e.g., measure if mitigation worked) - [ ] Specs enable metrics (e.g., instrumentation built into design) - [ ] All three components tell a coherent story **Rubric Validation:** - [ ] Self-assessed with rubric ≥ 3.5 average across all criteria - [ ] Stakeholders can act on this artifact - [ ] Gaps and unknowns explicitly acknowledged