Files
gh-lyndonkl-claude/skills/chain-spec-risk-metrics/resources/template.md
2025-11-30 08:38:26 +08:00

370 lines
15 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Chain Spec Risk Metrics Template
## Workflow
Copy this checklist and track your progress:
```
Chain Spec Risk Metrics Progress:
- [ ] Step 1: Gather initiative context
- [ ] Step 2: Write comprehensive specification
- [ ] Step 3: Conduct premortem and build risk register
- [ ] Step 4: Define success metrics and instrumentation
- [ ] Step 5: Validate completeness and deliver
```
**Step 1: Gather initiative context** - Collect goal, constraints, stakeholders, baseline, desired outcomes. See [Context Questions](#context-questions).
**Step 2: Write comprehensive specification** - Document scope, approach, requirements, dependencies, timeline. See [Quick Template](#quick-template) and [Specification Guidance](#specification-guidance).
**Step 3: Conduct premortem and build risk register** - Run failure imagination exercise, categorize risks, assign mitigations/owners. See [Premortem Process](#premortem-process) and [Risk Register Template](#risk-register-template).
**Step 4: Define success metrics** - Identify leading/lagging indicators, baselines, targets, counter-metrics. See [Metrics Template](#metrics-template) and [Instrumentation Guidance](#instrumentation-guidance).
**Step 5: Validate and deliver** - Self-check with rubric (≥3.5 average), ensure all components align. See [Quality Checklist](#quality-checklist).
## Quick Template
```markdown
# [Initiative Name]: Spec, Risks, Metrics
## 1. Specification
### 1.1 Executive Summary
- **Goal**: [What are you building/changing and why?]
- **Timeline**: [When does this need to be delivered?]
- **Stakeholders**: [Who cares about this initiative?]
- **Success Criteria**: [What does done look like?]
### 1.2 Current State (Baseline)
- [Describe the current state/system]
- [Key metrics/data points about current state]
- [Pain points or limitations driving this initiative]
### 1.3 Proposed Approach
- **Architecture/Design**: [High-level approach]
- **Implementation Plan**: [Phases, milestones, dependencies]
- **Technology Choices**: [Tools, frameworks, platforms with rationale]
### 1.4 Scope
- **In Scope**: [What this initiative includes]
- **Out of Scope**: [What is explicitly excluded]
- **Future Considerations**: [What might come later]
### 1.5 Requirements
**Functional Requirements:**
- [Feature/capability 1]
- [Feature/capability 2]
- [Feature/capability 3]
**Non-Functional Requirements:**
- **Performance**: [Latency, throughput, scalability targets]
- **Reliability**: [Uptime, error rates, recovery time]
- **Security**: [Authentication, authorization, data protection]
- **Compliance**: [Regulatory, policy, audit requirements]
### 1.6 Dependencies & Assumptions
- **Dependencies**: [External teams, systems, resources needed]
- **Assumptions**: [What we're assuming is true]
- **Constraints**: [Budget, time, resource, technical limitations]
### 1.7 Timeline & Milestones
| Milestone | Date | Deliverable |
|-----------|------|-------------|
| [Phase 1] | [Date] | [What's delivered] |
| [Phase 2] | [Date] | [What's delivered] |
| [Phase 3] | [Date] | [What's delivered] |
## 2. Risk Analysis
### 2.1 Premortem Summary
**Exercise Prompt**: "It's [X months] from now. This initiative failed catastrophically. What went wrong?"
**Top failure modes identified:**
1. [Failure mode 1]
2. [Failure mode 2]
3. [Failure mode 3]
4. [Failure mode 4]
5. [Failure mode 5]
### 2.2 Risk Register
| Risk ID | Risk Description | Category | Likelihood | Impact | Risk Score | Mitigation Strategy | Owner | Status |
|---------|-----------------|----------|------------|--------|-----------|---------------------|-------|--------|
| R1 | [Specific failure scenario] | Technical | High | High | 9 | [How you'll prevent/reduce] | [Name] | Open |
| R2 | [Specific failure scenario] | Operational | Medium | High | 6 | [How you'll prevent/reduce] | [Name] | Open |
| R3 | [Specific failure scenario] | Organizational | Medium | Medium | 4 | [How you'll prevent/reduce] | [Name] | Open |
| R4 | [Specific failure scenario] | External | Low | High | 3 | [How you'll prevent/reduce] | [Name] | Open |
**Risk Scoring**: Low=1, Medium=2, High=3. Risk Score = Likelihood × Impact.
**Risk Categories**:
- **Technical**: Architecture, code quality, infrastructure, performance
- **Operational**: Processes, runbooks, support, operations
- **Organizational**: Resources, skills, alignment, communication
- **External**: Market, vendors, regulation, dependencies
### 2.3 Risk Mitigation Timeline
- **Pre-Launch**: [Risks to mitigate before going live]
- **Launch Window**: [Monitoring and safeguards during launch]
- **Post-Launch**: [Ongoing monitoring and response]
## 3. Success Metrics
### 3.1 Leading Indicators (Early Signals)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Predictive metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
**Examples**: Deployment frequency, code review cycle time, test coverage, incident detection time
### 3.2 Lagging Indicators (Outcome Measures)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Outcome metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
**Examples**: Uptime, user adoption rate, revenue impact, customer satisfaction score
### 3.3 Counter-Metrics (What We Won't Sacrifice)
| Metric | Threshold | Monitoring Method | Escalation Trigger |
|--------|-----------|-------------------|-------------------|
| [Protection metric 1] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 2] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 3] | [Minimum acceptable] | [How monitored] | [When to escalate] |
**Examples**: Code quality (test coverage > 80%), team morale (happiness score > 7/10), security posture (no critical vulnerabilities), user privacy (zero data leaks)
### 3.4 Success Criteria Summary
**Must-haves (hard requirements)**:
- [Critical success criterion 1]
- [Critical success criterion 2]
- [Critical success criterion 3]
**Should-haves (target achievements)**:
- [Desired outcome 1]
- [Desired outcome 2]
- [Desired outcome 3]
**Nice-to-haves (stretch goals)**:
- [Aspirational outcome 1]
- [Aspirational outcome 2]
```
## Context Questions
**Basics**: What are you building/changing? Why now? Who are the stakeholders?
**Constraints**: When does this need to be delivered? What's the budget? What resources are available?
**Current State**: What exists today? What's the baseline? What are the pain points?
**Desired Outcomes**: What does success look like? What metrics would you track? What are you most worried about?
**Scope**: Greenfield/migration/enhancement? Single or multi-phase? Who is the primary user?
## Specification Guidance
### Scope Definition
**In Scope** should be specific and concrete:
- ✅ "Migrate user authentication service to OAuth 2.0 with PKCE"
- ❌ "Improve authentication"
**Out of Scope** prevents scope creep:
- List explicitly what won't be included in this phase
- Reference future work that might address excluded items
- Example: "Out of Scope: Social login (Google/GitHub) - deferred to Phase 2"
### Requirements Best Practices
**Functional Requirements**: What the system must do
- Use "must" for requirements, "should" for preferences
- Be specific: "System must handle 10K requests/sec" not "System should be fast"
- Include acceptance criteria: How will you verify this requirement is met?
**Non-Functional Requirements**: How well the system must perform
- **Performance**: Quantify with numbers (latency < 200ms, throughput > 1000 req/s)
- **Reliability**: Define uptime SLAs, error budgets, MTTR targets
- **Security**: Specify authentication, authorization, encryption requirements
- **Scalability**: Define growth expectations (users, data, traffic)
### Timeline & Milestones
Make milestones concrete and verifiable:
- ✅ "Milestone 1 (Mar 15): Auth service deployed to staging, 100% auth tests passing"
- ❌ "Milestone 1: Complete most of auth work"
Include buffer for unknowns:
- 80% planned work, 20% buffer for issues/tech debt
- Identify critical path and dependencies clearly
## Premortem Process
### Step 1: Set the Scene
Frame the failure scenario clearly:
> "It's [6/12/24] months from now - [Date]. We launched [initiative name] and it failed catastrophically. [Stakeholders] are upset. The team is demoralized. What went wrong?"
Choose timeframe based on initiative:
- Quick launch: 3-6 months
- Major migration: 12-24 months
- Strategic change: 24+ months
### Step 2: Brainstorm Failures (Independent)
Have each stakeholder privately write 3-5 specific failure modes:
- Be specific: "Database migration lost 10K user records" not "data issues"
- Think from your domain: Engineers focus on technical, PMs on product, Ops on operational
- No filtering: List even unlikely scenarios
### Step 3: Share and Cluster
Collect all failure modes and group similar ones:
- **Technical failures**: System design, code bugs, infrastructure
- **Operational failures**: Runbooks missing, wrong escalation, poor monitoring
- **Organizational failures**: Lack of skills, poor communication, misaligned incentives
- **External failures**: Vendor issues, market changes, regulatory changes
### Step 4: Vote and Prioritize
For each cluster, assess:
- **Likelihood**: How probable is this? (Low 10-30%, Medium 30-60%, High 60%+)
- **Impact**: How bad would it be? (Low = annoying, Medium = painful, High = catastrophic)
- **Risk Score**: Likelihood × Impact (1-9 scale)
Focus mitigation on High-risk items (score 6-9).
### Step 5: Convert to Risk Register
For each significant failure mode:
1. Reframe as a risk: "Risk that [specific scenario] happens"
2. Categorize (Technical/Operational/Organizational/External)
3. Assign owner (who monitors and responds)
4. Define mitigation (how to reduce likelihood or impact)
5. Track status (Open/Mitigated/Accepted/Closed)
## Risk Register Template
### Risk Statement Format
Good risk statements are specific and measurable:
- ✅ "Risk that database migration script fails to preserve foreign key relationships, causing data integrity errors in 15% of records"
- ❌ "Risk of data issues"
### Mitigation Strategies
**Reduce Likelihood** (prevent the risk):
- Example: "Implement dry-run migration on production snapshot; verify all FK relationships before live migration"
**Reduce Impact** (limit the damage):
- Example: "Create rollback script tested on staging; set up real-time monitoring for FK violations; keep read replica as backup"
**Accept Risk** (consciously choose not to mitigate):
- For low-impact or very-low-likelihood risks
- Document why it's acceptable: "Accept risk of 3rd-party API rate limiting during launch (low likelihood, workaround available)"
**Transfer Risk** (shift to vendor/insurance):
- Example: "Use managed database service with automated backups and point-in-time recovery (transfers operational risk to vendor)"
### Risk Owners
Each risk needs a clear owner who:
- Monitors early warning signals
- Executes mitigation plans
- Escalates if risk materializes
- Updates risk status regularly
Not necessarily the person who fixes it, but the person accountable for tracking it.
## Metrics Template
### Leading Indicators (Early Signals)
These predict future success before lagging metrics move:
- **Engineering**: Deployment frequency, build time, code review cycle time, test coverage
- **Product**: Feature adoption rate, activation rate, time-to-value, engagement trends
- **Operations**: Incident detection time, MTTD, runbook execution rate, alert accuracy
Choose 3-5 leading indicators that:
1. Predict lagging outcomes (validated correlation)
2. Can be measured frequently (daily/weekly)
3. You can act on quickly (short feedback loop)
### Lagging Indicators (Outcomes)
These measure actual success but appear later:
- **Reliability**: Uptime, error rate, MTTR, SLA compliance
- **Performance**: p50/p95/p99 latency, throughput, response time
- **Business**: Revenue, user growth, retention, NPS, cost savings
- **User**: Activation rate, feature adoption, satisfaction score
Choose 3-5 lagging indicators that:
1. Directly measure initiative goals
2. Are measurable and objective
3. Stakeholders care about
### Counter-Metrics (Guardrails)
What success does NOT mean sacrificing:
- If optimizing for speed → Counter-metric: Code quality (test coverage, bug rate)
- If optimizing for growth → Counter-metric: Costs (infrastructure spend, CAC)
- If optimizing for features → Counter-metric: Technical debt (cycle time, deployment frequency)
Choose 2-3 counter-metrics to:
1. Prevent gaming the system
2. Protect long-term sustainability
3. Maintain team/user trust
## Instrumentation Guidance
**Baseline**: Measure current state before launch (✅ "p99 latency: 500ms" ❌ "API is slow"). Without baseline, you can't measure improvement.
**Targets**: Make them specific ("reduce p99 from 500ms to 200ms"), achievable (industry benchmarks), time-bound ("by Q2 end"), meaningful (tied to outcomes).
**Measurement**: Document data source, calculation method, measurement frequency, and who can access the metrics.
**Tracking Cadence**: Real-time (system health), daily (operations), weekly (product), monthly (business).
## Quality Checklist
Before delivering, verify:
**Specification Complete:**
- [ ] Goal, stakeholders, timeline clearly stated
- [ ] Current state baseline documented with data
- [ ] Scope (in/out/future) explicitly defined
- [ ] Requirements are specific and measurable
- [ ] Dependencies and assumptions listed
- [ ] Timeline has concrete milestones
**Risks Comprehensive:**
- [ ] Premortem conducted (failure imagination exercise)
- [ ] Risks span technical, operational, organizational, external
- [ ] Each risk has likelihood, impact, score
- [ ] High-risk items (6-9 score) have detailed mitigation plans
- [ ] Every risk has an assigned owner
- [ ] Risk register is prioritized by score
**Metrics Measurable:**
- [ ] 3-5 leading indicators (early signals) defined
- [ ] 3-5 lagging indicators (outcomes) defined
- [ ] 2-3 counter-metrics (guardrails) defined
- [ ] Each metric has baseline, target, measurement method
- [ ] Metrics have clear owners and tracking cadence
- [ ] Success criteria (must/should/nice-to-have) documented
**Integration:**
- [ ] Risks map to specs (e.g., technical risks tied to architecture choices)
- [ ] Metrics validate risk mitigations (e.g., measure if mitigation worked)
- [ ] Specs enable metrics (e.g., instrumentation built into design)
- [ ] All three components tell a coherent story
**Rubric Validation:**
- [ ] Self-assessed with rubric ≥ 3.5 average across all criteria
- [ ] Stakeholders can act on this artifact
- [ ] Gaps and unknowns explicitly acknowledged