Files
gh-lyndonkl-claude/skills/chain-spec-risk-metrics/resources/template.md
2025-11-30 08:38:26 +08:00

15 KiB
Raw Blame History

Chain Spec Risk Metrics Template

Workflow

Copy this checklist and track your progress:

Chain Spec Risk Metrics Progress:
- [ ] Step 1: Gather initiative context
- [ ] Step 2: Write comprehensive specification
- [ ] Step 3: Conduct premortem and build risk register
- [ ] Step 4: Define success metrics and instrumentation
- [ ] Step 5: Validate completeness and deliver

Step 1: Gather initiative context - Collect goal, constraints, stakeholders, baseline, desired outcomes. See Context Questions.

Step 2: Write comprehensive specification - Document scope, approach, requirements, dependencies, timeline. See Quick Template and Specification Guidance.

Step 3: Conduct premortem and build risk register - Run failure imagination exercise, categorize risks, assign mitigations/owners. See Premortem Process and Risk Register Template.

Step 4: Define success metrics - Identify leading/lagging indicators, baselines, targets, counter-metrics. See Metrics Template and Instrumentation Guidance.

Step 5: Validate and deliver - Self-check with rubric (≥3.5 average), ensure all components align. See Quality Checklist.

Quick Template

# [Initiative Name]: Spec, Risks, Metrics

## 1. Specification

### 1.1 Executive Summary
- **Goal**: [What are you building/changing and why?]
- **Timeline**: [When does this need to be delivered?]
- **Stakeholders**: [Who cares about this initiative?]
- **Success Criteria**: [What does done look like?]

### 1.2 Current State (Baseline)
- [Describe the current state/system]
- [Key metrics/data points about current state]
- [Pain points or limitations driving this initiative]

### 1.3 Proposed Approach
- **Architecture/Design**: [High-level approach]
- **Implementation Plan**: [Phases, milestones, dependencies]
- **Technology Choices**: [Tools, frameworks, platforms with rationale]

### 1.4 Scope
- **In Scope**: [What this initiative includes]
- **Out of Scope**: [What is explicitly excluded]
- **Future Considerations**: [What might come later]

### 1.5 Requirements
**Functional Requirements:**
- [Feature/capability 1]
- [Feature/capability 2]
- [Feature/capability 3]

**Non-Functional Requirements:**
- **Performance**: [Latency, throughput, scalability targets]
- **Reliability**: [Uptime, error rates, recovery time]
- **Security**: [Authentication, authorization, data protection]
- **Compliance**: [Regulatory, policy, audit requirements]

### 1.6 Dependencies & Assumptions
- **Dependencies**: [External teams, systems, resources needed]
- **Assumptions**: [What we're assuming is true]
- **Constraints**: [Budget, time, resource, technical limitations]

### 1.7 Timeline & Milestones
| Milestone | Date | Deliverable |
|-----------|------|-------------|
| [Phase 1] | [Date] | [What's delivered] |
| [Phase 2] | [Date] | [What's delivered] |
| [Phase 3] | [Date] | [What's delivered] |

## 2. Risk Analysis

### 2.1 Premortem Summary
**Exercise Prompt**: "It's [X months] from now. This initiative failed catastrophically. What went wrong?"

**Top failure modes identified:**
1. [Failure mode 1]
2. [Failure mode 2]
3. [Failure mode 3]
4. [Failure mode 4]
5. [Failure mode 5]

### 2.2 Risk Register

| Risk ID | Risk Description | Category | Likelihood | Impact | Risk Score | Mitigation Strategy | Owner | Status |
|---------|-----------------|----------|------------|--------|-----------|---------------------|-------|--------|
| R1 | [Specific failure scenario] | Technical | High | High | 9 | [How you'll prevent/reduce] | [Name] | Open |
| R2 | [Specific failure scenario] | Operational | Medium | High | 6 | [How you'll prevent/reduce] | [Name] | Open |
| R3 | [Specific failure scenario] | Organizational | Medium | Medium | 4 | [How you'll prevent/reduce] | [Name] | Open |
| R4 | [Specific failure scenario] | External | Low | High | 3 | [How you'll prevent/reduce] | [Name] | Open |

**Risk Scoring**: Low=1, Medium=2, High=3. Risk Score = Likelihood × Impact.

**Risk Categories**:
- **Technical**: Architecture, code quality, infrastructure, performance
- **Operational**: Processes, runbooks, support, operations
- **Organizational**: Resources, skills, alignment, communication
- **External**: Market, vendors, regulation, dependencies

### 2.3 Risk Mitigation Timeline
- **Pre-Launch**: [Risks to mitigate before going live]
- **Launch Window**: [Monitoring and safeguards during launch]
- **Post-Launch**: [Ongoing monitoring and response]

## 3. Success Metrics

### 3.1 Leading Indicators (Early Signals)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Predictive metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |

**Examples**: Deployment frequency, code review cycle time, test coverage, incident detection time

### 3.2 Lagging Indicators (Outcome Measures)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Outcome metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |

**Examples**: Uptime, user adoption rate, revenue impact, customer satisfaction score

### 3.3 Counter-Metrics (What We Won't Sacrifice)
| Metric | Threshold | Monitoring Method | Escalation Trigger |
|--------|-----------|-------------------|-------------------|
| [Protection metric 1] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 2] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 3] | [Minimum acceptable] | [How monitored] | [When to escalate] |

**Examples**: Code quality (test coverage > 80%), team morale (happiness score > 7/10), security posture (no critical vulnerabilities), user privacy (zero data leaks)

### 3.4 Success Criteria Summary
**Must-haves (hard requirements)**:
- [Critical success criterion 1]
- [Critical success criterion 2]
- [Critical success criterion 3]

**Should-haves (target achievements)**:
- [Desired outcome 1]
- [Desired outcome 2]
- [Desired outcome 3]

**Nice-to-haves (stretch goals)**:
- [Aspirational outcome 1]
- [Aspirational outcome 2]

Context Questions

Basics: What are you building/changing? Why now? Who are the stakeholders?

Constraints: When does this need to be delivered? What's the budget? What resources are available?

Current State: What exists today? What's the baseline? What are the pain points?

Desired Outcomes: What does success look like? What metrics would you track? What are you most worried about?

Scope: Greenfield/migration/enhancement? Single or multi-phase? Who is the primary user?

Specification Guidance

Scope Definition

In Scope should be specific and concrete:

  • "Migrate user authentication service to OAuth 2.0 with PKCE"
  • "Improve authentication"

Out of Scope prevents scope creep:

  • List explicitly what won't be included in this phase
  • Reference future work that might address excluded items
  • Example: "Out of Scope: Social login (Google/GitHub) - deferred to Phase 2"

Requirements Best Practices

Functional Requirements: What the system must do

  • Use "must" for requirements, "should" for preferences
  • Be specific: "System must handle 10K requests/sec" not "System should be fast"
  • Include acceptance criteria: How will you verify this requirement is met?

Non-Functional Requirements: How well the system must perform

  • Performance: Quantify with numbers (latency < 200ms, throughput > 1000 req/s)
  • Reliability: Define uptime SLAs, error budgets, MTTR targets
  • Security: Specify authentication, authorization, encryption requirements
  • Scalability: Define growth expectations (users, data, traffic)

Timeline & Milestones

Make milestones concrete and verifiable:

  • "Milestone 1 (Mar 15): Auth service deployed to staging, 100% auth tests passing"
  • "Milestone 1: Complete most of auth work"

Include buffer for unknowns:

  • 80% planned work, 20% buffer for issues/tech debt
  • Identify critical path and dependencies clearly

Premortem Process

Step 1: Set the Scene

Frame the failure scenario clearly:

"It's [6/12/24] months from now - [Date]. We launched [initiative name] and it failed catastrophically. [Stakeholders] are upset. The team is demoralized. What went wrong?"

Choose timeframe based on initiative:

  • Quick launch: 3-6 months
  • Major migration: 12-24 months
  • Strategic change: 24+ months

Step 2: Brainstorm Failures (Independent)

Have each stakeholder privately write 3-5 specific failure modes:

  • Be specific: "Database migration lost 10K user records" not "data issues"
  • Think from your domain: Engineers focus on technical, PMs on product, Ops on operational
  • No filtering: List even unlikely scenarios

Step 3: Share and Cluster

Collect all failure modes and group similar ones:

  • Technical failures: System design, code bugs, infrastructure
  • Operational failures: Runbooks missing, wrong escalation, poor monitoring
  • Organizational failures: Lack of skills, poor communication, misaligned incentives
  • External failures: Vendor issues, market changes, regulatory changes

Step 4: Vote and Prioritize

For each cluster, assess:

  • Likelihood: How probable is this? (Low 10-30%, Medium 30-60%, High 60%+)
  • Impact: How bad would it be? (Low = annoying, Medium = painful, High = catastrophic)
  • Risk Score: Likelihood × Impact (1-9 scale)

Focus mitigation on High-risk items (score 6-9).

Step 5: Convert to Risk Register

For each significant failure mode:

  1. Reframe as a risk: "Risk that [specific scenario] happens"
  2. Categorize (Technical/Operational/Organizational/External)
  3. Assign owner (who monitors and responds)
  4. Define mitigation (how to reduce likelihood or impact)
  5. Track status (Open/Mitigated/Accepted/Closed)

Risk Register Template

Risk Statement Format

Good risk statements are specific and measurable:

  • "Risk that database migration script fails to preserve foreign key relationships, causing data integrity errors in 15% of records"
  • "Risk of data issues"

Mitigation Strategies

Reduce Likelihood (prevent the risk):

  • Example: "Implement dry-run migration on production snapshot; verify all FK relationships before live migration"

Reduce Impact (limit the damage):

  • Example: "Create rollback script tested on staging; set up real-time monitoring for FK violations; keep read replica as backup"

Accept Risk (consciously choose not to mitigate):

  • For low-impact or very-low-likelihood risks
  • Document why it's acceptable: "Accept risk of 3rd-party API rate limiting during launch (low likelihood, workaround available)"

Transfer Risk (shift to vendor/insurance):

  • Example: "Use managed database service with automated backups and point-in-time recovery (transfers operational risk to vendor)"

Risk Owners

Each risk needs a clear owner who:

  • Monitors early warning signals
  • Executes mitigation plans
  • Escalates if risk materializes
  • Updates risk status regularly

Not necessarily the person who fixes it, but the person accountable for tracking it.

Metrics Template

Leading Indicators (Early Signals)

These predict future success before lagging metrics move:

  • Engineering: Deployment frequency, build time, code review cycle time, test coverage
  • Product: Feature adoption rate, activation rate, time-to-value, engagement trends
  • Operations: Incident detection time, MTTD, runbook execution rate, alert accuracy

Choose 3-5 leading indicators that:

  1. Predict lagging outcomes (validated correlation)
  2. Can be measured frequently (daily/weekly)
  3. You can act on quickly (short feedback loop)

Lagging Indicators (Outcomes)

These measure actual success but appear later:

  • Reliability: Uptime, error rate, MTTR, SLA compliance
  • Performance: p50/p95/p99 latency, throughput, response time
  • Business: Revenue, user growth, retention, NPS, cost savings
  • User: Activation rate, feature adoption, satisfaction score

Choose 3-5 lagging indicators that:

  1. Directly measure initiative goals
  2. Are measurable and objective
  3. Stakeholders care about

Counter-Metrics (Guardrails)

What success does NOT mean sacrificing:

  • If optimizing for speed → Counter-metric: Code quality (test coverage, bug rate)
  • If optimizing for growth → Counter-metric: Costs (infrastructure spend, CAC)
  • If optimizing for features → Counter-metric: Technical debt (cycle time, deployment frequency)

Choose 2-3 counter-metrics to:

  1. Prevent gaming the system
  2. Protect long-term sustainability
  3. Maintain team/user trust

Instrumentation Guidance

Baseline: Measure current state before launch ( "p99 latency: 500ms" "API is slow"). Without baseline, you can't measure improvement.

Targets: Make them specific ("reduce p99 from 500ms to 200ms"), achievable (industry benchmarks), time-bound ("by Q2 end"), meaningful (tied to outcomes).

Measurement: Document data source, calculation method, measurement frequency, and who can access the metrics.

Tracking Cadence: Real-time (system health), daily (operations), weekly (product), monthly (business).

Quality Checklist

Before delivering, verify:

Specification Complete:

  • Goal, stakeholders, timeline clearly stated
  • Current state baseline documented with data
  • Scope (in/out/future) explicitly defined
  • Requirements are specific and measurable
  • Dependencies and assumptions listed
  • Timeline has concrete milestones

Risks Comprehensive:

  • Premortem conducted (failure imagination exercise)
  • Risks span technical, operational, organizational, external
  • Each risk has likelihood, impact, score
  • High-risk items (6-9 score) have detailed mitigation plans
  • Every risk has an assigned owner
  • Risk register is prioritized by score

Metrics Measurable:

  • 3-5 leading indicators (early signals) defined
  • 3-5 lagging indicators (outcomes) defined
  • 2-3 counter-metrics (guardrails) defined
  • Each metric has baseline, target, measurement method
  • Metrics have clear owners and tracking cadence
  • Success criteria (must/should/nice-to-have) documented

Integration:

  • Risks map to specs (e.g., technical risks tied to architecture choices)
  • Metrics validate risk mitigations (e.g., measure if mitigation worked)
  • Specs enable metrics (e.g., instrumentation built into design)
  • All three components tell a coherent story

Rubric Validation:

  • Self-assessed with rubric ≥ 3.5 average across all criteria
  • Stakeholders can act on this artifact
  • Gaps and unknowns explicitly acknowledged