zhongwei/gh-lyndonkl-claude

Fork 0

Files

Zhongwei Li 41d9f6b189 Initial commit

2025-11-30 08:38:26 +08:00

15 KiB

Raw Blame History

Chain Spec Risk Metrics Template

Workflow

Copy this checklist and track your progress:

Chain Spec Risk Metrics Progress:
- [ ] Step 1: Gather initiative context
- [ ] Step 2: Write comprehensive specification
- [ ] Step 3: Conduct premortem and build risk register
- [ ] Step 4: Define success metrics and instrumentation
- [ ] Step 5: Validate completeness and deliver

Step 1: Gather initiative context - Collect goal, constraints, stakeholders, baseline, desired outcomes. See Context Questions.

Step 2: Write comprehensive specification - Document scope, approach, requirements, dependencies, timeline. See Quick Template and Specification Guidance.

Step 3: Conduct premortem and build risk register - Run failure imagination exercise, categorize risks, assign mitigations/owners. See Premortem Process and Risk Register Template.

Step 4: Define success metrics - Identify leading/lagging indicators, baselines, targets, counter-metrics. See Metrics Template and Instrumentation Guidance.

Step 5: Validate and deliver - Self-check with rubric (≥3.5 average), ensure all components align. See Quality Checklist.

Quick Template

# [Initiative Name]: Spec, Risks, Metrics

## 1. Specification

### 1.1 Executive Summary
- **Goal**: [What are you building/changing and why?]
- **Timeline**: [When does this need to be delivered?]
- **Stakeholders**: [Who cares about this initiative?]
- **Success Criteria**: [What does done look like?]

### 1.2 Current State (Baseline)
- [Describe the current state/system]
- [Key metrics/data points about current state]
- [Pain points or limitations driving this initiative]

### 1.3 Proposed Approach
- **Architecture/Design**: [High-level approach]
- **Implementation Plan**: [Phases, milestones, dependencies]
- **Technology Choices**: [Tools, frameworks, platforms with rationale]

### 1.4 Scope
- **In Scope**: [What this initiative includes]
- **Out of Scope**: [What is explicitly excluded]
- **Future Considerations**: [What might come later]

### 1.5 Requirements
**Functional Requirements:**
- [Feature/capability 1]
- [Feature/capability 2]
- [Feature/capability 3]

**Non-Functional Requirements:**
- **Performance**: [Latency, throughput, scalability targets]
- **Reliability**: [Uptime, error rates, recovery time]
- **Security**: [Authentication, authorization, data protection]
- **Compliance**: [Regulatory, policy, audit requirements]

### 1.6 Dependencies & Assumptions
- **Dependencies**: [External teams, systems, resources needed]
- **Assumptions**: [What we're assuming is true]
- **Constraints**: [Budget, time, resource, technical limitations]

### 1.7 Timeline & Milestones
| Milestone | Date | Deliverable |
|-----------|------|-------------|
| [Phase 1] | [Date] | [What's delivered] |
| [Phase 2] | [Date] | [What's delivered] |
| [Phase 3] | [Date] | [What's delivered] |

## 2. Risk Analysis

### 2.1 Premortem Summary
**Exercise Prompt**: "It's [X months] from now. This initiative failed catastrophically. What went wrong?"

**Top failure modes identified:**
1. [Failure mode 1]
2. [Failure mode 2]
3. [Failure mode 3]
4. [Failure mode 4]
5. [Failure mode 5]

### 2.2 Risk Register

| Risk ID | Risk Description | Category | Likelihood | Impact | Risk Score | Mitigation Strategy | Owner | Status |
|---------|-----------------|----------|------------|--------|-----------|---------------------|-------|--------|
| R1 | [Specific failure scenario] | Technical | High | High | 9 | [How you'll prevent/reduce] | [Name] | Open |
| R2 | [Specific failure scenario] | Operational | Medium | High | 6 | [How you'll prevent/reduce] | [Name] | Open |
| R3 | [Specific failure scenario] | Organizational | Medium | Medium | 4 | [How you'll prevent/reduce] | [Name] | Open |
| R4 | [Specific failure scenario] | External | Low | High | 3 | [How you'll prevent/reduce] | [Name] | Open |

**Risk Scoring**: Low=1, Medium=2, High=3. Risk Score = Likelihood × Impact.

**Risk Categories**:
- **Technical**: Architecture, code quality, infrastructure, performance
- **Operational**: Processes, runbooks, support, operations
- **Organizational**: Resources, skills, alignment, communication
- **External**: Market, vendors, regulation, dependencies

### 2.3 Risk Mitigation Timeline
- **Pre-Launch**: [Risks to mitigate before going live]
- **Launch Window**: [Monitoring and safeguards during launch]
- **Post-Launch**: [Ongoing monitoring and response]

## 3. Success Metrics

### 3.1 Leading Indicators (Early Signals)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Predictive metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |

**Examples**: Deployment frequency, code review cycle time, test coverage, incident detection time

### 3.2 Lagging Indicators (Outcome Measures)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Outcome metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |

**Examples**: Uptime, user adoption rate, revenue impact, customer satisfaction score

### 3.3 Counter-Metrics (What We Won't Sacrifice)
| Metric | Threshold | Monitoring Method | Escalation Trigger |
|--------|-----------|-------------------|-------------------|
| [Protection metric 1] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 2] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 3] | [Minimum acceptable] | [How monitored] | [When to escalate] |

**Examples**: Code quality (test coverage > 80%), team morale (happiness score > 7/10), security posture (no critical vulnerabilities), user privacy (zero data leaks)

### 3.4 Success Criteria Summary
**Must-haves (hard requirements)**:
- [Critical success criterion 1]
- [Critical success criterion 2]
- [Critical success criterion 3]

**Should-haves (target achievements)**:
- [Desired outcome 1]
- [Desired outcome 2]
- [Desired outcome 3]

**Nice-to-haves (stretch goals)**:
- [Aspirational outcome 1]
- [Aspirational outcome 2]

Context Questions

Basics: What are you building/changing? Why now? Who are the stakeholders?

Constraints: When does this need to be delivered? What's the budget? What resources are available?

Current State: What exists today? What's the baseline? What are the pain points?

Desired Outcomes: What does success look like? What metrics would you track? What are you most worried about?

Scope: Greenfield/migration/enhancement? Single or multi-phase? Who is the primary user?

Specification Guidance

Scope Definition

In Scope should be specific and concrete:

✅ "Migrate user authentication service to OAuth 2.0 with PKCE"
❌ "Improve authentication"

Out of Scope prevents scope creep:

List explicitly what won't be included in this phase
Reference future work that might address excluded items
Example: "Out of Scope: Social login (Google/GitHub) - deferred to Phase 2"

Requirements Best Practices

Functional Requirements: What the system must do

Use "must" for requirements, "should" for preferences
Be specific: "System must handle 10K requests/sec" not "System should be fast"
Include acceptance criteria: How will you verify this requirement is met?

Non-Functional Requirements: How well the system must perform

Performance: Quantify with numbers (latency < 200ms, throughput > 1000 req/s)
Reliability: Define uptime SLAs, error budgets, MTTR targets
Security: Specify authentication, authorization, encryption requirements
Scalability: Define growth expectations (users, data, traffic)

Timeline & Milestones

Make milestones concrete and verifiable:

✅ "Milestone 1 (Mar 15): Auth service deployed to staging, 100% auth tests passing"
❌ "Milestone 1: Complete most of auth work"

Include buffer for unknowns:

80% planned work, 20% buffer for issues/tech debt
Identify critical path and dependencies clearly

Premortem Process

Step 1: Set the Scene

Frame the failure scenario clearly:

"It's [6/12/24] months from now - [Date]. We launched [initiative name] and it failed catastrophically. [Stakeholders] are upset. The team is demoralized. What went wrong?"

Choose timeframe based on initiative:

Quick launch: 3-6 months
Major migration: 12-24 months
Strategic change: 24+ months

Step 2: Brainstorm Failures (Independent)

Have each stakeholder privately write 3-5 specific failure modes:

Be specific: "Database migration lost 10K user records" not "data issues"
Think from your domain: Engineers focus on technical, PMs on product, Ops on operational
No filtering: List even unlikely scenarios

Collect all failure modes and group similar ones:

Technical failures: System design, code bugs, infrastructure
Operational failures: Runbooks missing, wrong escalation, poor monitoring
Organizational failures: Lack of skills, poor communication, misaligned incentives
External failures: Vendor issues, market changes, regulatory changes

Step 4: Vote and Prioritize

For each cluster, assess:

Likelihood: How probable is this? (Low 10-30%, Medium 30-60%, High 60%+)
Impact: How bad would it be? (Low = annoying, Medium = painful, High = catastrophic)
Risk Score: Likelihood × Impact (1-9 scale)

Focus mitigation on High-risk items (score 6-9).

Step 5: Convert to Risk Register

For each significant failure mode:

Reframe as a risk: "Risk that [specific scenario] happens"
Categorize (Technical/Operational/Organizational/External)
Assign owner (who monitors and responds)
Define mitigation (how to reduce likelihood or impact)
Track status (Open/Mitigated/Accepted/Closed)

Risk Register Template

Risk Statement Format

Good risk statements are specific and measurable:

✅ "Risk that database migration script fails to preserve foreign key relationships, causing data integrity errors in 15% of records"
❌ "Risk of data issues"

Mitigation Strategies

Reduce Likelihood (prevent the risk):

Example: "Implement dry-run migration on production snapshot; verify all FK relationships before live migration"

Reduce Impact (limit the damage):

Example: "Create rollback script tested on staging; set up real-time monitoring for FK violations; keep read replica as backup"

Accept Risk (consciously choose not to mitigate):

For low-impact or very-low-likelihood risks
Document why it's acceptable: "Accept risk of 3rd-party API rate limiting during launch (low likelihood, workaround available)"

Transfer Risk (shift to vendor/insurance):

Example: "Use managed database service with automated backups and point-in-time recovery (transfers operational risk to vendor)"

Risk Owners

Each risk needs a clear owner who:

Monitors early warning signals
Executes mitigation plans
Escalates if risk materializes
Updates risk status regularly

Not necessarily the person who fixes it, but the person accountable for tracking it.

Metrics Template

Leading Indicators (Early Signals)

These predict future success before lagging metrics move:

Engineering: Deployment frequency, build time, code review cycle time, test coverage
Product: Feature adoption rate, activation rate, time-to-value, engagement trends
Operations: Incident detection time, MTTD, runbook execution rate, alert accuracy

Choose 3-5 leading indicators that:

Predict lagging outcomes (validated correlation)
Can be measured frequently (daily/weekly)
You can act on quickly (short feedback loop)

Lagging Indicators (Outcomes)

These measure actual success but appear later:

Reliability: Uptime, error rate, MTTR, SLA compliance
Performance: p50/p95/p99 latency, throughput, response time
Business: Revenue, user growth, retention, NPS, cost savings
User: Activation rate, feature adoption, satisfaction score

Choose 3-5 lagging indicators that:

Directly measure initiative goals
Are measurable and objective
Stakeholders care about

Counter-Metrics (Guardrails)

What success does NOT mean sacrificing:

If optimizing for speed → Counter-metric: Code quality (test coverage, bug rate)
If optimizing for growth → Counter-metric: Costs (infrastructure spend, CAC)
If optimizing for features → Counter-metric: Technical debt (cycle time, deployment frequency)

Choose 2-3 counter-metrics to:

Prevent gaming the system
Protect long-term sustainability
Maintain team/user trust

Instrumentation Guidance

Baseline: Measure current state before launch (✅ "p99 latency: 500ms" ❌ "API is slow"). Without baseline, you can't measure improvement.

Targets: Make them specific ("reduce p99 from 500ms to 200ms"), achievable (industry benchmarks), time-bound ("by Q2 end"), meaningful (tied to outcomes).

Measurement: Document data source, calculation method, measurement frequency, and who can access the metrics.

Tracking Cadence: Real-time (system health), daily (operations), weekly (product), monthly (business).

Quality Checklist

Before delivering, verify:

Specification Complete:

Goal, stakeholders, timeline clearly stated
Current state baseline documented with data
Scope (in/out/future) explicitly defined
Requirements are specific and measurable
Dependencies and assumptions listed
Timeline has concrete milestones

Risks Comprehensive:

Premortem conducted (failure imagination exercise)
Risks span technical, operational, organizational, external
Each risk has likelihood, impact, score
High-risk items (6-9 score) have detailed mitigation plans
Every risk has an assigned owner
Risk register is prioritized by score

Metrics Measurable:

3-5 leading indicators (early signals) defined
3-5 lagging indicators (outcomes) defined
2-3 counter-metrics (guardrails) defined
Each metric has baseline, target, measurement method
Metrics have clear owners and tracking cadence
Success criteria (must/should/nice-to-have) documented

Integration:

Risks map to specs (e.g., technical risks tied to architecture choices)
Metrics validate risk mitigations (e.g., measure if mitigation worked)
Specs enable metrics (e.g., instrumentation built into design)
All three components tell a coherent story

Rubric Validation:

Self-assessed with rubric ≥ 3.5 average across all criteria
Stakeholders can act on this artifact
Gaps and unknowns explicitly acknowledged

15 KiB Raw Blame History Unescape Escape