Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions

View File

@@ -0,0 +1,369 @@
# Chain Spec Risk Metrics Template
## Workflow
Copy this checklist and track your progress:
```
Chain Spec Risk Metrics Progress:
- [ ] Step 1: Gather initiative context
- [ ] Step 2: Write comprehensive specification
- [ ] Step 3: Conduct premortem and build risk register
- [ ] Step 4: Define success metrics and instrumentation
- [ ] Step 5: Validate completeness and deliver
```
**Step 1: Gather initiative context** - Collect goal, constraints, stakeholders, baseline, desired outcomes. See [Context Questions](#context-questions).
**Step 2: Write comprehensive specification** - Document scope, approach, requirements, dependencies, timeline. See [Quick Template](#quick-template) and [Specification Guidance](#specification-guidance).
**Step 3: Conduct premortem and build risk register** - Run failure imagination exercise, categorize risks, assign mitigations/owners. See [Premortem Process](#premortem-process) and [Risk Register Template](#risk-register-template).
**Step 4: Define success metrics** - Identify leading/lagging indicators, baselines, targets, counter-metrics. See [Metrics Template](#metrics-template) and [Instrumentation Guidance](#instrumentation-guidance).
**Step 5: Validate and deliver** - Self-check with rubric (≥3.5 average), ensure all components align. See [Quality Checklist](#quality-checklist).
## Quick Template
```markdown
# [Initiative Name]: Spec, Risks, Metrics
## 1. Specification
### 1.1 Executive Summary
- **Goal**: [What are you building/changing and why?]
- **Timeline**: [When does this need to be delivered?]
- **Stakeholders**: [Who cares about this initiative?]
- **Success Criteria**: [What does done look like?]
### 1.2 Current State (Baseline)
- [Describe the current state/system]
- [Key metrics/data points about current state]
- [Pain points or limitations driving this initiative]
### 1.3 Proposed Approach
- **Architecture/Design**: [High-level approach]
- **Implementation Plan**: [Phases, milestones, dependencies]
- **Technology Choices**: [Tools, frameworks, platforms with rationale]
### 1.4 Scope
- **In Scope**: [What this initiative includes]
- **Out of Scope**: [What is explicitly excluded]
- **Future Considerations**: [What might come later]
### 1.5 Requirements
**Functional Requirements:**
- [Feature/capability 1]
- [Feature/capability 2]
- [Feature/capability 3]
**Non-Functional Requirements:**
- **Performance**: [Latency, throughput, scalability targets]
- **Reliability**: [Uptime, error rates, recovery time]
- **Security**: [Authentication, authorization, data protection]
- **Compliance**: [Regulatory, policy, audit requirements]
### 1.6 Dependencies & Assumptions
- **Dependencies**: [External teams, systems, resources needed]
- **Assumptions**: [What we're assuming is true]
- **Constraints**: [Budget, time, resource, technical limitations]
### 1.7 Timeline & Milestones
| Milestone | Date | Deliverable |
|-----------|------|-------------|
| [Phase 1] | [Date] | [What's delivered] |
| [Phase 2] | [Date] | [What's delivered] |
| [Phase 3] | [Date] | [What's delivered] |
## 2. Risk Analysis
### 2.1 Premortem Summary
**Exercise Prompt**: "It's [X months] from now. This initiative failed catastrophically. What went wrong?"
**Top failure modes identified:**
1. [Failure mode 1]
2. [Failure mode 2]
3. [Failure mode 3]
4. [Failure mode 4]
5. [Failure mode 5]
### 2.2 Risk Register
| Risk ID | Risk Description | Category | Likelihood | Impact | Risk Score | Mitigation Strategy | Owner | Status |
|---------|-----------------|----------|------------|--------|-----------|---------------------|-------|--------|
| R1 | [Specific failure scenario] | Technical | High | High | 9 | [How you'll prevent/reduce] | [Name] | Open |
| R2 | [Specific failure scenario] | Operational | Medium | High | 6 | [How you'll prevent/reduce] | [Name] | Open |
| R3 | [Specific failure scenario] | Organizational | Medium | Medium | 4 | [How you'll prevent/reduce] | [Name] | Open |
| R4 | [Specific failure scenario] | External | Low | High | 3 | [How you'll prevent/reduce] | [Name] | Open |
**Risk Scoring**: Low=1, Medium=2, High=3. Risk Score = Likelihood × Impact.
**Risk Categories**:
- **Technical**: Architecture, code quality, infrastructure, performance
- **Operational**: Processes, runbooks, support, operations
- **Organizational**: Resources, skills, alignment, communication
- **External**: Market, vendors, regulation, dependencies
### 2.3 Risk Mitigation Timeline
- **Pre-Launch**: [Risks to mitigate before going live]
- **Launch Window**: [Monitoring and safeguards during launch]
- **Post-Launch**: [Ongoing monitoring and response]
## 3. Success Metrics
### 3.1 Leading Indicators (Early Signals)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Predictive metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
**Examples**: Deployment frequency, code review cycle time, test coverage, incident detection time
### 3.2 Lagging Indicators (Outcome Measures)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Outcome metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
**Examples**: Uptime, user adoption rate, revenue impact, customer satisfaction score
### 3.3 Counter-Metrics (What We Won't Sacrifice)
| Metric | Threshold | Monitoring Method | Escalation Trigger |
|--------|-----------|-------------------|-------------------|
| [Protection metric 1] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 2] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 3] | [Minimum acceptable] | [How monitored] | [When to escalate] |
**Examples**: Code quality (test coverage > 80%), team morale (happiness score > 7/10), security posture (no critical vulnerabilities), user privacy (zero data leaks)
### 3.4 Success Criteria Summary
**Must-haves (hard requirements)**:
- [Critical success criterion 1]
- [Critical success criterion 2]
- [Critical success criterion 3]
**Should-haves (target achievements)**:
- [Desired outcome 1]
- [Desired outcome 2]
- [Desired outcome 3]
**Nice-to-haves (stretch goals)**:
- [Aspirational outcome 1]
- [Aspirational outcome 2]
```
## Context Questions
**Basics**: What are you building/changing? Why now? Who are the stakeholders?
**Constraints**: When does this need to be delivered? What's the budget? What resources are available?
**Current State**: What exists today? What's the baseline? What are the pain points?
**Desired Outcomes**: What does success look like? What metrics would you track? What are you most worried about?
**Scope**: Greenfield/migration/enhancement? Single or multi-phase? Who is the primary user?
## Specification Guidance
### Scope Definition
**In Scope** should be specific and concrete:
- ✅ "Migrate user authentication service to OAuth 2.0 with PKCE"
- ❌ "Improve authentication"
**Out of Scope** prevents scope creep:
- List explicitly what won't be included in this phase
- Reference future work that might address excluded items
- Example: "Out of Scope: Social login (Google/GitHub) - deferred to Phase 2"
### Requirements Best Practices
**Functional Requirements**: What the system must do
- Use "must" for requirements, "should" for preferences
- Be specific: "System must handle 10K requests/sec" not "System should be fast"
- Include acceptance criteria: How will you verify this requirement is met?
**Non-Functional Requirements**: How well the system must perform
- **Performance**: Quantify with numbers (latency < 200ms, throughput > 1000 req/s)
- **Reliability**: Define uptime SLAs, error budgets, MTTR targets
- **Security**: Specify authentication, authorization, encryption requirements
- **Scalability**: Define growth expectations (users, data, traffic)
### Timeline & Milestones
Make milestones concrete and verifiable:
- ✅ "Milestone 1 (Mar 15): Auth service deployed to staging, 100% auth tests passing"
- ❌ "Milestone 1: Complete most of auth work"
Include buffer for unknowns:
- 80% planned work, 20% buffer for issues/tech debt
- Identify critical path and dependencies clearly
## Premortem Process
### Step 1: Set the Scene
Frame the failure scenario clearly:
> "It's [6/12/24] months from now - [Date]. We launched [initiative name] and it failed catastrophically. [Stakeholders] are upset. The team is demoralized. What went wrong?"
Choose timeframe based on initiative:
- Quick launch: 3-6 months
- Major migration: 12-24 months
- Strategic change: 24+ months
### Step 2: Brainstorm Failures (Independent)
Have each stakeholder privately write 3-5 specific failure modes:
- Be specific: "Database migration lost 10K user records" not "data issues"
- Think from your domain: Engineers focus on technical, PMs on product, Ops on operational
- No filtering: List even unlikely scenarios
### Step 3: Share and Cluster
Collect all failure modes and group similar ones:
- **Technical failures**: System design, code bugs, infrastructure
- **Operational failures**: Runbooks missing, wrong escalation, poor monitoring
- **Organizational failures**: Lack of skills, poor communication, misaligned incentives
- **External failures**: Vendor issues, market changes, regulatory changes
### Step 4: Vote and Prioritize
For each cluster, assess:
- **Likelihood**: How probable is this? (Low 10-30%, Medium 30-60%, High 60%+)
- **Impact**: How bad would it be? (Low = annoying, Medium = painful, High = catastrophic)
- **Risk Score**: Likelihood × Impact (1-9 scale)
Focus mitigation on High-risk items (score 6-9).
### Step 5: Convert to Risk Register
For each significant failure mode:
1. Reframe as a risk: "Risk that [specific scenario] happens"
2. Categorize (Technical/Operational/Organizational/External)
3. Assign owner (who monitors and responds)
4. Define mitigation (how to reduce likelihood or impact)
5. Track status (Open/Mitigated/Accepted/Closed)
## Risk Register Template
### Risk Statement Format
Good risk statements are specific and measurable:
- ✅ "Risk that database migration script fails to preserve foreign key relationships, causing data integrity errors in 15% of records"
- ❌ "Risk of data issues"
### Mitigation Strategies
**Reduce Likelihood** (prevent the risk):
- Example: "Implement dry-run migration on production snapshot; verify all FK relationships before live migration"
**Reduce Impact** (limit the damage):
- Example: "Create rollback script tested on staging; set up real-time monitoring for FK violations; keep read replica as backup"
**Accept Risk** (consciously choose not to mitigate):
- For low-impact or very-low-likelihood risks
- Document why it's acceptable: "Accept risk of 3rd-party API rate limiting during launch (low likelihood, workaround available)"
**Transfer Risk** (shift to vendor/insurance):
- Example: "Use managed database service with automated backups and point-in-time recovery (transfers operational risk to vendor)"
### Risk Owners
Each risk needs a clear owner who:
- Monitors early warning signals
- Executes mitigation plans
- Escalates if risk materializes
- Updates risk status regularly
Not necessarily the person who fixes it, but the person accountable for tracking it.
## Metrics Template
### Leading Indicators (Early Signals)
These predict future success before lagging metrics move:
- **Engineering**: Deployment frequency, build time, code review cycle time, test coverage
- **Product**: Feature adoption rate, activation rate, time-to-value, engagement trends
- **Operations**: Incident detection time, MTTD, runbook execution rate, alert accuracy
Choose 3-5 leading indicators that:
1. Predict lagging outcomes (validated correlation)
2. Can be measured frequently (daily/weekly)
3. You can act on quickly (short feedback loop)
### Lagging Indicators (Outcomes)
These measure actual success but appear later:
- **Reliability**: Uptime, error rate, MTTR, SLA compliance
- **Performance**: p50/p95/p99 latency, throughput, response time
- **Business**: Revenue, user growth, retention, NPS, cost savings
- **User**: Activation rate, feature adoption, satisfaction score
Choose 3-5 lagging indicators that:
1. Directly measure initiative goals
2. Are measurable and objective
3. Stakeholders care about
### Counter-Metrics (Guardrails)
What success does NOT mean sacrificing:
- If optimizing for speed → Counter-metric: Code quality (test coverage, bug rate)
- If optimizing for growth → Counter-metric: Costs (infrastructure spend, CAC)
- If optimizing for features → Counter-metric: Technical debt (cycle time, deployment frequency)
Choose 2-3 counter-metrics to:
1. Prevent gaming the system
2. Protect long-term sustainability
3. Maintain team/user trust
## Instrumentation Guidance
**Baseline**: Measure current state before launch (✅ "p99 latency: 500ms" ❌ "API is slow"). Without baseline, you can't measure improvement.
**Targets**: Make them specific ("reduce p99 from 500ms to 200ms"), achievable (industry benchmarks), time-bound ("by Q2 end"), meaningful (tied to outcomes).
**Measurement**: Document data source, calculation method, measurement frequency, and who can access the metrics.
**Tracking Cadence**: Real-time (system health), daily (operations), weekly (product), monthly (business).
## Quality Checklist
Before delivering, verify:
**Specification Complete:**
- [ ] Goal, stakeholders, timeline clearly stated
- [ ] Current state baseline documented with data
- [ ] Scope (in/out/future) explicitly defined
- [ ] Requirements are specific and measurable
- [ ] Dependencies and assumptions listed
- [ ] Timeline has concrete milestones
**Risks Comprehensive:**
- [ ] Premortem conducted (failure imagination exercise)
- [ ] Risks span technical, operational, organizational, external
- [ ] Each risk has likelihood, impact, score
- [ ] High-risk items (6-9 score) have detailed mitigation plans
- [ ] Every risk has an assigned owner
- [ ] Risk register is prioritized by score
**Metrics Measurable:**
- [ ] 3-5 leading indicators (early signals) defined
- [ ] 3-5 lagging indicators (outcomes) defined
- [ ] 2-3 counter-metrics (guardrails) defined
- [ ] Each metric has baseline, target, measurement method
- [ ] Metrics have clear owners and tracking cadence
- [ ] Success criteria (must/should/nice-to-have) documented
**Integration:**
- [ ] Risks map to specs (e.g., technical risks tied to architecture choices)
- [ ] Metrics validate risk mitigations (e.g., measure if mitigation worked)
- [ ] Specs enable metrics (e.g., instrumentation built into design)
- [ ] All three components tell a coherent story
**Rubric Validation:**
- [ ] Self-assessed with rubric ≥ 3.5 average across all criteria
- [ ] Stakeholders can act on this artifact
- [ ] Gaps and unknowns explicitly acknowledged