Initial commit

2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions
--- a/skills/chain-spec-risk-metrics/resources/template.md
+++ b/skills/chain-spec-risk-metrics/resources/template.md
@@ -0,0 +1,369 @@
+# Chain Spec Risk Metrics Template
+
+## Workflow
+
+Copy this checklist and track your progress:
+
+```
+Chain Spec Risk Metrics Progress:
+- [ ] Step 1: Gather initiative context
+- [ ] Step 2: Write comprehensive specification
+- [ ] Step 3: Conduct premortem and build risk register
+- [ ] Step 4: Define success metrics and instrumentation
+- [ ] Step 5: Validate completeness and deliver
+```
+
+**Step 1: Gather initiative context** - Collect goal, constraints, stakeholders, baseline, desired outcomes. See [Context Questions](#context-questions).
+
+**Step 2: Write comprehensive specification** - Document scope, approach, requirements, dependencies, timeline. See [Quick Template](#quick-template) and [Specification Guidance](#specification-guidance).
+
+**Step 3: Conduct premortem and build risk register** - Run failure imagination exercise, categorize risks, assign mitigations/owners. See [Premortem Process](#premortem-process) and [Risk Register Template](#risk-register-template).
+
+**Step 4: Define success metrics** - Identify leading/lagging indicators, baselines, targets, counter-metrics. See [Metrics Template](#metrics-template) and [Instrumentation Guidance](#instrumentation-guidance).
+
+**Step 5: Validate and deliver** - Self-check with rubric (≥3.5 average), ensure all components align. See [Quality Checklist](#quality-checklist).
+
+## Quick Template
+
+```markdown
+# [Initiative Name]: Spec, Risks, Metrics
+
+## 1. Specification
+
+### 1.1 Executive Summary
+- **Goal**: [What are you building/changing and why?]
+- **Timeline**: [When does this need to be delivered?]
+- **Stakeholders**: [Who cares about this initiative?]
+- **Success Criteria**: [What does done look like?]
+
+### 1.2 Current State (Baseline)
+- [Describe the current state/system]
+- [Key metrics/data points about current state]
+- [Pain points or limitations driving this initiative]
+
+### 1.3 Proposed Approach
+- **Architecture/Design**: [High-level approach]
+- **Implementation Plan**: [Phases, milestones, dependencies]
+- **Technology Choices**: [Tools, frameworks, platforms with rationale]
+
+### 1.4 Scope
+- **In Scope**: [What this initiative includes]
+- **Out of Scope**: [What is explicitly excluded]
+- **Future Considerations**: [What might come later]
+
+### 1.5 Requirements
+**Functional Requirements:**
+- [Feature/capability 1]
+- [Feature/capability 2]
+- [Feature/capability 3]
+
+**Non-Functional Requirements:**
+- **Performance**: [Latency, throughput, scalability targets]
+- **Reliability**: [Uptime, error rates, recovery time]
+- **Security**: [Authentication, authorization, data protection]
+- **Compliance**: [Regulatory, policy, audit requirements]
+
+### 1.6 Dependencies & Assumptions
+- **Dependencies**: [External teams, systems, resources needed]
+- **Assumptions**: [What we're assuming is true]
+- **Constraints**: [Budget, time, resource, technical limitations]
+
+### 1.7 Timeline & Milestones
+| Milestone | Date | Deliverable |
+|-----------|------|-------------|
+| [Phase 1] | [Date] | [What's delivered] |
+| [Phase 2] | [Date] | [What's delivered] |
+| [Phase 3] | [Date] | [What's delivered] |
+
+## 2. Risk Analysis
+
+### 2.1 Premortem Summary
+**Exercise Prompt**: "It's [X months] from now. This initiative failed catastrophically. What went wrong?"
+
+**Top failure modes identified:**
+1. [Failure mode 1]
+2. [Failure mode 2]
+3. [Failure mode 3]
+4. [Failure mode 4]
+5. [Failure mode 5]
+
+### 2.2 Risk Register
+
+| Risk ID | Risk Description | Category | Likelihood | Impact | Risk Score | Mitigation Strategy | Owner | Status |
+|---------|-----------------|----------|------------|--------|-----------|---------------------|-------|--------|
+| R1 | [Specific failure scenario] | Technical | High | High | 9 | [How you'll prevent/reduce] | [Name] | Open |
+| R2 | [Specific failure scenario] | Operational | Medium | High | 6 | [How you'll prevent/reduce] | [Name] | Open |
+| R3 | [Specific failure scenario] | Organizational | Medium | Medium | 4 | [How you'll prevent/reduce] | [Name] | Open |
+| R4 | [Specific failure scenario] | External | Low | High | 3 | [How you'll prevent/reduce] | [Name] | Open |
+
+**Risk Scoring**: Low=1, Medium=2, High=3. Risk Score = Likelihood × Impact.
+
+**Risk Categories**:
+- **Technical**: Architecture, code quality, infrastructure, performance
+- **Operational**: Processes, runbooks, support, operations
+- **Organizational**: Resources, skills, alignment, communication
+- **External**: Market, vendors, regulation, dependencies
+
+### 2.3 Risk Mitigation Timeline
+- **Pre-Launch**: [Risks to mitigate before going live]
+- **Launch Window**: [Monitoring and safeguards during launch]
+- **Post-Launch**: [Ongoing monitoring and response]
+
+## 3. Success Metrics
+
+### 3.1 Leading Indicators (Early Signals)
+| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
+|--------|----------|--------|-------------------|------------------|-------|
+| [Predictive metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
+| [Predictive metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
+| [Predictive metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
+
+**Examples**: Deployment frequency, code review cycle time, test coverage, incident detection time
+
+### 3.2 Lagging Indicators (Outcome Measures)
+| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
+|--------|----------|--------|-------------------|------------------|-------|
+| [Outcome metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
+| [Outcome metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
+| [Outcome metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
+
+**Examples**: Uptime, user adoption rate, revenue impact, customer satisfaction score
+
+### 3.3 Counter-Metrics (What We Won't Sacrifice)
+| Metric | Threshold | Monitoring Method | Escalation Trigger |
+|--------|-----------|-------------------|-------------------|
+| [Protection metric 1] | [Minimum acceptable] | [How monitored] | [When to escalate] |
+| [Protection metric 2] | [Minimum acceptable] | [How monitored] | [When to escalate] |
+| [Protection metric 3] | [Minimum acceptable] | [How monitored] | [When to escalate] |
+
+**Examples**: Code quality (test coverage > 80%), team morale (happiness score > 7/10), security posture (no critical vulnerabilities), user privacy (zero data leaks)
+
+### 3.4 Success Criteria Summary
+**Must-haves (hard requirements)**:
+- [Critical success criterion 1]
+- [Critical success criterion 2]
+- [Critical success criterion 3]
+
+**Should-haves (target achievements)**:
+- [Desired outcome 1]
+- [Desired outcome 2]
+- [Desired outcome 3]
+
+**Nice-to-haves (stretch goals)**:
+- [Aspirational outcome 1]
+- [Aspirational outcome 2]
+```
+
+## Context Questions
+
+**Basics**: What are you building/changing? Why now? Who are the stakeholders?
+
+**Constraints**: When does this need to be delivered? What's the budget? What resources are available?
+
+**Current State**: What exists today? What's the baseline? What are the pain points?
+
+**Desired Outcomes**: What does success look like? What metrics would you track? What are you most worried about?
+
+**Scope**: Greenfield/migration/enhancement? Single or multi-phase? Who is the primary user?
+
+## Specification Guidance
+
+### Scope Definition
+
+**In Scope** should be specific and concrete:
+- ✅ "Migrate user authentication service to OAuth 2.0 with PKCE"
+- ❌ "Improve authentication"
+
+**Out of Scope** prevents scope creep:
+- List explicitly what won't be included in this phase
+- Reference future work that might address excluded items
+- Example: "Out of Scope: Social login (Google/GitHub) - deferred to Phase 2"
+
+### Requirements Best Practices
+
+**Functional Requirements**: What the system must do
+- Use "must" for requirements, "should" for preferences
+- Be specific: "System must handle 10K requests/sec" not "System should be fast"
+- Include acceptance criteria: How will you verify this requirement is met?
+
+**Non-Functional Requirements**: How well the system must perform
+- **Performance**: Quantify with numbers (latency < 200ms, throughput > 1000 req/s)
+- **Reliability**: Define uptime SLAs, error budgets, MTTR targets
+- **Security**: Specify authentication, authorization, encryption requirements
+- **Scalability**: Define growth expectations (users, data, traffic)
+
+### Timeline & Milestones
+
+Make milestones concrete and verifiable:
+- ✅ "Milestone 1 (Mar 15): Auth service deployed to staging, 100% auth tests passing"
+- ❌ "Milestone 1: Complete most of auth work"
+
+Include buffer for unknowns:
+- 80% planned work, 20% buffer for issues/tech debt
+- Identify critical path and dependencies clearly
+
+## Premortem Process
+
+### Step 1: Set the Scene
+
+Frame the failure scenario clearly:
+> "It's [6/12/24] months from now - [Date]. We launched [initiative name] and it failed catastrophically. [Stakeholders] are upset. The team is demoralized. What went wrong?"
+
+Choose timeframe based on initiative:
+- Quick launch: 3-6 months
+- Major migration: 12-24 months
+- Strategic change: 24+ months
+
+### Step 2: Brainstorm Failures (Independent)
+
+Have each stakeholder privately write 3-5 specific failure modes:
+- Be specific: "Database migration lost 10K user records" not "data issues"
+- Think from your domain: Engineers focus on technical, PMs on product, Ops on operational
+- No filtering: List even unlikely scenarios
+
+### Step 3: Share and Cluster
+
+Collect all failure modes and group similar ones:
+- **Technical failures**: System design, code bugs, infrastructure
+- **Operational failures**: Runbooks missing, wrong escalation, poor monitoring
+- **Organizational failures**: Lack of skills, poor communication, misaligned incentives
+- **External failures**: Vendor issues, market changes, regulatory changes
+
+### Step 4: Vote and Prioritize
+
+For each cluster, assess:
+- **Likelihood**: How probable is this? (Low 10-30%, Medium 30-60%, High 60%+)
+- **Impact**: How bad would it be? (Low = annoying, Medium = painful, High = catastrophic)
+- **Risk Score**: Likelihood × Impact (1-9 scale)
+
+Focus mitigation on High-risk items (score 6-9).
+
+### Step 5: Convert to Risk Register
+
+For each significant failure mode:
+1. Reframe as a risk: "Risk that [specific scenario] happens"
+2. Categorize (Technical/Operational/Organizational/External)
+3. Assign owner (who monitors and responds)
+4. Define mitigation (how to reduce likelihood or impact)
+5. Track status (Open/Mitigated/Accepted/Closed)
+
+## Risk Register Template
+
+### Risk Statement Format
+
+Good risk statements are specific and measurable:
+- ✅ "Risk that database migration script fails to preserve foreign key relationships, causing data integrity errors in 15% of records"
+- ❌ "Risk of data issues"
+
+### Mitigation Strategies
+
+**Reduce Likelihood** (prevent the risk):
+- Example: "Implement dry-run migration on production snapshot; verify all FK relationships before live migration"
+
+**Reduce Impact** (limit the damage):
+- Example: "Create rollback script tested on staging; set up real-time monitoring for FK violations; keep read replica as backup"
+
+**Accept Risk** (consciously choose not to mitigate):
+- For low-impact or very-low-likelihood risks
+- Document why it's acceptable: "Accept risk of 3rd-party API rate limiting during launch (low likelihood, workaround available)"
+
+**Transfer Risk** (shift to vendor/insurance):
+- Example: "Use managed database service with automated backups and point-in-time recovery (transfers operational risk to vendor)"
+
+### Risk Owners
+
+Each risk needs a clear owner who:
+- Monitors early warning signals
+- Executes mitigation plans
+- Escalates if risk materializes
+- Updates risk status regularly
+
+Not necessarily the person who fixes it, but the person accountable for tracking it.
+
+## Metrics Template
+
+### Leading Indicators (Early Signals)
+
+These predict future success before lagging metrics move:
+- **Engineering**: Deployment frequency, build time, code review cycle time, test coverage
+- **Product**: Feature adoption rate, activation rate, time-to-value, engagement trends
+- **Operations**: Incident detection time, MTTD, runbook execution rate, alert accuracy
+
+Choose 3-5 leading indicators that:
+1. Predict lagging outcomes (validated correlation)
+2. Can be measured frequently (daily/weekly)
+3. You can act on quickly (short feedback loop)
+
+### Lagging Indicators (Outcomes)
+
+These measure actual success but appear later:
+- **Reliability**: Uptime, error rate, MTTR, SLA compliance
+- **Performance**: p50/p95/p99 latency, throughput, response time
+- **Business**: Revenue, user growth, retention, NPS, cost savings
+- **User**: Activation rate, feature adoption, satisfaction score
+
+Choose 3-5 lagging indicators that:
+1. Directly measure initiative goals
+2. Are measurable and objective
+3. Stakeholders care about
+
+### Counter-Metrics (Guardrails)
+
+What success does NOT mean sacrificing:
+- If optimizing for speed → Counter-metric: Code quality (test coverage, bug rate)
+- If optimizing for growth → Counter-metric: Costs (infrastructure spend, CAC)
+- If optimizing for features → Counter-metric: Technical debt (cycle time, deployment frequency)
+
+Choose 2-3 counter-metrics to:
+1. Prevent gaming the system
+2. Protect long-term sustainability
+3. Maintain team/user trust
+
+## Instrumentation Guidance
+
+**Baseline**: Measure current state before launch (✅ "p99 latency: 500ms" ❌ "API is slow"). Without baseline, you can't measure improvement.
+
+**Targets**: Make them specific ("reduce p99 from 500ms to 200ms"), achievable (industry benchmarks), time-bound ("by Q2 end"), meaningful (tied to outcomes).
+
+**Measurement**: Document data source, calculation method, measurement frequency, and who can access the metrics.
+
+**Tracking Cadence**: Real-time (system health), daily (operations), weekly (product), monthly (business).
+
+## Quality Checklist
+
+Before delivering, verify:
+
+**Specification Complete:**
+- [ ] Goal, stakeholders, timeline clearly stated
+- [ ] Current state baseline documented with data
+- [ ] Scope (in/out/future) explicitly defined
+- [ ] Requirements are specific and measurable
+- [ ] Dependencies and assumptions listed
+- [ ] Timeline has concrete milestones
+
+**Risks Comprehensive:**
+- [ ] Premortem conducted (failure imagination exercise)
+- [ ] Risks span technical, operational, organizational, external
+- [ ] Each risk has likelihood, impact, score
+- [ ] High-risk items (6-9 score) have detailed mitigation plans
+- [ ] Every risk has an assigned owner
+- [ ] Risk register is prioritized by score
+
+**Metrics Measurable:**
+- [ ] 3-5 leading indicators (early signals) defined
+- [ ] 3-5 lagging indicators (outcomes) defined
+- [ ] 2-3 counter-metrics (guardrails) defined
+- [ ] Each metric has baseline, target, measurement method
+- [ ] Metrics have clear owners and tracking cadence
+- [ ] Success criteria (must/should/nice-to-have) documented
+
+**Integration:**
+- [ ] Risks map to specs (e.g., technical risks tied to architecture choices)
+- [ ] Metrics validate risk mitigations (e.g., measure if mitigation worked)
+- [ ] Specs enable metrics (e.g., instrumentation built into design)
+- [ ] All three components tell a coherent story
+
+**Rubric Validation:**
+- [ ] Self-assessed with rubric ≥ 3.5 average across all criteria
+- [ ] Stakeholders can act on this artifact
+- [ ] Gaps and unknowns explicitly acknowledged