Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:38:26 +08:00
commit 41d9f6b189
304 changed files with 98322 additions and 0 deletions

View File

@@ -0,0 +1,155 @@
---
name: chain-spec-risk-metrics
description: Use when planning high-stakes initiatives (migrations, launches, strategic changes) that require clear specifications, proactive risk identification (premortem/register), and measurable success criteria. Invoke when user mentions "plan this migration", "launch strategy", "implementation roadmap", "what could go wrong", "how do we measure success", or when high-impact decisions need comprehensive planning with risk mitigation and instrumentation.
---
# Chain Spec Risk Metrics
## Table of Contents
- [Purpose](#purpose)
- [When to Use This Skill](#when-to-use-this-skill)
- [When NOT to Use This Skill](#when-not-to-use-this-skill)
- [What Is It?](#what-is-it)
- [Workflow](#workflow)
- [Common Patterns](#common-patterns)
- [Guardrails](#guardrails)
- [Quick Reference](#quick-reference)
## Purpose
This skill helps you create comprehensive plans for high-stakes initiatives by chaining together three critical components: clear specifications, proactive risk analysis, and measurable success metrics. It ensures initiatives are well-defined, risks are anticipated and mitigated, and progress can be objectively tracked.
## When to Use This Skill
Use this skill when you need to:
- **Plan complex implementations** - Migrations, infrastructure changes, system redesigns requiring detailed specs
- **Launch new initiatives** - Products, features, programs that need risk assessment and success measurement
- **Make high-stakes decisions** - Strategic choices where failure modes must be identified and monitored
- **Coordinate cross-functional work** - Initiatives requiring clear specifications for alignment and risk transparency
- **Request comprehensive planning** - User asks "plan this migration", "create implementation roadmap", "what could go wrong?"
- **Establish accountability** - Need clear success criteria and risk owners for governance
**Trigger phrases:**
- "Plan this [migration/launch/implementation]"
- "Create a roadmap for..."
- "What could go wrong with..."
- "How do we measure success for..."
- "Write a spec that includes risks and metrics"
- "Comprehensive plan with risk mitigation"
## When NOT to Use This Skill
Skip this skill when:
- **Quick decisions** - Low-stakes choices don't need full premortem treatment
- **Specifications only** - If user just needs a spec without risk/metrics analysis (use one-pager-prd or adr-architecture instead)
- **Risk analysis only** - If focused solely on identifying risks (use project-risk-register or postmortem instead)
- **Metrics only** - If just defining KPIs (use metrics-tree instead)
- **Already decided and executing** - Use postmortem or reviews-retros-reflection for retrospectives
- **Brainstorming alternatives** - Use brainstorm-diverge-converge to generate options first
## What Is It?
Chain Spec Risk Metrics is a meta-skill that combines three complementary techniques into a comprehensive planning artifact:
1. **Specification** - Define what you're building/changing with clarity (scope, requirements, approach, timeline)
2. **Risk Analysis** - Identify what could go wrong through premortem ("imagine we failed - why?") and create risk register with mitigations
3. **Success Metrics** - Define measurable outcomes to track progress and validate success
**Quick example:**
> **Initiative:** Migrate monolith to microservices
>
> **Spec:** Decompose into 5 services (auth, user, order, inventory, payment), API gateway, shared data patterns
>
> **Risks:**
> - Data consistency issues between services (High) → Implement saga pattern with compensation
> - Performance degradation from network hops (Medium) → Load test with production traffic patterns
>
> **Metrics:**
> - Deployment frequency (target: 10+ per week, baseline: 2 per week)
> - API p99 latency (target: < 200ms, baseline: 150ms)
> - Mean time to recovery (target: < 30min, baseline: 2 hours)
## Workflow
Copy this checklist and track your progress:
```
Chain Spec Risk Metrics Progress:
- [ ] Step 1: Gather initiative context
- [ ] Step 2: Write comprehensive specification
- [ ] Step 3: Conduct premortem and build risk register
- [ ] Step 4: Define success metrics and instrumentation
- [ ] Step 5: Validate completeness and deliver
```
**Step 1: Gather initiative context**
Ask user for the initiative goal, constraints (time/budget/resources), stakeholders, current state (baseline), and desired outcomes. Clarify whether this is a greenfield build, migration, enhancement, or strategic change. See [resources/template.md](resources/template.md) for full context questions.
**Step 2: Write comprehensive specification**
Create detailed specification covering scope (what's in/out), approach (architecture/methodology), requirements (functional/non-functional), dependencies, timeline, and success criteria. For standard initiatives use [resources/template.md](resources/template.md); for complex multi-phase programs see [resources/methodology.md](resources/methodology.md) for decomposition techniques.
**Step 3: Conduct premortem and build risk register**
Run premortem exercise: "Imagine 12 months from now this initiative failed spectacularly. What went wrong?" Identify risks across technical, operational, organizational, and external dimensions. For each risk document likelihood, impact, mitigation strategy, and owner. See [Premortem Technique](#premortem-technique) and [Risk Register Structure](#risk-register-structure) sections, or [resources/methodology.md](resources/methodology.md) for advanced risk assessment methods.
**Step 4: Define success metrics and instrumentation**
Identify leading indicators (early signals), lagging indicators (outcome measures), and counter-metrics (what you're NOT willing to sacrifice). Specify current baseline, target values, measurement method, and tracking cadence for each metric. See [Metrics Framework](#metrics-framework) and use [resources/template.md](resources/template.md) for standard structure.
**Step 5: Validate completeness and deliver**
Self-check the complete artifact using [resources/evaluators/rubric_chain_spec_risk_metrics.json](resources/evaluators/rubric_chain_spec_risk_metrics.json). Ensure specification is clear and actionable, risks are comprehensive with mitigations, metrics measure actual success, and all three components reinforce each other. Minimum standard: Average score ≥ 3.5 across all criteria.
## Common Patterns
### Premortem Technique
1. **Set the scene**: "It's [6/12/24] months from now. This initiative failed catastrophically."
2. **Brainstorm failure causes**: Each stakeholder writes 3-5 reasons why it failed (independently first)
3. **Cluster and prioritize**: Group similar failures, vote on likelihood and impact
4. **Convert to risk register**: Each failure mode becomes a risk with mitigation plan
### Risk Register Structure
For each identified risk, document:
- **Risk description**: Specific failure mode (not vague "project delay")
- **Category**: Technical, operational, organizational, external
- **Likelihood**: Low/Medium/High (or probability %)
- **Impact**: Low/Medium/High (or cost estimate)
- **Mitigation strategy**: What you'll do to reduce likelihood or impact
- **Owner**: Who monitors and responds to this risk
- **Status**: Open, Mitigated, Accepted, Closed
### Metrics Framework
**Leading indicators** (predict future success):
- Deployment frequency, code review velocity, incident detection time
**Lagging indicators** (measure outcomes):
- Uptime, user adoption, revenue impact, customer satisfaction
**Counter-metrics** (what you're NOT willing to sacrifice):
- Code quality, team morale, security posture, user privacy
## Guardrails
- **Don't skip any component** - Spec without risks = blind spots; risks without metrics = unvalidated mitigations
- **Be specific in specifications** - "Improve performance" is not a spec; "Reduce p99 API latency from 500ms to 200ms" is
- **Quantify risks** - Use likelihood × impact scores to prioritize; don't treat all risks equally
- **Make metrics measurable** - "Better UX" is not measurable; "Increase checkout completion from 67% to 75%" is
- **Assign owners** - Every risk and metric needs a clear owner who monitors and acts
- **State assumptions explicitly** - Document what you're assuming about resources, timelines, dependencies
- **Include counter-metrics** - Always define what success does NOT mean sacrificing
- **Update as you learn** - This is a living document; revisit after milestones to update risks/metrics
## Quick Reference
| Component | When to Use | Resource |
|-----------|-------------|----------|
| **Template** | Standard initiatives with known patterns | [resources/template.md](resources/template.md) |
| **Methodology** | Complex multi-phase programs, novel risks | [resources/methodology.md](resources/methodology.md) |
| **Examples** | See what good looks like | [resources/examples/](resources/examples/) |
| **Rubric** | Validate before delivering | [resources/evaluators/rubric_chain_spec_risk_metrics.json](resources/evaluators/rubric_chain_spec_risk_metrics.json) |

View File

@@ -0,0 +1,279 @@
{
"criteria": [
{
"name": "Specification Clarity",
"description": "Is the initiative goal, scope, approach, and timeline clearly defined and actionable?",
"scoring": {
"1": "Vague goal ('improve system') with no clear scope or timeline. Stakeholders can't act on this.",
"2": "General goal stated but scope unclear (what's in/out?). Timeline missing or unrealistic.",
"3": "Goal, scope, timeline stated but lacks detail. Approach mentioned but not explained. Acceptable for low-stakes.",
"4": "Clear goal, explicit scope (in/out), realistic timeline with milestones. Approach well-explained. Good for medium-stakes.",
"5": "Crystal clear goal tied to business outcome. Precise scope with rationale. Detailed approach with diagrams. Timeline has buffer and dependencies. Exemplary for high-stakes."
},
"red_flags": [
"Spec says 'improve performance' without quantifying what that means",
"No 'out of scope' section (scope creep likely)",
"Timeline has no buffer or dependencies identified",
"Approach section just lists technology choices without explaining why"
]
},
{
"name": "Specification Completeness",
"description": "Are all necessary components covered (current state, requirements, dependencies, assumptions)?",
"scoring": {
"1": "Major sections missing. No baseline, no requirements, or no dependencies documented.",
"2": "Some sections present but incomplete. Requirements exist but vague ('system should be fast').",
"3": "All major sections present. Requirements specific but could be more detailed. Acceptable for low-stakes.",
"4": "Comprehensive: baseline with data, specific requirements, dependencies and assumptions explicit. Good for medium-stakes.",
"5": "Exhaustive: current state with metrics, functional + non-functional requirements with acceptance criteria, all dependencies mapped, assumptions validated. Exemplary for high-stakes."
},
"red_flags": [
"No current state baseline (can't measure improvement)",
"Requirements mix functional and non-functional without clear categories",
"No assumptions stated (hidden risks)",
"Dependencies mentioned but not explicitly called out in own section"
]
},
{
"name": "Risk Analysis Comprehensiveness",
"description": "Are risks identified across all dimensions (technical, operational, organizational, external)?",
"scoring": {
"1": "No risks identified, or only 1-2 obvious risks listed. Major blind spots.",
"2": "3-5 risks identified but all in one category (e.g., only technical). Missing organizational, external risks.",
"3": "5-10 risks covering technical and operational. Some organizational risks. Acceptable for low-stakes.",
"4": "10-15 risks across all four categories. Premortem conducted. Covers non-obvious risks. Good for medium-stakes.",
"5": "15+ risks identified through structured premortem. All categories covered with specific failure modes. Includes low-probability/high-impact risks. Exemplary for high-stakes."
},
"red_flags": [
"Risk register is just a list of vague concerns ('project might be delayed')",
"All risks are technical (missing organizational, external)",
"No premortem conducted (risks are just obvious failure modes)",
"Low-probability/high-impact risks ignored (e.g., key person leaves)"
]
},
{
"name": "Risk Quantification",
"description": "Are risks scored by likelihood and impact, with clear prioritization?",
"scoring": {
"1": "No risk scoring. Can't tell which risks are most important.",
"2": "Risks listed but no likelihood/impact assessment. Unclear which to prioritize.",
"3": "Likelihood and impact assessed (Low/Med/High) for each risk. Priority clear. Acceptable for low-stakes.",
"4": "Likelihood (%) and impact (cost/time) quantified. Risk scores calculated. Top risks prioritized. Good for medium-stakes.",
"5": "Quantitative risk analysis: probability distributions, expected loss, mitigation cost-benefit. Risks ranked by expected value. Exemplary for high-stakes."
},
"red_flags": [
"All risks marked 'High' (no actual prioritization)",
"Likelihood/impact inconsistent (50% likelihood but marked 'Low'?)",
"No risk score or priority ranking (can't tell what to focus on)",
"Mitigation cost not compared to expected loss (over-mitigating low-risk items)"
]
},
{
"name": "Risk Mitigation Depth",
"description": "Does each high-priority risk have specific, actionable mitigation strategies with owners?",
"scoring": {
"1": "No mitigation strategies. Just a list of risks.",
"2": "Generic mitigations ('monitor closely', 'be careful'). Not actionable.",
"3": "Specific mitigations for top 5 risks. Owners assigned. Acceptable for low-stakes.",
"4": "Detailed mitigations for all high-risk items (score 6-9). Clear actions, owners, status tracking. Good for medium-stakes.",
"5": "Comprehensive mitigations with preventive + detective + corrective controls. Cost-benefit analysis. Rollback plans. Continuous monitoring. Exemplary for high-stakes."
},
"red_flags": [
"Mitigation is vague ('increase testing') without specifics",
"No owners assigned to risks (accountability missing)",
"High-risk items have no mitigation plan",
"Mitigations are all preventive (no detective/corrective controls if prevention fails)"
]
},
{
"name": "Metrics Measurability",
"description": "Are metrics specific, measurable, with clear baselines, targets, and measurement methods?",
"scoring": {
"1": "No metrics, or metrics are unmeasurable ('better UX', 'improved reliability').",
"2": "Metrics stated but no baseline or target ('track uptime'). Can't assess success.",
"3": "3-5 metrics with baselines and targets. Measurement method implied but not explicit. Acceptable for low-stakes.",
"4": "5-10 metrics with baselines, targets, measurement methods, tracking cadence, owners. Good for medium-stakes.",
"5": "Comprehensive metrics framework (leading/lagging/counter). All metrics SMART (specific, measurable, achievable, relevant, time-bound). Instrumentation plan. Exemplary for high-stakes."
},
"red_flags": [
"Metric is subjective ('improved user experience') without quantification",
"No baseline (can't measure improvement)",
"Target is vague ('reduce latency') without number",
"Measurement method missing (how will you actually track this?)"
]
},
{
"name": "Leading/Lagging Balance",
"description": "Are there both leading indicators (early signals) and lagging indicators (outcomes)?",
"scoring": {
"1": "Only lagging indicators (outcomes). No early warning signals.",
"2": "Mostly lagging. One or two leading indicators but not well-chosen.",
"3": "2-3 leading indicators (predict outcomes) and 3-5 lagging (measure outcomes). Acceptable for low-stakes.",
"4": "Balanced: 3-5 leading indicators that predict lagging outcomes. Tracking cadence matches (leading daily/weekly, lagging monthly). Good for medium-stakes.",
"5": "Sophisticated framework: leading indicators validated to predict lagging. Includes counter-metrics to prevent gaming. Dashboard with real-time leading, periodic lagging. Exemplary for high-stakes."
},
"red_flags": [
"All metrics are outcomes (no early signals of trouble)",
"Leading indicators don't actually predict lagging (no validated correlation)",
"No counter-metrics (risk of gaming the system)",
"Tracking cadence wrong (measuring strategic metrics daily creates noise)"
]
},
{
"name": "Integration: Spec↔Risk↔Metrics",
"description": "Do the three components reinforce each other (specs enable metrics, risks map to specs, metrics validate mitigations)?",
"scoring": {
"1": "Components are disconnected. Metrics don't relate to spec goals. Risks don't map to spec decisions.",
"2": "Weak connections. Some overlap but mostly independent documents.",
"3": "Moderate integration. Risks reference spec sections. Some metrics measure risk mitigations. Acceptable for low-stakes.",
"4": "Strong integration. Major spec decisions have corresponding risks. High-risk items have metrics to detect issues. Metrics align with spec goals. Good for medium-stakes.",
"5": "Seamless integration. Specs include instrumentation for metrics. Risks mapped to specific spec choices with rationale. Metrics validate both spec assumptions and risk mitigations. Traceability matrix. Exemplary for high-stakes."
},
"red_flags": [
"Metrics don't align with spec goals (tracking unrelated things)",
"Spec makes technology choice but risks don't assess that choice",
"High-risk items have no corresponding metrics to detect if risk is materializing",
"Spec doesn't include instrumentation needed to collect metrics"
]
},
{
"name": "Actionability",
"description": "Can stakeholders act on this artifact? Are owners, timelines, and next steps clear?",
"scoring": {
"1": "No clear next steps. No owners assigned. Stakeholders can't act on this.",
"2": "Some next steps but vague ('start planning'). Owners missing or unclear.",
"3": "Next steps clear for immediate phase. Owners assigned to risks and metrics. Acceptable for low-stakes.",
"4": "Clear action plan with milestones, owners, dependencies. Stakeholders know what to do and when. Good for medium-stakes.",
"5": "Detailed execution plan with phase gates, decision points, escalation paths. RACI matrix for key activities. Stakeholders empowered to execute autonomously. Exemplary for high-stakes."
},
"red_flags": [
"No owners assigned to risks or metrics (accountability vacuum)",
"Timeline exists but no clear milestones or dependencies",
"Next steps are vague ('continue planning', 'monitor situation')",
"Unclear decision authority (who approves phase transitions?)"
]
},
{
"name": "Realism and Feasibility",
"description": "Is the plan realistic given constraints (time, budget, team)? Are assumptions validated?",
"scoring": {
"1": "Unrealistic plan (6-month timeline for 2-year project). Assumptions unvalidated. Will fail.",
"2": "Overly optimistic. Timeline has no buffer. Assumes best-case scenario throughout.",
"3": "Mostly realistic. Timeline includes some buffer (10-20%). Key assumptions stated. Acceptable for low-stakes.",
"4": "Realistic timeline with 20-30% buffer. Assumptions validated or explicitly called out as needs validation. Good for medium-stakes.",
"5": "Conservative timeline with 30%+ buffer and contingency plans. All assumptions validated or mitigated. Three-point estimates for uncertain items. Exemplary for high-stakes."
},
"red_flags": [
"Timeline has no buffer (assumes everything goes perfectly)",
"Assumes team has skills they don't have (no training plan)",
"Budget doesn't include contingency (cost overruns likely)",
"Critical assumptions not validated ('we assume API will handle 10K req/s' - did you test this?)"
]
}
],
"stakes_guidance": {
"low_stakes": {
"description": "Initiative < 1 eng-month, reversible, limited impact. Examples: Small feature, internal tool, process tweak.",
"target_score": 3.0,
"required_criteria": [
"Specification Clarity ≥ 3",
"Risk Analysis Comprehensiveness ≥ 3",
"Metrics Measurability ≥ 3"
],
"optional_criteria": [
"Risk Quantification (can use Low/Med/High)",
"Leading/Lagging Balance (3 metrics sufficient)"
]
},
"medium_stakes": {
"description": "Initiative 1-6 months, affects multiple teams, significant impact. Examples: Service migration, product launch, infrastructure change.",
"target_score": 3.5,
"required_criteria": [
"All criteria ≥ 3",
"Specification Completeness ≥ 4",
"Risk Mitigation Depth ≥ 4",
"Metrics Measurability ≥ 4"
],
"recommended": [
"Conduct premortem for risk analysis",
"Include counter-metrics to prevent gaming",
"Assign owners to all high-risk items and metrics"
]
},
"high_stakes": {
"description": "Initiative 6+ months, company-wide, strategic/existential impact. Examples: Architecture overhaul, market expansion, regulatory compliance.",
"target_score": 4.0,
"required_criteria": [
"All criteria ≥ 4",
"Risk Quantification ≥ 4 (use quantitative analysis)",
"Integration ≥ 4 (traceability matrix recommended)",
"Actionability ≥ 4 (detailed execution plan)"
],
"recommended": [
"Quantitative risk analysis (expected value, cost-benefit)",
"Advanced metrics frameworks (HEART, North Star, SLI/SLO)",
"Continuous validation loop (update risks/metrics monthly)",
"External review (architect, security, compliance)"
]
}
},
"common_failure_modes": [
{
"failure_mode": "Spec Without Risks",
"symptoms": "Detailed specification but no risk analysis. Assumes everything will go as planned.",
"consequences": "Blindsided by preventable failures. No mitigation plans when issues arise.",
"fix": "Run 30-minute premortem: 'Imagine this failed - why?' Identify top 10 risks and mitigate."
},
{
"failure_mode": "Risk Theater",
"symptoms": "50+ risks listed but no prioritization, mitigation, or owners. Just documenting everything that could go wrong.",
"consequences": "Analysis paralysis. Team can't focus. Risks aren't actually managed.",
"fix": "Score risks by likelihood × impact. Focus on top 10 (score 6-9). Assign owners and specific mitigations."
},
{
"failure_mode": "Vanity Metrics",
"symptoms": "Tracking activity metrics ('features shipped', 'lines of code') instead of outcome metrics ('user value', 'revenue').",
"consequences": "Team optimizes for the wrong thing. Looks busy but doesn't deliver value.",
"fix": "For each metric ask: 'If this goes up, are users/business better off?' Replace vanity with value metrics."
},
{
"failure_mode": "Plan and Forget",
"symptoms": "Beautiful spec/risk/metrics doc created then never referenced again.",
"consequences": "Doc becomes stale. Risks materialize but aren't detected. Metrics drift from goals.",
"fix": "Schedule monthly reviews. Update risks (new ones, status changes). Track metrics in team rituals (sprint reviews, all-hands)."
},
{
"failure_mode": "Premature Precision",
"symptoms": "Overconfident estimates: 'Migration will take exactly 47 days and cost $487,234.19'.",
"consequences": "False confidence. When reality diverges, team loses trust in planning.",
"fix": "Use ranges (30-60 days, $400-600K). State confidence levels (50%, 90%). Build in buffer (20-30%)."
},
{
"failure_mode": "Disconnected Components",
"symptoms": "Specs, risks, and metrics are separate documents that don't reference each other.",
"consequences": "Metrics don't validate spec assumptions. Risks aren't mitigated by spec choices. Incoherent plan.",
"fix": "Explicitly map: major spec decisions → corresponding risks → metrics that detect risk. Ensure traceability."
},
{
"failure_mode": "No Counter-Metrics",
"symptoms": "Optimizing for single metric without guardrails (e.g., 'ship faster!' without quality threshold).",
"consequences": "Gaming the system. Ship faster but quality tanks. Optimize costs but reliability suffers.",
"fix": "For each primary metric, define counter-metric: what you're NOT willing to sacrifice. Monitor both."
},
{
"failure_mode": "Analysis Paralysis",
"symptoms": "Spent 3 months planning, creating perfect spec/risks/metrics, haven't started building.",
"consequences": "Opportunity cost. Market moves on. Team demoralized by lack of progress.",
"fix": "Time-box planning (1-2 weeks for most initiatives). Embrace uncertainty. Learn by doing. Update plan as you learn."
}
],
"scale": 5,
"minimum_average_score": 3.5,
"interpretation": {
"1.0-2.0": "Inadequate. Major gaps in spec, risks, or metrics. Do not proceed. Revise artifact.",
"2.0-3.0": "Needs improvement. Some components present but incomplete or vague. Acceptable only for very low-stakes initiatives. Revise before proceeding with medium/high-stakes.",
"3.0-3.5": "Acceptable for low-stakes initiatives. Core components present with sufficient detail. For medium-stakes, strengthen risk analysis and metrics.",
"3.5-4.0": "Good. Ready for medium-stakes initiatives. Comprehensive spec, proactive risk management, measurable success criteria. For high-stakes, add quantitative analysis and continuous validation.",
"4.0-5.0": "Excellent. Ready for high-stakes initiatives. Exemplary planning with detailed execution plan, quantitative risk analysis, sophisticated metrics, and strong integration."
}
}

View File

@@ -0,0 +1,346 @@
# Microservices Migration: Spec, Risks, Metrics
Complete worked example showing how to chain specification, risk analysis, and success metrics for a complex migration initiative.
## 1. Specification
### 1.1 Executive Summary
**Goal**: Migrate our monolithic e-commerce platform to microservices architecture to enable independent team velocity, improve reliability through service isolation, and reduce deployment risk.
**Timeline**: 18-month program (Q1 2024 - Q2 2025)
- Phase 1 (Q1-Q2 2024): Foundation + Auth service
- Phase 2 (Q3-Q4 2024): User, Order, Inventory services
- Phase 3 (Q1-Q2 2025): Payment, Notification services + monolith sunset
**Stakeholders**:
- **Exec Sponsor**: CTO (accountable for initiative success)
- **Engineering**: 3 teams (15 engineers total)
- **Product**: Head of Product (feature velocity impact)
- **Operations**: SRE team (operational complexity)
- **Finance**: CFO (infrastructure cost impact)
**Success Criteria**: Independent service deployments with <5min MTTR, 99.95% uptime, no customer-facing regressions.
### 1.2 Current State (Baseline)
**Architecture**: Ruby on Rails monolith (250K LOC) serving all e-commerce functions.
**Current Metrics**:
- Deployments: 2 per week (all-or-nothing deploys)
- Deployment time: 45 minutes (build + test + deploy)
- Mean time to recovery (MTTR): 2 hours (requires full rollback)
- P99 API latency: 450ms (slower than target)
- Uptime: 99.8% (below SLA of 99.9%)
- Developer velocity: 3 weeks average for feature to production
**Pain Points**:
- Any code change requires full system deployment (high risk)
- One team's bug can take down entire platform
- Database hot spots limit scaling (users table has 50M rows)
- Hard to onboard new engineers (entire codebase is context)
- A/B testing is difficult (can't target specific services)
**Previous Attempts**: Tried service-oriented architecture in 2021 but stopped due to data consistency issues and operational complexity.
### 1.3 Proposed Approach
**Target Architecture**:
```
[API Gateway / Envoy]
|
┌───────────────────┼───────────────────┐
↓ ↓ ↓
[Auth Service] [User Service] [Order Service]
↓ ↓ ↓
[Auth DB] [User DB] [Order DB]
↓ ↓ ↓
[Inventory Service] [Payment Service] [Notification Service]
↓ ↓ ↓
[Inventory DB] [Payment DB] [Notification Queue]
```
**Service Boundaries** (based on DDD):
1. **Auth Service**: Authentication, authorization, session management
2. **User Service**: User profiles, preferences, account management
3. **Order Service**: Order creation, status, history
4. **Inventory Service**: Product catalog, stock levels, reservations
5. **Payment Service**: Payment processing, refunds, wallet
6. **Notification Service**: Email, SMS, push notifications
**Technology Stack**:
- **Language**: Node.js (team expertise, async I/O for APIs)
- **API Gateway**: Envoy (service mesh, observability)
- **Databases**: PostgreSQL per service (team expertise, ACID guarantees)
- **Messaging**: RabbitMQ (reliable delivery, team familiar)
- **Observability**: OpenTelemetry → DataDog (centralized logging, metrics, tracing)
**Data Migration Strategy**:
- **Phase 1**: Read from monolith DB, write to both (dual-write pattern)
- **Phase 2**: Read from service DB, write to both (validate consistency)
- **Phase 3**: Cut over fully to service DB, decommission monolith tables
**Deployment Strategy**:
- Canary releases: 1% → 10% → 50% → 100% traffic per service
- Feature flags for gradual rollout within traffic tiers
- Automated rollback on error rate spike (>0.5%)
### 1.4 Scope
**In Scope (18 months)**:
- Extract 6 services from monolith with independent deployability
- Implement API gateway for routing and observability
- Set up per-service CI/CD pipelines
- Migrate data to per-service databases
- Establish SLOs for each service (99.95% uptime, <200ms p99)
- Train engineers on microservices patterns and operational practices
**Out of Scope**:
- Rewrite service internals (lift-and-shift code from monolith)
- Multi-region deployment (deferred to 2026)
- GraphQL federation (REST APIs sufficient for now)
- Service mesh full rollout (Envoy at gateway only, not sidecar)
- Real-time event streaming (async via RabbitMQ sufficient)
**Future Considerations** (post-Q2 2025):
- Replatform services to Go or Rust for performance
- Implement CQRS/Event Sourcing for Order/Inventory
- Multi-region active-active deployment
- GraphQL federation layer for frontend
### 1.5 Requirements
**Functional Requirements**:
- **FR-001**: Each service must be independently deployable without affecting others
- **FR-002**: API Gateway must route requests to appropriate service based on URL path
- **FR-003**: Services must communicate asynchronously for non-blocking operations (e.g., send notification after order)
- **FR-004**: All services must implement health checks (liveness, readiness)
- **FR-005**: Data consistency must be maintained across service boundaries (eventual consistency acceptable for non-critical paths)
**Non-Functional Requirements**:
- **Performance**:
- P50 API latency: <100ms (target: 50ms improvement from baseline)
- P99 API latency: <200ms (target: 250ms improvement)
- API Gateway overhead: <10ms added latency
- **Reliability**:
- Per-service uptime: 99.95% (vs. 99.8% current)
- MTTR: <30 minutes (vs. 2 hours current)
- Zero-downtime deployments (vs. maintenance windows)
- **Scalability**:
- Each service must handle 10K req/s independently
- Database connections pooled (max 100 per service)
- Horizontal scaling via Kubernetes (3-10 replicas per service)
- **Security**:
- Service-to-service auth via mTLS
- API Gateway enforces rate limiting (100 req/min per user)
- Secrets managed via Vault (no hardcoded credentials)
- **Operability**:
- Centralized logging (all services → DataDog)
- Distributed tracing (OpenTelemetry)
- Automated alerts on SLO violations
### 1.6 Dependencies & Assumptions
**Dependencies**:
- Kubernetes cluster provisioned by Infrastructure team (ready by Jan 2024)
- DataDog enterprise license approved (ready by Feb 2024)
- RabbitMQ cluster available (SRE team owns, ready by Mar 2024)
- Engineers complete microservices training (2-week program, Jan 2024)
**Assumptions**:
- No major product launches during migration (to reduce risk overlap)
- Database schema changes can be coordinated across monolith and services
- Existing test coverage is sufficient (80% for critical paths)
- Customer traffic grows <20% during migration period
**Constraints**:
- Budget: $500K (infrastructure + tooling + training)
- Team: 15 engineers (no additional headcount)
- Timeline: 18 months firm (customer commitment for improved reliability)
- Compliance: Must maintain PCI-DSS for payment service
### 1.7 Timeline & Milestones
| Milestone | Date | Deliverable | Success Criteria |
|-----------|------|-------------|------------------|
| **M0: Foundation** | Mar 2024 | API Gateway deployed, observability in place | Gateway routes traffic, 100% tracing coverage |
| **M1: Auth Service** | Jun 2024 | Auth extracted, deployed independently | Zero auth regressions, <200ms p99 latency |
| **M2: User + Order** | Sep 2024 | User and Order services live | Independent deploys, data consistency validated |
| **M3: Inventory** | Dec 2024 | Inventory service live | Stock reservations work, no overselling |
| **M4: Payment** | Mar 2025 | Payment service live (PCI-DSS compliant) | Payment success rate >99.9%, audit passed |
| **M5: Notification** | May 2025 | Notification service live | Email/SMS delivery >99%, queue processed <1min |
| **M6: Monolith Sunset** | Jun 2025 | Monolith decommissioned | All traffic via services, monolith DB read-only |
## 2. Risk Analysis
### 2.1 Premortem Summary
**Exercise Prompt**: "It's June 2025. We launched microservices migration and it failed catastrophically. Engineering morale is low, customers are experiencing outages, and the CTO is considering rolling back to the monolith. What went wrong?"
**Top Failure Modes Identified** (from cross-functional team premortem):
1. **Data Consistency Nightmare** (Engineering): Dual-write bugs caused order/inventory mismatches, overselling inventory, customer complaints.
2. **Distributed System Cascading Failures** (SRE): One slow service (Payment) caused timeouts in all upstream services, bringing down entire platform.
3. **Operational Complexity Overwhelmed Team** (SRE): Too many services to monitor, alert fatigue, SRE team couldn't keep up with incidents.
4. **Performance Degradation** (Engineering): Network hops between services added latency, checkout flow is now slower than monolith.
5. **Data Migration Errors** (Engineering): Production migration script had bugs, lost 50K user records, required emergency restore.
6. **Team Skill Gaps** (Management): Engineers lacked distributed systems expertise, made common mistakes (distributed transactions, thundering herd).
7. **Cost Overruns** (Finance): Per-service databases and infrastructure cost 3× more than budgeted, CFO halted project.
8. **Feature Velocity Dropped** (Product): Cross-service changes require coordinating 3 teams, slowing down product velocity.
9. **Security Vulnerabilities** (Security): Service-to-service auth misconfigured, unauthorized service access, data breach.
10. **Incomplete Migration** (Management): Ran out of time, stuck with half-migrated state (some services live, monolith still critical).
### 2.2 Risk Register
| Risk ID | Risk Description | Category | Likelihood | Impact | Score | Mitigation Strategy | Owner | Status |
|---------|-----------------|----------|------------|--------|-------|---------------------|-------|--------|
| **R1** | Data inconsistency between monolith DB and service DBs during dual-write phase, leading to customer-facing bugs (order not found, wrong inventory) | Technical | High (60%) | High | 9 | **Mitigation**: (1) Implement reconciliation job to detect mismatches, (2) Extensive integration tests for dual-write paths, (3) Shadow mode: write to both but read from monolith initially to validate, (4) Automated rollback if mismatch rate >0.1% | Tech Lead - Data | Open |
| **R2** | Cascading failures: slow/failing service causes timeouts in all dependent services, total outage | Technical | Medium (40%) | High | 6 | **Mitigation**: (1) Circuit breakers on all service clients (fail fast), (2) Bulkhead pattern (isolate thread pools per service), (3) Timeout tuning (aggressive timeouts <1sec), (4) Graceful degradation (fallback to cached data) | SRE Lead | Open |
| **R3** | Operational complexity overwhelms SRE team: too many alerts, incidents, runbooks to manage | Operational | High (70%) | Medium | 6 | **Mitigation**: (1) Standardize observability across services (common dashboards), (2) SLO-based alerting only (eliminate noise), (3) Automated remediation for common issues, (4) Hire 2 additional SREs (approved) | SRE Manager | Open |
| **R4** | Performance degradation: added network latency from service calls makes checkout slower than monolith | Technical | Medium (50%) | Medium | 4 | **Mitigation**: (1) Baseline perf tests on monolith, (2) Set p99 latency budget per service (<50ms), (3) Async where possible (notification, inventory reservation), (4) Load tests at 2× expected traffic | Perf Engineer | Open |
| **R5** | Data migration script errors cause data loss in production | Technical | Low (20%) | High | 3 | **Mitigation**: (1) Dry-run migration on prod snapshot, (2) Automated validation (row counts, checksums), (3) Incremental migration (batch of 10K users at a time), (4) Keep monolith DB as source of truth during migration, (5) Point-in-time recovery tested | DB Admin | Open |
| **R6** | Team lacks microservices expertise, makes preventable mistakes (no circuit breakers, distributed transactions, etc.) | Organizational | Medium (50%) | Medium | 4 | **Mitigation**: (1) Mandatory 2-week microservices training (Jan 2024), (2) Architecture review board for service designs, (3) Pair programming with experienced engineers, (4) Reference implementation as template | Engineering Manager | Open |
| **R7** | Infrastructure costs exceed budget: per-service DBs, K8s overhead, observability tooling cost 3× estimate | Organizational | Medium (40%) | Medium | 4 | **Mitigation**: (1) Detailed cost model before each phase, (2) Right-size instances (start small, scale up), (3) Use managed services (RDS) to reduce ops cost, (4) Monthly cost reviews with Finance | Finance Partner | Open |
| **R8** | Feature velocity drops: cross-service changes require coordination, slowing product development | Organizational | High (60%) | Low | 3 | **Mitigation**: (1) Design services with clear boundaries (minimize cross-service changes), (2) Establish API contracts early, (3) Service teams own full stack (no handoffs), (4) Track deployment frequency as leading indicator | Product Manager | Accepted |
| **R9** | Service-to-service auth misconfigured, allowing unauthorized access or data leaks | External | Low (15%) | High | 3 | **Mitigation**: (1) mTLS enforced at gateway and between services, (2) Security audit before each service goes live, (3) Penetration test after full migration, (4) Principle of least privilege (services can only access what they need) | Security Lead | Open |
| **R10** | Migration incomplete by deadline: stuck in half-migrated state, technical debt accumulates | Organizational | Medium (40%) | High | 6 | **Mitigation**: (1) Phased rollout with hard cutover dates, (2) Executive sponsorship to protect time, (3) No feature work during final 2 months (migration focus), (4) Rollback plan if timeline slips | Program Manager | Open |
### 2.3 Risk Mitigation Timeline
**Pre-Launch (Jan-Feb 2024)**:
- R6: Complete microservices training for all engineers
- R7: Finalize cost model and get CFO sign-off
- R3: Hire additional SREs, onboard them
**Phase 1 (Mar-Jun 2024 - Auth Service)**:
- R1: Validate dual-write reconciliation with Auth DB
- R2: Implement circuit breakers in Auth service client
- R4: Baseline latency tests, set p99 budgets
- R5: Dry-run migration on prod snapshot
**Phase 2 (Jul-Dec 2024 - User, Order, Inventory)**:
- R1: Reconciliation jobs running for all 3 services
- R2: Bulkhead pattern implemented across services
- R4: Load tests at 2× traffic before each service launch
**Phase 3 (Jan-Jun 2025 - Payment, Notification, Sunset)**:
- R9: Security audit and pen test before Payment goes live
- R10: Hard cutover date for monolith sunset, no slippage
**Continuous (Throughout)**:
- R3: Monthly SRE team health check (alert fatigue, runbook gaps)
- R7: Monthly cost reviews vs. budget
- R8: Track feature velocity every sprint
## 3. Success Metrics
### 3.1 Leading Indicators (Early Signals)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| **Deployment Frequency** (per service) | 2/week (monolith) | 10+/week (per service) | Git tag count in CI/CD | Weekly | Tech Lead |
| **Build + Test Time** | 25 min (monolith) | <10 min (per service) | CI/CD pipeline duration | Per build | DevOps |
| **Code Review Cycle Time** | 2 days | <1 day | GitHub PR metrics | Weekly | Engineering Manager |
| **Test Coverage** (per service) | 80% (monolith) | >85% (per service) | Jest coverage report | Per commit | QA Lead |
| **Incident Detection Time** (MTTD) | 15 min | <5 min | DataDog alert → Slack | Per incident | SRE |
**Rationale**: These predict future success. If deployment frequency increases early, it validates independent deployability. If incident detection improves, observability is working.
### 3.2 Lagging Indicators (Outcome Measures)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| **System Uptime** (SLO) | 99.8% | 99.95% | DataDog uptime monitor | Daily | SRE Lead |
| **API p99 Latency** | 450ms | <200ms | OpenTelemetry traces | Real-time dashboard | Perf Engineer |
| **Mean Time to Recovery** (MTTR) | 2 hours | <30 min | Incident timeline analysis | Per incident | SRE Lead |
| **Customer-Impacting Incidents** | 8/month | <3/month | PagerDuty severity 1 & 2 | Monthly | Engineering Manager |
| **Feature Velocity** (stories/sprint) | 12 stories | >15 stories | Jira velocity report | Per sprint | Product Manager |
| **Infrastructure Cost** | $50K/month | <$70K/month | AWS billing dashboard | Monthly | Finance |
**Rationale**: These measure actual outcomes. Uptime and MTTR validate reliability improvements. Latency and velocity validate performance and productivity gains.
### 3.3 Counter-Metrics (What We Won't Sacrifice)
| Metric | Threshold | Monitoring Method | Escalation Trigger |
|--------|-----------|-------------------|-------------------|
| **Code Quality** (bug rate) | <5 bugs/sprint | Jira bug tickets | >10 bugs/sprint → halt features, fix bugs |
| **Team Morale** (happiness score) | >7/10 | Quarterly eng survey | <6/10 → leadership 1:1s, workload adjustment |
| **Security Posture** (critical vulns) | 0 critical | Snyk security scans | Any critical vuln → immediate fix before ship |
| **Data Integrity** (order accuracy) | >99.99% | Reconciliation jobs | <99.9% → halt migrations, investigate |
| **Customer Satisfaction** (NPS) | >40 | Quarterly NPS survey | <30 → customer interviews, rollback if needed |
**Rationale**: Prevent gaming the system. Don't ship faster at the expense of quality. Don't improve latency by cutting security. Don't optimize costs by burning out the team.
### 3.4 Success Criteria Summary
**Must-Haves** (hard requirements):
- All 6 services independently deployable by Jun 2025
- 99.95% uptime maintained throughout migration
- Zero data loss during migrations
- PCI-DSS compliance for Payment service
- Infrastructure cost <$70K/month
**Should-Haves** (target achievements):
- P99 latency <200ms (250ms improvement from baseline)
- MTTR <30 min (90-minute improvement)
- Deployment frequency >10/week per service (5× improvement)
- Feature velocity >15 stories/sprint (25% improvement)
- Customer-impacting incidents <3/month (63% reduction)
**Nice-to-Haves** (stretch goals):
- Multi-region deployment capability (deferred to 2026)
- GraphQL federation layer (deferred)
- Event streaming for real-time analytics (deferred)
- Team self-rates microservices maturity >8/10 (vs. 4/10 today)
## 4. Self-Assessment
Evaluated using `rubric_chain_spec_risk_metrics.json`:
### Specification Quality
- **Clarity** (4.5/5): Goal, scope, timeline, approach clearly stated with diagrams
- **Completeness** (4.0/5): All sections covered; could add more detail on monolith sunset plan
- **Feasibility** (4.0/5): 18-month timeline is aggressive but achievable with current team; phased approach mitigates risk
### Risk Analysis Quality
- **Comprehensiveness** (4.5/5): 10 risks across technical, operational, organizational; premortem surfaced non-obvious risks (team skill gaps, cost overruns)
- **Quantification** (3.5/5): Likelihood/impact scored; could add $ impact estimates for high-risk items
- **Mitigation Depth** (4.5/5): Each high-risk item has detailed mitigation with specific actions (circuit breakers, reconciliation jobs, etc.)
### Metrics Quality
- **Measurability** (5.0/5): All metrics have clear baseline, target, measurement method, cadence
- **Leading/Lagging Balance** (4.5/5): Good mix of early signals (deployment frequency, MTTD) and outcomes (uptime, MTTR)
- **Counter-Metrics** (4.0/5): Explicit guardrails (quality, morale, security); prevents optimization myopia
### Integration
- **Spec→Risk Mapping** (4.5/5): Major spec decisions (dual-write, service boundaries) have corresponding risks (R1, R8)
- **Risk→Metrics Mapping** (4.5/5): High-risk items tracked by metrics (R1 → data integrity counter-metric, R2 → MTTR)
- **Coherence** (4.5/5): All three components tell consistent story of complex migration with proactive risk management
### Actionability
- **Stakeholder Clarity** (4.5/5): Clear owners for each risk, metric; stakeholders can act on this document
- **Timeline Realism** (4.0/5): 18-month timeline is aggressive; includes buffer (80% work, 20% buffer in each phase)
**Overall Average**: 4.3/5 ✓ (Exceeds 3.5 minimum standard)
**Strengths**:
- Comprehensive risk analysis with specific, actionable mitigations
- Clear metrics with baselines, targets, and measurement methods
- Phased rollout reduces big-bang risk
**Areas for Improvement**:
- Add quantitative cost impact for high-risk items (e.g., R1 data inconsistency could cost $X in customer refunds)
- More detail on monolith sunset plan (how to decommission safely)
- Consider adding "reverse premortem" (what would make this wildly successful?) to identify opportunities
**Recommendation**: Proceed with migration. Risk mitigation strategies are sound. Metrics will provide early warning if initiative is off track. Schedule quarterly reviews to update risks/metrics as we learn.

View File

@@ -0,0 +1,393 @@
# Chain Spec Risk Metrics Methodology
Advanced techniques for complex multi-phase programs, novel risks, and sophisticated metric frameworks.
## Workflow
Copy this checklist and track your progress:
```
Chain Spec Risk Metrics Progress:
- [ ] Step 1: Gather initiative context
- [ ] Step 2: Write comprehensive specification
- [ ] Step 3: Conduct premortem and build risk register
- [ ] Step 4: Define success metrics and instrumentation
- [ ] Step 5: Validate completeness and deliver
```
**Step 1: Gather initiative context** - Collect goal, constraints, stakeholders, baseline. Use template questions plus assess complexity using [5. Complexity Decision Matrix](#5-complexity-decision-matrix).
**Step 2: Write comprehensive specification** - For multi-phase programs use [1. Advanced Specification Techniques](#1-advanced-specification-techniques) including phased rollout strategies and requirements traceability matrix.
**Step 3: Conduct premortem and build risk register** - Apply [2. Advanced Risk Assessment](#2-advanced-risk-assessment) methods including quantitative analysis, fault tree analysis, and premortem variations for comprehensive risk identification.
**Step 4: Define success metrics** - Use [3. Advanced Metrics Frameworks](#3-advanced-metrics-frameworks) such as HEART, AARRR, North Star, or SLI/SLO depending on initiative type.
**Step 5: Validate and deliver** - Ensure integration using [4. Integration Best Practices](#4-integration-best-practices) and check for [6. Anti-Patterns](#6-anti-patterns) before delivering.
## 1. Advanced Specification Techniques
### Multi-Phase Program Decomposition
**When**: Initiative is too large to execute in one phase (>6 months, >10 people, multiple systems).
**Approach**: Break into phases with clear value delivery at each stage.
**Phase Structure**:
- **Phase 0 (Discovery)**: Research, prototyping, validate assumptions
- Deliverable: Technical spike, proof of concept, feasibility report
- Metrics: Prototype success rate, assumption validation rate
- **Phase 1 (Foundation)**: Core infrastructure, no user-facing features yet
- Deliverable: Platform/APIs deployed, instrumentation in place
- Metrics: System stability, API latency, error rates
- **Phase 2 (Alpha)**: Limited rollout to internal users or pilot customers
- Deliverable: Feature complete for target use case, internal feedback
- Metrics: User activation, time-to-value, critical bugs
- **Phase 3 (Beta)**: Broader rollout, feature complete, gather feedback
- Deliverable: Production-ready with known limitations
- Metrics: Adoption rate, support tickets, performance under load
- **Phase 4 (GA)**: General availability, full feature set, scaled operations
- Deliverable: Fully launched, documented, supported
- Metrics: Market penetration, revenue, NPS, operational costs
**Phase Dependencies**:
- Document what each phase depends on (previous phase completion, external dependencies)
- Define phase gates (criteria to proceed to next phase)
- Include rollback plans if a phase fails
### Phased Rollout Strategies
**Canary Deployment**:
- Roll out to 1% → 10% → 50% → 100% of traffic/users
- Monitor metrics at each stage before expanding
- Automatic rollback if error rates spike
**Ring Deployment**:
- Ring 0: Internal employees (catch obvious bugs)
- Ring 1: Pilot customers (friendly users, willing to provide feedback)
- Ring 2: Early adopters (power users, high tolerance)
- Ring 3: General availability (all users)
**Feature Flags**:
- Deploy code but keep feature disabled
- Gradually enable for user segments
- A/B test impact before full rollout
**Geographic Rollout**:
- Region 1 (smallest/lowest risk) → Region 2 → Region 3 → Global
- Allows testing localization, compliance, performance in stages
### Requirements Traceability Matrix
For complex initiatives, map requirements → design → implementation → tests → risks → metrics.
| Requirement ID | Requirement | Design Doc | Implementation | Test Cases | Related Risks | Success Metric |
|----------------|-------------|------------|----------------|------------|---------------|----------------|
| REQ-001 | Auth via OAuth 2.0 | Design-Auth.md | auth-service/ | test/auth/ | R3, R7 | Auth success rate > 99.9% |
| REQ-002 | API p99 < 200ms | Design-Perf.md | gateway/ | test/perf/ | R1, R5 | p99 latency metric |
**Benefits**:
- Ensures nothing is forgotten (every requirement has tests, metrics, risk mitigation)
- Helps with impact analysis (if a risk materializes, which requirements are affected?)
- Useful for audit/compliance (trace from business need → implementation → validation)
## 2. Advanced Risk Assessment
### Quantitative Risk Analysis
**When**: High-stakes initiatives where "Low/Medium/High" risk scoring is insufficient.
**Approach**: Assign probabilities (%) and impact ($$) to risks, calculate expected loss.
**Example**:
- **Risk**: Database migration fails, requiring full rollback
- **Probability**: 15% (based on similar past migrations)
- **Impact**: $500K (3 eng-weeks × 10 engineers × $50K/year/eng + customer churn)
- **Expected Loss**: 15% × $500K = $75K
- **Mitigation Cost**: $30K (comprehensive testing, dry-run on prod snapshot)
- **Decision**: Invest $30K to mitigate (expected savings $45K)
**Three-Point Estimation** for uncertainty:
- **Optimistic**: Best-case scenario (10th percentile)
- **Most Likely**: Expected case (50th percentile)
- **Pessimistic**: Worst-case scenario (90th percentile)
- **Expected Value**: (Optimistic + 4×Most Likely + Pessimistic) / 6
### Fault Tree Analysis
**When**: Analyzing how multiple small failures combine to cause catastrophic outcome.
**Approach**: Work backward from catastrophic failure to root causes using logic gates.
**Example**: "Customer data breach"
```
[Customer Data Breach]
↓ (OR gate - any path causes breach)
┌──┴──┐
[Auth bypass] OR [DB exposed to internet]
↓ (AND gate - both required)
┌──┴──┐
[SQL injection] AND [No input validation]
```
**Insights**:
- Identify single points of failure (only one thing has to go wrong)
- Reveal defense-in-depth opportunities (add redundant protections)
- Prioritize mitigations (block root causes that appear in multiple paths)
### Bow-Tie Risk Diagrams
**When**: Complex risks with multiple causes and multiple consequences.
**Structure**:
```
[Causes] → [Preventive Controls] → [RISK EVENT] → [Mitigative Controls] → [Consequences]
```
**Example**: Risk = "Production database unavailable"
**Left side (Causes + Prevention)**:
- Cause 1: Hardware failure → Prevention: Redundant instances, health checks
- Cause 2: Human error (bad migration) → Prevention: Dry-run on snapshot, peer review
- Cause 3: DDoS attack → Prevention: Rate limiting, WAF
**Right side (Mitigation + Consequences)**:
- Consequence 1: Users can't access app → Mitigation: Read replica for degraded mode
- Consequence 2: Revenue loss → Mitigation: SLA credits, customer communication plan
- Consequence 3: Reputational damage → Mitigation: PR plan, status page transparency
**Benefits**: Shows full lifecycle of risk (how to prevent, how to respond if it happens anyway).
### Premortem Variations
**Reverse Premortem** ("It succeeded wildly - how?"):
- Identifies what must go RIGHT for success
- Reveals critical success factors often overlooked
- Example: "We succeeded because we secured executive sponsorship early and maintained it throughout"
**Pre-Parade** (best-case scenario):
- What would make this initiative exceed expectations?
- Identifies opportunities to amplify impact
- Example: "If we also integrate with Salesforce, we could unlock enterprise market"
**Lateral Premortem** (stakeholder-specific):
- Run separate premortems for each stakeholder group
- Engineering: "It failed because of technical reasons..."
- Product: "It failed because users didn't adopt it..."
- Operations: "It failed because we couldn't support it at scale..."
## 3. Advanced Metrics Frameworks
### HEART Framework (Google)
For user-facing products, track:
- **Happiness**: User satisfaction (NPS, CSAT surveys)
- **Engagement**: Level of user involvement (DAU/MAU, sessions/user)
- **Adoption**: New users accepting product (% of target users activated)
- **Retention**: Rate users come back (7-day/30-day retention)
- **Task Success**: Efficiency completing tasks (completion rate, time on task, error rate)
**Application**:
- Leading: Adoption rate, task success rate
- Lagging: Retention, engagement over time
- Counter-metric: Happiness (don't sacrifice UX for engagement)
### AARRR Pirate Metrics
For growth-focused initiatives:
- **Acquisition**: Users discover product (traffic sources, signup rate)
- **Activation**: Users have good first experience (onboarding completion, time-to-aha)
- **Retention**: Users return (D1/D7/D30 retention)
- **Revenue**: Users pay (conversion rate, ARPU, LTV)
- **Referral**: Users bring others (viral coefficient, NPS)
**Application**:
- Identify bottleneck stage (where most users drop off)
- Focus initiative on improving that stage
- Track funnel conversion at each stage
### North Star + Supporting Metrics
**North Star Metric**: Single metric that best captures value delivered to customers.
- Examples: Weekly active users (social app), time-to-insight (analytics), API calls/week (platform)
**Supporting Metrics**: Inputs that drive the North Star.
- Example NSM: Weekly Active Users
- Supporting: New user signups, activation rate, feature engagement, retention rate
**Application**:
- All initiatives tie back to moving North Star or supporting metrics
- Prevents metric myopia (optimizing one metric at expense of others)
- Aligns team around common goal
### SLI/SLO/SLA Framework
For reliability-focused initiatives:
- **SLI (Service Level Indicator)**: What you measure (latency, error rate, throughput)
- **SLO (Service Level Objective)**: Target for SLI (p99 latency < 200ms, error rate < 0.1%)
- **SLA (Service Level Agreement)**: Consequence if SLO not met (customer credits, escalation)
**Error Budget**:
- If SLO is 99.9% uptime, error budget is 0.1% downtime (43 minutes/month)
- Can "spend" error budget on risky deployments
- If budget exhausted, halt feature work and focus on reliability
**Application**:
- Define SLIs/SLOs for each major component
- Track burn rate (how fast you're consuming error budget)
- Use as gate for deployment (don't deploy if error budget exhausted)
### Metrics Decomposition Trees
**When**: Complex metric needs to be broken down into actionable components.
**Example**: Increase revenue
```
Revenue
├─ New Customer Revenue
│ ├─ New Customers (leads × conversion rate)
│ └─ Average Deal Size (features × pricing tier)
└─ Existing Customer Revenue
├─ Expansion (upsell rate × expansion ARR)
└─ Retention (renewal rate × existing ARR)
```
**Application**:
- Identify which leaf nodes to focus on
- Each leaf becomes a metric to track
- Reveals non-obvious leverage points (e.g., increasing renewal rate might have bigger impact than new customer acquisition)
## 4. Integration Best Practices
### Specs → Risks Mapping
**Principle**: Every major specification decision should have corresponding risks identified.
**Example**:
- **Spec**: "Use MongoDB for user data storage"
- **Risks**:
- R1: MongoDB performance degrades above 10M documents (mitigation: sharding strategy)
- R2: Team lacks MongoDB expertise (mitigation: training, hire consultant)
- R3: Data model changes require migration (mitigation: schema versioning)
**Implementation**:
- Review each spec section and ask "What could go wrong with this choice?"
- Ensure alternative approaches are considered with their risks
- Document why chosen approach is best despite risks
### Risks → Metrics Mapping
**Principle**: Each high-impact risk should have a metric that detects if it's materializing.
**Example**:
- **Risk**: "Database performance degrades under load"
- **Metrics**:
- Leading: Query time p95 (early warning before user impact)
- Lagging: User-reported latency complaints
- Counter-metric: Don't over-optimize for read speed at expense of write consistency
**Implementation**:
- For each risk score 6-9, define early warning metric
- Set up alerts/dashboards to monitor these metrics
- Define escalation thresholds (when to invoke mitigation plan)
### Metrics → Specs Mapping
**Principle**: Specifications should include instrumentation to enable metric collection.
**Example**:
- **Metric**: "API p99 latency < 200ms"
- **Spec Requirements**:
- Distributed tracing (OpenTelemetry) for end-to-end latency
- Per-endpoint latency bucketing (identify slow endpoints)
- Client-side RUM (real user monitoring) for user-perceived latency
**Implementation**:
- Review metrics and ask "What instrumentation is needed?"
- Add observability requirements to spec (logging, metrics, tracing)
- Include instrumentation in acceptance criteria
### Continuous Validation Loop
**Pattern**: Spec → Implement → Measure → Validate Risks → Update Spec
**Steps**:
1. **Initial Spec**: Document approach, risks, metrics
2. **Phase 1 Implementation**: Build and deploy
3. **Measure Reality**: Collect actual metrics vs. targets
4. **Validate Risk Mitigations**: Did mitigations work? New risks emerged?
5. **Update Spec**: Revise for next phase based on learnings
**Example**:
- **Phase 1 Spec**: Expected 10K req/s with single instance
- **Reality**: Actual 3K req/s (bottleneck in DB queries)
- **Risk Update**: Add R8: Query optimization needed for target load
- **Metric Update**: Add query execution time to dashboards
- **Phase 2 Spec**: Refactor queries, add read replicas, target 12K req/s
## 5. Complexity Decision Matrix
Use this matrix to decide when to use this skill vs. simpler approaches:
| Initiative Characteristics | Recommended Approach |
|---------------------------|----------------------|
| **Low Stakes** (< 1 eng-month, reversible, no user impact) | Lightweight spec, simple checklist, skip formal risk register |
| **Medium Stakes** (1-3 months, some teams, moderate impact) | Use template.md: Full spec + premortem + 3-5 metrics |
| **High Stakes** (3-6 months, cross-team, high impact) | Use template.md + methodology: Quantitative risk analysis, comprehensive metrics |
| **Strategic** (6+ months, company-level, existential) | Use methodology + external review: Fault tree, SLI/SLOs, continuous validation |
**Heuristics**:
- If failure would cost >$100K or 6+ eng-months, use full methodology
- If initiative affects >1000 users or >3 teams, use at least template
- If uncertainty is high, invest more in risk analysis and phased rollout
- If metrics are complex or novel, use advanced frameworks (HEART, North Star)
## 6. Anti-Patterns
**Spec Only (No Risks/Metrics)**:
- Symptom: Detailed spec but no discussion of what could go wrong or how to measure success
- Fix: Run quick premortem (15 min), define 3 must-track metrics minimum
**Risk Theater (Long List, No Mitigations)**:
- Symptom: 50+ risks identified but no prioritization or mitigation plans
- Fix: Score risks, focus on top 10, assign owners and mitigations
**Vanity Metrics (Measures Activity Not Outcomes)**:
- Symptom: Tracking "features shipped" instead of "user value delivered"
- Fix: For each metric ask "If this goes up, are users/business better off?"
**Plan and Forget (No Updates)**:
- Symptom: Beautiful spec/risk/metrics doc created then never referenced
- Fix: Schedule monthly reviews, update risks/metrics, track in team rituals
**Premature Precision (Overconfident Estimates)**:
- Symptom: "Migration will take exactly 47 days and cost $487K"
- Fix: Use ranges (30-60 days, $400-600K), state confidence levels, build in buffer
**Analysis Paralysis (Perfect Planning)**:
- Symptom: Spent 3 months planning, haven't started building
- Fix: Time-box planning (1-2 weeks for most initiatives), embrace uncertainty, learn by doing
## 7. Templates for Common Initiative Types
### Migration (System/Platform/Data)
- **Spec Focus**: Current vs. target architecture, migration path, rollback plan, validation
- **Risk Focus**: Data loss, downtime, performance regression, failed migration
- **Metrics Focus**: Migration progress %, data integrity, system performance, rollback capability
### Launch (Product/Feature)
- **Spec Focus**: User stories, UX flows, technical design, launch checklist
- **Risk Focus**: Low adoption, technical bugs, scalability issues, competitive response
- **Metrics Focus**: Activation, engagement, retention, revenue impact, support tickets
### Infrastructure Change
- **Spec Focus**: Architecture diagram, capacity planning, runbooks, disaster recovery
- **Risk Focus**: Outages, cost overruns, security vulnerabilities, operational complexity
- **Metrics Focus**: Uptime, latency, costs, incident count, MTTR
### Process Change (Organizational)
- **Spec Focus**: Current vs. new process, roles/responsibilities, training plan, timeline
- **Risk Focus**: Adoption resistance, productivity drop, key person dependency, communication gaps
- **Metrics Focus**: Process compliance, cycle time, employee satisfaction, quality metrics
For complete worked example, see [examples/microservices-migration.md](../examples/microservices-migration.md).

View File

@@ -0,0 +1,369 @@
# Chain Spec Risk Metrics Template
## Workflow
Copy this checklist and track your progress:
```
Chain Spec Risk Metrics Progress:
- [ ] Step 1: Gather initiative context
- [ ] Step 2: Write comprehensive specification
- [ ] Step 3: Conduct premortem and build risk register
- [ ] Step 4: Define success metrics and instrumentation
- [ ] Step 5: Validate completeness and deliver
```
**Step 1: Gather initiative context** - Collect goal, constraints, stakeholders, baseline, desired outcomes. See [Context Questions](#context-questions).
**Step 2: Write comprehensive specification** - Document scope, approach, requirements, dependencies, timeline. See [Quick Template](#quick-template) and [Specification Guidance](#specification-guidance).
**Step 3: Conduct premortem and build risk register** - Run failure imagination exercise, categorize risks, assign mitigations/owners. See [Premortem Process](#premortem-process) and [Risk Register Template](#risk-register-template).
**Step 4: Define success metrics** - Identify leading/lagging indicators, baselines, targets, counter-metrics. See [Metrics Template](#metrics-template) and [Instrumentation Guidance](#instrumentation-guidance).
**Step 5: Validate and deliver** - Self-check with rubric (≥3.5 average), ensure all components align. See [Quality Checklist](#quality-checklist).
## Quick Template
```markdown
# [Initiative Name]: Spec, Risks, Metrics
## 1. Specification
### 1.1 Executive Summary
- **Goal**: [What are you building/changing and why?]
- **Timeline**: [When does this need to be delivered?]
- **Stakeholders**: [Who cares about this initiative?]
- **Success Criteria**: [What does done look like?]
### 1.2 Current State (Baseline)
- [Describe the current state/system]
- [Key metrics/data points about current state]
- [Pain points or limitations driving this initiative]
### 1.3 Proposed Approach
- **Architecture/Design**: [High-level approach]
- **Implementation Plan**: [Phases, milestones, dependencies]
- **Technology Choices**: [Tools, frameworks, platforms with rationale]
### 1.4 Scope
- **In Scope**: [What this initiative includes]
- **Out of Scope**: [What is explicitly excluded]
- **Future Considerations**: [What might come later]
### 1.5 Requirements
**Functional Requirements:**
- [Feature/capability 1]
- [Feature/capability 2]
- [Feature/capability 3]
**Non-Functional Requirements:**
- **Performance**: [Latency, throughput, scalability targets]
- **Reliability**: [Uptime, error rates, recovery time]
- **Security**: [Authentication, authorization, data protection]
- **Compliance**: [Regulatory, policy, audit requirements]
### 1.6 Dependencies & Assumptions
- **Dependencies**: [External teams, systems, resources needed]
- **Assumptions**: [What we're assuming is true]
- **Constraints**: [Budget, time, resource, technical limitations]
### 1.7 Timeline & Milestones
| Milestone | Date | Deliverable |
|-----------|------|-------------|
| [Phase 1] | [Date] | [What's delivered] |
| [Phase 2] | [Date] | [What's delivered] |
| [Phase 3] | [Date] | [What's delivered] |
## 2. Risk Analysis
### 2.1 Premortem Summary
**Exercise Prompt**: "It's [X months] from now. This initiative failed catastrophically. What went wrong?"
**Top failure modes identified:**
1. [Failure mode 1]
2. [Failure mode 2]
3. [Failure mode 3]
4. [Failure mode 4]
5. [Failure mode 5]
### 2.2 Risk Register
| Risk ID | Risk Description | Category | Likelihood | Impact | Risk Score | Mitigation Strategy | Owner | Status |
|---------|-----------------|----------|------------|--------|-----------|---------------------|-------|--------|
| R1 | [Specific failure scenario] | Technical | High | High | 9 | [How you'll prevent/reduce] | [Name] | Open |
| R2 | [Specific failure scenario] | Operational | Medium | High | 6 | [How you'll prevent/reduce] | [Name] | Open |
| R3 | [Specific failure scenario] | Organizational | Medium | Medium | 4 | [How you'll prevent/reduce] | [Name] | Open |
| R4 | [Specific failure scenario] | External | Low | High | 3 | [How you'll prevent/reduce] | [Name] | Open |
**Risk Scoring**: Low=1, Medium=2, High=3. Risk Score = Likelihood × Impact.
**Risk Categories**:
- **Technical**: Architecture, code quality, infrastructure, performance
- **Operational**: Processes, runbooks, support, operations
- **Organizational**: Resources, skills, alignment, communication
- **External**: Market, vendors, regulation, dependencies
### 2.3 Risk Mitigation Timeline
- **Pre-Launch**: [Risks to mitigate before going live]
- **Launch Window**: [Monitoring and safeguards during launch]
- **Post-Launch**: [Ongoing monitoring and response]
## 3. Success Metrics
### 3.1 Leading Indicators (Early Signals)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Predictive metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Predictive metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
**Examples**: Deployment frequency, code review cycle time, test coverage, incident detection time
### 3.2 Lagging Indicators (Outcome Measures)
| Metric | Baseline | Target | Measurement Method | Tracking Cadence | Owner |
|--------|----------|--------|-------------------|------------------|-------|
| [Outcome metric 1] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 2] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
| [Outcome metric 3] | [Current value] | [Goal value] | [How measured] | [How often] | [Name] |
**Examples**: Uptime, user adoption rate, revenue impact, customer satisfaction score
### 3.3 Counter-Metrics (What We Won't Sacrifice)
| Metric | Threshold | Monitoring Method | Escalation Trigger |
|--------|-----------|-------------------|-------------------|
| [Protection metric 1] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 2] | [Minimum acceptable] | [How monitored] | [When to escalate] |
| [Protection metric 3] | [Minimum acceptable] | [How monitored] | [When to escalate] |
**Examples**: Code quality (test coverage > 80%), team morale (happiness score > 7/10), security posture (no critical vulnerabilities), user privacy (zero data leaks)
### 3.4 Success Criteria Summary
**Must-haves (hard requirements)**:
- [Critical success criterion 1]
- [Critical success criterion 2]
- [Critical success criterion 3]
**Should-haves (target achievements)**:
- [Desired outcome 1]
- [Desired outcome 2]
- [Desired outcome 3]
**Nice-to-haves (stretch goals)**:
- [Aspirational outcome 1]
- [Aspirational outcome 2]
```
## Context Questions
**Basics**: What are you building/changing? Why now? Who are the stakeholders?
**Constraints**: When does this need to be delivered? What's the budget? What resources are available?
**Current State**: What exists today? What's the baseline? What are the pain points?
**Desired Outcomes**: What does success look like? What metrics would you track? What are you most worried about?
**Scope**: Greenfield/migration/enhancement? Single or multi-phase? Who is the primary user?
## Specification Guidance
### Scope Definition
**In Scope** should be specific and concrete:
- ✅ "Migrate user authentication service to OAuth 2.0 with PKCE"
- ❌ "Improve authentication"
**Out of Scope** prevents scope creep:
- List explicitly what won't be included in this phase
- Reference future work that might address excluded items
- Example: "Out of Scope: Social login (Google/GitHub) - deferred to Phase 2"
### Requirements Best Practices
**Functional Requirements**: What the system must do
- Use "must" for requirements, "should" for preferences
- Be specific: "System must handle 10K requests/sec" not "System should be fast"
- Include acceptance criteria: How will you verify this requirement is met?
**Non-Functional Requirements**: How well the system must perform
- **Performance**: Quantify with numbers (latency < 200ms, throughput > 1000 req/s)
- **Reliability**: Define uptime SLAs, error budgets, MTTR targets
- **Security**: Specify authentication, authorization, encryption requirements
- **Scalability**: Define growth expectations (users, data, traffic)
### Timeline & Milestones
Make milestones concrete and verifiable:
- ✅ "Milestone 1 (Mar 15): Auth service deployed to staging, 100% auth tests passing"
- ❌ "Milestone 1: Complete most of auth work"
Include buffer for unknowns:
- 80% planned work, 20% buffer for issues/tech debt
- Identify critical path and dependencies clearly
## Premortem Process
### Step 1: Set the Scene
Frame the failure scenario clearly:
> "It's [6/12/24] months from now - [Date]. We launched [initiative name] and it failed catastrophically. [Stakeholders] are upset. The team is demoralized. What went wrong?"
Choose timeframe based on initiative:
- Quick launch: 3-6 months
- Major migration: 12-24 months
- Strategic change: 24+ months
### Step 2: Brainstorm Failures (Independent)
Have each stakeholder privately write 3-5 specific failure modes:
- Be specific: "Database migration lost 10K user records" not "data issues"
- Think from your domain: Engineers focus on technical, PMs on product, Ops on operational
- No filtering: List even unlikely scenarios
### Step 3: Share and Cluster
Collect all failure modes and group similar ones:
- **Technical failures**: System design, code bugs, infrastructure
- **Operational failures**: Runbooks missing, wrong escalation, poor monitoring
- **Organizational failures**: Lack of skills, poor communication, misaligned incentives
- **External failures**: Vendor issues, market changes, regulatory changes
### Step 4: Vote and Prioritize
For each cluster, assess:
- **Likelihood**: How probable is this? (Low 10-30%, Medium 30-60%, High 60%+)
- **Impact**: How bad would it be? (Low = annoying, Medium = painful, High = catastrophic)
- **Risk Score**: Likelihood × Impact (1-9 scale)
Focus mitigation on High-risk items (score 6-9).
### Step 5: Convert to Risk Register
For each significant failure mode:
1. Reframe as a risk: "Risk that [specific scenario] happens"
2. Categorize (Technical/Operational/Organizational/External)
3. Assign owner (who monitors and responds)
4. Define mitigation (how to reduce likelihood or impact)
5. Track status (Open/Mitigated/Accepted/Closed)
## Risk Register Template
### Risk Statement Format
Good risk statements are specific and measurable:
- ✅ "Risk that database migration script fails to preserve foreign key relationships, causing data integrity errors in 15% of records"
- ❌ "Risk of data issues"
### Mitigation Strategies
**Reduce Likelihood** (prevent the risk):
- Example: "Implement dry-run migration on production snapshot; verify all FK relationships before live migration"
**Reduce Impact** (limit the damage):
- Example: "Create rollback script tested on staging; set up real-time monitoring for FK violations; keep read replica as backup"
**Accept Risk** (consciously choose not to mitigate):
- For low-impact or very-low-likelihood risks
- Document why it's acceptable: "Accept risk of 3rd-party API rate limiting during launch (low likelihood, workaround available)"
**Transfer Risk** (shift to vendor/insurance):
- Example: "Use managed database service with automated backups and point-in-time recovery (transfers operational risk to vendor)"
### Risk Owners
Each risk needs a clear owner who:
- Monitors early warning signals
- Executes mitigation plans
- Escalates if risk materializes
- Updates risk status regularly
Not necessarily the person who fixes it, but the person accountable for tracking it.
## Metrics Template
### Leading Indicators (Early Signals)
These predict future success before lagging metrics move:
- **Engineering**: Deployment frequency, build time, code review cycle time, test coverage
- **Product**: Feature adoption rate, activation rate, time-to-value, engagement trends
- **Operations**: Incident detection time, MTTD, runbook execution rate, alert accuracy
Choose 3-5 leading indicators that:
1. Predict lagging outcomes (validated correlation)
2. Can be measured frequently (daily/weekly)
3. You can act on quickly (short feedback loop)
### Lagging Indicators (Outcomes)
These measure actual success but appear later:
- **Reliability**: Uptime, error rate, MTTR, SLA compliance
- **Performance**: p50/p95/p99 latency, throughput, response time
- **Business**: Revenue, user growth, retention, NPS, cost savings
- **User**: Activation rate, feature adoption, satisfaction score
Choose 3-5 lagging indicators that:
1. Directly measure initiative goals
2. Are measurable and objective
3. Stakeholders care about
### Counter-Metrics (Guardrails)
What success does NOT mean sacrificing:
- If optimizing for speed → Counter-metric: Code quality (test coverage, bug rate)
- If optimizing for growth → Counter-metric: Costs (infrastructure spend, CAC)
- If optimizing for features → Counter-metric: Technical debt (cycle time, deployment frequency)
Choose 2-3 counter-metrics to:
1. Prevent gaming the system
2. Protect long-term sustainability
3. Maintain team/user trust
## Instrumentation Guidance
**Baseline**: Measure current state before launch (✅ "p99 latency: 500ms" ❌ "API is slow"). Without baseline, you can't measure improvement.
**Targets**: Make them specific ("reduce p99 from 500ms to 200ms"), achievable (industry benchmarks), time-bound ("by Q2 end"), meaningful (tied to outcomes).
**Measurement**: Document data source, calculation method, measurement frequency, and who can access the metrics.
**Tracking Cadence**: Real-time (system health), daily (operations), weekly (product), monthly (business).
## Quality Checklist
Before delivering, verify:
**Specification Complete:**
- [ ] Goal, stakeholders, timeline clearly stated
- [ ] Current state baseline documented with data
- [ ] Scope (in/out/future) explicitly defined
- [ ] Requirements are specific and measurable
- [ ] Dependencies and assumptions listed
- [ ] Timeline has concrete milestones
**Risks Comprehensive:**
- [ ] Premortem conducted (failure imagination exercise)
- [ ] Risks span technical, operational, organizational, external
- [ ] Each risk has likelihood, impact, score
- [ ] High-risk items (6-9 score) have detailed mitigation plans
- [ ] Every risk has an assigned owner
- [ ] Risk register is prioritized by score
**Metrics Measurable:**
- [ ] 3-5 leading indicators (early signals) defined
- [ ] 3-5 lagging indicators (outcomes) defined
- [ ] 2-3 counter-metrics (guardrails) defined
- [ ] Each metric has baseline, target, measurement method
- [ ] Metrics have clear owners and tracking cadence
- [ ] Success criteria (must/should/nice-to-have) documented
**Integration:**
- [ ] Risks map to specs (e.g., technical risks tied to architecture choices)
- [ ] Metrics validate risk mitigations (e.g., measure if mitigation worked)
- [ ] Specs enable metrics (e.g., instrumentation built into design)
- [ ] All three components tell a coherent story
**Rubric Validation:**
- [ ] Self-assessed with rubric ≥ 3.5 average across all criteria
- [ ] Stakeholders can act on this artifact
- [ ] Gaps and unknowns explicitly acknowledged