Initial commit

2025-11-30 08:42:54 +08:00
commit 8aa65dc62a
10 changed files with 2457 additions and 0 deletions
--- a/commands/analyse-problem.md
+++ b/commands/analyse-problem.md
@@ -0,0 +1,389 @@
+---
+description: Comprehensive A3 one-page problem analysis with root cause and action plan
+argument-hint: Optional problem description to document
+---
+
+# A3 Problem Analysis
+
+Apply A3 problem-solving format for comprehensive, single-page problem documentation and resolution planning.
+
+## Description
+Structured one-page analysis format covering: Background, Current Condition, Goal, Root Cause Analysis, Countermeasures, Implementation Plan, and Follow-up. Named after A3 paper size; emphasizes concise, complete documentation.
+
+## Usage
+`/analyse-problem [problem_description]`
+
+## Variables
+- PROBLEM: Issue to analyze (default: prompt for input)
+- OUTPUT_FORMAT: markdown or text (default: markdown)
+
+## Steps
+1. **Background**: Why this problem matters (context, business impact)
+2. **Current Condition**: What's happening now (data, metrics, examples)
+3. **Goal/Target**: What success looks like (specific, measurable)
+4. **Root Cause Analysis**: Why problem exists (use 5 Whys or Fishbone)
+5. **Countermeasures**: Proposed solutions addressing root causes
+6. **Implementation Plan**: Who, what, when, how
+7. **Follow-up**: How to verify success and prevent recurrence
+
+## A3 Template
+
+```
+═══════════════════════════════════════════════════════════════
+                    A3 PROBLEM ANALYSIS
+═══════════════════════════════════════════════════════════════
+
+TITLE: [Concise problem statement]
+OWNER: [Person responsible]
+DATE: [YYYY-MM-DD]
+
+┌─────────────────────────────────────────────────────────────┐
+│ 1. BACKGROUND (Why this matters)                            │
+├─────────────────────────────────────────────────────────────┤
+│ [Context, impact, urgency, who's affected]                  │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 2. CURRENT CONDITION (What's happening)                     │
+├─────────────────────────────────────────────────────────────┤
+│ [Facts, data, metrics, examples - no opinions]              │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 3. GOAL/TARGET (What success looks like)                    │
+├─────────────────────────────────────────────────────────────┤
+│ [Specific, measurable, time-bound targets]                  │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 4. ROOT CAUSE ANALYSIS (Why problem exists)                 │
+├─────────────────────────────────────────────────────────────┤
+│ [5 Whys, Fishbone, data analysis]                           │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 5. COUNTERMEASURES (Solutions addressing root causes)       │
+├─────────────────────────────────────────────────────────────┤
+│ [Specific actions, not vague intentions]                    │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 6. IMPLEMENTATION PLAN (Who, What, When)                    │
+├─────────────────────────────────────────────────────────────┤
+│ [Timeline, responsibilities, dependencies, milestones]      │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 7. FOLLOW-UP (Verification & Prevention)                    │
+├─────────────────────────────────────────────────────────────┤
+│ [Success metrics, monitoring plan, review dates]            │
+└─────────────────────────────────────────────────────────────┘
+
+═══════════════════════════════════════════════════════════════
+```
+
+## Examples
+
+### Example 1: Database Connection Pool Exhaustion
+
+```
+═══════════════════════════════════════════════════════════════
+                    A3 PROBLEM ANALYSIS
+═══════════════════════════════════════════════════════════════
+
+TITLE: API Downtime Due to Connection Pool Exhaustion
+OWNER: Backend Team Lead
+DATE: 2024-11-14
+
+┌─────────────────────────────────────────────────────────────┐
+│ 1. BACKGROUND                                                │
+├─────────────────────────────────────────────────────────────┤
+│ • API goes down 2-3x per week during peak hours             │
+│ • Affects 10,000+ users, average 15min downtime             │
+│ • Revenue impact: ~$5K per incident                         │
+│ • Customer satisfaction score dropped from 4.5 to 3.8       │
+│ • Started 3 weeks ago after traffic increased 40%           │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 2. CURRENT CONDITION                                         │
+├─────────────────────────────────────────────────────────────┤
+│ Observations:                                                │
+│ • Connection pool size: 10 (unchanged since launch)         │
+│ • Peak concurrent users: 500 (was 300 three weeks ago)      │
+│ • Average request time: 200ms (was 150ms)                   │
+│ • Connections leaked: ~2 per hour (never released)          │
+│ • Error: "Connection pool exhausted" in logs                │
+│                                                              │
+│ Pattern:                                                     │
+│ • Occurs at 2pm-4pm daily (peak traffic)                    │
+│ • Gradual degradation over 30 minutes                       │
+│ • Recovery requires app restart                             │
+│ • Long-running queries block pool (some 30+ seconds)        │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 3. GOAL/TARGET                                               │
+├─────────────────────────────────────────────────────────────┤
+│ • Zero downtime due to connection exhaustion                │
+│ • Support 1000 concurrent users (2x current peak)           │
+│ • All connections released within 5 seconds                 │
+│ • Achieve within 1 week                                     │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 4. ROOT CAUSE ANALYSIS                                       │
+├─────────────────────────────────────────────────────────────┤
+│ 5 Whys:                                                      │
+│ Problem: Connection pool exhausted                          │
+│ Why 1: All 10 connections in use, none available            │
+│ Why 2: Connections not released after requests              │
+│ Why 3: Error handling doesn't close connections             │
+│ Why 4: Try-catch blocks missing .finally()                  │
+│ Why 5: No code review checklist for resource cleanup        │
+│                                                              │
+│ Contributing factors:                                        │
+│ • Pool size too small for current load                      │
+│ • No connection timeout configured (hangs forever)          │
+│ • Slow queries hold connections longer                      │
+│ • No monitoring/alerting on pool metrics                    │
+│                                                              │
+│ ROOT CAUSE: Systematic issue with resource cleanup +        │
+│             insufficient pool sizing                         │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 5. COUNTERMEASURES                                           │
+├─────────────────────────────────────────────────────────────┤
+│ Immediate (This Week):                                       │
+│ 1. Audit all DB code, add .finally() for connection release │
+│ 2. Increase pool size: 10 → 30                              │
+│ 3. Add connection timeout: 10 seconds                       │
+│ 4. Add pool monitoring & alerts (>80% used)                 │
+│                                                              │
+│ Short-term (2 Weeks):                                        │
+│ 5. Optimize slow queries (add indexes)                      │
+│ 6. Implement connection pooling best practices doc          │
+│ 7. Add automated test for connection leaks                  │
+│                                                              │
+│ Long-term (1 Month):                                         │
+│ 8. Migrate to connection pool library with auto-release     │
+│ 9. Add linter rule detecting missing .finally()             │
+│ 10. Create PR checklist for resource management             │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 6. IMPLEMENTATION PLAN                                       │
+├─────────────────────────────────────────────────────────────┤
+│ Week 1 (Nov 14-18):                                          │
+│ • Day 1-2: Audit & fix connection leaks [Dev Team]          │
+│ • Day 2: Increase pool size, add timeout [DevOps]           │
+│ • Day 3: Set up monitoring [SRE]                            │
+│ • Day 4: Test under load [QA]                               │
+│ • Day 5: Deploy to production [DevOps]                      │
+│                                                              │
+│ Week 2 (Nov 21-25):                                          │
+│ • Optimize identified slow queries [DB Team]                │
+│ • Write best practices doc [Tech Writer + Dev Lead]         │
+│ • Create connection leak test [QA Team]                     │
+│                                                              │
+│ Week 3-4 (Nov 28 - Dec 9):                                   │
+│ • Evaluate connection pool libraries [Dev Team]             │
+│ • Add linter rules [Dev Lead]                               │
+│ • Update PR template [Dev Lead]                             │
+│                                                              │
+│ Dependencies: None blocking Week 1 fixes                     │
+│ Resources: 2 developers, 1 DevOps, 1 SRE                    │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 7. FOLLOW-UP                                                 │
+├─────────────────────────────────────────────────────────────┤
+│ Success Metrics:                                             │
+│ • Zero downtime incidents (monitor 4 weeks)                 │
+│ • Pool usage stays <80% during peak                         │
+│ • No connection leaks detected                              │
+│ • Response time <200ms p95                                  │
+│                                                              │
+│ Monitoring:                                                  │
+│ • Daily: Check pool usage dashboard                         │
+│ • Weekly: Review connection leak alerts                     │
+│ • Bi-weekly: Team retrospective on progress                 │
+│                                                              │
+│ Review Dates:                                                │
+│ • Week 1 (Nov 18): Verify immediate fixes effective         │
+│ • Week 2 (Nov 25): Assess optimization impact               │
+│ • Week 4 (Dec 9): Final review, close A3                    │
+│                                                              │
+│ Prevention:                                                  │
+│ • Add connection handling to onboarding                     │
+│ • Monthly audit of resource management code                 │
+│ • Include pool metrics in SRE runbook                       │
+└─────────────────────────────────────────────────────────────┘
+
+═══════════════════════════════════════════════════════════════
+```
+
+### Example 2: Security Vulnerability in Production
+
+```
+═══════════════════════════════════════════════════════════════
+                    A3 PROBLEM ANALYSIS
+═══════════════════════════════════════════════════════════════
+
+TITLE: Critical SQL Injection Vulnerability
+OWNER: Security Team Lead
+DATE: 2024-11-14
+
+┌─────────────────────────────────────────────────────────────┐
+│ 1. BACKGROUND                                                │
+├─────────────────────────────────────────────────────────────┤
+│ • Critical security vulnerability reported by researcher    │
+│ • SQL injection in user search endpoint                     │
+│ • Potential data breach affecting 100K+ user records        │
+│ • CVSS score: 9.8 (Critical)                                │
+│ • Vulnerability exists in production for 6 months           │
+│ • Similar issue found in 2 other endpoints (scanning)       │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 2. CURRENT CONDITION                                         │
+├─────────────────────────────────────────────────────────────┤
+│ Vulnerable Code:                                             │
+│ • /api/users/search endpoint uses string concatenation      │
+│ • Input: search query (user-provided, not sanitized)        │
+│ • Pattern: `SELECT * FROM users WHERE name = '${input}'`    │
+│                                                              │
+│ Scope:                                                       │
+│ • 3 endpoints vulnerable (search, filter, export)           │
+│ • All use same unsafe pattern                               │
+│ • No parameterized queries                                  │
+│ • No input validation layer                                 │
+│                                                              │
+│ Risk Assessment:                                             │
+│ • Exploitable from public internet                          │
+│ • No evidence of exploitation (logs checked)                │
+│ • Similar code in admin panel (higher privilege)            │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 3. GOAL/TARGET                                               │
+├─────────────────────────────────────────────────────────────┤
+│ • Patch all SQL injection vulnerabilities within 24 hours   │
+│ • Zero SQL injection vulnerabilities in codebase            │
+│ • Prevent similar issues in future code                     │
+│ • Verify no unauthorized access occurred                    │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 4. ROOT CAUSE ANALYSIS                                       │
+├─────────────────────────────────────────────────────────────┤
+│ 5 Whys:                                                      │
+│ Problem: SQL injection vulnerability in production          │
+│ Why 1: User input concatenated directly into SQL            │
+│ Why 2: Developer wasn't aware of SQL injection risks        │
+│ Why 3: No security training for new developers              │
+│ Why 4: Security not part of onboarding checklist            │
+│ Why 5: Security team not involved in development process    │
+│                                                              │
+│ Contributing Factors (Fishbone):                             │
+│ • Process: No security code review                          │
+│ • Technology: ORM not used consistently                     │
+│ • People: Knowledge gap in secure coding                    │
+│ • Methods: No SAST tools in CI/CD                           │
+│                                                              │
+│ ROOT CAUSE: Security not integrated into development        │
+│             process, training gap                            │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 5. COUNTERMEASURES                                           │
+├─────────────────────────────────────────────────────────────┤
+│ Immediate (24 Hours):                                        │
+│ 1. Patch all 3 vulnerable endpoints                         │
+│ 2. Deploy hotfix to production                              │
+│ 3. Scan codebase for similar patterns                       │
+│ 4. Review access logs for exploitation attempts             │
+│                                                              │
+│ Short-term (1 Week):                                         │
+│ 5. Replace all raw SQL with parameterized queries           │
+│ 6. Add input validation middleware                          │
+│ 7. Set up SAST tool in CI (Snyk/SonarQube)                  │
+│ 8. Security team review of all data access code             │
+│                                                              │
+│ Long-term (1 Month):                                         │
+│ 9. Mandatory security training for all developers           │
+│ 10. Add security review to PR process                       │
+│ 11. Migrate to ORM for all database access                  │
+│ 12. Implement security champion program                     │
+│ 13. Quarterly security audits                               │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 6. IMPLEMENTATION PLAN                                       │
+├─────────────────────────────────────────────────────────────┤
+│ Hour 0-4 (Emergency Response):                               │
+│ • Write & test patches [Security + Senior Dev]              │
+│ • Emergency PR review [CTO + Tech Lead]                     │
+│ • Deploy to staging [DevOps]                                │
+│                                                              │
+│ Hour 4-24 (Production Deploy):                               │
+│ • Deploy hotfix [DevOps + On-call]                          │
+│ • Monitor for issues [SRE Team]                             │
+│ • Scan logs for exploitation [Security Team]                │
+│ • Notify stakeholders [Security Lead + CEO]                 │
+│                                                              │
+│ Day 2-7:                                                     │
+│ • Full codebase remediation [Dev Team]                      │
+│ • SAST tool setup [DevOps + Security]                       │
+│ • Security review [External Auditor]                        │
+│                                                              │
+│ Week 2-4:                                                    │
+│ • Security training program [Security + HR]                 │
+│ • Process improvements [Engineering Leadership]             │
+│                                                              │
+│ Dependencies: External auditor availability (Week 2)         │
+└─────────────────────────────────────────────────────────────┘
+
+┌─────────────────────────────────────────────────────────────┐
+│ 7. FOLLOW-UP                                                 │
+├─────────────────────────────────────────────────────────────┤
+│ Success Metrics:                                             │
+│ • Zero SQL injection vulnerabilities (verified by scan)     │
+│ • 100% of PRs pass SAST checks                              │
+│ • 100% developer security training completion               │
+│ • No unauthorized access detected in log analysis           │
+│                                                              │
+│ Verification:                                                │
+│ • Day 1: Verify patch deployed, vulnerability closed        │
+│ • Week 1: External security audit confirms fixes            │
+│ • Week 2: SAST tool catching similar issues                 │
+│ • Month 1: Training completion, process adoption            │
+│                                                              │
+│ Prevention:                                                  │
+│ • SAST tools block vulnerable code in CI                    │
+│ • Security review required for data access code             │
+│ • Quarterly penetration testing                             │
+│ • Annual security training refresh                          │
+│                                                              │
+│ Incident Report:                                             │
+│ • Post-mortem meeting: Nov 16                               │
+│ • Document lessons learned                                  │
+│ • Share with engineering org                                │
+└─────────────────────────────────────────────────────────────┘
+
+═══════════════════════════════════════════════════════════════
+```
+
+## Notes
+- A3 forces concise, complete thinking (fits on one page)
+- Use data and facts, not opinions or blame
+- Root cause analysis is critical—use `/why` or `/cause-and-effect`
+- Countermeasures must address root causes, not symptoms
+- Implementation plan needs clear ownership and timelines
+- Follow-up ensures sustainable improvement
+- A3 becomes historical record for organizational learning
+- Update A3 as situation evolves (living document until closed)
+- Consider A3 for: incidents, recurring issues, major improvements
+- Overkill for: small bugs, one-line fixes, trivial issues
+
--- a/commands/analyse.md
+++ b/commands/analyse.md
@@ -0,0 +1,509 @@
+---
+description: Auto-selects best Kaizen method (Gemba Walk, Value Stream, or Muda) for target
+argument-hint: Optional target description (e.g., code, workflow, or inefficiencies)
+---
+
+# Smart Analysis
+
+Intelligently select and apply the most appropriate Kaizen analysis technique based on what you're analyzing.
+
+## Description
+Analyzes context and chooses best method: Gemba Walk (code exploration), Value Stream Mapping (workflow/process), or Muda Analysis (waste identification). Guides you through the selected technique.
+
+## Usage
+`/analyse [target_description]`
+
+Examples:
+- `/analyse authentication implementation`
+- `/analyse deployment workflow`
+- `/analyse codebase for inefficiencies`
+
+## Variables
+- TARGET: What to analyze (default: prompt for input)
+- METHOD: Override auto-selection (gemba, vsm, muda)
+
+## Method Selection Logic
+
+**Gemba Walk** → When analyzing:
+- Code implementation (how feature actually works)
+- Gap between documentation and reality
+- Understanding unfamiliar codebase areas
+- Actual vs. assumed architecture
+
+**Value Stream Mapping** → When analyzing:
+- Workflows and processes (CI/CD, deployment, development)
+- Bottlenecks in multi-stage pipelines
+- Handoffs between teams/systems
+- Time spent in each process stage
+
+**Muda (Waste Analysis)** → When analyzing:
+- Code quality and efficiency
+- Technical debt
+- Over-engineering or duplication
+- Resource utilization
+
+## Steps
+1. Understand what's being analyzed
+2. Determine best method (or use specified method)
+3. Explain why this method fits
+4. Guide through the analysis
+5. Present findings with actionable insights
+
+---
+
+## Method 1: Gemba Walk
+
+"Go and see" the actual code to understand reality vs. assumptions.
+
+### When to Use
+- Understanding how feature actually works
+- Code archaeology (legacy systems)
+- Finding gaps between docs and implementation
+- Exploring unfamiliar areas before changes
+
+### Process
+1. **Define scope**: What code area to explore
+2. **State assumptions**: What you think it does
+3. **Observe reality**: Read actual code
+4. **Document findings**: 
+   - Entry points
+   - Actual data flow
+   - Surprises (differs from assumptions)
+   - Hidden dependencies
+   - Undocumented behavior
+5. **Identify gaps**: Documentation vs. reality
+6. **Recommend**: Update docs, refactor, or accept
+
+### Example: Authentication System Gemba Walk
+
+```
+SCOPE: User authentication flow
+
+ASSUMPTIONS (Before):
+• JWT tokens stored in localStorage
+• Single sign-on via OAuth only
+• Session expires after 1 hour
+• Password reset via email link
+
+GEMBA OBSERVATIONS (Actual Code):
+
+Entry Point: /api/auth/login (routes/auth.ts:45)
+├─> AuthService.authenticate() (services/auth.ts:120)
+├─> UserRepository.findByEmail() (db/users.ts:67)
+├─> bcrypt.compare() (services/auth.ts:145)
+└─> TokenService.generate() (services/token.ts:34)
+
+Actual Flow:
+1. Login credentials → POST /api/auth/login
+2. Password hashed with bcrypt (10 rounds)
+3. JWT generated with 24hr expiry (NOT 1 hour!)
+4. Token stored in httpOnly cookie (NOT localStorage)
+5. Refresh token in separate cookie (15 days)
+6. Session data in Redis (30 days TTL)
+
+SURPRISES:
+✗ OAuth not implemented (commented out code found)
+✗ Password reset is manual (admin intervention)
+✗ Three different session storage mechanisms:
+  - Redis for session data
+  - Database for "remember me"
+  - Cookies for tokens
+✗ Legacy endpoint /auth/legacy still active (no auth!)
+✗ Admin users bypass rate limiting (security issue)
+
+GAPS:
+• Documentation says OAuth, code doesn't have it
+• Session expiry inconsistent (docs: 1hr, code: 24hr)
+• Legacy endpoint not documented (security risk)
+• No mention of "remember me" in docs
+
+RECOMMENDATIONS:
+1. HIGH: Secure or remove /auth/legacy endpoint
+2. HIGH: Document actual session expiry (24hr)
+3. MEDIUM: Clean up or implement OAuth
+4. MEDIUM: Consolidate session storage (choose one)
+5. LOW: Add rate limiting for admin users
+```
+
+### Example: CI/CD Pipeline Gemba Walk
+
+```
+SCOPE: Build and deployment pipeline
+
+ASSUMPTIONS:
+• Automated tests run on every commit
+• Deploy to staging automatic
+• Production deploy requires approval
+
+GEMBA OBSERVATIONS:
+
+Actual Pipeline (.github/workflows/main.yml):
+1. On push to main:
+   ├─> Lint (2 min)
+   ├─> Unit tests (5 min) [SKIPPED if "[skip-tests]" in commit]
+   ├─> Build Docker image (15 min)
+   └─> Deploy to staging (3 min)
+
+2. Manual trigger for production:
+   ├─> Run integration tests (20 min) [ONLY for production!]
+   ├─> Security scan (10 min)
+   └─> Deploy to production (5 min)
+
+SURPRISES:
+✗ Unit tests can be skipped with commit message flag
+✗ Integration tests ONLY run for production deploy
+✗ Staging deployed without integration tests
+✗ No rollback mechanism (manual kubectl commands)
+✗ Secrets loaded from .env file (not secrets manager)
+✗ Old "hotfix" branch bypasses all checks
+
+GAPS:
+• Staging and production have different test coverage
+• Documentation doesn't mention test skip flag
+• Rollback process not documented or automated
+• Security scan results not enforced (warning only)
+
+RECOMMENDATIONS:
+1. CRITICAL: Remove test skip flag capability
+2. CRITICAL: Migrate secrets to secrets manager
+3. HIGH: Run integration tests on staging too
+4. HIGH: Delete or secure hotfix branch
+5. MEDIUM: Add automated rollback capability
+6. MEDIUM: Make security scan blocking
+```
+
+---
+
+## Method 2: Value Stream Mapping
+
+Map workflow stages, measure time/waste, identify bottlenecks.
+
+### When to Use
+- Process optimization (CI/CD, deployment, code review)
+- Understanding multi-stage workflows
+- Finding delays and handoffs
+- Improving cycle time
+
+### Process
+1. **Identify start and end**: Where process begins and ends
+2. **Map all steps**: Including waiting/handoff time
+3. **Measure each step**:
+   - Processing time (work happening)
+   - Waiting time (idle, blocked)
+   - Who/what performs step
+4. **Calculate metrics**:
+   - Total lead time
+   - Value-add time vs. waste time
+   - % efficiency (value-add / total time)
+5. **Identify bottlenecks**: Longest steps, most waiting
+6. **Design future state**: Optimized flow
+7. **Plan improvements**: How to achieve future state
+
+### Example: Feature Development Value Stream Map
+
+```
+CURRENT STATE: Feature request → Production
+
+Step 1: Requirements Gathering
+├─ Processing: 2 days (meetings, writing spec)
+├─ Waiting: 3 days (stakeholder review)
+└─ Owner: Product Manager
+
+Step 2: Design
+├─ Processing: 1 day (mockups, architecture)
+├─ Waiting: 2 days (design review, feedback)
+└─ Owner: Designer + Architect
+
+Step 3: Development
+├─ Processing: 5 days (coding)
+├─ Waiting: 2 days (PR review queue)
+└─ Owner: Developer
+
+Step 4: Code Review
+├─ Processing: 0.5 days (review)
+├─ Waiting: 1 day (back-and-forth changes)
+└─ Owner: Senior Developer
+
+Step 5: QA Testing
+├─ Processing: 2 days (manual testing)
+├─ Waiting: 1 day (bug fixes, retest)
+└─ Owner: QA Engineer
+
+Step 6: Staging Deployment
+├─ Processing: 0.5 days (deploy, smoke test)
+├─ Waiting: 2 days (stakeholder UAT)
+└─ Owner: DevOps
+
+Step 7: Production Deployment
+├─ Processing: 0.5 days (deploy, monitor)
+├─ Waiting: 0 days
+└─ Owner: DevOps
+
+───────────────────────────────────────
+METRICS:
+Total Lead Time: 22.5 days
+Value-Add Time: 11.5 days (work)
+Waste Time: 11 days (waiting)
+Efficiency: 51%
+
+BOTTLENECKS:
+1. Requirements review wait (3 days)
+2. Development time (5 days)
+3. Stakeholder UAT wait (2 days)
+4. PR review queue (2 days)
+
+WASTE ANALYSIS:
+• Waiting for reviews/approvals: 9 days (82% of waste)
+• Rework due to unclear requirements: ~1 day
+• Manual testing time: 2 days
+
+FUTURE STATE DESIGN:
+
+Changes:
+1. Async requirements approval (stakeholders have 24hr SLA)
+2. Split large features into smaller increments
+3. Automated testing replaces manual QA
+4. PR review SLA: 4 hours max
+5. Continuous deployment to staging (no approval)
+6. Feature flags for production rollout (no wait)
+
+Projected Future State:
+Total Lead Time: 9 days (60% reduction)
+Value-Add Time: 8 days
+Waste Time: 1 day
+Efficiency: 89%
+
+IMPLEMENTATION PLAN:
+Week 1: Set review SLAs, add feature flags
+Week 2: Automate test suite
+Week 3: Enable continuous staging deployment
+Week 4: Train team on incremental delivery
+```
+
+### Example: Incident Response Value Stream Map
+
+```
+CURRENT STATE: Incident detected → Resolution
+
+Step 1: Detection
+├─ Processing: 0 min (automated alert)
+├─ Waiting: 15 min (until someone sees alert)
+└─ System: Monitoring tool
+
+Step 2: Triage
+├─ Processing: 10 min (assess severity)
+├─ Waiting: 20 min (find right person)
+└─ Owner: On-call engineer
+
+Step 3: Investigation
+├─ Processing: 45 min (logs, debugging)
+├─ Waiting: 30 min (access to production, gather context)
+└─ Owner: Engineer + SRE
+
+Step 4: Fix Development
+├─ Processing: 60 min (write fix)
+├─ Waiting: 15 min (code review)
+└─ Owner: Engineer
+
+Step 5: Deployment
+├─ Processing: 10 min (hotfix deploy)
+├─ Waiting: 5 min (verification)
+└─ Owner: SRE
+
+Step 6: Post-Incident
+├─ Processing: 20 min (update status, notify)
+├─ Waiting: 0 min
+└─ Owner: Engineer
+
+───────────────────────────────────────
+METRICS:
+Total Lead Time: 230 min (3h 50min)
+Value-Add Time: 145 min
+Waste Time: 85 min (37%)
+
+BOTTLENECKS:
+1. Finding right person (20 min)
+2. Gaining production access (30 min)
+3. Investigation time (45 min)
+
+IMPROVEMENTS:
+1. Slack integration for alerts (reduce detection wait)
+2. Auto-assign by service owner (no hunt for person)
+3. Pre-approved prod access for on-call (reduce wait)
+4. Runbooks for common incidents (faster investigation)
+5. Automated rollback for deployment incidents
+
+Projected improvement: 230min → 120min (48% faster)
+```
+
+---
+
+## Method 3: Muda (Waste Analysis)
+
+Identify seven types of waste in code and development processes.
+
+### When to Use
+- Code quality audits
+- Technical debt assessment
+- Process efficiency improvements
+- Identifying over-engineering
+
+### The 7 Types of Waste (Applied to Software)
+
+**1. Overproduction**: Building more than needed
+- Features no one uses
+- Overly complex solutions
+- Premature optimization
+- Unnecessary abstractions
+
+**2. Waiting**: Idle time
+- Build/test/deploy time
+- Code review delays
+- Waiting for dependencies
+- Blocked by other teams
+
+**3. Transportation**: Moving things around
+- Unnecessary data transformations
+- API layers with no value add
+- Copying data between systems
+- Repeated serialization/deserialization
+
+**4. Over-processing**: Doing more than necessary
+- Excessive logging
+- Redundant validations
+- Over-normalized databases
+- Unnecessary computation
+
+**5. Inventory**: Work in progress
+- Unmerged branches
+- Half-finished features
+- Untriaged bugs
+- Undeployed code
+
+**6. Motion**: Unnecessary movement
+- Context switching
+- Meetings without purpose
+- Manual deployments
+- Repetitive tasks
+
+**7. Defects**: Rework and bugs
+- Production bugs
+- Technical debt
+- Flaky tests
+- Incomplete features
+
+### Process
+1. **Define scope**: Codebase area or process
+2. **Examine for each waste type**
+3. **Quantify impact** (time, complexity, cost)
+4. **Prioritize by impact**
+5. **Propose elimination strategies**
+
+### Example: API Codebase Waste Analysis
+
+```
+SCOPE: REST API backend (50K LOC)
+
+1. OVERPRODUCTION
+   Found:
+   • 15 API endpoints with zero usage (last 90 days)
+   • Generic "framework" built for "future flexibility" (unused)
+   • Premature microservices split (2 services, could be 1)
+   • Feature flags for 12 features (10 fully rolled out, flags kept)
+   
+   Impact: 8K LOC maintained for no reason
+   Recommendation: Delete unused endpoints, remove stale flags
+
+2. WAITING
+   Found:
+   • CI pipeline: 45 min (slow Docker builds)
+   • PR review time: avg 2 days
+   • Deployment to staging: manual, takes 1 hour
+   
+   Impact: 2.5 days wasted per feature
+   Recommendation: Cache Docker layers, PR review SLA, automate staging
+
+3. TRANSPORTATION
+   Found:
+   • Data transformed 4 times between DB and API response:
+     DB → ORM → Service → DTO → Serializer
+   • Request/response logged 3 times (middleware, handler, service)
+   • Files uploaded → S3 → CloudFront → Local cache (unnecessary)
+   
+   Impact: 200ms avg response time overhead
+   Recommendation: Reduce transformation layers, consolidate logging
+
+4. OVER-PROCESSING
+   Found:
+   • Every request validates auth token (even cached)
+   • Database queries fetch all columns (SELECT *)
+   • JSON responses include full object graphs (nested 5 levels)
+   • Logs every database query in production (verbose)
+   
+   Impact: 40% higher database load, 3x log storage
+   Recommendation: Cache auth checks, selective fields, trim responses
+
+5. INVENTORY
+   Found:
+   • 23 open PRs (8 abandoned, 6+ months old)
+   • 5 feature branches unmerged (completed but not deployed)
+   • 147 open bugs (42 duplicates, 60 not reproducible)
+   • 12 hotfix commits not backported to main
+   
+   Impact: Context overhead, merge conflicts, lost work
+   Recommendation: Close stale PRs, bug triage, deploy pending features
+
+6. MOTION
+   Found:
+   • Developers switch between 4 tools for one deployment
+   • Manual database migrations (error-prone, slow)
+   • Environment config spread across 6 files
+   • Copy-paste secrets to .env files
+   
+   Impact: 30min per deployment, frequent mistakes
+   Recommendation: Unified deployment tool, automate migrations
+
+7. DEFECTS
+   Found:
+   • 12 production bugs per month
+   • 15% flaky test rate (wasted retry time)
+   • Technical debt in auth module (refactor needed)
+   • Incomplete error handling (crashes instead of graceful)
+   
+   Impact: Customer complaints, rework, downtime
+   Recommendation: Stabilize tests, refactor auth, add error boundaries
+
+───────────────────────────────────────
+SUMMARY
+
+Total Waste Identified:
+• Code: 8K LOC doing nothing
+• Time: 2.5 days per feature
+• Performance: 200ms overhead per request
+• Effort: 30min per deployment
+
+Priority Fixes (by impact):
+1. HIGH: Automate deployments (reduces Motion + Waiting)
+2. HIGH: Fix flaky tests (reduces Defects)
+3. MEDIUM: Remove unused code (reduces Overproduction)
+4. MEDIUM: Optimize data transformations (reduces Transportation)
+5. LOW: Triage bug backlog (reduces Inventory)
+
+Estimated Recovery:
+• 20% faster feature delivery
+• 50% fewer production issues
+• 30% less operational overhead
+```
+
+---
+
+## Notes
+- Method selection is contextual—choose what fits best
+- Can combine methods (Gemba Walk → Muda Analysis)
+- Start with Gemba Walk when unfamiliar with area
+- Use VSM for process optimization
+- Use Muda for efficiency and cleanup
+- All methods should lead to actionable improvements
+- Document findings for organizational learning
+- Consider using `/analyse-problem` (A3) for comprehensive documentation of findings
+
--- a/commands/cause-and-effect.md
+++ b/commands/cause-and-effect.md
@@ -0,0 +1,197 @@
+---
+description: Systematic Fishbone analysis exploring problem causes across six categories
+argument-hint: Optional problem description to analyze
+---
+
+# Cause and Effect Analysis
+
+Apply Fishbone (Ishikawa) diagram analysis to systematically explore all potential causes of a problem across multiple categories.
+
+## Description
+Systematically examine potential causes across six categories: People, Process, Technology, Environment, Methods, and Materials. Creates structured "fishbone" view identifying contributing factors.
+
+## Usage
+`/cause-and-effect [problem_description]`
+
+## Variables
+- PROBLEM: Issue to analyze (default: prompt for input)
+- CATEGORIES: Categories to explore (default: all six)
+
+## Steps
+1. State the problem clearly (the "head" of the fish)
+2. For each category, brainstorm potential causes:
+   - **People**: Skills, training, communication, team dynamics
+   - **Process**: Workflows, procedures, standards, reviews
+   - **Technology**: Tools, infrastructure, dependencies, configuration
+   - **Environment**: Workspace, deployment targets, external factors
+   - **Methods**: Approaches, patterns, architectures, practices
+   - **Materials**: Data, dependencies, third-party services, resources
+3. For each potential cause, ask "why" to dig deeper
+4. Identify which causes are contributing vs. root causes
+5. Prioritize causes by impact and likelihood
+6. Propose solutions for highest-priority causes
+
+## Examples
+
+### Example 1: API Response Latency
+
+```
+Problem: API responses take 3+ seconds (target: <500ms)
+
+PEOPLE
+├─ Team unfamiliar with performance optimization
+├─ No one owns performance monitoring
+└─ Frontend team doesn't understand backend constraints
+
+PROCESS
+├─ No performance testing in CI/CD
+├─ No SLA defined for response times
+└─ Performance regression not caught in code review
+
+TECHNOLOGY
+├─ Database queries not optimized
+│  └─ Why: No query analysis tools in place
+├─ N+1 queries in ORM
+│  └─ Why: Eager loading not configured
+├─ No caching layer
+│  └─ Why: Redis not in tech stack
+└─ Synchronous external API calls
+   └─ Why: No async architecture in place
+
+ENVIRONMENT
+├─ Production uses smaller database instance than needed
+├─ No CDN for static assets
+└─ Single region deployment (high latency for distant users)
+
+METHODS
+├─ REST API design requires multiple round trips
+├─ No pagination on large datasets
+└─ Full object serialization instead of selective fields
+
+MATERIALS
+├─ Large JSON payloads (unnecessary data)
+├─ Uncompressed responses
+└─ Third-party API (payment gateway) is slow
+   └─ Why: Free tier with rate limiting
+
+ROOT CAUSES:
+- No performance requirements defined (Process)
+- Missing performance monitoring tooling (Technology)
+- Architecture doesn't support caching/async (Methods)
+
+SOLUTIONS (Priority Order):
+1. Add database indexes (quick win, high impact)
+2. Implement Redis caching layer (medium effort, high impact)
+3. Make external API calls async with webhooks (high effort, high impact)
+4. Define and monitor performance SLAs (low effort, prevents regression)
+```
+
+### Example 2: Flaky Test Suite
+
+```
+Problem: 15% of test runs fail, passing on retry
+
+PEOPLE
+├─ Test-writing skills vary across team
+├─ New developers copy existing flaky patterns
+└─ No one assigned to fix flaky tests
+
+PROCESS
+├─ Flaky tests marked as "known issue" and ignored
+├─ No policy against merging with flaky tests
+└─ Test failures don't block deployments
+
+TECHNOLOGY
+├─ Race conditions in async test setup
+├─ Tests share global state
+├─ Test database not isolated per test
+├─ setTimeout used instead of proper waiting
+└─ CI environment inconsistent (different CPU/memory)
+
+ENVIRONMENT
+├─ CI runner under heavy load
+├─ Network timing varies (external API mocks flaky)
+└─ Timezone differences between local and CI
+
+METHODS
+├─ Integration tests not properly isolated
+├─ No retry logic for legitimate timing issues
+└─ Tests depend on execution order
+
+MATERIALS
+├─ Test data fixtures overlap
+├─ Shared test database polluted
+└─ Mock data doesn't match production patterns
+
+ROOT CAUSES:
+- No test isolation strategy (Methods + Technology)
+- Process accepts flaky tests (Process)
+- Async timing not handled properly (Technology)
+
+SOLUTIONS:
+1. Implement per-test database isolation (high impact)
+2. Replace setTimeout with proper async/await patterns (medium impact)
+3. Add pre-commit hook blocking flaky test patterns (prevents new issues)
+4. Enforce policy: flaky test = block merge (process change)
+```
+
+### Example 3: Feature Takes 3 Months Instead of 3 Weeks
+
+```
+Problem: Simple CRUD feature took 12 weeks vs. 3 week estimate
+
+PEOPLE
+├─ Developer unfamiliar with codebase
+├─ Key architect on vacation during critical phase
+└─ Designer changed requirements mid-development
+
+PROCESS
+├─ Requirements not finalized before starting
+├─ No code review for first 6 weeks (large diff)
+├─ Multiple rounds of design revision
+└─ QA started late (found issues in week 10)
+
+TECHNOLOGY
+├─ Codebase has high coupling (change ripple effects)
+├─ No automated tests (manual testing slow)
+├─ Legacy code required refactoring first
+└─ Development environment setup took 2 weeks
+
+ENVIRONMENT
+├─ Staging environment broken for 3 weeks
+├─ Production data needed for testing (compliance delay)
+└─ Dependencies blocked by another team
+
+METHODS
+├─ No incremental delivery (big bang approach)
+├─ Over-engineering (added future features "while we're at it")
+└─ No design doc (discovered issues during implementation)
+
+MATERIALS
+├─ Third-party API changed during development
+├─ Production data model different than staging
+└─ Missing design assets (waited for designer)
+
+ROOT CAUSES:
+- No requirements lock-down before start (Process)
+- Architecture prevents incremental changes (Technology)
+- Big bang approach vs. iterative (Methods)
+- Development environment not automated (Technology)
+
+SOLUTIONS:
+1. Require design doc + finalized requirements before starting (Process)
+2. Implement feature flags for incremental delivery (Methods)
+3. Automate dev environment setup (Technology)
+4. Refactor high-coupling areas (Technology, long-term)
+```
+
+## Notes
+- Fishbone reveals systemic issues across domains
+- Multiple causes often combine to create problems
+- Don't stop at first cause in each category—dig deeper
+- Some causes span multiple categories (mark them)
+- Root causes usually in Process or Methods (not just Technology)
+- Use with `/why` command for deeper analysis of specific causes
+- Prioritize solutions by: impact × feasibility ÷ effort
+- Address root causes, not just symptoms
+
--- a/commands/plan-do-check-act.md
+++ b/commands/plan-do-check-act.md
@@ -0,0 +1,255 @@
+---
+description: Iterative PDCA cycle for systematic experimentation and continuous improvement
+argument-hint: Optional improvement goal or problem to address
+---
+
+# Plan-Do-Check-Act (PDCA)
+
+Apply PDCA cycle for continuous improvement through iterative problem-solving and process optimization.
+
+## Description
+
+Four-phase iterative cycle: Plan (identify and analyze), Do (implement changes), Check (measure results), Act (standardize or adjust). Enables systematic experimentation and improvement.
+
+## Usage
+
+`/plan-do-check-act [improvement_goal]`
+
+## Variables
+
+- GOAL: Improvement target or problem to address (default: prompt for input)
+- CYCLE_NUMBER: Which PDCA iteration (default: 1)
+
+## Steps
+
+### Phase 1: PLAN
+
+1. Define the problem or improvement goal
+2. Analyze current state (baseline metrics)
+3. Identify root causes (use `/why` or `/cause-and-effect`)
+4. Develop hypothesis: "If we change X, Y will improve"
+5. Design experiment: what to change, how to measure success
+6. Set success criteria (measurable targets)
+
+### Phase 2: DO
+
+1. Implement the planned change (small scale first)
+2. Document what was actually done
+3. Record any deviations from plan
+4. Collect data throughout implementation
+5. Note unexpected observations
+
+### Phase 3: CHECK
+
+1. Measure results against success criteria
+2. Compare to baseline (before vs. after)
+3. Analyze data: did hypothesis hold?
+4. Identify what worked and what didn't
+5. Document learnings and insights
+
+### Phase 4: ACT
+
+1. **If successful**: Standardize the change
+   - Update documentation
+   - Train team
+   - Create checklist/automation
+   - Monitor for regression
+2. **If unsuccessful**: Learn and adjust
+   - Understand why it failed
+   - Refine hypothesis
+   - Start new PDCA cycle with adjusted plan
+3. **If partially successful**:
+   - Standardize what worked
+   - Plan next cycle for remaining issues
+
+## Examples
+
+### Example 1: Reducing Build Time
+
+```
+CYCLE 1
+───────
+PLAN:
+  Problem: Docker build takes 45 minutes
+  Current State: Full rebuild every time, no layer caching
+  Root Cause: Package manager cache not preserved between builds
+  Hypothesis: Caching dependencies will reduce build to <10 minutes
+  Change: Add layer caching for package.json + node_modules
+  Success Criteria: Build time <10 minutes on unchanged dependencies
+
+DO:
+  - Restructured Dockerfile: COPY package*.json before src files
+  - Added .dockerignore for node_modules
+  - Configured CI cache for Docker layers
+  - Tested on 3 builds
+
+CHECK:
+  Results:
+    - Unchanged dependencies: 8 minutes ✓ (was 45)
+    - Changed dependencies: 12 minutes (was 45)
+    - Fresh builds: 45 minutes (same, expected)
+  Analysis: 82% reduction on cached builds, hypothesis confirmed
+
+ACT:
+  Standardize:
+    ✓ Merged Dockerfile changes
+    ✓ Updated CI pipeline config
+    ✓ Documented in README
+    ✓ Added build time monitoring
+  
+  New Problem: 12 minutes still slow when deps change
+  → Start CYCLE 2
+
+
+CYCLE 2
+───────
+PLAN:
+  Problem: Build still 12 min when dependencies change
+  Current State: npm install rebuilds all packages
+  Root Cause: Some packages compile from source
+  Hypothesis: Pre-built binaries will reduce to <5 minutes
+  Change: Use npm ci instead of install, configure binary mirrors
+  Success Criteria: Build <5 minutes on dependency changes
+
+DO:
+  - Changed to npm ci (uses package-lock.json)
+  - Added .npmrc with binary mirror configs
+  - Tested across 5 dependency updates
+
+CHECK:
+  Results:
+    - Dependency changes: 4.5 minutes ✓ (was 12)
+    - Compilation errors reduced to 0 (was 3)
+  Analysis: npm ci faster + more reliable, hypothesis confirmed
+
+ACT:
+  Standardize:
+    ✓ Use npm ci everywhere (local + CI)
+    ✓ Committed .npmrc
+    ✓ Updated developer onboarding docs
+  
+  Total improvement: 45min → 4.5min (90% reduction)
+  ✓ PDCA complete, monitor for 2 weeks
+```
+
+### Example 2: Reducing Production Bugs
+
+```
+CYCLE 1
+───────
+PLAN:
+  Problem: 8 production bugs per month
+  Current State: Manual testing only, no automated tests
+  Root Cause: Regressions not caught before release
+  Hypothesis: Adding integration tests will reduce bugs by 50%
+  Change: Implement integration test suite for critical paths
+  Success Criteria: <4 bugs per month after 1 month
+
+DO:
+  Week 1-2: Wrote integration tests for:
+    - User authentication flow
+    - Payment processing
+    - Data export
+  Week 3: Set up CI to run tests
+  Week 4: Team training on test writing
+  Coverage: 3 critical paths (was 0)
+
+CHECK:
+  Results after 1 month:
+    - Production bugs: 6 (was 8)
+    - Bugs caught in CI: 4
+    - Test failures (false positives): 2
+  Analysis: 25% reduction, not 50% target
+  Insight: Bugs are in areas without tests yet
+
+ACT:
+  Partially successful:
+    ✓ Keep existing tests (prevented 4 bugs)
+    ✓ Fix flaky tests
+  
+  Adjust for CYCLE 2:
+    - Expand test coverage to all user flows
+    - Add tests for bug-prone areas
+    → Start CYCLE 2
+
+
+CYCLE 2
+───────
+PLAN:
+  Problem: Still 6 bugs/month, need <4
+  Current State: 3 critical paths tested, 12 paths total
+  Root Cause: UI interaction bugs not covered by integration tests
+  Hypothesis: E2E tests for all user flows will reach <4 bugs
+  Change: Add E2E tests for remaining 9 flows
+  Success Criteria: <4 bugs per month, 80% coverage
+
+DO:
+  Week 1-3: Added E2E tests for all user flows
+  Week 4: Set up visual regression testing
+  Coverage: 12/12 user flows (was 3/12)
+
+CHECK:
+  Results after 1 month:
+    - Production bugs: 3 ✓ (was 6)
+    - Bugs caught in CI: 8 (was 4)
+    - Test maintenance time: 3 hours/week
+  Analysis: Target achieved! 62% reduction from baseline
+
+ACT:
+  Standardize:
+    ✓ Made tests required for all PRs
+    ✓ Added test checklist to PR template
+    ✓ Scheduled weekly test review
+    ✓ Created runbook for test maintenance
+  
+  Monitor: Track bug rate and test effectiveness monthly
+  ✓ PDCA complete
+```
+
+### Example 3: Improving Code Review Speed
+
+```
+PLAN:
+  Problem: PRs take 3 days average to merge
+  Current State: Manual review, no automation
+  Root Cause: Reviewers wait to see if CI passes before reviewing
+  Hypothesis: Auto-review + faster CI will reduce to <1 day
+  Change: Add automated checks + split long CI jobs
+  Success Criteria: Average time to merge <1 day (8 hours)
+
+DO:
+  - Set up automated linter checks (fail fast)
+  - Split test suite into parallel jobs
+  - Added PR template with self-review checklist
+  - CI time: 45min → 15min
+  - Tracked PR merge time for 2 weeks
+
+CHECK:
+  Results:
+    - Average time to merge: 1.5 days (was 3)
+    - Time waiting for CI: 15min (was 45min)
+    - Time waiting for review: 1.3 days (was 2+ days)
+  Analysis: CI faster, but review still bottleneck
+
+ACT:
+  Partially successful:
+    ✓ Keep fast CI improvements
+  
+  Insight: Real bottleneck is reviewer availability, not CI
+  Adjust for new PDCA:
+    - Focus on reviewer availability/notification
+    - Consider rotating review assignments
+  → Start new PDCA cycle with different hypothesis
+```
+
+## Notes
+
+- Start with small, measurable changes (not big overhauls)
+- PDCA is iterative—multiple cycles normal
+- Failed experiments are learning opportunities
+- Document everything: easier to see patterns across cycles
+- Success criteria must be measurable (not subjective)
+- Phase 4 "Act" determines next cycle or completion
+- If stuck after 3 cycles, revisit root cause analysis
+- PDCA works for technical and process improvements
+- Use `/analyse-problem` (A3) for comprehensive documentation
--- a/commands/root-cause-tracing.md
+++ b/commands/root-cause-tracing.md
@@ -0,0 +1,188 @@
+---
+name: root-cause-tracing
+description: Use when errors occur deep in execution and you need to trace back to find the original trigger - systematically traces bugs backward through call stack, adding instrumentation when needed, to identify source of invalid data or incorrect behavior
+---
+
+# Root Cause Tracing
+
+## Overview
+
+Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.
+
+**Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source.
+
+## When to Use
+
+```dot
+digraph when_to_use {
+    "Bug appears deep in stack?" [shape=diamond];
+    "Can trace backwards?" [shape=diamond];
+    "Fix at symptom point" [shape=box];
+    "Trace to original trigger" [shape=box];
+    "BETTER: Also add defense-in-depth" [shape=box];
+
+    "Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
+    "Can trace backwards?" -> "Trace to original trigger" [label="yes"];
+    "Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
+    "Trace to original trigger" -> "BETTER: Also add defense-in-depth";
+}
+```
+
+**Use when:**
+
+- Error happens deep in execution (not at entry point)
+- Stack trace shows long call chain
+- Unclear where invalid data originated
+- Need to find which test/code triggers the problem
+
+## The Tracing Process
+
+### 1. Observe the Symptom
+
+```
+Error: git init failed in /Users/jesse/project/packages/core
+```
+
+### 2. Find Immediate Cause
+
+**What code directly causes this?**
+
+```typescript
+await execFileAsync('git', ['init'], { cwd: projectDir });
+```
+
+### 3. Ask: What Called This?
+
+```typescript
+WorktreeManager.createSessionWorktree(projectDir, sessionId)
+  → called by Session.initializeWorkspace()
+  → called by Session.create()
+  → called by test at Project.create()
+```
+
+### 4. Keep Tracing Up
+
+**What value was passed?**
+
+- `projectDir = ''` (empty string!)
+- Empty string as `cwd` resolves to `process.cwd()`
+- That's the source code directory!
+
+### 5. Find Original Trigger
+
+**Where did empty string come from?**
+
+```typescript
+const context = setupCoreTest(); // Returns { tempDir: '' }
+Project.create('name', context.tempDir); // Accessed before beforeEach!
+```
+
+## Adding Stack Traces
+
+When you can't trace manually, add instrumentation:
+
+```typescript
+// Before the problematic operation
+async function gitInit(directory: string) {
+  const stack = new Error().stack;
+  console.error('DEBUG git init:', {
+    directory,
+    cwd: process.cwd(),
+    nodeEnv: process.env.NODE_ENV,
+    stack,
+  });
+
+  await execFileAsync('git', ['init'], { cwd: directory });
+}
+```
+
+**Critical:** Use `console.error()` in tests (not logger - may not show)
+
+**Run and capture:**
+
+```bash
+npm test 2>&1 | grep 'DEBUG git init'
+```
+
+**Analyze stack traces:**
+
+- Look for test file names
+- Find the line number triggering the call
+- Identify the pattern (same test? same parameter?)
+
+## Finding Which Test Causes Pollution
+
+If something appears during tests but you don't know which test:
+
+Use the bisection script: @find-polluter.sh
+
+```bash
+./find-polluter.sh '.git' 'src/**/*.test.ts'
+```
+
+Runs tests one-by-one, stops at first polluter. See script for usage.
+
+## Real Example: Empty projectDir
+
+**Symptom:** `.git` created in `packages/core/` (source code)
+
+**Trace chain:**
+
+1. `git init` runs in `process.cwd()` ← empty cwd parameter
+2. WorktreeManager called with empty projectDir
+3. Session.create() passed empty string
+4. Test accessed `context.tempDir` before beforeEach
+5. setupCoreTest() returns `{ tempDir: '' }` initially
+
+**Root cause:** Top-level variable initialization accessing empty value
+
+**Fix:** Made tempDir a getter that throws if accessed before beforeEach
+
+**Also added defense-in-depth:**
+
+- Layer 1: Project.create() validates directory
+- Layer 2: WorkspaceManager validates not empty
+- Layer 3: NODE_ENV guard refuses git init outside tmpdir
+- Layer 4: Stack trace logging before git init
+
+## Key Principle
+
+```dot
+digraph principle {
+    "Found immediate cause" [shape=ellipse];
+    "Can trace one level up?" [shape=diamond];
+    "Trace backwards" [shape=box];
+    "Is this the source?" [shape=diamond];
+    "Fix at source" [shape=box];
+    "Add validation at each layer" [shape=box];
+    "Bug impossible" [shape=doublecircle];
+    "NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];
+
+    "Found immediate cause" -> "Can trace one level up?";
+    "Can trace one level up?" -> "Trace backwards" [label="yes"];
+    "Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
+    "Trace backwards" -> "Is this the source?";
+    "Is this the source?" -> "Trace backwards" [label="no - keeps going"];
+    "Is this the source?" -> "Fix at source" [label="yes"];
+    "Fix at source" -> "Add validation at each layer";
+    "Add validation at each layer" -> "Bug impossible";
+}
+```
+
+**NEVER fix just where the error appears.** Trace back to find the original trigger.
+
+## Stack Trace Tips
+
+**In tests:** Use `console.error()` not logger - logger may be suppressed
+**Before operation:** Log before the dangerous operation, not after it fails
+**Include context:** Directory, cwd, environment variables, timestamps
+**Capture stack:** `new Error().stack` shows complete call chain
+
+## Real-World Impact
+
+From debugging session (2025-10-03):
+
+- Found root cause through 5-level trace
+- Fixed at source (getter validation)
+- Added 4 layers of defense
+- 1847 tests passed, zero pollution
--- a/commands/why.md
+++ b/commands/why.md
@@ -0,0 +1,99 @@
+---
+description: Iterative Five Whys root cause analysis drilling from symptoms to fundamentals
+argument-hint: Optional issue or symptom description
+---
+
+# Five Whys Analysis
+
+Apply Five Whys root cause analysis to investigate issues by iteratively asking "why" to drill from symptoms to root causes.
+
+## Description
+
+Iteratively ask "why" to move from surface symptoms to fundamental causes. Identifies systemic issues rather than quick fixes.
+
+## Usage
+
+`/why [issue_description]`
+
+## Variables
+
+- ISSUE: Problem or symptom to analyze (default: prompt for input)
+- DEPTH: Number of "why" iterations (default: 5, adjust as needed)
+
+## Steps
+
+1. State the problem clearly
+2. Ask "Why did this happen?" and document the answer
+3. For that answer, ask "Why?" again
+4. Continue until reaching root cause (usually 5 iterations)
+5. Validate by working backwards: root cause → symptom
+6. Explore branches if multiple causes emerge
+7. Propose solutions addressing root causes, not symptoms
+
+## Examples
+
+### Example 1: Production Bug
+
+```
+Problem: Users see 500 error on checkout
+Why 1: Payment service throws exception
+Why 2: Request timeout after 30 seconds
+Why 3: Database query takes 45 seconds
+Why 4: Missing index on transactions table
+Why 5: Index creation wasn't in migration scripts
+Root Cause: Migration review process doesn't check query performance
+
+Solution: Add query performance checks to migration PR template
+```
+
+### Example 2: CI/CD Pipeline Failures
+
+```
+Problem: E2E tests fail intermittently
+Why 1: Race condition in async test setup
+Why 2: Test doesn't wait for database seed completion
+Why 3: Seed function doesn't return promise
+Why 4: TypeScript didn't catch missing return type
+Why 5: strict mode not enabled in test config
+Root Cause: Inconsistent TypeScript config between src and tests
+
+Solution: Unify TypeScript config, enable strict mode everywhere
+```
+
+### Example 3: Multi-Branch Analysis
+
+```
+Problem: Feature deployment takes 2 hours
+
+Branch A (Build):
+Why 1: Docker build takes 90 minutes
+Why 2: No layer caching
+Why 3: Dependencies reinstalled every time
+Why 4: Cache invalidated by timestamp in Dockerfile
+Root Cause A: Dockerfile uses current timestamp for versioning
+
+Branch B (Tests):
+Why 1: Test suite takes 30 minutes
+Why 2: Integration tests run sequentially
+Why 3: Test runner config has maxWorkers: 1
+Why 4: Previous developer disabled parallelism due to flaky tests
+Root Cause B: Flaky tests masked by disabling parallelism
+
+Solutions: 
+A) Remove timestamp from Dockerfile, use git SHA
+B) Fix flaky tests, re-enable parallel test execution
+```
+
+## Notes
+
+- Don't stop at symptoms; keep digging for systemic issues
+- Multiple root causes may exist - explore different branches
+- Document each "why" for future reference
+- Consider both technical and process-related causes
+- The magic isn't in exactly 5 whys - stop when you reach the true root cause
+- Stop when you hit systemic/process issues, not just technical details
+- Multiple root causes are common—explore branches separately
+- If "human error" appears, keep digging: why was error possible?
+- Document every "why" for future reference
+- Root cause usually involves: missing validation, missing docs, unclear process, or missing automation
+- Test solutions: implement → verify symptom resolved → monitor for recurrence