Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:42:54 +08:00
commit 8aa65dc62a
10 changed files with 2457 additions and 0 deletions

View File

@@ -0,0 +1,15 @@
{
"name": "kaizen",
"description": "Inspired by Japanese continuous improvement philosophy, Agile and Lean development practices. Introduces commands for analysis of root cause of issues and problems, including 5 Whys, Cause and Effect Analysis, and other techniques.",
"version": "1.0.0",
"author": {
"name": "Vlad Goncharov",
"email": "vlad.goncharov@neolab.finance"
},
"skills": [
"./skills"
],
"commands": [
"./commands"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# kaizen
Inspired by Japanese continuous improvement philosophy, Agile and Lean development practices. Introduces commands for analysis of root cause of issues and problems, including 5 Whys, Cause and Effect Analysis, and other techniques.

389
commands/analyse-problem.md Normal file
View File

@@ -0,0 +1,389 @@
---
description: Comprehensive A3 one-page problem analysis with root cause and action plan
argument-hint: Optional problem description to document
---
# A3 Problem Analysis
Apply A3 problem-solving format for comprehensive, single-page problem documentation and resolution planning.
## Description
Structured one-page analysis format covering: Background, Current Condition, Goal, Root Cause Analysis, Countermeasures, Implementation Plan, and Follow-up. Named after A3 paper size; emphasizes concise, complete documentation.
## Usage
`/analyse-problem [problem_description]`
## Variables
- PROBLEM: Issue to analyze (default: prompt for input)
- OUTPUT_FORMAT: markdown or text (default: markdown)
## Steps
1. **Background**: Why this problem matters (context, business impact)
2. **Current Condition**: What's happening now (data, metrics, examples)
3. **Goal/Target**: What success looks like (specific, measurable)
4. **Root Cause Analysis**: Why problem exists (use 5 Whys or Fishbone)
5. **Countermeasures**: Proposed solutions addressing root causes
6. **Implementation Plan**: Who, what, when, how
7. **Follow-up**: How to verify success and prevent recurrence
## A3 Template
```
═══════════════════════════════════════════════════════════════
A3 PROBLEM ANALYSIS
═══════════════════════════════════════════════════════════════
TITLE: [Concise problem statement]
OWNER: [Person responsible]
DATE: [YYYY-MM-DD]
┌─────────────────────────────────────────────────────────────┐
│ 1. BACKGROUND (Why this matters) │
├─────────────────────────────────────────────────────────────┤
│ [Context, impact, urgency, who's affected] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. CURRENT CONDITION (What's happening) │
├─────────────────────────────────────────────────────────────┤
│ [Facts, data, metrics, examples - no opinions] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. GOAL/TARGET (What success looks like) │
├─────────────────────────────────────────────────────────────┤
│ [Specific, measurable, time-bound targets] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. ROOT CAUSE ANALYSIS (Why problem exists) │
├─────────────────────────────────────────────────────────────┤
│ [5 Whys, Fishbone, data analysis] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 5. COUNTERMEASURES (Solutions addressing root causes) │
├─────────────────────────────────────────────────────────────┤
│ [Specific actions, not vague intentions] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 6. IMPLEMENTATION PLAN (Who, What, When) │
├─────────────────────────────────────────────────────────────┤
│ [Timeline, responsibilities, dependencies, milestones] │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 7. FOLLOW-UP (Verification & Prevention) │
├─────────────────────────────────────────────────────────────┤
│ [Success metrics, monitoring plan, review dates] │
└─────────────────────────────────────────────────────────────┘
═══════════════════════════════════════════════════════════════
```
## Examples
### Example 1: Database Connection Pool Exhaustion
```
═══════════════════════════════════════════════════════════════
A3 PROBLEM ANALYSIS
═══════════════════════════════════════════════════════════════
TITLE: API Downtime Due to Connection Pool Exhaustion
OWNER: Backend Team Lead
DATE: 2024-11-14
┌─────────────────────────────────────────────────────────────┐
│ 1. BACKGROUND │
├─────────────────────────────────────────────────────────────┤
│ • API goes down 2-3x per week during peak hours │
│ • Affects 10,000+ users, average 15min downtime │
│ • Revenue impact: ~$5K per incident │
│ • Customer satisfaction score dropped from 4.5 to 3.8 │
│ • Started 3 weeks ago after traffic increased 40% │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. CURRENT CONDITION │
├─────────────────────────────────────────────────────────────┤
│ Observations: │
│ • Connection pool size: 10 (unchanged since launch) │
│ • Peak concurrent users: 500 (was 300 three weeks ago) │
│ • Average request time: 200ms (was 150ms) │
│ • Connections leaked: ~2 per hour (never released) │
│ • Error: "Connection pool exhausted" in logs │
│ │
│ Pattern: │
│ • Occurs at 2pm-4pm daily (peak traffic) │
│ • Gradual degradation over 30 minutes │
│ • Recovery requires app restart │
│ • Long-running queries block pool (some 30+ seconds) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. GOAL/TARGET │
├─────────────────────────────────────────────────────────────┤
│ • Zero downtime due to connection exhaustion │
│ • Support 1000 concurrent users (2x current peak) │
│ • All connections released within 5 seconds │
│ • Achieve within 1 week │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. ROOT CAUSE ANALYSIS │
├─────────────────────────────────────────────────────────────┤
│ 5 Whys: │
│ Problem: Connection pool exhausted │
│ Why 1: All 10 connections in use, none available │
│ Why 2: Connections not released after requests │
│ Why 3: Error handling doesn't close connections │
│ Why 4: Try-catch blocks missing .finally() │
│ Why 5: No code review checklist for resource cleanup │
│ │
│ Contributing factors: │
│ • Pool size too small for current load │
│ • No connection timeout configured (hangs forever) │
│ • Slow queries hold connections longer │
│ • No monitoring/alerting on pool metrics │
│ │
│ ROOT CAUSE: Systematic issue with resource cleanup + │
│ insufficient pool sizing │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 5. COUNTERMEASURES │
├─────────────────────────────────────────────────────────────┤
│ Immediate (This Week): │
│ 1. Audit all DB code, add .finally() for connection release │
│ 2. Increase pool size: 10 → 30 │
│ 3. Add connection timeout: 10 seconds │
│ 4. Add pool monitoring & alerts (>80% used) │
│ │
│ Short-term (2 Weeks): │
│ 5. Optimize slow queries (add indexes) │
│ 6. Implement connection pooling best practices doc │
│ 7. Add automated test for connection leaks │
│ │
│ Long-term (1 Month): │
│ 8. Migrate to connection pool library with auto-release │
│ 9. Add linter rule detecting missing .finally() │
│ 10. Create PR checklist for resource management │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 6. IMPLEMENTATION PLAN │
├─────────────────────────────────────────────────────────────┤
│ Week 1 (Nov 14-18): │
│ • Day 1-2: Audit & fix connection leaks [Dev Team] │
│ • Day 2: Increase pool size, add timeout [DevOps] │
│ • Day 3: Set up monitoring [SRE] │
│ • Day 4: Test under load [QA] │
│ • Day 5: Deploy to production [DevOps] │
│ │
│ Week 2 (Nov 21-25): │
│ • Optimize identified slow queries [DB Team] │
│ • Write best practices doc [Tech Writer + Dev Lead] │
│ • Create connection leak test [QA Team] │
│ │
│ Week 3-4 (Nov 28 - Dec 9): │
│ • Evaluate connection pool libraries [Dev Team] │
│ • Add linter rules [Dev Lead] │
│ • Update PR template [Dev Lead] │
│ │
│ Dependencies: None blocking Week 1 fixes │
│ Resources: 2 developers, 1 DevOps, 1 SRE │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 7. FOLLOW-UP │
├─────────────────────────────────────────────────────────────┤
│ Success Metrics: │
│ • Zero downtime incidents (monitor 4 weeks) │
│ • Pool usage stays <80% during peak │
│ • No connection leaks detected │
│ • Response time <200ms p95 │
│ │
│ Monitoring: │
│ • Daily: Check pool usage dashboard │
│ • Weekly: Review connection leak alerts │
│ • Bi-weekly: Team retrospective on progress │
│ │
│ Review Dates: │
│ • Week 1 (Nov 18): Verify immediate fixes effective │
│ • Week 2 (Nov 25): Assess optimization impact │
│ • Week 4 (Dec 9): Final review, close A3 │
│ │
│ Prevention: │
│ • Add connection handling to onboarding │
│ • Monthly audit of resource management code │
│ • Include pool metrics in SRE runbook │
└─────────────────────────────────────────────────────────────┘
═══════════════════════════════════════════════════════════════
```
### Example 2: Security Vulnerability in Production
```
═══════════════════════════════════════════════════════════════
A3 PROBLEM ANALYSIS
═══════════════════════════════════════════════════════════════
TITLE: Critical SQL Injection Vulnerability
OWNER: Security Team Lead
DATE: 2024-11-14
┌─────────────────────────────────────────────────────────────┐
│ 1. BACKGROUND │
├─────────────────────────────────────────────────────────────┤
│ • Critical security vulnerability reported by researcher │
│ • SQL injection in user search endpoint │
│ • Potential data breach affecting 100K+ user records │
│ • CVSS score: 9.8 (Critical) │
│ • Vulnerability exists in production for 6 months │
│ • Similar issue found in 2 other endpoints (scanning) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 2. CURRENT CONDITION │
├─────────────────────────────────────────────────────────────┤
│ Vulnerable Code: │
│ • /api/users/search endpoint uses string concatenation │
│ • Input: search query (user-provided, not sanitized) │
│ • Pattern: `SELECT * FROM users WHERE name = '${input}'` │
│ │
│ Scope: │
│ • 3 endpoints vulnerable (search, filter, export) │
│ • All use same unsafe pattern │
│ • No parameterized queries │
│ • No input validation layer │
│ │
│ Risk Assessment: │
│ • Exploitable from public internet │
│ • No evidence of exploitation (logs checked) │
│ • Similar code in admin panel (higher privilege) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 3. GOAL/TARGET │
├─────────────────────────────────────────────────────────────┤
│ • Patch all SQL injection vulnerabilities within 24 hours │
│ • Zero SQL injection vulnerabilities in codebase │
│ • Prevent similar issues in future code │
│ • Verify no unauthorized access occurred │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 4. ROOT CAUSE ANALYSIS │
├─────────────────────────────────────────────────────────────┤
│ 5 Whys: │
│ Problem: SQL injection vulnerability in production │
│ Why 1: User input concatenated directly into SQL │
│ Why 2: Developer wasn't aware of SQL injection risks │
│ Why 3: No security training for new developers │
│ Why 4: Security not part of onboarding checklist │
│ Why 5: Security team not involved in development process │
│ │
│ Contributing Factors (Fishbone): │
│ • Process: No security code review │
│ • Technology: ORM not used consistently │
│ • People: Knowledge gap in secure coding │
│ • Methods: No SAST tools in CI/CD │
│ │
│ ROOT CAUSE: Security not integrated into development │
│ process, training gap │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 5. COUNTERMEASURES │
├─────────────────────────────────────────────────────────────┤
│ Immediate (24 Hours): │
│ 1. Patch all 3 vulnerable endpoints │
│ 2. Deploy hotfix to production │
│ 3. Scan codebase for similar patterns │
│ 4. Review access logs for exploitation attempts │
│ │
│ Short-term (1 Week): │
│ 5. Replace all raw SQL with parameterized queries │
│ 6. Add input validation middleware │
│ 7. Set up SAST tool in CI (Snyk/SonarQube) │
│ 8. Security team review of all data access code │
│ │
│ Long-term (1 Month): │
│ 9. Mandatory security training for all developers │
│ 10. Add security review to PR process │
│ 11. Migrate to ORM for all database access │
│ 12. Implement security champion program │
│ 13. Quarterly security audits │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 6. IMPLEMENTATION PLAN │
├─────────────────────────────────────────────────────────────┤
│ Hour 0-4 (Emergency Response): │
│ • Write & test patches [Security + Senior Dev] │
│ • Emergency PR review [CTO + Tech Lead] │
│ • Deploy to staging [DevOps] │
│ │
│ Hour 4-24 (Production Deploy): │
│ • Deploy hotfix [DevOps + On-call] │
│ • Monitor for issues [SRE Team] │
│ • Scan logs for exploitation [Security Team] │
│ • Notify stakeholders [Security Lead + CEO] │
│ │
│ Day 2-7: │
│ • Full codebase remediation [Dev Team] │
│ • SAST tool setup [DevOps + Security] │
│ • Security review [External Auditor] │
│ │
│ Week 2-4: │
│ • Security training program [Security + HR] │
│ • Process improvements [Engineering Leadership] │
│ │
│ Dependencies: External auditor availability (Week 2) │
└─────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────┐
│ 7. FOLLOW-UP │
├─────────────────────────────────────────────────────────────┤
│ Success Metrics: │
│ • Zero SQL injection vulnerabilities (verified by scan) │
│ • 100% of PRs pass SAST checks │
│ • 100% developer security training completion │
│ • No unauthorized access detected in log analysis │
│ │
│ Verification: │
│ • Day 1: Verify patch deployed, vulnerability closed │
│ • Week 1: External security audit confirms fixes │
│ • Week 2: SAST tool catching similar issues │
│ • Month 1: Training completion, process adoption │
│ │
│ Prevention: │
│ • SAST tools block vulnerable code in CI │
│ • Security review required for data access code │
│ • Quarterly penetration testing │
│ • Annual security training refresh │
│ │
│ Incident Report: │
│ • Post-mortem meeting: Nov 16 │
│ • Document lessons learned │
│ • Share with engineering org │
└─────────────────────────────────────────────────────────────┘
═══════════════════════════════════════════════════════════════
```
## Notes
- A3 forces concise, complete thinking (fits on one page)
- Use data and facts, not opinions or blame
- Root cause analysis is critical—use `/why` or `/cause-and-effect`
- Countermeasures must address root causes, not symptoms
- Implementation plan needs clear ownership and timelines
- Follow-up ensures sustainable improvement
- A3 becomes historical record for organizational learning
- Update A3 as situation evolves (living document until closed)
- Consider A3 for: incidents, recurring issues, major improvements
- Overkill for: small bugs, one-line fixes, trivial issues

509
commands/analyse.md Normal file
View File

@@ -0,0 +1,509 @@
---
description: Auto-selects best Kaizen method (Gemba Walk, Value Stream, or Muda) for target
argument-hint: Optional target description (e.g., code, workflow, or inefficiencies)
---
# Smart Analysis
Intelligently select and apply the most appropriate Kaizen analysis technique based on what you're analyzing.
## Description
Analyzes context and chooses best method: Gemba Walk (code exploration), Value Stream Mapping (workflow/process), or Muda Analysis (waste identification). Guides you through the selected technique.
## Usage
`/analyse [target_description]`
Examples:
- `/analyse authentication implementation`
- `/analyse deployment workflow`
- `/analyse codebase for inefficiencies`
## Variables
- TARGET: What to analyze (default: prompt for input)
- METHOD: Override auto-selection (gemba, vsm, muda)
## Method Selection Logic
**Gemba Walk** → When analyzing:
- Code implementation (how feature actually works)
- Gap between documentation and reality
- Understanding unfamiliar codebase areas
- Actual vs. assumed architecture
**Value Stream Mapping** → When analyzing:
- Workflows and processes (CI/CD, deployment, development)
- Bottlenecks in multi-stage pipelines
- Handoffs between teams/systems
- Time spent in each process stage
**Muda (Waste Analysis)** → When analyzing:
- Code quality and efficiency
- Technical debt
- Over-engineering or duplication
- Resource utilization
## Steps
1. Understand what's being analyzed
2. Determine best method (or use specified method)
3. Explain why this method fits
4. Guide through the analysis
5. Present findings with actionable insights
---
## Method 1: Gemba Walk
"Go and see" the actual code to understand reality vs. assumptions.
### When to Use
- Understanding how feature actually works
- Code archaeology (legacy systems)
- Finding gaps between docs and implementation
- Exploring unfamiliar areas before changes
### Process
1. **Define scope**: What code area to explore
2. **State assumptions**: What you think it does
3. **Observe reality**: Read actual code
4. **Document findings**:
- Entry points
- Actual data flow
- Surprises (differs from assumptions)
- Hidden dependencies
- Undocumented behavior
5. **Identify gaps**: Documentation vs. reality
6. **Recommend**: Update docs, refactor, or accept
### Example: Authentication System Gemba Walk
```
SCOPE: User authentication flow
ASSUMPTIONS (Before):
• JWT tokens stored in localStorage
• Single sign-on via OAuth only
• Session expires after 1 hour
• Password reset via email link
GEMBA OBSERVATIONS (Actual Code):
Entry Point: /api/auth/login (routes/auth.ts:45)
├─> AuthService.authenticate() (services/auth.ts:120)
├─> UserRepository.findByEmail() (db/users.ts:67)
├─> bcrypt.compare() (services/auth.ts:145)
└─> TokenService.generate() (services/token.ts:34)
Actual Flow:
1. Login credentials → POST /api/auth/login
2. Password hashed with bcrypt (10 rounds)
3. JWT generated with 24hr expiry (NOT 1 hour!)
4. Token stored in httpOnly cookie (NOT localStorage)
5. Refresh token in separate cookie (15 days)
6. Session data in Redis (30 days TTL)
SURPRISES:
✗ OAuth not implemented (commented out code found)
✗ Password reset is manual (admin intervention)
✗ Three different session storage mechanisms:
- Redis for session data
- Database for "remember me"
- Cookies for tokens
✗ Legacy endpoint /auth/legacy still active (no auth!)
✗ Admin users bypass rate limiting (security issue)
GAPS:
• Documentation says OAuth, code doesn't have it
• Session expiry inconsistent (docs: 1hr, code: 24hr)
• Legacy endpoint not documented (security risk)
• No mention of "remember me" in docs
RECOMMENDATIONS:
1. HIGH: Secure or remove /auth/legacy endpoint
2. HIGH: Document actual session expiry (24hr)
3. MEDIUM: Clean up or implement OAuth
4. MEDIUM: Consolidate session storage (choose one)
5. LOW: Add rate limiting for admin users
```
### Example: CI/CD Pipeline Gemba Walk
```
SCOPE: Build and deployment pipeline
ASSUMPTIONS:
• Automated tests run on every commit
• Deploy to staging automatic
• Production deploy requires approval
GEMBA OBSERVATIONS:
Actual Pipeline (.github/workflows/main.yml):
1. On push to main:
├─> Lint (2 min)
├─> Unit tests (5 min) [SKIPPED if "[skip-tests]" in commit]
├─> Build Docker image (15 min)
└─> Deploy to staging (3 min)
2. Manual trigger for production:
├─> Run integration tests (20 min) [ONLY for production!]
├─> Security scan (10 min)
└─> Deploy to production (5 min)
SURPRISES:
✗ Unit tests can be skipped with commit message flag
✗ Integration tests ONLY run for production deploy
✗ Staging deployed without integration tests
✗ No rollback mechanism (manual kubectl commands)
✗ Secrets loaded from .env file (not secrets manager)
✗ Old "hotfix" branch bypasses all checks
GAPS:
• Staging and production have different test coverage
• Documentation doesn't mention test skip flag
• Rollback process not documented or automated
• Security scan results not enforced (warning only)
RECOMMENDATIONS:
1. CRITICAL: Remove test skip flag capability
2. CRITICAL: Migrate secrets to secrets manager
3. HIGH: Run integration tests on staging too
4. HIGH: Delete or secure hotfix branch
5. MEDIUM: Add automated rollback capability
6. MEDIUM: Make security scan blocking
```
---
## Method 2: Value Stream Mapping
Map workflow stages, measure time/waste, identify bottlenecks.
### When to Use
- Process optimization (CI/CD, deployment, code review)
- Understanding multi-stage workflows
- Finding delays and handoffs
- Improving cycle time
### Process
1. **Identify start and end**: Where process begins and ends
2. **Map all steps**: Including waiting/handoff time
3. **Measure each step**:
- Processing time (work happening)
- Waiting time (idle, blocked)
- Who/what performs step
4. **Calculate metrics**:
- Total lead time
- Value-add time vs. waste time
- % efficiency (value-add / total time)
5. **Identify bottlenecks**: Longest steps, most waiting
6. **Design future state**: Optimized flow
7. **Plan improvements**: How to achieve future state
### Example: Feature Development Value Stream Map
```
CURRENT STATE: Feature request → Production
Step 1: Requirements Gathering
├─ Processing: 2 days (meetings, writing spec)
├─ Waiting: 3 days (stakeholder review)
└─ Owner: Product Manager
Step 2: Design
├─ Processing: 1 day (mockups, architecture)
├─ Waiting: 2 days (design review, feedback)
└─ Owner: Designer + Architect
Step 3: Development
├─ Processing: 5 days (coding)
├─ Waiting: 2 days (PR review queue)
└─ Owner: Developer
Step 4: Code Review
├─ Processing: 0.5 days (review)
├─ Waiting: 1 day (back-and-forth changes)
└─ Owner: Senior Developer
Step 5: QA Testing
├─ Processing: 2 days (manual testing)
├─ Waiting: 1 day (bug fixes, retest)
└─ Owner: QA Engineer
Step 6: Staging Deployment
├─ Processing: 0.5 days (deploy, smoke test)
├─ Waiting: 2 days (stakeholder UAT)
└─ Owner: DevOps
Step 7: Production Deployment
├─ Processing: 0.5 days (deploy, monitor)
├─ Waiting: 0 days
└─ Owner: DevOps
───────────────────────────────────────
METRICS:
Total Lead Time: 22.5 days
Value-Add Time: 11.5 days (work)
Waste Time: 11 days (waiting)
Efficiency: 51%
BOTTLENECKS:
1. Requirements review wait (3 days)
2. Development time (5 days)
3. Stakeholder UAT wait (2 days)
4. PR review queue (2 days)
WASTE ANALYSIS:
• Waiting for reviews/approvals: 9 days (82% of waste)
• Rework due to unclear requirements: ~1 day
• Manual testing time: 2 days
FUTURE STATE DESIGN:
Changes:
1. Async requirements approval (stakeholders have 24hr SLA)
2. Split large features into smaller increments
3. Automated testing replaces manual QA
4. PR review SLA: 4 hours max
5. Continuous deployment to staging (no approval)
6. Feature flags for production rollout (no wait)
Projected Future State:
Total Lead Time: 9 days (60% reduction)
Value-Add Time: 8 days
Waste Time: 1 day
Efficiency: 89%
IMPLEMENTATION PLAN:
Week 1: Set review SLAs, add feature flags
Week 2: Automate test suite
Week 3: Enable continuous staging deployment
Week 4: Train team on incremental delivery
```
### Example: Incident Response Value Stream Map
```
CURRENT STATE: Incident detected → Resolution
Step 1: Detection
├─ Processing: 0 min (automated alert)
├─ Waiting: 15 min (until someone sees alert)
└─ System: Monitoring tool
Step 2: Triage
├─ Processing: 10 min (assess severity)
├─ Waiting: 20 min (find right person)
└─ Owner: On-call engineer
Step 3: Investigation
├─ Processing: 45 min (logs, debugging)
├─ Waiting: 30 min (access to production, gather context)
└─ Owner: Engineer + SRE
Step 4: Fix Development
├─ Processing: 60 min (write fix)
├─ Waiting: 15 min (code review)
└─ Owner: Engineer
Step 5: Deployment
├─ Processing: 10 min (hotfix deploy)
├─ Waiting: 5 min (verification)
└─ Owner: SRE
Step 6: Post-Incident
├─ Processing: 20 min (update status, notify)
├─ Waiting: 0 min
└─ Owner: Engineer
───────────────────────────────────────
METRICS:
Total Lead Time: 230 min (3h 50min)
Value-Add Time: 145 min
Waste Time: 85 min (37%)
BOTTLENECKS:
1. Finding right person (20 min)
2. Gaining production access (30 min)
3. Investigation time (45 min)
IMPROVEMENTS:
1. Slack integration for alerts (reduce detection wait)
2. Auto-assign by service owner (no hunt for person)
3. Pre-approved prod access for on-call (reduce wait)
4. Runbooks for common incidents (faster investigation)
5. Automated rollback for deployment incidents
Projected improvement: 230min → 120min (48% faster)
```
---
## Method 3: Muda (Waste Analysis)
Identify seven types of waste in code and development processes.
### When to Use
- Code quality audits
- Technical debt assessment
- Process efficiency improvements
- Identifying over-engineering
### The 7 Types of Waste (Applied to Software)
**1. Overproduction**: Building more than needed
- Features no one uses
- Overly complex solutions
- Premature optimization
- Unnecessary abstractions
**2. Waiting**: Idle time
- Build/test/deploy time
- Code review delays
- Waiting for dependencies
- Blocked by other teams
**3. Transportation**: Moving things around
- Unnecessary data transformations
- API layers with no value add
- Copying data between systems
- Repeated serialization/deserialization
**4. Over-processing**: Doing more than necessary
- Excessive logging
- Redundant validations
- Over-normalized databases
- Unnecessary computation
**5. Inventory**: Work in progress
- Unmerged branches
- Half-finished features
- Untriaged bugs
- Undeployed code
**6. Motion**: Unnecessary movement
- Context switching
- Meetings without purpose
- Manual deployments
- Repetitive tasks
**7. Defects**: Rework and bugs
- Production bugs
- Technical debt
- Flaky tests
- Incomplete features
### Process
1. **Define scope**: Codebase area or process
2. **Examine for each waste type**
3. **Quantify impact** (time, complexity, cost)
4. **Prioritize by impact**
5. **Propose elimination strategies**
### Example: API Codebase Waste Analysis
```
SCOPE: REST API backend (50K LOC)
1. OVERPRODUCTION
Found:
• 15 API endpoints with zero usage (last 90 days)
• Generic "framework" built for "future flexibility" (unused)
• Premature microservices split (2 services, could be 1)
• Feature flags for 12 features (10 fully rolled out, flags kept)
Impact: 8K LOC maintained for no reason
Recommendation: Delete unused endpoints, remove stale flags
2. WAITING
Found:
• CI pipeline: 45 min (slow Docker builds)
• PR review time: avg 2 days
• Deployment to staging: manual, takes 1 hour
Impact: 2.5 days wasted per feature
Recommendation: Cache Docker layers, PR review SLA, automate staging
3. TRANSPORTATION
Found:
• Data transformed 4 times between DB and API response:
DB → ORM → Service → DTO → Serializer
• Request/response logged 3 times (middleware, handler, service)
• Files uploaded → S3 → CloudFront → Local cache (unnecessary)
Impact: 200ms avg response time overhead
Recommendation: Reduce transformation layers, consolidate logging
4. OVER-PROCESSING
Found:
• Every request validates auth token (even cached)
• Database queries fetch all columns (SELECT *)
• JSON responses include full object graphs (nested 5 levels)
• Logs every database query in production (verbose)
Impact: 40% higher database load, 3x log storage
Recommendation: Cache auth checks, selective fields, trim responses
5. INVENTORY
Found:
• 23 open PRs (8 abandoned, 6+ months old)
• 5 feature branches unmerged (completed but not deployed)
• 147 open bugs (42 duplicates, 60 not reproducible)
• 12 hotfix commits not backported to main
Impact: Context overhead, merge conflicts, lost work
Recommendation: Close stale PRs, bug triage, deploy pending features
6. MOTION
Found:
• Developers switch between 4 tools for one deployment
• Manual database migrations (error-prone, slow)
• Environment config spread across 6 files
• Copy-paste secrets to .env files
Impact: 30min per deployment, frequent mistakes
Recommendation: Unified deployment tool, automate migrations
7. DEFECTS
Found:
• 12 production bugs per month
• 15% flaky test rate (wasted retry time)
• Technical debt in auth module (refactor needed)
• Incomplete error handling (crashes instead of graceful)
Impact: Customer complaints, rework, downtime
Recommendation: Stabilize tests, refactor auth, add error boundaries
───────────────────────────────────────
SUMMARY
Total Waste Identified:
• Code: 8K LOC doing nothing
• Time: 2.5 days per feature
• Performance: 200ms overhead per request
• Effort: 30min per deployment
Priority Fixes (by impact):
1. HIGH: Automate deployments (reduces Motion + Waiting)
2. HIGH: Fix flaky tests (reduces Defects)
3. MEDIUM: Remove unused code (reduces Overproduction)
4. MEDIUM: Optimize data transformations (reduces Transportation)
5. LOW: Triage bug backlog (reduces Inventory)
Estimated Recovery:
• 20% faster feature delivery
• 50% fewer production issues
• 30% less operational overhead
```
---
## Notes
- Method selection is contextual—choose what fits best
- Can combine methods (Gemba Walk → Muda Analysis)
- Start with Gemba Walk when unfamiliar with area
- Use VSM for process optimization
- Use Muda for efficiency and cleanup
- All methods should lead to actionable improvements
- Document findings for organizational learning
- Consider using `/analyse-problem` (A3) for comprehensive documentation of findings

View File

@@ -0,0 +1,197 @@
---
description: Systematic Fishbone analysis exploring problem causes across six categories
argument-hint: Optional problem description to analyze
---
# Cause and Effect Analysis
Apply Fishbone (Ishikawa) diagram analysis to systematically explore all potential causes of a problem across multiple categories.
## Description
Systematically examine potential causes across six categories: People, Process, Technology, Environment, Methods, and Materials. Creates structured "fishbone" view identifying contributing factors.
## Usage
`/cause-and-effect [problem_description]`
## Variables
- PROBLEM: Issue to analyze (default: prompt for input)
- CATEGORIES: Categories to explore (default: all six)
## Steps
1. State the problem clearly (the "head" of the fish)
2. For each category, brainstorm potential causes:
- **People**: Skills, training, communication, team dynamics
- **Process**: Workflows, procedures, standards, reviews
- **Technology**: Tools, infrastructure, dependencies, configuration
- **Environment**: Workspace, deployment targets, external factors
- **Methods**: Approaches, patterns, architectures, practices
- **Materials**: Data, dependencies, third-party services, resources
3. For each potential cause, ask "why" to dig deeper
4. Identify which causes are contributing vs. root causes
5. Prioritize causes by impact and likelihood
6. Propose solutions for highest-priority causes
## Examples
### Example 1: API Response Latency
```
Problem: API responses take 3+ seconds (target: <500ms)
PEOPLE
├─ Team unfamiliar with performance optimization
├─ No one owns performance monitoring
└─ Frontend team doesn't understand backend constraints
PROCESS
├─ No performance testing in CI/CD
├─ No SLA defined for response times
└─ Performance regression not caught in code review
TECHNOLOGY
├─ Database queries not optimized
│ └─ Why: No query analysis tools in place
├─ N+1 queries in ORM
│ └─ Why: Eager loading not configured
├─ No caching layer
│ └─ Why: Redis not in tech stack
└─ Synchronous external API calls
└─ Why: No async architecture in place
ENVIRONMENT
├─ Production uses smaller database instance than needed
├─ No CDN for static assets
└─ Single region deployment (high latency for distant users)
METHODS
├─ REST API design requires multiple round trips
├─ No pagination on large datasets
└─ Full object serialization instead of selective fields
MATERIALS
├─ Large JSON payloads (unnecessary data)
├─ Uncompressed responses
└─ Third-party API (payment gateway) is slow
└─ Why: Free tier with rate limiting
ROOT CAUSES:
- No performance requirements defined (Process)
- Missing performance monitoring tooling (Technology)
- Architecture doesn't support caching/async (Methods)
SOLUTIONS (Priority Order):
1. Add database indexes (quick win, high impact)
2. Implement Redis caching layer (medium effort, high impact)
3. Make external API calls async with webhooks (high effort, high impact)
4. Define and monitor performance SLAs (low effort, prevents regression)
```
### Example 2: Flaky Test Suite
```
Problem: 15% of test runs fail, passing on retry
PEOPLE
├─ Test-writing skills vary across team
├─ New developers copy existing flaky patterns
└─ No one assigned to fix flaky tests
PROCESS
├─ Flaky tests marked as "known issue" and ignored
├─ No policy against merging with flaky tests
└─ Test failures don't block deployments
TECHNOLOGY
├─ Race conditions in async test setup
├─ Tests share global state
├─ Test database not isolated per test
├─ setTimeout used instead of proper waiting
└─ CI environment inconsistent (different CPU/memory)
ENVIRONMENT
├─ CI runner under heavy load
├─ Network timing varies (external API mocks flaky)
└─ Timezone differences between local and CI
METHODS
├─ Integration tests not properly isolated
├─ No retry logic for legitimate timing issues
└─ Tests depend on execution order
MATERIALS
├─ Test data fixtures overlap
├─ Shared test database polluted
└─ Mock data doesn't match production patterns
ROOT CAUSES:
- No test isolation strategy (Methods + Technology)
- Process accepts flaky tests (Process)
- Async timing not handled properly (Technology)
SOLUTIONS:
1. Implement per-test database isolation (high impact)
2. Replace setTimeout with proper async/await patterns (medium impact)
3. Add pre-commit hook blocking flaky test patterns (prevents new issues)
4. Enforce policy: flaky test = block merge (process change)
```
### Example 3: Feature Takes 3 Months Instead of 3 Weeks
```
Problem: Simple CRUD feature took 12 weeks vs. 3 week estimate
PEOPLE
├─ Developer unfamiliar with codebase
├─ Key architect on vacation during critical phase
└─ Designer changed requirements mid-development
PROCESS
├─ Requirements not finalized before starting
├─ No code review for first 6 weeks (large diff)
├─ Multiple rounds of design revision
└─ QA started late (found issues in week 10)
TECHNOLOGY
├─ Codebase has high coupling (change ripple effects)
├─ No automated tests (manual testing slow)
├─ Legacy code required refactoring first
└─ Development environment setup took 2 weeks
ENVIRONMENT
├─ Staging environment broken for 3 weeks
├─ Production data needed for testing (compliance delay)
└─ Dependencies blocked by another team
METHODS
├─ No incremental delivery (big bang approach)
├─ Over-engineering (added future features "while we're at it")
└─ No design doc (discovered issues during implementation)
MATERIALS
├─ Third-party API changed during development
├─ Production data model different than staging
└─ Missing design assets (waited for designer)
ROOT CAUSES:
- No requirements lock-down before start (Process)
- Architecture prevents incremental changes (Technology)
- Big bang approach vs. iterative (Methods)
- Development environment not automated (Technology)
SOLUTIONS:
1. Require design doc + finalized requirements before starting (Process)
2. Implement feature flags for incremental delivery (Methods)
3. Automate dev environment setup (Technology)
4. Refactor high-coupling areas (Technology, long-term)
```
## Notes
- Fishbone reveals systemic issues across domains
- Multiple causes often combine to create problems
- Don't stop at first cause in each category—dig deeper
- Some causes span multiple categories (mark them)
- Root causes usually in Process or Methods (not just Technology)
- Use with `/why` command for deeper analysis of specific causes
- Prioritize solutions by: impact × feasibility ÷ effort
- Address root causes, not just symptoms

View File

@@ -0,0 +1,255 @@
---
description: Iterative PDCA cycle for systematic experimentation and continuous improvement
argument-hint: Optional improvement goal or problem to address
---
# Plan-Do-Check-Act (PDCA)
Apply PDCA cycle for continuous improvement through iterative problem-solving and process optimization.
## Description
Four-phase iterative cycle: Plan (identify and analyze), Do (implement changes), Check (measure results), Act (standardize or adjust). Enables systematic experimentation and improvement.
## Usage
`/plan-do-check-act [improvement_goal]`
## Variables
- GOAL: Improvement target or problem to address (default: prompt for input)
- CYCLE_NUMBER: Which PDCA iteration (default: 1)
## Steps
### Phase 1: PLAN
1. Define the problem or improvement goal
2. Analyze current state (baseline metrics)
3. Identify root causes (use `/why` or `/cause-and-effect`)
4. Develop hypothesis: "If we change X, Y will improve"
5. Design experiment: what to change, how to measure success
6. Set success criteria (measurable targets)
### Phase 2: DO
1. Implement the planned change (small scale first)
2. Document what was actually done
3. Record any deviations from plan
4. Collect data throughout implementation
5. Note unexpected observations
### Phase 3: CHECK
1. Measure results against success criteria
2. Compare to baseline (before vs. after)
3. Analyze data: did hypothesis hold?
4. Identify what worked and what didn't
5. Document learnings and insights
### Phase 4: ACT
1. **If successful**: Standardize the change
- Update documentation
- Train team
- Create checklist/automation
- Monitor for regression
2. **If unsuccessful**: Learn and adjust
- Understand why it failed
- Refine hypothesis
- Start new PDCA cycle with adjusted plan
3. **If partially successful**:
- Standardize what worked
- Plan next cycle for remaining issues
## Examples
### Example 1: Reducing Build Time
```
CYCLE 1
───────
PLAN:
Problem: Docker build takes 45 minutes
Current State: Full rebuild every time, no layer caching
Root Cause: Package manager cache not preserved between builds
Hypothesis: Caching dependencies will reduce build to <10 minutes
Change: Add layer caching for package.json + node_modules
Success Criteria: Build time <10 minutes on unchanged dependencies
DO:
- Restructured Dockerfile: COPY package*.json before src files
- Added .dockerignore for node_modules
- Configured CI cache for Docker layers
- Tested on 3 builds
CHECK:
Results:
- Unchanged dependencies: 8 minutes ✓ (was 45)
- Changed dependencies: 12 minutes (was 45)
- Fresh builds: 45 minutes (same, expected)
Analysis: 82% reduction on cached builds, hypothesis confirmed
ACT:
Standardize:
✓ Merged Dockerfile changes
✓ Updated CI pipeline config
✓ Documented in README
✓ Added build time monitoring
New Problem: 12 minutes still slow when deps change
→ Start CYCLE 2
CYCLE 2
───────
PLAN:
Problem: Build still 12 min when dependencies change
Current State: npm install rebuilds all packages
Root Cause: Some packages compile from source
Hypothesis: Pre-built binaries will reduce to <5 minutes
Change: Use npm ci instead of install, configure binary mirrors
Success Criteria: Build <5 minutes on dependency changes
DO:
- Changed to npm ci (uses package-lock.json)
- Added .npmrc with binary mirror configs
- Tested across 5 dependency updates
CHECK:
Results:
- Dependency changes: 4.5 minutes ✓ (was 12)
- Compilation errors reduced to 0 (was 3)
Analysis: npm ci faster + more reliable, hypothesis confirmed
ACT:
Standardize:
✓ Use npm ci everywhere (local + CI)
✓ Committed .npmrc
✓ Updated developer onboarding docs
Total improvement: 45min → 4.5min (90% reduction)
✓ PDCA complete, monitor for 2 weeks
```
### Example 2: Reducing Production Bugs
```
CYCLE 1
───────
PLAN:
Problem: 8 production bugs per month
Current State: Manual testing only, no automated tests
Root Cause: Regressions not caught before release
Hypothesis: Adding integration tests will reduce bugs by 50%
Change: Implement integration test suite for critical paths
Success Criteria: <4 bugs per month after 1 month
DO:
Week 1-2: Wrote integration tests for:
- User authentication flow
- Payment processing
- Data export
Week 3: Set up CI to run tests
Week 4: Team training on test writing
Coverage: 3 critical paths (was 0)
CHECK:
Results after 1 month:
- Production bugs: 6 (was 8)
- Bugs caught in CI: 4
- Test failures (false positives): 2
Analysis: 25% reduction, not 50% target
Insight: Bugs are in areas without tests yet
ACT:
Partially successful:
✓ Keep existing tests (prevented 4 bugs)
✓ Fix flaky tests
Adjust for CYCLE 2:
- Expand test coverage to all user flows
- Add tests for bug-prone areas
→ Start CYCLE 2
CYCLE 2
───────
PLAN:
Problem: Still 6 bugs/month, need <4
Current State: 3 critical paths tested, 12 paths total
Root Cause: UI interaction bugs not covered by integration tests
Hypothesis: E2E tests for all user flows will reach <4 bugs
Change: Add E2E tests for remaining 9 flows
Success Criteria: <4 bugs per month, 80% coverage
DO:
Week 1-3: Added E2E tests for all user flows
Week 4: Set up visual regression testing
Coverage: 12/12 user flows (was 3/12)
CHECK:
Results after 1 month:
- Production bugs: 3 ✓ (was 6)
- Bugs caught in CI: 8 (was 4)
- Test maintenance time: 3 hours/week
Analysis: Target achieved! 62% reduction from baseline
ACT:
Standardize:
✓ Made tests required for all PRs
✓ Added test checklist to PR template
✓ Scheduled weekly test review
✓ Created runbook for test maintenance
Monitor: Track bug rate and test effectiveness monthly
✓ PDCA complete
```
### Example 3: Improving Code Review Speed
```
PLAN:
Problem: PRs take 3 days average to merge
Current State: Manual review, no automation
Root Cause: Reviewers wait to see if CI passes before reviewing
Hypothesis: Auto-review + faster CI will reduce to <1 day
Change: Add automated checks + split long CI jobs
Success Criteria: Average time to merge <1 day (8 hours)
DO:
- Set up automated linter checks (fail fast)
- Split test suite into parallel jobs
- Added PR template with self-review checklist
- CI time: 45min → 15min
- Tracked PR merge time for 2 weeks
CHECK:
Results:
- Average time to merge: 1.5 days (was 3)
- Time waiting for CI: 15min (was 45min)
- Time waiting for review: 1.3 days (was 2+ days)
Analysis: CI faster, but review still bottleneck
ACT:
Partially successful:
✓ Keep fast CI improvements
Insight: Real bottleneck is reviewer availability, not CI
Adjust for new PDCA:
- Focus on reviewer availability/notification
- Consider rotating review assignments
→ Start new PDCA cycle with different hypothesis
```
## Notes
- Start with small, measurable changes (not big overhauls)
- PDCA is iterative—multiple cycles normal
- Failed experiments are learning opportunities
- Document everything: easier to see patterns across cycles
- Success criteria must be measurable (not subjective)
- Phase 4 "Act" determines next cycle or completion
- If stuck after 3 cycles, revisit root cause analysis
- PDCA works for technical and process improvements
- Use `/analyse-problem` (A3) for comprehensive documentation

View File

@@ -0,0 +1,188 @@
---
name: root-cause-tracing
description: Use when errors occur deep in execution and you need to trace back to find the original trigger - systematically traces bugs backward through call stack, adding instrumentation when needed, to identify source of invalid data or incorrect behavior
---
# Root Cause Tracing
## Overview
Bugs often manifest deep in the call stack (git init in wrong directory, file created in wrong location, database opened with wrong path). Your instinct is to fix where the error appears, but that's treating a symptom.
**Core principle:** Trace backward through the call chain until you find the original trigger, then fix at the source.
## When to Use
```dot
digraph when_to_use {
"Bug appears deep in stack?" [shape=diamond];
"Can trace backwards?" [shape=diamond];
"Fix at symptom point" [shape=box];
"Trace to original trigger" [shape=box];
"BETTER: Also add defense-in-depth" [shape=box];
"Bug appears deep in stack?" -> "Can trace backwards?" [label="yes"];
"Can trace backwards?" -> "Trace to original trigger" [label="yes"];
"Can trace backwards?" -> "Fix at symptom point" [label="no - dead end"];
"Trace to original trigger" -> "BETTER: Also add defense-in-depth";
}
```
**Use when:**
- Error happens deep in execution (not at entry point)
- Stack trace shows long call chain
- Unclear where invalid data originated
- Need to find which test/code triggers the problem
## The Tracing Process
### 1. Observe the Symptom
```
Error: git init failed in /Users/jesse/project/packages/core
```
### 2. Find Immediate Cause
**What code directly causes this?**
```typescript
await execFileAsync('git', ['init'], { cwd: projectDir });
```
### 3. Ask: What Called This?
```typescript
WorktreeManager.createSessionWorktree(projectDir, sessionId)
called by Session.initializeWorkspace()
called by Session.create()
called by test at Project.create()
```
### 4. Keep Tracing Up
**What value was passed?**
- `projectDir = ''` (empty string!)
- Empty string as `cwd` resolves to `process.cwd()`
- That's the source code directory!
### 5. Find Original Trigger
**Where did empty string come from?**
```typescript
const context = setupCoreTest(); // Returns { tempDir: '' }
Project.create('name', context.tempDir); // Accessed before beforeEach!
```
## Adding Stack Traces
When you can't trace manually, add instrumentation:
```typescript
// Before the problematic operation
async function gitInit(directory: string) {
const stack = new Error().stack;
console.error('DEBUG git init:', {
directory,
cwd: process.cwd(),
nodeEnv: process.env.NODE_ENV,
stack,
});
await execFileAsync('git', ['init'], { cwd: directory });
}
```
**Critical:** Use `console.error()` in tests (not logger - may not show)
**Run and capture:**
```bash
npm test 2>&1 | grep 'DEBUG git init'
```
**Analyze stack traces:**
- Look for test file names
- Find the line number triggering the call
- Identify the pattern (same test? same parameter?)
## Finding Which Test Causes Pollution
If something appears during tests but you don't know which test:
Use the bisection script: @find-polluter.sh
```bash
./find-polluter.sh '.git' 'src/**/*.test.ts'
```
Runs tests one-by-one, stops at first polluter. See script for usage.
## Real Example: Empty projectDir
**Symptom:** `.git` created in `packages/core/` (source code)
**Trace chain:**
1. `git init` runs in `process.cwd()` ← empty cwd parameter
2. WorktreeManager called with empty projectDir
3. Session.create() passed empty string
4. Test accessed `context.tempDir` before beforeEach
5. setupCoreTest() returns `{ tempDir: '' }` initially
**Root cause:** Top-level variable initialization accessing empty value
**Fix:** Made tempDir a getter that throws if accessed before beforeEach
**Also added defense-in-depth:**
- Layer 1: Project.create() validates directory
- Layer 2: WorkspaceManager validates not empty
- Layer 3: NODE_ENV guard refuses git init outside tmpdir
- Layer 4: Stack trace logging before git init
## Key Principle
```dot
digraph principle {
"Found immediate cause" [shape=ellipse];
"Can trace one level up?" [shape=diamond];
"Trace backwards" [shape=box];
"Is this the source?" [shape=diamond];
"Fix at source" [shape=box];
"Add validation at each layer" [shape=box];
"Bug impossible" [shape=doublecircle];
"NEVER fix just the symptom" [shape=octagon, style=filled, fillcolor=red, fontcolor=white];
"Found immediate cause" -> "Can trace one level up?";
"Can trace one level up?" -> "Trace backwards" [label="yes"];
"Can trace one level up?" -> "NEVER fix just the symptom" [label="no"];
"Trace backwards" -> "Is this the source?";
"Is this the source?" -> "Trace backwards" [label="no - keeps going"];
"Is this the source?" -> "Fix at source" [label="yes"];
"Fix at source" -> "Add validation at each layer";
"Add validation at each layer" -> "Bug impossible";
}
```
**NEVER fix just where the error appears.** Trace back to find the original trigger.
## Stack Trace Tips
**In tests:** Use `console.error()` not logger - logger may be suppressed
**Before operation:** Log before the dangerous operation, not after it fails
**Include context:** Directory, cwd, environment variables, timestamps
**Capture stack:** `new Error().stack` shows complete call chain
## Real-World Impact
From debugging session (2025-10-03):
- Found root cause through 5-level trace
- Fixed at source (getter validation)
- Added 4 layers of defense
- 1847 tests passed, zero pollution

99
commands/why.md Normal file
View File

@@ -0,0 +1,99 @@
---
description: Iterative Five Whys root cause analysis drilling from symptoms to fundamentals
argument-hint: Optional issue or symptom description
---
# Five Whys Analysis
Apply Five Whys root cause analysis to investigate issues by iteratively asking "why" to drill from symptoms to root causes.
## Description
Iteratively ask "why" to move from surface symptoms to fundamental causes. Identifies systemic issues rather than quick fixes.
## Usage
`/why [issue_description]`
## Variables
- ISSUE: Problem or symptom to analyze (default: prompt for input)
- DEPTH: Number of "why" iterations (default: 5, adjust as needed)
## Steps
1. State the problem clearly
2. Ask "Why did this happen?" and document the answer
3. For that answer, ask "Why?" again
4. Continue until reaching root cause (usually 5 iterations)
5. Validate by working backwards: root cause → symptom
6. Explore branches if multiple causes emerge
7. Propose solutions addressing root causes, not symptoms
## Examples
### Example 1: Production Bug
```
Problem: Users see 500 error on checkout
Why 1: Payment service throws exception
Why 2: Request timeout after 30 seconds
Why 3: Database query takes 45 seconds
Why 4: Missing index on transactions table
Why 5: Index creation wasn't in migration scripts
Root Cause: Migration review process doesn't check query performance
Solution: Add query performance checks to migration PR template
```
### Example 2: CI/CD Pipeline Failures
```
Problem: E2E tests fail intermittently
Why 1: Race condition in async test setup
Why 2: Test doesn't wait for database seed completion
Why 3: Seed function doesn't return promise
Why 4: TypeScript didn't catch missing return type
Why 5: strict mode not enabled in test config
Root Cause: Inconsistent TypeScript config between src and tests
Solution: Unify TypeScript config, enable strict mode everywhere
```
### Example 3: Multi-Branch Analysis
```
Problem: Feature deployment takes 2 hours
Branch A (Build):
Why 1: Docker build takes 90 minutes
Why 2: No layer caching
Why 3: Dependencies reinstalled every time
Why 4: Cache invalidated by timestamp in Dockerfile
Root Cause A: Dockerfile uses current timestamp for versioning
Branch B (Tests):
Why 1: Test suite takes 30 minutes
Why 2: Integration tests run sequentially
Why 3: Test runner config has maxWorkers: 1
Why 4: Previous developer disabled parallelism due to flaky tests
Root Cause B: Flaky tests masked by disabling parallelism
Solutions:
A) Remove timestamp from Dockerfile, use git SHA
B) Fix flaky tests, re-enable parallel test execution
```
## Notes
- Don't stop at symptoms; keep digging for systemic issues
- Multiple root causes may exist - explore different branches
- Document each "why" for future reference
- Consider both technical and process-related causes
- The magic isn't in exactly 5 whys - stop when you reach the true root cause
- Stop when you hit systemic/process issues, not just technical details
- Multiple root causes are common—explore branches separately
- If "human error" appears, keep digging: why was error possible?
- Document every "why" for future reference
- Root cause usually involves: missing validation, missing docs, unclear process, or missing automation
- Test solutions: implement → verify symptom resolved → monitor for recurrence

69
plugin.lock.json Normal file
View File

@@ -0,0 +1,69 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:NeoLabHQ/context-engineering-kit:plugins/kaizen",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "8657e6de448c17e6b4602ffd032b8bcde67e2670",
"treeHash": "8f8e55aecbe9ecd417c7091a217231e2f6ee68009ffc17e2fb3d0dc0c82be14e",
"generatedAt": "2025-11-28T10:12:11.355390Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "kaizen",
"description": "Inspired by Japanese continuous improvement philosophy, Agile and Lean development practices. Introduces commands for analysis of root cause of issues and problems, including 5 Whys, Cause and Effect Analysis, and other techniques.",
"version": "1.0.0"
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "29752d48c20936f65c802c65d78e352f23721a314ff9c1e35b52c9dacad09c48"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "06991cd325a4fae965b1092dbb7f60cf06c3e6449834784e7b6c3e49bba12585"
},
{
"path": "commands/analyse-problem.md",
"sha256": "487dc9cf6e8dba0bfcaa1c1cf057e82a25afe2f7b4a87b536f2d55aebfc15e03"
},
{
"path": "commands/cause-and-effect.md",
"sha256": "8add54200271bba4d0ec30acfcd613e0ae05c81d53b67ff66d641978977840a3"
},
{
"path": "commands/plan-do-check-act.md",
"sha256": "88783462f563f71a6c36c462403e7b046fe4e13862a243513895a916dfb00821"
},
{
"path": "commands/analyse.md",
"sha256": "410c7505cf4698efca6aa30de9eca8c14812ca79030c8f7e5412ca27b48bfa7c"
},
{
"path": "commands/why.md",
"sha256": "cd1219aa596841809e631eb3e24b196d84eef665f4936d967f300300598e5a30"
},
{
"path": "commands/root-cause-tracing.md",
"sha256": "2d59c187253c0df17abbf2dddf159f4a42f18841fd7ed5b7baa62c1b9ae885b6"
},
{
"path": "skills/kaizen/SKILL.md",
"sha256": "1a300eb3c93aa3f8307e034cc5f45731076ae478f78c049e23f06ad2a0c98626"
}
],
"dirSha256": "8f8e55aecbe9ecd417c7091a217231e2f6ee68009ffc17e2fb3d0dc0c82be14e"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}

733
skills/kaizen/SKILL.md Normal file
View File

@@ -0,0 +1,733 @@
---
name: kaizen
description: Use when Code implementation and refactoring, architecturing or designing systems, process and workflow improvements, error handling and validation. Provide tehniquest to avoid over-engineering and apply iterative improvements.
---
# Kaizen: Continuous Improvement
Apply continuous improvement mindset - suggest small iterative improvements, error-proof designs, follow established patterns, avoid over-engineering; automatically applied to guide quality and simplicity
## Overview
Small improvements, continuously. Error-proof by design. Follow what works. Build only what's needed.
**Core principle:** Many small improvements beat one big change. Prevent errors at design time, not with fixes.
## When to Use
**Always applied for:**
- Code implementation and refactoring
- Architecture and design decisions
- Process and workflow improvements
- Error handling and validation
**Philosophy:** Quality through incremental progress and prevention, not perfection through massive effort.
## The Four Pillars
### 1. Continuous Improvement (Kaizen)
Small, frequent improvements compound into major gains.
#### Principles
**Incremental over revolutionary:**
- Make smallest viable change that improves quality
- One improvement at a time
- Verify each change before next
- Build momentum through small wins
**Always leave code better:**
- Fix small issues as you encounter them
- Refactor while you work (within scope)
- Update outdated comments
- Remove dead code when you see it
**Iterative refinement:**
- First version: make it work
- Second pass: make it clear
- Third pass: make it efficient
- Don't try all three at once
<Good>
```typescript
// Iteration 1: Make it work
const calculateTotal = (items: Item[]) => {
let total = 0;
for (let i = 0; i < items.length; i++) {
total += items[i].price * items[i].quantity;
}
return total;
};
// Iteration 2: Make it clear (refactor)
const calculateTotal = (items: Item[]): number => {
return items.reduce((total, item) => {
return total + (item.price * item.quantity);
}, 0);
};
// Iteration 3: Make it robust (add validation)
const calculateTotal = (items: Item[]): number => {
if (!items?.length) return 0;
return items.reduce((total, item) => {
if (item.price < 0 || item.quantity < 0) {
throw new Error('Price and quantity must be non-negative');
}
return total + (item.price * item.quantity);
}, 0);
};
```
Each step is complete, tested, and working
</Good>
<Bad>
```typescript
// Trying to do everything at once
const calculateTotal = (items: Item[]): number => {
// Validate, optimize, add features, handle edge cases all together
if (!items?.length) return 0;
const validItems = items.filter(item => {
if (item.price < 0) throw new Error('Negative price');
if (item.quantity < 0) throw new Error('Negative quantity');
return item.quantity > 0; // Also filtering zero quantities
});
// Plus caching, plus logging, plus currency conversion...
return validItems.reduce(...); // Too many concerns at once
};
```
Overwhelming, error-prone, hard to verify
</Bad>
#### In Practice
**When implementing features:**
1. Start with simplest version that works
2. Add one improvement (error handling, validation, etc.)
3. Test and verify
4. Repeat if time permits
5. Don't try to make it perfect immediately
**When refactoring:**
- Fix one smell at a time
- Commit after each improvement
- Keep tests passing throughout
- Stop when "good enough" (diminishing returns)
**When reviewing code:**
- Suggest incremental improvements (not rewrites)
- Prioritize: critical → important → nice-to-have
- Focus on highest-impact changes first
- Accept "better than before" even if not perfect
### 2. Poka-Yoke (Error Proofing)
Design systems that prevent errors at compile/design time, not runtime.
#### Principles
**Make errors impossible:**
- Type system catches mistakes
- Compiler enforces contracts
- Invalid states unrepresentable
- Errors caught early (left of production)
**Design for safety:**
- Fail fast and loudly
- Provide helpful error messages
- Make correct path obvious
- Make incorrect path difficult
**Defense in layers:**
1. Type system (compile time)
2. Validation (runtime, early)
3. Guards (preconditions)
4. Error boundaries (graceful degradation)
#### Type System Error Proofing
<Good>
```typescript
// Error: string status can be any value
type OrderBad = {
status: string; // Can be "pending", "PENDING", "pnding", anything!
total: number;
};
// Good: Only valid states possible
type OrderStatus = 'pending' | 'processing' | 'shipped' | 'delivered';
type Order = {
status: OrderStatus;
total: number;
};
// Better: States with associated data
type Order =
| { status: 'pending'; createdAt: Date }
| { status: 'processing'; startedAt: Date; estimatedCompletion: Date }
| { status: 'shipped'; trackingNumber: string; shippedAt: Date }
| { status: 'delivered'; deliveredAt: Date; signature: string };
// Now impossible to have shipped without trackingNumber
```
Type system prevents entire classes of errors
</Good>
<Good>
```typescript
// Make invalid states unrepresentable
type NonEmptyArray<T> = [T, ...T[]];
const firstItem = <T>(items: NonEmptyArray<T>): T => {
return items[0]; // Always safe, never undefined!
};
// Caller must prove array is non-empty
const items: number[] = [1, 2, 3];
if (items.length > 0) {
firstItem(items as NonEmptyArray<number>); // Safe
}
```
Function signature guarantees safety
</Good>
#### Validation Error Proofing
<Good>
```typescript
// Error: Validation after use
const processPayment = (amount: number) => {
const fee = amount * 0.03; // Used before validation!
if (amount <= 0) throw new Error('Invalid amount');
// ...
};
// Good: Validate immediately
const processPayment = (amount: number) => {
if (amount <= 0) {
throw new Error('Payment amount must be positive');
}
if (amount > 10000) {
throw new Error('Payment exceeds maximum allowed');
}
const fee = amount * 0.03;
// ... now safe to use
};
// Better: Validation at boundary with branded type
type PositiveNumber = number & { readonly __brand: 'PositiveNumber' };
const validatePositive = (n: number): PositiveNumber => {
if (n <= 0) throw new Error('Must be positive');
return n as PositiveNumber;
};
const processPayment = (amount: PositiveNumber) => {
// amount is guaranteed positive, no need to check
const fee = amount * 0.03;
};
// Validate at system boundary
const handlePaymentRequest = (req: Request) => {
const amount = validatePositive(req.body.amount); // Validate once
processPayment(amount); // Use everywhere safely
};
```
Validate once at boundary, safe everywhere else
</Good>
#### Guards and Preconditions
<Good>
```typescript
// Early returns prevent deeply nested code
const processUser = (user: User | null) => {
if (!user) {
logger.error('User not found');
return;
}
if (!user.email) {
logger.error('User email missing');
return;
}
if (!user.isActive) {
logger.info('User inactive, skipping');
return;
}
// Main logic here, guaranteed user is valid and active
sendEmail(user.email, 'Welcome!');
};
```
Guards make assumptions explicit and enforced
</Good>
#### Configuration Error Proofing
<Good>
```typescript
// Error: Optional config with unsafe defaults
type ConfigBad = {
apiKey?: string;
timeout?: number;
};
const client = new APIClient({ timeout: 5000 }); // apiKey missing!
// Good: Required config, fails early
type Config = {
apiKey: string;
timeout: number;
};
const loadConfig = (): Config => {
const apiKey = process.env.API_KEY;
if (!apiKey) {
throw new Error('API_KEY environment variable required');
}
return {
apiKey,
timeout: 5000,
};
};
// App fails at startup if config invalid, not during request
const config = loadConfig();
const client = new APIClient(config);
```
Fail at startup, not in production
</Good>
#### In Practice
**When designing APIs:**
- Use types to constrain inputs
- Make invalid states unrepresentable
- Return Result<T, E> instead of throwing
- Document preconditions in types
**When handling errors:**
- Validate at system boundaries
- Use guards for preconditions
- Fail fast with clear messages
- Log context for debugging
**When configuring:**
- Required over optional with defaults
- Validate all config at startup
- Fail deployment if config invalid
- Don't allow partial configurations
### 3. Standardized Work
Follow established patterns. Document what works. Make good practices easy to follow.
#### Principles
**Consistency over cleverness:**
- Follow existing codebase patterns
- Don't reinvent solved problems
- New pattern only if significantly better
- Team agreement on new patterns
**Documentation lives with code:**
- README for setup and architecture
- CLAUDE.md for AI coding conventions
- Comments for "why", not "what"
- Examples for complex patterns
**Automate standards:**
- Linters enforce style
- Type checks enforce contracts
- Tests verify behavior
- CI/CD enforces quality gates
#### Following Patterns
<Good>
```typescript
// Existing codebase pattern for API clients
class UserAPIClient {
async getUser(id: string): Promise<User> {
return this.fetch(`/users/${id}`);
}
}
// New code follows the same pattern
class OrderAPIClient {
async getOrder(id: string): Promise<Order> {
return this.fetch(`/orders/${id}`);
}
}
```
Consistency makes codebase predictable
</Good>
<Bad>
```typescript
// Existing pattern uses classes
class UserAPIClient { /* ... */ }
// New code introduces different pattern without discussion
const getOrder = async (id: string): Promise<Order> => {
// Breaking consistency "because I prefer functions"
};
```
Inconsistency creates confusion
</Bad>
#### Error Handling Patterns
<Good>
```typescript
// Project standard: Result type for recoverable errors
type Result<T, E> = { ok: true; value: T } | { ok: false; error: E };
// All services follow this pattern
const fetchUser = async (id: string): Promise<Result<User, Error>> => {
try {
const user = await db.users.findById(id);
if (!user) {
return { ok: false, error: new Error('User not found') };
}
return { ok: true, value: user };
} catch (err) {
return { ok: false, error: err as Error };
}
};
// Callers use consistent pattern
const result = await fetchUser('123');
if (!result.ok) {
logger.error('Failed to fetch user', result.error);
return;
}
const user = result.value; // Type-safe!
```
Standard pattern across codebase
</Good>
#### Documentation Standards
<Good>
```typescript
/**
* Retries an async operation with exponential backoff.
*
* Why: Network requests fail temporarily; retrying improves reliability
* When to use: External API calls, database operations
* When not to use: User input validation, internal function calls
*
* @example
* const result = await retry(
* () => fetch('https://api.example.com/data'),
* { maxAttempts: 3, baseDelay: 1000 }
* );
*/
const retry = async <T>(
operation: () => Promise<T>,
options: RetryOptions
): Promise<T> => {
// Implementation...
};
```
Documents why, when, and how
</Good>
#### In Practice
**Before adding new patterns:**
- Search codebase for similar problems solved
- Check CLAUDE.md for project conventions
- Discuss with team if breaking from pattern
- Update docs when introducing new pattern
**When writing code:**
- Match existing file structure
- Use same naming conventions
- Follow same error handling approach
- Import from same locations
**When reviewing:**
- Check consistency with existing code
- Point to examples in codebase
- Suggest aligning with standards
- Update CLAUDE.md if new standard emerges
### 4. Just-In-Time (JIT)
Build what's needed now. No more, no less. Avoid premature optimization and over-engineering.
#### Principles
**YAGNI (You Aren't Gonna Need It):**
- Implement only current requirements
- No "just in case" features
- No "we might need this later" code
- Delete speculation
**Simplest thing that works:**
- Start with straightforward solution
- Add complexity only when needed
- Refactor when requirements change
- Don't anticipate future needs
**Optimize when measured:**
- No premature optimization
- Profile before optimizing
- Measure impact of changes
- Accept "good enough" performance
#### YAGNI in Action
<Good>
```typescript
// Current requirement: Log errors to console
const logError = (error: Error) => {
console.error(error.message);
};
```
Simple, meets current need
</Good>
<Bad>
```typescript
// Over-engineered for "future needs"
interface LogTransport {
write(level: LogLevel, message: string, meta?: LogMetadata): Promise<void>;
}
class ConsoleTransport implements LogTransport { /*... */ }
class FileTransport implements LogTransport { /* ... */ }
class RemoteTransport implements LogTransport { /* ...*/ }
class Logger {
private transports: LogTransport[] = [];
private queue: LogEntry[] = [];
private rateLimiter: RateLimiter;
private formatter: LogFormatter;
// 200 lines of code for "maybe we'll need it"
}
const logError = (error: Error) => {
Logger.getInstance().log('error', error.message);
};
```
Building for imaginary future requirements
</Bad>
**When to add complexity:**
- Current requirement demands it
- Pain points identified through use
- Measured performance issues
- Multiple use cases emerged
<Good>
```typescript
// Start simple
const formatCurrency = (amount: number): string => {
return `$${amount.toFixed(2)}`;
};
// Requirement evolves: support multiple currencies
const formatCurrency = (amount: number, currency: string): string => {
const symbols = { USD: '$', EUR: '€', GBP: '£' };
return `${symbols[currency]}${amount.toFixed(2)}`;
};
// Requirement evolves: support localization
const formatCurrency = (amount: number, locale: string): string => {
return new Intl.NumberFormat(locale, {
style: 'currency',
currency: locale === 'en-US' ? 'USD' : 'EUR',
}).format(amount);
};
```
Complexity added only when needed
</Good>
#### Premature Abstraction
<Bad>
```typescript
// One use case, but building generic framework
abstract class BaseCRUDService<T> {
abstract getAll(): Promise<T[]>;
abstract getById(id: string): Promise<T>;
abstract create(data: Partial<T>): Promise<T>;
abstract update(id: string, data: Partial<T>): Promise<T>;
abstract delete(id: string): Promise<void>;
}
class GenericRepository<T> { /*300 lines */ }
class QueryBuilder<T> { /* 200 lines*/ }
// ... building entire ORM for single table
```
Massive abstraction for uncertain future
</Bad>
<Good>
```typescript
// Simple functions for current needs
const getUsers = async (): Promise<User[]> => {
return db.query('SELECT * FROM users');
};
const getUserById = async (id: string): Promise<User | null> => {
return db.query('SELECT * FROM users WHERE id = $1', [id]);
};
// When pattern emerges across multiple entities, then abstract
```
Abstract only when pattern proven across 3+ cases
</Good>
#### Performance Optimization
<Good>
```typescript
// Current: Simple approach
const filterActiveUsers = (users: User[]): User[] => {
return users.filter(user => user.isActive);
};
// Benchmark shows: 50ms for 1000 users (acceptable)
// ✓ Ship it, no optimization needed
// Later: After profiling shows this is bottleneck
// Then optimize with indexed lookup or caching
```
Optimize based on measurement, not assumptions
</Good>
<Bad>
```typescript
// Premature optimization
const filterActiveUsers = (users: User[]): User[] => {
// "This might be slow, so let's cache and index"
const cache = new WeakMap();
const indexed = buildBTreeIndex(users, 'isActive');
// 100 lines of optimization code
// Adds complexity, harder to maintain
// No evidence it was needed
};
```
Complex solution for unmeasured problem
</Bad>
#### In Practice
**When implementing:**
- Solve the immediate problem
- Use straightforward approach
- Resist "what if" thinking
- Delete speculative code
**When optimizing:**
- Profile first, optimize second
- Measure before and after
- Document why optimization needed
- Keep simple version in tests
**When abstracting:**
- Wait for 3+ similar cases (Rule of Three)
- Make abstraction as simple as possible
- Prefer duplication over wrong abstraction
- Refactor when pattern clear
## Integration with Commands
The Kaizen skill guides how you work. The commands provide structured analysis:
- **`/why`**: Root cause analysis (5 Whys)
- **`/cause-and-effect`**: Multi-factor analysis (Fishbone)
- **`/plan-do-check-act`**: Iterative improvement cycles
- **`/analyse-problem`**: Comprehensive documentation (A3)
- **`/analyse`**: Smart method selection (Gemba/VSM/Muda)
Use commands for structured problem-solving. Apply skill for day-to-day development.
## Red Flags
**Violating Continuous Improvement:**
- "I'll refactor it later" (never happens)
- Leaving code worse than you found it
- Big bang rewrites instead of incremental
**Violating Poka-Yoke:**
- "Users should just be careful"
- Validation after use instead of before
- Optional config with no validation
**Violating Standardized Work:**
- "I prefer to do it my way"
- Not checking existing patterns
- Ignoring project conventions
**Violating Just-In-Time:**
- "We might need this someday"
- Building frameworks before using them
- Optimizing without measuring
## Remember
**Kaizen is about:**
- Small improvements continuously
- Preventing errors by design
- Following proven patterns
- Building only what's needed
**Not about:**
- Perfection on first try
- Massive refactoring projects
- Clever abstractions
- Premature optimization
**Mindset:** Good enough today, better tomorrow. Repeat.