Initial commit

2025-11-30 08:45:31 +08:00
commit ca9b85ccda
35 changed files with 10784 additions and 0 deletions
--- a/skills/spec-author/guides/design-document.md
+++ b/skills/spec-author/guides/design-document.md
@@ -0,0 +1,503 @@
+# How to Create a Design Document
+
+Design Documents provide the detailed architectural and technical design for a system, component, or significant feature. They answer "How will we build this?" after business and technical requirements have been defined.
+
+## Quick Start
+
+```bash
+# 1. Create a new design document
+scripts/generate-spec.sh design-document des-001-descriptive-slug
+
+# 2. Open and fill in the file
+# (The file will be created at: docs/specs/design-document/des-001-descriptive-slug.md)
+
+# 3. Fill in the sections, then validate:
+scripts/validate-spec.sh docs/specs/design-document/des-001-descriptive-slug.md
+
+# 4. Fix issues and check completeness:
+scripts/check-completeness.sh docs/specs/design-document/des-001-descriptive-slug.md
+```
+
+## When to Write a Design Document
+
+Use a Design Document when you need to:
+- Define system architecture or redesign existing components
+- Document major technical decisions and trade-offs
+- Provide a blueprint for implementation teams
+- Enable architectural review before coding begins
+- Create shared understanding of complex systems
+
+## Research Phase
+
+### 1. Research Related Specifications
+Find upstream specs that inform your design:
+
+```bash
+# Find related business requirements
+grep -r "brd" docs/specs/ --include="*.md"
+
+# Find related technical requirements
+grep -r "prd\|technical" docs/specs/ --include="*.md"
+
+# Find existing design patterns or similar designs
+grep -r "design\|architecture" docs/specs/ --include="*.md"
+```
+
+### 2. Research External Documentation
+Research existing architectures and patterns:
+
+- Look up similar systems: "How do other companies solve this problem?"
+- Research technologies and frameworks you're planning to use
+- Review relevant design patterns or architecture styles
+- Check for security, performance, or scalability best practices
+
+Use tools to fetch external docs:
+```bash
+# Research the latest on your chosen technologies
+# Example: Research distributed system patterns
+# Example: Research microservices architecture best practices
+```
+
+### 3. Review Existing Codebase & Architecture
+- What patterns does your codebase already follow?
+- What technologies are you already using?
+- How are similar features currently implemented?
+- What architectural decisions have been made previously?
+
+Ask: "Are we extending existing patterns or introducing new ones?"
+
+### 4. Understand Constraints
+- What are the performance requirements? (latency, throughput)
+- What scalability targets exist?
+- What security constraints apply?
+- What infrastructure/budget constraints?
+- Team expertise with chosen technologies?
+
+## Structure & Content Guide
+
+### Title & Metadata
+- **Title**: "Microservices Architecture for User Service" or similar
+- **Type**: Architecture | System Design | RFC | Technology Choice
+- **Status**: Draft | Under Review | Accepted | Rejected
+- **Version**: 1.0.0 (increment for significant revisions)
+
+### Executive Summary
+Write 3-4 sentences that answer:
+- What problem does this solve?
+- What's the proposed solution?
+- What are the key tradeoffs?
+
+Example:
+```
+This design proposes a microservices architecture to scale our user service.
+We'll split user management, authentication, and profile service into separate
+deployable services. This trades some operational complexity for independent
+scaling and development velocity. Key trade-off: eventual consistency vs.
+immediate consistency in cross-service operations.
+```
+
+### Problem Statement
+Describe the current state and limitations:
+
+```
+Current monolithic architecture handles all user operations in a single service,
+causing:
+- Bottleneck: User service becomes bottleneck for entire system
+- Scaling: Must scale entire service even if only auth needs capacity
+- Deployment: Changes in one area risk entire user service
+- Velocity: Teams block each other during development
+
+This design solves these issues by enabling independent scaling and deployment.
+```
+
+### Goals & Success Criteria
+
+**Primary Goals** (3-5 goals)
+- Reduce deployment frequency to enable multiple daily deployments
+- Enable independent scaling of auth and profile services
+- Reduce time to market for new user features
+
+**Success Criteria** (specific, measurable)
+1. Auth service can scale independently to handle 10k requests/sec
+2. Profile service deployment doesn't impact auth service
+3. System reduces MTTR for user service incidents by 50%
+4. Teams can deploy independently without coordination
+5. P95 latency remains under 200ms across service boundaries
+
+### Context & Background
+Explain why now?
+
+```
+Over the past 6 months, we've experienced:
+- Auth service saturated at 5k requests/sec during peak hours
+- Authentication changes blocked by profile service deployments
+- High operational burden managing single monolithic service
+
+Recent customer requests for higher throughput have revealed these bottlenecks.
+This design addresses the most urgent scaling constraint (auth service).
+```
+
+### Proposed Solution
+
+#### High-Level Overview
+Provide a diagram showing major components and data flow:
+
+```
+┌─────────────┐
+│   Client    │
+└──────┬──────┘
+       │
+       ├─→ [API Gateway]
+       │         │
+       ├─→ [Auth Service] - JWT validation, user login
+       │
+       ├─→ [Profile Service] - User profile, preferences
+       │
+       └─→ [Data Layer]
+            ├─ User DB (master)
+            ├─ Cache (Redis)
+            └─ Message Queue (RabbitMQ)
+```
+
+Explain how components interact:
+```
+Client sends request to API Gateway, which routes based on endpoint.
+Auth service handles login/JWT operations. Profile service handles profile
+reads/writes. Both services consume user data from shared database with
+eventual consistency via message queue.
+```
+
+#### Architecture Components
+For each major component:
+
+**Auth Service**
+- **Purpose**: Handles authentication, token generation, validation
+- **Technology**: Node.js with Express, Redis for session storage
+- **Key Responsibilities**:
+  - User login/logout
+  - JWT token generation and validation
+  - Session management
+  - Password reset flows
+- **Interactions**: Calls User DB for credential validation, publishes events to queue
+
+**Profile Service**
+- **Purpose**: Manages user profile data and preferences
+- **Technology**: Node.js with Express, PostgreSQL for user data
+- **Key Responsibilities**:
+  - Read/write user profile information
+  - Manage user preferences
+  - Handle profile search and filtering
+- **Interactions**: Consumes user events from queue, calls shared User DB
+
+**API Gateway**
+- **Purpose**: Single entry point, routing, authentication enforcement
+- **Technology**: Nginx or API Gateway (e.g., Kong)
+- **Key Responsibilities**:
+  - Route requests to appropriate service
+  - Enforce API authentication
+  - Rate limiting
+  - Request/response transformation
+- **Interactions**: Routes to Auth and Profile services
+
+### Design Decisions
+
+For each significant decision, document:
+
+#### Decision 1: Microservices vs. Monolith
+- **Decision**: Adopt microservices architecture
+- **Rationale**:
+  - Independent scaling needed (auth bottleneck at 5k req/sec)
+  - Team velocity: Can deploy auth changes independently
+  - Loose coupling enables faster iteration
+- **Alternatives Considered**:
+  - Monolith optimization: Caching, database optimization (rejected: can't solve scaling bottleneck)
+  - Modular monolith: Improves structure but doesn't enable independent scaling
+- **Impact**:
+  - Gain: Independent scaling, deployment, team velocity
+  - Accept: Distributed system complexity, operational overhead, eventual consistency
+
+#### Decision 2: Synchronous vs. Asynchronous Communication
+- **Decision**: Use message queue for eventual consistency
+- **Rationale**:
+  - Profile updates don't need to be immediately consistent across auth service
+  - Reduces coupling: Auth service doesn't wait for profile service
+  - Improves resilience: Profile service failure doesn't affect auth
+- **Alternatives Considered**:
+  - Synchronous REST calls: Simpler but tight coupling, availability issues
+  - Event sourcing: Over-engineered for current needs
+- **Impact**:
+  - Gain: Resilience, reduced coupling, independent scaling
+  - Accept: Eventual consistency, operational complexity (message queue)
+
+### Technology Stack
+
+**Language & Runtime**
+- Node.js 18 LTS - Rationale: Existing expertise, good async support
+- Express - Lightweight, flexible framework the team knows
+
+**Data Layer**
+- PostgreSQL (primary database) - Reliable, ACID transactions for user data
+- Redis (cache layer) - Session storage, auth token cache
+
+**Infrastructure**
+- Kubernetes for orchestration - Running multiple services at scale
+- Docker for containerization - Consistent deployment
+
+**Key Libraries/Frameworks**
+- Express (v4.18) - HTTP framework
+- jsonwebtoken - JWT token handling
+- @aws-sdk - AWS SDK for future integration
+- Jest - Testing framework
+
+### Data Model & Storage
+
+**Storage Strategy**
+- **Primary Database**: PostgreSQL with user table containing:
+  - id, email, password_hash, created_at, updated_at
+  - One-to-many relationship with user_preferences
+- **Caching**: Redis stores JWT token metadata and session info with 1-hour TTL
+- **Data Retention**: User data retained indefinitely; sessions cleaned up after TTL
+
+**Schema Overview**
+```
+Users Table:
+- id (primary key)
+- email (unique index)
+- password_hash
+- created_at
+- updated_at
+
+User Preferences:
+- id
+- user_id (foreign key)
+- key (e.g., theme, language)
+- value
+```
+
+### API & Integration Points
+
+**External Dependencies**
+- Integrates with existing Payment Service for billing
+- Consumes events from Billing Service (subscription changes)
+- Publishes user events to event bus for downstream services
+
+**Key Endpoints** (reference full API spec):
+- POST /auth/login - User login
+- POST /auth/logout - User logout
+- GET /profile - Fetch user profile
+- PUT /profile - Update user profile
+
+(See [API-001] for complete endpoint specifications)
+
+### Trade-offs
+
+**Accepting**
+- Operational complexity: Must manage multiple services, deployments, monitoring
+- Eventual consistency: Changes propagate through message queue, not immediate
+- Debugging complexity: Cross-service issues harder to debug
+
+**Gaining**
+- Independent scaling: Auth service can scale without scaling profile service
+- Team autonomy: Teams can deploy independently without coordination
+- Failure isolation: Auth service failure doesn't take down profile service
+- Development velocity: Faster iteration, less blocking
+
+### Implementation
+
+**Approach**: Phased migration - Extract services incrementally without big-bang rewrite
+
+**Phases**:
+1. **Phase 1 (Week 1-2)**: Extract Auth Service
+   - Deliverables: Auth service running in parallel, API Gateway routing auth requests
+   - Testing: Canary traffic (10%) to new service
+
+2. **Phase 2 (Week 3-4)**: Migrate Auth Traffic
+   - Deliverables: 100% auth traffic on new service, rollback plan tested
+   - Verification: Auth latency, error rates compared to baseline
+
+3. **Phase 3 (Week 5-6)**: Extract Profile Service
+   - Deliverables: Profile service independent, event queue running
+   - Testing: Data consistency verification across message queue
+
+**Migration Strategy**:
+- Run both monolith and microservices in parallel initially
+- Use API Gateway to route traffic, allow A/B testing
+- Maintain ability to rollback quickly if issues arise
+- Monitor closely for latency/error rate increases
+
+(See [PLN-001] for detailed implementation roadmap)
+
+### Performance & Scalability
+
+**Performance Targets**
+- **Latency**: Auth service p95 < 100ms, p99 < 200ms
+- **Throughput**: Auth service handles 10k requests/second
+- **Availability**: 99.9% uptime for auth service
+
+**Scalability Strategy**
+- **Scaling Approach**: Horizontal - Add more auth service instances behind load balancer
+- **Bottlenecks**: Database connection pool size (limit 100 connections per service instance)
+  - Mitigation: PgBouncer connection pooling, read replicas for read operations
+- **Auto-scaling**: Kubernetes HPA scales auth service from 3 to 20 replicas based on CPU
+
+**Monitoring & Observability**
+- **Metrics**: Request latency (p50, p95, p99), error rate, service availability
+- **Alerting**: Alert if auth latency p95 > 150ms, error rate > 0.5%
+- **Logging**: Structured JSON logs with request ID for tracing across services
+
+### Security
+
+**Authentication**
+- JWT tokens issued by Auth Service, validated by API Gateway
+- Token expiration: 1 hour, refresh tokens for extended sessions
+
+**Authorization**
+- Role-based access control (RBAC) enforced at API Gateway
+- Profile service doesn't repeat auth checks (trusts gateway)
+
+**Data Protection**
+- **Encryption at Rest**: PostgreSQL database encryption enabled
+- **Encryption in Transit**: TLS 1.3 for all service-to-service communication
+- **PII Handling**: Passwords hashed with bcrypt (cost factor 12)
+
+**Secrets Management**
+- Database credentials stored in Kubernetes secrets
+- JWT signing key rotated quarterly
+- Environment-based secret injection at runtime
+
+**Compliance**
+- GDPR: User data can be exported via profile service
+- SOC2: Audit logging enabled for user data access
+
+### Dependencies & Assumptions
+
+**Dependencies**
+- PostgreSQL database must be highly available (RTO 1 hour)
+- Redis cache can tolerate data loss (non-critical)
+- API Gateway (Nginx) must be deployed and operational
+- Message queue (RabbitMQ) must be running
+
+**Assumptions**
+- Auth service will handle up to 10k requests/second (based on growth projections)
+- User data size remains < 100GB (current: 5GB)
+- Network latency between services < 10ms (co-located data center)
+
+### Open Questions
+
+- [ ] Should we use gRPC for service-to-service communication instead of REST?
+  - **Status**: Under investigation - benchmarking against REST
+- [ ] How do we handle shared user data updates if both services write to DB?
+  - **Status**: Deferred to Phase 3 - will use event sourcing pattern
+- [ ] What message queue (RabbitMQ vs. Kafka)?
+  - **Status**: RabbitMQ chosen, but revisit if we need audit trail of all changes
+
+### Approvals
+
+**Technical Review**
+- Lead Backend Engineer - TBD
+
+**Architecture Review**
+- VP Engineering - TBD
+
+**Security Review**
+- Security Team - TBD
+
+**Approved By**
+- TBD
+```
+
+## Writing Tips
+
+### Use Diagrams Effectively
+- ASCII art is fine for design docs (easy to version control)
+- Show data flow and component interactions
+- Label arrows with what data/requests are flowing
+
+### Be Explicit About Trade-offs
+- Don't just say "microservices is better"
+- Say "We're trading operational complexity for independent scaling because this addresses our 5k req/sec bottleneck"
+
+### Link to Other Specs
+- Reference related business requirements: `[BRD-001]`
+- Reference technical requirements: `[PRD-001]`
+- Reference data models: `[DATA-001]`
+- Reference API contracts: `[API-001]`
+
+### Document Rationale
+- Each decision needs a "why"
+- Explain what alternatives were considered and why they were rejected
+- This helps future developers understand the context
+
+### Be Specific About Performance
+- Not: "Must be performant"
+- Yes: "p95 latency under 100ms, p99 under 200ms, supporting 10k requests/second"
+
+### Consider the Whole System
+- Security implications
+- Operational/monitoring requirements
+- Data consistency model
+- Failure modes and recovery
+- Future scalability
+
+## Validation & Fixing Issues
+
+### Run the Validator
+```bash
+scripts/validate-spec.sh docs/specs/design-document/des-001-your-spec.md
+```
+
+### Common Issues & Fixes
+
+**Issue**: "Missing Proposed Solution section"
+- **Fix**: Add detailed architecture components, design decisions, tech stack
+
+**Issue**: "TODO items in Architecture Components (4 items)"
+- **Fix**: Complete descriptions for all components (purpose, technology, responsibilities)
+
+**Issue**: "No Trade-offs documented"
+- **Fix**: Explicitly document what you're accepting and what you're gaining
+
+**Issue**: "Missing Performance & Scalability targets"
+- **Fix**: Add specific latency, throughput, and availability targets
+
+### Check Completeness
+```bash
+scripts/check-completeness.sh docs/specs/design-document/des-001-your-spec.md
+```
+
+## Decision-Making Framework
+
+As you write the design doc, work through:
+
+1. **Problem**: What are we designing for?
+   - Specific pain points or constraints?
+   - Performance targets, scalability requirements?
+
+2. **Options**: What architectural approaches could work?
+   - Monolith vs. distributed?
+   - Synchronous vs. asynchronous?
+   - Technology choices?
+
+3. **Evaluation**: How do options compare?
+   - Which best addresses the problem?
+   - What are the trade-offs?
+   - What does the team have experience with?
+
+4. **Decision**: Which approach wins and why?
+   - What assumptions must hold?
+   - What trade-offs are we accepting?
+
+5. **Implementation**: How do we build/migrate to this?
+   - Big bang or incremental?
+   - Parallel running period?
+   - Rollback plan?
+
+## Next Steps
+
+1. **Create the spec**: `scripts/generate-spec.sh design-document des-XXX-slug`
+2. **Research**: Find related specs and understand architecture context
+3. **Sketch**: Draw architecture diagrams before writing detailed components
+4. **Fill in sections** using this guide
+5. **Validate**: `scripts/validate-spec.sh docs/specs/design-document/des-XXX-slug.md`
+6. **Get architectural review** before implementation begins
+7. **Update related specs**: Create or update technical requirements and implementation plans