Initial commit
This commit is contained in:
503
skills/spec-author/guides/design-document.md
Normal file
503
skills/spec-author/guides/design-document.md
Normal file
@@ -0,0 +1,503 @@
|
||||
# How to Create a Design Document
|
||||
|
||||
Design Documents provide the detailed architectural and technical design for a system, component, or significant feature. They answer "How will we build this?" after business and technical requirements have been defined.
|
||||
|
||||
## Quick Start
|
||||
|
||||
```bash
|
||||
# 1. Create a new design document
|
||||
scripts/generate-spec.sh design-document des-001-descriptive-slug
|
||||
|
||||
# 2. Open and fill in the file
|
||||
# (The file will be created at: docs/specs/design-document/des-001-descriptive-slug.md)
|
||||
|
||||
# 3. Fill in the sections, then validate:
|
||||
scripts/validate-spec.sh docs/specs/design-document/des-001-descriptive-slug.md
|
||||
|
||||
# 4. Fix issues and check completeness:
|
||||
scripts/check-completeness.sh docs/specs/design-document/des-001-descriptive-slug.md
|
||||
```
|
||||
|
||||
## When to Write a Design Document
|
||||
|
||||
Use a Design Document when you need to:
|
||||
- Define system architecture or redesign existing components
|
||||
- Document major technical decisions and trade-offs
|
||||
- Provide a blueprint for implementation teams
|
||||
- Enable architectural review before coding begins
|
||||
- Create shared understanding of complex systems
|
||||
|
||||
## Research Phase
|
||||
|
||||
### 1. Research Related Specifications
|
||||
Find upstream specs that inform your design:
|
||||
|
||||
```bash
|
||||
# Find related business requirements
|
||||
grep -r "brd" docs/specs/ --include="*.md"
|
||||
|
||||
# Find related technical requirements
|
||||
grep -r "prd\|technical" docs/specs/ --include="*.md"
|
||||
|
||||
# Find existing design patterns or similar designs
|
||||
grep -r "design\|architecture" docs/specs/ --include="*.md"
|
||||
```
|
||||
|
||||
### 2. Research External Documentation
|
||||
Research existing architectures and patterns:
|
||||
|
||||
- Look up similar systems: "How do other companies solve this problem?"
|
||||
- Research technologies and frameworks you're planning to use
|
||||
- Review relevant design patterns or architecture styles
|
||||
- Check for security, performance, or scalability best practices
|
||||
|
||||
Use tools to fetch external docs:
|
||||
```bash
|
||||
# Research the latest on your chosen technologies
|
||||
# Example: Research distributed system patterns
|
||||
# Example: Research microservices architecture best practices
|
||||
```
|
||||
|
||||
### 3. Review Existing Codebase & Architecture
|
||||
- What patterns does your codebase already follow?
|
||||
- What technologies are you already using?
|
||||
- How are similar features currently implemented?
|
||||
- What architectural decisions have been made previously?
|
||||
|
||||
Ask: "Are we extending existing patterns or introducing new ones?"
|
||||
|
||||
### 4. Understand Constraints
|
||||
- What are the performance requirements? (latency, throughput)
|
||||
- What scalability targets exist?
|
||||
- What security constraints apply?
|
||||
- What infrastructure/budget constraints?
|
||||
- Team expertise with chosen technologies?
|
||||
|
||||
## Structure & Content Guide
|
||||
|
||||
### Title & Metadata
|
||||
- **Title**: "Microservices Architecture for User Service" or similar
|
||||
- **Type**: Architecture | System Design | RFC | Technology Choice
|
||||
- **Status**: Draft | Under Review | Accepted | Rejected
|
||||
- **Version**: 1.0.0 (increment for significant revisions)
|
||||
|
||||
### Executive Summary
|
||||
Write 3-4 sentences that answer:
|
||||
- What problem does this solve?
|
||||
- What's the proposed solution?
|
||||
- What are the key tradeoffs?
|
||||
|
||||
Example:
|
||||
```
|
||||
This design proposes a microservices architecture to scale our user service.
|
||||
We'll split user management, authentication, and profile service into separate
|
||||
deployable services. This trades some operational complexity for independent
|
||||
scaling and development velocity. Key trade-off: eventual consistency vs.
|
||||
immediate consistency in cross-service operations.
|
||||
```
|
||||
|
||||
### Problem Statement
|
||||
Describe the current state and limitations:
|
||||
|
||||
```
|
||||
Current monolithic architecture handles all user operations in a single service,
|
||||
causing:
|
||||
- Bottleneck: User service becomes bottleneck for entire system
|
||||
- Scaling: Must scale entire service even if only auth needs capacity
|
||||
- Deployment: Changes in one area risk entire user service
|
||||
- Velocity: Teams block each other during development
|
||||
|
||||
This design solves these issues by enabling independent scaling and deployment.
|
||||
```
|
||||
|
||||
### Goals & Success Criteria
|
||||
|
||||
**Primary Goals** (3-5 goals)
|
||||
- Reduce deployment frequency to enable multiple daily deployments
|
||||
- Enable independent scaling of auth and profile services
|
||||
- Reduce time to market for new user features
|
||||
|
||||
**Success Criteria** (specific, measurable)
|
||||
1. Auth service can scale independently to handle 10k requests/sec
|
||||
2. Profile service deployment doesn't impact auth service
|
||||
3. System reduces MTTR for user service incidents by 50%
|
||||
4. Teams can deploy independently without coordination
|
||||
5. P95 latency remains under 200ms across service boundaries
|
||||
|
||||
### Context & Background
|
||||
Explain why now?
|
||||
|
||||
```
|
||||
Over the past 6 months, we've experienced:
|
||||
- Auth service saturated at 5k requests/sec during peak hours
|
||||
- Authentication changes blocked by profile service deployments
|
||||
- High operational burden managing single monolithic service
|
||||
|
||||
Recent customer requests for higher throughput have revealed these bottlenecks.
|
||||
This design addresses the most urgent scaling constraint (auth service).
|
||||
```
|
||||
|
||||
### Proposed Solution
|
||||
|
||||
#### High-Level Overview
|
||||
Provide a diagram showing major components and data flow:
|
||||
|
||||
```
|
||||
┌─────────────┐
|
||||
│ Client │
|
||||
└──────┬──────┘
|
||||
│
|
||||
├─→ [API Gateway]
|
||||
│ │
|
||||
├─→ [Auth Service] - JWT validation, user login
|
||||
│
|
||||
├─→ [Profile Service] - User profile, preferences
|
||||
│
|
||||
└─→ [Data Layer]
|
||||
├─ User DB (master)
|
||||
├─ Cache (Redis)
|
||||
└─ Message Queue (RabbitMQ)
|
||||
```
|
||||
|
||||
Explain how components interact:
|
||||
```
|
||||
Client sends request to API Gateway, which routes based on endpoint.
|
||||
Auth service handles login/JWT operations. Profile service handles profile
|
||||
reads/writes. Both services consume user data from shared database with
|
||||
eventual consistency via message queue.
|
||||
```
|
||||
|
||||
#### Architecture Components
|
||||
For each major component:
|
||||
|
||||
**Auth Service**
|
||||
- **Purpose**: Handles authentication, token generation, validation
|
||||
- **Technology**: Node.js with Express, Redis for session storage
|
||||
- **Key Responsibilities**:
|
||||
- User login/logout
|
||||
- JWT token generation and validation
|
||||
- Session management
|
||||
- Password reset flows
|
||||
- **Interactions**: Calls User DB for credential validation, publishes events to queue
|
||||
|
||||
**Profile Service**
|
||||
- **Purpose**: Manages user profile data and preferences
|
||||
- **Technology**: Node.js with Express, PostgreSQL for user data
|
||||
- **Key Responsibilities**:
|
||||
- Read/write user profile information
|
||||
- Manage user preferences
|
||||
- Handle profile search and filtering
|
||||
- **Interactions**: Consumes user events from queue, calls shared User DB
|
||||
|
||||
**API Gateway**
|
||||
- **Purpose**: Single entry point, routing, authentication enforcement
|
||||
- **Technology**: Nginx or API Gateway (e.g., Kong)
|
||||
- **Key Responsibilities**:
|
||||
- Route requests to appropriate service
|
||||
- Enforce API authentication
|
||||
- Rate limiting
|
||||
- Request/response transformation
|
||||
- **Interactions**: Routes to Auth and Profile services
|
||||
|
||||
### Design Decisions
|
||||
|
||||
For each significant decision, document:
|
||||
|
||||
#### Decision 1: Microservices vs. Monolith
|
||||
- **Decision**: Adopt microservices architecture
|
||||
- **Rationale**:
|
||||
- Independent scaling needed (auth bottleneck at 5k req/sec)
|
||||
- Team velocity: Can deploy auth changes independently
|
||||
- Loose coupling enables faster iteration
|
||||
- **Alternatives Considered**:
|
||||
- Monolith optimization: Caching, database optimization (rejected: can't solve scaling bottleneck)
|
||||
- Modular monolith: Improves structure but doesn't enable independent scaling
|
||||
- **Impact**:
|
||||
- Gain: Independent scaling, deployment, team velocity
|
||||
- Accept: Distributed system complexity, operational overhead, eventual consistency
|
||||
|
||||
#### Decision 2: Synchronous vs. Asynchronous Communication
|
||||
- **Decision**: Use message queue for eventual consistency
|
||||
- **Rationale**:
|
||||
- Profile updates don't need to be immediately consistent across auth service
|
||||
- Reduces coupling: Auth service doesn't wait for profile service
|
||||
- Improves resilience: Profile service failure doesn't affect auth
|
||||
- **Alternatives Considered**:
|
||||
- Synchronous REST calls: Simpler but tight coupling, availability issues
|
||||
- Event sourcing: Over-engineered for current needs
|
||||
- **Impact**:
|
||||
- Gain: Resilience, reduced coupling, independent scaling
|
||||
- Accept: Eventual consistency, operational complexity (message queue)
|
||||
|
||||
### Technology Stack
|
||||
|
||||
**Language & Runtime**
|
||||
- Node.js 18 LTS - Rationale: Existing expertise, good async support
|
||||
- Express - Lightweight, flexible framework the team knows
|
||||
|
||||
**Data Layer**
|
||||
- PostgreSQL (primary database) - Reliable, ACID transactions for user data
|
||||
- Redis (cache layer) - Session storage, auth token cache
|
||||
|
||||
**Infrastructure**
|
||||
- Kubernetes for orchestration - Running multiple services at scale
|
||||
- Docker for containerization - Consistent deployment
|
||||
|
||||
**Key Libraries/Frameworks**
|
||||
- Express (v4.18) - HTTP framework
|
||||
- jsonwebtoken - JWT token handling
|
||||
- @aws-sdk - AWS SDK for future integration
|
||||
- Jest - Testing framework
|
||||
|
||||
### Data Model & Storage
|
||||
|
||||
**Storage Strategy**
|
||||
- **Primary Database**: PostgreSQL with user table containing:
|
||||
- id, email, password_hash, created_at, updated_at
|
||||
- One-to-many relationship with user_preferences
|
||||
- **Caching**: Redis stores JWT token metadata and session info with 1-hour TTL
|
||||
- **Data Retention**: User data retained indefinitely; sessions cleaned up after TTL
|
||||
|
||||
**Schema Overview**
|
||||
```
|
||||
Users Table:
|
||||
- id (primary key)
|
||||
- email (unique index)
|
||||
- password_hash
|
||||
- created_at
|
||||
- updated_at
|
||||
|
||||
User Preferences:
|
||||
- id
|
||||
- user_id (foreign key)
|
||||
- key (e.g., theme, language)
|
||||
- value
|
||||
```
|
||||
|
||||
### API & Integration Points
|
||||
|
||||
**External Dependencies**
|
||||
- Integrates with existing Payment Service for billing
|
||||
- Consumes events from Billing Service (subscription changes)
|
||||
- Publishes user events to event bus for downstream services
|
||||
|
||||
**Key Endpoints** (reference full API spec):
|
||||
- POST /auth/login - User login
|
||||
- POST /auth/logout - User logout
|
||||
- GET /profile - Fetch user profile
|
||||
- PUT /profile - Update user profile
|
||||
|
||||
(See [API-001] for complete endpoint specifications)
|
||||
|
||||
### Trade-offs
|
||||
|
||||
**Accepting**
|
||||
- Operational complexity: Must manage multiple services, deployments, monitoring
|
||||
- Eventual consistency: Changes propagate through message queue, not immediate
|
||||
- Debugging complexity: Cross-service issues harder to debug
|
||||
|
||||
**Gaining**
|
||||
- Independent scaling: Auth service can scale without scaling profile service
|
||||
- Team autonomy: Teams can deploy independently without coordination
|
||||
- Failure isolation: Auth service failure doesn't take down profile service
|
||||
- Development velocity: Faster iteration, less blocking
|
||||
|
||||
### Implementation
|
||||
|
||||
**Approach**: Phased migration - Extract services incrementally without big-bang rewrite
|
||||
|
||||
**Phases**:
|
||||
1. **Phase 1 (Week 1-2)**: Extract Auth Service
|
||||
- Deliverables: Auth service running in parallel, API Gateway routing auth requests
|
||||
- Testing: Canary traffic (10%) to new service
|
||||
|
||||
2. **Phase 2 (Week 3-4)**: Migrate Auth Traffic
|
||||
- Deliverables: 100% auth traffic on new service, rollback plan tested
|
||||
- Verification: Auth latency, error rates compared to baseline
|
||||
|
||||
3. **Phase 3 (Week 5-6)**: Extract Profile Service
|
||||
- Deliverables: Profile service independent, event queue running
|
||||
- Testing: Data consistency verification across message queue
|
||||
|
||||
**Migration Strategy**:
|
||||
- Run both monolith and microservices in parallel initially
|
||||
- Use API Gateway to route traffic, allow A/B testing
|
||||
- Maintain ability to rollback quickly if issues arise
|
||||
- Monitor closely for latency/error rate increases
|
||||
|
||||
(See [PLN-001] for detailed implementation roadmap)
|
||||
|
||||
### Performance & Scalability
|
||||
|
||||
**Performance Targets**
|
||||
- **Latency**: Auth service p95 < 100ms, p99 < 200ms
|
||||
- **Throughput**: Auth service handles 10k requests/second
|
||||
- **Availability**: 99.9% uptime for auth service
|
||||
|
||||
**Scalability Strategy**
|
||||
- **Scaling Approach**: Horizontal - Add more auth service instances behind load balancer
|
||||
- **Bottlenecks**: Database connection pool size (limit 100 connections per service instance)
|
||||
- Mitigation: PgBouncer connection pooling, read replicas for read operations
|
||||
- **Auto-scaling**: Kubernetes HPA scales auth service from 3 to 20 replicas based on CPU
|
||||
|
||||
**Monitoring & Observability**
|
||||
- **Metrics**: Request latency (p50, p95, p99), error rate, service availability
|
||||
- **Alerting**: Alert if auth latency p95 > 150ms, error rate > 0.5%
|
||||
- **Logging**: Structured JSON logs with request ID for tracing across services
|
||||
|
||||
### Security
|
||||
|
||||
**Authentication**
|
||||
- JWT tokens issued by Auth Service, validated by API Gateway
|
||||
- Token expiration: 1 hour, refresh tokens for extended sessions
|
||||
|
||||
**Authorization**
|
||||
- Role-based access control (RBAC) enforced at API Gateway
|
||||
- Profile service doesn't repeat auth checks (trusts gateway)
|
||||
|
||||
**Data Protection**
|
||||
- **Encryption at Rest**: PostgreSQL database encryption enabled
|
||||
- **Encryption in Transit**: TLS 1.3 for all service-to-service communication
|
||||
- **PII Handling**: Passwords hashed with bcrypt (cost factor 12)
|
||||
|
||||
**Secrets Management**
|
||||
- Database credentials stored in Kubernetes secrets
|
||||
- JWT signing key rotated quarterly
|
||||
- Environment-based secret injection at runtime
|
||||
|
||||
**Compliance**
|
||||
- GDPR: User data can be exported via profile service
|
||||
- SOC2: Audit logging enabled for user data access
|
||||
|
||||
### Dependencies & Assumptions
|
||||
|
||||
**Dependencies**
|
||||
- PostgreSQL database must be highly available (RTO 1 hour)
|
||||
- Redis cache can tolerate data loss (non-critical)
|
||||
- API Gateway (Nginx) must be deployed and operational
|
||||
- Message queue (RabbitMQ) must be running
|
||||
|
||||
**Assumptions**
|
||||
- Auth service will handle up to 10k requests/second (based on growth projections)
|
||||
- User data size remains < 100GB (current: 5GB)
|
||||
- Network latency between services < 10ms (co-located data center)
|
||||
|
||||
### Open Questions
|
||||
|
||||
- [ ] Should we use gRPC for service-to-service communication instead of REST?
|
||||
- **Status**: Under investigation - benchmarking against REST
|
||||
- [ ] How do we handle shared user data updates if both services write to DB?
|
||||
- **Status**: Deferred to Phase 3 - will use event sourcing pattern
|
||||
- [ ] What message queue (RabbitMQ vs. Kafka)?
|
||||
- **Status**: RabbitMQ chosen, but revisit if we need audit trail of all changes
|
||||
|
||||
### Approvals
|
||||
|
||||
**Technical Review**
|
||||
- Lead Backend Engineer - TBD
|
||||
|
||||
**Architecture Review**
|
||||
- VP Engineering - TBD
|
||||
|
||||
**Security Review**
|
||||
- Security Team - TBD
|
||||
|
||||
**Approved By**
|
||||
- TBD
|
||||
```
|
||||
|
||||
## Writing Tips
|
||||
|
||||
### Use Diagrams Effectively
|
||||
- ASCII art is fine for design docs (easy to version control)
|
||||
- Show data flow and component interactions
|
||||
- Label arrows with what data/requests are flowing
|
||||
|
||||
### Be Explicit About Trade-offs
|
||||
- Don't just say "microservices is better"
|
||||
- Say "We're trading operational complexity for independent scaling because this addresses our 5k req/sec bottleneck"
|
||||
|
||||
### Link to Other Specs
|
||||
- Reference related business requirements: `[BRD-001]`
|
||||
- Reference technical requirements: `[PRD-001]`
|
||||
- Reference data models: `[DATA-001]`
|
||||
- Reference API contracts: `[API-001]`
|
||||
|
||||
### Document Rationale
|
||||
- Each decision needs a "why"
|
||||
- Explain what alternatives were considered and why they were rejected
|
||||
- This helps future developers understand the context
|
||||
|
||||
### Be Specific About Performance
|
||||
- Not: "Must be performant"
|
||||
- Yes: "p95 latency under 100ms, p99 under 200ms, supporting 10k requests/second"
|
||||
|
||||
### Consider the Whole System
|
||||
- Security implications
|
||||
- Operational/monitoring requirements
|
||||
- Data consistency model
|
||||
- Failure modes and recovery
|
||||
- Future scalability
|
||||
|
||||
## Validation & Fixing Issues
|
||||
|
||||
### Run the Validator
|
||||
```bash
|
||||
scripts/validate-spec.sh docs/specs/design-document/des-001-your-spec.md
|
||||
```
|
||||
|
||||
### Common Issues & Fixes
|
||||
|
||||
**Issue**: "Missing Proposed Solution section"
|
||||
- **Fix**: Add detailed architecture components, design decisions, tech stack
|
||||
|
||||
**Issue**: "TODO items in Architecture Components (4 items)"
|
||||
- **Fix**: Complete descriptions for all components (purpose, technology, responsibilities)
|
||||
|
||||
**Issue**: "No Trade-offs documented"
|
||||
- **Fix**: Explicitly document what you're accepting and what you're gaining
|
||||
|
||||
**Issue**: "Missing Performance & Scalability targets"
|
||||
- **Fix**: Add specific latency, throughput, and availability targets
|
||||
|
||||
### Check Completeness
|
||||
```bash
|
||||
scripts/check-completeness.sh docs/specs/design-document/des-001-your-spec.md
|
||||
```
|
||||
|
||||
## Decision-Making Framework
|
||||
|
||||
As you write the design doc, work through:
|
||||
|
||||
1. **Problem**: What are we designing for?
|
||||
- Specific pain points or constraints?
|
||||
- Performance targets, scalability requirements?
|
||||
|
||||
2. **Options**: What architectural approaches could work?
|
||||
- Monolith vs. distributed?
|
||||
- Synchronous vs. asynchronous?
|
||||
- Technology choices?
|
||||
|
||||
3. **Evaluation**: How do options compare?
|
||||
- Which best addresses the problem?
|
||||
- What are the trade-offs?
|
||||
- What does the team have experience with?
|
||||
|
||||
4. **Decision**: Which approach wins and why?
|
||||
- What assumptions must hold?
|
||||
- What trade-offs are we accepting?
|
||||
|
||||
5. **Implementation**: How do we build/migrate to this?
|
||||
- Big bang or incremental?
|
||||
- Parallel running period?
|
||||
- Rollback plan?
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Create the spec**: `scripts/generate-spec.sh design-document des-XXX-slug`
|
||||
2. **Research**: Find related specs and understand architecture context
|
||||
3. **Sketch**: Draw architecture diagrams before writing detailed components
|
||||
4. **Fill in sections** using this guide
|
||||
5. **Validate**: `scripts/validate-spec.sh docs/specs/design-document/des-XXX-slug.md`
|
||||
6. **Get architectural review** before implementation begins
|
||||
7. **Update related specs**: Create or update technical requirements and implementation plans
|
||||
Reference in New Issue
Block a user