504 lines
17 KiB
Markdown
504 lines
17 KiB
Markdown
# How to Create a Design Document
|
|
|
|
Design Documents provide the detailed architectural and technical design for a system, component, or significant feature. They answer "How will we build this?" after business and technical requirements have been defined.
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# 1. Create a new design document
|
|
scripts/generate-spec.sh design-document des-001-descriptive-slug
|
|
|
|
# 2. Open and fill in the file
|
|
# (The file will be created at: docs/specs/design-document/des-001-descriptive-slug.md)
|
|
|
|
# 3. Fill in the sections, then validate:
|
|
scripts/validate-spec.sh docs/specs/design-document/des-001-descriptive-slug.md
|
|
|
|
# 4. Fix issues and check completeness:
|
|
scripts/check-completeness.sh docs/specs/design-document/des-001-descriptive-slug.md
|
|
```
|
|
|
|
## When to Write a Design Document
|
|
|
|
Use a Design Document when you need to:
|
|
- Define system architecture or redesign existing components
|
|
- Document major technical decisions and trade-offs
|
|
- Provide a blueprint for implementation teams
|
|
- Enable architectural review before coding begins
|
|
- Create shared understanding of complex systems
|
|
|
|
## Research Phase
|
|
|
|
### 1. Research Related Specifications
|
|
Find upstream specs that inform your design:
|
|
|
|
```bash
|
|
# Find related business requirements
|
|
grep -r "brd" docs/specs/ --include="*.md"
|
|
|
|
# Find related technical requirements
|
|
grep -r "prd\|technical" docs/specs/ --include="*.md"
|
|
|
|
# Find existing design patterns or similar designs
|
|
grep -r "design\|architecture" docs/specs/ --include="*.md"
|
|
```
|
|
|
|
### 2. Research External Documentation
|
|
Research existing architectures and patterns:
|
|
|
|
- Look up similar systems: "How do other companies solve this problem?"
|
|
- Research technologies and frameworks you're planning to use
|
|
- Review relevant design patterns or architecture styles
|
|
- Check for security, performance, or scalability best practices
|
|
|
|
Use tools to fetch external docs:
|
|
```bash
|
|
# Research the latest on your chosen technologies
|
|
# Example: Research distributed system patterns
|
|
# Example: Research microservices architecture best practices
|
|
```
|
|
|
|
### 3. Review Existing Codebase & Architecture
|
|
- What patterns does your codebase already follow?
|
|
- What technologies are you already using?
|
|
- How are similar features currently implemented?
|
|
- What architectural decisions have been made previously?
|
|
|
|
Ask: "Are we extending existing patterns or introducing new ones?"
|
|
|
|
### 4. Understand Constraints
|
|
- What are the performance requirements? (latency, throughput)
|
|
- What scalability targets exist?
|
|
- What security constraints apply?
|
|
- What infrastructure/budget constraints?
|
|
- Team expertise with chosen technologies?
|
|
|
|
## Structure & Content Guide
|
|
|
|
### Title & Metadata
|
|
- **Title**: "Microservices Architecture for User Service" or similar
|
|
- **Type**: Architecture | System Design | RFC | Technology Choice
|
|
- **Status**: Draft | Under Review | Accepted | Rejected
|
|
- **Version**: 1.0.0 (increment for significant revisions)
|
|
|
|
### Executive Summary
|
|
Write 3-4 sentences that answer:
|
|
- What problem does this solve?
|
|
- What's the proposed solution?
|
|
- What are the key tradeoffs?
|
|
|
|
Example:
|
|
```
|
|
This design proposes a microservices architecture to scale our user service.
|
|
We'll split user management, authentication, and profile service into separate
|
|
deployable services. This trades some operational complexity for independent
|
|
scaling and development velocity. Key trade-off: eventual consistency vs.
|
|
immediate consistency in cross-service operations.
|
|
```
|
|
|
|
### Problem Statement
|
|
Describe the current state and limitations:
|
|
|
|
```
|
|
Current monolithic architecture handles all user operations in a single service,
|
|
causing:
|
|
- Bottleneck: User service becomes bottleneck for entire system
|
|
- Scaling: Must scale entire service even if only auth needs capacity
|
|
- Deployment: Changes in one area risk entire user service
|
|
- Velocity: Teams block each other during development
|
|
|
|
This design solves these issues by enabling independent scaling and deployment.
|
|
```
|
|
|
|
### Goals & Success Criteria
|
|
|
|
**Primary Goals** (3-5 goals)
|
|
- Reduce deployment frequency to enable multiple daily deployments
|
|
- Enable independent scaling of auth and profile services
|
|
- Reduce time to market for new user features
|
|
|
|
**Success Criteria** (specific, measurable)
|
|
1. Auth service can scale independently to handle 10k requests/sec
|
|
2. Profile service deployment doesn't impact auth service
|
|
3. System reduces MTTR for user service incidents by 50%
|
|
4. Teams can deploy independently without coordination
|
|
5. P95 latency remains under 200ms across service boundaries
|
|
|
|
### Context & Background
|
|
Explain why now?
|
|
|
|
```
|
|
Over the past 6 months, we've experienced:
|
|
- Auth service saturated at 5k requests/sec during peak hours
|
|
- Authentication changes blocked by profile service deployments
|
|
- High operational burden managing single monolithic service
|
|
|
|
Recent customer requests for higher throughput have revealed these bottlenecks.
|
|
This design addresses the most urgent scaling constraint (auth service).
|
|
```
|
|
|
|
### Proposed Solution
|
|
|
|
#### High-Level Overview
|
|
Provide a diagram showing major components and data flow:
|
|
|
|
```
|
|
┌─────────────┐
|
|
│ Client │
|
|
└──────┬──────┘
|
|
│
|
|
├─→ [API Gateway]
|
|
│ │
|
|
├─→ [Auth Service] - JWT validation, user login
|
|
│
|
|
├─→ [Profile Service] - User profile, preferences
|
|
│
|
|
└─→ [Data Layer]
|
|
├─ User DB (master)
|
|
├─ Cache (Redis)
|
|
└─ Message Queue (RabbitMQ)
|
|
```
|
|
|
|
Explain how components interact:
|
|
```
|
|
Client sends request to API Gateway, which routes based on endpoint.
|
|
Auth service handles login/JWT operations. Profile service handles profile
|
|
reads/writes. Both services consume user data from shared database with
|
|
eventual consistency via message queue.
|
|
```
|
|
|
|
#### Architecture Components
|
|
For each major component:
|
|
|
|
**Auth Service**
|
|
- **Purpose**: Handles authentication, token generation, validation
|
|
- **Technology**: Node.js with Express, Redis for session storage
|
|
- **Key Responsibilities**:
|
|
- User login/logout
|
|
- JWT token generation and validation
|
|
- Session management
|
|
- Password reset flows
|
|
- **Interactions**: Calls User DB for credential validation, publishes events to queue
|
|
|
|
**Profile Service**
|
|
- **Purpose**: Manages user profile data and preferences
|
|
- **Technology**: Node.js with Express, PostgreSQL for user data
|
|
- **Key Responsibilities**:
|
|
- Read/write user profile information
|
|
- Manage user preferences
|
|
- Handle profile search and filtering
|
|
- **Interactions**: Consumes user events from queue, calls shared User DB
|
|
|
|
**API Gateway**
|
|
- **Purpose**: Single entry point, routing, authentication enforcement
|
|
- **Technology**: Nginx or API Gateway (e.g., Kong)
|
|
- **Key Responsibilities**:
|
|
- Route requests to appropriate service
|
|
- Enforce API authentication
|
|
- Rate limiting
|
|
- Request/response transformation
|
|
- **Interactions**: Routes to Auth and Profile services
|
|
|
|
### Design Decisions
|
|
|
|
For each significant decision, document:
|
|
|
|
#### Decision 1: Microservices vs. Monolith
|
|
- **Decision**: Adopt microservices architecture
|
|
- **Rationale**:
|
|
- Independent scaling needed (auth bottleneck at 5k req/sec)
|
|
- Team velocity: Can deploy auth changes independently
|
|
- Loose coupling enables faster iteration
|
|
- **Alternatives Considered**:
|
|
- Monolith optimization: Caching, database optimization (rejected: can't solve scaling bottleneck)
|
|
- Modular monolith: Improves structure but doesn't enable independent scaling
|
|
- **Impact**:
|
|
- Gain: Independent scaling, deployment, team velocity
|
|
- Accept: Distributed system complexity, operational overhead, eventual consistency
|
|
|
|
#### Decision 2: Synchronous vs. Asynchronous Communication
|
|
- **Decision**: Use message queue for eventual consistency
|
|
- **Rationale**:
|
|
- Profile updates don't need to be immediately consistent across auth service
|
|
- Reduces coupling: Auth service doesn't wait for profile service
|
|
- Improves resilience: Profile service failure doesn't affect auth
|
|
- **Alternatives Considered**:
|
|
- Synchronous REST calls: Simpler but tight coupling, availability issues
|
|
- Event sourcing: Over-engineered for current needs
|
|
- **Impact**:
|
|
- Gain: Resilience, reduced coupling, independent scaling
|
|
- Accept: Eventual consistency, operational complexity (message queue)
|
|
|
|
### Technology Stack
|
|
|
|
**Language & Runtime**
|
|
- Node.js 18 LTS - Rationale: Existing expertise, good async support
|
|
- Express - Lightweight, flexible framework the team knows
|
|
|
|
**Data Layer**
|
|
- PostgreSQL (primary database) - Reliable, ACID transactions for user data
|
|
- Redis (cache layer) - Session storage, auth token cache
|
|
|
|
**Infrastructure**
|
|
- Kubernetes for orchestration - Running multiple services at scale
|
|
- Docker for containerization - Consistent deployment
|
|
|
|
**Key Libraries/Frameworks**
|
|
- Express (v4.18) - HTTP framework
|
|
- jsonwebtoken - JWT token handling
|
|
- @aws-sdk - AWS SDK for future integration
|
|
- Jest - Testing framework
|
|
|
|
### Data Model & Storage
|
|
|
|
**Storage Strategy**
|
|
- **Primary Database**: PostgreSQL with user table containing:
|
|
- id, email, password_hash, created_at, updated_at
|
|
- One-to-many relationship with user_preferences
|
|
- **Caching**: Redis stores JWT token metadata and session info with 1-hour TTL
|
|
- **Data Retention**: User data retained indefinitely; sessions cleaned up after TTL
|
|
|
|
**Schema Overview**
|
|
```
|
|
Users Table:
|
|
- id (primary key)
|
|
- email (unique index)
|
|
- password_hash
|
|
- created_at
|
|
- updated_at
|
|
|
|
User Preferences:
|
|
- id
|
|
- user_id (foreign key)
|
|
- key (e.g., theme, language)
|
|
- value
|
|
```
|
|
|
|
### API & Integration Points
|
|
|
|
**External Dependencies**
|
|
- Integrates with existing Payment Service for billing
|
|
- Consumes events from Billing Service (subscription changes)
|
|
- Publishes user events to event bus for downstream services
|
|
|
|
**Key Endpoints** (reference full API spec):
|
|
- POST /auth/login - User login
|
|
- POST /auth/logout - User logout
|
|
- GET /profile - Fetch user profile
|
|
- PUT /profile - Update user profile
|
|
|
|
(See [API-001] for complete endpoint specifications)
|
|
|
|
### Trade-offs
|
|
|
|
**Accepting**
|
|
- Operational complexity: Must manage multiple services, deployments, monitoring
|
|
- Eventual consistency: Changes propagate through message queue, not immediate
|
|
- Debugging complexity: Cross-service issues harder to debug
|
|
|
|
**Gaining**
|
|
- Independent scaling: Auth service can scale without scaling profile service
|
|
- Team autonomy: Teams can deploy independently without coordination
|
|
- Failure isolation: Auth service failure doesn't take down profile service
|
|
- Development velocity: Faster iteration, less blocking
|
|
|
|
### Implementation
|
|
|
|
**Approach**: Phased migration - Extract services incrementally without big-bang rewrite
|
|
|
|
**Phases**:
|
|
1. **Phase 1 (Week 1-2)**: Extract Auth Service
|
|
- Deliverables: Auth service running in parallel, API Gateway routing auth requests
|
|
- Testing: Canary traffic (10%) to new service
|
|
|
|
2. **Phase 2 (Week 3-4)**: Migrate Auth Traffic
|
|
- Deliverables: 100% auth traffic on new service, rollback plan tested
|
|
- Verification: Auth latency, error rates compared to baseline
|
|
|
|
3. **Phase 3 (Week 5-6)**: Extract Profile Service
|
|
- Deliverables: Profile service independent, event queue running
|
|
- Testing: Data consistency verification across message queue
|
|
|
|
**Migration Strategy**:
|
|
- Run both monolith and microservices in parallel initially
|
|
- Use API Gateway to route traffic, allow A/B testing
|
|
- Maintain ability to rollback quickly if issues arise
|
|
- Monitor closely for latency/error rate increases
|
|
|
|
(See [PLN-001] for detailed implementation roadmap)
|
|
|
|
### Performance & Scalability
|
|
|
|
**Performance Targets**
|
|
- **Latency**: Auth service p95 < 100ms, p99 < 200ms
|
|
- **Throughput**: Auth service handles 10k requests/second
|
|
- **Availability**: 99.9% uptime for auth service
|
|
|
|
**Scalability Strategy**
|
|
- **Scaling Approach**: Horizontal - Add more auth service instances behind load balancer
|
|
- **Bottlenecks**: Database connection pool size (limit 100 connections per service instance)
|
|
- Mitigation: PgBouncer connection pooling, read replicas for read operations
|
|
- **Auto-scaling**: Kubernetes HPA scales auth service from 3 to 20 replicas based on CPU
|
|
|
|
**Monitoring & Observability**
|
|
- **Metrics**: Request latency (p50, p95, p99), error rate, service availability
|
|
- **Alerting**: Alert if auth latency p95 > 150ms, error rate > 0.5%
|
|
- **Logging**: Structured JSON logs with request ID for tracing across services
|
|
|
|
### Security
|
|
|
|
**Authentication**
|
|
- JWT tokens issued by Auth Service, validated by API Gateway
|
|
- Token expiration: 1 hour, refresh tokens for extended sessions
|
|
|
|
**Authorization**
|
|
- Role-based access control (RBAC) enforced at API Gateway
|
|
- Profile service doesn't repeat auth checks (trusts gateway)
|
|
|
|
**Data Protection**
|
|
- **Encryption at Rest**: PostgreSQL database encryption enabled
|
|
- **Encryption in Transit**: TLS 1.3 for all service-to-service communication
|
|
- **PII Handling**: Passwords hashed with bcrypt (cost factor 12)
|
|
|
|
**Secrets Management**
|
|
- Database credentials stored in Kubernetes secrets
|
|
- JWT signing key rotated quarterly
|
|
- Environment-based secret injection at runtime
|
|
|
|
**Compliance**
|
|
- GDPR: User data can be exported via profile service
|
|
- SOC2: Audit logging enabled for user data access
|
|
|
|
### Dependencies & Assumptions
|
|
|
|
**Dependencies**
|
|
- PostgreSQL database must be highly available (RTO 1 hour)
|
|
- Redis cache can tolerate data loss (non-critical)
|
|
- API Gateway (Nginx) must be deployed and operational
|
|
- Message queue (RabbitMQ) must be running
|
|
|
|
**Assumptions**
|
|
- Auth service will handle up to 10k requests/second (based on growth projections)
|
|
- User data size remains < 100GB (current: 5GB)
|
|
- Network latency between services < 10ms (co-located data center)
|
|
|
|
### Open Questions
|
|
|
|
- [ ] Should we use gRPC for service-to-service communication instead of REST?
|
|
- **Status**: Under investigation - benchmarking against REST
|
|
- [ ] How do we handle shared user data updates if both services write to DB?
|
|
- **Status**: Deferred to Phase 3 - will use event sourcing pattern
|
|
- [ ] What message queue (RabbitMQ vs. Kafka)?
|
|
- **Status**: RabbitMQ chosen, but revisit if we need audit trail of all changes
|
|
|
|
### Approvals
|
|
|
|
**Technical Review**
|
|
- Lead Backend Engineer - TBD
|
|
|
|
**Architecture Review**
|
|
- VP Engineering - TBD
|
|
|
|
**Security Review**
|
|
- Security Team - TBD
|
|
|
|
**Approved By**
|
|
- TBD
|
|
```
|
|
|
|
## Writing Tips
|
|
|
|
### Use Diagrams Effectively
|
|
- ASCII art is fine for design docs (easy to version control)
|
|
- Show data flow and component interactions
|
|
- Label arrows with what data/requests are flowing
|
|
|
|
### Be Explicit About Trade-offs
|
|
- Don't just say "microservices is better"
|
|
- Say "We're trading operational complexity for independent scaling because this addresses our 5k req/sec bottleneck"
|
|
|
|
### Link to Other Specs
|
|
- Reference related business requirements: `[BRD-001]`
|
|
- Reference technical requirements: `[PRD-001]`
|
|
- Reference data models: `[DATA-001]`
|
|
- Reference API contracts: `[API-001]`
|
|
|
|
### Document Rationale
|
|
- Each decision needs a "why"
|
|
- Explain what alternatives were considered and why they were rejected
|
|
- This helps future developers understand the context
|
|
|
|
### Be Specific About Performance
|
|
- Not: "Must be performant"
|
|
- Yes: "p95 latency under 100ms, p99 under 200ms, supporting 10k requests/second"
|
|
|
|
### Consider the Whole System
|
|
- Security implications
|
|
- Operational/monitoring requirements
|
|
- Data consistency model
|
|
- Failure modes and recovery
|
|
- Future scalability
|
|
|
|
## Validation & Fixing Issues
|
|
|
|
### Run the Validator
|
|
```bash
|
|
scripts/validate-spec.sh docs/specs/design-document/des-001-your-spec.md
|
|
```
|
|
|
|
### Common Issues & Fixes
|
|
|
|
**Issue**: "Missing Proposed Solution section"
|
|
- **Fix**: Add detailed architecture components, design decisions, tech stack
|
|
|
|
**Issue**: "TODO items in Architecture Components (4 items)"
|
|
- **Fix**: Complete descriptions for all components (purpose, technology, responsibilities)
|
|
|
|
**Issue**: "No Trade-offs documented"
|
|
- **Fix**: Explicitly document what you're accepting and what you're gaining
|
|
|
|
**Issue**: "Missing Performance & Scalability targets"
|
|
- **Fix**: Add specific latency, throughput, and availability targets
|
|
|
|
### Check Completeness
|
|
```bash
|
|
scripts/check-completeness.sh docs/specs/design-document/des-001-your-spec.md
|
|
```
|
|
|
|
## Decision-Making Framework
|
|
|
|
As you write the design doc, work through:
|
|
|
|
1. **Problem**: What are we designing for?
|
|
- Specific pain points or constraints?
|
|
- Performance targets, scalability requirements?
|
|
|
|
2. **Options**: What architectural approaches could work?
|
|
- Monolith vs. distributed?
|
|
- Synchronous vs. asynchronous?
|
|
- Technology choices?
|
|
|
|
3. **Evaluation**: How do options compare?
|
|
- Which best addresses the problem?
|
|
- What are the trade-offs?
|
|
- What does the team have experience with?
|
|
|
|
4. **Decision**: Which approach wins and why?
|
|
- What assumptions must hold?
|
|
- What trade-offs are we accepting?
|
|
|
|
5. **Implementation**: How do we build/migrate to this?
|
|
- Big bang or incremental?
|
|
- Parallel running period?
|
|
- Rollback plan?
|
|
|
|
## Next Steps
|
|
|
|
1. **Create the spec**: `scripts/generate-spec.sh design-document des-XXX-slug`
|
|
2. **Research**: Find related specs and understand architecture context
|
|
3. **Sketch**: Draw architecture diagrams before writing detailed components
|
|
4. **Fill in sections** using this guide
|
|
5. **Validate**: `scripts/validate-spec.sh docs/specs/design-document/des-XXX-slug.md`
|
|
6. **Get architectural review** before implementation begins
|
|
7. **Update related specs**: Create or update technical requirements and implementation plans
|