gh-onezerocompany-claude-pr…/skills/spec-author/guides/design-document.md

# How to Create a Design Document

Design Documents provide the detailed architectural and technical design for a system, component, or significant feature. They answer "How will we build this?" after business and technical requirements have been defined.

## Quick Start

```bash
# 1. Create a new design document
scripts/generate-spec.sh design-document des-001-descriptive-slug

# 2. Open and fill in the file
# (The file will be created at: docs/specs/design-document/des-001-descriptive-slug.md)

# 3. Fill in the sections, then validate:
scripts/validate-spec.sh docs/specs/design-document/des-001-descriptive-slug.md

# 4. Fix issues and check completeness:
scripts/check-completeness.sh docs/specs/design-document/des-001-descriptive-slug.md
```

## When to Write a Design Document

Use a Design Document when you need to:
- Define system architecture or redesign existing components
- Document major technical decisions and trade-offs
- Provide a blueprint for implementation teams
- Enable architectural review before coding begins
- Create shared understanding of complex systems

## Research Phase

### 1. Research Related Specifications
Find upstream specs that inform your design:

```bash
# Find related business requirements
grep -r "brd" docs/specs/ --include="*.md"

# Find related technical requirements
grep -r "prd\|technical" docs/specs/ --include="*.md"

# Find existing design patterns or similar designs
grep -r "design\|architecture" docs/specs/ --include="*.md"
```

### 2. Research External Documentation
Research existing architectures and patterns:

- Look up similar systems: "How do other companies solve this problem?"
- Research technologies and frameworks you're planning to use
- Review relevant design patterns or architecture styles
- Check for security, performance, or scalability best practices

Use tools to fetch external docs:
```bash
# Research the latest on your chosen technologies
# Example: Research distributed system patterns
# Example: Research microservices architecture best practices
```

### 3. Review Existing Codebase & Architecture
- What patterns does your codebase already follow?
- What technologies are you already using?
- How are similar features currently implemented?
- What architectural decisions have been made previously?

Ask: "Are we extending existing patterns or introducing new ones?"

### 4. Understand Constraints
- What are the performance requirements? (latency, throughput)
- What scalability targets exist?
- What security constraints apply?
- What infrastructure/budget constraints?
- Team expertise with chosen technologies?

## Structure & Content Guide

### Title & Metadata
- **Title**: "Microservices Architecture for User Service" or similar
- **Type**: Architecture | System Design | RFC | Technology Choice
- **Status**: Draft | Under Review | Accepted | Rejected
- **Version**: 1.0.0 (increment for significant revisions)

### Executive Summary
Write 3-4 sentences that answer:
- What problem does this solve?
- What's the proposed solution?
- What are the key tradeoffs?

Example:
```
This design proposes a microservices architecture to scale our user service.
We'll split user management, authentication, and profile service into separate
deployable services. This trades some operational complexity for independent
scaling and development velocity. Key trade-off: eventual consistency vs.
immediate consistency in cross-service operations.
```

### Problem Statement
Describe the current state and limitations:

```
Current monolithic architecture handles all user operations in a single service,
causing:
- Bottleneck: User service becomes bottleneck for entire system
- Scaling: Must scale entire service even if only auth needs capacity
- Deployment: Changes in one area risk entire user service
- Velocity: Teams block each other during development

This design solves these issues by enabling independent scaling and deployment.
```

### Goals & Success Criteria

**Primary Goals** (3-5 goals)
- Reduce deployment frequency to enable multiple daily deployments
- Enable independent scaling of auth and profile services
- Reduce time to market for new user features

**Success Criteria** (specific, measurable)
1. Auth service can scale independently to handle 10k requests/sec
2. Profile service deployment doesn't impact auth service
3. System reduces MTTR for user service incidents by 50%
4. Teams can deploy independently without coordination
5. P95 latency remains under 200ms across service boundaries

### Context & Background
Explain why now?

```
Over the past 6 months, we've experienced:
- Auth service saturated at 5k requests/sec during peak hours
- Authentication changes blocked by profile service deployments
- High operational burden managing single monolithic service

Recent customer requests for higher throughput have revealed these bottlenecks.
This design addresses the most urgent scaling constraint (auth service).
```

### Proposed Solution

#### High-Level Overview
Provide a diagram showing major components and data flow:

```
┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ├─→ [API Gateway]
       │         │
       ├─→ [Auth Service] - JWT validation, user login
       │
       ├─→ [Profile Service] - User profile, preferences
       │
       └─→ [Data Layer]
            ├─ User DB (master)
            ├─ Cache (Redis)
            └─ Message Queue (RabbitMQ)
```

Explain how components interact:
```
Client sends request to API Gateway, which routes based on endpoint.
Auth service handles login/JWT operations. Profile service handles profile
reads/writes. Both services consume user data from shared database with
eventual consistency via message queue.
```

#### Architecture Components
For each major component:

**Auth Service**
- **Purpose**: Handles authentication, token generation, validation
- **Technology**: Node.js with Express, Redis for session storage
- **Key Responsibilities**:
  - User login/logout
  - JWT token generation and validation
  - Session management
  - Password reset flows
- **Interactions**: Calls User DB for credential validation, publishes events to queue

**Profile Service**
- **Purpose**: Manages user profile data and preferences
- **Technology**: Node.js with Express, PostgreSQL for user data
- **Key Responsibilities**:
  - Read/write user profile information
  - Manage user preferences
  - Handle profile search and filtering
- **Interactions**: Consumes user events from queue, calls shared User DB

**API Gateway**
- **Purpose**: Single entry point, routing, authentication enforcement
- **Technology**: Nginx or API Gateway (e.g., Kong)
- **Key Responsibilities**:
  - Route requests to appropriate service
  - Enforce API authentication
  - Rate limiting
  - Request/response transformation
- **Interactions**: Routes to Auth and Profile services

### Design Decisions

For each significant decision, document:

#### Decision 1: Microservices vs. Monolith
- **Decision**: Adopt microservices architecture
- **Rationale**:
  - Independent scaling needed (auth bottleneck at 5k req/sec)
  - Team velocity: Can deploy auth changes independently
  - Loose coupling enables faster iteration
- **Alternatives Considered**:
  - Monolith optimization: Caching, database optimization (rejected: can't solve scaling bottleneck)
  - Modular monolith: Improves structure but doesn't enable independent scaling
- **Impact**:
  - Gain: Independent scaling, deployment, team velocity
  - Accept: Distributed system complexity, operational overhead, eventual consistency

#### Decision 2: Synchronous vs. Asynchronous Communication
- **Decision**: Use message queue for eventual consistency
- **Rationale**:
  - Profile updates don't need to be immediately consistent across auth service
  - Reduces coupling: Auth service doesn't wait for profile service
  - Improves resilience: Profile service failure doesn't affect auth
- **Alternatives Considered**:
  - Synchronous REST calls: Simpler but tight coupling, availability issues
  - Event sourcing: Over-engineered for current needs
- **Impact**:
  - Gain: Resilience, reduced coupling, independent scaling
  - Accept: Eventual consistency, operational complexity (message queue)

### Technology Stack

**Language & Runtime**
- Node.js 18 LTS - Rationale: Existing expertise, good async support
- Express - Lightweight, flexible framework the team knows

**Data Layer**
- PostgreSQL (primary database) - Reliable, ACID transactions for user data
- Redis (cache layer) - Session storage, auth token cache

**Infrastructure**
- Kubernetes for orchestration - Running multiple services at scale
- Docker for containerization - Consistent deployment

**Key Libraries/Frameworks**
- Express (v4.18) - HTTP framework
- jsonwebtoken - JWT token handling
- @aws-sdk - AWS SDK for future integration
- Jest - Testing framework

### Data Model & Storage

**Storage Strategy**
- **Primary Database**: PostgreSQL with user table containing:
  - id, email, password_hash, created_at, updated_at
  - One-to-many relationship with user_preferences
- **Caching**: Redis stores JWT token metadata and session info with 1-hour TTL
- **Data Retention**: User data retained indefinitely; sessions cleaned up after TTL

**Schema Overview**
```
Users Table:
- id (primary key)
- email (unique index)
- password_hash
- created_at
- updated_at

User Preferences:
- id
- user_id (foreign key)
- key (e.g., theme, language)
- value
```

### API & Integration Points

**External Dependencies**
- Integrates with existing Payment Service for billing
- Consumes events from Billing Service (subscription changes)
- Publishes user events to event bus for downstream services

**Key Endpoints** (reference full API spec):
- POST /auth/login - User login
- POST /auth/logout - User logout
- GET /profile - Fetch user profile
- PUT /profile - Update user profile

(See [API-001] for complete endpoint specifications)

### Trade-offs

**Accepting**
- Operational complexity: Must manage multiple services, deployments, monitoring
- Eventual consistency: Changes propagate through message queue, not immediate
- Debugging complexity: Cross-service issues harder to debug

**Gaining**
- Independent scaling: Auth service can scale without scaling profile service
- Team autonomy: Teams can deploy independently without coordination
- Failure isolation: Auth service failure doesn't take down profile service
- Development velocity: Faster iteration, less blocking

### Implementation

**Approach**: Phased migration - Extract services incrementally without big-bang rewrite

**Phases**:
1. **Phase 1 (Week 1-2)**: Extract Auth Service
   - Deliverables: Auth service running in parallel, API Gateway routing auth requests
   - Testing: Canary traffic (10%) to new service

2. **Phase 2 (Week 3-4)**: Migrate Auth Traffic
   - Deliverables: 100% auth traffic on new service, rollback plan tested
   - Verification: Auth latency, error rates compared to baseline

3. **Phase 3 (Week 5-6)**: Extract Profile Service
   - Deliverables: Profile service independent, event queue running
   - Testing: Data consistency verification across message queue

**Migration Strategy**:
- Run both monolith and microservices in parallel initially
- Use API Gateway to route traffic, allow A/B testing
- Maintain ability to rollback quickly if issues arise
- Monitor closely for latency/error rate increases

(See [PLN-001] for detailed implementation roadmap)

### Performance & Scalability

**Performance Targets**
- **Latency**: Auth service p95 < 100ms, p99 < 200ms
- **Throughput**: Auth service handles 10k requests/second
- **Availability**: 99.9% uptime for auth service

**Scalability Strategy**
- **Scaling Approach**: Horizontal - Add more auth service instances behind load balancer
- **Bottlenecks**: Database connection pool size (limit 100 connections per service instance)
  - Mitigation: PgBouncer connection pooling, read replicas for read operations
- **Auto-scaling**: Kubernetes HPA scales auth service from 3 to 20 replicas based on CPU

**Monitoring & Observability**
- **Metrics**: Request latency (p50, p95, p99), error rate, service availability
- **Alerting**: Alert if auth latency p95 > 150ms, error rate > 0.5%
- **Logging**: Structured JSON logs with request ID for tracing across services

### Security

**Authentication**
- JWT tokens issued by Auth Service, validated by API Gateway
- Token expiration: 1 hour, refresh tokens for extended sessions

**Authorization**
- Role-based access control (RBAC) enforced at API Gateway
- Profile service doesn't repeat auth checks (trusts gateway)

**Data Protection**
- **Encryption at Rest**: PostgreSQL database encryption enabled
- **Encryption in Transit**: TLS 1.3 for all service-to-service communication
- **PII Handling**: Passwords hashed with bcrypt (cost factor 12)

**Secrets Management**
- Database credentials stored in Kubernetes secrets
- JWT signing key rotated quarterly
- Environment-based secret injection at runtime

**Compliance**
- GDPR: User data can be exported via profile service
- SOC2: Audit logging enabled for user data access

### Dependencies & Assumptions

**Dependencies**
- PostgreSQL database must be highly available (RTO 1 hour)
- Redis cache can tolerate data loss (non-critical)
- API Gateway (Nginx) must be deployed and operational
- Message queue (RabbitMQ) must be running

**Assumptions**
- Auth service will handle up to 10k requests/second (based on growth projections)
- User data size remains < 100GB (current: 5GB)
- Network latency between services < 10ms (co-located data center)

### Open Questions

- [ ] Should we use gRPC for service-to-service communication instead of REST?
  - **Status**: Under investigation - benchmarking against REST
- [ ] How do we handle shared user data updates if both services write to DB?
  - **Status**: Deferred to Phase 3 - will use event sourcing pattern
- [ ] What message queue (RabbitMQ vs. Kafka)?
  - **Status**: RabbitMQ chosen, but revisit if we need audit trail of all changes

### Approvals

**Technical Review**
- Lead Backend Engineer - TBD

**Architecture Review**
- VP Engineering - TBD

**Security Review**
- Security Team - TBD

**Approved By**
- TBD
```

## Writing Tips

### Use Diagrams Effectively
- ASCII art is fine for design docs (easy to version control)
- Show data flow and component interactions
- Label arrows with what data/requests are flowing

### Be Explicit About Trade-offs
- Don't just say "microservices is better"
- Say "We're trading operational complexity for independent scaling because this addresses our 5k req/sec bottleneck"

### Link to Other Specs
- Reference related business requirements: `[BRD-001]`
- Reference technical requirements: `[PRD-001]`
- Reference data models: `[DATA-001]`
- Reference API contracts: `[API-001]`

### Document Rationale
- Each decision needs a "why"
- Explain what alternatives were considered and why they were rejected
- This helps future developers understand the context

### Be Specific About Performance
- Not: "Must be performant"
- Yes: "p95 latency under 100ms, p99 under 200ms, supporting 10k requests/second"

### Consider the Whole System
- Security implications
- Operational/monitoring requirements
- Data consistency model
- Failure modes and recovery
- Future scalability

## Validation & Fixing Issues

### Run the Validator
```bash
scripts/validate-spec.sh docs/specs/design-document/des-001-your-spec.md
```

### Common Issues & Fixes

**Issue**: "Missing Proposed Solution section"
- **Fix**: Add detailed architecture components, design decisions, tech stack

**Issue**: "TODO items in Architecture Components (4 items)"
- **Fix**: Complete descriptions for all components (purpose, technology, responsibilities)

**Issue**: "No Trade-offs documented"
- **Fix**: Explicitly document what you're accepting and what you're gaining

**Issue**: "Missing Performance & Scalability targets"
- **Fix**: Add specific latency, throughput, and availability targets

### Check Completeness
```bash
scripts/check-completeness.sh docs/specs/design-document/des-001-your-spec.md
```

## Decision-Making Framework

As you write the design doc, work through:

1. **Problem**: What are we designing for?
   - Specific pain points or constraints?
   - Performance targets, scalability requirements?

2. **Options**: What architectural approaches could work?
   - Monolith vs. distributed?
   - Synchronous vs. asynchronous?
   - Technology choices?

3. **Evaluation**: How do options compare?
   - Which best addresses the problem?
   - What are the trade-offs?
   - What does the team have experience with?

4. **Decision**: Which approach wins and why?
   - What assumptions must hold?
   - What trade-offs are we accepting?

5. **Implementation**: How do we build/migrate to this?
   - Big bang or incremental?
   - Parallel running period?
   - Rollback plan?

## Next Steps

1. **Create the spec**: `scripts/generate-spec.sh design-document des-XXX-slug`
2. **Research**: Find related specs and understand architecture context
3. **Sketch**: Draw architecture diagrams before writing detailed components
4. **Fill in sections** using this guide
5. **Validate**: `scripts/validate-spec.sh docs/specs/design-document/des-XXX-slug.md`
6. **Get architectural review** before implementation begins
7. **Update related specs**: Create or update technical requirements and implementation plans