zhongwei/gh-onezerocompany-claude-project-basics

Files

Zhongwei Li ca9b85ccda Initial commit

2025-11-30 08:45:31 +08:00

17 KiB

Raw Blame History

How to Create a Design Document

Design Documents provide the detailed architectural and technical design for a system, component, or significant feature. They answer "How will we build this?" after business and technical requirements have been defined.

Quick Start

# 1. Create a new design document
scripts/generate-spec.sh design-document des-001-descriptive-slug

# 2. Open and fill in the file
# (The file will be created at: docs/specs/design-document/des-001-descriptive-slug.md)

# 3. Fill in the sections, then validate:
scripts/validate-spec.sh docs/specs/design-document/des-001-descriptive-slug.md

# 4. Fix issues and check completeness:
scripts/check-completeness.sh docs/specs/design-document/des-001-descriptive-slug.md

When to Write a Design Document

Use a Design Document when you need to:

Define system architecture or redesign existing components
Document major technical decisions and trade-offs
Provide a blueprint for implementation teams
Enable architectural review before coding begins
Create shared understanding of complex systems

Research Phase

Find upstream specs that inform your design:

# Find related business requirements
grep -r "brd" docs/specs/ --include="*.md"

# Find related technical requirements
grep -r "prd\|technical" docs/specs/ --include="*.md"

# Find existing design patterns or similar designs
grep -r "design\|architecture" docs/specs/ --include="*.md"

2. Research External Documentation

Research existing architectures and patterns:

Look up similar systems: "How do other companies solve this problem?"
Research technologies and frameworks you're planning to use
Review relevant design patterns or architecture styles
Check for security, performance, or scalability best practices

Use tools to fetch external docs:

# Research the latest on your chosen technologies
# Example: Research distributed system patterns
# Example: Research microservices architecture best practices

3. Review Existing Codebase & Architecture

What patterns does your codebase already follow?
What technologies are you already using?
How are similar features currently implemented?
What architectural decisions have been made previously?

Ask: "Are we extending existing patterns or introducing new ones?"

4. Understand Constraints

What are the performance requirements? (latency, throughput)
What scalability targets exist?
What security constraints apply?
What infrastructure/budget constraints?
Team expertise with chosen technologies?

Structure & Content Guide

Title & Metadata

Title: "Microservices Architecture for User Service" or similar
Type: Architecture | System Design | RFC | Technology Choice
Status: Draft | Under Review | Accepted | Rejected
Version: 1.0.0 (increment for significant revisions)

Executive Summary

Write 3-4 sentences that answer:

What problem does this solve?
What's the proposed solution?
What are the key tradeoffs?

Example:

This design proposes a microservices architecture to scale our user service.
We'll split user management, authentication, and profile service into separate
deployable services. This trades some operational complexity for independent
scaling and development velocity. Key trade-off: eventual consistency vs.
immediate consistency in cross-service operations.

Problem Statement

Describe the current state and limitations:

Current monolithic architecture handles all user operations in a single service,
causing:
- Bottleneck: User service becomes bottleneck for entire system
- Scaling: Must scale entire service even if only auth needs capacity
- Deployment: Changes in one area risk entire user service
- Velocity: Teams block each other during development

This design solves these issues by enabling independent scaling and deployment.

Goals & Success Criteria

Primary Goals (3-5 goals)

Reduce deployment frequency to enable multiple daily deployments
Enable independent scaling of auth and profile services
Reduce time to market for new user features

Success Criteria (specific, measurable)

Auth service can scale independently to handle 10k requests/sec
Profile service deployment doesn't impact auth service
System reduces MTTR for user service incidents by 50%
Teams can deploy independently without coordination
P95 latency remains under 200ms across service boundaries

Context & Background

Explain why now?

Over the past 6 months, we've experienced:
- Auth service saturated at 5k requests/sec during peak hours
- Authentication changes blocked by profile service deployments
- High operational burden managing single monolithic service

Recent customer requests for higher throughput have revealed these bottlenecks.
This design addresses the most urgent scaling constraint (auth service).

Proposed Solution

High-Level Overview

Provide a diagram showing major components and data flow:

┌─────────────┐
│   Client    │
└──────┬──────┘
       │
       ├─→ [API Gateway]
       │         │
       ├─→ [Auth Service] - JWT validation, user login
       │
       ├─→ [Profile Service] - User profile, preferences
       │
       └─→ [Data Layer]
            ├─ User DB (master)
            ├─ Cache (Redis)
            └─ Message Queue (RabbitMQ)

Explain how components interact:

Client sends request to API Gateway, which routes based on endpoint.
Auth service handles login/JWT operations. Profile service handles profile
reads/writes. Both services consume user data from shared database with
eventual consistency via message queue.

Architecture Components

For each major component:

Auth Service

Purpose: Handles authentication, token generation, validation
Technology: Node.js with Express, Redis for session storage
Key Responsibilities:
- User login/logout
- JWT token generation and validation
- Session management
- Password reset flows
Interactions: Calls User DB for credential validation, publishes events to queue

Profile Service

Purpose: Manages user profile data and preferences
Technology: Node.js with Express, PostgreSQL for user data
Key Responsibilities:
- Read/write user profile information
- Manage user preferences
- Handle profile search and filtering
Interactions: Consumes user events from queue, calls shared User DB

API Gateway

Purpose: Single entry point, routing, authentication enforcement
Technology: Nginx or API Gateway (e.g., Kong)
Key Responsibilities:
- Route requests to appropriate service
- Enforce API authentication
- Rate limiting
- Request/response transformation
Interactions: Routes to Auth and Profile services

Design Decisions

For each significant decision, document:

Decision 1: Microservices vs. Monolith

Decision: Adopt microservices architecture
Rationale:
- Independent scaling needed (auth bottleneck at 5k req/sec)
- Team velocity: Can deploy auth changes independently
- Loose coupling enables faster iteration
Alternatives Considered:
- Monolith optimization: Caching, database optimization (rejected: can't solve scaling bottleneck)
- Modular monolith: Improves structure but doesn't enable independent scaling
Impact:
- Gain: Independent scaling, deployment, team velocity
- Accept: Distributed system complexity, operational overhead, eventual consistency

Decision 2: Synchronous vs. Asynchronous Communication

Decision: Use message queue for eventual consistency
Rationale:
- Profile updates don't need to be immediately consistent across auth service
- Reduces coupling: Auth service doesn't wait for profile service
- Improves resilience: Profile service failure doesn't affect auth
Alternatives Considered:
- Synchronous REST calls: Simpler but tight coupling, availability issues
- Event sourcing: Over-engineered for current needs
Impact:
- Gain: Resilience, reduced coupling, independent scaling
- Accept: Eventual consistency, operational complexity (message queue)

Technology Stack

Language & Runtime

Node.js 18 LTS - Rationale: Existing expertise, good async support
Express - Lightweight, flexible framework the team knows

Data Layer

PostgreSQL (primary database) - Reliable, ACID transactions for user data
Redis (cache layer) - Session storage, auth token cache

Infrastructure

Kubernetes for orchestration - Running multiple services at scale
Docker for containerization - Consistent deployment

Key Libraries/Frameworks

Express (v4.18) - HTTP framework
jsonwebtoken - JWT token handling
@aws-sdk - AWS SDK for future integration
Jest - Testing framework

Data Model & Storage

Storage Strategy

Primary Database: PostgreSQL with user table containing:
- id, email, password_hash, created_at, updated_at
- One-to-many relationship with user_preferences
Caching: Redis stores JWT token metadata and session info with 1-hour TTL
Data Retention: User data retained indefinitely; sessions cleaned up after TTL

Schema Overview

Users Table:
- id (primary key)
- email (unique index)
- password_hash
- created_at
- updated_at

User Preferences:
- id
- user_id (foreign key)
- key (e.g., theme, language)
- value

API & Integration Points

External Dependencies

Integrates with existing Payment Service for billing
Consumes events from Billing Service (subscription changes)
Publishes user events to event bus for downstream services

Key Endpoints (reference full API spec):

POST /auth/login - User login
POST /auth/logout - User logout
GET /profile - Fetch user profile
PUT /profile - Update user profile

(See [API-001] for complete endpoint specifications)

Trade-offs

Accepting

Operational complexity: Must manage multiple services, deployments, monitoring
Eventual consistency: Changes propagate through message queue, not immediate
Debugging complexity: Cross-service issues harder to debug

Gaining

Independent scaling: Auth service can scale without scaling profile service
Team autonomy: Teams can deploy independently without coordination
Failure isolation: Auth service failure doesn't take down profile service
Development velocity: Faster iteration, less blocking

Implementation

Approach: Phased migration - Extract services incrementally without big-bang rewrite

Phases:

Phase 1 (Week 1-2): Extract Auth Service
- Deliverables: Auth service running in parallel, API Gateway routing auth requests
- Testing: Canary traffic (10%) to new service
Phase 2 (Week 3-4): Migrate Auth Traffic
- Deliverables: 100% auth traffic on new service, rollback plan tested
- Verification: Auth latency, error rates compared to baseline
Phase 3 (Week 5-6): Extract Profile Service
- Deliverables: Profile service independent, event queue running
- Testing: Data consistency verification across message queue

Migration Strategy:

Run both monolith and microservices in parallel initially
Use API Gateway to route traffic, allow A/B testing
Maintain ability to rollback quickly if issues arise
Monitor closely for latency/error rate increases

(See [PLN-001] for detailed implementation roadmap)

Performance & Scalability

Performance Targets

Latency: Auth service p95 < 100ms, p99 < 200ms
Throughput: Auth service handles 10k requests/second
Availability: 99.9% uptime for auth service

Scalability Strategy

Scaling Approach: Horizontal - Add more auth service instances behind load balancer
Bottlenecks: Database connection pool size (limit 100 connections per service instance)
- Mitigation: PgBouncer connection pooling, read replicas for read operations
Auto-scaling: Kubernetes HPA scales auth service from 3 to 20 replicas based on CPU

Monitoring & Observability

Metrics: Request latency (p50, p95, p99), error rate, service availability
Alerting: Alert if auth latency p95 > 150ms, error rate > 0.5%
Logging: Structured JSON logs with request ID for tracing across services

Security

Authentication

JWT tokens issued by Auth Service, validated by API Gateway
Token expiration: 1 hour, refresh tokens for extended sessions

Authorization

Role-based access control (RBAC) enforced at API Gateway
Profile service doesn't repeat auth checks (trusts gateway)

Data Protection

Encryption at Rest: PostgreSQL database encryption enabled
Encryption in Transit: TLS 1.3 for all service-to-service communication
PII Handling: Passwords hashed with bcrypt (cost factor 12)

Secrets Management

Database credentials stored in Kubernetes secrets
JWT signing key rotated quarterly
Environment-based secret injection at runtime

Compliance

GDPR: User data can be exported via profile service
SOC2: Audit logging enabled for user data access

Dependencies & Assumptions

Dependencies

PostgreSQL database must be highly available (RTO 1 hour)
Redis cache can tolerate data loss (non-critical)
API Gateway (Nginx) must be deployed and operational
Message queue (RabbitMQ) must be running

Assumptions

Auth service will handle up to 10k requests/second (based on growth projections)
User data size remains < 100GB (current: 5GB)
Network latency between services < 10ms (co-located data center)

Open Questions

Should we use gRPC for service-to-service communication instead of REST?
- Status: Under investigation - benchmarking against REST
How do we handle shared user data updates if both services write to DB?
- Status: Deferred to Phase 3 - will use event sourcing pattern
What message queue (RabbitMQ vs. Kafka)?
- Status: RabbitMQ chosen, but revisit if we need audit trail of all changes

Approvals

Technical Review

Lead Backend Engineer - TBD

Architecture Review

VP Engineering - TBD

Security Review

Security Team - TBD

Approved By


## Writing Tips

### Use Diagrams Effectively
- ASCII art is fine for design docs (easy to version control)
- Show data flow and component interactions
- Label arrows with what data/requests are flowing

### Be Explicit About Trade-offs
- Don't just say "microservices is better"
- Say "We're trading operational complexity for independent scaling because this addresses our 5k req/sec bottleneck"

### Link to Other Specs
- Reference related business requirements: `[BRD-001]`
- Reference technical requirements: `[PRD-001]`
- Reference data models: `[DATA-001]`
- Reference API contracts: `[API-001]`

### Document Rationale
- Each decision needs a "why"
- Explain what alternatives were considered and why they were rejected
- This helps future developers understand the context

### Be Specific About Performance
- Not: "Must be performant"
- Yes: "p95 latency under 100ms, p99 under 200ms, supporting 10k requests/second"

### Consider the Whole System
- Security implications
- Operational/monitoring requirements
- Data consistency model
- Failure modes and recovery
- Future scalability

## Validation & Fixing Issues

### Run the Validator
```bash
scripts/validate-spec.sh docs/specs/design-document/des-001-your-spec.md

Common Issues & Fixes

Issue: "Missing Proposed Solution section"

Fix: Add detailed architecture components, design decisions, tech stack

Issue: "TODO items in Architecture Components (4 items)"

Fix: Complete descriptions for all components (purpose, technology, responsibilities)

Issue: "No Trade-offs documented"

Fix: Explicitly document what you're accepting and what you're gaining

Issue: "Missing Performance & Scalability targets"

Fix: Add specific latency, throughput, and availability targets

Check Completeness

scripts/check-completeness.sh docs/specs/design-document/des-001-your-spec.md

Decision-Making Framework

As you write the design doc, work through:

Problem: What are we designing for?
- Specific pain points or constraints?
- Performance targets, scalability requirements?
Options: What architectural approaches could work?
- Monolith vs. distributed?
- Synchronous vs. asynchronous?
- Technology choices?
Evaluation: How do options compare?
- Which best addresses the problem?
- What are the trade-offs?
- What does the team have experience with?
Decision: Which approach wins and why?
- What assumptions must hold?
- What trade-offs are we accepting?
Implementation: How do we build/migrate to this?
- Big bang or incremental?
- Parallel running period?
- Rollback plan?

Next Steps

Create the spec: scripts/generate-spec.sh design-document des-XXX-slug
Research: Find related specs and understand architecture context
Sketch: Draw architecture diagrams before writing detailed components
Fill in sections using this guide
Validate: scripts/validate-spec.sh docs/specs/design-document/des-XXX-slug.md
Get architectural review before implementation begins
Update related specs: Create or update technical requirements and implementation plans

17 KiB Raw Blame History