12 KiB
12 KiB
Example: System Architecture Documentation with Mermaid Diagrams
Complete workflow for creating comprehensive system architecture documentation for a distributed Grey Haven application.
Context
Project: Multi-Tenant SaaS Platform (TanStack Start + Cloudflare Workers + FastAPI + PostgreSQL) Problem: New developers taking 3-4 weeks to understand system architecture, high onboarding cost Goal: Create comprehensive architecture documentation that reduces onboarding time to <1 week
Initial State:
- No architecture documentation
- Tribal knowledge spread across 8 senior developers
- New hires asking same questions repeatedly
- 3-4 weeks until new developer productive
- Architecture decisions not documented (ADRs missing)
Step 1: System Overview with Mermaid
High-Level Architecture Diagram
graph TB
subgraph "Client Layer"
Browser[Web Browser]
Mobile[Mobile App]
end
subgraph "Edge Layer (Cloudflare Workers)"
Gateway[API Gateway]
Auth[Auth Service]
Cache[KV Cache]
end
subgraph "Application Layer"
Frontend[TanStack Start<br/>React 19]
Backend[FastAPI Backend<br/>Python 3.12]
end
subgraph "Data Layer"
PostgreSQL[(PostgreSQL<br/>PlanetScale)]
Redis[(Redis Cache<br/>Upstash)]
S3[(R2 Object Storage<br/>Cloudflare)]
end
subgraph "External Services"
Stripe[Stripe<br/>Payments]
SendGrid[SendGrid<br/>Email]
DataDog[DataDog<br/>Monitoring]
end
Browser --> Gateway
Mobile --> Gateway
Gateway --> Auth
Gateway --> Frontend
Gateway --> Backend
Auth --> Cache
Frontend --> PostgreSQL
Backend --> PostgreSQL
Backend --> Redis
Backend --> S3
Backend --> Stripe
Backend --> SendGrid
Backend -.telemetry.-> DataDog
Step 2: Request Flow Sequence Diagrams
User Authentication Flow
sequenceDiagram
actor User
participant Browser
participant Gateway as API Gateway<br/>(Cloudflare Worker)
participant Auth as Auth Service<br/>(Cloudflare Worker)
participant KV as KV Cache
participant DB as PostgreSQL
User->>Browser: Enter email/password
Browser->>Gateway: POST /auth/login
Gateway->>Auth: Validate credentials
Auth->>DB: Query user by email
DB-->>Auth: User record
alt Valid Credentials
Auth->>Auth: Hash password & verify
Auth->>Auth: Generate JWT token
Auth->>KV: Store session (token -> user_id)
KV-->>Auth: OK
Auth-->>Gateway: {token, user}
Gateway-->>Browser: 200 OK {token, user}
Browser->>Browser: Store token in localStorage
Browser-->>User: Redirect to dashboard
else Invalid Credentials
Auth-->>Gateway: 401 Unauthorized
Gateway-->>Browser: {error: "INVALID_CREDENTIALS"}
Browser-->>User: Show error message
end
Multi-Tenant Data Access Flow
sequenceDiagram
participant Client
participant Gateway
participant Backend as FastAPI Backend
participant DB as PostgreSQL<br/>(Row-Level Security)
Client->>Gateway: GET /api/orders<br/>Authorization: Bearer <token>
Gateway->>Gateway: Validate JWT token
Gateway->>Gateway: Extract tenant_id from token
Gateway->>Backend: Forward request<br/>X-Tenant-ID: tenant_123
Backend->>Backend: Set session context<br/>SET app.tenant_id = 'tenant_123'
Backend->>DB: SELECT * FROM orders<br/>(RLS automatically filters by tenant)
Note over DB: Row-Level Security Policy:<br/>CREATE POLICY tenant_isolation ON orders<br/>FOR SELECT USING (tenant_id = current_setting('app.tenant_id'))
DB-->>Backend: Orders for tenant_123 only
Backend-->>Gateway: {orders: [...]}
Gateway-->>Client: 200 OK {orders: [...]}
Step 3: Data Flow Diagram
Order Processing Data Flow
flowchart LR
User[User Creates Order] --> Validation[Validate Order Data]
Validation --> Stock{Check Stock<br/>Availability}
Stock -->|Insufficient| Error[Return 400 Error]
Stock -->|Available| Reserve[Reserve Inventory]
Reserve --> Payment[Process Payment<br/>via Stripe]
Payment -->|Failed| Release[Release Reservation]
Release --> Error
Payment -->|Success| CreateOrder[Create Order<br/>in Database]
CreateOrder --> Queue[Queue Email<br/>Confirmation]
Queue --> Cache[Invalidate<br/>User Cache]
Cache --> Success[Return Order]
Success --> Async[Async: Send Email<br/>via SendGrid]
Success --> Metrics[Update Metrics<br/>in DataDog]
Step 4: Database Schema ER Diagram
erDiagram
TENANT ||--o{ USER : has
TENANT ||--o{ ORDER : has
USER ||--o{ ORDER : places
ORDER ||--|{ ORDER_ITEM : contains
PRODUCT ||--o{ ORDER_ITEM : included_in
TENANT ||--o{ PRODUCT : owns
TENANT {
uuid id PK
string name
string subdomain UK
timestamp created_at
}
USER {
uuid id PK
uuid tenant_id FK
string email UK
string hashed_password
string role
timestamp created_at
}
PRODUCT {
uuid id PK
uuid tenant_id FK
string name
decimal price
int stock
}
ORDER {
uuid id PK
uuid tenant_id FK
uuid user_id FK
decimal subtotal
decimal tax
decimal total
string status
timestamp created_at
}
ORDER_ITEM {
uuid id PK
uuid order_id FK
uuid product_id FK
int quantity
decimal unit_price
}
Step 5: Deployment Architecture
graph TB
subgraph "Development"
DevBranch[Feature Branch]
DevEnv[Dev Environment<br/>Cloudflare Preview]
end
subgraph "Staging"
MainBranch[Main Branch]
StageEnv[Staging Environment<br/>staging.greyhaven.com]
StageDB[(Staging PostgreSQL)]
end
subgraph "Production"
Release[Release Tag]
ProdWorkers[Cloudflare Workers<br/>300+ Datacenters]
ProdDB[(Production PostgreSQL<br/>PlanetScale)]
ProdCache[(Redis Cache<br/>Upstash)]
end
DevBranch -->|git push| CI1[GitHub Actions]
CI1 -->|Deploy| DevEnv
DevBranch -->|PR Merged| MainBranch
MainBranch -->|Deploy| CI2[GitHub Actions]
CI2 -->|Run Tests| TestSuite
TestSuite -->|Success| StageEnv
StageEnv --> StageDB
MainBranch -->|git tag v1.0.0| Release
Release -->|Deploy| CI3[GitHub Actions]
CI3 -->|Canary 10%| ProdWorkers
CI3 -->|Monitor 10 min| Metrics
Metrics -->|Success| FullDeploy[100% Rollout]
FullDeploy --> ProdWorkers
ProdWorkers --> ProdDB
ProdWorkers --> ProdCache
Step 6: State Machine Diagram for Order Status
stateDiagram-v2
[*] --> Pending: Order Created
Pending --> Processing: Payment Confirmed
Pending --> Cancelled: Payment Failed
Processing --> Shipped: Fulfillment Complete
Processing --> Cancelled: Out of Stock
Shipped --> Delivered: Tracking Confirmed
Shipped --> Returned: Customer Return
Delivered --> Returned: Return Requested
Returned --> Refunded: Return Approved
Cancelled --> [*]
Delivered --> [*]
Refunded --> [*]
note right of Pending
Inventory reserved
Payment processing
end note
note right of Processing
Items picked
Preparing shipment
end note
note right of Shipped
Tracking number assigned
In transit
end note
Step 7: Architecture Decision Records (ADRs)
ADR-001: Choose Cloudflare Workers for Edge Computing
# ADR-001: Use Cloudflare Workers for API Gateway and Auth
**Date**: 2024-01-15
**Status**: Accepted
**Decision Makers**: Engineering Team
## Context
We need an edge computing platform for API gateway, authentication, and caching that:
- Provides global low latency (<50ms p95)
- Scales automatically without management
- Integrates with our CDN infrastructure
- Supports multi-tenant architecture
## Decision
We will use Cloudflare Workers for edge computing with KV for session storage.
## Alternatives Considered
1. **AWS Lambda@Edge**: Good performance but vendor lock-in, higher cost
2. **Traditional Load Balancer**: Single region, no edge caching
3. **Self-hosted Edge Nodes**: Complex deployment, maintenance overhead
## Consequences
**Positive**:
- Global deployment (300+ datacenters) with <50ms latency worldwide
- Auto-scaling to zero cost when idle
- Built-in DDoS protection and WAF
- KV storage for session caching (sub-millisecond reads)
- 1ms CPU time limit forces efficient code
**Negative**:
- 1ms CPU time limit requires careful optimization
- Cold starts (though <10ms typically)
- Limited to JavaScript/TypeScript/Rust/Python (via Pyodide)
- No native PostgreSQL driver (must use HTTP-based client)
## Implementation
- API Gateway: Handles routing, CORS, rate limiting
- Auth Service: JWT validation, session management (KV)
- Cache Layer: API response caching (KV + Cache API)
## Monitoring
- Worker CPU time (aim for <500μs p95)
- KV cache hit rate (aim for >95%)
- Edge response time (aim for <50ms p95)
ADR-002: PostgreSQL with Row-Level Security for Multi-Tenancy
# ADR-002: PostgreSQL Row-Level Security (RLS) for Multi-Tenant Isolation
**Date**: 2024-01-20
**Status**: Accepted
## Context
Multi-tenant SaaS requires strict data isolation. Accidental cross-tenant data access would be a critical security breach.
## Decision
Use PostgreSQL Row-Level Security (RLS) policies to enforce tenant isolation at the database level.
## Implementation
```sql
-- Enable RLS on all tables
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;
-- Create policy that filters by session tenant_id
CREATE POLICY tenant_isolation ON orders
FOR ALL
USING (tenant_id = current_setting('app.tenant_id', true)::uuid);
-- Application sets tenant context per request
SET app.tenant_id = '<tenant_id_from_jwt>';
Consequences
Positive:
- Database-level enforcement (cannot be bypassed by application bugs)
- Automatic filtering on all queries (including ORMs)
- Performance: RLS uses indexes efficiently
Negative:
- Requires setting session context per connection
- Slightly more complex query plans
Monitoring
- Weekly audit: Check for tables missing RLS
- Quarterly penetration test: Attempt cross-tenant access
## Results
### Before
- No architecture documentation
- 3-4 weeks until new developer productive
- 15+ hours/week answering architecture questions
- Architecture decisions lost to time
- Difficult to identify bottlenecks
### After
- Comprehensive architecture docs with 8 Mermaid diagrams
- 5 Architecture Decision Records documenting key choices
- Documentation in Git (versioned, reviewed)
- Interactive diagrams (clickable, navigable)
### Improvements
- Onboarding time: 3-4 weeks → 4-5 days (75% reduction)
- Architecture questions: 15 hrs/week → 2 hrs/week (87% reduction)
- New developer productivity: Week 4 → Week 1
- Time to understand data flow: 2 weeks → 1 day
### Developer Feedback
- "The sequence diagrams made auth flow crystal clear"
- "ERD diagram helped me understand relationships immediately"
- "ADRs answered 'why did we choose X?' questions"
## Key Lessons
1. **Mermaid Diagrams**: Version-controlled, reviewable, always up-to-date
2. **Multiple Perspectives**: System, sequence, data flow, deployment diagrams all needed
3. **ADRs are Critical**: "Why" is as important as "what"
4. **Progressive Disclosure**: Overview first, then drill into details
5. **Keep Diagrams Simple**: One concept per diagram, not everything at once
## Prevention Measures
**Implemented**:
- [x] All architecture docs in Git (versioned)
- [x] Mermaid diagrams (not static images)
- [x] ADR template for all major decisions
- [x] Onboarding checklist includes reading architecture docs
**Ongoing**:
- [ ] Auto-generate diagrams from code (infrastructure as code)
- [ ] Quarterly architecture review (docs up-to-date?)
- [ ] New ADR for every major technical decision
---
Related: [openapi-generation.md](openapi-generation.md) | [coverage-validation.md](coverage-validation.md) | [Return to INDEX](INDEX.md)