zhongwei/gh-greyhaven-ai-claude-code-config-grey-haven-plugins-developer-experience

Files

Zhongwei Li be476a3fea Initial commit

2025-11-29 18:29:15 +08:00

12 KiB

Raw Blame History

Example: System Architecture Documentation with Mermaid Diagrams

Complete workflow for creating comprehensive system architecture documentation for a distributed Grey Haven application.

Context

Project: Multi-Tenant SaaS Platform (TanStack Start + Cloudflare Workers + FastAPI + PostgreSQL) Problem: New developers taking 3-4 weeks to understand system architecture, high onboarding cost Goal: Create comprehensive architecture documentation that reduces onboarding time to <1 week

Initial State:

No architecture documentation
Tribal knowledge spread across 8 senior developers
New hires asking same questions repeatedly
3-4 weeks until new developer productive
Architecture decisions not documented (ADRs missing)

Step 1: System Overview with Mermaid

High-Level Architecture Diagram

graph TB
    subgraph "Client Layer"
        Browser[Web Browser]
        Mobile[Mobile App]
    end

    subgraph "Edge Layer (Cloudflare Workers)"
        Gateway[API Gateway]
        Auth[Auth Service]
        Cache[KV Cache]
    end

    subgraph "Application Layer"
        Frontend[TanStack Start<br/>React 19]
        Backend[FastAPI Backend<br/>Python 3.12]
    end

    subgraph "Data Layer"
        PostgreSQL[(PostgreSQL<br/>PlanetScale)]
        Redis[(Redis Cache<br/>Upstash)]
        S3[(R2 Object Storage<br/>Cloudflare)]
    end

    subgraph "External Services"
        Stripe[Stripe<br/>Payments]
        SendGrid[SendGrid<br/>Email]
        DataDog[DataDog<br/>Monitoring]
    end

    Browser --> Gateway
    Mobile --> Gateway
    Gateway --> Auth
    Gateway --> Frontend
    Gateway --> Backend
    Auth --> Cache
    Frontend --> PostgreSQL
    Backend --> PostgreSQL
    Backend --> Redis
    Backend --> S3
    Backend --> Stripe
    Backend --> SendGrid
    Backend -.telemetry.-> DataDog

Step 2: Request Flow Sequence Diagrams

User Authentication Flow

sequenceDiagram
    actor User
    participant Browser
    participant Gateway as API Gateway<br/>(Cloudflare Worker)
    participant Auth as Auth Service<br/>(Cloudflare Worker)
    participant KV as KV Cache
    participant DB as PostgreSQL

    User->>Browser: Enter email/password
    Browser->>Gateway: POST /auth/login
    Gateway->>Auth: Validate credentials
    Auth->>DB: Query user by email
    DB-->>Auth: User record
    
    alt Valid Credentials
        Auth->>Auth: Hash password & verify
        Auth->>Auth: Generate JWT token
        Auth->>KV: Store session (token -> user_id)
        KV-->>Auth: OK
        Auth-->>Gateway: {token, user}
        Gateway-->>Browser: 200 OK {token, user}
        Browser->>Browser: Store token in localStorage
        Browser-->>User: Redirect to dashboard
    else Invalid Credentials
        Auth-->>Gateway: 401 Unauthorized
        Gateway-->>Browser: {error: "INVALID_CREDENTIALS"}
        Browser-->>User: Show error message
    end

Multi-Tenant Data Access Flow

sequenceDiagram
    participant Client
    participant Gateway
    participant Backend as FastAPI Backend
    participant DB as PostgreSQL<br/>(Row-Level Security)

    Client->>Gateway: GET /api/orders<br/>Authorization: Bearer <token>
    Gateway->>Gateway: Validate JWT token
    Gateway->>Gateway: Extract tenant_id from token
    Gateway->>Backend: Forward request<br/>X-Tenant-ID: tenant_123
    
    Backend->>Backend: Set session context<br/>SET app.tenant_id = 'tenant_123'
    Backend->>DB: SELECT * FROM orders<br/>(RLS automatically filters by tenant)
    
    Note over DB: Row-Level Security Policy:<br/>CREATE POLICY tenant_isolation ON orders<br/>FOR SELECT USING (tenant_id = current_setting('app.tenant_id'))
    
    DB-->>Backend: Orders for tenant_123 only
    Backend-->>Gateway: {orders: [...]}
    Gateway-->>Client: 200 OK {orders: [...]}

Step 3: Data Flow Diagram

Order Processing Data Flow

flowchart LR
    User[User Creates Order] --> Validation[Validate Order Data]
    Validation --> Stock{Check Stock<br/>Availability}
    
    Stock -->|Insufficient| Error[Return 400 Error]
    Stock -->|Available| Reserve[Reserve Inventory]
    
    Reserve --> Payment[Process Payment<br/>via Stripe]
    Payment -->|Failed| Release[Release Reservation]
    Release --> Error
    
    Payment -->|Success| CreateOrder[Create Order<br/>in Database]
    CreateOrder --> Queue[Queue Email<br/>Confirmation]
    Queue --> Cache[Invalidate<br/>User Cache]
    Cache --> Success[Return Order]
    
    Success --> Async[Async: Send Email<br/>via SendGrid]
    Success --> Metrics[Update Metrics<br/>in DataDog]

Step 4: Database Schema ER Diagram

erDiagram
    TENANT ||--o{ USER : has
    TENANT ||--o{ ORDER : has
    USER ||--o{ ORDER : places
    ORDER ||--|{ ORDER_ITEM : contains
    PRODUCT ||--o{ ORDER_ITEM : included_in
    TENANT ||--o{ PRODUCT : owns
    
    TENANT {
        uuid id PK
        string name
        string subdomain UK
        timestamp created_at
    }
    
    USER {
        uuid id PK
        uuid tenant_id FK
        string email UK
        string hashed_password
        string role
        timestamp created_at
    }
    
    PRODUCT {
        uuid id PK
        uuid tenant_id FK
        string name
        decimal price
        int stock
    }
    
    ORDER {
        uuid id PK
        uuid tenant_id FK
        uuid user_id FK
        decimal subtotal
        decimal tax
        decimal total
        string status
        timestamp created_at
    }
    
    ORDER_ITEM {
        uuid id PK
        uuid order_id FK
        uuid product_id FK
        int quantity
        decimal unit_price
    }

Step 5: Deployment Architecture

graph TB
    subgraph "Development"
        DevBranch[Feature Branch]
        DevEnv[Dev Environment<br/>Cloudflare Preview]
    end

    subgraph "Staging"
        MainBranch[Main Branch]
        StageEnv[Staging Environment<br/>staging.greyhaven.com]
        StageDB[(Staging PostgreSQL)]
    end

    subgraph "Production"
        Release[Release Tag]
        ProdWorkers[Cloudflare Workers<br/>300+ Datacenters]
        ProdDB[(Production PostgreSQL<br/>PlanetScale)]
        ProdCache[(Redis Cache<br/>Upstash)]
    end

    DevBranch -->|git push| CI1[GitHub Actions]
    CI1 -->|Deploy| DevEnv
    
    DevBranch -->|PR Merged| MainBranch
    MainBranch -->|Deploy| CI2[GitHub Actions]
    CI2 -->|Run Tests| TestSuite
    TestSuite -->|Success| StageEnv
    StageEnv --> StageDB
    
    MainBranch -->|git tag v1.0.0| Release
    Release -->|Deploy| CI3[GitHub Actions]
    CI3 -->|Canary 10%| ProdWorkers
    CI3 -->|Monitor 10 min| Metrics
    Metrics -->|Success| FullDeploy[100% Rollout]
    FullDeploy --> ProdWorkers
    ProdWorkers --> ProdDB
    ProdWorkers --> ProdCache

Step 6: State Machine Diagram for Order Status

stateDiagram-v2
    [*] --> Pending: Order Created
    Pending --> Processing: Payment Confirmed
    Pending --> Cancelled: Payment Failed
    
    Processing --> Shipped: Fulfillment Complete
    Processing --> Cancelled: Out of Stock
    
    Shipped --> Delivered: Tracking Confirmed
    Shipped --> Returned: Customer Return
    
    Delivered --> Returned: Return Requested
    Returned --> Refunded: Return Approved
    
    Cancelled --> [*]
    Delivered --> [*]
    Refunded --> [*]
    
    note right of Pending
        Inventory reserved
        Payment processing
    end note
    
    note right of Processing
        Items picked
        Preparing shipment
    end note
    
    note right of Shipped
        Tracking number assigned
        In transit
    end note

Step 7: Architecture Decision Records (ADRs)

ADR-001: Choose Cloudflare Workers for Edge Computing

# ADR-001: Use Cloudflare Workers for API Gateway and Auth

**Date**: 2024-01-15
**Status**: Accepted
**Decision Makers**: Engineering Team

## Context

We need an edge computing platform for API gateway, authentication, and caching that:
- Provides global low latency (<50ms p95)
- Scales automatically without management
- Integrates with our CDN infrastructure
- Supports multi-tenant architecture

## Decision

We will use Cloudflare Workers for edge computing with KV for session storage.

## Alternatives Considered

1. **AWS Lambda@Edge**: Good performance but vendor lock-in, higher cost
2. **Traditional Load Balancer**: Single region, no edge caching
3. **Self-hosted Edge Nodes**: Complex deployment, maintenance overhead

## Consequences

**Positive**:
- Global deployment (300+ datacenters) with <50ms latency worldwide
- Auto-scaling to zero cost when idle
- Built-in DDoS protection and WAF
- KV storage for session caching (sub-millisecond reads)
- 1ms CPU time limit forces efficient code

**Negative**:
- 1ms CPU time limit requires careful optimization
- Cold starts (though <10ms typically)
- Limited to JavaScript/TypeScript/Rust/Python (via Pyodide)
- No native PostgreSQL driver (must use HTTP-based client)

## Implementation

- API Gateway: Handles routing, CORS, rate limiting
- Auth Service: JWT validation, session management (KV)
- Cache Layer: API response caching (KV + Cache API)

## Monitoring

- Worker CPU time (aim for <500μs p95)
- KV cache hit rate (aim for >95%)
- Edge response time (aim for <50ms p95)

ADR-002: PostgreSQL with Row-Level Security for Multi-Tenancy

# ADR-002: PostgreSQL Row-Level Security (RLS) for Multi-Tenant Isolation

**Date**: 2024-01-20
**Status**: Accepted

## Context

Multi-tenant SaaS requires strict data isolation. Accidental cross-tenant data access would be a critical security breach.

## Decision

Use PostgreSQL Row-Level Security (RLS) policies to enforce tenant isolation at the database level.

## Implementation

```sql
-- Enable RLS on all tables
ALTER TABLE orders ENABLE ROW LEVEL SECURITY;

-- Create policy that filters by session tenant_id
CREATE POLICY tenant_isolation ON orders
    FOR ALL
    USING (tenant_id = current_setting('app.tenant_id', true)::uuid);

-- Application sets tenant context per request
SET app.tenant_id = '<tenant_id_from_jwt>';

Consequences

Positive:

Database-level enforcement (cannot be bypassed by application bugs)
Automatic filtering on all queries (including ORMs)
Performance: RLS uses indexes efficiently

Negative:

Requires setting session context per connection
Slightly more complex query plans

Monitoring

Weekly audit: Check for tables missing RLS
Quarterly penetration test: Attempt cross-tenant access


## Results

### Before

- No architecture documentation
- 3-4 weeks until new developer productive
- 15+ hours/week answering architecture questions
- Architecture decisions lost to time
- Difficult to identify bottlenecks

### After

- Comprehensive architecture docs with 8 Mermaid diagrams
- 5 Architecture Decision Records documenting key choices
- Documentation in Git (versioned, reviewed)
- Interactive diagrams (clickable, navigable)

### Improvements

- Onboarding time: 3-4 weeks → 4-5 days (75% reduction)
- Architecture questions: 15 hrs/week → 2 hrs/week (87% reduction)
- New developer productivity: Week 4 → Week 1
- Time to understand data flow: 2 weeks → 1 day

### Developer Feedback

- "The sequence diagrams made auth flow crystal clear"
- "ERD diagram helped me understand relationships immediately"
- "ADRs answered 'why did we choose X?' questions"

## Key Lessons

1. **Mermaid Diagrams**: Version-controlled, reviewable, always up-to-date
2. **Multiple Perspectives**: System, sequence, data flow, deployment diagrams all needed
3. **ADRs are Critical**: "Why" is as important as "what"
4. **Progressive Disclosure**: Overview first, then drill into details
5. **Keep Diagrams Simple**: One concept per diagram, not everything at once

## Prevention Measures

**Implemented**:
- [x] All architecture docs in Git (versioned)
- [x] Mermaid diagrams (not static images)
- [x] ADR template for all major decisions
- [x] Onboarding checklist includes reading architecture docs

**Ongoing**:
- [ ] Auto-generate diagrams from code (infrastructure as code)
- [ ] Quarterly architecture review (docs up-to-date?)
- [ ] New ADR for every major technical decision

---

Related: [openapi-generation.md](openapi-generation.md) | [coverage-validation.md](coverage-validation.md) | [Return to INDEX](INDEX.md)

12 KiB Raw Blame History