Files
gh-tachyon-beep-skillpacks-…/skills/using-system-archaeologist/documenting-system-architecture.md
2025-11-30 08:59:22 +08:00

19 KiB
Raw Blame History

Documenting System Architecture

Purpose

Synthesize subsystem catalogs and architecture diagrams into final, stakeholder-ready architecture reports that serve multiple audiences through clear structure, comprehensive navigation, and actionable findings.

When to Use

  • Coordinator delegates final report generation from validated artifacts
  • Have 02-subsystem-catalog.md and 03-diagrams.md as inputs
  • Task specifies writing to 04-final-report.md
  • Need to produce executive-readable architecture documentation
  • Output represents deliverable for stakeholders

Core Principle: Synthesis Over Concatenation

Good reports synthesize information into insights. Poor reports concatenate source documents.

Your goal: Create a coherent narrative with extracted patterns, concerns, and recommendations - not a copy-paste of inputs.

Document Structure

Required Sections

1. Front Matter

  • Document title
  • Version number
  • Analysis date
  • Classification (if needed)

2. Table of Contents

  • Multi-level hierarchy (H2, H3, H4)
  • Anchor links to all major sections
  • Quick navigation for readers

3. Executive Summary (2-3 paragraphs)

  • High-level system overview
  • Key architectural patterns
  • Major concerns and confidence assessment
  • Should be readable standalone by leadership

4. System Overview

  • Purpose and scope
  • Technology stack
  • System context (external dependencies)

5. Architecture Diagrams

  • Embed all diagrams from 03-diagrams.md
  • Add contextual analysis after each diagram
  • Cross-reference to subsystem catalog

6. Subsystem Catalog

  • One detailed entry per subsystem
  • Synthesize from 02-subsystem-catalog.md (don't just copy)
  • Add cross-references to diagrams and findings

7. Key Findings

  • Architectural Patterns: Identified across subsystems
  • Technical Concerns: Extracted from catalog concerns
  • Recommendations: Actionable next steps with priorities

8. Appendices

  • Methodology: How analysis was performed
  • Confidence Levels: Rationale for confidence ratings
  • Assumptions & Limitations: What you inferred, what's missing

Synthesis Strategies

Pattern Identification

Look across subsystems for recurring patterns:

From catalog observations:

  • Subsystem A: "Dependency injection for testability"
  • Subsystem B: "All external services injected"
  • Subsystem C: "Injected dependencies for testing"

Synthesize into pattern:

### Dependency Injection Pattern

**Observed in**: Authentication Service, API Gateway, User Service

**Description**: External dependencies are injected rather than directly instantiated, enabling test isolation and loose coupling.

**Benefits**:
- Testability: Mock dependencies in unit tests
- Flexibility: Swap implementations without code changes
- Loose coupling: Services depend on interfaces, not concrete implementations

**Trade-offs**:
- Initial complexity: Requires dependency wiring infrastructure
- Runtime overhead: Minimal (dependency resolution at startup)

Concern Extraction

Find concerns buried in catalog entries:

Catalog entries:

  • API Gateway: "Rate limiter uses in-memory storage (doesn't scale horizontally)"
  • Database Layer: "Connection pool max size hardcoded (should be configurable)"
  • Data Service: "Large analytics queries can cause database load spikes"

Synthesize into findings:

## Technical Concerns

### 1. Rate Limiter Scalability Issue

**Severity**: Medium
**Affected Subsystem**: [API Gateway](#api-gateway)

**Issue**: In-memory rate limiting prevents horizontal scaling. If multiple gateway instances run, each maintains separate counters, allowing clients to exceed intended limits by distributing requests across instances.

**Impact**:
- Cannot scale gateway horizontally without distributed rate limiting
- Potential for rate limit bypass under load balancing
- Inconsistent rate limit enforcement

**Remediation**:
1. **Immediate** (next sprint): Document limitation, add monitoring alerts
2. **Short-term** (next quarter): Migrate to Redis-backed rate limiter
3. **Validation**: Test rate limiting with multiple gateway instances

**Priority**: High (blocks horizontal scaling)

Recommendation Prioritization

Priority recommendations using severity scoring + impact assessment + timeline buckets:

Severity Scoring (for each concern/recommendation)

Critical:

  • Blocks deployment or core functionality
  • Security vulnerability (data exposure, injection, auth bypass)
  • Data corruption or loss risk
  • Service outage potential
  • Examples: SQL injection, hardcoded credentials, unhandled critical exceptions

High:

  • Significant maintainability impact
  • High effort to modify or extend
  • Frequent source of bugs
  • Performance degradation under load
  • Examples: God objects, extreme duplication, shotgun surgery, N+1 queries

Medium:

  • Moderate maintainability concern
  • Refactoring beneficial but not urgent
  • Technical debt accumulation
  • Examples: Long functions, missing documentation, inconsistent error handling

Low:

  • Minor quality improvement
  • Cosmetic or style issues
  • Nice-to-have enhancements
  • Examples: Magic numbers, verbose naming, minor duplication

Impact Assessment Matrix

Use 2-dimensional scoring: Severity × Frequency

Severity High Frequency Medium Frequency Low Frequency
Critical P1 - Fix immediately P1 - Fix immediately P2 - Fix ASAP
High P2 - Fix ASAP P2 - Fix ASAP P3 - Plan for sprint
Medium P3 - Plan for sprint P4 - Backlog P4 - Backlog
Low P4 - Backlog P4 - Backlog P5 - Optional

Frequency assessment:

  • High: Affects core user workflows, used constantly, blocking development
  • Medium: Affects some workflows, occasional impact, periodic friction
  • Low: Edge case, rarely encountered, minimal operational impact

Timeline Buckets

Immediate (This Week / Next Sprint):

  • P1 priorities (Critical issues regardless of frequency)
  • Security vulnerabilities
  • Blocking deployment or development
  • Quick wins (high impact, low effort)

Short-Term (1-3 Months / Next Quarter):

  • P2 priorities (High severity or critical+low frequency)
  • Significant maintainability improvements
  • Performance optimizations
  • Breaking circular dependencies

Medium-Term (3-6 Months):

  • P3 priorities (Medium severity+high frequency or high+low)
  • Architectural refactoring
  • Technical debt paydown
  • System-wide improvements

Long-Term (6-12+ Months):

  • P4-P5 priorities (Low severity, backlog items)
  • Nice-to-have improvements
  • Experimental optimizations
  • Deferred enhancements

Prioritized Recommendation Format

## Recommendations

### Immediate (This Week / Next Sprint) - P1

**1. Fix Rate Limiter Scalability Vulnerability**
- **Severity:** Critical (blocks horizontal scaling)
- **Frequency:** High (affects all gateway scaling attempts)
- **Priority:** P1
- **Impact:** Cannot scale API gateway, potential rate limit bypass
- **Effort:** Medium (2-3 days migration to Redis)
- **Action:**
  1. Document current limitation in ops runbook (Day 1)
  2. Add monitoring for rate limit violations (Day 1)
  3. Migrate to Redis-backed rate limiter (Days 2-3)
  4. Validate with load testing (Day 3)

**2. Remove Hardcoded Database Credentials**
- **Severity:** Critical (security vulnerability)
- **Frequency:** Low (only affects DB config rotation)
- **Priority:** P1
- **Impact:** Credentials exposed in source control, rotation requires code deployment
- **Effort:** Low (< 1 day)
- **Action:**
  1. Move credentials to environment variables
  2. Update deployment configs
  3. Rotate compromised credentials

### Short-Term (1-3 Months / Next Quarter) - P2

**3. Extract Common Validation Framework**
- **Severity:** High (high duplication, shotgun surgery for validation changes)
- **Frequency:** High (every new API endpoint)
- **Priority:** P2
- **Impact:** 3 duplicate validation implementations, 15% code duplication
- **Effort:** Medium (1 week to extract + migrate)
- **Action:**
  1. Design validation framework API (2 days)
  2. Implement core framework (2 days)
  3. Migrate existing validators (2 days)
  4. Document validation patterns (1 day)

**4. Externalize Database Pool Configuration**
- **Severity:** High (hardcoded limits cause connection exhaustion)
- **Frequency:** Medium (impacts under load spikes)
- **Priority:** P2
- **Impact:** Connection pool exhaustion during traffic spikes
- **Effort:** Low (2 days)
- **Action:**
  1. Move pool config to environment variables
  2. Add runtime pool size adjustment
  3. Document tuning guidelines

### Medium-Term (3-6 Months) - P3

**5. Break User ↔ Notification Circular Dependency**
- **Severity:** Medium (architectural coupling)
- **Frequency:** Medium (affects both subsystem modifications)
- **Priority:** P3
- **Impact:** Difficult to modify either service independently
- **Effort:** High (2-3 weeks, requires event bus introduction)
- **Action:**
  1. Design event bus architecture (1 week)
  2. Implement notification via events (1 week)
  3. Migrate user service to publish events (3 days)
  4. Remove direct dependency (2 days)

**6. Add Docstrings to Public API (27% → 90% coverage)**
- **Severity:** Medium (maintainability concern)
- **Frequency:** Medium (affects onboarding, API understanding)
- **Priority:** P3
- **Impact:** Poor API discoverability, onboarding friction
- **Effort:** Medium (2-3 weeks distributed work)
- **Action:**
  1. Establish docstring standard (1 day)
  2. Document public APIs in batches (2 weeks)
  3. Add pre-commit hook to enforce (1 day)

### Long-Term (6-12+ Months) - P4-P5

**7. Evaluate Circuit Breaker Effectiveness**
- **Severity:** Low (optimization opportunity)
- **Frequency:** Low (affects only failure scenarios)
- **Priority:** P4
- **Impact:** Potential false positives, could improve resilience
- **Effort:** Medium (1 week testing + analysis)
- **Action:** Load testing + monitoring analysis when capacity allows

**8. Extract Magic Numbers to Configuration**
- **Severity:** Low (code quality improvement)
- **Frequency:** Low (rarely needs changing)
- **Priority:** P5
- **Impact:** Minor maintainability improvement
- **Effort:** Low (2-3 days)
- **Action:** Backlog item, tackle during related refactoring

Priority Summary Table

Include summary table for quick scanning:

## Priority Summary

| Priority | Count | Severity Distribution | Total Effort |
|----------|-------|----------------------|--------------|
| **P1** (Immediate) | 2 | Critical: 2 | 4 days |
| **P2** (Short-term) | 2 | High: 2 | 2.5 weeks |
| **P3** (Medium-term) | 2 | Medium: 2 | 5-6 weeks |
| **P4-P5** (Long-term) | 2 | Low: 2 | 2 weeks |
| **Total** | 8 | - | ~10 weeks |

**Recommended sprint allocation:**
- Sprint 1: P1 items (4 days) + start P2.3 validation framework
- Sprint 2: Complete P2.3 + P2.4 database pool config
- Quarter 2: P3 items (architectural improvements)
- Backlog: P4-P5 items (opportunistic improvements)

Cross-Referencing Strategy

Subsystem → Diagram:

## Authentication Service

[...subsystem details...]

**Component Architecture**: See [Authentication Service Components](#auth-service-components) diagram

**Dependencies**: [API Gateway](#api-gateway), [Database Layer](#database-layer)

Diagram → Subsystem:

### Authentication Service Components

[...diagram...]

**Description**: This component diagram shows internal structure of the Authentication Service. For additional operational details, see [Authentication Service](#authentication-service) in the subsystem catalog.

Finding → Subsystem:

### Rate Limiter Scalability Issue

**Affected Subsystem**: [API Gateway](#api-gateway)

[...concern details...]

Navigation Patterns

Table of contents with anchor links:

## Table of Contents

1. [Executive Summary](#executive-summary)
2. [System Overview](#system-overview)
   - [Purpose and Scope](#purpose-and-scope)
   - [Technology Stack](#technology-stack)
3. [Architecture Diagrams](#architecture-diagrams)
   - [Level 1: Context](#level-1-context)
   - [Level 2: Container](#level-2-container)

Multi-Audience Considerations

Executive Audience

What they need:

  • Executive summary ONLY (should be self-contained)
  • High-level patterns and risks
  • Business impact of concerns
  • Clear recommendations with timelines

Document design:

  • Put executive summary first
  • Make it readable standalone (no forward references)
  • Focus on "why this matters" over "how it works"

Architect Audience

What they need:

  • System overview + architecture diagrams + key findings
  • Pattern analysis with trade-offs
  • Dependency relationships
  • Design decisions and rationale

Document design:

  • System overview explains context
  • Diagrams show structure at multiple levels
  • Findings synthesize patterns and concerns
  • Cross-references enable non-linear reading

Engineer Audience

What they need:

  • Subsystem catalog with technical details
  • Component diagrams showing internal structure
  • Technology stack specifics
  • File references and entry points

Document design:

  • Detailed subsystem catalog
  • Component-level diagrams
  • Technology stack section with versions/frameworks
  • Code/file references where available

Operations Audience

What they need:

  • Technical concerns with remediation
  • Dependency mapping
  • Confidence levels (what's validated vs assumed)
  • Recommendations with priorities

Document design:

  • Technical concerns section up front
  • Clear remediation steps
  • Appendix with assumptions/limitations
  • Prioritized recommendations

Optional Enhancements

Visual Aids

Subsystem Quick Reference Table:

## Appendix D: Subsystem Quick Reference

| Subsystem | Location | Confidence | Key Concerns | Dependencies |
|-----------|----------|------------|--------------|--------------|
| API Gateway | /src/gateway/ | High | Rate limiter scalability | Auth, User, Data, Logging |
| Auth Service | /src/services/auth/ | High | None | Database, Cache, Logging |
| User Service | /src/services/users/ | High | None | Database, Cache, Notification |

Pattern Summary Matrix:

## Architectural Patterns Summary

| Pattern | Subsystems Using | Benefits | Trade-offs |
|---------|------------------|----------|------------|
| Dependency Injection | Auth, Gateway, User | Testability, flexibility | Initial complexity |
| Repository Pattern | User, Data | Data access abstraction | Extra layer |
| Circuit Breaker | Gateway | Fault isolation | False positives |

Reading Guide

## How to Read This Document

**For Executives** (5 minutes):
- Read [Executive Summary](#executive-summary) only
- Optionally skim [Recommendations](#recommendations)

**For Architects** (30 minutes):
- Read [Executive Summary](#executive-summary)
- Read [System Overview](#system-overview)
- Review [Architecture Diagrams](#architecture-diagrams)
- Read [Key Findings](#key-findings)

**For Engineers** (1 hour):
- Read [System Overview](#system-overview)
- Study [Architecture Diagrams](#architecture-diagrams) (all levels)
- Read [Subsystem Catalog](#subsystem-catalog) for relevant services
- Review [Technical Concerns](#technical-concerns)

**For Operations** (45 minutes):
- Read [Executive Summary](#executive-summary)
- Study [Technical Concerns](#technical-concerns)
- Review [Recommendations](#recommendations)
- Read [Appendix C: Assumptions and Limitations](#appendix-c-assumptions-and-limitations)

Glossary

## Appendix E: Glossary

**Circuit Breaker**: Fault tolerance pattern that prevents cascading failures by temporarily blocking requests to failing services.

**Dependency Injection**: Design pattern where dependencies are provided to components rather than constructed internally, enabling testability and loose coupling.

**Repository Pattern**: Data access abstraction that separates business logic from data persistence concerns.

**Optimistic Locking**: Concurrency control technique assuming conflicts are rare, using version checks rather than locks.

Success Criteria

You succeeded when:

  • Executive summary (2-3 paragraphs) distills key information
  • Table of contents provides multi-level navigation
  • Cross-references (30+) enable non-linear reading
  • Patterns synthesized (not just listed from catalog)
  • Concerns extracted and prioritized
  • Recommendations actionable with timelines
  • Diagrams integrated with contextual analysis
  • Appendices document methodology, confidence, assumptions
  • Professional structure (document metadata, clear hierarchy)
  • Written to 04-final-report.md

You failed when:

  • Simple concatenation of source documents
  • No executive summary or it requires reading full document
  • Missing table of contents
  • No cross-references between sections
  • Patterns just copied from catalog (not synthesized)
  • Concerns buried without extraction
  • Recommendations vague or unprioritized
  • Diagrams pasted without context
  • Missing appendices

Best Practices from Baseline Testing

What Works

Comprehensive synthesis - Identify patterns, extract concerns, create narrative Professional structure - Document metadata, TOC, clear hierarchy, appendices Multi-level navigation - 20+ TOC entries, 40+ cross-references Executive summary - Self-contained 2-3 paragraph distillation Actionable findings - Concerns with severity/impact/remediation, recommendations with timelines Transparency - Confidence levels, assumptions, limitations documented Diagram integration - Embedded with contextual analysis and cross-refs Multi-audience - Executive summary + technical depth + appendices

Synthesis Patterns

Pattern identification:

  • Look across multiple subsystems for recurring themes
  • Group by pattern name (e.g., "Repository Pattern")
  • Document which subsystems use it
  • Explain benefits and trade-offs

Concern extraction:

  • Find concerns in subsystem catalog entries
  • Elevate to Key Findings section
  • Add severity, impact, remediation
  • Prioritize by timeline (immediate/short/long)

Recommendation structure:

  • Group by timeline
  • Specific actions (not vague suggestions)
  • Validation steps
  • Priority indicators

Integration with Workflow

This skill is typically invoked as:

  1. Coordinator completes and validates subsystem catalog
  2. Coordinator completes and validates architecture diagrams
  3. Coordinator writes task specification for final report
  4. YOU read both source documents systematically
  5. YOU synthesize patterns, extract concerns, create recommendations
  6. YOU build professional report structure with navigation
  7. YOU write to 04-final-report.md
  8. Validator (optional) checks for synthesis quality, navigation, completeness

Your role: Transform analysis artifacts into stakeholder-ready documentation through synthesis, organization, and professional presentation.