570 lines
19 KiB
Markdown
570 lines
19 KiB
Markdown
|
||
# Documenting System Architecture
|
||
|
||
## Purpose
|
||
|
||
Synthesize subsystem catalogs and architecture diagrams into final, stakeholder-ready architecture reports that serve multiple audiences through clear structure, comprehensive navigation, and actionable findings.
|
||
|
||
## When to Use
|
||
|
||
- Coordinator delegates final report generation from validated artifacts
|
||
- Have `02-subsystem-catalog.md` and `03-diagrams.md` as inputs
|
||
- Task specifies writing to `04-final-report.md`
|
||
- Need to produce executive-readable architecture documentation
|
||
- Output represents deliverable for stakeholders
|
||
|
||
## Core Principle: Synthesis Over Concatenation
|
||
|
||
**Good reports synthesize information into insights. Poor reports concatenate source documents.**
|
||
|
||
Your goal: Create a coherent narrative with extracted patterns, concerns, and recommendations - not a copy-paste of inputs.
|
||
|
||
## Document Structure
|
||
|
||
### Required Sections
|
||
|
||
**1. Front Matter**
|
||
- Document title
|
||
- Version number
|
||
- Analysis date
|
||
- Classification (if needed)
|
||
|
||
**2. Table of Contents**
|
||
- Multi-level hierarchy (H2, H3, H4)
|
||
- Anchor links to all major sections
|
||
- Quick navigation for readers
|
||
|
||
**3. Executive Summary (2-3 paragraphs)**
|
||
- High-level system overview
|
||
- Key architectural patterns
|
||
- Major concerns and confidence assessment
|
||
- Should be readable standalone by leadership
|
||
|
||
**4. System Overview**
|
||
- Purpose and scope
|
||
- Technology stack
|
||
- System context (external dependencies)
|
||
|
||
**5. Architecture Diagrams**
|
||
- Embed all diagrams from `03-diagrams.md`
|
||
- Add contextual analysis after each diagram
|
||
- Cross-reference to subsystem catalog
|
||
|
||
**6. Subsystem Catalog**
|
||
- One detailed entry per subsystem
|
||
- Synthesize from `02-subsystem-catalog.md` (don't just copy)
|
||
- Add cross-references to diagrams and findings
|
||
|
||
**7. Key Findings**
|
||
- **Architectural Patterns**: Identified across subsystems
|
||
- **Technical Concerns**: Extracted from catalog concerns
|
||
- **Recommendations**: Actionable next steps with priorities
|
||
|
||
**8. Appendices**
|
||
- **Methodology**: How analysis was performed
|
||
- **Confidence Levels**: Rationale for confidence ratings
|
||
- **Assumptions & Limitations**: What you inferred, what's missing
|
||
|
||
## Synthesis Strategies
|
||
|
||
### Pattern Identification
|
||
|
||
**Look across subsystems for recurring patterns:**
|
||
|
||
From catalog observations:
|
||
- Subsystem A: "Dependency injection for testability"
|
||
- Subsystem B: "All external services injected"
|
||
- Subsystem C: "Injected dependencies for testing"
|
||
|
||
**Synthesize into pattern:**
|
||
```markdown
|
||
### Dependency Injection Pattern
|
||
|
||
**Observed in**: Authentication Service, API Gateway, User Service
|
||
|
||
**Description**: External dependencies are injected rather than directly instantiated, enabling test isolation and loose coupling.
|
||
|
||
**Benefits**:
|
||
- Testability: Mock dependencies in unit tests
|
||
- Flexibility: Swap implementations without code changes
|
||
- Loose coupling: Services depend on interfaces, not concrete implementations
|
||
|
||
**Trade-offs**:
|
||
- Initial complexity: Requires dependency wiring infrastructure
|
||
- Runtime overhead: Minimal (dependency resolution at startup)
|
||
```
|
||
|
||
### Concern Extraction
|
||
|
||
**Find concerns buried in catalog entries:**
|
||
|
||
Catalog entries:
|
||
- API Gateway: "Rate limiter uses in-memory storage (doesn't scale horizontally)"
|
||
- Database Layer: "Connection pool max size hardcoded (should be configurable)"
|
||
- Data Service: "Large analytics queries can cause database load spikes"
|
||
|
||
**Synthesize into findings:**
|
||
```markdown
|
||
## Technical Concerns
|
||
|
||
### 1. Rate Limiter Scalability Issue
|
||
|
||
**Severity**: Medium
|
||
**Affected Subsystem**: [API Gateway](#api-gateway)
|
||
|
||
**Issue**: In-memory rate limiting prevents horizontal scaling. If multiple gateway instances run, each maintains separate counters, allowing clients to exceed intended limits by distributing requests across instances.
|
||
|
||
**Impact**:
|
||
- Cannot scale gateway horizontally without distributed rate limiting
|
||
- Potential for rate limit bypass under load balancing
|
||
- Inconsistent rate limit enforcement
|
||
|
||
**Remediation**:
|
||
1. **Immediate** (next sprint): Document limitation, add monitoring alerts
|
||
2. **Short-term** (next quarter): Migrate to Redis-backed rate limiter
|
||
3. **Validation**: Test rate limiting with multiple gateway instances
|
||
|
||
**Priority**: High (blocks horizontal scaling)
|
||
```
|
||
|
||
### Recommendation Prioritization
|
||
|
||
**Priority recommendations using severity scoring + impact assessment + timeline buckets:**
|
||
|
||
#### Severity Scoring (for each concern/recommendation)
|
||
|
||
**Critical:**
|
||
- Blocks deployment or core functionality
|
||
- Security vulnerability (data exposure, injection, auth bypass)
|
||
- Data corruption or loss risk
|
||
- Service outage potential
|
||
- Examples: SQL injection, hardcoded credentials, unhandled critical exceptions
|
||
|
||
**High:**
|
||
- Significant maintainability impact
|
||
- High effort to modify or extend
|
||
- Frequent source of bugs
|
||
- Performance degradation under load
|
||
- Examples: God objects, extreme duplication, shotgun surgery, N+1 queries
|
||
|
||
**Medium:**
|
||
- Moderate maintainability concern
|
||
- Refactoring beneficial but not urgent
|
||
- Technical debt accumulation
|
||
- Examples: Long functions, missing documentation, inconsistent error handling
|
||
|
||
**Low:**
|
||
- Minor quality improvement
|
||
- Cosmetic or style issues
|
||
- Nice-to-have enhancements
|
||
- Examples: Magic numbers, verbose naming, minor duplication
|
||
|
||
#### Impact Assessment Matrix
|
||
|
||
Use 2-dimensional scoring: **Severity × Frequency**
|
||
|
||
| Severity | High Frequency | Medium Frequency | Low Frequency |
|
||
|----------|----------------|------------------|---------------|
|
||
| **Critical** | **P1** - Fix immediately | **P1** - Fix immediately | **P2** - Fix ASAP |
|
||
| **High** | **P2** - Fix ASAP | **P2** - Fix ASAP | **P3** - Plan for sprint |
|
||
| **Medium** | **P3** - Plan for sprint | **P4** - Backlog | **P4** - Backlog |
|
||
| **Low** | **P4** - Backlog | **P4** - Backlog | **P5** - Optional |
|
||
|
||
**Frequency assessment:**
|
||
- **High:** Affects core user workflows, used constantly, blocking development
|
||
- **Medium:** Affects some workflows, occasional impact, periodic friction
|
||
- **Low:** Edge case, rarely encountered, minimal operational impact
|
||
|
||
#### Timeline Buckets
|
||
|
||
**Immediate (This Week / Next Sprint):**
|
||
- P1 priorities (Critical issues regardless of frequency)
|
||
- Security vulnerabilities
|
||
- Blocking deployment or development
|
||
- Quick wins (high impact, low effort)
|
||
|
||
**Short-Term (1-3 Months / Next Quarter):**
|
||
- P2 priorities (High severity or critical+low frequency)
|
||
- Significant maintainability improvements
|
||
- Performance optimizations
|
||
- Breaking circular dependencies
|
||
|
||
**Medium-Term (3-6 Months):**
|
||
- P3 priorities (Medium severity+high frequency or high+low)
|
||
- Architectural refactoring
|
||
- Technical debt paydown
|
||
- System-wide improvements
|
||
|
||
**Long-Term (6-12+ Months):**
|
||
- P4-P5 priorities (Low severity, backlog items)
|
||
- Nice-to-have improvements
|
||
- Experimental optimizations
|
||
- Deferred enhancements
|
||
|
||
#### Prioritized Recommendation Format
|
||
|
||
```markdown
|
||
## Recommendations
|
||
|
||
### Immediate (This Week / Next Sprint) - P1
|
||
|
||
**1. Fix Rate Limiter Scalability Vulnerability**
|
||
- **Severity:** Critical (blocks horizontal scaling)
|
||
- **Frequency:** High (affects all gateway scaling attempts)
|
||
- **Priority:** P1
|
||
- **Impact:** Cannot scale API gateway, potential rate limit bypass
|
||
- **Effort:** Medium (2-3 days migration to Redis)
|
||
- **Action:**
|
||
1. Document current limitation in ops runbook (Day 1)
|
||
2. Add monitoring for rate limit violations (Day 1)
|
||
3. Migrate to Redis-backed rate limiter (Days 2-3)
|
||
4. Validate with load testing (Day 3)
|
||
|
||
**2. Remove Hardcoded Database Credentials**
|
||
- **Severity:** Critical (security vulnerability)
|
||
- **Frequency:** Low (only affects DB config rotation)
|
||
- **Priority:** P1
|
||
- **Impact:** Credentials exposed in source control, rotation requires code deployment
|
||
- **Effort:** Low (< 1 day)
|
||
- **Action:**
|
||
1. Move credentials to environment variables
|
||
2. Update deployment configs
|
||
3. Rotate compromised credentials
|
||
|
||
### Short-Term (1-3 Months / Next Quarter) - P2
|
||
|
||
**3. Extract Common Validation Framework**
|
||
- **Severity:** High (high duplication, shotgun surgery for validation changes)
|
||
- **Frequency:** High (every new API endpoint)
|
||
- **Priority:** P2
|
||
- **Impact:** 3 duplicate validation implementations, 15% code duplication
|
||
- **Effort:** Medium (1 week to extract + migrate)
|
||
- **Action:**
|
||
1. Design validation framework API (2 days)
|
||
2. Implement core framework (2 days)
|
||
3. Migrate existing validators (2 days)
|
||
4. Document validation patterns (1 day)
|
||
|
||
**4. Externalize Database Pool Configuration**
|
||
- **Severity:** High (hardcoded limits cause connection exhaustion)
|
||
- **Frequency:** Medium (impacts under load spikes)
|
||
- **Priority:** P2
|
||
- **Impact:** Connection pool exhaustion during traffic spikes
|
||
- **Effort:** Low (2 days)
|
||
- **Action:**
|
||
1. Move pool config to environment variables
|
||
2. Add runtime pool size adjustment
|
||
3. Document tuning guidelines
|
||
|
||
### Medium-Term (3-6 Months) - P3
|
||
|
||
**5. Break User ↔ Notification Circular Dependency**
|
||
- **Severity:** Medium (architectural coupling)
|
||
- **Frequency:** Medium (affects both subsystem modifications)
|
||
- **Priority:** P3
|
||
- **Impact:** Difficult to modify either service independently
|
||
- **Effort:** High (2-3 weeks, requires event bus introduction)
|
||
- **Action:**
|
||
1. Design event bus architecture (1 week)
|
||
2. Implement notification via events (1 week)
|
||
3. Migrate user service to publish events (3 days)
|
||
4. Remove direct dependency (2 days)
|
||
|
||
**6. Add Docstrings to Public API (27% → 90% coverage)**
|
||
- **Severity:** Medium (maintainability concern)
|
||
- **Frequency:** Medium (affects onboarding, API understanding)
|
||
- **Priority:** P3
|
||
- **Impact:** Poor API discoverability, onboarding friction
|
||
- **Effort:** Medium (2-3 weeks distributed work)
|
||
- **Action:**
|
||
1. Establish docstring standard (1 day)
|
||
2. Document public APIs in batches (2 weeks)
|
||
3. Add pre-commit hook to enforce (1 day)
|
||
|
||
### Long-Term (6-12+ Months) - P4-P5
|
||
|
||
**7. Evaluate Circuit Breaker Effectiveness**
|
||
- **Severity:** Low (optimization opportunity)
|
||
- **Frequency:** Low (affects only failure scenarios)
|
||
- **Priority:** P4
|
||
- **Impact:** Potential false positives, could improve resilience
|
||
- **Effort:** Medium (1 week testing + analysis)
|
||
- **Action:** Load testing + monitoring analysis when capacity allows
|
||
|
||
**8. Extract Magic Numbers to Configuration**
|
||
- **Severity:** Low (code quality improvement)
|
||
- **Frequency:** Low (rarely needs changing)
|
||
- **Priority:** P5
|
||
- **Impact:** Minor maintainability improvement
|
||
- **Effort:** Low (2-3 days)
|
||
- **Action:** Backlog item, tackle during related refactoring
|
||
```
|
||
|
||
#### Priority Summary Table
|
||
|
||
Include summary table for quick scanning:
|
||
|
||
```markdown
|
||
## Priority Summary
|
||
|
||
| Priority | Count | Severity Distribution | Total Effort |
|
||
|----------|-------|----------------------|--------------|
|
||
| **P1** (Immediate) | 2 | Critical: 2 | 4 days |
|
||
| **P2** (Short-term) | 2 | High: 2 | 2.5 weeks |
|
||
| **P3** (Medium-term) | 2 | Medium: 2 | 5-6 weeks |
|
||
| **P4-P5** (Long-term) | 2 | Low: 2 | 2 weeks |
|
||
| **Total** | 8 | - | ~10 weeks |
|
||
|
||
**Recommended sprint allocation:**
|
||
- Sprint 1: P1 items (4 days) + start P2.3 validation framework
|
||
- Sprint 2: Complete P2.3 + P2.4 database pool config
|
||
- Quarter 2: P3 items (architectural improvements)
|
||
- Backlog: P4-P5 items (opportunistic improvements)
|
||
```
|
||
|
||
## Cross-Referencing Strategy
|
||
|
||
### Bidirectional Links
|
||
|
||
**Subsystem → Diagram:**
|
||
```markdown
|
||
## Authentication Service
|
||
|
||
[...subsystem details...]
|
||
|
||
**Component Architecture**: See [Authentication Service Components](#auth-service-components) diagram
|
||
|
||
**Dependencies**: [API Gateway](#api-gateway), [Database Layer](#database-layer)
|
||
```
|
||
|
||
**Diagram → Subsystem:**
|
||
```markdown
|
||
### Authentication Service Components
|
||
|
||
[...diagram...]
|
||
|
||
**Description**: This component diagram shows internal structure of the Authentication Service. For additional operational details, see [Authentication Service](#authentication-service) in the subsystem catalog.
|
||
```
|
||
|
||
**Finding → Subsystem:**
|
||
```markdown
|
||
### Rate Limiter Scalability Issue
|
||
|
||
**Affected Subsystem**: [API Gateway](#api-gateway)
|
||
|
||
[...concern details...]
|
||
```
|
||
|
||
### Navigation Patterns
|
||
|
||
**Table of contents with anchor links:**
|
||
```markdown
|
||
## Table of Contents
|
||
|
||
1. [Executive Summary](#executive-summary)
|
||
2. [System Overview](#system-overview)
|
||
- [Purpose and Scope](#purpose-and-scope)
|
||
- [Technology Stack](#technology-stack)
|
||
3. [Architecture Diagrams](#architecture-diagrams)
|
||
- [Level 1: Context](#level-1-context)
|
||
- [Level 2: Container](#level-2-container)
|
||
```
|
||
|
||
## Multi-Audience Considerations
|
||
|
||
### Executive Audience
|
||
|
||
**What they need:**
|
||
- Executive summary ONLY (should be self-contained)
|
||
- High-level patterns and risks
|
||
- Business impact of concerns
|
||
- Clear recommendations with timelines
|
||
|
||
**Document design:**
|
||
- Put executive summary first
|
||
- Make it readable standalone (no forward references)
|
||
- Focus on "why this matters" over "how it works"
|
||
|
||
### Architect Audience
|
||
|
||
**What they need:**
|
||
- System overview + architecture diagrams + key findings
|
||
- Pattern analysis with trade-offs
|
||
- Dependency relationships
|
||
- Design decisions and rationale
|
||
|
||
**Document design:**
|
||
- System overview explains context
|
||
- Diagrams show structure at multiple levels
|
||
- Findings synthesize patterns and concerns
|
||
- Cross-references enable non-linear reading
|
||
|
||
### Engineer Audience
|
||
|
||
**What they need:**
|
||
- Subsystem catalog with technical details
|
||
- Component diagrams showing internal structure
|
||
- Technology stack specifics
|
||
- File references and entry points
|
||
|
||
**Document design:**
|
||
- Detailed subsystem catalog
|
||
- Component-level diagrams
|
||
- Technology stack section with versions/frameworks
|
||
- Code/file references where available
|
||
|
||
### Operations Audience
|
||
|
||
**What they need:**
|
||
- Technical concerns with remediation
|
||
- Dependency mapping
|
||
- Confidence levels (what's validated vs assumed)
|
||
- Recommendations with priorities
|
||
|
||
**Document design:**
|
||
- Technical concerns section up front
|
||
- Clear remediation steps
|
||
- Appendix with assumptions/limitations
|
||
- Prioritized recommendations
|
||
|
||
## Optional Enhancements
|
||
|
||
### Visual Aids
|
||
|
||
**Subsystem Quick Reference Table:**
|
||
```markdown
|
||
## Appendix D: Subsystem Quick Reference
|
||
|
||
| Subsystem | Location | Confidence | Key Concerns | Dependencies |
|
||
|-----------|----------|------------|--------------|--------------|
|
||
| API Gateway | /src/gateway/ | High | Rate limiter scalability | Auth, User, Data, Logging |
|
||
| Auth Service | /src/services/auth/ | High | None | Database, Cache, Logging |
|
||
| User Service | /src/services/users/ | High | None | Database, Cache, Notification |
|
||
```
|
||
|
||
**Pattern Summary Matrix:**
|
||
```markdown
|
||
## Architectural Patterns Summary
|
||
|
||
| Pattern | Subsystems Using | Benefits | Trade-offs |
|
||
|---------|------------------|----------|------------|
|
||
| Dependency Injection | Auth, Gateway, User | Testability, flexibility | Initial complexity |
|
||
| Repository Pattern | User, Data | Data access abstraction | Extra layer |
|
||
| Circuit Breaker | Gateway | Fault isolation | False positives |
|
||
```
|
||
|
||
### Reading Guide
|
||
|
||
```markdown
|
||
## How to Read This Document
|
||
|
||
**For Executives** (5 minutes):
|
||
- Read [Executive Summary](#executive-summary) only
|
||
- Optionally skim [Recommendations](#recommendations)
|
||
|
||
**For Architects** (30 minutes):
|
||
- Read [Executive Summary](#executive-summary)
|
||
- Read [System Overview](#system-overview)
|
||
- Review [Architecture Diagrams](#architecture-diagrams)
|
||
- Read [Key Findings](#key-findings)
|
||
|
||
**For Engineers** (1 hour):
|
||
- Read [System Overview](#system-overview)
|
||
- Study [Architecture Diagrams](#architecture-diagrams) (all levels)
|
||
- Read [Subsystem Catalog](#subsystem-catalog) for relevant services
|
||
- Review [Technical Concerns](#technical-concerns)
|
||
|
||
**For Operations** (45 minutes):
|
||
- Read [Executive Summary](#executive-summary)
|
||
- Study [Technical Concerns](#technical-concerns)
|
||
- Review [Recommendations](#recommendations)
|
||
- Read [Appendix C: Assumptions and Limitations](#appendix-c-assumptions-and-limitations)
|
||
```
|
||
|
||
### Glossary
|
||
|
||
```markdown
|
||
## Appendix E: Glossary
|
||
|
||
**Circuit Breaker**: Fault tolerance pattern that prevents cascading failures by temporarily blocking requests to failing services.
|
||
|
||
**Dependency Injection**: Design pattern where dependencies are provided to components rather than constructed internally, enabling testability and loose coupling.
|
||
|
||
**Repository Pattern**: Data access abstraction that separates business logic from data persistence concerns.
|
||
|
||
**Optimistic Locking**: Concurrency control technique assuming conflicts are rare, using version checks rather than locks.
|
||
```
|
||
|
||
## Success Criteria
|
||
|
||
**You succeeded when:**
|
||
- Executive summary (2-3 paragraphs) distills key information
|
||
- Table of contents provides multi-level navigation
|
||
- Cross-references (30+) enable non-linear reading
|
||
- Patterns synthesized (not just listed from catalog)
|
||
- Concerns extracted and prioritized
|
||
- Recommendations actionable with timelines
|
||
- Diagrams integrated with contextual analysis
|
||
- Appendices document methodology, confidence, assumptions
|
||
- Professional structure (document metadata, clear hierarchy)
|
||
- Written to 04-final-report.md
|
||
|
||
**You failed when:**
|
||
- Simple concatenation of source documents
|
||
- No executive summary or it requires reading full document
|
||
- Missing table of contents
|
||
- No cross-references between sections
|
||
- Patterns just copied from catalog (not synthesized)
|
||
- Concerns buried without extraction
|
||
- Recommendations vague or unprioritized
|
||
- Diagrams pasted without context
|
||
- Missing appendices
|
||
|
||
## Best Practices from Baseline Testing
|
||
|
||
### What Works
|
||
|
||
✅ **Comprehensive synthesis** - Identify patterns, extract concerns, create narrative
|
||
✅ **Professional structure** - Document metadata, TOC, clear hierarchy, appendices
|
||
✅ **Multi-level navigation** - 20+ TOC entries, 40+ cross-references
|
||
✅ **Executive summary** - Self-contained 2-3 paragraph distillation
|
||
✅ **Actionable findings** - Concerns with severity/impact/remediation, recommendations with timelines
|
||
✅ **Transparency** - Confidence levels, assumptions, limitations documented
|
||
✅ **Diagram integration** - Embedded with contextual analysis and cross-refs
|
||
✅ **Multi-audience** - Executive summary + technical depth + appendices
|
||
|
||
### Synthesis Patterns
|
||
|
||
**Pattern identification:**
|
||
- Look across multiple subsystems for recurring themes
|
||
- Group by pattern name (e.g., "Repository Pattern")
|
||
- Document which subsystems use it
|
||
- Explain benefits and trade-offs
|
||
|
||
**Concern extraction:**
|
||
- Find concerns in subsystem catalog entries
|
||
- Elevate to Key Findings section
|
||
- Add severity, impact, remediation
|
||
- Prioritize by timeline (immediate/short/long)
|
||
|
||
**Recommendation structure:**
|
||
- Group by timeline
|
||
- Specific actions (not vague suggestions)
|
||
- Validation steps
|
||
- Priority indicators
|
||
|
||
## Integration with Workflow
|
||
|
||
This skill is typically invoked as:
|
||
|
||
1. **Coordinator** completes and validates subsystem catalog
|
||
2. **Coordinator** completes and validates architecture diagrams
|
||
3. **Coordinator** writes task specification for final report
|
||
4. **YOU** read both source documents systematically
|
||
5. **YOU** synthesize patterns, extract concerns, create recommendations
|
||
6. **YOU** build professional report structure with navigation
|
||
7. **YOU** write to 04-final-report.md
|
||
8. **Validator** (optional) checks for synthesis quality, navigation, completeness
|
||
|
||
**Your role:** Transform analysis artifacts into stakeholder-ready documentation through synthesis, organization, and professional presentation.
|