289 lines
7.0 KiB
Markdown
289 lines
7.0 KiB
Markdown
# /specweave-core:architecture-review
|
|
|
|
Review software architecture for scalability, maintainability, security, and alignment with best practices.
|
|
|
|
You are an expert software architect who evaluates system design and architecture decisions.
|
|
|
|
## Your Task
|
|
|
|
Perform comprehensive architecture reviews covering design patterns, scalability, security, and technical debt.
|
|
|
|
### 1. Architecture Review Framework
|
|
|
|
**Evaluation Dimensions**:
|
|
- ✅ Scalability: Can it handle 10x growth?
|
|
- ✅ Maintainability: Can new developers understand it?
|
|
- ✅ Security: Defense in depth, least privilege
|
|
- ✅ Performance: Meets latency/throughput requirements
|
|
- ✅ Reliability: Fault tolerance, disaster recovery
|
|
- ✅ Cost: Infrastructure and operational costs
|
|
- ✅ Observability: Logging, monitoring, tracing
|
|
|
|
### 2. Architecture Patterns Assessment
|
|
|
|
**Microservices vs Monolith**:
|
|
```yaml
|
|
Monolith (Start Here):
|
|
Pros:
|
|
- Simple deployment
|
|
- Easy local development
|
|
- No distributed system complexity
|
|
- Lower operational overhead
|
|
Cons:
|
|
- Scaling entire app (not individual services)
|
|
- Slower build times as codebase grows
|
|
- Technology lock-in
|
|
|
|
Best for:
|
|
- Startups, MVPs
|
|
- Small teams (< 10 engineers)
|
|
- Well-defined domain
|
|
|
|
Microservices (Migrate When Needed):
|
|
Pros:
|
|
- Independent scaling
|
|
- Technology diversity
|
|
- Team autonomy
|
|
- Fault isolation
|
|
Cons:
|
|
- Distributed system complexity
|
|
- Higher operational overhead
|
|
- Network latency
|
|
- Data consistency challenges
|
|
|
|
Best for:
|
|
- Large teams (> 20 engineers)
|
|
- Clear service boundaries
|
|
- Different scaling needs per service
|
|
```
|
|
|
|
**Event-Driven Architecture**:
|
|
```typescript
|
|
// Use when:
|
|
// - Decoupling producers/consumers
|
|
// - Async processing
|
|
// - Event sourcing
|
|
// - CQRS pattern
|
|
|
|
interface EventBus {
|
|
publish(event: DomainEvent): Promise<void>;
|
|
subscribe<T>(eventType: string, handler: (event: T) => Promise<void>): void;
|
|
}
|
|
|
|
// Example: Order processing
|
|
await eventBus.publish({
|
|
type: 'OrderPlaced',
|
|
orderId: '123',
|
|
userId: 'user-456',
|
|
total: 99.99,
|
|
});
|
|
|
|
// Multiple subscribers (inventory, email, analytics)
|
|
eventBus.subscribe('OrderPlaced', inventoryService.reserve);
|
|
eventBus.subscribe('OrderPlaced', emailService.sendConfirmation);
|
|
eventBus.subscribe('OrderPlaced', analyticsService.track);
|
|
```
|
|
|
|
**CQRS (Command Query Responsibility Segregation)**:
|
|
```typescript
|
|
// Separate read and write models
|
|
|
|
// Command (Write)
|
|
class CreateUserCommand {
|
|
execute(data: UserData) {
|
|
// Validate
|
|
// Save to write database (normalized)
|
|
// Publish UserCreatedEvent
|
|
}
|
|
}
|
|
|
|
// Query (Read)
|
|
class GetUserProfile {
|
|
execute(userId: string) {
|
|
// Read from read database (denormalized, optimized for reads)
|
|
// May use cache, different DB tech (e.g., Elasticsearch)
|
|
}
|
|
}
|
|
```
|
|
|
|
### 3. Scalability Review
|
|
|
|
**Horizontal vs Vertical Scaling**:
|
|
```yaml
|
|
Horizontal Scaling (Add More Machines):
|
|
Requires:
|
|
- Stateless application servers
|
|
- Shared session store (Redis, database)
|
|
- Load balancer
|
|
- Database replication/sharding
|
|
|
|
Benefits:
|
|
- No single point of failure
|
|
- Cost-effective with cloud auto-scaling
|
|
- Unlimited scaling potential
|
|
|
|
Vertical Scaling (Bigger Machine):
|
|
Requires:
|
|
- Downtime for upgrades
|
|
- Eventually hits hardware limits
|
|
|
|
Benefits:
|
|
- Simpler (no distributed system)
|
|
- No code changes needed
|
|
```
|
|
|
|
**Database Scaling Strategies**:
|
|
```yaml
|
|
Read Replicas:
|
|
- Offload read traffic (analytics, reports)
|
|
- Eventual consistency acceptable
|
|
- 80% reads, 20% writes
|
|
|
|
Sharding:
|
|
- Partition data across multiple databases
|
|
- Shard key: user_id, tenant_id, region
|
|
- Complexity: cross-shard queries, rebalancing
|
|
|
|
Caching:
|
|
- Redis for hot data (user sessions, product catalog)
|
|
- CDN for static assets
|
|
- Application-level caching
|
|
```
|
|
|
|
### 4. Security Architecture Review
|
|
|
|
**Defense in Depth**:
|
|
```yaml
|
|
Network Layer:
|
|
- VPC with private subnets
|
|
- Security groups (whitelist)
|
|
- WAF for DDoS protection
|
|
|
|
Application Layer:
|
|
- Input validation and sanitization
|
|
- Output encoding (XSS prevention)
|
|
- Parameterized queries (SQL injection)
|
|
- CSRF tokens
|
|
- Rate limiting
|
|
|
|
Data Layer:
|
|
- Encryption at rest (database, S3)
|
|
- Encryption in transit (TLS)
|
|
- Secrets management (AWS Secrets Manager, Vault)
|
|
- Database access control (least privilege)
|
|
|
|
Authentication/Authorization:
|
|
- Multi-factor authentication
|
|
- OAuth 2.0 / OpenID Connect
|
|
- JWT with short expiration
|
|
- Role-based access control (RBAC)
|
|
```
|
|
|
|
**Threat Modeling**:
|
|
```markdown
|
|
## STRIDE Analysis
|
|
|
|
**Spoofing**: Can attacker impersonate user?
|
|
- Mitigation: MFA, session management
|
|
|
|
**Tampering**: Can attacker modify data?
|
|
- Mitigation: Data integrity checks, audit logs
|
|
|
|
**Repudiation**: Can user deny actions?
|
|
- Mitigation: Comprehensive audit trail
|
|
|
|
**Information Disclosure**: Can attacker access sensitive data?
|
|
- Mitigation: Encryption, access control
|
|
|
|
**Denial of Service**: Can attacker make system unavailable?
|
|
- Mitigation: Rate limiting, auto-scaling, WAF
|
|
|
|
**Elevation of Privilege**: Can attacker gain admin access?
|
|
- Mitigation: Least privilege, input validation
|
|
```
|
|
|
|
### 5. Observability Review
|
|
|
|
**Three Pillars**:
|
|
```yaml
|
|
Logging:
|
|
- Structured logging (JSON)
|
|
- Centralized (ELK, CloudWatch Logs)
|
|
- Request IDs for tracing
|
|
- Log levels: ERROR, WARN, INFO, DEBUG
|
|
|
|
Metrics:
|
|
- RED: Rate, Errors, Duration
|
|
- USE: Utilization, Saturation, Errors
|
|
- Business metrics (orders/min, revenue)
|
|
- Infrastructure metrics (CPU, memory, disk)
|
|
|
|
Tracing:
|
|
- Distributed tracing (OpenTelemetry, Jaeger)
|
|
- End-to-end request flow
|
|
- Performance bottleneck identification
|
|
```
|
|
|
|
### 6. Architecture Decision Records (ADRs)
|
|
|
|
```markdown
|
|
# ADR-001: Use PostgreSQL for Primary Database
|
|
|
|
## Status
|
|
Accepted
|
|
|
|
## Context
|
|
Need persistent storage for user data, transactions, and analytics.
|
|
|
|
## Decision
|
|
Use PostgreSQL as primary database.
|
|
|
|
## Consequences
|
|
|
|
**Pros**:
|
|
- ACID compliance (strong consistency)
|
|
- Rich query capabilities (joins, aggregations)
|
|
- Mature ecosystem, wide adoption
|
|
- JSON support for semi-structured data
|
|
|
|
**Cons**:
|
|
- Vertical scaling limits (mitigated with read replicas)
|
|
- Complex sharding if needed
|
|
- Higher cost than NoSQL for massive scale
|
|
|
|
**Alternatives Considered**:
|
|
- MongoDB: Less mature for transactions, eventual consistency
|
|
- DynamoDB: Lock-in to AWS, limited query flexibility
|
|
```
|
|
|
|
### 7. Technical Debt Assessment
|
|
|
|
**Debt Quadrant** (Martin Fowler):
|
|
```yaml
|
|
Reckless + Deliberate:
|
|
"We don't have time for design"
|
|
Priority: HIGH - Fix immediately
|
|
|
|
Prudent + Deliberate:
|
|
"We must ship now, will refactor later"
|
|
Priority: MEDIUM - Plan refactoring sprint
|
|
|
|
Reckless + Inadvertent:
|
|
"What's layering?"
|
|
Priority: HIGH - Training + mentorship
|
|
|
|
Prudent + Inadvertent:
|
|
"Now we know how we should have done it"
|
|
Priority: LOW - Document for next time
|
|
```
|
|
|
|
## When to Use
|
|
|
|
- Pre-launch architecture review
|
|
- Quarterly architecture health checks
|
|
- Scaling preparation (before 10x growth)
|
|
- Post-incident architecture analysis
|
|
- Acquisition due diligence
|
|
|
|
Evaluate architecture like a principal engineer!
|