Files
gh-tachyon-beep-skillpacks-…/skills/using-web-backend/microservices-architecture.md
2025-11-30 08:59:27 +08:00

593 lines
17 KiB
Markdown

# Microservices Architecture
## Overview
**Microservices architecture specialist covering service boundaries, communication patterns, data consistency, and operational concerns.**
**Core principle**: Microservices decompose applications into independently deployable services organized around business capabilities - enabling team autonomy and technology diversity at the cost of operational complexity and distributed system challenges.
## When to Use This Skill
Use when encountering:
- **Service boundaries**: Defining service scope, applying domain-driven design
- **Monolith decomposition**: Strategies for splitting existing systems
- **Data consistency**: Sagas, event sourcing, eventual consistency patterns
- **Communication**: Sync (REST/gRPC) vs async (events/messages)
- **API gateways**: Routing, authentication, rate limiting
- **Service discovery**: Registry patterns, DNS, configuration
- **Resilience**: Circuit breakers, retries, timeouts, bulkheads
- **Observability**: Distributed tracing, logging aggregation, metrics
- **Deployment**: Containers, orchestration, blue-green deployments
**Do NOT use for**:
- Monolithic architectures (microservices aren't always better)
- Single-team projects < 5 services (overhead exceeds benefits)
- Simple CRUD applications (microservices add unnecessary complexity)
## When NOT to Use Microservices
**Stay monolithic if**:
- Team < 10 engineers
- Domain is not well understood yet
- Strong consistency required everywhere
- Network latency is critical
- You can't invest in observability/DevOps infrastructure
**Microservices require**: Mature DevOps, monitoring, distributed systems expertise, organizational support.
## Service Boundary Patterns (Domain-Driven Design)
### 1. Bounded Contexts
**Pattern: One microservice = One bounded context**
```
❌ Too fine-grained (anemic services):
- UserService (just CRUD)
- OrderService (just CRUD)
- PaymentService (just CRUD)
✅ Business capability alignment:
- CustomerManagementService (user profiles, preferences, history)
- OrderFulfillmentService (order lifecycle, inventory, shipping)
- PaymentProcessingService (payment, billing, invoicing, refunds)
```
**Identifying boundaries**:
1. **Ubiquitous language** - Different terms for same concept = different contexts
2. **Change patterns** - Services that change together should stay together
3. **Team ownership** - One team should own one service
4. **Data autonomy** - Each service owns its data, no shared databases
### 2. Strategic DDD Patterns
| Pattern | Use When | Example |
|---------|----------|---------|
| **Separate Ways** | Contexts are independent | Analytics service, main app service |
| **Partnership** | Teams must collaborate closely | Order + Inventory services |
| **Customer-Supplier** | Upstream/downstream relationship | Payment gateway (upstream) → Order service |
| **Conformist** | Accept upstream model as-is | Third-party API integration |
| **Anti-Corruption Layer** | Isolate from legacy/external systems | ACL between new microservices and legacy monolith |
### 3. Service Sizing Guidelines
**Too small (Nanoservices)**:
- Excessive network calls
- Distributed monolith
- Coordination overhead exceeds benefits
**Too large (Minimonoliths)**:
- Multiple teams modifying same service
- Mixed deployment frequencies
- Tight coupling re-emerges
**Right size indicators**:
- Single team can own it
- Deployable independently
- Changes don't ripple to other services
- Clear business capability
- 100-10,000 LOC (highly variable)
## Communication Patterns
### Synchronous Communication
**REST APIs**:
```python
# Order service calling Payment service
async def create_order(order: Order):
# Synchronous REST call
payment = await payment_service.charge(
amount=order.total,
customer_id=order.customer_id
)
if payment.status == "success":
order.status = "confirmed"
await db.save(order)
return order
else:
raise PaymentFailedException()
```
**Pros**: Simple, request-response, easy to debug
**Cons**: Tight coupling, availability dependency, latency cascades
**gRPC**:
```python
# Proto definition
service OrderService {
rpc CreateOrder (OrderRequest) returns (OrderResponse);
}
# Implementation
class OrderServicer(order_pb2_grpc.OrderServiceServicer):
async def CreateOrder(self, request, context):
# Type-safe, efficient binary protocol
payment = await payment_stub.Charge(
PaymentRequest(amount=request.total)
)
return OrderResponse(order_id=order.id)
```
**Pros**: Type-safe, efficient, streaming support
**Cons**: HTTP/2 required, less human-readable, proto dependencies
### Asynchronous Communication
**Event-Driven (Pub/Sub)**:
```python
# Order service publishes event
await event_bus.publish("order.created", {
"order_id": order.id,
"customer_id": customer.id,
"total": order.total
})
# Inventory service subscribes
@event_bus.subscribe("order.created")
async def reserve_inventory(event):
await inventory.reserve(event["order_id"])
await event_bus.publish("inventory.reserved", {...})
# Notification service subscribes
@event_bus.subscribe("order.created")
async def send_confirmation(event):
await email.send_order_confirmation(event)
```
**Pros**: Loose coupling, services independent, scalable
**Cons**: Eventual consistency, harder to trace, ordering challenges
**Message Queues (Point-to-Point)**:
```python
# Producer
await queue.send("payment-processing", {
"order_id": order.id,
"amount": order.total
})
# Consumer
@queue.consumer("payment-processing")
async def process_payment(message):
result = await payment_gateway.charge(message["amount"])
if result.success:
await message.ack()
else:
await message.nack(requeue=True)
```
**Pros**: Guaranteed delivery, work distribution, retry handling
**Cons**: Queue becomes bottleneck, requires message broker
### Communication Pattern Decision Matrix
| Scenario | Pattern | Why |
|----------|---------|-----|
| User-facing request/response | Sync (REST/gRPC) | Low latency, immediate feedback |
| Background processing | Async (queue) | Don't block user, retry support |
| Cross-service notifications | Async (pub/sub) | Loose coupling, multiple consumers |
| Real-time updates | WebSocket/SSE | Bidirectional, streaming |
| Data replication | Event sourcing | Audit trail, rebuild state |
| High throughput | Async (messaging) | Buffer spikes, backpressure |
## Data Consistency Patterns
### 1. Saga Pattern (Distributed Transactions)
**Choreography (Event-Driven)**:
```python
# Order Service
async def create_order(order):
order.status = "pending"
await db.save(order)
await events.publish("order.created", order)
# Payment Service
@events.subscribe("order.created")
async def handle_order(event):
try:
await charge_customer(event["total"])
await events.publish("payment.completed", event)
except PaymentError:
await events.publish("payment.failed", event)
# Inventory Service
@events.subscribe("payment.completed")
async def reserve_items(event):
try:
await reserve(event["items"])
await events.publish("inventory.reserved", event)
except InventoryError:
await events.publish("inventory.failed", event)
# Order Service (Compensation)
@events.subscribe("payment.failed")
async def cancel_order(event):
order = await db.get(event["order_id"])
order.status = "cancelled"
await db.save(order)
@events.subscribe("inventory.failed")
async def refund_payment(event):
await payment.refund(event["order_id"])
await cancel_order(event)
```
**Orchestration (Coordinator)**:
```python
class OrderSaga:
def __init__(self, order):
self.order = order
self.completed_steps = []
async def execute(self):
try:
# Step 1: Reserve inventory
await self.reserve_inventory()
self.completed_steps.append("inventory")
# Step 2: Process payment
await self.process_payment()
self.completed_steps.append("payment")
# Step 3: Confirm order
await self.confirm_order()
except Exception as e:
# Compensate in reverse order
await self.compensate()
raise
async def compensate(self):
for step in reversed(self.completed_steps):
if step == "inventory":
await inventory_service.release(self.order.id)
elif step == "payment":
await payment_service.refund(self.order.id)
```
**Choreography vs Orchestration**:
| Aspect | Choreography | Orchestration |
|--------|--------------|---------------|
| Coordination | Decentralized (events) | Centralized (orchestrator) |
| Coupling | Loose | Tight to orchestrator |
| Complexity | Distributed across services | Concentrated in orchestrator |
| Tracing | Harder (follow events) | Easier (single coordinator) |
| Failure handling | Implicit (event handlers) | Explicit (orchestrator logic) |
| Best for | Simple workflows | Complex workflows |
### 2. Event Sourcing
**Pattern: Store events, not state**
```python
# Traditional approach (storing state)
class Order:
id: int
status: str # "pending" → "confirmed" → "shipped"
total: float
# Event sourcing (storing events)
class OrderCreated(Event):
order_id: int
total: float
class OrderConfirmed(Event):
order_id: int
class OrderShipped(Event):
order_id: int
# Rebuild state from events
def rebuild_order(order_id):
events = event_store.get_events(order_id)
order = Order()
for event in events:
order.apply(event) # Apply each event to rebuild state
return order
```
**Pros**: Complete audit trail, time travel, event replay
**Cons**: Complexity, eventual consistency, schema evolution challenges
### 3. CQRS (Command Query Responsibility Segregation)
**Separate read and write models**:
```python
# Write model (commands)
class CreateOrder:
def execute(self, data):
order = Order(**data)
await db.save(order)
await event_bus.publish("order.created", order)
# Read model (projections)
class OrderReadModel:
# Denormalized for fast reads
def __init__(self):
self.cache = {}
@event_bus.subscribe("order.created")
async def on_order_created(self, event):
self.cache[event["order_id"]] = {
"id": event["order_id"],
"customer_name": await get_customer_name(event["customer_id"]),
"status": "pending",
"total": event["total"]
}
def get_order(self, order_id):
return self.cache.get(order_id) # Fast read, no joins
```
**Use when**: Read/write patterns differ significantly (e.g., analytics dashboards)
## Resilience Patterns
### 1. Circuit Breaker
```python
from circuitbreaker import circuit
@circuit(failure_threshold=5, recovery_timeout=60)
async def call_payment_service(amount):
response = await http.post("http://payment-service/charge", json={"amount": amount})
if response.status >= 500:
raise PaymentServiceError()
return response.json()
# Circuit states:
# CLOSED → normal operation
# OPEN → fails fast after threshold
# HALF_OPEN → test if service recovered
```
### 2. Retry with Exponential Backoff
```python
from tenacity import retry, stop_after_attempt, wait_exponential
@retry(
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_with_retry(url):
return await http.get(url)
# Retries: 2s → 4s → 8s
```
### 3. Timeout
```python
import asyncio
async def call_with_timeout(url):
try:
return await asyncio.wait_for(
http.get(url),
timeout=5.0 # 5 second timeout
)
except asyncio.TimeoutError:
return {"error": "Service timeout"}
```
### 4. Bulkhead
**Isolate resources to prevent cascade failures**:
```python
# Separate thread pools for different services
payment_pool = ThreadPoolExecutor(max_workers=10)
inventory_pool = ThreadPoolExecutor(max_workers=5)
async def call_payment():
return await asyncio.get_event_loop().run_in_executor(
payment_pool,
payment_service.call
)
# If payment service is slow, it only exhausts payment_pool,
# inventory calls still work
```
## API Gateway Pattern
**Centralized entry point for client requests**:
```
Client → API Gateway → [Order, Payment, Inventory services]
```
**Responsibilities**:
- Routing requests to services
- Authentication/authorization
- Rate limiting
- Request/response transformation
- Caching
- Logging/monitoring
**Example (Kong, AWS API Gateway, Nginx)**:
```yaml
# API Gateway config
routes:
- path: /orders
service: order-service
auth: jwt
ratelimit: 100/minute
- path: /payments
service: payment-service
auth: oauth2
ratelimit: 50/minute
```
**Backend for Frontend (BFF) Pattern**:
```
Web Client → Web BFF → Services
Mobile App → Mobile BFF → Services
```
Each client type has optimized gateway.
## Service Discovery
### 1. Client-Side Discovery
```python
# Service registry (Consul, Eureka)
registry = ServiceRegistry("http://consul:8500")
# Client looks up service
instances = registry.get_instances("payment-service")
instance = load_balancer.choose(instances)
response = await http.get(f"http://{instance.host}:{instance.port}/charge")
```
### 2. Server-Side Discovery (Load Balancer)
```
Client → Load Balancer → [Service Instance 1, Instance 2, Instance 3]
```
**DNS-based**: Kubernetes services, AWS ELB
## Observability
### Distributed Tracing
```python
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
async def create_order(order):
with tracer.start_as_current_span("create-order") as span:
span.set_attribute("order.id", order.id)
span.set_attribute("order.total", order.total)
# Trace propagates to payment service
payment = await payment_service.charge(
amount=order.total,
trace_context=span.context
)
span.add_event("payment-completed")
return order
```
**Tools**: Jaeger, Zipkin, AWS X-Ray, Datadog APM
### Log Aggregation
**Structured logging with correlation IDs**:
```python
import logging
import uuid
logger = logging.getLogger(__name__)
async def handle_request(request):
correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())
logger.info("Processing request", extra={
"correlation_id": correlation_id,
"service": "order-service",
"user_id": request.user_id
})
```
**Tools**: ELK stack (Elasticsearch, Logstash, Kibana), Splunk, Datadog
## Monolith Decomposition Strategies
### 1. Strangler Fig Pattern
**Gradually replace monolith with microservices**:
```
Phase 1: Monolith handles everything
Phase 2: Extract service, proxy some requests to it
Phase 3: More services extracted, proxy more requests
Phase 4: Monolith retired
```
### 2. Branch by Abstraction
1. Create abstraction layer in monolith
2. Implement new service
3. Gradually migrate code behind abstraction
4. Remove old implementation
5. Extract as microservice
### 3. Extract by Bounded Context
Priority order:
1. Services with clear boundaries (authentication, payments)
2. Services changing frequently
3. Services with different scaling needs
4. Services with technology mismatches (e.g., Java monolith, Python ML service)
## Anti-Patterns
| Anti-Pattern | Why Bad | Fix |
|--------------|---------|-----|
| **Distributed Monolith** | Services share database, deploy together | One DB per service, independent deployment |
| **Nanoservices** | Too fine-grained, excessive network calls | Merge related services, follow DDD |
| **Shared Database** | Tight coupling, schema changes break multiple services | Database per service |
| **Synchronous Chains** | A→B→C→D, latency adds up, cascading failures | Async events, parallelize where possible |
| **Chatty Services** | N+1 calls, excessive network overhead | Batch APIs, caching, coarser boundaries |
| **No Circuit Breakers** | Cascading failures bring down system | Circuit breakers + timeouts + retries |
| **No Distributed Tracing** | Impossible to debug cross-service issues | OpenTelemetry, correlation IDs |
## Cross-References
**Related skills**:
- **Message queues** → `message-queues` (RabbitMQ, Kafka patterns)
- **REST APIs** → `rest-api-design` (service interface design)
- **gRPC** → Check if gRPC skill exists
- **Security** → `ordis-security-architect` (service-to-service auth, zero trust)
- **Database** → `database-integration` (per-service databases, migrations)
- **Testing** → `api-testing` (contract testing, integration testing)
## Further Reading
- **Building Microservices** by Sam Newman
- **Domain-Driven Design** by Eric Evans
- **Release It!** by Michael Nygard (resilience patterns)
- **Microservices Patterns** by Chris Richardson