gh-tachyon-beep-skillpacks-…/skills/using-web-backend/microservices-architecture.md


# Microservices Architecture

## Overview

**Microservices architecture specialist covering service boundaries, communication patterns, data consistency, and operational concerns.**

**Core principle**: Microservices decompose applications into independently deployable services organized around business capabilities - enabling team autonomy and technology diversity at the cost of operational complexity and distributed system challenges.

## When to Use This Skill

Use when encountering:

- **Service boundaries**: Defining service scope, applying domain-driven design
- **Monolith decomposition**: Strategies for splitting existing systems
- **Data consistency**: Sagas, event sourcing, eventual consistency patterns
- **Communication**: Sync (REST/gRPC) vs async (events/messages)
- **API gateways**: Routing, authentication, rate limiting
- **Service discovery**: Registry patterns, DNS, configuration
- **Resilience**: Circuit breakers, retries, timeouts, bulkheads
- **Observability**: Distributed tracing, logging aggregation, metrics
- **Deployment**: Containers, orchestration, blue-green deployments

**Do NOT use for**:
- Monolithic architectures (microservices aren't always better)
- Single-team projects < 5 services (overhead exceeds benefits)
- Simple CRUD applications (microservices add unnecessary complexity)

## When NOT to Use Microservices

**Stay monolithic if**:
- Team < 10 engineers
- Domain is not well understood yet
- Strong consistency required everywhere
- Network latency is critical
- You can't invest in observability/DevOps infrastructure

**Microservices require**: Mature DevOps, monitoring, distributed systems expertise, organizational support.

## Service Boundary Patterns (Domain-Driven Design)

### 1. Bounded Contexts

**Pattern: One microservice = One bounded context**

```
❌ Too fine-grained (anemic services):
- UserService (just CRUD)
- OrderService (just CRUD)
- PaymentService (just CRUD)

✅ Business capability alignment:
- CustomerManagementService (user profiles, preferences, history)
- OrderFulfillmentService (order lifecycle, inventory, shipping)
- PaymentProcessingService (payment, billing, invoicing, refunds)
```

**Identifying boundaries**:
1. **Ubiquitous language** - Different terms for same concept = different contexts
2. **Change patterns** - Services that change together should stay together
3. **Team ownership** - One team should own one service
4. **Data autonomy** - Each service owns its data, no shared databases

### 2. Strategic DDD Patterns

| Pattern | Use When | Example |
|---------|----------|---------|
| **Separate Ways** | Contexts are independent | Analytics service, main app service |
| **Partnership** | Teams must collaborate closely | Order + Inventory services |
| **Customer-Supplier** | Upstream/downstream relationship | Payment gateway (upstream) → Order service |
| **Conformist** | Accept upstream model as-is | Third-party API integration |
| **Anti-Corruption Layer** | Isolate from legacy/external systems | ACL between new microservices and legacy monolith |

### 3. Service Sizing Guidelines

**Too small (Nanoservices)**:
- Excessive network calls
- Distributed monolith
- Coordination overhead exceeds benefits

**Too large (Minimonoliths)**:
- Multiple teams modifying same service
- Mixed deployment frequencies
- Tight coupling re-emerges

**Right size indicators**:
- Single team can own it
- Deployable independently
- Changes don't ripple to other services
- Clear business capability
- 100-10,000 LOC (highly variable)

## Communication Patterns

### Synchronous Communication

**REST APIs**:

```python
# Order service calling Payment service
async def create_order(order: Order):
    # Synchronous REST call
    payment = await payment_service.charge(
        amount=order.total,
        customer_id=order.customer_id
    )

    if payment.status == "success":
        order.status = "confirmed"
        await db.save(order)
        return order
    else:
        raise PaymentFailedException()
```

**Pros**: Simple, request-response, easy to debug
**Cons**: Tight coupling, availability dependency, latency cascades

**gRPC**:

```python
# Proto definition
service OrderService {
    rpc CreateOrder (OrderRequest) returns (OrderResponse);
}

# Implementation
class OrderServicer(order_pb2_grpc.OrderServiceServicer):
    async def CreateOrder(self, request, context):
        # Type-safe, efficient binary protocol
        payment = await payment_stub.Charge(
            PaymentRequest(amount=request.total)
        )
        return OrderResponse(order_id=order.id)
```

**Pros**: Type-safe, efficient, streaming support
**Cons**: HTTP/2 required, less human-readable, proto dependencies

### Asynchronous Communication

**Event-Driven (Pub/Sub)**:

```python
# Order service publishes event
await event_bus.publish("order.created", {
    "order_id": order.id,
    "customer_id": customer.id,
    "total": order.total
})

# Inventory service subscribes
@event_bus.subscribe("order.created")
async def reserve_inventory(event):
    await inventory.reserve(event["order_id"])
    await event_bus.publish("inventory.reserved", {...})

# Notification service subscribes
@event_bus.subscribe("order.created")
async def send_confirmation(event):
    await email.send_order_confirmation(event)
```

**Pros**: Loose coupling, services independent, scalable
**Cons**: Eventual consistency, harder to trace, ordering challenges

**Message Queues (Point-to-Point)**:

```python
# Producer
await queue.send("payment-processing", {
    "order_id": order.id,
    "amount": order.total
})

# Consumer
@queue.consumer("payment-processing")
async def process_payment(message):
    result = await payment_gateway.charge(message["amount"])
    if result.success:
        await message.ack()
    else:
        await message.nack(requeue=True)
```

**Pros**: Guaranteed delivery, work distribution, retry handling
**Cons**: Queue becomes bottleneck, requires message broker

### Communication Pattern Decision Matrix

| Scenario | Pattern | Why |
|----------|---------|-----|
| User-facing request/response | Sync (REST/gRPC) | Low latency, immediate feedback |
| Background processing | Async (queue) | Don't block user, retry support |
| Cross-service notifications | Async (pub/sub) | Loose coupling, multiple consumers |
| Real-time updates | WebSocket/SSE | Bidirectional, streaming |
| Data replication | Event sourcing | Audit trail, rebuild state |
| High throughput | Async (messaging) | Buffer spikes, backpressure |

## Data Consistency Patterns

### 1. Saga Pattern (Distributed Transactions)

**Choreography (Event-Driven)**:

```python
# Order Service
async def create_order(order):
    order.status = "pending"
    await db.save(order)
    await events.publish("order.created", order)

# Payment Service
@events.subscribe("order.created")
async def handle_order(event):
    try:
        await charge_customer(event["total"])
        await events.publish("payment.completed", event)
    except PaymentError:
        await events.publish("payment.failed", event)

# Inventory Service
@events.subscribe("payment.completed")
async def reserve_items(event):
    try:
        await reserve(event["items"])
        await events.publish("inventory.reserved", event)
    except InventoryError:
        await events.publish("inventory.failed", event)

# Order Service (Compensation)
@events.subscribe("payment.failed")
async def cancel_order(event):
    order = await db.get(event["order_id"])
    order.status = "cancelled"
    await db.save(order)

@events.subscribe("inventory.failed")
async def refund_payment(event):
    await payment.refund(event["order_id"])
    await cancel_order(event)
```

**Orchestration (Coordinator)**:

```python
class OrderSaga:
    def __init__(self, order):
        self.order = order
        self.completed_steps = []

    async def execute(self):
        try:
            # Step 1: Reserve inventory
            await self.reserve_inventory()
            self.completed_steps.append("inventory")

            # Step 2: Process payment
            await self.process_payment()
            self.completed_steps.append("payment")

            # Step 3: Confirm order
            await self.confirm_order()

        except Exception as e:
            # Compensate in reverse order
            await self.compensate()
            raise

    async def compensate(self):
        for step in reversed(self.completed_steps):
            if step == "inventory":
                await inventory_service.release(self.order.id)
            elif step == "payment":
                await payment_service.refund(self.order.id)
```

**Choreography vs Orchestration**:

| Aspect | Choreography | Orchestration |
|--------|--------------|---------------|
| Coordination | Decentralized (events) | Centralized (orchestrator) |
| Coupling | Loose | Tight to orchestrator |
| Complexity | Distributed across services | Concentrated in orchestrator |
| Tracing | Harder (follow events) | Easier (single coordinator) |
| Failure handling | Implicit (event handlers) | Explicit (orchestrator logic) |
| Best for | Simple workflows | Complex workflows |

### 2. Event Sourcing

**Pattern: Store events, not state**

```python
# Traditional approach (storing state)
class Order:
    id: int
    status: str  # "pending" → "confirmed" → "shipped"
    total: float

# Event sourcing (storing events)
class OrderCreated(Event):
    order_id: int
    total: float

class OrderConfirmed(Event):
    order_id: int

class OrderShipped(Event):
    order_id: int

# Rebuild state from events
def rebuild_order(order_id):
    events = event_store.get_events(order_id)
    order = Order()
    for event in events:
        order.apply(event)  # Apply each event to rebuild state
    return order
```

**Pros**: Complete audit trail, time travel, event replay
**Cons**: Complexity, eventual consistency, schema evolution challenges

### 3. CQRS (Command Query Responsibility Segregation)

**Separate read and write models**:

```python
# Write model (commands)
class CreateOrder:
    def execute(self, data):
        order = Order(**data)
        await db.save(order)
        await event_bus.publish("order.created", order)

# Read model (projections)
class OrderReadModel:
    # Denormalized for fast reads
    def __init__(self):
        self.cache = {}

    @event_bus.subscribe("order.created")
    async def on_order_created(self, event):
        self.cache[event["order_id"]] = {
            "id": event["order_id"],
            "customer_name": await get_customer_name(event["customer_id"]),
            "status": "pending",
            "total": event["total"]
        }

    def get_order(self, order_id):
        return self.cache.get(order_id)  # Fast read, no joins
```

**Use when**: Read/write patterns differ significantly (e.g., analytics dashboards)

## Resilience Patterns

### 1. Circuit Breaker

```python
from circuitbreaker import circuit

@circuit(failure_threshold=5, recovery_timeout=60)
async def call_payment_service(amount):
    response = await http.post("http://payment-service/charge", json={"amount": amount})
    if response.status >= 500:
        raise PaymentServiceError()
    return response.json()

# Circuit states:
# CLOSED → normal operation
# OPEN → fails fast after threshold
# HALF_OPEN → test if service recovered
```

### 2. Retry with Exponential Backoff

```python
from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
async def call_with_retry(url):
    return await http.get(url)

# Retries: 2s → 4s → 8s
```

### 3. Timeout

```python
import asyncio

async def call_with_timeout(url):
    try:
        return await asyncio.wait_for(
            http.get(url),
            timeout=5.0  # 5 second timeout
        )
    except asyncio.TimeoutError:
        return {"error": "Service timeout"}
```

### 4. Bulkhead

**Isolate resources to prevent cascade failures**:

```python
# Separate thread pools for different services
payment_pool = ThreadPoolExecutor(max_workers=10)
inventory_pool = ThreadPoolExecutor(max_workers=5)

async def call_payment():
    return await asyncio.get_event_loop().run_in_executor(
        payment_pool,
        payment_service.call
    )

# If payment service is slow, it only exhausts payment_pool,
# inventory calls still work
```

## API Gateway Pattern

**Centralized entry point for client requests**:

```
Client → API Gateway → [Order, Payment, Inventory services]
```

**Responsibilities**:
- Routing requests to services
- Authentication/authorization
- Rate limiting
- Request/response transformation
- Caching
- Logging/monitoring

**Example (Kong, AWS API Gateway, Nginx)**:

```yaml
# API Gateway config
routes:
  - path: /orders
    service: order-service
    auth: jwt
    ratelimit: 100/minute

  - path: /payments
    service: payment-service
    auth: oauth2
    ratelimit: 50/minute
```

**Backend for Frontend (BFF) Pattern**:

```
Web Client → Web BFF → Services
Mobile App → Mobile BFF → Services
```

Each client type has optimized gateway.

## Service Discovery

### 1. Client-Side Discovery

```python
# Service registry (Consul, Eureka)
registry = ServiceRegistry("http://consul:8500")

# Client looks up service
instances = registry.get_instances("payment-service")
instance = load_balancer.choose(instances)
response = await http.get(f"http://{instance.host}:{instance.port}/charge")
```

### 2. Server-Side Discovery (Load Balancer)

```
Client → Load Balancer → [Service Instance 1, Instance 2, Instance 3]
```

**DNS-based**: Kubernetes services, AWS ELB

## Observability

### Distributed Tracing

```python
from opentelemetry import trace

tracer = trace.get_tracer(__name__)

async def create_order(order):
    with tracer.start_as_current_span("create-order") as span:
        span.set_attribute("order.id", order.id)
        span.set_attribute("order.total", order.total)

        # Trace propagates to payment service
        payment = await payment_service.charge(
            amount=order.total,
            trace_context=span.context
        )

        span.add_event("payment-completed")
        return order
```

**Tools**: Jaeger, Zipkin, AWS X-Ray, Datadog APM

### Log Aggregation

**Structured logging with correlation IDs**:

```python
import logging
import uuid

logger = logging.getLogger(__name__)

async def handle_request(request):
    correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4())

    logger.info("Processing request", extra={
        "correlation_id": correlation_id,
        "service": "order-service",
        "user_id": request.user_id
    })
```

**Tools**: ELK stack (Elasticsearch, Logstash, Kibana), Splunk, Datadog

## Monolith Decomposition Strategies

### 1. Strangler Fig Pattern

**Gradually replace monolith with microservices**:

```
Phase 1: Monolith handles everything
Phase 2: Extract service, proxy some requests to it
Phase 3: More services extracted, proxy more requests
Phase 4: Monolith retired
```

### 2. Branch by Abstraction

1. Create abstraction layer in monolith
2. Implement new service
3. Gradually migrate code behind abstraction
4. Remove old implementation
5. Extract as microservice

### 3. Extract by Bounded Context

Priority order:
1. Services with clear boundaries (authentication, payments)
2. Services changing frequently
3. Services with different scaling needs
4. Services with technology mismatches (e.g., Java monolith, Python ML service)

## Anti-Patterns

| Anti-Pattern | Why Bad | Fix |
|--------------|---------|-----|
| **Distributed Monolith** | Services share database, deploy together | One DB per service, independent deployment |
| **Nanoservices** | Too fine-grained, excessive network calls | Merge related services, follow DDD |
| **Shared Database** | Tight coupling, schema changes break multiple services | Database per service |
| **Synchronous Chains** | A→B→C→D, latency adds up, cascading failures | Async events, parallelize where possible |
| **Chatty Services** | N+1 calls, excessive network overhead | Batch APIs, caching, coarser boundaries |
| **No Circuit Breakers** | Cascading failures bring down system | Circuit breakers + timeouts + retries |
| **No Distributed Tracing** | Impossible to debug cross-service issues | OpenTelemetry, correlation IDs |

## Cross-References

**Related skills**:
- **Message queues** → `message-queues` (RabbitMQ, Kafka patterns)
- **REST APIs** → `rest-api-design` (service interface design)
- **gRPC** → Check if gRPC skill exists
- **Security** → `ordis-security-architect` (service-to-service auth, zero trust)
- **Database** → `database-integration` (per-service databases, migrations)
- **Testing** → `api-testing` (contract testing, integration testing)

## Further Reading

- **Building Microservices** by Sam Newman
- **Domain-Driven Design** by Eric Evans
- **Release It!** by Michael Nygard (resilience patterns)
- **Microservices Patterns** by Chris Richardson