# Microservices Architecture ## Overview **Microservices architecture specialist covering service boundaries, communication patterns, data consistency, and operational concerns.** **Core principle**: Microservices decompose applications into independently deployable services organized around business capabilities - enabling team autonomy and technology diversity at the cost of operational complexity and distributed system challenges. ## When to Use This Skill Use when encountering: - **Service boundaries**: Defining service scope, applying domain-driven design - **Monolith decomposition**: Strategies for splitting existing systems - **Data consistency**: Sagas, event sourcing, eventual consistency patterns - **Communication**: Sync (REST/gRPC) vs async (events/messages) - **API gateways**: Routing, authentication, rate limiting - **Service discovery**: Registry patterns, DNS, configuration - **Resilience**: Circuit breakers, retries, timeouts, bulkheads - **Observability**: Distributed tracing, logging aggregation, metrics - **Deployment**: Containers, orchestration, blue-green deployments **Do NOT use for**: - Monolithic architectures (microservices aren't always better) - Single-team projects < 5 services (overhead exceeds benefits) - Simple CRUD applications (microservices add unnecessary complexity) ## When NOT to Use Microservices **Stay monolithic if**: - Team < 10 engineers - Domain is not well understood yet - Strong consistency required everywhere - Network latency is critical - You can't invest in observability/DevOps infrastructure **Microservices require**: Mature DevOps, monitoring, distributed systems expertise, organizational support. ## Service Boundary Patterns (Domain-Driven Design) ### 1. Bounded Contexts **Pattern: One microservice = One bounded context** ``` ❌ Too fine-grained (anemic services): - UserService (just CRUD) - OrderService (just CRUD) - PaymentService (just CRUD) ✅ Business capability alignment: - CustomerManagementService (user profiles, preferences, history) - OrderFulfillmentService (order lifecycle, inventory, shipping) - PaymentProcessingService (payment, billing, invoicing, refunds) ``` **Identifying boundaries**: 1. **Ubiquitous language** - Different terms for same concept = different contexts 2. **Change patterns** - Services that change together should stay together 3. **Team ownership** - One team should own one service 4. **Data autonomy** - Each service owns its data, no shared databases ### 2. Strategic DDD Patterns | Pattern | Use When | Example | |---------|----------|---------| | **Separate Ways** | Contexts are independent | Analytics service, main app service | | **Partnership** | Teams must collaborate closely | Order + Inventory services | | **Customer-Supplier** | Upstream/downstream relationship | Payment gateway (upstream) → Order service | | **Conformist** | Accept upstream model as-is | Third-party API integration | | **Anti-Corruption Layer** | Isolate from legacy/external systems | ACL between new microservices and legacy monolith | ### 3. Service Sizing Guidelines **Too small (Nanoservices)**: - Excessive network calls - Distributed monolith - Coordination overhead exceeds benefits **Too large (Minimonoliths)**: - Multiple teams modifying same service - Mixed deployment frequencies - Tight coupling re-emerges **Right size indicators**: - Single team can own it - Deployable independently - Changes don't ripple to other services - Clear business capability - 100-10,000 LOC (highly variable) ## Communication Patterns ### Synchronous Communication **REST APIs**: ```python # Order service calling Payment service async def create_order(order: Order): # Synchronous REST call payment = await payment_service.charge( amount=order.total, customer_id=order.customer_id ) if payment.status == "success": order.status = "confirmed" await db.save(order) return order else: raise PaymentFailedException() ``` **Pros**: Simple, request-response, easy to debug **Cons**: Tight coupling, availability dependency, latency cascades **gRPC**: ```python # Proto definition service OrderService { rpc CreateOrder (OrderRequest) returns (OrderResponse); } # Implementation class OrderServicer(order_pb2_grpc.OrderServiceServicer): async def CreateOrder(self, request, context): # Type-safe, efficient binary protocol payment = await payment_stub.Charge( PaymentRequest(amount=request.total) ) return OrderResponse(order_id=order.id) ``` **Pros**: Type-safe, efficient, streaming support **Cons**: HTTP/2 required, less human-readable, proto dependencies ### Asynchronous Communication **Event-Driven (Pub/Sub)**: ```python # Order service publishes event await event_bus.publish("order.created", { "order_id": order.id, "customer_id": customer.id, "total": order.total }) # Inventory service subscribes @event_bus.subscribe("order.created") async def reserve_inventory(event): await inventory.reserve(event["order_id"]) await event_bus.publish("inventory.reserved", {...}) # Notification service subscribes @event_bus.subscribe("order.created") async def send_confirmation(event): await email.send_order_confirmation(event) ``` **Pros**: Loose coupling, services independent, scalable **Cons**: Eventual consistency, harder to trace, ordering challenges **Message Queues (Point-to-Point)**: ```python # Producer await queue.send("payment-processing", { "order_id": order.id, "amount": order.total }) # Consumer @queue.consumer("payment-processing") async def process_payment(message): result = await payment_gateway.charge(message["amount"]) if result.success: await message.ack() else: await message.nack(requeue=True) ``` **Pros**: Guaranteed delivery, work distribution, retry handling **Cons**: Queue becomes bottleneck, requires message broker ### Communication Pattern Decision Matrix | Scenario | Pattern | Why | |----------|---------|-----| | User-facing request/response | Sync (REST/gRPC) | Low latency, immediate feedback | | Background processing | Async (queue) | Don't block user, retry support | | Cross-service notifications | Async (pub/sub) | Loose coupling, multiple consumers | | Real-time updates | WebSocket/SSE | Bidirectional, streaming | | Data replication | Event sourcing | Audit trail, rebuild state | | High throughput | Async (messaging) | Buffer spikes, backpressure | ## Data Consistency Patterns ### 1. Saga Pattern (Distributed Transactions) **Choreography (Event-Driven)**: ```python # Order Service async def create_order(order): order.status = "pending" await db.save(order) await events.publish("order.created", order) # Payment Service @events.subscribe("order.created") async def handle_order(event): try: await charge_customer(event["total"]) await events.publish("payment.completed", event) except PaymentError: await events.publish("payment.failed", event) # Inventory Service @events.subscribe("payment.completed") async def reserve_items(event): try: await reserve(event["items"]) await events.publish("inventory.reserved", event) except InventoryError: await events.publish("inventory.failed", event) # Order Service (Compensation) @events.subscribe("payment.failed") async def cancel_order(event): order = await db.get(event["order_id"]) order.status = "cancelled" await db.save(order) @events.subscribe("inventory.failed") async def refund_payment(event): await payment.refund(event["order_id"]) await cancel_order(event) ``` **Orchestration (Coordinator)**: ```python class OrderSaga: def __init__(self, order): self.order = order self.completed_steps = [] async def execute(self): try: # Step 1: Reserve inventory await self.reserve_inventory() self.completed_steps.append("inventory") # Step 2: Process payment await self.process_payment() self.completed_steps.append("payment") # Step 3: Confirm order await self.confirm_order() except Exception as e: # Compensate in reverse order await self.compensate() raise async def compensate(self): for step in reversed(self.completed_steps): if step == "inventory": await inventory_service.release(self.order.id) elif step == "payment": await payment_service.refund(self.order.id) ``` **Choreography vs Orchestration**: | Aspect | Choreography | Orchestration | |--------|--------------|---------------| | Coordination | Decentralized (events) | Centralized (orchestrator) | | Coupling | Loose | Tight to orchestrator | | Complexity | Distributed across services | Concentrated in orchestrator | | Tracing | Harder (follow events) | Easier (single coordinator) | | Failure handling | Implicit (event handlers) | Explicit (orchestrator logic) | | Best for | Simple workflows | Complex workflows | ### 2. Event Sourcing **Pattern: Store events, not state** ```python # Traditional approach (storing state) class Order: id: int status: str # "pending" → "confirmed" → "shipped" total: float # Event sourcing (storing events) class OrderCreated(Event): order_id: int total: float class OrderConfirmed(Event): order_id: int class OrderShipped(Event): order_id: int # Rebuild state from events def rebuild_order(order_id): events = event_store.get_events(order_id) order = Order() for event in events: order.apply(event) # Apply each event to rebuild state return order ``` **Pros**: Complete audit trail, time travel, event replay **Cons**: Complexity, eventual consistency, schema evolution challenges ### 3. CQRS (Command Query Responsibility Segregation) **Separate read and write models**: ```python # Write model (commands) class CreateOrder: def execute(self, data): order = Order(**data) await db.save(order) await event_bus.publish("order.created", order) # Read model (projections) class OrderReadModel: # Denormalized for fast reads def __init__(self): self.cache = {} @event_bus.subscribe("order.created") async def on_order_created(self, event): self.cache[event["order_id"]] = { "id": event["order_id"], "customer_name": await get_customer_name(event["customer_id"]), "status": "pending", "total": event["total"] } def get_order(self, order_id): return self.cache.get(order_id) # Fast read, no joins ``` **Use when**: Read/write patterns differ significantly (e.g., analytics dashboards) ## Resilience Patterns ### 1. Circuit Breaker ```python from circuitbreaker import circuit @circuit(failure_threshold=5, recovery_timeout=60) async def call_payment_service(amount): response = await http.post("http://payment-service/charge", json={"amount": amount}) if response.status >= 500: raise PaymentServiceError() return response.json() # Circuit states: # CLOSED → normal operation # OPEN → fails fast after threshold # HALF_OPEN → test if service recovered ``` ### 2. Retry with Exponential Backoff ```python from tenacity import retry, stop_after_attempt, wait_exponential @retry( stop=stop_after_attempt(3), wait=wait_exponential(multiplier=1, min=2, max=10) ) async def call_with_retry(url): return await http.get(url) # Retries: 2s → 4s → 8s ``` ### 3. Timeout ```python import asyncio async def call_with_timeout(url): try: return await asyncio.wait_for( http.get(url), timeout=5.0 # 5 second timeout ) except asyncio.TimeoutError: return {"error": "Service timeout"} ``` ### 4. Bulkhead **Isolate resources to prevent cascade failures**: ```python # Separate thread pools for different services payment_pool = ThreadPoolExecutor(max_workers=10) inventory_pool = ThreadPoolExecutor(max_workers=5) async def call_payment(): return await asyncio.get_event_loop().run_in_executor( payment_pool, payment_service.call ) # If payment service is slow, it only exhausts payment_pool, # inventory calls still work ``` ## API Gateway Pattern **Centralized entry point for client requests**: ``` Client → API Gateway → [Order, Payment, Inventory services] ``` **Responsibilities**: - Routing requests to services - Authentication/authorization - Rate limiting - Request/response transformation - Caching - Logging/monitoring **Example (Kong, AWS API Gateway, Nginx)**: ```yaml # API Gateway config routes: - path: /orders service: order-service auth: jwt ratelimit: 100/minute - path: /payments service: payment-service auth: oauth2 ratelimit: 50/minute ``` **Backend for Frontend (BFF) Pattern**: ``` Web Client → Web BFF → Services Mobile App → Mobile BFF → Services ``` Each client type has optimized gateway. ## Service Discovery ### 1. Client-Side Discovery ```python # Service registry (Consul, Eureka) registry = ServiceRegistry("http://consul:8500") # Client looks up service instances = registry.get_instances("payment-service") instance = load_balancer.choose(instances) response = await http.get(f"http://{instance.host}:{instance.port}/charge") ``` ### 2. Server-Side Discovery (Load Balancer) ``` Client → Load Balancer → [Service Instance 1, Instance 2, Instance 3] ``` **DNS-based**: Kubernetes services, AWS ELB ## Observability ### Distributed Tracing ```python from opentelemetry import trace tracer = trace.get_tracer(__name__) async def create_order(order): with tracer.start_as_current_span("create-order") as span: span.set_attribute("order.id", order.id) span.set_attribute("order.total", order.total) # Trace propagates to payment service payment = await payment_service.charge( amount=order.total, trace_context=span.context ) span.add_event("payment-completed") return order ``` **Tools**: Jaeger, Zipkin, AWS X-Ray, Datadog APM ### Log Aggregation **Structured logging with correlation IDs**: ```python import logging import uuid logger = logging.getLogger(__name__) async def handle_request(request): correlation_id = request.headers.get("X-Correlation-ID") or str(uuid.uuid4()) logger.info("Processing request", extra={ "correlation_id": correlation_id, "service": "order-service", "user_id": request.user_id }) ``` **Tools**: ELK stack (Elasticsearch, Logstash, Kibana), Splunk, Datadog ## Monolith Decomposition Strategies ### 1. Strangler Fig Pattern **Gradually replace monolith with microservices**: ``` Phase 1: Monolith handles everything Phase 2: Extract service, proxy some requests to it Phase 3: More services extracted, proxy more requests Phase 4: Monolith retired ``` ### 2. Branch by Abstraction 1. Create abstraction layer in monolith 2. Implement new service 3. Gradually migrate code behind abstraction 4. Remove old implementation 5. Extract as microservice ### 3. Extract by Bounded Context Priority order: 1. Services with clear boundaries (authentication, payments) 2. Services changing frequently 3. Services with different scaling needs 4. Services with technology mismatches (e.g., Java monolith, Python ML service) ## Anti-Patterns | Anti-Pattern | Why Bad | Fix | |--------------|---------|-----| | **Distributed Monolith** | Services share database, deploy together | One DB per service, independent deployment | | **Nanoservices** | Too fine-grained, excessive network calls | Merge related services, follow DDD | | **Shared Database** | Tight coupling, schema changes break multiple services | Database per service | | **Synchronous Chains** | A→B→C→D, latency adds up, cascading failures | Async events, parallelize where possible | | **Chatty Services** | N+1 calls, excessive network overhead | Batch APIs, caching, coarser boundaries | | **No Circuit Breakers** | Cascading failures bring down system | Circuit breakers + timeouts + retries | | **No Distributed Tracing** | Impossible to debug cross-service issues | OpenTelemetry, correlation IDs | ## Cross-References **Related skills**: - **Message queues** → `message-queues` (RabbitMQ, Kafka patterns) - **REST APIs** → `rest-api-design` (service interface design) - **gRPC** → Check if gRPC skill exists - **Security** → `ordis-security-architect` (service-to-service auth, zero trust) - **Database** → `database-integration` (per-service databases, migrations) - **Testing** → `api-testing` (contract testing, integration testing) ## Further Reading - **Building Microservices** by Sam Newman - **Domain-Driven Design** by Eric Evans - **Release It!** by Michael Nygard (resilience patterns) - **Microservices Patterns** by Chris Richardson