994 lines
28 KiB
Markdown
994 lines
28 KiB
Markdown
|
|
# Message Queues
|
|
|
|
## Overview
|
|
|
|
**Message queue specialist covering technology selection, reliability patterns, ordering guarantees, schema evolution, and production operations.**
|
|
|
|
**Core principle**: Message queues decouple producers from consumers, enabling async processing, load leveling, and resilience - but require careful design for reliability, ordering, monitoring, and operational excellence.
|
|
|
|
## When to Use This Skill
|
|
|
|
Use when encountering:
|
|
|
|
- **Technology selection**: RabbitMQ vs Kafka vs SQS vs SNS
|
|
- **Reliability**: Guaranteed delivery, acknowledgments, retries, DLQ
|
|
- **Ordering**: Partition keys, FIFO queues, ordered processing
|
|
- **Scaling**: Consumer groups, parallelism, backpressure
|
|
- **Schema evolution**: Message versioning, Avro, Protobuf
|
|
- **Monitoring**: Lag tracking, alerting, distributed tracing
|
|
- **Advanced patterns**: Outbox, saga, CQRS, event sourcing
|
|
- **Security**: Encryption, IAM, Kafka authentication
|
|
- **Testing**: Local testing, chaos engineering, load testing
|
|
|
|
**Do NOT use for**:
|
|
- Request/response APIs → Use REST or GraphQL instead
|
|
- Strong consistency required → Use database transactions
|
|
- Real-time streaming analytics → See if streaming-specific skill exists
|
|
|
|
## Technology Selection Matrix
|
|
|
|
| Factor | RabbitMQ | Apache Kafka | AWS SQS | AWS SNS |
|
|
|--------|----------|--------------|---------|---------|
|
|
| **Use Case** | Task queues, routing | Event streaming, logs | Simple queues | Pub/sub fanout |
|
|
| **Throughput** | 10k-50k msg/s | 100k+ msg/s | 3k msg/s (std), 300 msg/s (FIFO) | 100k+ msg/s |
|
|
| **Ordering** | Queue-level | Partition-level (strong) | FIFO queues only | None |
|
|
| **Persistence** | Durable queues | Log-based (default) | Managed | Ephemeral (SNS → SQS for durability) |
|
|
| **Retention** | Until consumed | Days to weeks | 4 days (std), 14 days max | None (delivery only) |
|
|
| **Routing** | Exchanges (topic, fanout, headers) | Topics only | None | Topic-based filtering |
|
|
| **Message size** | Up to 128 MB | Up to 1 MB (configurable) | 256 KB | 256 KB |
|
|
| **Ops complexity** | Medium (clustering) | High (partitions, replication) | Low (managed) | Low (managed) |
|
|
| **Cost** | EC2 self-hosted | Self-hosted or MSK | Pay-per-request | Pay-per-request |
|
|
|
|
### Decision Tree
|
|
|
|
```
|
|
Are you on AWS and need simple async processing?
|
|
→ Yes → **AWS SQS** (start simple)
|
|
→ No → Continue...
|
|
|
|
Do you need event replay or stream processing?
|
|
→ Yes → **Kafka** (log-based, replayable)
|
|
→ No → Continue...
|
|
|
|
Do you need complex routing (topic exchange, headers)?
|
|
→ Yes → **RabbitMQ** (rich exchange types)
|
|
→ No → Continue...
|
|
|
|
Do you need pub/sub fanout to multiple subscribers?
|
|
→ Yes → **SNS** (or Kafka topics with multiple consumer groups)
|
|
→ No → **SQS** or **RabbitMQ** for task queues
|
|
```
|
|
|
|
### Migration Path
|
|
|
|
| Current State | Next Step | Why |
|
|
|---------------|-----------|-----|
|
|
| No queue | Start with SQS (if AWS) or RabbitMQ | Lowest operational complexity |
|
|
| SQS → 1k+ msg/s | Consider Kafka or sharded SQS | SQS throttles at 3k msg/s |
|
|
| RabbitMQ → Event sourcing needed | Migrate to Kafka | Kafka's log retention enables replay |
|
|
| Kafka → Simple task queue | Consider RabbitMQ or SQS | Kafka is overkill for simple queues |
|
|
|
|
## Reliability Patterns
|
|
|
|
### Acknowledgment Modes
|
|
|
|
| Mode | When Ack Sent | Reliability | Performance | Use Case |
|
|
|------|---------------|-------------|-------------|----------|
|
|
| **Auto-ack** | On receive | Low (lost on crash) | High | Logs, analytics, best-effort |
|
|
| **Manual ack (after processing)** | After success | High (at-least-once) | Medium | Standard production pattern |
|
|
| **Transactional** | In transaction | Highest (exactly-once) | Low | Financial, critical data |
|
|
|
|
### At-Least-Once Delivery Pattern
|
|
|
|
**SQS**:
|
|
|
|
```python
|
|
# WRONG: Delete before processing
|
|
message = sqs.receive_message(QueueUrl=queue_url)['Messages'][0]
|
|
sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'])
|
|
process(message['Body']) # ❌ If this fails, message is lost
|
|
|
|
# CORRECT: Process, then delete
|
|
message = sqs.receive_message(
|
|
QueueUrl=queue_url,
|
|
VisibilityTimeout=300 # 5 minutes to process
|
|
)['Messages'][0]
|
|
|
|
try:
|
|
process(json.loads(message['Body']))
|
|
sqs.delete_message(QueueUrl=queue_url, ReceiptHandle=message['ReceiptHandle'])
|
|
except Exception as e:
|
|
# Message becomes visible again after timeout
|
|
logger.error(f"Processing failed, will retry: {e}")
|
|
```
|
|
|
|
**Kafka**:
|
|
|
|
```python
|
|
# WRONG: Auto-commit before processing
|
|
consumer = KafkaConsumer(
|
|
'orders',
|
|
enable_auto_commit=True, # ❌ Commits offset before processing
|
|
auto_commit_interval_ms=5000
|
|
)
|
|
|
|
for msg in consumer:
|
|
process(msg.value) # Crash here = message lost
|
|
|
|
# CORRECT: Manual commit after processing
|
|
consumer = KafkaConsumer(
|
|
'orders',
|
|
enable_auto_commit=False
|
|
)
|
|
|
|
for msg in consumer:
|
|
try:
|
|
process(msg.value)
|
|
consumer.commit() # ✓ Commit only after success
|
|
except Exception as e:
|
|
logger.error(f"Processing failed, will retry: {e}")
|
|
# Don't commit - message will be reprocessed
|
|
```
|
|
|
|
**RabbitMQ**:
|
|
|
|
```python
|
|
import pika
|
|
|
|
connection = pika.BlockingConnection(pika.ConnectionParameters('localhost'))
|
|
channel = connection.channel()
|
|
|
|
def callback(ch, method, properties, body):
|
|
try:
|
|
process(json.loads(body))
|
|
ch.basic_ack(delivery_tag=method.delivery_tag) # ✓ Ack after success
|
|
except Exception as e:
|
|
logger.error(f"Processing failed: {e}")
|
|
ch.basic_nack(delivery_tag=method.delivery_tag, requeue=True) # Requeue
|
|
|
|
channel.basic_consume(
|
|
queue='orders',
|
|
on_message_callback=callback,
|
|
auto_ack=False # ✓ Manual acknowledgment
|
|
)
|
|
|
|
channel.start_consuming()
|
|
```
|
|
|
|
### Idempotency (Critical for At-Least-Once)
|
|
|
|
Since at-least-once delivery guarantees duplicates, **all processing must be idempotent**:
|
|
|
|
```python
|
|
# Pattern 1: Database unique constraint
|
|
def process_order(order_id, data):
|
|
db.execute(
|
|
"INSERT INTO orders (id, user_id, amount, created_at) "
|
|
"VALUES (%s, %s, %s, NOW()) "
|
|
"ON CONFLICT (id) DO NOTHING", # Idempotent
|
|
(order_id, data['user_id'], data['amount'])
|
|
)
|
|
|
|
# Pattern 2: Distributed lock (Redis)
|
|
def process_order_with_lock(order_id, data):
|
|
lock_key = f"lock:order:{order_id}"
|
|
|
|
# Try to acquire lock (60s TTL)
|
|
if not redis.set(lock_key, "1", nx=True, ex=60):
|
|
logger.info(f"Order {order_id} already being processed")
|
|
return # Duplicate, skip
|
|
|
|
try:
|
|
# Process order
|
|
create_order(data)
|
|
charge_payment(data['amount'])
|
|
finally:
|
|
redis.delete(lock_key)
|
|
|
|
# Pattern 3: Idempotency key table
|
|
def process_with_idempotency_key(message_id, data):
|
|
with db.transaction():
|
|
# Check if already processed
|
|
result = db.execute(
|
|
"SELECT 1 FROM processed_messages WHERE message_id = %s FOR UPDATE",
|
|
(message_id,)
|
|
)
|
|
|
|
if result:
|
|
return # Already processed
|
|
|
|
# Process + record atomically
|
|
process_order(data)
|
|
db.execute(
|
|
"INSERT INTO processed_messages (message_id, processed_at) VALUES (%s, NOW())",
|
|
(message_id,)
|
|
)
|
|
```
|
|
|
|
## Ordering Guarantees
|
|
|
|
### Kafka: Partition-Level Ordering
|
|
|
|
**Kafka guarantees ordering within a partition**, not across partitions.
|
|
|
|
```python
|
|
from kafka import KafkaProducer
|
|
|
|
producer = KafkaProducer(
|
|
bootstrap_servers=['kafka:9092'],
|
|
key_serializer=str.encode,
|
|
value_serializer=lambda v: json.dumps(v).encode()
|
|
)
|
|
|
|
# ✓ Partition key ensures ordering
|
|
def publish_order_event(user_id, event_type, data):
|
|
producer.send(
|
|
'orders',
|
|
key=str(user_id), # All user_id events go to same partition
|
|
value={
|
|
'event_type': event_type,
|
|
'user_id': user_id,
|
|
'data': data,
|
|
'timestamp': time.time()
|
|
}
|
|
)
|
|
|
|
# User 123's events all go to partition 2 → strict ordering
|
|
publish_order_event(123, 'order_placed', {...})
|
|
publish_order_event(123, 'payment_processed', {...})
|
|
publish_order_event(123, 'shipped', {...})
|
|
```
|
|
|
|
**Partition count determines max parallelism**:
|
|
|
|
```
|
|
Topic: orders (4 partitions)
|
|
Consumer group: order-processors
|
|
|
|
2 consumers → Each processes 2 partitions
|
|
4 consumers → Each processes 1 partition (max parallelism)
|
|
5 consumers → 1 consumer idle (wasted)
|
|
|
|
Rule: partition_count >= max_consumers_needed
|
|
```
|
|
|
|
### SQS FIFO: MessageGroupId Ordering
|
|
|
|
```python
|
|
import boto3
|
|
|
|
sqs = boto3.client('sqs')
|
|
|
|
# FIFO queue guarantees ordering per MessageGroupId
|
|
sqs.send_message(
|
|
QueueUrl='orders.fifo',
|
|
MessageBody=json.dumps(event),
|
|
MessageGroupId=f"user-{user_id}", # Like Kafka partition key
|
|
MessageDeduplicationId=f"{event_id}-{timestamp}" # Prevent duplicates
|
|
)
|
|
|
|
# Throughput limit: 300 msg/s per MessageGroupId
|
|
# Workaround: Use multiple MessageGroupIds if possible
|
|
```
|
|
|
|
### RabbitMQ: Single Consumer Ordering
|
|
|
|
```python
|
|
# RabbitMQ guarantees ordering if single consumer
|
|
channel.basic_qos(prefetch_count=1) # Process one at a time
|
|
|
|
channel.basic_consume(
|
|
queue='orders',
|
|
on_message_callback=callback,
|
|
auto_ack=False
|
|
)
|
|
|
|
# Multiple consumers break ordering unless using consistent hashing
|
|
```
|
|
|
|
## Dead Letter Queues (DLQ)
|
|
|
|
### Retry Strategy with Exponential Backoff
|
|
|
|
**SQS with DLQ**:
|
|
|
|
```python
|
|
# Infrastructure setup
|
|
main_queue = sqs.create_queue(
|
|
QueueName='orders',
|
|
Attributes={
|
|
'RedrivePolicy': json.dumps({
|
|
'deadLetterTargetArn': dlq_arn,
|
|
'maxReceiveCount': '3' # After 3 failures → DLQ
|
|
}),
|
|
'VisibilityTimeout': '300'
|
|
}
|
|
)
|
|
|
|
# Consumer with retry logic
|
|
def process_with_retry(message):
|
|
attempt = int(message.attributes.get('ApproximateReceiveCount', 0))
|
|
|
|
try:
|
|
process_order(json.loads(message.body))
|
|
message.delete()
|
|
|
|
except RetriableError as e:
|
|
# Exponential backoff: 10s, 20s, 40s, 80s, ...
|
|
backoff = min(300, 2 ** attempt * 10)
|
|
message.change_visibility(VisibilityTimeout=backoff)
|
|
logger.warning(f"Retriable error (attempt {attempt}), retry in {backoff}s")
|
|
|
|
except PermanentError as e:
|
|
# Send to DLQ immediately
|
|
logger.error(f"Permanent error: {e}")
|
|
send_to_dlq(message, error=str(e))
|
|
message.delete()
|
|
|
|
# Error classification
|
|
class RetriableError(Exception):
|
|
"""Network timeout, rate limit, DB unavailable"""
|
|
pass
|
|
|
|
class PermanentError(Exception):
|
|
"""Invalid data, missing field, business rule violation"""
|
|
pass
|
|
```
|
|
|
|
**Kafka DLQ Pattern**:
|
|
|
|
```python
|
|
from kafka import KafkaConsumer, KafkaProducer
|
|
|
|
consumer = KafkaConsumer('orders', group_id='processor')
|
|
dlq_producer = KafkaProducer(bootstrap_servers=['kafka:9092'])
|
|
|
|
def process_with_dlq(message):
|
|
retry_count = message.headers.get('retry_count', 0)
|
|
|
|
try:
|
|
process_order(message.value)
|
|
consumer.commit()
|
|
|
|
except RetriableError as e:
|
|
if retry_count < 3:
|
|
# Send to retry topic with delay
|
|
delay_minutes = 2 ** retry_count # 1min, 2min, 4min
|
|
retry_producer.send(
|
|
f'orders-retry-{delay_minutes}min',
|
|
value=message.value,
|
|
headers={'retry_count': retry_count + 1}
|
|
)
|
|
else:
|
|
# Max retries → DLQ
|
|
dlq_producer.send(
|
|
'orders-dlq',
|
|
value=message.value,
|
|
headers={'error': str(e), 'retry_count': retry_count}
|
|
)
|
|
consumer.commit() # Don't reprocess from main topic
|
|
|
|
except PermanentError as e:
|
|
# Immediate DLQ
|
|
dlq_producer.send('orders-dlq', value=message.value, headers={'error': str(e)})
|
|
consumer.commit()
|
|
```
|
|
|
|
### DLQ Monitoring & Recovery
|
|
|
|
```python
|
|
# Alert on DLQ depth
|
|
def check_dlq_depth():
|
|
attrs = sqs.get_queue_attributes(
|
|
QueueUrl=dlq_url,
|
|
AttributeNames=['ApproximateNumberOfMessages']
|
|
)
|
|
depth = int(attrs['Attributes']['ApproximateNumberOfMessages'])
|
|
|
|
if depth > 10:
|
|
alert(f"DLQ has {depth} messages - investigate!")
|
|
|
|
# Manual recovery
|
|
def replay_from_dlq():
|
|
"""Fix root cause, then replay"""
|
|
messages = dlq.receive_messages(MaxNumberOfMessages=10)
|
|
|
|
for msg in messages:
|
|
data = json.loads(msg.body)
|
|
|
|
# Fix data issue
|
|
if 'customer_email' not in data:
|
|
data['customer_email'] = lookup_email(data['user_id'])
|
|
|
|
# Replay to main queue
|
|
main_queue.send_message(MessageBody=json.dumps(data))
|
|
msg.delete()
|
|
```
|
|
|
|
## Message Schema Evolution
|
|
|
|
### Versioning Strategies
|
|
|
|
**Pattern 1: Version field in message**:
|
|
|
|
```python
|
|
# v1 message
|
|
{
|
|
"version": "1.0",
|
|
"order_id": "123",
|
|
"amount": 99.99
|
|
}
|
|
|
|
# v2 message (added currency)
|
|
{
|
|
"version": "2.0",
|
|
"order_id": "123",
|
|
"amount": 99.99,
|
|
"currency": "USD"
|
|
}
|
|
|
|
# Consumer handles both versions
|
|
def process_order(message):
|
|
if message['version'] == "1.0":
|
|
amount = message['amount']
|
|
currency = "USD" # Default for v1
|
|
elif message['version'] == "2.0":
|
|
amount = message['amount']
|
|
currency = message['currency']
|
|
else:
|
|
raise ValueError(f"Unsupported version: {message['version']}")
|
|
```
|
|
|
|
**Pattern 2: Apache Avro (Kafka best practice)**:
|
|
|
|
```python
|
|
from confluent_kafka import avro
|
|
from confluent_kafka.avro import AvroProducer, AvroConsumer
|
|
|
|
# Define schema
|
|
value_schema = avro.loads('''
|
|
{
|
|
"type": "record",
|
|
"name": "Order",
|
|
"fields": [
|
|
{"name": "order_id", "type": "string"},
|
|
{"name": "amount", "type": "double"},
|
|
{"name": "currency", "type": "string", "default": "USD"} # Backward compatible
|
|
]
|
|
}
|
|
''')
|
|
|
|
# Producer
|
|
producer = AvroProducer({
|
|
'bootstrap.servers': 'kafka:9092',
|
|
'schema.registry.url': 'http://schema-registry:8081'
|
|
}, default_value_schema=value_schema)
|
|
|
|
producer.produce(topic='orders', value={
|
|
'order_id': '123',
|
|
'amount': 99.99,
|
|
'currency': 'USD'
|
|
})
|
|
|
|
# Consumer automatically validates schema
|
|
consumer = AvroConsumer({
|
|
'bootstrap.servers': 'kafka:9092',
|
|
'group.id': 'processor',
|
|
'schema.registry.url': 'http://schema-registry:8081'
|
|
})
|
|
```
|
|
|
|
**Avro Schema Evolution Rules**:
|
|
|
|
| Change | Compatible? | Notes |
|
|
|--------|-------------|-------|
|
|
| Add field with default | ✓ Backward compatible | Old consumers ignore new field |
|
|
| Remove field | ✓ Forward compatible | New consumers must handle missing field |
|
|
| Rename field | ❌ Breaking | Requires migration |
|
|
| Change field type | ❌ Breaking | Requires new topic or migration |
|
|
|
|
**Pattern 3: Protobuf (alternative to Avro)**:
|
|
|
|
```protobuf
|
|
syntax = "proto3";
|
|
|
|
message Order {
|
|
string order_id = 1;
|
|
double amount = 2;
|
|
string currency = 3; // New field, backward compatible
|
|
}
|
|
```
|
|
|
|
### Schema Registry (Kafka)
|
|
|
|
```
|
|
Producer → Schema Registry (validate) → Kafka
|
|
Consumer → Kafka → Schema Registry (deserialize)
|
|
|
|
Benefits:
|
|
- Centralized schema management
|
|
- Automatic validation
|
|
- Schema evolution enforcement
|
|
- Type safety
|
|
```
|
|
|
|
## Monitoring & Observability
|
|
|
|
### Key Metrics
|
|
|
|
| Metric | Alert Threshold | Why It Matters |
|
|
|--------|----------------|----------------|
|
|
| **Queue depth** | > 1000 (or 5min processing time) | Consumers can't keep up |
|
|
| **Consumer lag** (Kafka) | > 100k messages or > 5 min | Consumers falling behind |
|
|
| **DLQ depth** | > 10 | Messages failing repeatedly |
|
|
| **Processing time p99** | > 5 seconds | Slow processing blocks queue |
|
|
| **Error rate** | > 5% | Widespread failures |
|
|
| **Redelivery rate** | > 10% | Idempotency issues or transient errors |
|
|
|
|
### Consumer Lag Monitoring (Kafka)
|
|
|
|
```python
|
|
from kafka import KafkaAdminClient, TopicPartition
|
|
|
|
admin = KafkaAdminClient(bootstrap_servers=['kafka:9092'])
|
|
|
|
def check_consumer_lag(group_id, topic):
|
|
# Get committed offsets
|
|
committed = admin.list_consumer_group_offsets(group_id)
|
|
|
|
# Get latest offsets (highwater mark)
|
|
consumer = KafkaConsumer(bootstrap_servers=['kafka:9092'])
|
|
partitions = [TopicPartition(topic, p) for p in range(partition_count)]
|
|
latest = consumer.end_offsets(partitions)
|
|
|
|
# Calculate lag
|
|
total_lag = 0
|
|
for partition in partitions:
|
|
committed_offset = committed[partition].offset
|
|
latest_offset = latest[partition]
|
|
lag = latest_offset - committed_offset
|
|
total_lag += lag
|
|
|
|
if lag > 10000:
|
|
alert(f"Partition {partition.partition} lag: {lag}")
|
|
|
|
return total_lag
|
|
|
|
# Alert if total lag > 100k
|
|
if check_consumer_lag('order-processor', 'orders') > 100000:
|
|
alert("Consumer lag critical!")
|
|
```
|
|
|
|
### Distributed Tracing Across Queues
|
|
|
|
```python
|
|
from opentelemetry import trace
|
|
from opentelemetry.propagate import inject, extract
|
|
|
|
tracer = trace.get_tracer(__name__)
|
|
|
|
# Producer: Inject trace context
|
|
def publish_with_trace(topic, message):
|
|
with tracer.start_as_current_span("publish-order") as span:
|
|
headers = {}
|
|
inject(headers) # Inject trace context into headers
|
|
|
|
producer.send(
|
|
topic,
|
|
value=message,
|
|
headers=list(headers.items())
|
|
)
|
|
|
|
# Consumer: Extract trace context
|
|
def consume_with_trace(message):
|
|
context = extract(dict(message.headers))
|
|
|
|
with tracer.start_as_current_span("process-order", context=context) as span:
|
|
process_order(message.value)
|
|
span.set_attribute("order.id", message.value['order_id'])
|
|
|
|
# Trace spans: API → Producer → Queue → Consumer → DB
|
|
# Shows end-to-end latency including queue wait time
|
|
```
|
|
|
|
## Backpressure & Circuit Breakers
|
|
|
|
### Rate Limiting Consumers
|
|
|
|
```python
|
|
import time
|
|
from collections import deque
|
|
|
|
class RateLimitedConsumer:
|
|
def __init__(self, max_per_second=100):
|
|
self.max_per_second = max_per_second
|
|
self.requests = deque()
|
|
|
|
def consume(self, message):
|
|
now = time.time()
|
|
|
|
# Remove requests older than 1 second
|
|
while self.requests and self.requests[0] < now - 1:
|
|
self.requests.popleft()
|
|
|
|
# Check rate limit
|
|
if len(self.requests) >= self.max_per_second:
|
|
sleep_time = 1 - (now - self.requests[0])
|
|
time.sleep(sleep_time)
|
|
|
|
self.requests.append(time.time())
|
|
process(message)
|
|
```
|
|
|
|
### Circuit Breaker for Downstream Dependencies
|
|
|
|
```python
|
|
from circuitbreaker import circuit
|
|
|
|
@circuit(failure_threshold=5, recovery_timeout=60)
|
|
def call_payment_service(order_id, amount):
|
|
response = requests.post(
|
|
'https://payment-service/charge',
|
|
json={'order_id': order_id, 'amount': amount},
|
|
timeout=5
|
|
)
|
|
|
|
if response.status_code >= 500:
|
|
raise ServiceUnavailableError()
|
|
|
|
return response.json()
|
|
|
|
def process_order(message):
|
|
try:
|
|
result = call_payment_service(message['order_id'], message['amount'])
|
|
# ... continue processing
|
|
except CircuitBreakerError:
|
|
# Circuit open - don't overwhelm failing service
|
|
logger.warning("Payment service circuit open, requeueing message")
|
|
raise RetriableError("Circuit breaker open")
|
|
```
|
|
|
|
## Advanced Patterns
|
|
|
|
### Outbox Pattern (Reliable Publishing)
|
|
|
|
**Problem**: How to atomically update database AND publish message?
|
|
|
|
```python
|
|
# ❌ WRONG: Dual write (can fail between DB and queue)
|
|
def create_order(data):
|
|
db.execute("INSERT INTO orders (...) VALUES (...)")
|
|
producer.send('orders', data) # ❌ If this fails, DB updated but no event
|
|
|
|
# ✓ CORRECT: Outbox pattern
|
|
def create_order_with_outbox(data):
|
|
with db.transaction():
|
|
# 1. Insert order
|
|
db.execute("INSERT INTO orders (id, user_id, amount) VALUES (%s, %s, %s)",
|
|
(data['id'], data['user_id'], data['amount']))
|
|
|
|
# 2. Insert into outbox (same transaction)
|
|
db.execute("INSERT INTO outbox (event_type, payload) VALUES (%s, %s)",
|
|
('order.created', json.dumps(data)))
|
|
|
|
# Separate process reads outbox and publishes
|
|
|
|
# Outbox processor (separate worker)
|
|
def process_outbox():
|
|
while True:
|
|
events = db.execute("SELECT * FROM outbox WHERE published_at IS NULL LIMIT 10")
|
|
|
|
for event in events:
|
|
try:
|
|
producer.send(event['event_type'], json.loads(event['payload']))
|
|
db.execute("UPDATE outbox SET published_at = NOW() WHERE id = %s", (event['id'],))
|
|
except Exception as e:
|
|
logger.error(f"Failed to publish event {event['id']}: {e}")
|
|
# Will retry on next iteration
|
|
|
|
time.sleep(1)
|
|
```
|
|
|
|
### Saga Pattern (Distributed Transactions)
|
|
|
|
See `microservices-architecture` skill for full saga patterns (choreography vs orchestration).
|
|
|
|
**Quick reference for message-based saga**:
|
|
|
|
```python
|
|
# Order saga coordinator publishes commands
|
|
def create_order_saga(order_data):
|
|
saga_id = str(uuid.uuid4())
|
|
|
|
# Step 1: Reserve inventory
|
|
producer.send('inventory-commands', {
|
|
'command': 'reserve',
|
|
'saga_id': saga_id,
|
|
'order_id': order_data['order_id'],
|
|
'items': order_data['items']
|
|
})
|
|
|
|
# Inventory service responds on 'inventory-events'
|
|
# If success → proceed to step 2
|
|
# If failure → compensate (cancel order)
|
|
```
|
|
|
|
## Security
|
|
|
|
### Message Encryption
|
|
|
|
**SQS**: Server-side encryption (SSE) with KMS
|
|
|
|
```python
|
|
sqs.create_queue(
|
|
QueueName='orders-encrypted',
|
|
Attributes={
|
|
'KmsMasterKeyId': 'alias/my-key', # AWS KMS
|
|
'KmsDataKeyReusePeriodSeconds': '300'
|
|
}
|
|
)
|
|
```
|
|
|
|
**Kafka**: Encryption in transit + at rest
|
|
|
|
```python
|
|
# SSL/TLS for in-transit encryption
|
|
producer = KafkaProducer(
|
|
bootstrap_servers=['kafka:9093'],
|
|
security_protocol='SSL',
|
|
ssl_cafile='/path/to/ca-cert',
|
|
ssl_certfile='/path/to/client-cert',
|
|
ssl_keyfile='/path/to/client-key'
|
|
)
|
|
|
|
# Encryption at rest (Kafka broker config)
|
|
# log.dirs=/encrypted-volume # Use encrypted EBS volumes
|
|
```
|
|
|
|
### Authentication & Authorization
|
|
|
|
**SQS**: IAM policies
|
|
|
|
```json
|
|
{
|
|
"Version": "2012-10-17",
|
|
"Statement": [{
|
|
"Effect": "Allow",
|
|
"Principal": {"AWS": "arn:aws:iam::123456789012:role/OrderService"},
|
|
"Action": ["sqs:SendMessage"],
|
|
"Resource": "arn:aws:sqs:us-east-1:123456789012:orders"
|
|
}]
|
|
}
|
|
```
|
|
|
|
**Kafka**: SASL/SCRAM authentication
|
|
|
|
```python
|
|
producer = KafkaProducer(
|
|
bootstrap_servers=['kafka:9093'],
|
|
security_protocol='SASL_SSL',
|
|
sasl_mechanism='SCRAM-SHA-512',
|
|
sasl_plain_username='order-service',
|
|
sasl_plain_password='secret'
|
|
)
|
|
```
|
|
|
|
**Kafka ACLs** (authorization):
|
|
|
|
```bash
|
|
# Grant order-service permission to write to orders topic
|
|
kafka-acls --add \
|
|
--allow-principal User:order-service \
|
|
--operation Write \
|
|
--topic orders
|
|
```
|
|
|
|
## Testing Strategies
|
|
|
|
### Local Testing
|
|
|
|
**LocalStack for SQS/SNS**:
|
|
|
|
```python
|
|
# docker-compose.yml
|
|
services:
|
|
localstack:
|
|
image: localstack/localstack
|
|
environment:
|
|
- SERVICES=sqs,sns
|
|
|
|
# Test code
|
|
import boto3
|
|
|
|
sqs = boto3.client(
|
|
'sqs',
|
|
endpoint_url='http://localhost:4566', # LocalStack
|
|
region_name='us-east-1'
|
|
)
|
|
|
|
queue_url = sqs.create_queue(QueueName='test-orders')['QueueUrl']
|
|
sqs.send_message(QueueUrl=queue_url, MessageBody='test')
|
|
```
|
|
|
|
**Kafka in Docker**:
|
|
|
|
```yaml
|
|
# docker-compose.yml
|
|
services:
|
|
zookeeper:
|
|
image: confluentinc/cp-zookeeper:latest
|
|
environment:
|
|
ZOOKEEPER_CLIENT_PORT: 2181
|
|
|
|
kafka:
|
|
image: confluentinc/cp-kafka:latest
|
|
ports:
|
|
- "9092:9092"
|
|
environment:
|
|
KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
|
|
KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
|
|
```
|
|
|
|
### Integration Testing
|
|
|
|
```python
|
|
import pytest
|
|
from testcontainers.kafka import KafkaContainer
|
|
|
|
@pytest.fixture
|
|
def kafka():
|
|
with KafkaContainer() as kafka:
|
|
yield kafka.get_bootstrap_server()
|
|
|
|
def test_order_processing(kafka):
|
|
producer = KafkaProducer(bootstrap_servers=kafka)
|
|
consumer = KafkaConsumer('orders', bootstrap_servers=kafka, auto_offset_reset='earliest')
|
|
|
|
# Publish message
|
|
producer.send('orders', value=b'{"order_id": "123"}')
|
|
producer.flush()
|
|
|
|
# Consume and verify
|
|
message = next(consumer)
|
|
assert json.loads(message.value)['order_id'] == '123'
|
|
```
|
|
|
|
### Chaos Engineering
|
|
|
|
```python
|
|
# Test consumer failure recovery
|
|
def test_consumer_crash_recovery():
|
|
# Start consumer
|
|
consumer_process = subprocess.Popen(['python', 'consumer.py'])
|
|
time.sleep(2)
|
|
|
|
# Publish message
|
|
producer.send('orders', value=test_order)
|
|
producer.flush()
|
|
|
|
# Kill consumer mid-processing
|
|
consumer_process.kill()
|
|
|
|
# Restart consumer
|
|
consumer_process = subprocess.Popen(['python', 'consumer.py'])
|
|
time.sleep(5)
|
|
|
|
# Verify message was reprocessed (idempotency!)
|
|
assert db.execute("SELECT COUNT(*) FROM orders WHERE id = %s", (test_order['id'],))[0] == 1
|
|
```
|
|
|
|
## Anti-Patterns
|
|
|
|
| Anti-Pattern | Why Bad | Fix |
|
|
|--------------|---------|-----|
|
|
| **Auto-ack before processing** | Messages lost on crash | Manual ack after processing |
|
|
| **No idempotency** | Duplicates cause data corruption | Unique constraints, locks, or idempotency keys |
|
|
| **No DLQ** | Poison messages block queue | Configure DLQ with maxReceiveCount |
|
|
| **No monitoring** | Can't detect consumer lag or failures | Monitor lag, depth, error rate |
|
|
| **Synchronous message processing** | Low throughput | Batch processing, parallel consumers |
|
|
| **Large messages** | Exceeds queue limits, slow transfer | Store in S3, send reference in message |
|
|
| **No schema versioning** | Breaking changes break consumers | Use Avro/Protobuf with schema registry |
|
|
| **Shared consumer instances** | Race conditions, duplicate processing | Use consumer groups (Kafka) or visibility timeout (SQS) |
|
|
|
|
## Technology-Specific Patterns
|
|
|
|
### RabbitMQ Exchanges
|
|
|
|
```python
|
|
# Topic exchange for routing
|
|
channel.exchange_declare(exchange='orders', exchange_type='topic')
|
|
|
|
# Bind queues with patterns
|
|
channel.queue_bind(exchange='orders', queue='us-orders', routing_key='order.us.*')
|
|
channel.queue_bind(exchange='orders', queue='eu-orders', routing_key='order.eu.*')
|
|
|
|
# Publish with routing key
|
|
channel.basic_publish(
|
|
exchange='orders',
|
|
routing_key='order.us.california', # Goes to us-orders queue
|
|
body=json.dumps(order)
|
|
)
|
|
|
|
# Fanout exchange for pub/sub
|
|
channel.exchange_declare(exchange='analytics', exchange_type='fanout')
|
|
# All bound queues receive every message
|
|
```
|
|
|
|
### Kafka Connect (Data Integration)
|
|
|
|
```json
|
|
{
|
|
"name": "mysql-source",
|
|
"config": {
|
|
"connector.class": "io.confluent.connect.jdbc.JdbcSourceConnector",
|
|
"connection.url": "jdbc:mysql://localhost:3306/mydb",
|
|
"table.whitelist": "orders",
|
|
"mode": "incrementing",
|
|
"incrementing.column.name": "id",
|
|
"topic.prefix": "mysql-"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Use cases**:
|
|
- Stream DB changes to Kafka (CDC)
|
|
- Sink Kafka to Elasticsearch, S3, databases
|
|
- No custom code needed for common integrations
|
|
|
|
## Batching Optimizations
|
|
|
|
### Batch Size Tuning
|
|
|
|
```python
|
|
# SQS batch receiving (up to 10 messages)
|
|
messages = sqs.receive_messages(
|
|
QueueUrl=queue_url,
|
|
MaxNumberOfMessages=10, # Fetch 10 at once
|
|
WaitTimeSeconds=20 # Long polling (reduces empty receives)
|
|
)
|
|
|
|
# Process in parallel
|
|
with ThreadPoolExecutor(max_workers=10) as executor:
|
|
futures = [executor.submit(process, msg) for msg in messages]
|
|
for future in futures:
|
|
future.result()
|
|
|
|
# Kafka batch consuming
|
|
consumer = KafkaConsumer(
|
|
'orders',
|
|
max_poll_records=500, # Fetch 500 messages per poll
|
|
fetch_min_bytes=1024 # Wait for at least 1KB before returning
|
|
)
|
|
|
|
for messages in consumer:
|
|
batch_process(messages) # Process 500 at once
|
|
```
|
|
|
|
**Batch size tradeoffs**:
|
|
|
|
| Batch Size | Throughput | Latency | Memory |
|
|
|------------|------------|---------|--------|
|
|
| 1 | Low | Low | Low |
|
|
| 10-100 | Medium | Medium | Medium |
|
|
| 500+ | High | High | High |
|
|
|
|
**Recommendation**: Start with 10-100, increase for higher throughput if latency allows.
|
|
|
|
## Cross-References
|
|
|
|
**Related skills**:
|
|
- **Microservices communication** → `microservices-architecture` (saga, event-driven)
|
|
- **FastAPI async** → `fastapi-development` (consuming queues in FastAPI)
|
|
- **REST vs async** → `rest-api-design` (when to use queues vs HTTP)
|
|
- **Security** → `ordis-security-architect` (encryption, IAM, compliance)
|
|
- **Testing** → `api-testing` (integration testing strategies)
|
|
|
|
## Further Reading
|
|
|
|
- **Enterprise Integration Patterns** by Gregor Hohpe (message patterns)
|
|
- **Designing Data-Intensive Applications** by Martin Kleppmann (Kafka internals)
|
|
- **RabbitMQ in Action** by Alvaro Videla
|
|
- **Kafka: The Definitive Guide** by Neha Narkhede
|
|
- **AWS SQS Best Practices**: https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-best-practices.html
|