Files
2025-11-29 17:56:46 +08:00

648 lines
15 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
---
name: kafka-architecture
description: Expert knowledge of Apache Kafka architecture, cluster design, capacity planning, partitioning strategies, replication, and high availability. Auto-activates on keywords kafka architecture, cluster sizing, partition strategy, replication factor, kafka ha, kafka scalability, broker count, topic design, kafka performance, kafka capacity planning.
---
# Kafka Architecture & Design Expert
Comprehensive knowledge of Apache Kafka architecture patterns, cluster design principles, and production best practices for building resilient, scalable event streaming platforms.
## Core Architecture Concepts
### Kafka Cluster Components
**Brokers**:
- Individual Kafka servers that store and serve data
- Each broker handles thousands of partitions
- Typical: 3-10 brokers per cluster (small), 10-100+ (large enterprises)
**Controller**:
- One broker elected as controller (via KRaft or ZooKeeper)
- Manages partition leaders and replica assignments
- Failure triggers automatic re-election
**Topics**:
- Logical channels for message streams
- Divided into partitions for parallelism
- Can have different retention policies per topic
**Partitions**:
- Ordered, immutable sequence of records
- Unit of parallelism (1 partition = 1 consumer in a group)
- Distributed across brokers for load balancing
**Replicas**:
- Copies of partitions across multiple brokers
- 1 leader replica (serves reads/writes)
- N-1 follower replicas (replication only)
- In-Sync Replicas (ISR): Followers caught up with leader
### KRaft vs ZooKeeper Mode
**KRaft Mode** (Recommended, Kafka 3.3+):
```yaml
Cluster Metadata:
- Stored in Kafka itself (no external ZooKeeper)
- Metadata topic: __cluster_metadata
- Controller quorum (3 or 5 nodes)
- Faster failover (<1s vs 10-30s)
- Simplified operations
```
**ZooKeeper Mode** (Legacy, deprecated in 4.0):
```yaml
External Coordination:
- Requires separate ZooKeeper ensemble (3-5 nodes)
- Stores cluster metadata, configs, ACLs
- Slower failover (10-30 seconds)
- More complex to operate
```
**Migration**: ZooKeeper → KRaft migration supported in Kafka 3.6+
## Cluster Sizing Guidelines
### Small Cluster (Development/Testing)
```yaml
Configuration:
Brokers: 3
Partitions per broker: ~100-500
Total partitions: 300-1500
Replication factor: 3
Hardware:
- CPU: 4-8 cores
- RAM: 8-16 GB
- Disk: 500 GB - 1 TB SSD
- Network: 1 Gbps
Use Cases:
- Development environments
- Low-volume production (<10 MB/s)
- Proof of concepts
- Single datacenter
Example Workload:
- 50 topics
- 5-10 partitions per topic
- 1 million messages/day
- 7-day retention
```
### Medium Cluster (Standard Production)
```yaml
Configuration:
Brokers: 6-12
Partitions per broker: 500-2000
Total partitions: 3K-24K
Replication factor: 3
Hardware:
- CPU: 16-32 cores
- RAM: 64-128 GB
- Disk: 2-8 TB NVMe SSD
- Network: 10 Gbps
Use Cases:
- Standard production workloads
- Multi-team environments
- Regional deployments
- Up to 500 MB/s throughput
Example Workload:
- 200-500 topics
- 10-50 partitions per topic
- 100 million messages/day
- 30-day retention
```
### Large Cluster (High-Scale Production)
```yaml
Configuration:
Brokers: 20-100+
Partitions per broker: 2000-4000
Total partitions: 40K-400K+
Replication factor: 3
Hardware:
- CPU: 32-64 cores
- RAM: 128-256 GB
- Disk: 8-20 TB NVMe SSD
- Network: 25-100 Gbps
Use Cases:
- Large enterprises
- Multi-region deployments
- Event-driven architectures
- 1+ GB/s throughput
Example Workload:
- 1000+ topics
- 50-200 partitions per topic
- 1+ billion messages/day
- 90-365 day retention
```
### Kafka Streams / Exactly-Once Semantics (EOS) Clusters
```yaml
Configuration:
Brokers: 6-12+ (same as standard, but more control plane load)
Partitions per broker: 500-1500 (fewer due to transaction overhead)
Total partitions: 3K-18K
Replication factor: 3
Hardware:
- CPU: 16-32 cores (more CPU for transactions)
- RAM: 64-128 GB
- Disk: 4-12 TB NVMe SSD (more for transaction logs)
- Network: 10-25 Gbps
Special Considerations:
- More brokers due to transaction coordinator load
- Lower partition count per broker (transactions = more overhead)
- Higher disk IOPS for transaction logs
- min.insync.replicas=2 mandatory for EOS
- acks=all required for producers
Use Cases:
- Stream processing with exactly-once guarantees
- Financial transactions
- Event sourcing with strict ordering
- Multi-step workflows requiring atomicity
```
## Partitioning Strategy
### How Many Partitions?
**Formula**:
```
Partitions = max(
Target Throughput / Single Partition Throughput,
Number of Consumers (for parallelism),
Future Growth Factor (2-3x)
)
Single Partition Limits:
- Write throughput: ~10-50 MB/s
- Read throughput: ~30-100 MB/s
- Message rate: ~10K-100K msg/s
```
**Examples**:
**High Throughput Topic** (Logs, Events):
```yaml
Requirements:
- Write: 200 MB/s
- Read: 500 MB/s (multiple consumers)
- Expected growth: 3x in 1 year
Calculation:
Write partitions: 200 MB/s ÷ 20 MB/s = 10
Read partitions: 500 MB/s ÷ 40 MB/s = 13
Growth factor: 13 × 3 = 39
Recommendation: 40-50 partitions
```
**Low-Latency Topic** (Commands, Requests):
```yaml
Requirements:
- Write: 5 MB/s
- Read: 10 MB/s
- Latency: <10ms p99
- Order preservation: By user ID
Calculation:
Throughput partitions: 5 MB/s ÷ 20 MB/s = 1
Parallelism: 4 (for redundancy)
Recommendation: 4-6 partitions (keyed by user ID)
```
**Dead Letter Queue**:
```yaml
Recommendation: 1-3 partitions
Reason: Low volume, order less important
```
### Partition Key Selection
**Good Keys** (High Cardinality, Even Distribution):
```yaml
✅ User ID (UUIDs):
- Millions of unique values
- Even distribution
- Example: "user-123e4567-e89b-12d3-a456-426614174000"
✅ Device ID (IoT):
- Unique per device
- Natural sharding
- Example: "device-sensor-001-zone-a"
✅ Order ID (E-commerce):
- Unique per transaction
- Even temporal distribution
- Example: "order-2024-11-15-abc123"
```
**Bad Keys** (Low Cardinality, Hotspots):
```yaml
❌ Country Code:
- Only ~200 values
- Uneven (US, CN >> others)
- Creates partition hotspots
❌ Boolean Flags:
- Only 2 values (true/false)
- Severe imbalance
❌ Date (YYYY-MM-DD):
- All today's traffic → 1 partition
- Temporal hotspot
```
**Compound Keys** (Best of Both):
```yaml
✅ Country + User ID:
- Partition by country for locality
- Sub-partition by user for distribution
- Example: "US:user-123" → hash("US:user-123")
✅ Tenant + Event Type + Timestamp:
- Multi-tenant isolation
- Event type grouping
- Temporal ordering
```
## Replication & High Availability
### Replication Factor Guidelines
```yaml
Development:
Replication Factor: 1
Reason: Fast, no durability needed
Production (Standard):
Replication Factor: 3
Reason: Balance durability vs cost
Tolerates: 2 broker failures (with min.insync.replicas=2)
Production (Critical):
Replication Factor: 5
Reason: Maximum durability
Tolerates: 4 broker failures (with min.insync.replicas=3)
Use Cases: Financial transactions, audit logs
Multi-Datacenter:
Replication Factor: 3 per DC (6 total)
Reason: DC-level fault tolerance
Requires: MirrorMaker 2 or Confluent Replicator
```
### min.insync.replicas
**Configuration**:
```yaml
min.insync.replicas=2:
- At least 2 replicas must acknowledge writes
- Typical for replication.factor=3
- Prevents data loss if 1 broker fails
min.insync.replicas=1:
- Only leader must acknowledge (dangerous!)
- Use only for non-critical topics
min.insync.replicas=3:
- At least 3 replicas must acknowledge
- For replication.factor=5 (critical systems)
```
**Rule**: `min.insync.replicas ≤ replication.factor - 1` (to allow 1 replica failure)
### Rack Awareness
```yaml
Configuration:
broker.rack=rack1 # Broker 1
broker.rack=rack2 # Broker 2
broker.rack=rack3 # Broker 3
Benefit:
- Replicas spread across racks
- Survives rack-level failures (power, network)
- Example: Topic with RF=3 → 1 replica per rack
Placement:
Leader: rack1
Follower 1: rack2
Follower 2: rack3
```
## Retention Strategies
### Time-Based Retention
```yaml
Short-Term (Events, Logs):
retention.ms: 86400000 # 1 day
Use Cases: Real-time analytics, monitoring
Medium-Term (Transactions):
retention.ms: 604800000 # 7 days
Use Cases: Standard business events
Long-Term (Audit, Compliance):
retention.ms: 31536000000 # 365 days
Use Cases: Regulatory requirements, event sourcing
Infinite (Event Sourcing):
retention.ms: -1 # Forever
cleanup.policy: compact
Use Cases: Source of truth, state rebuilding
```
### Size-Based Retention
```yaml
retention.bytes: 10737418240 # 10 GB per partition
Combined (Time OR Size):
retention.ms: 604800000 # 7 days
retention.bytes: 107374182400 # 100 GB
# Whichever limit is reached first
```
### Compaction (Log Compaction)
```yaml
cleanup.policy: compact
How It Works:
- Keeps only latest value per key
- Deletes old versions
- Preserves full history initially, compacts later
Use Cases:
- Database changelogs (CDC)
- User profile updates
- Configuration management
- State stores
Example:
Before Compaction:
user:123 → {name: "Alice", v:1}
user:123 → {name: "Alice", v:2, email: "alice@ex.com"}
user:123 → {name: "Alice A.", v:3}
After Compaction:
user:123 → {name: "Alice A.", v:3} # Latest only
```
## Performance Optimization
### Broker Configuration
```yaml
# Network threads (handle client connections)
num.network.threads: 8 # Increase for high connection count
# I/O threads (disk operations)
num.io.threads: 16 # Set to number of disks × 2
# Replica fetcher threads
num.replica.fetchers: 4 # Increase for many partitions
# Socket buffer sizes
socket.send.buffer.bytes: 1048576 # 1 MB
socket.receive.buffer.bytes: 1048576 # 1 MB
# Log flush (default: OS handles flushing)
log.flush.interval.messages: 10000 # Flush every 10K messages
log.flush.interval.ms: 1000 # Or every 1 second
```
### Producer Optimization
```yaml
High Throughput:
batch.size: 65536 # 64 KB
linger.ms: 100 # Wait 100ms for batching
compression.type: lz4 # Fast compression
acks: 1 # Leader only
Low Latency:
batch.size: 16384 # 16 KB (default)
linger.ms: 0 # Send immediately
compression.type: none
acks: 1
Durability (Exactly-Once):
batch.size: 16384
linger.ms: 10
compression.type: lz4
acks: all
enable.idempotence: true
transactional.id: "producer-1"
```
### Consumer Optimization
```yaml
High Throughput:
fetch.min.bytes: 1048576 # 1 MB
fetch.max.wait.ms: 500 # Wait 500ms to accumulate
Low Latency:
fetch.min.bytes: 1 # Immediate fetch
fetch.max.wait.ms: 100 # Short wait
Max Parallelism:
# Deploy consumers = number of partitions
# More consumers than partitions = idle consumers
```
## Multi-Datacenter Patterns
### Active-Passive (Disaster Recovery)
```yaml
Architecture:
Primary DC: Full Kafka cluster
Secondary DC: Replica cluster (MirrorMaker 2)
Configuration:
- Producers → Primary only
- Consumers → Primary only
- MirrorMaker 2: Primary → Secondary (async replication)
Failover:
1. Detect primary failure
2. Switch producers/consumers to secondary
3. Promote secondary to primary
Recovery Time: 5-30 minutes (manual)
Data Loss: Potential (async replication lag)
```
### Active-Active (Geo-Replication)
```yaml
Architecture:
DC1: Kafka cluster (region A)
DC2: Kafka cluster (region B)
Bidirectional replication via MirrorMaker 2
Configuration:
- Producers → Nearest DC
- Consumers → Nearest DC or both
- Conflict resolution: Last-write-wins or custom
Challenges:
- Duplicate messages (at-least-once delivery)
- Ordering across DCs not guaranteed
- Circular replication prevention
Use Cases:
- Global applications
- Regional compliance (GDPR)
- Load distribution
```
### Stretch Cluster (Synchronous Replication)
```yaml
Architecture:
Single Kafka cluster spanning 2 DCs
Rack awareness: DC1 = rack1, DC2 = rack2
Configuration:
min.insync.replicas: 2
replication.factor: 4 (2 per DC)
acks: all
Requirements:
- Low latency between DCs (<10ms)
- High bandwidth link (10+ Gbps)
- Dedicated fiber
Trade-offs:
Pros: Synchronous replication, zero data loss
Cons: Latency penalty, network dependency
```
## Monitoring & Observability
### Key Metrics
**Broker Metrics**:
```yaml
UnderReplicatedPartitions:
Alert: > 0 for > 5 minutes
Indicates: Replica lag, broker failure
OfflinePartitionsCount:
Alert: > 0
Indicates: No leader elected (critical!)
ActiveControllerCount:
Alert: != 1 (should be exactly 1)
Indicates: Split brain or no controller
RequestHandlerAvgIdlePercent:
Alert: < 20%
Indicates: Broker CPU saturation
```
**Topic Metrics**:
```yaml
MessagesInPerSec:
Monitor: Throughput trends
Alert: Sudden drops (producer failure)
BytesInPerSec / BytesOutPerSec:
Monitor: Network utilization
Alert: Approaching NIC limits
RecordsLagMax (Consumer):
Alert: > 10000 or growing
Indicates: Consumer can't keep up
```
**Disk Metrics**:
```yaml
LogSegmentSize:
Monitor: Disk usage trends
Alert: > 80% capacity
LogFlushRateAndTimeMs:
Monitor: Disk write latency
Alert: > 100ms p99 (slow disk)
```
## Security Patterns
### Authentication & Authorization
```yaml
SASL/SCRAM-SHA-512:
- Industry standard
- User/password authentication
- Stored in ZooKeeper/KRaft
ACLs (Access Control Lists):
- Per-topic, per-group permissions
- Operations: READ, WRITE, CREATE, DELETE, ALTER
- Example:
bin/kafka-acls.sh --add \
--allow-principal User:alice \
--operation READ \
--topic orders
mTLS (Mutual TLS):
- Certificate-based auth
- Strong cryptographic identity
- Best for service-to-service
```
## Integration with SpecWeave
**Automatic Architecture Detection**:
```typescript
import { ClusterSizingCalculator } from './lib/utils/sizing';
const calculator = new ClusterSizingCalculator();
const recommendation = calculator.calculate({
throughputMBps: 200,
retentionDays: 30,
replicationFactor: 3,
topicCount: 100
});
console.log(recommendation);
// {
// brokers: 8,
// partitionsPerBroker: 1500,
// diskPerBroker: 6000 GB,
// ramPerBroker: 64 GB
// }
```
**SpecWeave Commands**:
- `/specweave-kafka:deploy` - Validates cluster sizing before deployment
- `/specweave-kafka:monitor-setup` - Configures metrics for key indicators
## Related Skills
- `/specweave-kafka:kafka-mcp-integration` - MCP server setup
- `/specweave-kafka:kafka-cli-tools` - CLI operations
## External Links
- [Kafka Documentation - Architecture](https://kafka.apache.org/documentation/#design)
- [Confluent - Kafka Sizing](https://www.confluent.io/blog/how-to-choose-the-number-of-topics-partitions-in-a-kafka-cluster/)
- [KRaft Mode Overview](https://kafka.apache.org/documentation/#kraft)
- [LinkedIn Engineering - Kafka at Scale](https://engineering.linkedin.com/kafka/running-kafka-scale)