15 KiB
name, description
| name | description |
|---|---|
| kafka-architecture | Expert knowledge of Apache Kafka architecture, cluster design, capacity planning, partitioning strategies, replication, and high availability. Auto-activates on keywords kafka architecture, cluster sizing, partition strategy, replication factor, kafka ha, kafka scalability, broker count, topic design, kafka performance, kafka capacity planning. |
Kafka Architecture & Design Expert
Comprehensive knowledge of Apache Kafka architecture patterns, cluster design principles, and production best practices for building resilient, scalable event streaming platforms.
Core Architecture Concepts
Kafka Cluster Components
Brokers:
- Individual Kafka servers that store and serve data
- Each broker handles thousands of partitions
- Typical: 3-10 brokers per cluster (small), 10-100+ (large enterprises)
Controller:
- One broker elected as controller (via KRaft or ZooKeeper)
- Manages partition leaders and replica assignments
- Failure triggers automatic re-election
Topics:
- Logical channels for message streams
- Divided into partitions for parallelism
- Can have different retention policies per topic
Partitions:
- Ordered, immutable sequence of records
- Unit of parallelism (1 partition = 1 consumer in a group)
- Distributed across brokers for load balancing
Replicas:
- Copies of partitions across multiple brokers
- 1 leader replica (serves reads/writes)
- N-1 follower replicas (replication only)
- In-Sync Replicas (ISR): Followers caught up with leader
KRaft vs ZooKeeper Mode
KRaft Mode (Recommended, Kafka 3.3+):
Cluster Metadata:
- Stored in Kafka itself (no external ZooKeeper)
- Metadata topic: __cluster_metadata
- Controller quorum (3 or 5 nodes)
- Faster failover (<1s vs 10-30s)
- Simplified operations
ZooKeeper Mode (Legacy, deprecated in 4.0):
External Coordination:
- Requires separate ZooKeeper ensemble (3-5 nodes)
- Stores cluster metadata, configs, ACLs
- Slower failover (10-30 seconds)
- More complex to operate
Migration: ZooKeeper → KRaft migration supported in Kafka 3.6+
Cluster Sizing Guidelines
Small Cluster (Development/Testing)
Configuration:
Brokers: 3
Partitions per broker: ~100-500
Total partitions: 300-1500
Replication factor: 3
Hardware:
- CPU: 4-8 cores
- RAM: 8-16 GB
- Disk: 500 GB - 1 TB SSD
- Network: 1 Gbps
Use Cases:
- Development environments
- Low-volume production (<10 MB/s)
- Proof of concepts
- Single datacenter
Example Workload:
- 50 topics
- 5-10 partitions per topic
- 1 million messages/day
- 7-day retention
Medium Cluster (Standard Production)
Configuration:
Brokers: 6-12
Partitions per broker: 500-2000
Total partitions: 3K-24K
Replication factor: 3
Hardware:
- CPU: 16-32 cores
- RAM: 64-128 GB
- Disk: 2-8 TB NVMe SSD
- Network: 10 Gbps
Use Cases:
- Standard production workloads
- Multi-team environments
- Regional deployments
- Up to 500 MB/s throughput
Example Workload:
- 200-500 topics
- 10-50 partitions per topic
- 100 million messages/day
- 30-day retention
Large Cluster (High-Scale Production)
Configuration:
Brokers: 20-100+
Partitions per broker: 2000-4000
Total partitions: 40K-400K+
Replication factor: 3
Hardware:
- CPU: 32-64 cores
- RAM: 128-256 GB
- Disk: 8-20 TB NVMe SSD
- Network: 25-100 Gbps
Use Cases:
- Large enterprises
- Multi-region deployments
- Event-driven architectures
- 1+ GB/s throughput
Example Workload:
- 1000+ topics
- 50-200 partitions per topic
- 1+ billion messages/day
- 90-365 day retention
Kafka Streams / Exactly-Once Semantics (EOS) Clusters
Configuration:
Brokers: 6-12+ (same as standard, but more control plane load)
Partitions per broker: 500-1500 (fewer due to transaction overhead)
Total partitions: 3K-18K
Replication factor: 3
Hardware:
- CPU: 16-32 cores (more CPU for transactions)
- RAM: 64-128 GB
- Disk: 4-12 TB NVMe SSD (more for transaction logs)
- Network: 10-25 Gbps
Special Considerations:
- More brokers due to transaction coordinator load
- Lower partition count per broker (transactions = more overhead)
- Higher disk IOPS for transaction logs
- min.insync.replicas=2 mandatory for EOS
- acks=all required for producers
Use Cases:
- Stream processing with exactly-once guarantees
- Financial transactions
- Event sourcing with strict ordering
- Multi-step workflows requiring atomicity
Partitioning Strategy
How Many Partitions?
Formula:
Partitions = max(
Target Throughput / Single Partition Throughput,
Number of Consumers (for parallelism),
Future Growth Factor (2-3x)
)
Single Partition Limits:
- Write throughput: ~10-50 MB/s
- Read throughput: ~30-100 MB/s
- Message rate: ~10K-100K msg/s
Examples:
High Throughput Topic (Logs, Events):
Requirements:
- Write: 200 MB/s
- Read: 500 MB/s (multiple consumers)
- Expected growth: 3x in 1 year
Calculation:
Write partitions: 200 MB/s ÷ 20 MB/s = 10
Read partitions: 500 MB/s ÷ 40 MB/s = 13
Growth factor: 13 × 3 = 39
Recommendation: 40-50 partitions
Low-Latency Topic (Commands, Requests):
Requirements:
- Write: 5 MB/s
- Read: 10 MB/s
- Latency: <10ms p99
- Order preservation: By user ID
Calculation:
Throughput partitions: 5 MB/s ÷ 20 MB/s = 1
Parallelism: 4 (for redundancy)
Recommendation: 4-6 partitions (keyed by user ID)
Dead Letter Queue:
Recommendation: 1-3 partitions
Reason: Low volume, order less important
Partition Key Selection
Good Keys (High Cardinality, Even Distribution):
✅ User ID (UUIDs):
- Millions of unique values
- Even distribution
- Example: "user-123e4567-e89b-12d3-a456-426614174000"
✅ Device ID (IoT):
- Unique per device
- Natural sharding
- Example: "device-sensor-001-zone-a"
✅ Order ID (E-commerce):
- Unique per transaction
- Even temporal distribution
- Example: "order-2024-11-15-abc123"
Bad Keys (Low Cardinality, Hotspots):
❌ Country Code:
- Only ~200 values
- Uneven (US, CN >> others)
- Creates partition hotspots
❌ Boolean Flags:
- Only 2 values (true/false)
- Severe imbalance
❌ Date (YYYY-MM-DD):
- All today's traffic → 1 partition
- Temporal hotspot
Compound Keys (Best of Both):
✅ Country + User ID:
- Partition by country for locality
- Sub-partition by user for distribution
- Example: "US:user-123" → hash("US:user-123")
✅ Tenant + Event Type + Timestamp:
- Multi-tenant isolation
- Event type grouping
- Temporal ordering
Replication & High Availability
Replication Factor Guidelines
Development:
Replication Factor: 1
Reason: Fast, no durability needed
Production (Standard):
Replication Factor: 3
Reason: Balance durability vs cost
Tolerates: 2 broker failures (with min.insync.replicas=2)
Production (Critical):
Replication Factor: 5
Reason: Maximum durability
Tolerates: 4 broker failures (with min.insync.replicas=3)
Use Cases: Financial transactions, audit logs
Multi-Datacenter:
Replication Factor: 3 per DC (6 total)
Reason: DC-level fault tolerance
Requires: MirrorMaker 2 or Confluent Replicator
min.insync.replicas
Configuration:
min.insync.replicas=2:
- At least 2 replicas must acknowledge writes
- Typical for replication.factor=3
- Prevents data loss if 1 broker fails
min.insync.replicas=1:
- Only leader must acknowledge (dangerous!)
- Use only for non-critical topics
min.insync.replicas=3:
- At least 3 replicas must acknowledge
- For replication.factor=5 (critical systems)
Rule: min.insync.replicas ≤ replication.factor - 1 (to allow 1 replica failure)
Rack Awareness
Configuration:
broker.rack=rack1 # Broker 1
broker.rack=rack2 # Broker 2
broker.rack=rack3 # Broker 3
Benefit:
- Replicas spread across racks
- Survives rack-level failures (power, network)
- Example: Topic with RF=3 → 1 replica per rack
Placement:
Leader: rack1
Follower 1: rack2
Follower 2: rack3
Retention Strategies
Time-Based Retention
Short-Term (Events, Logs):
retention.ms: 86400000 # 1 day
Use Cases: Real-time analytics, monitoring
Medium-Term (Transactions):
retention.ms: 604800000 # 7 days
Use Cases: Standard business events
Long-Term (Audit, Compliance):
retention.ms: 31536000000 # 365 days
Use Cases: Regulatory requirements, event sourcing
Infinite (Event Sourcing):
retention.ms: -1 # Forever
cleanup.policy: compact
Use Cases: Source of truth, state rebuilding
Size-Based Retention
retention.bytes: 10737418240 # 10 GB per partition
Combined (Time OR Size):
retention.ms: 604800000 # 7 days
retention.bytes: 107374182400 # 100 GB
# Whichever limit is reached first
Compaction (Log Compaction)
cleanup.policy: compact
How It Works:
- Keeps only latest value per key
- Deletes old versions
- Preserves full history initially, compacts later
Use Cases:
- Database changelogs (CDC)
- User profile updates
- Configuration management
- State stores
Example:
Before Compaction:
user:123 → {name: "Alice", v:1}
user:123 → {name: "Alice", v:2, email: "alice@ex.com"}
user:123 → {name: "Alice A.", v:3}
After Compaction:
user:123 → {name: "Alice A.", v:3} # Latest only
Performance Optimization
Broker Configuration
# Network threads (handle client connections)
num.network.threads: 8 # Increase for high connection count
# I/O threads (disk operations)
num.io.threads: 16 # Set to number of disks × 2
# Replica fetcher threads
num.replica.fetchers: 4 # Increase for many partitions
# Socket buffer sizes
socket.send.buffer.bytes: 1048576 # 1 MB
socket.receive.buffer.bytes: 1048576 # 1 MB
# Log flush (default: OS handles flushing)
log.flush.interval.messages: 10000 # Flush every 10K messages
log.flush.interval.ms: 1000 # Or every 1 second
Producer Optimization
High Throughput:
batch.size: 65536 # 64 KB
linger.ms: 100 # Wait 100ms for batching
compression.type: lz4 # Fast compression
acks: 1 # Leader only
Low Latency:
batch.size: 16384 # 16 KB (default)
linger.ms: 0 # Send immediately
compression.type: none
acks: 1
Durability (Exactly-Once):
batch.size: 16384
linger.ms: 10
compression.type: lz4
acks: all
enable.idempotence: true
transactional.id: "producer-1"
Consumer Optimization
High Throughput:
fetch.min.bytes: 1048576 # 1 MB
fetch.max.wait.ms: 500 # Wait 500ms to accumulate
Low Latency:
fetch.min.bytes: 1 # Immediate fetch
fetch.max.wait.ms: 100 # Short wait
Max Parallelism:
# Deploy consumers = number of partitions
# More consumers than partitions = idle consumers
Multi-Datacenter Patterns
Active-Passive (Disaster Recovery)
Architecture:
Primary DC: Full Kafka cluster
Secondary DC: Replica cluster (MirrorMaker 2)
Configuration:
- Producers → Primary only
- Consumers → Primary only
- MirrorMaker 2: Primary → Secondary (async replication)
Failover:
1. Detect primary failure
2. Switch producers/consumers to secondary
3. Promote secondary to primary
Recovery Time: 5-30 minutes (manual)
Data Loss: Potential (async replication lag)
Active-Active (Geo-Replication)
Architecture:
DC1: Kafka cluster (region A)
DC2: Kafka cluster (region B)
Bidirectional replication via MirrorMaker 2
Configuration:
- Producers → Nearest DC
- Consumers → Nearest DC or both
- Conflict resolution: Last-write-wins or custom
Challenges:
- Duplicate messages (at-least-once delivery)
- Ordering across DCs not guaranteed
- Circular replication prevention
Use Cases:
- Global applications
- Regional compliance (GDPR)
- Load distribution
Stretch Cluster (Synchronous Replication)
Architecture:
Single Kafka cluster spanning 2 DCs
Rack awareness: DC1 = rack1, DC2 = rack2
Configuration:
min.insync.replicas: 2
replication.factor: 4 (2 per DC)
acks: all
Requirements:
- Low latency between DCs (<10ms)
- High bandwidth link (10+ Gbps)
- Dedicated fiber
Trade-offs:
Pros: Synchronous replication, zero data loss
Cons: Latency penalty, network dependency
Monitoring & Observability
Key Metrics
Broker Metrics:
UnderReplicatedPartitions:
Alert: > 0 for > 5 minutes
Indicates: Replica lag, broker failure
OfflinePartitionsCount:
Alert: > 0
Indicates: No leader elected (critical!)
ActiveControllerCount:
Alert: != 1 (should be exactly 1)
Indicates: Split brain or no controller
RequestHandlerAvgIdlePercent:
Alert: < 20%
Indicates: Broker CPU saturation
Topic Metrics:
MessagesInPerSec:
Monitor: Throughput trends
Alert: Sudden drops (producer failure)
BytesInPerSec / BytesOutPerSec:
Monitor: Network utilization
Alert: Approaching NIC limits
RecordsLagMax (Consumer):
Alert: > 10000 or growing
Indicates: Consumer can't keep up
Disk Metrics:
LogSegmentSize:
Monitor: Disk usage trends
Alert: > 80% capacity
LogFlushRateAndTimeMs:
Monitor: Disk write latency
Alert: > 100ms p99 (slow disk)
Security Patterns
Authentication & Authorization
SASL/SCRAM-SHA-512:
- Industry standard
- User/password authentication
- Stored in ZooKeeper/KRaft
ACLs (Access Control Lists):
- Per-topic, per-group permissions
- Operations: READ, WRITE, CREATE, DELETE, ALTER
- Example:
bin/kafka-acls.sh --add \
--allow-principal User:alice \
--operation READ \
--topic orders
mTLS (Mutual TLS):
- Certificate-based auth
- Strong cryptographic identity
- Best for service-to-service
Integration with SpecWeave
Automatic Architecture Detection:
import { ClusterSizingCalculator } from './lib/utils/sizing';
const calculator = new ClusterSizingCalculator();
const recommendation = calculator.calculate({
throughputMBps: 200,
retentionDays: 30,
replicationFactor: 3,
topicCount: 100
});
console.log(recommendation);
// {
// brokers: 8,
// partitionsPerBroker: 1500,
// diskPerBroker: 6000 GB,
// ramPerBroker: 64 GB
// }
SpecWeave Commands:
/specweave-kafka:deploy- Validates cluster sizing before deployment/specweave-kafka:monitor-setup- Configures metrics for key indicators
Related Skills
/specweave-kafka:kafka-mcp-integration- MCP server setup/specweave-kafka:kafka-cli-tools- CLI operations