--- name: kafka-architecture description: Expert knowledge of Apache Kafka architecture, cluster design, capacity planning, partitioning strategies, replication, and high availability. Auto-activates on keywords kafka architecture, cluster sizing, partition strategy, replication factor, kafka ha, kafka scalability, broker count, topic design, kafka performance, kafka capacity planning. --- # Kafka Architecture & Design Expert Comprehensive knowledge of Apache Kafka architecture patterns, cluster design principles, and production best practices for building resilient, scalable event streaming platforms. ## Core Architecture Concepts ### Kafka Cluster Components **Brokers**: - Individual Kafka servers that store and serve data - Each broker handles thousands of partitions - Typical: 3-10 brokers per cluster (small), 10-100+ (large enterprises) **Controller**: - One broker elected as controller (via KRaft or ZooKeeper) - Manages partition leaders and replica assignments - Failure triggers automatic re-election **Topics**: - Logical channels for message streams - Divided into partitions for parallelism - Can have different retention policies per topic **Partitions**: - Ordered, immutable sequence of records - Unit of parallelism (1 partition = 1 consumer in a group) - Distributed across brokers for load balancing **Replicas**: - Copies of partitions across multiple brokers - 1 leader replica (serves reads/writes) - N-1 follower replicas (replication only) - In-Sync Replicas (ISR): Followers caught up with leader ### KRaft vs ZooKeeper Mode **KRaft Mode** (Recommended, Kafka 3.3+): ```yaml Cluster Metadata: - Stored in Kafka itself (no external ZooKeeper) - Metadata topic: __cluster_metadata - Controller quorum (3 or 5 nodes) - Faster failover (<1s vs 10-30s) - Simplified operations ``` **ZooKeeper Mode** (Legacy, deprecated in 4.0): ```yaml External Coordination: - Requires separate ZooKeeper ensemble (3-5 nodes) - Stores cluster metadata, configs, ACLs - Slower failover (10-30 seconds) - More complex to operate ``` **Migration**: ZooKeeper → KRaft migration supported in Kafka 3.6+ ## Cluster Sizing Guidelines ### Small Cluster (Development/Testing) ```yaml Configuration: Brokers: 3 Partitions per broker: ~100-500 Total partitions: 300-1500 Replication factor: 3 Hardware: - CPU: 4-8 cores - RAM: 8-16 GB - Disk: 500 GB - 1 TB SSD - Network: 1 Gbps Use Cases: - Development environments - Low-volume production (<10 MB/s) - Proof of concepts - Single datacenter Example Workload: - 50 topics - 5-10 partitions per topic - 1 million messages/day - 7-day retention ``` ### Medium Cluster (Standard Production) ```yaml Configuration: Brokers: 6-12 Partitions per broker: 500-2000 Total partitions: 3K-24K Replication factor: 3 Hardware: - CPU: 16-32 cores - RAM: 64-128 GB - Disk: 2-8 TB NVMe SSD - Network: 10 Gbps Use Cases: - Standard production workloads - Multi-team environments - Regional deployments - Up to 500 MB/s throughput Example Workload: - 200-500 topics - 10-50 partitions per topic - 100 million messages/day - 30-day retention ``` ### Large Cluster (High-Scale Production) ```yaml Configuration: Brokers: 20-100+ Partitions per broker: 2000-4000 Total partitions: 40K-400K+ Replication factor: 3 Hardware: - CPU: 32-64 cores - RAM: 128-256 GB - Disk: 8-20 TB NVMe SSD - Network: 25-100 Gbps Use Cases: - Large enterprises - Multi-region deployments - Event-driven architectures - 1+ GB/s throughput Example Workload: - 1000+ topics - 50-200 partitions per topic - 1+ billion messages/day - 90-365 day retention ``` ### Kafka Streams / Exactly-Once Semantics (EOS) Clusters ```yaml Configuration: Brokers: 6-12+ (same as standard, but more control plane load) Partitions per broker: 500-1500 (fewer due to transaction overhead) Total partitions: 3K-18K Replication factor: 3 Hardware: - CPU: 16-32 cores (more CPU for transactions) - RAM: 64-128 GB - Disk: 4-12 TB NVMe SSD (more for transaction logs) - Network: 10-25 Gbps Special Considerations: - More brokers due to transaction coordinator load - Lower partition count per broker (transactions = more overhead) - Higher disk IOPS for transaction logs - min.insync.replicas=2 mandatory for EOS - acks=all required for producers Use Cases: - Stream processing with exactly-once guarantees - Financial transactions - Event sourcing with strict ordering - Multi-step workflows requiring atomicity ``` ## Partitioning Strategy ### How Many Partitions? **Formula**: ``` Partitions = max( Target Throughput / Single Partition Throughput, Number of Consumers (for parallelism), Future Growth Factor (2-3x) ) Single Partition Limits: - Write throughput: ~10-50 MB/s - Read throughput: ~30-100 MB/s - Message rate: ~10K-100K msg/s ``` **Examples**: **High Throughput Topic** (Logs, Events): ```yaml Requirements: - Write: 200 MB/s - Read: 500 MB/s (multiple consumers) - Expected growth: 3x in 1 year Calculation: Write partitions: 200 MB/s ÷ 20 MB/s = 10 Read partitions: 500 MB/s ÷ 40 MB/s = 13 Growth factor: 13 × 3 = 39 Recommendation: 40-50 partitions ``` **Low-Latency Topic** (Commands, Requests): ```yaml Requirements: - Write: 5 MB/s - Read: 10 MB/s - Latency: <10ms p99 - Order preservation: By user ID Calculation: Throughput partitions: 5 MB/s ÷ 20 MB/s = 1 Parallelism: 4 (for redundancy) Recommendation: 4-6 partitions (keyed by user ID) ``` **Dead Letter Queue**: ```yaml Recommendation: 1-3 partitions Reason: Low volume, order less important ``` ### Partition Key Selection **Good Keys** (High Cardinality, Even Distribution): ```yaml ✅ User ID (UUIDs): - Millions of unique values - Even distribution - Example: "user-123e4567-e89b-12d3-a456-426614174000" ✅ Device ID (IoT): - Unique per device - Natural sharding - Example: "device-sensor-001-zone-a" ✅ Order ID (E-commerce): - Unique per transaction - Even temporal distribution - Example: "order-2024-11-15-abc123" ``` **Bad Keys** (Low Cardinality, Hotspots): ```yaml ❌ Country Code: - Only ~200 values - Uneven (US, CN >> others) - Creates partition hotspots ❌ Boolean Flags: - Only 2 values (true/false) - Severe imbalance ❌ Date (YYYY-MM-DD): - All today's traffic → 1 partition - Temporal hotspot ``` **Compound Keys** (Best of Both): ```yaml ✅ Country + User ID: - Partition by country for locality - Sub-partition by user for distribution - Example: "US:user-123" → hash("US:user-123") ✅ Tenant + Event Type + Timestamp: - Multi-tenant isolation - Event type grouping - Temporal ordering ``` ## Replication & High Availability ### Replication Factor Guidelines ```yaml Development: Replication Factor: 1 Reason: Fast, no durability needed Production (Standard): Replication Factor: 3 Reason: Balance durability vs cost Tolerates: 2 broker failures (with min.insync.replicas=2) Production (Critical): Replication Factor: 5 Reason: Maximum durability Tolerates: 4 broker failures (with min.insync.replicas=3) Use Cases: Financial transactions, audit logs Multi-Datacenter: Replication Factor: 3 per DC (6 total) Reason: DC-level fault tolerance Requires: MirrorMaker 2 or Confluent Replicator ``` ### min.insync.replicas **Configuration**: ```yaml min.insync.replicas=2: - At least 2 replicas must acknowledge writes - Typical for replication.factor=3 - Prevents data loss if 1 broker fails min.insync.replicas=1: - Only leader must acknowledge (dangerous!) - Use only for non-critical topics min.insync.replicas=3: - At least 3 replicas must acknowledge - For replication.factor=5 (critical systems) ``` **Rule**: `min.insync.replicas ≤ replication.factor - 1` (to allow 1 replica failure) ### Rack Awareness ```yaml Configuration: broker.rack=rack1 # Broker 1 broker.rack=rack2 # Broker 2 broker.rack=rack3 # Broker 3 Benefit: - Replicas spread across racks - Survives rack-level failures (power, network) - Example: Topic with RF=3 → 1 replica per rack Placement: Leader: rack1 Follower 1: rack2 Follower 2: rack3 ``` ## Retention Strategies ### Time-Based Retention ```yaml Short-Term (Events, Logs): retention.ms: 86400000 # 1 day Use Cases: Real-time analytics, monitoring Medium-Term (Transactions): retention.ms: 604800000 # 7 days Use Cases: Standard business events Long-Term (Audit, Compliance): retention.ms: 31536000000 # 365 days Use Cases: Regulatory requirements, event sourcing Infinite (Event Sourcing): retention.ms: -1 # Forever cleanup.policy: compact Use Cases: Source of truth, state rebuilding ``` ### Size-Based Retention ```yaml retention.bytes: 10737418240 # 10 GB per partition Combined (Time OR Size): retention.ms: 604800000 # 7 days retention.bytes: 107374182400 # 100 GB # Whichever limit is reached first ``` ### Compaction (Log Compaction) ```yaml cleanup.policy: compact How It Works: - Keeps only latest value per key - Deletes old versions - Preserves full history initially, compacts later Use Cases: - Database changelogs (CDC) - User profile updates - Configuration management - State stores Example: Before Compaction: user:123 → {name: "Alice", v:1} user:123 → {name: "Alice", v:2, email: "alice@ex.com"} user:123 → {name: "Alice A.", v:3} After Compaction: user:123 → {name: "Alice A.", v:3} # Latest only ``` ## Performance Optimization ### Broker Configuration ```yaml # Network threads (handle client connections) num.network.threads: 8 # Increase for high connection count # I/O threads (disk operations) num.io.threads: 16 # Set to number of disks × 2 # Replica fetcher threads num.replica.fetchers: 4 # Increase for many partitions # Socket buffer sizes socket.send.buffer.bytes: 1048576 # 1 MB socket.receive.buffer.bytes: 1048576 # 1 MB # Log flush (default: OS handles flushing) log.flush.interval.messages: 10000 # Flush every 10K messages log.flush.interval.ms: 1000 # Or every 1 second ``` ### Producer Optimization ```yaml High Throughput: batch.size: 65536 # 64 KB linger.ms: 100 # Wait 100ms for batching compression.type: lz4 # Fast compression acks: 1 # Leader only Low Latency: batch.size: 16384 # 16 KB (default) linger.ms: 0 # Send immediately compression.type: none acks: 1 Durability (Exactly-Once): batch.size: 16384 linger.ms: 10 compression.type: lz4 acks: all enable.idempotence: true transactional.id: "producer-1" ``` ### Consumer Optimization ```yaml High Throughput: fetch.min.bytes: 1048576 # 1 MB fetch.max.wait.ms: 500 # Wait 500ms to accumulate Low Latency: fetch.min.bytes: 1 # Immediate fetch fetch.max.wait.ms: 100 # Short wait Max Parallelism: # Deploy consumers = number of partitions # More consumers than partitions = idle consumers ``` ## Multi-Datacenter Patterns ### Active-Passive (Disaster Recovery) ```yaml Architecture: Primary DC: Full Kafka cluster Secondary DC: Replica cluster (MirrorMaker 2) Configuration: - Producers → Primary only - Consumers → Primary only - MirrorMaker 2: Primary → Secondary (async replication) Failover: 1. Detect primary failure 2. Switch producers/consumers to secondary 3. Promote secondary to primary Recovery Time: 5-30 minutes (manual) Data Loss: Potential (async replication lag) ``` ### Active-Active (Geo-Replication) ```yaml Architecture: DC1: Kafka cluster (region A) DC2: Kafka cluster (region B) Bidirectional replication via MirrorMaker 2 Configuration: - Producers → Nearest DC - Consumers → Nearest DC or both - Conflict resolution: Last-write-wins or custom Challenges: - Duplicate messages (at-least-once delivery) - Ordering across DCs not guaranteed - Circular replication prevention Use Cases: - Global applications - Regional compliance (GDPR) - Load distribution ``` ### Stretch Cluster (Synchronous Replication) ```yaml Architecture: Single Kafka cluster spanning 2 DCs Rack awareness: DC1 = rack1, DC2 = rack2 Configuration: min.insync.replicas: 2 replication.factor: 4 (2 per DC) acks: all Requirements: - Low latency between DCs (<10ms) - High bandwidth link (10+ Gbps) - Dedicated fiber Trade-offs: Pros: Synchronous replication, zero data loss Cons: Latency penalty, network dependency ``` ## Monitoring & Observability ### Key Metrics **Broker Metrics**: ```yaml UnderReplicatedPartitions: Alert: > 0 for > 5 minutes Indicates: Replica lag, broker failure OfflinePartitionsCount: Alert: > 0 Indicates: No leader elected (critical!) ActiveControllerCount: Alert: != 1 (should be exactly 1) Indicates: Split brain or no controller RequestHandlerAvgIdlePercent: Alert: < 20% Indicates: Broker CPU saturation ``` **Topic Metrics**: ```yaml MessagesInPerSec: Monitor: Throughput trends Alert: Sudden drops (producer failure) BytesInPerSec / BytesOutPerSec: Monitor: Network utilization Alert: Approaching NIC limits RecordsLagMax (Consumer): Alert: > 10000 or growing Indicates: Consumer can't keep up ``` **Disk Metrics**: ```yaml LogSegmentSize: Monitor: Disk usage trends Alert: > 80% capacity LogFlushRateAndTimeMs: Monitor: Disk write latency Alert: > 100ms p99 (slow disk) ``` ## Security Patterns ### Authentication & Authorization ```yaml SASL/SCRAM-SHA-512: - Industry standard - User/password authentication - Stored in ZooKeeper/KRaft ACLs (Access Control Lists): - Per-topic, per-group permissions - Operations: READ, WRITE, CREATE, DELETE, ALTER - Example: bin/kafka-acls.sh --add \ --allow-principal User:alice \ --operation READ \ --topic orders mTLS (Mutual TLS): - Certificate-based auth - Strong cryptographic identity - Best for service-to-service ``` ## Integration with SpecWeave **Automatic Architecture Detection**: ```typescript import { ClusterSizingCalculator } from './lib/utils/sizing'; const calculator = new ClusterSizingCalculator(); const recommendation = calculator.calculate({ throughputMBps: 200, retentionDays: 30, replicationFactor: 3, topicCount: 100 }); console.log(recommendation); // { // brokers: 8, // partitionsPerBroker: 1500, // diskPerBroker: 6000 GB, // ramPerBroker: 64 GB // } ``` **SpecWeave Commands**: - `/specweave-kafka:deploy` - Validates cluster sizing before deployment - `/specweave-kafka:monitor-setup` - Configures metrics for key indicators ## Related Skills - `/specweave-kafka:kafka-mcp-integration` - MCP server setup - `/specweave-kafka:kafka-cli-tools` - CLI operations ## External Links - [Kafka Documentation - Architecture](https://kafka.apache.org/documentation/#design) - [Confluent - Kafka Sizing](https://www.confluent.io/blog/how-to-choose-the-number-of-topics-partitions-in-a-kafka-cluster/) - [KRaft Mode Overview](https://kafka.apache.org/documentation/#kraft) - [LinkedIn Engineering - Kafka at Scale](https://engineering.linkedin.com/kafka/running-kafka-scale)