Initial commit
This commit is contained in:
647
skills/kafka-architecture/SKILL.md
Normal file
647
skills/kafka-architecture/SKILL.md
Normal file
@@ -0,0 +1,647 @@
|
||||
---
|
||||
name: kafka-architecture
|
||||
description: Expert knowledge of Apache Kafka architecture, cluster design, capacity planning, partitioning strategies, replication, and high availability. Auto-activates on keywords kafka architecture, cluster sizing, partition strategy, replication factor, kafka ha, kafka scalability, broker count, topic design, kafka performance, kafka capacity planning.
|
||||
---
|
||||
|
||||
# Kafka Architecture & Design Expert
|
||||
|
||||
Comprehensive knowledge of Apache Kafka architecture patterns, cluster design principles, and production best practices for building resilient, scalable event streaming platforms.
|
||||
|
||||
## Core Architecture Concepts
|
||||
|
||||
### Kafka Cluster Components
|
||||
|
||||
**Brokers**:
|
||||
- Individual Kafka servers that store and serve data
|
||||
- Each broker handles thousands of partitions
|
||||
- Typical: 3-10 brokers per cluster (small), 10-100+ (large enterprises)
|
||||
|
||||
**Controller**:
|
||||
- One broker elected as controller (via KRaft or ZooKeeper)
|
||||
- Manages partition leaders and replica assignments
|
||||
- Failure triggers automatic re-election
|
||||
|
||||
**Topics**:
|
||||
- Logical channels for message streams
|
||||
- Divided into partitions for parallelism
|
||||
- Can have different retention policies per topic
|
||||
|
||||
**Partitions**:
|
||||
- Ordered, immutable sequence of records
|
||||
- Unit of parallelism (1 partition = 1 consumer in a group)
|
||||
- Distributed across brokers for load balancing
|
||||
|
||||
**Replicas**:
|
||||
- Copies of partitions across multiple brokers
|
||||
- 1 leader replica (serves reads/writes)
|
||||
- N-1 follower replicas (replication only)
|
||||
- In-Sync Replicas (ISR): Followers caught up with leader
|
||||
|
||||
### KRaft vs ZooKeeper Mode
|
||||
|
||||
**KRaft Mode** (Recommended, Kafka 3.3+):
|
||||
```yaml
|
||||
Cluster Metadata:
|
||||
- Stored in Kafka itself (no external ZooKeeper)
|
||||
- Metadata topic: __cluster_metadata
|
||||
- Controller quorum (3 or 5 nodes)
|
||||
- Faster failover (<1s vs 10-30s)
|
||||
- Simplified operations
|
||||
```
|
||||
|
||||
**ZooKeeper Mode** (Legacy, deprecated in 4.0):
|
||||
```yaml
|
||||
External Coordination:
|
||||
- Requires separate ZooKeeper ensemble (3-5 nodes)
|
||||
- Stores cluster metadata, configs, ACLs
|
||||
- Slower failover (10-30 seconds)
|
||||
- More complex to operate
|
||||
```
|
||||
|
||||
**Migration**: ZooKeeper → KRaft migration supported in Kafka 3.6+
|
||||
|
||||
## Cluster Sizing Guidelines
|
||||
|
||||
### Small Cluster (Development/Testing)
|
||||
|
||||
```yaml
|
||||
Configuration:
|
||||
Brokers: 3
|
||||
Partitions per broker: ~100-500
|
||||
Total partitions: 300-1500
|
||||
Replication factor: 3
|
||||
Hardware:
|
||||
- CPU: 4-8 cores
|
||||
- RAM: 8-16 GB
|
||||
- Disk: 500 GB - 1 TB SSD
|
||||
- Network: 1 Gbps
|
||||
|
||||
Use Cases:
|
||||
- Development environments
|
||||
- Low-volume production (<10 MB/s)
|
||||
- Proof of concepts
|
||||
- Single datacenter
|
||||
|
||||
Example Workload:
|
||||
- 50 topics
|
||||
- 5-10 partitions per topic
|
||||
- 1 million messages/day
|
||||
- 7-day retention
|
||||
```
|
||||
|
||||
### Medium Cluster (Standard Production)
|
||||
|
||||
```yaml
|
||||
Configuration:
|
||||
Brokers: 6-12
|
||||
Partitions per broker: 500-2000
|
||||
Total partitions: 3K-24K
|
||||
Replication factor: 3
|
||||
Hardware:
|
||||
- CPU: 16-32 cores
|
||||
- RAM: 64-128 GB
|
||||
- Disk: 2-8 TB NVMe SSD
|
||||
- Network: 10 Gbps
|
||||
|
||||
Use Cases:
|
||||
- Standard production workloads
|
||||
- Multi-team environments
|
||||
- Regional deployments
|
||||
- Up to 500 MB/s throughput
|
||||
|
||||
Example Workload:
|
||||
- 200-500 topics
|
||||
- 10-50 partitions per topic
|
||||
- 100 million messages/day
|
||||
- 30-day retention
|
||||
```
|
||||
|
||||
### Large Cluster (High-Scale Production)
|
||||
|
||||
```yaml
|
||||
Configuration:
|
||||
Brokers: 20-100+
|
||||
Partitions per broker: 2000-4000
|
||||
Total partitions: 40K-400K+
|
||||
Replication factor: 3
|
||||
Hardware:
|
||||
- CPU: 32-64 cores
|
||||
- RAM: 128-256 GB
|
||||
- Disk: 8-20 TB NVMe SSD
|
||||
- Network: 25-100 Gbps
|
||||
|
||||
Use Cases:
|
||||
- Large enterprises
|
||||
- Multi-region deployments
|
||||
- Event-driven architectures
|
||||
- 1+ GB/s throughput
|
||||
|
||||
Example Workload:
|
||||
- 1000+ topics
|
||||
- 50-200 partitions per topic
|
||||
- 1+ billion messages/day
|
||||
- 90-365 day retention
|
||||
```
|
||||
|
||||
### Kafka Streams / Exactly-Once Semantics (EOS) Clusters
|
||||
|
||||
```yaml
|
||||
Configuration:
|
||||
Brokers: 6-12+ (same as standard, but more control plane load)
|
||||
Partitions per broker: 500-1500 (fewer due to transaction overhead)
|
||||
Total partitions: 3K-18K
|
||||
Replication factor: 3
|
||||
Hardware:
|
||||
- CPU: 16-32 cores (more CPU for transactions)
|
||||
- RAM: 64-128 GB
|
||||
- Disk: 4-12 TB NVMe SSD (more for transaction logs)
|
||||
- Network: 10-25 Gbps
|
||||
|
||||
Special Considerations:
|
||||
- More brokers due to transaction coordinator load
|
||||
- Lower partition count per broker (transactions = more overhead)
|
||||
- Higher disk IOPS for transaction logs
|
||||
- min.insync.replicas=2 mandatory for EOS
|
||||
- acks=all required for producers
|
||||
|
||||
Use Cases:
|
||||
- Stream processing with exactly-once guarantees
|
||||
- Financial transactions
|
||||
- Event sourcing with strict ordering
|
||||
- Multi-step workflows requiring atomicity
|
||||
```
|
||||
|
||||
## Partitioning Strategy
|
||||
|
||||
### How Many Partitions?
|
||||
|
||||
**Formula**:
|
||||
```
|
||||
Partitions = max(
|
||||
Target Throughput / Single Partition Throughput,
|
||||
Number of Consumers (for parallelism),
|
||||
Future Growth Factor (2-3x)
|
||||
)
|
||||
|
||||
Single Partition Limits:
|
||||
- Write throughput: ~10-50 MB/s
|
||||
- Read throughput: ~30-100 MB/s
|
||||
- Message rate: ~10K-100K msg/s
|
||||
```
|
||||
|
||||
**Examples**:
|
||||
|
||||
**High Throughput Topic** (Logs, Events):
|
||||
```yaml
|
||||
Requirements:
|
||||
- Write: 200 MB/s
|
||||
- Read: 500 MB/s (multiple consumers)
|
||||
- Expected growth: 3x in 1 year
|
||||
|
||||
Calculation:
|
||||
Write partitions: 200 MB/s ÷ 20 MB/s = 10
|
||||
Read partitions: 500 MB/s ÷ 40 MB/s = 13
|
||||
Growth factor: 13 × 3 = 39
|
||||
|
||||
Recommendation: 40-50 partitions
|
||||
```
|
||||
|
||||
**Low-Latency Topic** (Commands, Requests):
|
||||
```yaml
|
||||
Requirements:
|
||||
- Write: 5 MB/s
|
||||
- Read: 10 MB/s
|
||||
- Latency: <10ms p99
|
||||
- Order preservation: By user ID
|
||||
|
||||
Calculation:
|
||||
Throughput partitions: 5 MB/s ÷ 20 MB/s = 1
|
||||
Parallelism: 4 (for redundancy)
|
||||
|
||||
Recommendation: 4-6 partitions (keyed by user ID)
|
||||
```
|
||||
|
||||
**Dead Letter Queue**:
|
||||
```yaml
|
||||
Recommendation: 1-3 partitions
|
||||
Reason: Low volume, order less important
|
||||
```
|
||||
|
||||
### Partition Key Selection
|
||||
|
||||
**Good Keys** (High Cardinality, Even Distribution):
|
||||
```yaml
|
||||
✅ User ID (UUIDs):
|
||||
- Millions of unique values
|
||||
- Even distribution
|
||||
- Example: "user-123e4567-e89b-12d3-a456-426614174000"
|
||||
|
||||
✅ Device ID (IoT):
|
||||
- Unique per device
|
||||
- Natural sharding
|
||||
- Example: "device-sensor-001-zone-a"
|
||||
|
||||
✅ Order ID (E-commerce):
|
||||
- Unique per transaction
|
||||
- Even temporal distribution
|
||||
- Example: "order-2024-11-15-abc123"
|
||||
```
|
||||
|
||||
**Bad Keys** (Low Cardinality, Hotspots):
|
||||
```yaml
|
||||
❌ Country Code:
|
||||
- Only ~200 values
|
||||
- Uneven (US, CN >> others)
|
||||
- Creates partition hotspots
|
||||
|
||||
❌ Boolean Flags:
|
||||
- Only 2 values (true/false)
|
||||
- Severe imbalance
|
||||
|
||||
❌ Date (YYYY-MM-DD):
|
||||
- All today's traffic → 1 partition
|
||||
- Temporal hotspot
|
||||
```
|
||||
|
||||
**Compound Keys** (Best of Both):
|
||||
```yaml
|
||||
✅ Country + User ID:
|
||||
- Partition by country for locality
|
||||
- Sub-partition by user for distribution
|
||||
- Example: "US:user-123" → hash("US:user-123")
|
||||
|
||||
✅ Tenant + Event Type + Timestamp:
|
||||
- Multi-tenant isolation
|
||||
- Event type grouping
|
||||
- Temporal ordering
|
||||
```
|
||||
|
||||
## Replication & High Availability
|
||||
|
||||
### Replication Factor Guidelines
|
||||
|
||||
```yaml
|
||||
Development:
|
||||
Replication Factor: 1
|
||||
Reason: Fast, no durability needed
|
||||
|
||||
Production (Standard):
|
||||
Replication Factor: 3
|
||||
Reason: Balance durability vs cost
|
||||
Tolerates: 2 broker failures (with min.insync.replicas=2)
|
||||
|
||||
Production (Critical):
|
||||
Replication Factor: 5
|
||||
Reason: Maximum durability
|
||||
Tolerates: 4 broker failures (with min.insync.replicas=3)
|
||||
Use Cases: Financial transactions, audit logs
|
||||
|
||||
Multi-Datacenter:
|
||||
Replication Factor: 3 per DC (6 total)
|
||||
Reason: DC-level fault tolerance
|
||||
Requires: MirrorMaker 2 or Confluent Replicator
|
||||
```
|
||||
|
||||
### min.insync.replicas
|
||||
|
||||
**Configuration**:
|
||||
```yaml
|
||||
min.insync.replicas=2:
|
||||
- At least 2 replicas must acknowledge writes
|
||||
- Typical for replication.factor=3
|
||||
- Prevents data loss if 1 broker fails
|
||||
|
||||
min.insync.replicas=1:
|
||||
- Only leader must acknowledge (dangerous!)
|
||||
- Use only for non-critical topics
|
||||
|
||||
min.insync.replicas=3:
|
||||
- At least 3 replicas must acknowledge
|
||||
- For replication.factor=5 (critical systems)
|
||||
```
|
||||
|
||||
**Rule**: `min.insync.replicas ≤ replication.factor - 1` (to allow 1 replica failure)
|
||||
|
||||
### Rack Awareness
|
||||
|
||||
```yaml
|
||||
Configuration:
|
||||
broker.rack=rack1 # Broker 1
|
||||
broker.rack=rack2 # Broker 2
|
||||
broker.rack=rack3 # Broker 3
|
||||
|
||||
Benefit:
|
||||
- Replicas spread across racks
|
||||
- Survives rack-level failures (power, network)
|
||||
- Example: Topic with RF=3 → 1 replica per rack
|
||||
|
||||
Placement:
|
||||
Leader: rack1
|
||||
Follower 1: rack2
|
||||
Follower 2: rack3
|
||||
```
|
||||
|
||||
## Retention Strategies
|
||||
|
||||
### Time-Based Retention
|
||||
|
||||
```yaml
|
||||
Short-Term (Events, Logs):
|
||||
retention.ms: 86400000 # 1 day
|
||||
Use Cases: Real-time analytics, monitoring
|
||||
|
||||
Medium-Term (Transactions):
|
||||
retention.ms: 604800000 # 7 days
|
||||
Use Cases: Standard business events
|
||||
|
||||
Long-Term (Audit, Compliance):
|
||||
retention.ms: 31536000000 # 365 days
|
||||
Use Cases: Regulatory requirements, event sourcing
|
||||
|
||||
Infinite (Event Sourcing):
|
||||
retention.ms: -1 # Forever
|
||||
cleanup.policy: compact
|
||||
Use Cases: Source of truth, state rebuilding
|
||||
```
|
||||
|
||||
### Size-Based Retention
|
||||
|
||||
```yaml
|
||||
retention.bytes: 10737418240 # 10 GB per partition
|
||||
|
||||
Combined (Time OR Size):
|
||||
retention.ms: 604800000 # 7 days
|
||||
retention.bytes: 107374182400 # 100 GB
|
||||
# Whichever limit is reached first
|
||||
```
|
||||
|
||||
### Compaction (Log Compaction)
|
||||
|
||||
```yaml
|
||||
cleanup.policy: compact
|
||||
|
||||
How It Works:
|
||||
- Keeps only latest value per key
|
||||
- Deletes old versions
|
||||
- Preserves full history initially, compacts later
|
||||
|
||||
Use Cases:
|
||||
- Database changelogs (CDC)
|
||||
- User profile updates
|
||||
- Configuration management
|
||||
- State stores
|
||||
|
||||
Example:
|
||||
Before Compaction:
|
||||
user:123 → {name: "Alice", v:1}
|
||||
user:123 → {name: "Alice", v:2, email: "alice@ex.com"}
|
||||
user:123 → {name: "Alice A.", v:3}
|
||||
|
||||
After Compaction:
|
||||
user:123 → {name: "Alice A.", v:3} # Latest only
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Broker Configuration
|
||||
|
||||
```yaml
|
||||
# Network threads (handle client connections)
|
||||
num.network.threads: 8 # Increase for high connection count
|
||||
|
||||
# I/O threads (disk operations)
|
||||
num.io.threads: 16 # Set to number of disks × 2
|
||||
|
||||
# Replica fetcher threads
|
||||
num.replica.fetchers: 4 # Increase for many partitions
|
||||
|
||||
# Socket buffer sizes
|
||||
socket.send.buffer.bytes: 1048576 # 1 MB
|
||||
socket.receive.buffer.bytes: 1048576 # 1 MB
|
||||
|
||||
# Log flush (default: OS handles flushing)
|
||||
log.flush.interval.messages: 10000 # Flush every 10K messages
|
||||
log.flush.interval.ms: 1000 # Or every 1 second
|
||||
```
|
||||
|
||||
### Producer Optimization
|
||||
|
||||
```yaml
|
||||
High Throughput:
|
||||
batch.size: 65536 # 64 KB
|
||||
linger.ms: 100 # Wait 100ms for batching
|
||||
compression.type: lz4 # Fast compression
|
||||
acks: 1 # Leader only
|
||||
|
||||
Low Latency:
|
||||
batch.size: 16384 # 16 KB (default)
|
||||
linger.ms: 0 # Send immediately
|
||||
compression.type: none
|
||||
acks: 1
|
||||
|
||||
Durability (Exactly-Once):
|
||||
batch.size: 16384
|
||||
linger.ms: 10
|
||||
compression.type: lz4
|
||||
acks: all
|
||||
enable.idempotence: true
|
||||
transactional.id: "producer-1"
|
||||
```
|
||||
|
||||
### Consumer Optimization
|
||||
|
||||
```yaml
|
||||
High Throughput:
|
||||
fetch.min.bytes: 1048576 # 1 MB
|
||||
fetch.max.wait.ms: 500 # Wait 500ms to accumulate
|
||||
|
||||
Low Latency:
|
||||
fetch.min.bytes: 1 # Immediate fetch
|
||||
fetch.max.wait.ms: 100 # Short wait
|
||||
|
||||
Max Parallelism:
|
||||
# Deploy consumers = number of partitions
|
||||
# More consumers than partitions = idle consumers
|
||||
```
|
||||
|
||||
## Multi-Datacenter Patterns
|
||||
|
||||
### Active-Passive (Disaster Recovery)
|
||||
|
||||
```yaml
|
||||
Architecture:
|
||||
Primary DC: Full Kafka cluster
|
||||
Secondary DC: Replica cluster (MirrorMaker 2)
|
||||
|
||||
Configuration:
|
||||
- Producers → Primary only
|
||||
- Consumers → Primary only
|
||||
- MirrorMaker 2: Primary → Secondary (async replication)
|
||||
|
||||
Failover:
|
||||
1. Detect primary failure
|
||||
2. Switch producers/consumers to secondary
|
||||
3. Promote secondary to primary
|
||||
|
||||
Recovery Time: 5-30 minutes (manual)
|
||||
Data Loss: Potential (async replication lag)
|
||||
```
|
||||
|
||||
### Active-Active (Geo-Replication)
|
||||
|
||||
```yaml
|
||||
Architecture:
|
||||
DC1: Kafka cluster (region A)
|
||||
DC2: Kafka cluster (region B)
|
||||
Bidirectional replication via MirrorMaker 2
|
||||
|
||||
Configuration:
|
||||
- Producers → Nearest DC
|
||||
- Consumers → Nearest DC or both
|
||||
- Conflict resolution: Last-write-wins or custom
|
||||
|
||||
Challenges:
|
||||
- Duplicate messages (at-least-once delivery)
|
||||
- Ordering across DCs not guaranteed
|
||||
- Circular replication prevention
|
||||
|
||||
Use Cases:
|
||||
- Global applications
|
||||
- Regional compliance (GDPR)
|
||||
- Load distribution
|
||||
```
|
||||
|
||||
### Stretch Cluster (Synchronous Replication)
|
||||
|
||||
```yaml
|
||||
Architecture:
|
||||
Single Kafka cluster spanning 2 DCs
|
||||
Rack awareness: DC1 = rack1, DC2 = rack2
|
||||
|
||||
Configuration:
|
||||
min.insync.replicas: 2
|
||||
replication.factor: 4 (2 per DC)
|
||||
acks: all
|
||||
|
||||
Requirements:
|
||||
- Low latency between DCs (<10ms)
|
||||
- High bandwidth link (10+ Gbps)
|
||||
- Dedicated fiber
|
||||
|
||||
Trade-offs:
|
||||
Pros: Synchronous replication, zero data loss
|
||||
Cons: Latency penalty, network dependency
|
||||
```
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
### Key Metrics
|
||||
|
||||
**Broker Metrics**:
|
||||
```yaml
|
||||
UnderReplicatedPartitions:
|
||||
Alert: > 0 for > 5 minutes
|
||||
Indicates: Replica lag, broker failure
|
||||
|
||||
OfflinePartitionsCount:
|
||||
Alert: > 0
|
||||
Indicates: No leader elected (critical!)
|
||||
|
||||
ActiveControllerCount:
|
||||
Alert: != 1 (should be exactly 1)
|
||||
Indicates: Split brain or no controller
|
||||
|
||||
RequestHandlerAvgIdlePercent:
|
||||
Alert: < 20%
|
||||
Indicates: Broker CPU saturation
|
||||
```
|
||||
|
||||
**Topic Metrics**:
|
||||
```yaml
|
||||
MessagesInPerSec:
|
||||
Monitor: Throughput trends
|
||||
Alert: Sudden drops (producer failure)
|
||||
|
||||
BytesInPerSec / BytesOutPerSec:
|
||||
Monitor: Network utilization
|
||||
Alert: Approaching NIC limits
|
||||
|
||||
RecordsLagMax (Consumer):
|
||||
Alert: > 10000 or growing
|
||||
Indicates: Consumer can't keep up
|
||||
```
|
||||
|
||||
**Disk Metrics**:
|
||||
```yaml
|
||||
LogSegmentSize:
|
||||
Monitor: Disk usage trends
|
||||
Alert: > 80% capacity
|
||||
|
||||
LogFlushRateAndTimeMs:
|
||||
Monitor: Disk write latency
|
||||
Alert: > 100ms p99 (slow disk)
|
||||
```
|
||||
|
||||
## Security Patterns
|
||||
|
||||
### Authentication & Authorization
|
||||
|
||||
```yaml
|
||||
SASL/SCRAM-SHA-512:
|
||||
- Industry standard
|
||||
- User/password authentication
|
||||
- Stored in ZooKeeper/KRaft
|
||||
|
||||
ACLs (Access Control Lists):
|
||||
- Per-topic, per-group permissions
|
||||
- Operations: READ, WRITE, CREATE, DELETE, ALTER
|
||||
- Example:
|
||||
bin/kafka-acls.sh --add \
|
||||
--allow-principal User:alice \
|
||||
--operation READ \
|
||||
--topic orders
|
||||
|
||||
mTLS (Mutual TLS):
|
||||
- Certificate-based auth
|
||||
- Strong cryptographic identity
|
||||
- Best for service-to-service
|
||||
```
|
||||
|
||||
## Integration with SpecWeave
|
||||
|
||||
**Automatic Architecture Detection**:
|
||||
```typescript
|
||||
import { ClusterSizingCalculator } from './lib/utils/sizing';
|
||||
|
||||
const calculator = new ClusterSizingCalculator();
|
||||
const recommendation = calculator.calculate({
|
||||
throughputMBps: 200,
|
||||
retentionDays: 30,
|
||||
replicationFactor: 3,
|
||||
topicCount: 100
|
||||
});
|
||||
|
||||
console.log(recommendation);
|
||||
// {
|
||||
// brokers: 8,
|
||||
// partitionsPerBroker: 1500,
|
||||
// diskPerBroker: 6000 GB,
|
||||
// ramPerBroker: 64 GB
|
||||
// }
|
||||
```
|
||||
|
||||
**SpecWeave Commands**:
|
||||
- `/specweave-kafka:deploy` - Validates cluster sizing before deployment
|
||||
- `/specweave-kafka:monitor-setup` - Configures metrics for key indicators
|
||||
|
||||
## Related Skills
|
||||
|
||||
- `/specweave-kafka:kafka-mcp-integration` - MCP server setup
|
||||
- `/specweave-kafka:kafka-cli-tools` - CLI operations
|
||||
|
||||
## External Links
|
||||
|
||||
- [Kafka Documentation - Architecture](https://kafka.apache.org/documentation/#design)
|
||||
- [Confluent - Kafka Sizing](https://www.confluent.io/blog/how-to-choose-the-number-of-topics-partitions-in-a-kafka-cluster/)
|
||||
- [KRaft Mode Overview](https://kafka.apache.org/documentation/#kraft)
|
||||
- [LinkedIn Engineering - Kafka at Scale](https://engineering.linkedin.com/kafka/running-kafka-scale)
|
||||
Reference in New Issue
Block a user