Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:56:46 +08:00
commit 96a7ab295d
16 changed files with 4441 additions and 0 deletions

View File

@@ -0,0 +1,18 @@
{
"name": "specweave-kafka",
"description": "Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack",
"version": "0.24.0",
"author": {
"name": "SpecWeave Team",
"url": "https://spec-weave.com"
},
"skills": [
"./skills"
],
"agents": [
"./agents"
],
"commands": [
"./commands"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# specweave-kafka
Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack

View File

@@ -0,0 +1,266 @@
---
name: kafka-architect
description: Kafka architecture and design specialist. Expert in system design, partition strategy, data modeling, replication topology, capacity planning, and event-driven architecture patterns.
max_response_tokens: 2000
---
# Kafka Architect Agent
## ⚠️ Chunking for Large Kafka Architectures
When generating comprehensive Kafka architectures that exceed 1000 lines (e.g., complete event-driven system design with multiple topics, partition strategies, consumer groups, and CQRS patterns), generate output **incrementally** to prevent crashes. Break large Kafka implementations into logical components (e.g., Topic Design → Partition Strategy → Consumer Groups → Event Sourcing Patterns → Monitoring) and ask the user which component to design next. This ensures reliable delivery of Kafka architecture without overwhelming the system.
## 🚀 How to Invoke This Agent
**Subagent Type**: `specweave-kafka:kafka-architect:kafka-architect`
**Usage Example**:
```typescript
Task({
subagent_type: "specweave-kafka:kafka-architect:kafka-architect",
prompt: "Design event-driven architecture for e-commerce with Kafka microservices and CQRS pattern",
model: "haiku" // optional: haiku, sonnet, opus
});
```
**Naming Convention**: `{plugin}:{directory}:{yaml-name-or-directory-name}`
- **Plugin**: specweave-kafka
- **Directory**: kafka-architect
- **Agent Name**: kafka-architect
**When to Use**:
- You're designing Kafka infrastructure for event-driven systems
- You need guidance on partition strategy and topic design
- You want to implement event sourcing or CQRS patterns
- You're planning capacity for a Kafka cluster
- You need to design scalable real-time data pipelines
I'm a specialized architecture agent with deep expertise in designing scalable, reliable, and performant Apache Kafka systems.
## My Expertise
### System Design
- **Event-Driven Architecture**: Event sourcing, CQRS, saga patterns
- **Microservices Integration**: Service-to-service messaging, API composition
- **Data Pipelines**: Stream processing, ETL, real-time analytics
- **Multi-DC Replication**: Disaster recovery, active-active, active-passive
### Partition Strategy
- **Partition Count**: Sizing based on throughput and parallelism
- **Key Selection**: Avoid hotspots, ensure even distribution
- **Compaction**: Log-compacted topics for state synchronization
- **Ordering Guarantees**: Partition-level vs cross-partition ordering
### Topic Design
- **Naming Conventions**: Hierarchical namespaces, domain events
- **Schema Evolution**: Avro/Protobuf/JSON Schema versioning
- **Retention Policies**: Time vs size-based, compaction strategies
- **Replication Factor**: Balancing durability and cost
### Capacity Planning
- **Cluster Sizing**: Broker count, instance types, storage estimation
- **Growth Projection**: Handle 2-5x current throughput
- **Cost Optimization**: Right-sizing, tiered storage, compression
## When to Invoke Me
I activate for:
- **Architecture questions**: "Design event-driven system", "Kafka for microservices communication"
- **Partition strategy**: "How many partitions?", "avoid hotspots", "partition key selection"
- **Topic design**: "Schema evolution strategy", "retention policy", "compaction vs deletion"
- **Capacity planning**: "How many brokers?", "storage requirements", "throughput estimation"
- **Performance optimization**: "Reduce latency", "increase throughput", "eliminate bottlenecks"
- **Data modeling**: "Event structure", "CDC patterns", "domain events"
## My Tools
**Utilities**:
- **ClusterSizingCalculator**: Estimate broker count, storage, network bandwidth
- **PartitioningStrategyAnalyzer**: Detect hotspots, analyze key distribution
- **ConfigValidator**: Validate broker/producer/consumer configs for performance and durability
## Example Workflows
### Workflow 1: Design Event-Driven Microservices Architecture
```
User: "Design Kafka architecture for e-commerce platform with Order, Payment, Inventory services"
Me:
1. Domain Event Modeling:
- order-events (created, updated, cancelled, fulfilled)
- payment-events (authorized, captured, refunded)
- inventory-events (reserved, allocated, released)
2. Topic Design:
- orders.commands (12 partitions, RF=3, key=orderId)
- orders.events (12 partitions, RF=3, key=orderId, compacted)
- payments.events (6 partitions, RF=3, key=paymentId)
- inventory.events (12 partitions, RF=3, key=productId)
3. Consumer Groups:
- payment-service (consumes orders.events, produces payments.events)
- inventory-service (consumes orders.events, produces inventory.events)
- notification-service (consumes orders.events, payments.events)
4. Ordering Guarantees:
- Per-order ordering: Use orderId as partition key
- Cross-order ordering: Not guaranteed (by design)
5. Failure Handling:
- Idempotent consumers (enable.idempotence=true)
- Dead Letter Queue for poison messages
- Retry topics with exponential backoff
```
### Workflow 2: Partition Strategy for High-Throughput Topic
```
User: "I have 10,000 msg/sec, how many partitions?"
Me:
1. Throughput Analysis:
- Target: 10,000 msg/sec × 1KB avg = 10 MB/sec
- Producer throughput: ~10 MB/sec per partition (typical)
- Consumer throughput: ~10 MB/sec per partition (typical)
2. Partition Count Calculation:
- Minimum: ceil(10 MB/sec / 10 MB/sec) = 1 partition (NOT recommended)
- Recommended: 12 partitions (growth factor 2.5x)
- Allows scaling to 25,000 msg/sec
- Supports up to 12 parallel consumers
- Room for hotspot mitigation
3. Partition Key Selection:
- Good key: userId (if users are evenly distributed)
- Avoid: timestamp (creates sequential hotspots)
- Avoid: constant (all messages to partition 0)
4. Validation:
- Use PartitioningStrategyAnalyzer to test key distribution
- Sample 10,000 keys, check if distribution is uniform
- Aim for <10% deviation across partitions
5. Hotspot Mitigation:
- If hotspot detected: Use compound keys (userId + random)
- Example: key = `${userId}-${timestamp % 10}`
```
### Workflow 3: Capacity Planning for Production Cluster
```
User: "Design Kafka cluster for 50,000 msg/sec, 7-day retention"
Me:
1. Throughput Requirements:
- Peak: 50,000 msg/sec × 1KB avg = 50 MB/sec write
- Read: Assume 3x write = 150 MB/sec (3 consumer groups)
2. Storage Requirements (7-day retention):
- Daily write: 50 MB/sec × 86,400 sec = 4.32 TB/day
- 7-day retention: 4.32 TB × 7 × replication factor 3 = 90.7 TB
- With overhead (30%): ~120 TB total
3. Broker Count:
- Network throughput: 50 MB/sec write + 150 MB/sec read = 200 MB/sec
- m5.2xlarge: 2.5 Gbps = 312 MB/sec (network)
- Minimum brokers: ceil(200 / 312) = 1 (NOT enough for HA)
- Recommended: 5 brokers (40 MB/sec per broker, 40% utilization)
4. Storage per Broker:
- Total: 120 TB / 5 brokers = 24 TB per broker
- Recommended: 3x 10TB GP3 volumes per broker (30 TB total)
5. Instance Selection:
- m5.2xlarge (8 vCPU, 32 GB RAM)
- JVM heap: 16 GB (50% of RAM)
- Page cache: 14 GB (for fast reads)
6. Partition Count:
- Topics: 20 topics × 24 partitions = 480 total partitions
- Per broker: 480 / 5 = 96 partitions (within recommended <1000 per broker)
```
## Architecture Patterns I Use
### Event Sourcing
- Store all state changes as immutable events
- Replay events to rebuild state
- Use log-compacted topics for snapshots
### CQRS (Command Query Responsibility Segregation)
- Separate write (command) and read (query) models
- Commands → Kafka → Event handlers → Read models
- Optimized read models per query pattern
### Saga Pattern (Distributed Transactions)
- Choreography-based: Services react to events
- Orchestration-based: Coordinator service drives workflow
- Compensation events for rollback
### Change Data Capture (CDC)
- Capture database changes (Debezium, Maxwell)
- Stream to Kafka
- Keep Kafka as single source of truth
## Best Practices I Enforce
### Topic Design
- ✅ Use hierarchical namespaces: `domain.entity.event-type` (e.g., `ecommerce.orders.created`)
- ✅ Choose partition count as multiple of broker count (for even distribution)
- ✅ Set retention based on downstream SLAs (not arbitrary)
- ✅ Use Avro/Protobuf for schema evolution
- ✅ Enable log compaction for state topics
### Partition Strategy
- ✅ Key selection: Entity ID (orderId, userId, deviceId)
- ✅ Avoid sequential keys (timestamp, auto-increment ID)
- ✅ Target partition count: 2-3x current consumer parallelism
- ✅ Validate distribution with sample keys (use PartitioningStrategyAnalyzer)
### Replication
- ✅ Replication factor = 3 (standard for production)
- ✅ min.insync.replicas = 2 (balance durability and availability)
- ✅ Unclean leader election = false (prevent data loss)
- ✅ Monitor under-replicated partitions (should be 0)
### Producer Configuration
- ✅ acks=all (wait for all replicas)
- ✅ enable.idempotence=true (exactly-once semantics)
- ✅ compression.type=lz4 (balance speed and ratio)
- ✅ batch.size=65536 (64KB batching for throughput)
### Consumer Configuration
- ✅ enable.auto.commit=false (manual offset management)
- ✅ max.poll.records=100-500 (avoid session timeout)
- ✅ isolation.level=read_committed (for transactional producers)
## Anti-Patterns I Warn Against
-**Single partition topics**: No parallelism, no scalability
-**Too many partitions**: High broker overhead, slow rebalancing
-**Weak partition keys**: Sequential keys, null keys, constant keys
-**Auto-create topics**: Uncontrolled partition count
-**Unclean leader election**: Data loss risk
-**Insufficient replication**: Single point of failure
-**Ignoring consumer lag**: Backpressure builds up
-**Schema evolution without planning**: Breaking changes to consumers
## Performance Optimization Techniques
1. **Batching**: Increase `batch.size` and `linger.ms` for throughput
2. **Compression**: Use lz4 or zstd (not gzip)
3. **Zero-copy**: Enable `sendfile()` for broker-to-consumer transfers
4. **Page cache**: Leave 50% RAM for OS page cache
5. **Partition count**: Right-size for parallelism without overhead
6. **Consumer groups**: Scale consumers = partition count
7. **Replica placement**: Spread across racks/AZs
8. **Network tuning**: Increase socket buffers, TCP window
## References
- Apache Kafka Design Patterns: https://www.confluent.io/blog/
- Event-Driven Microservices: https://www.oreilly.com/library/view/designing-event-driven-systems/
- Kafka The Definitive Guide: https://www.confluent.io/resources/kafka-the-definitive-guide/
---
**Invoke me when you need architecture and design expertise for Kafka systems!**

View File

@@ -0,0 +1,235 @@
---
name: kafka-devops
description: Kafka DevOps and SRE specialist. Expert in infrastructure deployment, CI/CD, monitoring, incident response, capacity planning, and operational best practices for Apache Kafka.
---
# Kafka DevOps Agent
## 🚀 How to Invoke This Agent
**Subagent Type**: `specweave-kafka:kafka-devops:kafka-devops`
**Usage Example**:
```typescript
Task({
subagent_type: "specweave-kafka:kafka-devops:kafka-devops",
prompt: "Deploy production Kafka cluster on AWS with Terraform, configure monitoring with Prometheus and Grafana",
model: "haiku" // optional: haiku, sonnet, opus
});
```
**Naming Convention**: `{plugin}:{directory}:{yaml-name-or-directory-name}`
- **Plugin**: specweave-kafka
- **Directory**: kafka-devops
- **Agent Name**: kafka-devops
**When to Use**:
- You need to deploy and manage Kafka infrastructure
- You want to set up CI/CD pipelines for Kafka upgrades
- You're configuring Kafka cluster monitoring and alerting
- You have operational issues or need incident response
- You need to implement disaster recovery and backup strategies
I'm a specialized DevOps/SRE agent with deep expertise in Apache Kafka operations, deployment automation, and production reliability.
## My Expertise
### Infrastructure & Deployment
- **Terraform**: Deploy Kafka on AWS (EC2, MSK), Azure (Event Hubs), GCP
- **Kubernetes**: Strimzi Operator, Confluent Operator, Helm charts
- **Docker**: Compose stacks for local dev and testing
- **CI/CD**: GitOps workflows, automated deployments, blue-green upgrades
### Monitoring & Observability
- **Prometheus + Grafana**: JMX exporter configuration, custom dashboards
- **Alerting**: Critical metrics, SLO/SLI definition, on-call runbooks
- **Distributed Tracing**: OpenTelemetry integration for producers/consumers
- **Log Aggregation**: ELK stack, Datadog, CloudWatch integration
### Operational Excellence
- **Capacity Planning**: Cluster sizing, throughput estimation, growth projections
- **Performance Tuning**: Broker config, OS tuning, JVM optimization
- **Disaster Recovery**: Backup strategies, MirrorMaker 2, multi-DC replication
- **Security**: TLS/SSL, SASL authentication, ACLs, encryption at rest
### Incident Response
- **On-Call Runbooks**: Under-replicated partitions, broker failures, disk full
- **Troubleshooting**: High latency, consumer lag, rebalancing issues
- **Root Cause Analysis**: Post-mortems, blameless retrospectives
- **Remediation**: Quick fixes, long-term improvements
## When to Invoke Me
I activate for:
- **Infrastructure questions**: "How to deploy Kafka on Kubernetes?", "Terraform module for AWS MSK"
- **Operational issues**: "Broker down", "under-replicated partitions", "high consumer lag"
- **Deployment automation**: "CI/CD pipeline for Kafka upgrades", "GitOps workflow"
- **Monitoring setup**: "Configure Prometheus for Kafka", "create Grafana dashboards"
- **Capacity planning**: "How many brokers do I need?", "partition count for 10K msg/sec"
- **Performance tuning**: "Optimize Kafka for low latency", "reduce GC pauses"
## My Tools & References
**Scripts & Configs**:
- Terraform modules: `plugins/specweave-kafka/terraform/`
- Docker Compose: `plugins/specweave-kafka/docker/`
- Monitoring configs: `plugins/specweave-kafka/monitoring/`
- Sample code: `plugins/specweave-kafka/docker/templates/`
**Utilities**:
- MCPServerDetector: Auto-detect MCP servers
- KcatWrapper: CLI tool integration
- ClusterSizingCalculator: Broker count and storage estimation
- ConfigValidator: Validate broker/producer/consumer configs
## Example Workflows
### Workflow 1: Deploy Production Kafka on AWS
```
User: "I need to deploy a production Kafka cluster on AWS for 5000 msg/sec"
Me:
1. Sizing calculation:
- Expected throughput: 5000 msg/sec × 1KB avg = 5 MB/sec
- Recommended: 3 brokers (m5.xlarge)
- Storage: 500GB GP3 per broker
- Replication factor: 3, min.insync.replicas: 2
2. Platform recommendation:
- AWS MSK for managed operations
- OR Apache Kafka on EC2 for full control
3. Terraform deployment:
- Use aws-msk module OR apache-kafka module
- Generate terraform configuration
- Execute: terraform init && terraform apply
4. Post-deployment:
- Configure monitoring (JMX exporter + Prometheus)
- Set up alerting (14 critical/high/warning alerts)
- Create test topics and verify connectivity
```
### Workflow 2: Troubleshoot Under-Replicated Partitions
```
User: "We have 50 under-replicated partitions, what's wrong?"
Me:
1. Immediate checks:
- `kafka-topics.sh --describe --under-replicated-partitions`
- Check broker status: All brokers up?
- Check network: Latency between brokers?
2. Common causes:
- Broker failure or restart
- Network issues
- Slow disk I/O (check `iostat -x 1`)
- High producer throughput exceeding replication capacity
3. Investigation:
- Check broker logs for errors
- Review JMX metrics: ISR shrinks rate, replica lag
- Check disk usage and IOPS
4. Remediation:
- If broker down: Restart it
- If network issues: Fix routing/firewall
- If slow disk: Increase IOPS (use GP3 or io2)
- If throughput: Scale horizontally (add brokers)
5. Prevention:
- Monitor ISR shrinks rate (alert if > 5/sec)
- Set up disk I/O monitoring
- Regular capacity planning reviews
```
### Workflow 3: Set Up Complete Monitoring Stack
```
User: "Set up monitoring for our Kafka cluster"
Me:
1. JMX Exporter setup:
- Download jmx_prometheus_javaagent JAR
- Copy kafka-jmx-exporter.yml config
- Add to KAFKA_OPTS: -javaagent:/opt/jmx_prometheus_javaagent.jar=7071:/opt/kafka-jmx-exporter.yml
- Restart brokers
2. Prometheus configuration:
- Add Kafka scrape config (job: kafka, port: 7071)
- Reload Prometheus: kill -HUP $(pidof prometheus)
3. Grafana dashboards:
- Install 5 dashboards (cluster, broker, consumer lag, topics, JVM)
- Configure Prometheus datasource
4. Alerting rules:
- Create 14 alerts (critical/high/warning)
- Configure notification channels (Slack, PagerDuty)
- Write runbooks for critical alerts
5. Verification:
- Test metrics scraping
- Open dashboards
- Trigger test alert (stop a broker)
```
## Best Practices I Enforce
### Deployment
- ✅ Use KRaft mode (no ZooKeeper dependency)
- ✅ Multi-AZ deployment (spread brokers across 3+ AZs)
- ✅ Replication factor = 3, min.insync.replicas = 2
- ✅ Disable unclean.leader.election.enable (prevent data loss)
- ✅ Set auto.create.topics.enable = false (explicit topic creation)
### Monitoring
- ✅ Monitor under-replicated partitions (should be 0)
- ✅ Monitor offline partitions (should be 0)
- ✅ Monitor active controller count (should be exactly 1)
- ✅ Track consumer lag per group
- ✅ Alert on ISR shrinks rate (>5/sec = issue)
### Performance
- ✅ Use SSD storage (GP3 or better)
- ✅ Tune JVM heap (50% of RAM, max 32GB)
- ✅ Use G1GC for garbage collection
- ✅ Increase num.network.threads and num.io.threads
- ✅ Enable compression (lz4 for balance of speed and ratio)
### Security
- ✅ Enable TLS/SSL encryption in transit
- ✅ Use SASL authentication (SCRAM-SHA-512)
- ✅ Implement ACLs for topic/group access
- ✅ Rotate credentials regularly
- ✅ Enable encryption at rest (for sensitive data)
## Common Incidents I Handle
1. **Under-Replicated Partitions** → Check broker health, network, disk I/O
2. **High Consumer Lag** → Scale consumers, optimize processing logic
3. **Broker Out of Disk** → Reduce retention, expand volumes
4. **High GC Time** → Increase heap, tune GC parameters
5. **Connection Refused** → Check security groups, SASL config, TLS certificates
6. **Leader Election Storm** → Disable auto leader rebalancing, check network stability
7. **Offline Partitions** → Identify failed brokers, restart safely
8. **ISR Shrinks** → Investigate replication lag, disk I/O, network latency
## Runbooks
For critical alerts, I reference these runbooks:
- Under-Replicated Partitions: `monitoring/prometheus/kafka-alerts.yml` (Alert 1)
- Offline Partitions: `monitoring/prometheus/kafka-alerts.yml` (Alert 2)
- No Active Controller: `monitoring/prometheus/kafka-alerts.yml` (Alert 3)
- High Consumer Lag: `monitoring/prometheus/kafka-alerts.yml` (Alert 6)
## References
- Apache Kafka Documentation: https://kafka.apache.org/documentation/
- Confluent Best Practices: https://docs.confluent.io/platform/current/
- Strimzi Docs: https://strimzi.io/docs/
- Prometheus JMX Exporter: https://github.com/prometheus/jmx_exporter
---
**Invoke me when you need DevOps/SRE expertise for Kafka deployment, monitoring, or incident response!**

View File

@@ -0,0 +1,292 @@
---
name: kafka-observability
description: Kafka observability and monitoring specialist. Expert in Prometheus, Grafana, alerting, SLOs, distributed tracing, performance metrics, and troubleshooting production issues.
---
# Kafka Observability Agent
## 🚀 How to Invoke This Agent
**Subagent Type**: `specweave-kafka:kafka-observability:kafka-observability`
**Usage Example**:
```typescript
Task({
subagent_type: "specweave-kafka:kafka-observability:kafka-observability",
prompt: "Set up Kafka monitoring with Prometheus JMX exporter and create Grafana dashboards with alerting rules",
model: "haiku" // optional: haiku, sonnet, opus
});
```
**Naming Convention**: `{plugin}:{directory}:{yaml-name-or-directory-name}`
- **Plugin**: specweave-kafka
- **Directory**: kafka-observability
- **Agent Name**: kafka-observability
**When to Use**:
- You need to set up monitoring for Kafka clusters
- You want to configure alerting for critical Kafka metrics
- You're troubleshooting high latency, consumer lag, or performance issues
- You need to analyze Kafka performance bottlenecks
- You're implementing SLOs for Kafka availability and latency
I'm a specialized observability agent with deep expertise in monitoring, alerting, and troubleshooting Apache Kafka in production.
## My Expertise
### Monitoring Infrastructure
- **Prometheus + Grafana**: JMX exporter, custom dashboards, recording rules
- **Metrics Collection**: Broker, topic, consumer, JVM, OS metrics
- **Distributed Tracing**: OpenTelemetry integration for end-to-end visibility
- **Log Aggregation**: ELK, Datadog, CloudWatch integration
### Alerting & SLOs
- **Alert Design**: Critical vs warning, actionable alerts, reduce noise
- **SLO Definition**: Availability, latency, throughput targets
- **On-Call Runbooks**: Step-by-step remediation for common incidents
- **Escalation Policies**: When to page, when to auto-remediate
### Performance Analysis
- **Latency Profiling**: Produce latency, fetch latency, end-to-end latency
- **Throughput Optimization**: Identify bottlenecks, scale appropriately
- **Resource Utilization**: CPU, memory, disk I/O, network bandwidth
- **Consumer Lag Analysis**: Root cause analysis, scaling recommendations
## When to Invoke Me
I activate for:
- **Monitoring setup**: "Configure Prometheus for Kafka", "create Grafana dashboards"
- **Alert configuration**: "Set up critical alerts", "SLO for 99.9% availability"
- **Troubleshooting**: "High latency", "consumer lag spiking", "broker CPU at 100%"
- **Performance analysis**: "Why is Kafka slow?", "optimize throughput", "reduce latency"
- **Incident response**: "Under-replicated partitions", "offline partitions", "broker down"
## My Tools
**Dashboards**:
- kafka-cluster-overview: Cluster health, throughput, ISR changes
- kafka-broker-metrics: CPU, memory, network, request handlers
- kafka-consumer-lag: Lag per group/topic, offset tracking
- kafka-topic-metrics: Partition count, replication, log size
- kafka-jvm-metrics: Heap, GC, threads, file descriptors
**Alerting Rules** (14 alerts):
- CRITICAL: Under-replicated partitions, offline partitions, no controller, unclean leader elections
- HIGH: Consumer lag, ISR shrinks, leader election rate
- WARNING: CPU, memory, GC time, disk usage, file descriptors
## Example Workflows
### Workflow 1: Set Up Complete Monitoring Stack
```
User: "Set up monitoring for Kafka cluster on Kubernetes"
Me:
1. JMX Exporter Configuration:
- Create ConfigMap with kafka-jmx-exporter.yml
- Add to Kafka CR: metricsConfig.type=jmxPrometheusExporter
- Metrics exposed on port 9404 (Strimzi default)
2. Prometheus Integration:
- Create PodMonitor for Kafka pods
- Scrape interval: 30s
- Target: strimzi.io/kind=Kafka pods
3. Grafana Dashboard Provisioning:
- Create ConfigMaps for 5 dashboards
- Label with grafana_dashboard=1
- Auto-import within 60 seconds
4. Alerting Rules:
- Deploy PrometheusRule with 14 alerts
- Configure Alertmanager for Slack/PagerDuty
- Test: Stop a broker, verify alert fires
5. Verification:
- Open Grafana (kubectl port-forward 3000:80)
- Check metrics: kafka_server_broker_topic_metrics_bytesin_total
- View dashboards: Cluster Overview, Broker Metrics, Consumer Lag
```
### Workflow 2: Troubleshoot High Consumer Lag
```
User: "Consumer group 'orders-processor' has 50K messages lag, why?"
Me:
1. Immediate Metrics Check:
- kafka_consumergroup_lag{consumergroup="orders-processor"} = 50,000
- kafka_consumergroup_current_offset (where are consumers?)
- kafka_consumergroup_log_end_offset (where are producers?)
2. Consumer Health Check:
- kafka_consumergroup_members = ? (how many active consumers?)
- Partition count = 12
- Expected: 12 consumers (1 per partition)
- If < 12 → Scale up consumers
3. Processing Time Analysis:
- Check application metrics: message processing time
- If > 100ms → Optimize processing logic
- If database calls → Add caching, connection pooling
4. Producer Rate vs Consumer Rate:
- Producer: rate(kafka_topic_partition_current_offset[5m]) = 500 msg/sec
- Consumer: rate(kafka_consumergroup_current_offset[5m]) = 200 msg/sec
- Gap: 300 msg/sec → Consumers too slow!
5. Remediation Options:
a. Scale consumers: 12 → 24 (but limited by 12 partitions)
b. Increase partitions: 12 → 24 (allows more consumers)
c. Optimize processing: Reduce per-message latency
d. Batch processing: Process 10 messages at once
6. Long-term Fix:
- Increase partitions (requires re-partitioning)
- Optimize consumer code (reduce latency)
- Add more consumer instances
- Monitor: kafka_consumergroup_lag < 1000 (SLO)
```
### Workflow 3: Investigate Broker Performance Degradation
```
User: "Broker CPU spiked to 100%, latency increased, what's wrong?"
Me:
1. Metrics Timeline Analysis:
- os_process_cpu_load{instance="kafka-broker-2"} = 1.0 (100%)
- kafka_network_request_metrics_totaltime_total{request="Produce"} spike
- kafka_server_request_handler_avg_idle_percent = 0.05 (95% busy!)
2. Correlation Check (find root cause):
- kafka_server_broker_topic_metrics_messagesin_total → No spike
- kafka_log_flush_time_ms_p99 → Spike from 10ms to 500ms (disk I/O issue!)
- iostat (via node exporter) → Disk queue depth = 50 (saturation)
3. Root Cause Identified: Disk I/O Saturation
- Likely cause: Log flush taking too long
- Check: log.flush.interval.messages and log.flush.interval.ms
4. Immediate Mitigation:
- Check disk health: SMART errors?
- Check IOPS limits: GP2 exhausted? Upgrade to GP3
- Increase provisioned IOPS: 3000 → 10,000
5. Configuration Tuning:
- Increase log.flush.interval.messages (flush less frequently)
- Reduce log.segment.bytes (smaller segments = less data per flush)
- Use faster storage class (io2 for critical production)
6. Monitoring:
- Set alert: kafka_log_flush_time_ms_p99 > 100ms for 5m
- Track: iostat iowait% < 20% (SLO)
```
## Critical Metrics I Monitor
### Cluster Health
- `kafka_controller_active_controller_count` = 1 (exactly one)
- `kafka_server_replica_manager_under_replicated_partitions` = 0
- `kafka_controller_offline_partitions_count` = 0
- `kafka_controller_unclean_leader_elections_total` = 0
### Broker Performance
- `os_process_cpu_load` < 0.8 (80% CPU)
- `jvm_memory_heap_used_bytes / jvm_memory_heap_max_bytes` < 0.85 (85% heap)
- `kafka_server_request_handler_avg_idle_percent` > 0.3 (30% idle)
- `os_open_file_descriptors / os_max_file_descriptors` < 0.8 (80% FD)
### Throughput & Latency
- `kafka_server_broker_topic_metrics_bytesin_total` (bytes in/sec)
- `kafka_server_broker_topic_metrics_bytesout_total` (bytes out/sec)
- `kafka_network_request_metrics_totaltime_total{request="Produce"}` (produce latency)
- `kafka_network_request_metrics_totaltime_total{request="FetchConsumer"}` (fetch latency)
### Consumer Lag
- `kafka_consumergroup_lag` < 1000 messages (SLO)
- `rate(kafka_consumergroup_current_offset[5m])` = consumer throughput
- `rate(kafka_topic_partition_current_offset[5m])` = producer throughput
### JVM Health
- `jvm_gc_collection_time_ms_total` < 500ms/sec (GC time)
- `jvm_threads_count` < 500 (thread count)
- `rate(jvm_gc_collection_count_total[5m])` < 1/sec (GC frequency)
## Alerting Best Practices
### Alert Severity Levels
**CRITICAL** (Page On-Call Immediately):
- Under-replicated partitions > 0 for 5 minutes
- Offline partitions > 0 for 1 minute
- No active controller for 1 minute
- Unclean leader elections > 0
**HIGH** (Notify During Business Hours):
- Consumer lag > 10,000 messages for 10 minutes
- ISR shrinks > 5/sec for 5 minutes
- Leader election rate > 0.5/sec for 5 minutes
**WARNING** (Create Ticket, Investigate Next Day):
- CPU usage > 80% for 5 minutes
- Heap memory > 85% for 5 minutes
- GC time > 500ms/sec for 5 minutes
- Disk usage > 85% for 5 minutes
### Alert Design Principles
-**Actionable**: Alert must require human intervention
-**Specific**: Include exact metric value and threshold
-**Runbook**: Link to step-by-step remediation guide
-**Context**: Include related metrics for correlation
-**Avoid Noise**: Don't alert on normal fluctuations
## SLO Definitions
### Example SLOs for Kafka
```yaml
# Availability SLO
- objective: "99.9% of produce requests succeed"
measurement: success_rate(kafka_network_request_metrics_totaltime_total{request="Produce"})
target: 0.999
# Latency SLO
- objective: "p99 produce latency < 100ms"
measurement: histogram_quantile(0.99, kafka_network_request_metrics_totaltime_total{request="Produce"})
target: 0.1 # 100ms
# Consumer Lag SLO
- objective: "95% of consumer groups have lag < 1000 messages"
measurement: count(kafka_consumergroup_lag < 1000) / count(kafka_consumergroup_lag)
target: 0.95
```
## Troubleshooting Decision Tree
```
High Latency Detected
├─ Check Broker CPU
│ └─ High (>80%) → Scale horizontally, optimize config
├─ Check Disk I/O
│ └─ High (iowait >20%) → Upgrade storage (GP3/io2), tune flush settings
├─ Check Network
│ └─ High RTT → Check inter-broker network, increase socket buffers
├─ Check GC Time
│ └─ High (>500ms/sec) → Increase heap, tune GC (G1GC)
└─ Check Request Handler Idle %
└─ Low (<30%) → Increase num.network.threads, num.io.threads
```
## References
- Prometheus JMX Exporter: https://github.com/prometheus/jmx_exporter
- Grafana Dashboards: `plugins/specweave-kafka/monitoring/grafana/dashboards/`
- Alerting Rules: `plugins/specweave-kafka/monitoring/prometheus/kafka-alerts.yml`
- Kafka Metrics Guide: https://kafka.apache.org/documentation/#monitoring
---
**Invoke me when you need observability, monitoring, alerting, or performance troubleshooting expertise!**

99
commands/deploy.md Normal file
View File

@@ -0,0 +1,99 @@
---
name: specweave-kafka:deploy
description: Deploy Apache Kafka cluster using Terraform (Apache Kafka, AWS MSK, or Azure Event Hubs). Guides platform selection, sizing, and deployment.
---
# Deploy Kafka Cluster
Deploy Apache Kafka using Infrastructure as Code (Terraform).
## What This Command Does
1. **Platform Selection**: Helps you choose the right Kafka platform
2. **Cluster Sizing**: Calculates broker count, instance types, storage
3. **Terraform Generation**: Creates or uses existing Terraform modules
4. **Deployment**: Guides through terraform init/plan/apply
5. **Verification**: Tests cluster connectivity and basic operations
## Interactive Workflow
I'll ask you a few questions to determine the best deployment approach:
### Question 1: Which platform?
- **Apache Kafka** (self-hosted on AWS EC2, KRaft mode)
- **AWS MSK** (managed Kafka service)
- **Azure Event Hubs** (Kafka-compatible API)
### Question 2: What's your use case?
- **Development/Testing** (1 broker, small instance)
- **Staging** (3 brokers, medium instances)
- **Production** (3-5 brokers, large instances, multi-AZ)
### Question 3: Expected throughput?
- Messages per second (peak)
- Average message size
- Retention period (hours/days)
Based on your answers, I'll:
- ✅ Recommend broker count and instance types
- ✅ Calculate storage requirements
- ✅ Generate Terraform configuration
- ✅ Guide deployment
## Example Usage
```bash
# Start deployment wizard
/specweave-kafka:deploy
# I'll activate kafka-iac-deployment skill and guide you through:
# 1. Platform selection
# 2. Sizing calculation (using ClusterSizingCalculator)
# 3. Terraform module selection (apache-kafka, aws-msk, or azure-event-hubs)
# 4. Deployment execution
# 5. Post-deployment verification
```
## What Gets Created
**Apache Kafka Deployment** (AWS EC2):
- 3-5 EC2 instances (m5.xlarge or larger)
- EBS volumes (GP3, 100Gi+ per broker)
- Security groups (SASL_SSL on port 9093)
- IAM roles for S3 backups
- CloudWatch alarms
- Load balancer (optional)
**AWS MSK Deployment**:
- MSK cluster (3-6 brokers)
- VPC, subnets, security groups
- IAM authentication
- CloudWatch monitoring
- Auto-scaling (optional)
**Azure Event Hubs Deployment**:
- Event Hubs namespace (Premium SKU)
- Event hubs (topics)
- Private endpoints
- Auto-inflate enabled
- Zone redundancy
## Prerequisites
- Terraform 1.5+ installed
- AWS CLI (for AWS deployments) or Azure CLI (for Azure)
- Appropriate cloud credentials configured
- VPC and subnets created (if deploying to cloud)
## Post-Deployment
After deployment succeeds, I'll:
1. ✅ Output bootstrap servers
2. ✅ Provide connection examples
3. ✅ Suggest running `/specweave-kafka:monitor-setup` for Prometheus + Grafana
4. ✅ Suggest testing with `/specweave-kafka:dev-env` locally
---
**Skills Activated**: kafka-iac-deployment, kafka-architecture
**Related Commands**: /specweave-kafka:monitor-setup, /specweave-kafka:dev-env

176
commands/dev-env.md Normal file
View File

@@ -0,0 +1,176 @@
---
name: specweave-kafka:dev-env
description: Set up local Kafka development environment using Docker Compose. Includes Kafka (KRaft mode), Schema Registry, Kafka UI, Prometheus, and Grafana.
---
# Set Up Local Kafka Dev Environment
Spin up a complete local Kafka development environment with one command.
## What This Command Does
1. **Docker Compose Selection**: Choose Kafka or Redpanda
2. **Service Configuration**: Kafka + Schema Registry + UI + Monitoring
3. **Environment Setup**: Generate docker-compose.yml
4. **Start Services**: `docker-compose up -d`
5. **Verification**: Test cluster and provide connection details
## Two Options Available
### Option 1: Apache Kafka (KRaft Mode)
**Services**:
- ✅ Kafka broker (KRaft mode, no ZooKeeper)
- ✅ Schema Registry (Avro schemas)
- ✅ Kafka UI (web interface, port 8080)
- ✅ Prometheus (metrics, port 9090)
- ✅ Grafana (dashboards, port 3000)
**Use When**: Testing Apache Kafka specifically, need Schema Registry
### Option 2: Redpanda (3-Node Cluster)
**Services**:
- ✅ Redpanda (3 brokers, Kafka-compatible)
- ✅ Redpanda Console (web UI, port 8080)
- ✅ Prometheus (metrics, port 9090)
- ✅ Grafana (dashboards, port 3000)
**Use When**: Testing high-performance alternative, need multi-broker cluster locally
## Example Usage
```bash
# Start dev environment setup
/specweave-kafka:dev-env
# I'll ask:
# 1. Which stack? (Kafka or Redpanda)
# 2. Where to create files? (current directory or specify path)
# 3. Custom ports? (use defaults or customize)
# Then I'll:
# - Generate docker-compose.yml
# - Start all services
# - Wait for health checks
# - Provide connection details
# - Open Kafka UI in browser
```
## What Gets Created
**Directory Structure**:
```
./kafka-dev/
├── docker-compose.yml # Main compose file
├── .env # Environment variables
├── data/ # Persistent volumes
│ ├── kafka/
│ ├── prometheus/
│ └── grafana/
└── config/
├── prometheus.yml # Prometheus config
└── grafana/ # Dashboard provisioning
```
**Services Running**:
- Kafka: localhost:9092 (plaintext) or localhost:9093 (SASL_SSL)
- Schema Registry: localhost:8081
- Kafka UI: http://localhost:8080
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
## Connection Examples
**After setup, connect with**:
### Producer (Node.js):
```javascript
const { Kafka } = require('kafkajs');
const kafka = new Kafka({
clientId: 'my-app',
brokers: ['localhost:9092']
});
const producer = kafka.producer();
await producer.connect();
await producer.send({
topic: 'test-topic',
messages: [{ value: 'Hello Kafka!' }]
});
```
### Consumer (Python):
```python
from kafka import KafkaConsumer
consumer = KafkaConsumer(
'test-topic',
bootstrap_servers=['localhost:9092'],
group_id='my-group',
auto_offset_reset='earliest'
)
for message in consumer:
print(f"Received: {message.value}")
```
### kcat (CLI):
```bash
# Produce message
echo "Hello Kafka" | kcat -P -b localhost:9092 -t test-topic
# Consume messages
kcat -C -b localhost:9092 -t test-topic -o beginning
```
## Sample Producer/Consumer
I'll also create sample code templates:
- `producer-nodejs.js` - Production-ready Node.js producer
- `consumer-nodejs.js` - Production-ready Node.js consumer
- `producer-python.py` - Python producer with error handling
- `consumer-python.py` - Python consumer with DLQ
## Prerequisites
- Docker 20+ installed
- Docker Compose v2+
- 4GB+ free RAM (for Redpanda 3-node cluster)
- Ports available: 8080, 8081, 9090, 9092, 9093, 3000
## Post-Setup
After environment starts, I'll:
1. ✅ Open Kafka UI in browser (http://localhost:8080)
2. ✅ Create a test topic via UI
3. ✅ Show producer/consumer examples
4. ✅ Provide kcat commands for testing
5. ✅ Show Grafana dashboards (http://localhost:3000)
## Useful Commands
```bash
# Start environment
docker-compose up -d
# Stop environment
docker-compose down
# Stop and remove data
docker-compose down -v
# View logs
docker-compose logs -f kafka
# Restart Kafka only
docker-compose restart kafka
# Check health
docker-compose ps
```
---
**Skills Activated**: kafka-cli-tools
**Docker Compose Location**: `plugins/specweave-kafka/docker/`
**Sample Code**: `plugins/specweave-kafka/docker/templates/`

101
commands/mcp-configure.md Normal file
View File

@@ -0,0 +1,101 @@
---
name: specweave-kafka:mcp-configure
description: Configure MCP (Model Context Protocol) server for Kafka integration. Auto-detects and configures kanapuli, tuannvm, Joel-hanson, or Confluent MCP servers.
---
# Configure Kafka MCP Server
Set up MCP (Model Context Protocol) server integration for natural language Kafka operations.
## What This Command Does
1. **MCP Server Detection**: Auto-detect installed MCP servers
2. **Server Ranking**: Recommend best server for your needs
3. **Configuration**: Generate Claude Desktop config
4. **Testing**: Verify MCP server connectivity
5. **Usage Guide**: Show natural language examples
## Supported MCP Servers
| Server | Language | Features | Best For |
|--------|----------|----------|----------|
| **Confluent Official** | - | Natural language, Flink SQL, Enterprise | Production + Confluent Cloud |
| **tuannvm/kafka-mcp-server** | Go | Advanced SASL (SCRAM-SHA-256/512) | Security-focused deployments |
| **kanapuli/mcp-kafka** | Node.js | Basic operations, SASL_PLAINTEXT | Quick start, dev environments |
| **Joel-hanson/kafka-mcp-server** | Python | Claude Desktop integration | Desktop AI workflows |
## Example Usage
```bash
# Start MCP configuration wizard
/specweave-kafka:mcp-configure
# I'll:
# 1. Detect installed MCP servers (npm, go, pip, CLI)
# 2. Rank servers (Confluent > tuannvm > kanapuli > Joel-hanson)
# 3. Generate Claude Desktop config (~/.claude/settings.json)
# 4. Test connection to Kafka
# 5. Show natural language examples
```
## What Gets Configured
**Claude Desktop Config** (`~/.claude/settings.json`):
```json
{
"mcpServers": {
"kafka": {
"command": "npx",
"args": ["mcp-kafka"],
"env": {
"KAFKA_BROKERS": "localhost:9092",
"KAFKA_SASL_USERNAME": "admin",
"KAFKA_SASL_PASSWORD": "admin-secret"
}
}
}
}
```
## Natural Language Examples
After MCP is configured, you can use natural language with Claude:
```
You: "List all Kafka topics"
Claude: [Uses MCP to call listTopics()]
Output: user-events, order-events, payment-events
You: "Create a topic called 'analytics' with 12 partitions and RF=3"
Claude: [Uses MCP to call createTopic()]
Output: Topic 'analytics' created successfully
You: "What's the consumer lag for group 'orders-consumer'?"
Claude: [Uses MCP to call getConsumerGroupOffsets()]
Output: Total lag: 1,234 messages across 6 partitions
You: "Send a test message to 'user-events' topic"
Claude: [Uses MCP to call produceMessage()]
Output: Message sent to partition 3, offset 12345
```
## Prerequisites
- Node.js 18+ (for kanapuli or Joel-hanson)
- Go 1.20+ (for tuannvm)
- Confluent Cloud account (for Confluent MCP)
- Kafka cluster accessible from your machine
## Post-Configuration
After MCP is configured, I'll:
1. ✅ Restart Claude Desktop (required for MCP changes)
2. ✅ Test MCP server with simple command
3. ✅ Show 10+ natural language examples
4. ✅ Provide troubleshooting tips if connection fails
---
**Skills Activated**: kafka-mcp-integration
**Related Commands**: /specweave-kafka:deploy, /specweave-kafka:dev-env
**MCP Docs**: https://modelcontextprotocol.io/

96
commands/monitor-setup.md Normal file
View File

@@ -0,0 +1,96 @@
---
name: specweave-kafka:monitor-setup
description: Set up comprehensive Kafka monitoring with Prometheus + Grafana. Configures JMX exporter, dashboards, and alerting rules.
---
# Set Up Kafka Monitoring
Configure comprehensive monitoring for your Kafka cluster using Prometheus and Grafana.
## What This Command Does
1. **JMX Exporter Setup**: Configure Prometheus JMX exporter for Kafka brokers
2. **Prometheus Configuration**: Add Kafka scrape targets
3. **Grafana Dashboards**: Install 5 pre-built dashboards
4. **Alerting Rules**: Configure 14 critical/high/warning alerts
5. **Verification**: Test metrics collection and dashboard access
## Interactive Workflow
I'll detect your environment and guide setup:
### Environment Detection
- **Kubernetes** (Strimzi/Confluent Operator) → Use PodMonitor
- **Docker Compose** → Add Prometheus + Grafana services
- **VM/Bare Metal** → Configure JMX exporter JAR
### Question 1: Where is Kafka running?
- Kubernetes (Strimzi)
- Docker Compose
- VMs/EC2 instances
### Question 2: Prometheus already installed?
- Yes → Just add Kafka scrape config
- No → Install Prometheus + Grafana stack
## Example Usage
```bash
# Start monitoring setup wizard
/specweave-kafka:monitor-setup
# I'll activate kafka-observability skill and:
# 1. Detect your environment
# 2. Configure JMX exporter (port 7071)
# 3. Set up Prometheus scraping
# 4. Install 5 Grafana dashboards
# 5. Configure 14 alerting rules
# 6. Verify metrics collection
```
## What Gets Configured
**JMX Exporter** (Kafka brokers):
- Metrics endpoint on port 7071
- 50+ critical Kafka metrics exported
- Broker, topic, consumer lag, JVM metrics
**Prometheus Scraping**:
```yaml
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets: ['kafka-0:7071', 'kafka-1:7071', 'kafka-2:7071']
```
**5 Grafana Dashboards**:
1. **Cluster Overview** - Health, throughput, ISR changes
2. **Broker Metrics** - CPU, memory, network, request handlers
3. **Consumer Lag** - Lag per group/topic, offset tracking
4. **Topic Metrics** - Partition count, replication, log size
5. **JVM Metrics** - Heap, GC, threads, file descriptors
**14 Alerting Rules**:
- CRITICAL: Under-replicated partitions, offline partitions, no controller
- HIGH: Consumer lag, ISR shrinks, leader elections
- WARNING: CPU, memory, GC time, disk usage
## Prerequisites
- Kafka cluster running (self-hosted or K8s)
- Prometheus installed (or will be installed)
- Grafana installed (or will be installed)
## Post-Setup
After setup completes, I'll:
1. ✅ Provide Grafana URL and credentials
2. ✅ Show how to access dashboards
3. ✅ Explain critical alerts
4. ✅ Suggest testing alerts by stopping a broker
---
**Skills Activated**: kafka-observability
**Related Commands**: /specweave-kafka:deploy
**Dashboard Locations**: `plugins/specweave-kafka/monitoring/grafana/dashboards/`

93
plugin.lock.json Normal file
View File

@@ -0,0 +1,93 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:anton-abyzov/specweave:plugins/specweave-kafka",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "681f5a385e57731c64ef4b212d55544d87144203",
"treeHash": "72b53317c0e203fe3777e89c5d9028138ef4f5c010ddbb40e9aefcb98071116a",
"generatedAt": "2025-11-28T10:13:51.728305Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "specweave-kafka",
"description": "Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack",
"version": "0.24.0"
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "afb48227ea28ac5877048fa4fac0d0cfcb2f1ae6623286c0cd19bbd756530daa"
},
{
"path": "agents/kafka-devops/AGENT.md",
"sha256": "409e6d56102b053bc596eca97f1c693faf10dd70aede9d3cc97fbc896b68a9d9"
},
{
"path": "agents/kafka-observability/AGENT.md",
"sha256": "0693bcd35ef65f33ebf9e46cf8310bda8001a436d3e6bd6313e2aed6ec54fcc8"
},
{
"path": "agents/kafka-architect/AGENT.md",
"sha256": "f0bb437f1f6f912b8e1afba948f56b2146f4785ba6f1f79512bb50584608b8ea"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "52b4935226fb2716fdd2cec74187ce12252cafc46326a83c64aaf93fb2027cde"
},
{
"path": "commands/monitor-setup.md",
"sha256": "d18a3a37122d04ccb4f7a99286e07af58929b11162288a1db755d44626da58b8"
},
{
"path": "commands/deploy.md",
"sha256": "82c246ae4d7043e5da67dc6402bcc58f181b078f21ca67619efd266701894ad4"
},
{
"path": "commands/mcp-configure.md",
"sha256": "0e2144e29ab332a925535d81753094a426fa46670778fbcd70c802235b024266"
},
{
"path": "commands/dev-env.md",
"sha256": "c2cd943d4d1a3a2b05e7321976672594569b25f9d12151213b8b68081b3fe861"
},
{
"path": "skills/kafka-kubernetes/SKILL.md",
"sha256": "64da4d3d9cdbe7061d9e0254c7be4a531a9356f43494251958d24aa66622eb53"
},
{
"path": "skills/kafka-mcp-integration/SKILL.md",
"sha256": "e132fabf52ebad6a2b57e490a0c7738f6ca70ded6f6f879b8db719de66c17e0d"
},
{
"path": "skills/kafka-architecture/SKILL.md",
"sha256": "326e0a3de8c26ce4b36bfe76fc507adca6741b4078bae5214f04039561d84bfd"
},
{
"path": "skills/kafka-observability/SKILL.md",
"sha256": "c3b4b19fbac43f0fdba91009efe717f55348c66a26dce0c6dfb721a7c57d0817"
},
{
"path": "skills/kafka-iac-deployment/SKILL.md",
"sha256": "fc82a9a7990d1c8ca8ab4f9b8845cd53fdc46767a11e95432b517368fff000de"
},
{
"path": "skills/kafka-cli-tools/SKILL.md",
"sha256": "7658d0dfabdb1cf1fa5c83619f41acf2daff28597cdeb35b94e4ce484218d4e0"
}
],
"dirSha256": "72b53317c0e203fe3777e89c5d9028138ef4f5c010ddbb40e9aefcb98071116a"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}

View File

@@ -0,0 +1,647 @@
---
name: kafka-architecture
description: Expert knowledge of Apache Kafka architecture, cluster design, capacity planning, partitioning strategies, replication, and high availability. Auto-activates on keywords kafka architecture, cluster sizing, partition strategy, replication factor, kafka ha, kafka scalability, broker count, topic design, kafka performance, kafka capacity planning.
---
# Kafka Architecture & Design Expert
Comprehensive knowledge of Apache Kafka architecture patterns, cluster design principles, and production best practices for building resilient, scalable event streaming platforms.
## Core Architecture Concepts
### Kafka Cluster Components
**Brokers**:
- Individual Kafka servers that store and serve data
- Each broker handles thousands of partitions
- Typical: 3-10 brokers per cluster (small), 10-100+ (large enterprises)
**Controller**:
- One broker elected as controller (via KRaft or ZooKeeper)
- Manages partition leaders and replica assignments
- Failure triggers automatic re-election
**Topics**:
- Logical channels for message streams
- Divided into partitions for parallelism
- Can have different retention policies per topic
**Partitions**:
- Ordered, immutable sequence of records
- Unit of parallelism (1 partition = 1 consumer in a group)
- Distributed across brokers for load balancing
**Replicas**:
- Copies of partitions across multiple brokers
- 1 leader replica (serves reads/writes)
- N-1 follower replicas (replication only)
- In-Sync Replicas (ISR): Followers caught up with leader
### KRaft vs ZooKeeper Mode
**KRaft Mode** (Recommended, Kafka 3.3+):
```yaml
Cluster Metadata:
- Stored in Kafka itself (no external ZooKeeper)
- Metadata topic: __cluster_metadata
- Controller quorum (3 or 5 nodes)
- Faster failover (<1s vs 10-30s)
- Simplified operations
```
**ZooKeeper Mode** (Legacy, deprecated in 4.0):
```yaml
External Coordination:
- Requires separate ZooKeeper ensemble (3-5 nodes)
- Stores cluster metadata, configs, ACLs
- Slower failover (10-30 seconds)
- More complex to operate
```
**Migration**: ZooKeeper → KRaft migration supported in Kafka 3.6+
## Cluster Sizing Guidelines
### Small Cluster (Development/Testing)
```yaml
Configuration:
Brokers: 3
Partitions per broker: ~100-500
Total partitions: 300-1500
Replication factor: 3
Hardware:
- CPU: 4-8 cores
- RAM: 8-16 GB
- Disk: 500 GB - 1 TB SSD
- Network: 1 Gbps
Use Cases:
- Development environments
- Low-volume production (<10 MB/s)
- Proof of concepts
- Single datacenter
Example Workload:
- 50 topics
- 5-10 partitions per topic
- 1 million messages/day
- 7-day retention
```
### Medium Cluster (Standard Production)
```yaml
Configuration:
Brokers: 6-12
Partitions per broker: 500-2000
Total partitions: 3K-24K
Replication factor: 3
Hardware:
- CPU: 16-32 cores
- RAM: 64-128 GB
- Disk: 2-8 TB NVMe SSD
- Network: 10 Gbps
Use Cases:
- Standard production workloads
- Multi-team environments
- Regional deployments
- Up to 500 MB/s throughput
Example Workload:
- 200-500 topics
- 10-50 partitions per topic
- 100 million messages/day
- 30-day retention
```
### Large Cluster (High-Scale Production)
```yaml
Configuration:
Brokers: 20-100+
Partitions per broker: 2000-4000
Total partitions: 40K-400K+
Replication factor: 3
Hardware:
- CPU: 32-64 cores
- RAM: 128-256 GB
- Disk: 8-20 TB NVMe SSD
- Network: 25-100 Gbps
Use Cases:
- Large enterprises
- Multi-region deployments
- Event-driven architectures
- 1+ GB/s throughput
Example Workload:
- 1000+ topics
- 50-200 partitions per topic
- 1+ billion messages/day
- 90-365 day retention
```
### Kafka Streams / Exactly-Once Semantics (EOS) Clusters
```yaml
Configuration:
Brokers: 6-12+ (same as standard, but more control plane load)
Partitions per broker: 500-1500 (fewer due to transaction overhead)
Total partitions: 3K-18K
Replication factor: 3
Hardware:
- CPU: 16-32 cores (more CPU for transactions)
- RAM: 64-128 GB
- Disk: 4-12 TB NVMe SSD (more for transaction logs)
- Network: 10-25 Gbps
Special Considerations:
- More brokers due to transaction coordinator load
- Lower partition count per broker (transactions = more overhead)
- Higher disk IOPS for transaction logs
- min.insync.replicas=2 mandatory for EOS
- acks=all required for producers
Use Cases:
- Stream processing with exactly-once guarantees
- Financial transactions
- Event sourcing with strict ordering
- Multi-step workflows requiring atomicity
```
## Partitioning Strategy
### How Many Partitions?
**Formula**:
```
Partitions = max(
Target Throughput / Single Partition Throughput,
Number of Consumers (for parallelism),
Future Growth Factor (2-3x)
)
Single Partition Limits:
- Write throughput: ~10-50 MB/s
- Read throughput: ~30-100 MB/s
- Message rate: ~10K-100K msg/s
```
**Examples**:
**High Throughput Topic** (Logs, Events):
```yaml
Requirements:
- Write: 200 MB/s
- Read: 500 MB/s (multiple consumers)
- Expected growth: 3x in 1 year
Calculation:
Write partitions: 200 MB/s ÷ 20 MB/s = 10
Read partitions: 500 MB/s ÷ 40 MB/s = 13
Growth factor: 13 × 3 = 39
Recommendation: 40-50 partitions
```
**Low-Latency Topic** (Commands, Requests):
```yaml
Requirements:
- Write: 5 MB/s
- Read: 10 MB/s
- Latency: <10ms p99
- Order preservation: By user ID
Calculation:
Throughput partitions: 5 MB/s ÷ 20 MB/s = 1
Parallelism: 4 (for redundancy)
Recommendation: 4-6 partitions (keyed by user ID)
```
**Dead Letter Queue**:
```yaml
Recommendation: 1-3 partitions
Reason: Low volume, order less important
```
### Partition Key Selection
**Good Keys** (High Cardinality, Even Distribution):
```yaml
✅ User ID (UUIDs):
- Millions of unique values
- Even distribution
- Example: "user-123e4567-e89b-12d3-a456-426614174000"
✅ Device ID (IoT):
- Unique per device
- Natural sharding
- Example: "device-sensor-001-zone-a"
✅ Order ID (E-commerce):
- Unique per transaction
- Even temporal distribution
- Example: "order-2024-11-15-abc123"
```
**Bad Keys** (Low Cardinality, Hotspots):
```yaml
❌ Country Code:
- Only ~200 values
- Uneven (US, CN >> others)
- Creates partition hotspots
❌ Boolean Flags:
- Only 2 values (true/false)
- Severe imbalance
❌ Date (YYYY-MM-DD):
- All today's traffic → 1 partition
- Temporal hotspot
```
**Compound Keys** (Best of Both):
```yaml
✅ Country + User ID:
- Partition by country for locality
- Sub-partition by user for distribution
- Example: "US:user-123" → hash("US:user-123")
✅ Tenant + Event Type + Timestamp:
- Multi-tenant isolation
- Event type grouping
- Temporal ordering
```
## Replication & High Availability
### Replication Factor Guidelines
```yaml
Development:
Replication Factor: 1
Reason: Fast, no durability needed
Production (Standard):
Replication Factor: 3
Reason: Balance durability vs cost
Tolerates: 2 broker failures (with min.insync.replicas=2)
Production (Critical):
Replication Factor: 5
Reason: Maximum durability
Tolerates: 4 broker failures (with min.insync.replicas=3)
Use Cases: Financial transactions, audit logs
Multi-Datacenter:
Replication Factor: 3 per DC (6 total)
Reason: DC-level fault tolerance
Requires: MirrorMaker 2 or Confluent Replicator
```
### min.insync.replicas
**Configuration**:
```yaml
min.insync.replicas=2:
- At least 2 replicas must acknowledge writes
- Typical for replication.factor=3
- Prevents data loss if 1 broker fails
min.insync.replicas=1:
- Only leader must acknowledge (dangerous!)
- Use only for non-critical topics
min.insync.replicas=3:
- At least 3 replicas must acknowledge
- For replication.factor=5 (critical systems)
```
**Rule**: `min.insync.replicas ≤ replication.factor - 1` (to allow 1 replica failure)
### Rack Awareness
```yaml
Configuration:
broker.rack=rack1 # Broker 1
broker.rack=rack2 # Broker 2
broker.rack=rack3 # Broker 3
Benefit:
- Replicas spread across racks
- Survives rack-level failures (power, network)
- Example: Topic with RF=3 → 1 replica per rack
Placement:
Leader: rack1
Follower 1: rack2
Follower 2: rack3
```
## Retention Strategies
### Time-Based Retention
```yaml
Short-Term (Events, Logs):
retention.ms: 86400000 # 1 day
Use Cases: Real-time analytics, monitoring
Medium-Term (Transactions):
retention.ms: 604800000 # 7 days
Use Cases: Standard business events
Long-Term (Audit, Compliance):
retention.ms: 31536000000 # 365 days
Use Cases: Regulatory requirements, event sourcing
Infinite (Event Sourcing):
retention.ms: -1 # Forever
cleanup.policy: compact
Use Cases: Source of truth, state rebuilding
```
### Size-Based Retention
```yaml
retention.bytes: 10737418240 # 10 GB per partition
Combined (Time OR Size):
retention.ms: 604800000 # 7 days
retention.bytes: 107374182400 # 100 GB
# Whichever limit is reached first
```
### Compaction (Log Compaction)
```yaml
cleanup.policy: compact
How It Works:
- Keeps only latest value per key
- Deletes old versions
- Preserves full history initially, compacts later
Use Cases:
- Database changelogs (CDC)
- User profile updates
- Configuration management
- State stores
Example:
Before Compaction:
user:123 → {name: "Alice", v:1}
user:123 → {name: "Alice", v:2, email: "alice@ex.com"}
user:123 → {name: "Alice A.", v:3}
After Compaction:
user:123 → {name: "Alice A.", v:3} # Latest only
```
## Performance Optimization
### Broker Configuration
```yaml
# Network threads (handle client connections)
num.network.threads: 8 # Increase for high connection count
# I/O threads (disk operations)
num.io.threads: 16 # Set to number of disks × 2
# Replica fetcher threads
num.replica.fetchers: 4 # Increase for many partitions
# Socket buffer sizes
socket.send.buffer.bytes: 1048576 # 1 MB
socket.receive.buffer.bytes: 1048576 # 1 MB
# Log flush (default: OS handles flushing)
log.flush.interval.messages: 10000 # Flush every 10K messages
log.flush.interval.ms: 1000 # Or every 1 second
```
### Producer Optimization
```yaml
High Throughput:
batch.size: 65536 # 64 KB
linger.ms: 100 # Wait 100ms for batching
compression.type: lz4 # Fast compression
acks: 1 # Leader only
Low Latency:
batch.size: 16384 # 16 KB (default)
linger.ms: 0 # Send immediately
compression.type: none
acks: 1
Durability (Exactly-Once):
batch.size: 16384
linger.ms: 10
compression.type: lz4
acks: all
enable.idempotence: true
transactional.id: "producer-1"
```
### Consumer Optimization
```yaml
High Throughput:
fetch.min.bytes: 1048576 # 1 MB
fetch.max.wait.ms: 500 # Wait 500ms to accumulate
Low Latency:
fetch.min.bytes: 1 # Immediate fetch
fetch.max.wait.ms: 100 # Short wait
Max Parallelism:
# Deploy consumers = number of partitions
# More consumers than partitions = idle consumers
```
## Multi-Datacenter Patterns
### Active-Passive (Disaster Recovery)
```yaml
Architecture:
Primary DC: Full Kafka cluster
Secondary DC: Replica cluster (MirrorMaker 2)
Configuration:
- Producers → Primary only
- Consumers → Primary only
- MirrorMaker 2: Primary → Secondary (async replication)
Failover:
1. Detect primary failure
2. Switch producers/consumers to secondary
3. Promote secondary to primary
Recovery Time: 5-30 minutes (manual)
Data Loss: Potential (async replication lag)
```
### Active-Active (Geo-Replication)
```yaml
Architecture:
DC1: Kafka cluster (region A)
DC2: Kafka cluster (region B)
Bidirectional replication via MirrorMaker 2
Configuration:
- Producers → Nearest DC
- Consumers → Nearest DC or both
- Conflict resolution: Last-write-wins or custom
Challenges:
- Duplicate messages (at-least-once delivery)
- Ordering across DCs not guaranteed
- Circular replication prevention
Use Cases:
- Global applications
- Regional compliance (GDPR)
- Load distribution
```
### Stretch Cluster (Synchronous Replication)
```yaml
Architecture:
Single Kafka cluster spanning 2 DCs
Rack awareness: DC1 = rack1, DC2 = rack2
Configuration:
min.insync.replicas: 2
replication.factor: 4 (2 per DC)
acks: all
Requirements:
- Low latency between DCs (<10ms)
- High bandwidth link (10+ Gbps)
- Dedicated fiber
Trade-offs:
Pros: Synchronous replication, zero data loss
Cons: Latency penalty, network dependency
```
## Monitoring & Observability
### Key Metrics
**Broker Metrics**:
```yaml
UnderReplicatedPartitions:
Alert: > 0 for > 5 minutes
Indicates: Replica lag, broker failure
OfflinePartitionsCount:
Alert: > 0
Indicates: No leader elected (critical!)
ActiveControllerCount:
Alert: != 1 (should be exactly 1)
Indicates: Split brain or no controller
RequestHandlerAvgIdlePercent:
Alert: < 20%
Indicates: Broker CPU saturation
```
**Topic Metrics**:
```yaml
MessagesInPerSec:
Monitor: Throughput trends
Alert: Sudden drops (producer failure)
BytesInPerSec / BytesOutPerSec:
Monitor: Network utilization
Alert: Approaching NIC limits
RecordsLagMax (Consumer):
Alert: > 10000 or growing
Indicates: Consumer can't keep up
```
**Disk Metrics**:
```yaml
LogSegmentSize:
Monitor: Disk usage trends
Alert: > 80% capacity
LogFlushRateAndTimeMs:
Monitor: Disk write latency
Alert: > 100ms p99 (slow disk)
```
## Security Patterns
### Authentication & Authorization
```yaml
SASL/SCRAM-SHA-512:
- Industry standard
- User/password authentication
- Stored in ZooKeeper/KRaft
ACLs (Access Control Lists):
- Per-topic, per-group permissions
- Operations: READ, WRITE, CREATE, DELETE, ALTER
- Example:
bin/kafka-acls.sh --add \
--allow-principal User:alice \
--operation READ \
--topic orders
mTLS (Mutual TLS):
- Certificate-based auth
- Strong cryptographic identity
- Best for service-to-service
```
## Integration with SpecWeave
**Automatic Architecture Detection**:
```typescript
import { ClusterSizingCalculator } from './lib/utils/sizing';
const calculator = new ClusterSizingCalculator();
const recommendation = calculator.calculate({
throughputMBps: 200,
retentionDays: 30,
replicationFactor: 3,
topicCount: 100
});
console.log(recommendation);
// {
// brokers: 8,
// partitionsPerBroker: 1500,
// diskPerBroker: 6000 GB,
// ramPerBroker: 64 GB
// }
```
**SpecWeave Commands**:
- `/specweave-kafka:deploy` - Validates cluster sizing before deployment
- `/specweave-kafka:monitor-setup` - Configures metrics for key indicators
## Related Skills
- `/specweave-kafka:kafka-mcp-integration` - MCP server setup
- `/specweave-kafka:kafka-cli-tools` - CLI operations
## External Links
- [Kafka Documentation - Architecture](https://kafka.apache.org/documentation/#design)
- [Confluent - Kafka Sizing](https://www.confluent.io/blog/how-to-choose-the-number-of-topics-partitions-in-a-kafka-cluster/)
- [KRaft Mode Overview](https://kafka.apache.org/documentation/#kraft)
- [LinkedIn Engineering - Kafka at Scale](https://engineering.linkedin.com/kafka/running-kafka-scale)

View File

@@ -0,0 +1,433 @@
---
name: kafka-cli-tools
description: Expert knowledge of Kafka CLI tools (kcat, kcli, kaf, kafkactl). Auto-activates on keywords kcat, kafkacat, kcli, kaf, kafkactl, kafka cli, kafka command line, produce message, consume topic, list topics, kafka metadata. Provides command examples, installation guides, and tool comparisons.
---
# Kafka CLI Tools Expert
Comprehensive knowledge of modern Kafka CLI tools for production operations, development, and troubleshooting.
## Supported CLI Tools
### 1. kcat (kafkacat) - The Swiss Army Knife
**Installation**:
```bash
# macOS
brew install kcat
# Ubuntu/Debian
apt-get install kafkacat
# From source
git clone https://github.com/edenhill/kcat.git
cd kcat
./configure && make && sudo make install
```
**Core Operations**:
**Produce Messages**:
```bash
# Simple produce
echo "Hello Kafka" | kcat -P -b localhost:9092 -t my-topic
# Produce with key (key:value format)
echo "user123:Login event" | kcat -P -b localhost:9092 -t events -K:
# Produce from file
cat events.json | kcat -P -b localhost:9092 -t events
# Produce with headers
echo "msg" | kcat -P -b localhost:9092 -t my-topic -H "source=app1" -H "version=1.0"
# Produce with compression
echo "data" | kcat -P -b localhost:9092 -t my-topic -z gzip
# Produce with acks=all
echo "critical-data" | kcat -P -b localhost:9092 -t my-topic -X acks=all
```
**Consume Messages**:
```bash
# Consume from beginning
kcat -C -b localhost:9092 -t my-topic -o beginning
# Consume from end (latest)
kcat -C -b localhost:9092 -t my-topic -o end
# Consume specific partition
kcat -C -b localhost:9092 -t my-topic -p 0 -o beginning
# Consume with consumer group
kcat -C -b localhost:9092 -G my-group my-topic
# Consume N messages and exit
kcat -C -b localhost:9092 -t my-topic -c 10
# Custom format (topic:partition:offset:key:value)
kcat -C -b localhost:9092 -t my-topic -f 'Topic: %t, Partition: %p, Offset: %o, Key: %k, Value: %s\n'
# JSON output
kcat -C -b localhost:9092 -t my-topic -J
```
**Metadata & Admin**:
```bash
# List all topics
kcat -L -b localhost:9092
# Get topic metadata (JSON)
kcat -L -b localhost:9092 -t my-topic -J
# Query topic offsets
kcat -Q -b localhost:9092 -t my-topic
# Check broker health
kcat -L -b localhost:9092 | grep "broker\|topic"
```
**SASL/SSL Authentication**:
```bash
# SASL/PLAINTEXT
kcat -b localhost:9092 \
-X security.protocol=SASL_PLAINTEXT \
-X sasl.mechanism=PLAIN \
-X sasl.username=admin \
-X sasl.password=admin-secret \
-L
# SASL/SSL
kcat -b localhost:9093 \
-X security.protocol=SASL_SSL \
-X sasl.mechanism=SCRAM-SHA-256 \
-X sasl.username=admin \
-X sasl.password=admin-secret \
-X ssl.ca.location=/path/to/ca-cert \
-L
# mTLS (mutual TLS)
kcat -b localhost:9093 \
-X security.protocol=SSL \
-X ssl.ca.location=/path/to/ca-cert \
-X ssl.certificate.location=/path/to/client-cert.pem \
-X ssl.key.location=/path/to/client-key.pem \
-L
```
### 2. kcli - Kubernetes-Native Kafka CLI
**Installation**:
```bash
# Install via krew (Kubernetes plugin manager)
kubectl krew install kcli
# Or download binary
curl -LO https://github.com/cswank/kcli/releases/latest/download/kcli-linux-amd64
chmod +x kcli-linux-amd64
sudo mv kcli-linux-amd64 /usr/local/bin/kcli
```
**Kubernetes Integration**:
```bash
# Connect to Kafka running in k8s
kcli --context my-cluster --namespace kafka
# Produce to topic in k8s
echo "msg" | kcli produce --topic my-topic --brokers kafka-broker:9092
# Consume from k8s Kafka
kcli consume --topic my-topic --brokers kafka-broker:9092 --from-beginning
# List topics in k8s cluster
kcli topics list --brokers kafka-broker:9092
```
**Best For**:
- Kubernetes-native deployments
- Helmfile/Kustomize workflows
- GitOps with ArgoCD/Flux
### 3. kaf - Modern Terminal UI
**Installation**:
```bash
# macOS
brew install kaf
# Linux (via snap)
snap install kaf
# From source
go install github.com/birdayz/kaf/cmd/kaf@latest
```
**Interactive Features**:
```bash
# Configure cluster
kaf config add-cluster local --brokers localhost:9092
# Use cluster
kaf config use-cluster local
# Interactive topic browsing (TUI)
kaf topics
# Interactive consume (arrow keys to navigate)
kaf consume my-topic
# Produce interactively
kaf produce my-topic
# Consumer group management
kaf groups
kaf group describe my-group
kaf group reset my-group --topic my-topic --offset earliest
# Schema Registry integration
kaf schemas
kaf schema get my-schema
```
**Best For**:
- Development workflows
- Quick topic exploration
- Consumer group debugging
- Schema Registry management
### 4. kafkactl - Advanced Admin Tool
**Installation**:
```bash
# macOS
brew install deviceinsight/packages/kafkactl
# Linux
curl -L https://github.com/deviceinsight/kafkactl/releases/latest/download/kafkactl_linux_amd64 -o kafkactl
chmod +x kafkactl
sudo mv kafkactl /usr/local/bin/
# Via Docker
docker run --rm -it deviceinsight/kafkactl:latest
```
**Advanced Operations**:
```bash
# Configure context
kafkactl config add-context local --brokers localhost:9092
# Topic management
kafkactl create topic my-topic --partitions 3 --replication-factor 2
kafkactl alter topic my-topic --config retention.ms=86400000
kafkactl delete topic my-topic
# Consumer group operations
kafkactl describe consumer-group my-group
kafkactl reset consumer-group my-group --topic my-topic --offset earliest
kafkactl delete consumer-group my-group
# ACL management
kafkactl create acl --allow --principal User:alice --operation READ --topic my-topic
kafkactl list acls
# Quota management
kafkactl alter client-quota --user alice --producer-byte-rate 1048576
# Reassign partitions
kafkactl alter partition --topic my-topic --partition 0 --replicas 1,2,3
```
**Best For**:
- Production cluster management
- ACL administration
- Partition reassignment
- Quota management
## Tool Comparison Matrix
| Feature | kcat | kcli | kaf | kafkactl |
|---------|------|------|-----|----------|
| **Installation** | Easy | Medium | Easy | Easy |
| **Produce** | ✅ Advanced | ✅ Basic | ✅ Interactive | ✅ Basic |
| **Consume** | ✅ Advanced | ✅ Basic | ✅ Interactive | ✅ Basic |
| **Metadata** | ✅ JSON | ✅ Basic | ✅ TUI | ✅ Detailed |
| **TUI** | ❌ | ❌ | ✅ | ✅ Limited |
| **Admin** | ❌ | ❌ | ⚠️ Limited | ✅ Advanced |
| **SASL/SSL** | ✅ | ✅ | ✅ | ✅ |
| **K8s Native** | ❌ | ✅ | ❌ | ❌ |
| **Schema Reg** | ❌ | ❌ | ✅ | ❌ |
| **ACLs** | ❌ | ❌ | ❌ | ✅ |
| **Quotas** | ❌ | ❌ | ❌ | ✅ |
| **Best For** | Scripting, ops | Kubernetes | Development | Production admin |
## Common Patterns
### 1. Topic Creation with Optimal Settings
```bash
# Using kafkactl (recommended for production)
kafkactl create topic orders \
--partitions 12 \
--replication-factor 3 \
--config retention.ms=604800000 \
--config compression.type=lz4 \
--config min.insync.replicas=2
# Verify with kcat
kcat -L -b localhost:9092 -t orders -J | jq '.topics[0]'
```
### 2. Dead Letter Queue Pattern
```bash
# Produce failed message to DLQ
echo "failed-msg" | kcat -P -b localhost:9092 -t orders-dlq \
-H "original-topic=orders" \
-H "error=DeserializationException" \
-H "timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
# Monitor DLQ
kcat -C -b localhost:9092 -t orders-dlq -f 'Headers: %h\nValue: %s\n\n'
```
### 3. Consumer Group Lag Monitoring
```bash
# Using kafkactl
kafkactl describe consumer-group my-app | grep LAG
# Using kcat (via external tool like kcat-lag)
kcat -L -b localhost:9092 -J | jq '.topics[].partitions[] | select(.topic=="my-topic")'
# Using kaf (interactive)
kaf groups
# Then select group to see lag in TUI
```
### 4. Multi-Cluster Replication Testing
```bash
# Produce to source cluster
echo "test" | kcat -P -b source-kafka:9092 -t replicated-topic
# Consume from target cluster
kcat -C -b target-kafka:9092 -t replicated-topic -o end -c 1
# Compare offsets
kcat -Q -b source-kafka:9092 -t replicated-topic
kcat -Q -b target-kafka:9092 -t replicated-topic
```
### 5. Performance Testing
```bash
# Produce 10,000 messages with kcat
seq 1 10000 | kcat -P -b localhost:9092 -t perf-test
# Consume and measure throughput
time kcat -C -b localhost:9092 -t perf-test -c 10000 -o beginning > /dev/null
# Test with compression
seq 1 10000 | kcat -P -b localhost:9092 -t perf-test -z lz4
```
## Troubleshooting
### Connection Issues
```bash
# Test broker connectivity
kcat -L -b localhost:9092
# Check SSL/TLS connection
openssl s_client -connect localhost:9093 -showcerts
# Verify SASL authentication
kcat -b localhost:9092 \
-X security.protocol=SASL_PLAINTEXT \
-X sasl.mechanism=PLAIN \
-X sasl.username=admin \
-X sasl.password=wrong-password \
-L
# Should fail with authentication error
```
### Message Not Appearing
```bash
# Check topic exists
kcat -L -b localhost:9092 | grep my-topic
# Check partition count
kcat -L -b localhost:9092 -t my-topic -J | jq '.topics[0].partition_count'
# Query all partition offsets
kcat -Q -b localhost:9092 -t my-topic
# Consume from all partitions
for i in {0..11}; do
echo "Partition $i:"
kcat -C -b localhost:9092 -t my-topic -p $i -c 1 -o end
done
```
### Consumer Group Stuck
```bash
# Check consumer group state
kafkactl describe consumer-group my-app
# Reset to beginning
kafkactl reset consumer-group my-app --topic my-topic --offset earliest
# Reset to specific offset
kafkactl reset consumer-group my-app --topic my-topic --partition 0 --offset 12345
# Delete consumer group (all consumers must be stopped first)
kafkactl delete consumer-group my-app
```
## Integration with SpecWeave
**Automatic CLI Tool Detection**:
SpecWeave auto-detects installed CLI tools and recommends best tool for the operation:
```typescript
import { CLIToolDetector } from './lib/cli/detector';
const detector = new CLIToolDetector();
const available = await detector.detectAll();
// Recommended tool for produce operation
if (available.includes('kcat')) {
console.log('Use kcat for produce (fastest)');
} else if (available.includes('kaf')) {
console.log('Use kaf for produce (interactive)');
}
```
**SpecWeave Commands**:
- `/specweave-kafka:dev-env` - Uses Docker Compose + kcat for local testing
- `/specweave-kafka:monitor-setup` - Sets up kcat-based lag monitoring
- `/specweave-kafka:mcp-configure` - Validates CLI tools are installed
## Security Best Practices
1. **Never hardcode credentials** - Use environment variables or secrets management
2. **Use SSL/TLS in production** - Configure `-X security.protocol=SASL_SSL`
3. **Prefer SCRAM over PLAIN** - Use `-X sasl.mechanism=SCRAM-SHA-256`
4. **Rotate credentials regularly** - Update passwords and certificates
5. **Least privilege** - Grant only necessary ACLs to users
## Related Skills
- `/specweave-kafka:kafka-mcp-integration` - MCP server setup and configuration
- `/specweave-kafka:kafka-architecture` - Cluster design and sizing
## External Links
- [kcat GitHub](https://github.com/edenhill/kcat)
- [kcli GitHub](https://github.com/cswank/kcli)
- [kaf GitHub](https://github.com/birdayz/kaf)
- [kafkactl GitHub](https://github.com/deviceinsight/kafkactl)
- [Apache Kafka Documentation](https://kafka.apache.org/documentation/)

View File

@@ -0,0 +1,449 @@
---
name: kafka-iac-deployment
description: Infrastructure as Code (IaC) deployment expert for Apache Kafka. Guides Terraform deployments across Apache Kafka (KRaft mode), AWS MSK, Azure Event Hubs. Activates for terraform, iac, infrastructure as code, deploy kafka, provision kafka, aws msk, azure event hubs, kafka infrastructure, terraform modules, cloud deployment, kafka deployment automation.
---
# Kafka Infrastructure as Code (IaC) Deployment
Expert guidance for deploying Apache Kafka using Terraform across multiple platforms.
## When to Use This Skill
I activate when you need help with:
- **Terraform deployments**: "Deploy Kafka with Terraform", "provision Kafka cluster"
- **Platform selection**: "Should I use AWS MSK or self-hosted Kafka?", "compare Kafka platforms"
- **Infrastructure planning**: "How to size Kafka infrastructure", "Kafka on AWS vs Azure"
- **IaC automation**: "Automate Kafka deployment", "CI/CD for Kafka infrastructure"
## What I Know
### Available Terraform Modules
This plugin provides 3 production-ready Terraform modules:
#### 1. **Apache Kafka (Self-Hosted, KRaft Mode)**
- **Location**: `plugins/specweave-kafka/terraform/apache-kafka/`
- **Platform**: AWS EC2 (can adapt to other clouds)
- **Architecture**: KRaft mode (no ZooKeeper dependency)
- **Features**:
- Multi-broker cluster (3-5 brokers recommended)
- Security groups with SASL_SSL
- IAM roles for S3 backups
- CloudWatch metrics and alarms
- Auto-scaling group support
- Custom VPC and subnet configuration
- **Use When**:
- ✅ You need full control over Kafka configuration
- ✅ Running Kafka 3.6+ (KRaft mode)
- ✅ Want to avoid ZooKeeper operational overhead
- ✅ Multi-cloud or hybrid deployments
- **Variables**:
```hcl
module "kafka" {
source = "../../plugins/specweave-kafka/terraform/apache-kafka"
environment = "production"
broker_count = 3
kafka_version = "3.7.0"
instance_type = "m5.xlarge"
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
domain = "example.com"
enable_s3_backups = true
enable_monitoring = true
}
```
#### 2. **AWS MSK (Managed Streaming for Kafka)**
- **Location**: `plugins/specweave-kafka/terraform/aws-msk/`
- **Platform**: AWS Managed Service
- **Features**:
- Fully managed Kafka service
- IAM authentication + SASL/SCRAM
- Auto-scaling (provisioned throughput)
- Built-in monitoring (CloudWatch)
- Multi-AZ deployment
- Encryption in transit and at rest
- **Use When**:
- ✅ You want AWS to manage Kafka operations
- ✅ Need tight AWS integration (IAM, KMS, CloudWatch)
- ✅ Prefer operational simplicity over cost
- ✅ Running in AWS VPC
- **Variables**:
```hcl
module "msk" {
source = "../../plugins/specweave-kafka/terraform/aws-msk"
cluster_name = "my-kafka-cluster"
kafka_version = "3.6.0"
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.large"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
enable_iam_auth = true
enable_scram_auth = false
enable_auto_scaling = true
}
```
#### 3. **Azure Event Hubs (Kafka API)**
- **Location**: `plugins/specweave-kafka/terraform/azure-event-hubs/`
- **Platform**: Azure Managed Service
- **Features**:
- Kafka 1.0+ protocol support
- Auto-inflate (elastic scaling)
- Premium SKU for high throughput
- Zone redundancy
- Private endpoints (VNet integration)
- Event capture to Azure Storage
- **Use When**:
- ✅ Running on Azure cloud
- ✅ Need Kafka-compatible API without Kafka operations
- ✅ Want serverless scaling (auto-inflate)
- ✅ Integrating with Azure ecosystem
- **Variables**:
```hcl
module "event_hubs" {
source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
namespace_name = "my-event-hub-ns"
resource_group_name = var.resource_group_name
location = "eastus"
sku = "Premium"
capacity = 1
kafka_enabled = true
auto_inflate_enabled = true
maximum_throughput_units = 20
}
```
## Platform Selection Decision Tree
```
Need Kafka deployment? START HERE:
├─ Running on AWS?
│ ├─ YES → Want managed service?
│ │ ├─ YES → Use AWS MSK module (terraform/aws-msk)
│ │ └─ NO → Use Apache Kafka module (terraform/apache-kafka)
│ └─ NO → Continue...
├─ Running on Azure?
│ ├─ YES → Use Azure Event Hubs module (terraform/azure-event-hubs)
│ └─ NO → Continue...
├─ Multi-cloud or hybrid?
│ └─ YES → Use Apache Kafka module (most portable)
├─ Need maximum control?
│ └─ YES → Use Apache Kafka module
└─ Default → Use Apache Kafka module (self-hosted, KRaft mode)
```
## Deployment Workflows
### Workflow 1: Deploy Self-Hosted Kafka (Apache Kafka Module)
**Scenario**: You want full control over Kafka on AWS EC2
```bash
# 1. Create Terraform configuration
cat > main.tf <<EOF
module "kafka_cluster" {
source = "../../plugins/specweave-kafka/terraform/apache-kafka"
environment = "production"
broker_count = 3
kafka_version = "3.7.0"
instance_type = "m5.xlarge"
vpc_id = "vpc-12345678"
subnet_ids = ["subnet-abc", "subnet-def", "subnet-ghi"]
domain = "kafka.example.com"
enable_s3_backups = true
enable_monitoring = true
tags = {
Project = "MyApp"
Environment = "Production"
}
}
output "broker_endpoints" {
value = module.kafka_cluster.broker_endpoints
}
EOF
# 2. Initialize Terraform
terraform init
# 3. Plan deployment (review what will be created)
terraform plan
# 4. Apply (create infrastructure)
terraform apply
# 5. Get broker endpoints
terraform output broker_endpoints
# Output: ["kafka-0.kafka.example.com:9093", "kafka-1.kafka.example.com:9093", ...]
```
### Workflow 2: Deploy AWS MSK (Managed Service)
**Scenario**: You want AWS to manage Kafka operations
```bash
# 1. Create Terraform configuration
cat > main.tf <<EOF
module "msk_cluster" {
source = "../../plugins/specweave-kafka/terraform/aws-msk"
cluster_name = "my-msk-cluster"
kafka_version = "3.6.0"
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.large"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
enable_iam_auth = true
enable_auto_scaling = true
tags = {
Project = "MyApp"
}
}
output "bootstrap_brokers" {
value = module.msk_cluster.bootstrap_brokers_sasl_iam
}
EOF
# 2. Deploy
terraform init && terraform apply
# 3. Configure IAM authentication
# (module outputs IAM policy, attach to your application role)
```
### Workflow 3: Deploy Azure Event Hubs (Kafka API)
**Scenario**: You're on Azure and want Kafka-compatible API
```bash
# 1. Create Terraform configuration
cat > main.tf <<EOF
module "event_hubs" {
source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
namespace_name = "my-kafka-namespace"
resource_group_name = "my-resource-group"
location = "eastus"
sku = "Premium"
capacity = 1
kafka_enabled = true
auto_inflate_enabled = true
maximum_throughput_units = 20
# Create hubs (topics) for your use case
hubs = [
{ name = "user-events", partitions = 12 },
{ name = "order-events", partitions = 6 },
{ name = "payment-events", partitions = 3 }
]
}
output "connection_string" {
value = module.event_hubs.connection_string
sensitive = true
}
EOF
# 2. Deploy
terraform init && terraform apply
# 3. Get connection details
terraform output connection_string
```
## Infrastructure Sizing Recommendations
### Small Environment (Dev/Test)
```hcl
# Self-hosted: 1 broker, m5.large
broker_count = 1
instance_type = "m5.large"
# AWS MSK: 1 broker per AZ, kafka.m5.large
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.large"
# Azure Event Hubs: Basic SKU
sku = "Basic"
capacity = 1
```
### Medium Environment (Staging/Production)
```hcl
# Self-hosted: 3 brokers, m5.xlarge
broker_count = 3
instance_type = "m5.xlarge"
# AWS MSK: 3 brokers, kafka.m5.xlarge
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.xlarge"
# Azure Event Hubs: Standard SKU with auto-inflate
sku = "Standard"
capacity = 2
auto_inflate_enabled = true
maximum_throughput_units = 10
```
### Large Environment (High-Throughput Production)
```hcl
# Self-hosted: 5+ brokers, m5.2xlarge or m5.4xlarge
broker_count = 5
instance_type = "m5.2xlarge"
# AWS MSK: 6+ brokers, kafka.m5.2xlarge, auto-scaling
number_of_broker_nodes = 6
broker_node_instance_type = "kafka.m5.2xlarge"
enable_auto_scaling = true
# Azure Event Hubs: Premium SKU with zone redundancy
sku = "Premium"
capacity = 4
zone_redundant = true
maximum_throughput_units = 20
```
## Best Practices
### Security Best Practices
1. **Always use encryption in transit**
- Self-hosted: Enable SASL_SSL listener
- AWS MSK: Set `encryption_in_transit_client_broker = "TLS"`
- Azure Event Hubs: HTTPS/TLS enabled by default
2. **Use IAM authentication (when possible)**
- AWS MSK: `enable_iam_auth = true`
- Azure Event Hubs: Managed identities
3. **Network isolation**
- Deploy in private subnets
- Use security groups/NSGs restrictively
- Azure: Enable private endpoints for Premium SKU
### High Availability Best Practices
1. **Multi-AZ deployment**
- Self-hosted: Distribute brokers across 3+ AZs
- AWS MSK: Automatically multi-AZ
- Azure Event Hubs: Enable `zone_redundant = true` (Premium)
2. **Replication factor = 3**
- Self-hosted: `default.replication.factor=3`
- AWS MSK: Configured automatically
- Azure Event Hubs: N/A (fully managed)
3. **min.insync.replicas = 2**
- Ensures durability even if 1 broker fails
### Cost Optimization
1. **Right-size instances**
- Use ClusterSizingCalculator utility (in kafka-architecture skill)
- Start small, scale up based on metrics
2. **Auto-scaling (where available)**
- AWS MSK: `enable_auto_scaling = true`
- Azure Event Hubs: `auto_inflate_enabled = true`
3. **Retention policies**
- Set `log.retention.hours` based on actual needs (default: 168 hours = 7 days)
- Shorter retention = lower storage costs
## Monitoring Integration
All modules integrate with monitoring:
### Self-Hosted Kafka
- CloudWatch metrics (via JMX Exporter)
- Prometheus + Grafana dashboards (see kafka-observability skill)
- Custom CloudWatch alarms
### AWS MSK
- Built-in CloudWatch metrics
- Enhanced monitoring available
- Integration with CloudWatch Alarms
### Azure Event Hubs
- Built-in Azure Monitor metrics
- Diagnostic logs to Log Analytics
- Integration with Azure Alerts
## Troubleshooting
### "Terraform destroy fails on security groups"
**Cause**: Resources using security groups still exist
**Fix**:
```bash
# 1. Find dependent resources
aws ec2 describe-network-interfaces --filters "Name=group-id,Values=sg-12345678"
# 2. Delete dependent resources first
# 3. Retry terraform destroy
```
### "AWS MSK cluster takes 20+ minutes to create"
**Cause**: MSK provisioning is inherently slow (AWS behavior)
**Fix**: This is normal. Use `--auto-approve` for automation:
```bash
terraform apply -auto-approve
```
### "Azure Event Hubs: Connection refused"
**Cause**: Kafka protocol not enabled OR incorrect connection string
**Fix**:
1. Verify `kafka_enabled = true` in Terraform
2. Use Kafka connection string (not Event Hubs connection string)
3. Check firewall rules (Premium SKU supports private endpoints)
## Integration with Other Skills
- **kafka-architecture**: For cluster sizing and partitioning strategy
- **kafka-observability**: For Prometheus + Grafana setup after deployment
- **kafka-kubernetes**: For deploying Kafka on Kubernetes (alternative to Terraform)
- **kafka-cli-tools**: For testing deployed clusters with kcat
## Quick Reference Commands
```bash
# Terraform workflow
terraform init # Initialize modules
terraform plan # Preview changes
terraform apply # Create infrastructure
terraform output # Get outputs (endpoints, etc.)
terraform destroy # Delete infrastructure
# AWS MSK specific
aws kafka list-clusters # List MSK clusters
aws kafka describe-cluster --cluster-arn <arn> # Get cluster details
# Azure Event Hubs specific
az eventhubs namespace list # List namespaces
az eventhubs eventhub list --namespace-name <name> --resource-group <rg> # List hubs
```
---
**Next Steps After Deployment**:
1. Use **kafka-observability** skill to set up Prometheus + Grafana monitoring
2. Use **kafka-cli-tools** skill to test cluster with kcat
3. Deploy your producer/consumer applications
4. Monitor cluster health and performance

View File

@@ -0,0 +1,667 @@
---
name: kafka-kubernetes
description: Kubernetes deployment expert for Apache Kafka. Guides K8s deployments using Helm charts, operators (Strimzi, Confluent), StatefulSets, and production best practices. Activates for kubernetes, k8s, helm, kafka on kubernetes, strimzi, confluent operator, kafka operator, statefulset, kafka helm chart, k8s deployment, kubernetes kafka, deploy kafka to k8s.
---
# Kafka on Kubernetes Deployment
Expert guidance for deploying Apache Kafka on Kubernetes using industry-standard tools.
## When to Use This Skill
I activate when you need help with:
- **Kubernetes deployments**: "Deploy Kafka on Kubernetes", "run Kafka in K8s", "Kafka Helm chart"
- **Operator selection**: "Strimzi vs Confluent Operator", "which Kafka operator to use"
- **StatefulSet patterns**: "Kafka StatefulSet best practices", "persistent volumes for Kafka"
- **Production K8s**: "Production-ready Kafka on K8s", "Kafka high availability in Kubernetes"
## What I Know
### Deployment Options Comparison
| Approach | Difficulty | Production-Ready | Best For |
|----------|-----------|------------------|----------|
| **Strimzi Operator** | Easy | ✅ Yes | Self-managed Kafka on K8s, CNCF project |
| **Confluent Operator** | Medium | ✅ Yes | Enterprise features, Confluent ecosystem |
| **Bitnami Helm Chart** | Easy | ⚠️ Mostly | Quick dev/staging environments |
| **Custom StatefulSet** | Hard | ⚠️ Requires expertise | Full control, custom requirements |
**Recommendation**: **Strimzi Operator** for most production use cases (CNCF project, active community, KRaft support)
## Deployment Approach 1: Strimzi Operator (Recommended)
**Strimzi** is a CNCF Sandbox project providing Kubernetes operators for Apache Kafka.
### Features
- ✅ KRaft mode support (Kafka 3.6+, no ZooKeeper)
- ✅ Declarative Kafka management (CRDs)
- ✅ Automatic rolling upgrades
- ✅ Built-in monitoring (Prometheus metrics)
- ✅ Mirror Maker 2 for replication
- ✅ Kafka Connect integration
- ✅ User and topic management via CRDs
### Installation (Helm)
```bash
# 1. Add Strimzi Helm repository
helm repo add strimzi https://strimzi.io/charts/
helm repo update
# 2. Create namespace
kubectl create namespace kafka
# 3. Install Strimzi Operator
helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
--namespace kafka \
--set watchNamespaces="{kafka}" \
--version 0.39.0
# 4. Verify operator is running
kubectl get pods -n kafka
# Output: strimzi-cluster-operator-... Running
```
### Deploy Kafka Cluster (KRaft Mode)
```yaml
# kafka-cluster.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: kafka-pool
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
replicas: 3
roles:
- controller
- broker
storage:
type: jbod
volumes:
- id: 0
type: persistent-claim
size: 100Gi
class: fast-ssd
deleteClaim: false
---
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-kafka-cluster
namespace: kafka
annotations:
strimzi.io/kraft: enabled
strimzi.io/node-pools: enabled
spec:
kafka:
version: 3.7.0
metadataVersion: 3.7-IV4
replicas: 3
listeners:
- name: plain
port: 9092
type: internal
tls: false
- name: tls
port: 9093
type: internal
tls: true
authentication:
type: tls
- name: external
port: 9094
type: loadbalancer
tls: true
authentication:
type: tls
config:
default.replication.factor: 3
min.insync.replicas: 2
offsets.topic.replication.factor: 3
transaction.state.log.replication.factor: 3
transaction.state.log.min.isr: 2
auto.create.topics.enable: false
log.retention.hours: 168
log.segment.bytes: 1073741824
compression.type: lz4
resources:
requests:
memory: 4Gi
cpu: "2"
limits:
memory: 8Gi
cpu: "4"
jvmOptions:
-Xms: 2048m
-Xmx: 4096m
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
```
```bash
# Apply Kafka cluster
kubectl apply -f kafka-cluster.yaml
# Wait for cluster to be ready (5-10 minutes)
kubectl wait kafka/my-kafka-cluster --for=condition=Ready --timeout=600s -n kafka
# Check status
kubectl get kafka -n kafka
# Output: my-kafka-cluster 3.7.0 3 True
```
### Create Topics (Declaratively)
```yaml
# kafka-topics.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: user-events
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 12
replicas: 3
config:
retention.ms: 604800000 # 7 days
segment.bytes: 1073741824
compression.type: lz4
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaTopic
metadata:
name: order-events
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
partitions: 6
replicas: 3
config:
retention.ms: 2592000000 # 30 days
min.insync.replicas: 2
```
```bash
# Apply topics
kubectl apply -f kafka-topics.yaml
# Verify topics created
kubectl get kafkatopics -n kafka
```
### Create Users (Declaratively)
```yaml
# kafka-users.yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: my-producer
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
authentication:
type: tls
authorization:
type: simple
acls:
- resource:
type: topic
name: user-events
patternType: literal
operations: [Write, Describe]
- resource:
type: topic
name: order-events
patternType: literal
operations: [Write, Describe]
---
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaUser
metadata:
name: my-consumer
namespace: kafka
labels:
strimzi.io/cluster: my-kafka-cluster
spec:
authentication:
type: tls
authorization:
type: simple
acls:
- resource:
type: topic
name: user-events
patternType: literal
operations: [Read, Describe]
- resource:
type: group
name: my-consumer-group
patternType: literal
operations: [Read]
```
```bash
# Apply users
kubectl apply -f kafka-users.yaml
# Get user credentials (TLS certificates)
kubectl get secret my-producer -n kafka -o jsonpath='{.data.user\.crt}' | base64 -d > producer.crt
kubectl get secret my-producer -n kafka -o jsonpath='{.data.user\.key}' | base64 -d > producer.key
kubectl get secret my-kafka-cluster-cluster-ca-cert -n kafka -o jsonpath='{.data.ca\.crt}' | base64 -d > ca.crt
```
## Deployment Approach 2: Confluent Operator
**Confluent for Kubernetes (CFK)** provides enterprise-grade Kafka management.
### Features
- ✅ Full Confluent Platform (Kafka, Schema Registry, ksqlDB, Connect)
- ✅ Hybrid deployments (K8s + on-prem)
- ✅ Rolling upgrades with zero downtime
- ✅ Multi-region replication
- ✅ Advanced security (RBAC, encryption)
- ⚠️ Requires Confluent Platform license (paid)
### Installation
```bash
# 1. Add Confluent Helm repository
helm repo add confluentinc https://packages.confluent.io/helm
helm repo update
# 2. Create namespace
kubectl create namespace confluent
# 3. Install Confluent Operator
helm install confluent-operator confluentinc/confluent-for-kubernetes \
--namespace confluent \
--version 0.921.11
# 4. Verify
kubectl get pods -n confluent
```
### Deploy Kafka Cluster
```yaml
# kafka-cluster-confluent.yaml
apiVersion: platform.confluent.io/v1beta1
kind: Kafka
metadata:
name: kafka
namespace: confluent
spec:
replicas: 3
image:
application: confluentinc/cp-server:7.6.0
init: confluentinc/confluent-init-container:2.7.0
dataVolumeCapacity: 100Gi
storageClass:
name: fast-ssd
metricReporter:
enabled: true
listeners:
internal:
authentication:
type: plain
tls:
enabled: true
external:
authentication:
type: plain
tls:
enabled: true
dependencies:
zookeeper:
endpoint: zookeeper.confluent.svc.cluster.local:2181
podTemplate:
resources:
requests:
memory: 4Gi
cpu: 2
limits:
memory: 8Gi
cpu: 4
```
```bash
# Apply Kafka cluster
kubectl apply -f kafka-cluster-confluent.yaml
# Wait for cluster
kubectl wait kafka/kafka --for=condition=Ready --timeout=600s -n confluent
```
## Deployment Approach 3: Bitnami Helm Chart (Dev/Staging)
**Bitnami Helm Chart** is simple but less suitable for production.
### Installation
```bash
# 1. Add Bitnami repository
helm repo add bitnami https://charts.bitnami.com/bitnami
helm repo update
# 2. Install Kafka (KRaft mode)
helm install kafka bitnami/kafka \
--namespace kafka \
--create-namespace \
--set kraft.enabled=true \
--set controller.replicaCount=3 \
--set broker.replicaCount=3 \
--set persistence.size=100Gi \
--set persistence.storageClass=fast-ssd \
--set metrics.kafka.enabled=true \
--set metrics.jmx.enabled=true
# 3. Get bootstrap servers
export KAFKA_BOOTSTRAP=$(kubectl get svc kafka -n kafka -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'):9092
```
**Limitations**:
- ⚠️ Less production-ready than Strimzi/Confluent
- ⚠️ Limited declarative topic/user management
- ⚠️ Fewer advanced features (no MirrorMaker 2, limited RBAC)
## Production Best Practices
### 1. Storage Configuration
**Use SSD-backed storage classes** for Kafka logs:
```yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs # or pd.csi.storage.gke.io for GKE
parameters:
type: gp3 # AWS EBS GP3 (or io2 for extreme performance)
iopsPerGB: "50"
throughput: "125"
allowVolumeExpansion: true
volumeBindingMode: WaitForFirstConsumer
```
**Kafka storage requirements**:
- **Min IOPS**: 3000+ per broker
- **Min Throughput**: 125 MB/s per broker
- **Persistent**: Use `deleteClaim: false` (don't delete data on pod deletion)
### 2. Resource Limits
```yaml
resources:
requests:
memory: 4Gi
cpu: "2"
limits:
memory: 8Gi
cpu: "4"
jvmOptions:
-Xms: 2048m # Initial heap (50% of memory request)
-Xmx: 4096m # Max heap (50% of memory limit, leave room for OS cache)
```
**Sizing guidelines**:
- **Small (dev)**: 2 CPU, 4Gi memory
- **Medium (staging)**: 4 CPU, 8Gi memory
- **Large (production)**: 8 CPU, 16Gi memory
### 3. Pod Disruption Budgets
Ensure high availability during K8s upgrades:
```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: kafka-pdb
namespace: kafka
spec:
maxUnavailable: 1
selector:
matchLabels:
app.kubernetes.io/name: kafka
```
### 4. Affinity Rules
**Spread brokers across availability zones**:
```yaml
spec:
kafka:
template:
pod:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: strimzi.io/name
operator: In
values:
- my-kafka-cluster-kafka
topologyKey: topology.kubernetes.io/zone
```
### 5. Network Policies
**Restrict access to Kafka brokers**:
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: kafka-network-policy
namespace: kafka
spec:
podSelector:
matchLabels:
strimzi.io/name: my-kafka-cluster-kafka
policyTypes:
- Ingress
ingress:
- from:
- podSelector:
matchLabels:
app: my-producer
- podSelector:
matchLabels:
app: my-consumer
ports:
- protocol: TCP
port: 9092
- protocol: TCP
port: 9093
```
## Monitoring Integration
### Prometheus + Grafana Setup
Strimzi provides built-in Prometheus metrics exporter:
```yaml
# kafka-metrics-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
name: kafka-metrics
namespace: kafka
data:
kafka-metrics-config.yml: |
# Use JMX Exporter config from:
# plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml
lowercaseOutputName: true
lowercaseOutputLabelNames: true
whitelistObjectNames:
- "kafka.server:type=BrokerTopicMetrics,name=*"
# ... (copy from kafka-jmx-exporter.yml)
```
```bash
# Apply metrics config
kubectl apply -f kafka-metrics-configmap.yaml
# Install Prometheus Operator (if not already installed)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace
# Create PodMonitor for Kafka
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: kafka-metrics
namespace: kafka
spec:
selector:
matchLabels:
strimzi.io/kind: Kafka
podMetricsEndpoints:
- port: tcp-prometheus
interval: 30s
EOF
# Access Grafana dashboards (from kafka-observability skill)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Open: http://localhost:3000
# Dashboards: Kafka Cluster Overview, Broker Metrics, Consumer Lag, Topic Metrics, JVM Metrics
```
## Troubleshooting
### "Pods stuck in Pending state"
**Cause**: Insufficient resources or storage class not found
**Fix**:
```bash
# Check events
kubectl describe pod kafka-my-kafka-cluster-0 -n kafka
# Check storage class exists
kubectl get storageclass
# If missing, create fast-ssd storage class (see Production Best Practices above)
```
### "Kafka broker not ready after 10 minutes"
**Cause**: Slow storage provisioning or resource limits too low
**Fix**:
```bash
# Check broker logs
kubectl logs kafka-my-kafka-cluster-0 -n kafka
# Common issues:
# 1. Low IOPS on storage → Use GP3 or better
# 2. Low memory → Increase resources.requests.memory
# 3. KRaft quorum not formed → Check all brokers are running
```
### "Cannot connect to Kafka from outside K8s"
**Cause**: External listener not configured
**Fix**:
```yaml
# Add external listener (Strimzi)
spec:
kafka:
listeners:
- name: external
port: 9094
type: loadbalancer
tls: true
authentication:
type: tls
# Get external bootstrap server
kubectl get kafka my-kafka-cluster -n kafka -o jsonpath='{.status.listeners[?(@.name=="external")].bootstrapServers}'
```
## Scaling Operations
### Horizontal Scaling (Add Brokers)
```bash
# Strimzi: Update KafkaNodePool replicas
kubectl patch kafkanodepool kafka-pool -n kafka --type='json' \
-p='[{"op": "replace", "path": "/spec/replicas", "value": 5}]'
# Confluent: Update Kafka CR
kubectl patch kafka kafka -n confluent --type='json' \
-p='[{"op": "replace", "path": "/spec/replicas", "value": 5}]'
# Wait for new brokers
kubectl rollout status statefulset/kafka-my-kafka-cluster-kafka -n kafka
```
### Vertical Scaling (Change Resources)
```bash
# Update resources in Kafka CR
kubectl patch kafka my-kafka-cluster -n kafka --type='json' \
-p='[
{"op": "replace", "path": "/spec/kafka/resources/requests/memory", "value": "8Gi"},
{"op": "replace", "path": "/spec/kafka/resources/requests/cpu", "value": "4"}
]'
# Rolling restart will happen automatically
```
## Integration with Other Skills
- **kafka-iac-deployment**: Alternative to K8s (use Terraform for cloud-managed Kafka)
- **kafka-observability**: Set up Prometheus + Grafana dashboards for K8s Kafka
- **kafka-architecture**: Cluster sizing and partitioning strategy
- **kafka-cli-tools**: Test K8s Kafka cluster with kcat
## Quick Reference Commands
```bash
# Strimzi
kubectl get kafka -n kafka # List Kafka clusters
kubectl get kafkatopics -n kafka # List topics
kubectl get kafkausers -n kafka # List users
kubectl logs kafka-my-kafka-cluster-0 -n kafka # Check broker logs
# Confluent
kubectl get kafka -n confluent # List Kafka clusters
kubectl get schemaregistry -n confluent # List Schema Registry
kubectl get ksqldb -n confluent # List ksqlDB
# Port-forward for testing
kubectl port-forward -n kafka svc/my-kafka-cluster-kafka-bootstrap 9092:9092
```
---
**Next Steps After K8s Deployment**:
1. Use **kafka-observability** skill to verify Prometheus metrics and Grafana dashboards
2. Use **kafka-cli-tools** skill to test cluster with kcat
3. Deploy your producer/consumer applications to K8s
4. Set up GitOps for declarative topic/user management (ArgoCD, Flux)

View File

@@ -0,0 +1,290 @@
---
name: kafka-mcp-integration
description: MCP server integration for Kafka operations. Auto-activates on keywords kafka mcp, mcp server, mcp configure, mcp setup, kanapuli, tuannvm, confluent mcp, kafka integration. Provides configuration examples and connection guidance for all 4 MCP servers.
---
# Kafka MCP Server Integration
Expert knowledge for integrating SpecWeave with Kafka MCP (Model Context Protocol) servers. Supports 4 MCP server implementations with auto-detection and configuration guidance.
---
> **Code-First Recommendation**: For most Kafka automation tasks, [writing code is better than MCP](https://www.anthropic.com/engineering/code-execution-with-mcp) (98% token reduction). Use **kafkajs** or **kafka-node** directly:
>
> ```typescript
> import { Kafka } from 'kafkajs';
> const kafka = new Kafka({ brokers: ['localhost:9092'] });
> const producer = kafka.producer();
> await producer.send({ topic: 'events', messages: [{ value: 'Hello' }] });
> ```
>
> **When MCP IS useful**: Quick interactive debugging, topic exploration, Claude Desktop integration.
>
> **When to use code instead**: CI/CD pipelines, test automation, production scripts, anything that should be committed and reusable.
---
## Supported MCP Servers
### 1. kanapuli/mcp-kafka (Node.js)
**Installation**:
```bash
npm install -g mcp-kafka
```
**Capabilities**:
- Authentication: SASL_PLAINTEXT, PLAINTEXT
- Operations: produce, consume, list-topics, describe-topic, get-offsets
- Best for: Basic Kafka operations, quick prototyping
**Configuration Example**:
```json
{
"mcpServers": {
"kafka": {
"command": "npx",
"args": ["mcp-kafka"],
"env": {
"KAFKA_BROKERS": "localhost:9092",
"KAFKA_SASL_MECHANISM": "plain",
"KAFKA_SASL_USERNAME": "user",
"KAFKA_SASL_PASSWORD": "password"
}
}
}
}
```
### 2. tuannvm/kafka-mcp-server (Go)
**Installation**:
```bash
go install github.com/tuannvm/kafka-mcp-server@latest
```
**Capabilities**:
- Authentication: SASL_SCRAM_SHA_256, SASL_SCRAM_SHA_512, SASL_SSL, PLAINTEXT
- Operations: All CRUD operations, consumer group management, offset management
- Best for: Production use, advanced SASL authentication
**Configuration Example**:
```json
{
"mcpServers": {
"kafka": {
"command": "kafka-mcp-server",
"args": [
"--brokers", "localhost:9092",
"--sasl-mechanism", "SCRAM-SHA-256",
"--sasl-username", "admin",
"--sasl-password", "admin-secret"
]
}
}
}
```
### 3. Joel-hanson/kafka-mcp-server (Python)
**Installation**:
```bash
pip install kafka-mcp-server
```
**Capabilities**:
- Authentication: SASL_PLAINTEXT, PLAINTEXT, SSL
- Operations: produce, consume, list-topics, describe-topic
- Best for: Claude Desktop integration, Python ecosystem
**Configuration Example**:
```json
{
"mcpServers": {
"kafka": {
"command": "python",
"args": ["-m", "kafka_mcp_server"],
"env": {
"KAFKA_BOOTSTRAP_SERVERS": "localhost:9092"
}
}
}
}
```
### 4. Confluent Official MCP (Enterprise)
**Installation**:
```bash
confluent plugin install mcp-server
```
**Capabilities**:
- Authentication: OAuth, SASL_SCRAM, API Keys
- Operations: All Kafka operations, Schema Registry, ksqlDB, Flink SQL
- Advanced: Natural language interface, AI-powered query generation
- Best for: Confluent Cloud, enterprise deployments
**Configuration Example**:
```json
{
"mcpServers": {
"kafka": {
"command": "confluent",
"args": ["mcp", "start"],
"env": {
"CONFLUENT_CLOUD_API_KEY": "your-api-key",
"CONFLUENT_CLOUD_API_SECRET": "your-api-secret"
}
}
}
}
```
## Auto-Detection
SpecWeave can auto-detect installed MCP servers:
```bash
/specweave-kafka:mcp-configure
```
This command:
1. Scans for installed MCP servers (npm, go, pip, confluent CLI)
2. Checks which servers are currently running
3. Ranks servers by capabilities (Confluent > tuannvm > kanapuli > Joel-hanson)
4. Generates recommended configuration
5. Tests connection
## Quick Start
### Option 1: Auto-Configure (Recommended)
```bash
/specweave-kafka:mcp-configure
```
Interactive wizard guides you through:
- MCP server selection (or auto-detect)
- Broker URL configuration
- Authentication setup
- Connection testing
### Option 2: Manual Configuration
1. **Install preferred MCP server** (see installation commands above)
2. **Create `.mcp.json` configuration**:
```json
{
"serverType": "tuannvm",
"brokerUrls": ["localhost:9092"],
"authentication": {
"mechanism": "SASL/SCRAM-SHA-256",
"username": "admin",
"password": "admin-secret"
}
}
```
3. **Test connection**:
```bash
# Via MCP server CLI
kafka-mcp-server test-connection
# Or via SpecWeave
node -e "import('./dist/lib/mcp/detector.js').then(async ({ MCPServerDetector }) => {
const detector = new MCPServerDetector();
const result = await detector.detectAll();
console.log(JSON.stringify(result, null, 2));
});"
```
## MCP Server Comparison
| Feature | kanapuli | tuannvm | Joel-hanson | Confluent |
|---------|----------|---------|-------------|-----------|
| **Language** | Node.js | Go | Python | Official CLI |
| **SASL_PLAINTEXT** | ✅ | ✅ | ✅ | ✅ |
| **SCRAM-SHA-256** | ❌ | ✅ | ❌ | ✅ |
| **SCRAM-SHA-512** | ❌ | ✅ | ❌ | ✅ |
| **mTLS/SSL** | ❌ | ✅ | ✅ | ✅ |
| **OAuth** | ❌ | ❌ | ❌ | ✅ |
| **Consumer Groups** | ❌ | ✅ | ❌ | ✅ |
| **Offset Mgmt** | ❌ | ✅ | ❌ | ✅ |
| **Schema Registry** | ❌ | ❌ | ❌ | ✅ |
| **ksqlDB** | ❌ | ❌ | ❌ | ✅ |
| **Flink SQL** | ❌ | ❌ | ❌ | ✅ |
| **AI/NL Interface** | ❌ | ❌ | ❌ | ✅ |
| **Best For** | Prototyping | Production | Desktop | Enterprise |
## Troubleshooting
### MCP Server Not Detected
```bash
# Check if MCP server installed
npm list -g mcp-kafka # kanapuli
which kafka-mcp-server # tuannvm
pip show kafka-mcp-server # Joel-hanson
confluent version # Confluent
```
### Connection Refused
- Verify Kafka broker is running: `kcat -L -b localhost:9092`
- Check firewall rules
- Validate broker URL (correct host:port)
### Authentication Failed
- Double-check credentials (username, password, API keys)
- Verify SASL mechanism matches broker configuration
- Check broker logs for authentication errors
### Operations Not Working
- Ensure MCP server supports the operation (see comparison table)
- Check broker ACLs (permissions for the authenticated user)
- Verify topic exists: `/specweave-kafka:mcp-configure list-topics`
## Operations via MCP
Once configured, you can perform Kafka operations via MCP:
```typescript
import { MCPServerDetector } from './lib/mcp/detector';
const detector = new MCPServerDetector();
const result = await detector.detectAll();
// Use recommended server
if (result.recommended) {
console.log(`Using ${result.recommended} MCP server`);
console.log(`Reason: ${result.rankingReason}`);
}
```
## Security Best Practices
1. **Never commit credentials** - Use environment variables or secrets manager
2. **Use strongest auth** - Prefer SCRAM-SHA-512 > SCRAM-SHA-256 > PLAINTEXT
3. **Enable TLS/SSL** - Encrypt communication with broker
4. **Rotate credentials** - Regularly update passwords and API keys
5. **Least privilege** - Grant only necessary ACLs to MCP server user
## Related Commands
- `/specweave-kafka:mcp-configure` - Interactive MCP server setup
- `/specweave-kafka:dev-env start` - Start local Kafka for testing
- `/specweave-kafka:deploy` - Deploy production Kafka cluster
## External Links
- [kanapuli/mcp-kafka](https://github.com/kanapuli/mcp-kafka)
- [tuannvm/kafka-mcp-server](https://github.com/tuannvm/kafka-mcp-server)
- [Joel-hanson/kafka-mcp-server](https://github.com/Joel-hanson/kafka-mcp-server)
- [Confluent MCP Documentation](https://docs.confluent.io/platform/current/mcp/)
- [MCP Protocol Specification](https://modelcontextprotocol.org/)

View File

@@ -0,0 +1,576 @@
---
name: kafka-observability
description: Kafka monitoring and observability expert. Guides Prometheus + Grafana setup, JMX metrics, alerting rules, and dashboard configuration. Activates for kafka monitoring, prometheus, grafana, kafka metrics, jmx exporter, kafka observability, monitoring setup, kafka dashboards, alerting, kafka performance monitoring, metrics collection.
---
# Kafka Monitoring & Observability
Expert guidance for implementing comprehensive monitoring and observability for Apache Kafka using Prometheus and Grafana.
## When to Use This Skill
I activate when you need help with:
- **Monitoring setup**: "Set up Kafka monitoring", "configure Prometheus for Kafka", "Grafana dashboards for Kafka"
- **Metrics collection**: "Kafka JMX metrics", "export Kafka metrics to Prometheus"
- **Alerting**: "Kafka alerting rules", "alert on under-replicated partitions", "critical Kafka metrics"
- **Troubleshooting**: "Monitor Kafka performance", "track consumer lag", "broker health monitoring"
## What I Know
### Available Monitoring Components
This plugin provides a complete monitoring stack:
#### 1. **Prometheus JMX Exporter Configuration**
- **Location**: `plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml`
- **Purpose**: Export Kafka JMX metrics to Prometheus format
- **Metrics Exported**:
- Broker topic metrics (bytes in/out, messages in, request rate)
- Replica manager (under-replicated partitions, ISR shrinks/expands)
- Controller metrics (active controller, offline partitions, leader elections)
- Request metrics (produce/fetch latency)
- Log metrics (flush rate, flush latency)
- JVM metrics (heap, GC, threads, file descriptors)
#### 2. **Grafana Dashboards** (5 Dashboards)
- **Location**: `plugins/specweave-kafka/monitoring/grafana/dashboards/`
- **Dashboards**:
1. **kafka-cluster-overview.json** - Cluster health and throughput
2. **kafka-broker-metrics.json** - Per-broker performance
3. **kafka-consumer-lag.json** - Consumer lag monitoring
4. **kafka-topic-metrics.json** - Topic-level metrics
5. **kafka-jvm-metrics.json** - JVM health (heap, GC, threads)
#### 3. **Grafana Provisioning**
- **Location**: `plugins/specweave-kafka/monitoring/grafana/provisioning/`
- **Files**:
- `dashboards/kafka.yml` - Dashboard provisioning config
- `datasources/prometheus.yml` - Prometheus datasource config
## Setup Workflow 1: JMX Exporter (Self-Hosted Kafka)
For Kafka running on VMs or bare metal (non-Kubernetes).
### Step 1: Download JMX Prometheus Agent
```bash
# Download JMX Prometheus agent JAR
cd /opt
wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
# Copy JMX Exporter config
cp plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml /opt/kafka-jmx-exporter.yml
```
### Step 2: Configure Kafka Broker
Add JMX exporter to Kafka startup script:
```bash
# Edit Kafka startup (e.g., /etc/systemd/system/kafka.service)
[Service]
Environment="KAFKA_OPTS=-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
```
Or add to `kafka-server-start.sh`:
```bash
export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
```
### Step 3: Restart Kafka and Verify
```bash
# Restart Kafka broker
sudo systemctl restart kafka
# Verify JMX exporter is running (port 7071)
curl localhost:7071/metrics | grep kafka_server
# Expected output: kafka_server_broker_topic_metrics_bytesin_total{...} 12345
```
### Step 4: Configure Prometheus Scraping
Add Kafka brokers to Prometheus config:
```yaml
# prometheus.yml
scrape_configs:
- job_name: 'kafka'
static_configs:
- targets:
- 'kafka-broker-1:7071'
- 'kafka-broker-2:7071'
- 'kafka-broker-3:7071'
scrape_interval: 30s
```
```bash
# Reload Prometheus
sudo systemctl reload prometheus
# OR send SIGHUP
kill -HUP $(pidof prometheus)
# Verify scraping
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
```
## Setup Workflow 2: Strimzi (Kubernetes)
For Kafka running on Kubernetes with Strimzi Operator.
### Step 1: Create JMX Exporter ConfigMap
```bash
# Create ConfigMap from JMX exporter config
kubectl create configmap kafka-metrics \
--from-file=kafka-metrics-config.yml=plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml \
-n kafka
```
### Step 2: Configure Kafka CR with Metrics
```yaml
# kafka-cluster.yaml (add metricsConfig section)
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
name: my-kafka-cluster
namespace: kafka
spec:
kafka:
version: 3.7.0
replicas: 3
# ... other config ...
metricsConfig:
type: jmxPrometheusExporter
valueFrom:
configMapKeyRef:
name: kafka-metrics
key: kafka-metrics-config.yml
```
```bash
# Apply updated Kafka CR
kubectl apply -f kafka-cluster.yaml
# Verify metrics endpoint (wait for rolling restart)
kubectl exec -it kafka-my-kafka-cluster-0 -n kafka -- curl localhost:9404/metrics | grep kafka_server
```
### Step 3: Install Prometheus Operator (if not installed)
```bash
# Add Prometheus Community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
```
### Step 4: Create PodMonitor for Kafka
```yaml
# kafka-podmonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: kafka-metrics
namespace: kafka
labels:
app: strimzi
spec:
selector:
matchLabels:
strimzi.io/kind: Kafka
podMetricsEndpoints:
- port: tcp-prometheus
interval: 30s
```
```bash
# Apply PodMonitor
kubectl apply -f kafka-podmonitor.yaml
# Verify Prometheus is scraping Kafka
kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
# Open: http://localhost:9090/targets
# Should see kafka-metrics/* targets
```
## Setup Workflow 3: Grafana Dashboards
### Installation (Docker Compose)
If using Docker Compose for local development:
```yaml
# docker-compose.yml (add to existing Kafka setup)
version: '3.8'
services:
# ... Kafka services ...
prometheus:
image: prom/prometheus:v2.48.0
ports:
- "9090:9090"
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
grafana:
image: grafana/grafana:10.2.0
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
- grafana-data:/var/lib/grafana
volumes:
prometheus-data:
grafana-data:
```
```bash
# Start monitoring stack
docker-compose up -d prometheus grafana
# Access Grafana
# URL: http://localhost:3000
# Username: admin
# Password: admin
```
### Installation (Kubernetes)
Dashboards are auto-provisioned if using kube-prometheus-stack:
```bash
# Create ConfigMaps for each dashboard
for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
name=$(basename "$dashboard" .json)
kubectl create configmap "kafka-dashboard-$name" \
--from-file="$dashboard" \
-n monitoring \
--dry-run=client -o yaml | kubectl apply -f -
done
# Label ConfigMaps for Grafana auto-discovery
kubectl label configmap -n monitoring kafka-dashboard-* grafana_dashboard=1
# Grafana will auto-import dashboards (wait 30-60 seconds)
# Access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# URL: http://localhost:3000
# Username: admin
# Password: prom-operator (default kube-prometheus-stack password)
```
### Manual Dashboard Import
If auto-provisioning doesn't work:
```bash
# 1. Access Grafana UI
# 2. Go to: Dashboards → Import
# 3. Upload JSON files from:
# plugins/specweave-kafka/monitoring/grafana/dashboards/
# Or use Grafana API
for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
-H "Content-Type: application/json" \
-d @"$dashboard"
done
```
## Dashboard Overview
### 1. **Kafka Cluster Overview** (`kafka-cluster-overview.json`)
**Purpose**: High-level cluster health
**Key Metrics**:
- Active Controller Count (should be exactly 1)
- Under-Replicated Partitions (should be 0) ⚠️ CRITICAL
- Offline Partitions Count (should be 0) ⚠️ CRITICAL
- Unclean Leader Elections (should be 0)
- Cluster Throughput (bytes in/out per second)
- Request Rate (produce, fetch requests per second)
- ISR Changes (shrinks/expands)
- Leader Election Rate
**Use When**: Checking overall cluster health
### 2. **Kafka Broker Metrics** (`kafka-broker-metrics.json`)
**Purpose**: Per-broker performance
**Key Metrics**:
- Broker CPU Usage (% utilization)
- Broker Heap Memory Usage
- Broker Network Throughput (bytes in/out)
- Request Handler Idle Percentage (low = CPU saturation)
- File Descriptors (open vs max)
- Log Flush Latency (p50, p99)
- JVM GC Collection Count/Time
**Use When**: Investigating broker performance issues
### 3. **Kafka Consumer Lag** (`kafka-consumer-lag.json`)
**Purpose**: Consumer lag monitoring
**Key Metrics**:
- Consumer Lag per Topic/Partition
- Total Lag per Consumer Group
- Offset Commit Rate
- Current Consumer Offset
- Log End Offset (producer offset)
- Consumer Group Members
**Use When**: Troubleshooting slow consumers or lag spikes
### 4. **Kafka Topic Metrics** (`kafka-topic-metrics.json`)
**Purpose**: Topic-level metrics
**Key Metrics**:
- Messages Produced per Topic
- Bytes per Topic (in/out)
- Partition Count per Topic
- Replication Factor
- In-Sync Replicas
- Log Size per Partition
- Current Offset per Partition
- Partition Leader Distribution
**Use When**: Analyzing topic throughput and hotspots
### 5. **Kafka JVM Metrics** (`kafka-jvm-metrics.json`)
**Purpose**: JVM health monitoring
**Key Metrics**:
- Heap Memory Usage (used vs max)
- Heap Utilization Percentage
- GC Collection Rate (collections/sec)
- GC Collection Time (ms/sec)
- JVM Thread Count
- Heap Memory by Pool (young gen, old gen, survivor)
- Off-Heap Memory Usage (metaspace, code cache)
- GC Pause Time Percentiles (p50, p95, p99)
**Use When**: Investigating memory leaks or GC pauses
## Critical Alerts Configuration
Create Prometheus alerting rules for critical Kafka metrics:
```yaml
# kafka-alerts.yml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: kafka-alerts
namespace: monitoring
spec:
groups:
- name: kafka.rules
interval: 30s
rules:
# CRITICAL: Under-Replicated Partitions
- alert: KafkaUnderReplicatedPartitions
expr: sum(kafka_server_replica_manager_under_replicated_partitions) > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka has under-replicated partitions"
description: "{{ $value }} partitions are under-replicated. Data loss risk!"
# CRITICAL: Offline Partitions
- alert: KafkaOfflinePartitions
expr: kafka_controller_offline_partitions_count > 0
for: 1m
labels:
severity: critical
annotations:
summary: "Kafka has offline partitions"
description: "{{ $value }} partitions are offline. Service degradation!"
# CRITICAL: No Active Controller
- alert: KafkaNoActiveController
expr: kafka_controller_active_controller_count == 0
for: 1m
labels:
severity: critical
annotations:
summary: "No active Kafka controller"
description: "Cluster has no active controller. Cannot perform administrative operations!"
# WARNING: High Consumer Lag
- alert: KafkaConsumerLagHigh
expr: sum by (consumergroup) (kafka_consumergroup_lag) > 10000
for: 10m
labels:
severity: warning
annotations:
summary: "Consumer group {{ $labels.consumergroup }} has high lag"
description: "Lag is {{ $value }} messages. Consumers may be slow."
# WARNING: High CPU Usage
- alert: KafkaBrokerHighCPU
expr: os_process_cpu_load{job="kafka"} > 0.8
for: 5m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} has high CPU usage"
description: "CPU usage is {{ $value | humanizePercentage }}. Consider scaling."
# WARNING: Low Heap Memory
- alert: KafkaBrokerLowHeapMemory
expr: jvm_memory_heap_used_bytes{job="kafka"} / jvm_memory_heap_max_bytes{job="kafka"} > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} has low heap memory"
description: "Heap usage is {{ $value | humanizePercentage }}. Risk of OOM!"
# WARNING: High GC Time
- alert: KafkaBrokerHighGCTime
expr: rate(jvm_gc_collection_time_ms_total{job="kafka"}[5m]) > 500
for: 5m
labels:
severity: warning
annotations:
summary: "Broker {{ $labels.instance }} spending too much time in GC"
description: "GC time is {{ $value }}ms/sec. Application pauses likely."
```
```bash
# Apply alerts (Kubernetes)
kubectl apply -f kafka-alerts.yml
# Verify alerts loaded
kubectl get prometheusrules -n monitoring
```
## Troubleshooting
### "Prometheus not scraping Kafka metrics"
**Symptoms**: No Kafka metrics in Prometheus
**Fix**:
```bash
# 1. Verify JMX exporter is running
curl http://kafka-broker:7071/metrics
# 2. Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
# 3. Check Prometheus logs
kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0
# Common issues:
# - Firewall blocking port 7071
# - Incorrect scrape config
# - Kafka broker not running
```
### "Grafana dashboards not loading"
**Symptoms**: Dashboards show "No data"
**Fix**:
```bash
# 1. Verify Prometheus datasource
# Grafana UI → Configuration → Data Sources → Prometheus → Test
# 2. Check if Kafka metrics exist in Prometheus
# Prometheus UI → Graph → Enter: kafka_server_broker_topic_metrics_bytesin_total
# 3. Verify dashboard queries match your Prometheus job name
# Dashboard panels use job="kafka" by default
# If your job name is different, update dashboard JSON
```
### "Consumer lag metrics missing"
**Symptoms**: Consumer lag dashboard empty
**Fix**:
Consumer lag metrics require **Kafka Exporter** (separate from JMX Exporter):
```bash
# Install Kafka Exporter (Kubernetes)
helm install kafka-exporter prometheus-community/prometheus-kafka-exporter \
--namespace monitoring \
--set kafkaServer={kafka-bootstrap:9092}
# Or run as Docker container
docker run -d -p 9308:9308 \
danielqsj/kafka-exporter \
--kafka.server=kafka:9092 \
--web.listen-address=:9308
# Add to Prometheus scrape config
scrape_configs:
- job_name: 'kafka-exporter'
static_configs:
- targets: ['kafka-exporter:9308']
```
## Integration with Other Skills
- **kafka-iac-deployment**: Set up monitoring during Terraform deployment
- **kafka-kubernetes**: Configure monitoring for Strimzi Kafka on K8s
- **kafka-architecture**: Use cluster sizing metrics to validate capacity planning
- **kafka-cli-tools**: Use kcat to generate test traffic and verify metrics
## Quick Reference Commands
```bash
# Check JMX exporter metrics
curl http://localhost:7071/metrics | grep -E "(kafka_server|kafka_controller)"
# Prometheus query examples
curl -g 'http://localhost:9090/api/v1/query?query=kafka_server_replica_manager_under_replicated_partitions'
# Grafana dashboard export
curl http://admin:admin@localhost:3000/api/dashboards/uid/kafka-cluster-overview | jq .dashboard > backup.json
# Reload Prometheus config
kill -HUP $(pidof prometheus)
# Check Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
```
---
**Next Steps After Monitoring Setup**:
1. Review all 5 Grafana dashboards to familiarize yourself with metrics
2. Set up alerting (Slack, PagerDuty, email)
3. Create runbooks for critical alerts (under-replicated partitions, offline partitions, no controller)
4. Monitor for 7 days to establish baseline metrics
5. Tune JVM settings based on GC metrics