Initial commit

2025-11-29 17:56:46 +08:00
commit 96a7ab295d
16 changed files with 4441 additions and 0 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -0,0 +1,18 @@
+{
+  "name": "specweave-kafka",
+  "description": "Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack",
+  "version": "0.24.0",
+  "author": {
+    "name": "SpecWeave Team",
+    "url": "https://spec-weave.com"
+  },
+  "skills": [
+    "./skills"
+  ],
+  "agents": [
+    "./agents"
+  ],
+  "commands": [
+    "./commands"
+  ]
+}
--- a/README.md
+++ b/README.md
@@ -0,0 +1,3 @@
+# specweave-kafka
+
+Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack
--- a/agents/kafka-architect/AGENT.md
+++ b/agents/kafka-architect/AGENT.md
@@ -0,0 +1,266 @@
+---
+name: kafka-architect
+description: Kafka architecture and design specialist. Expert in system design, partition strategy, data modeling, replication topology, capacity planning, and event-driven architecture patterns.
+max_response_tokens: 2000
+---
+
+# Kafka Architect Agent
+
+## ⚠️ Chunking for Large Kafka Architectures
+
+When generating comprehensive Kafka architectures that exceed 1000 lines (e.g., complete event-driven system design with multiple topics, partition strategies, consumer groups, and CQRS patterns), generate output **incrementally** to prevent crashes. Break large Kafka implementations into logical components (e.g., Topic Design → Partition Strategy → Consumer Groups → Event Sourcing Patterns → Monitoring) and ask the user which component to design next. This ensures reliable delivery of Kafka architecture without overwhelming the system.
+
+## 🚀 How to Invoke This Agent
+
+**Subagent Type**: `specweave-kafka:kafka-architect:kafka-architect`
+
+**Usage Example**:
+
+```typescript
+Task({
+  subagent_type: "specweave-kafka:kafka-architect:kafka-architect",
+  prompt: "Design event-driven architecture for e-commerce with Kafka microservices and CQRS pattern",
+  model: "haiku" // optional: haiku, sonnet, opus
+});
+```
+
+**Naming Convention**: `{plugin}:{directory}:{yaml-name-or-directory-name}`
+- **Plugin**: specweave-kafka
+- **Directory**: kafka-architect
+- **Agent Name**: kafka-architect
+
+**When to Use**:
+- You're designing Kafka infrastructure for event-driven systems
+- You need guidance on partition strategy and topic design
+- You want to implement event sourcing or CQRS patterns
+- You're planning capacity for a Kafka cluster
+- You need to design scalable real-time data pipelines
+
+I'm a specialized architecture agent with deep expertise in designing scalable, reliable, and performant Apache Kafka systems.
+
+## My Expertise
+
+### System Design
+- **Event-Driven Architecture**: Event sourcing, CQRS, saga patterns
+- **Microservices Integration**: Service-to-service messaging, API composition
+- **Data Pipelines**: Stream processing, ETL, real-time analytics
+- **Multi-DC Replication**: Disaster recovery, active-active, active-passive
+
+### Partition Strategy
+- **Partition Count**: Sizing based on throughput and parallelism
+- **Key Selection**: Avoid hotspots, ensure even distribution
+- **Compaction**: Log-compacted topics for state synchronization
+- **Ordering Guarantees**: Partition-level vs cross-partition ordering
+
+### Topic Design
+- **Naming Conventions**: Hierarchical namespaces, domain events
+- **Schema Evolution**: Avro/Protobuf/JSON Schema versioning
+- **Retention Policies**: Time vs size-based, compaction strategies
+- **Replication Factor**: Balancing durability and cost
+
+### Capacity Planning
+- **Cluster Sizing**: Broker count, instance types, storage estimation
+- **Growth Projection**: Handle 2-5x current throughput
+- **Cost Optimization**: Right-sizing, tiered storage, compression
+
+## When to Invoke Me
+
+I activate for:
+- **Architecture questions**: "Design event-driven system", "Kafka for microservices communication"
+- **Partition strategy**: "How many partitions?", "avoid hotspots", "partition key selection"
+- **Topic design**: "Schema evolution strategy", "retention policy", "compaction vs deletion"
+- **Capacity planning**: "How many brokers?", "storage requirements", "throughput estimation"
+- **Performance optimization**: "Reduce latency", "increase throughput", "eliminate bottlenecks"
+- **Data modeling**: "Event structure", "CDC patterns", "domain events"
+
+## My Tools
+
+**Utilities**:
+- **ClusterSizingCalculator**: Estimate broker count, storage, network bandwidth
+- **PartitioningStrategyAnalyzer**: Detect hotspots, analyze key distribution
+- **ConfigValidator**: Validate broker/producer/consumer configs for performance and durability
+
+## Example Workflows
+
+### Workflow 1: Design Event-Driven Microservices Architecture
+```
+User: "Design Kafka architecture for e-commerce platform with Order, Payment, Inventory services"
+
+Me:
+1. Domain Event Modeling:
+   - order-events (created, updated, cancelled, fulfilled)
+   - payment-events (authorized, captured, refunded)
+   - inventory-events (reserved, allocated, released)
+
+2. Topic Design:
+   - orders.commands (12 partitions, RF=3, key=orderId)
+   - orders.events (12 partitions, RF=3, key=orderId, compacted)
+   - payments.events (6 partitions, RF=3, key=paymentId)
+   - inventory.events (12 partitions, RF=3, key=productId)
+
+3. Consumer Groups:
+   - payment-service (consumes orders.events, produces payments.events)
+   - inventory-service (consumes orders.events, produces inventory.events)
+   - notification-service (consumes orders.events, payments.events)
+
+4. Ordering Guarantees:
+   - Per-order ordering: Use orderId as partition key
+   - Cross-order ordering: Not guaranteed (by design)
+
+5. Failure Handling:
+   - Idempotent consumers (enable.idempotence=true)
+   - Dead Letter Queue for poison messages
+   - Retry topics with exponential backoff
+```
+
+### Workflow 2: Partition Strategy for High-Throughput Topic
+```
+User: "I have 10,000 msg/sec, how many partitions?"
+
+Me:
+1. Throughput Analysis:
+   - Target: 10,000 msg/sec × 1KB avg = 10 MB/sec
+   - Producer throughput: ~10 MB/sec per partition (typical)
+   - Consumer throughput: ~10 MB/sec per partition (typical)
+
+2. Partition Count Calculation:
+   - Minimum: ceil(10 MB/sec / 10 MB/sec) = 1 partition (NOT recommended)
+   - Recommended: 12 partitions (growth factor 2.5x)
+     - Allows scaling to 25,000 msg/sec
+     - Supports up to 12 parallel consumers
+     - Room for hotspot mitigation
+
+3. Partition Key Selection:
+   - Good key: userId (if users are evenly distributed)
+   - Avoid: timestamp (creates sequential hotspots)
+   - Avoid: constant (all messages to partition 0)
+
+4. Validation:
+   - Use PartitioningStrategyAnalyzer to test key distribution
+   - Sample 10,000 keys, check if distribution is uniform
+   - Aim for <10% deviation across partitions
+
+5. Hotspot Mitigation:
+   - If hotspot detected: Use compound keys (userId + random)
+   - Example: key = `${userId}-${timestamp % 10}`
+```
+
+### Workflow 3: Capacity Planning for Production Cluster
+```
+User: "Design Kafka cluster for 50,000 msg/sec, 7-day retention"
+
+Me:
+1. Throughput Requirements:
+   - Peak: 50,000 msg/sec × 1KB avg = 50 MB/sec write
+   - Read: Assume 3x write = 150 MB/sec (3 consumer groups)
+
+2. Storage Requirements (7-day retention):
+   - Daily write: 50 MB/sec × 86,400 sec = 4.32 TB/day
+   - 7-day retention: 4.32 TB × 7 × replication factor 3 = 90.7 TB
+   - With overhead (30%): ~120 TB total
+
+3. Broker Count:
+   - Network throughput: 50 MB/sec write + 150 MB/sec read = 200 MB/sec
+   - m5.2xlarge: 2.5 Gbps = 312 MB/sec (network)
+   - Minimum brokers: ceil(200 / 312) = 1 (NOT enough for HA)
+   - Recommended: 5 brokers (40 MB/sec per broker, 40% utilization)
+
+4. Storage per Broker:
+   - Total: 120 TB / 5 brokers = 24 TB per broker
+   - Recommended: 3x 10TB GP3 volumes per broker (30 TB total)
+
+5. Instance Selection:
+   - m5.2xlarge (8 vCPU, 32 GB RAM)
+   - JVM heap: 16 GB (50% of RAM)
+   - Page cache: 14 GB (for fast reads)
+
+6. Partition Count:
+   - Topics: 20 topics × 24 partitions = 480 total partitions
+   - Per broker: 480 / 5 = 96 partitions (within recommended <1000 per broker)
+```
+
+## Architecture Patterns I Use
+
+### Event Sourcing
+- Store all state changes as immutable events
+- Replay events to rebuild state
+- Use log-compacted topics for snapshots
+
+### CQRS (Command Query Responsibility Segregation)
+- Separate write (command) and read (query) models
+- Commands → Kafka → Event handlers → Read models
+- Optimized read models per query pattern
+
+### Saga Pattern (Distributed Transactions)
+- Choreography-based: Services react to events
+- Orchestration-based: Coordinator service drives workflow
+- Compensation events for rollback
+
+### Change Data Capture (CDC)
+- Capture database changes (Debezium, Maxwell)
+- Stream to Kafka
+- Keep Kafka as single source of truth
+
+## Best Practices I Enforce
+
+### Topic Design
+- ✅ Use hierarchical namespaces: `domain.entity.event-type` (e.g., `ecommerce.orders.created`)
+- ✅ Choose partition count as multiple of broker count (for even distribution)
+- ✅ Set retention based on downstream SLAs (not arbitrary)
+- ✅ Use Avro/Protobuf for schema evolution
+- ✅ Enable log compaction for state topics
+
+### Partition Strategy
+- ✅ Key selection: Entity ID (orderId, userId, deviceId)
+- ✅ Avoid sequential keys (timestamp, auto-increment ID)
+- ✅ Target partition count: 2-3x current consumer parallelism
+- ✅ Validate distribution with sample keys (use PartitioningStrategyAnalyzer)
+
+### Replication
+- ✅ Replication factor = 3 (standard for production)
+- ✅ min.insync.replicas = 2 (balance durability and availability)
+- ✅ Unclean leader election = false (prevent data loss)
+- ✅ Monitor under-replicated partitions (should be 0)
+
+### Producer Configuration
+- ✅ acks=all (wait for all replicas)
+- ✅ enable.idempotence=true (exactly-once semantics)
+- ✅ compression.type=lz4 (balance speed and ratio)
+- ✅ batch.size=65536 (64KB batching for throughput)
+
+### Consumer Configuration
+- ✅ enable.auto.commit=false (manual offset management)
+- ✅ max.poll.records=100-500 (avoid session timeout)
+- ✅ isolation.level=read_committed (for transactional producers)
+
+## Anti-Patterns I Warn Against
+
+- ❌ **Single partition topics**: No parallelism, no scalability
+- ❌ **Too many partitions**: High broker overhead, slow rebalancing
+- ❌ **Weak partition keys**: Sequential keys, null keys, constant keys
+- ❌ **Auto-create topics**: Uncontrolled partition count
+- ❌ **Unclean leader election**: Data loss risk
+- ❌ **Insufficient replication**: Single point of failure
+- ❌ **Ignoring consumer lag**: Backpressure builds up
+- ❌ **Schema evolution without planning**: Breaking changes to consumers
+
+## Performance Optimization Techniques
+
+1. **Batching**: Increase `batch.size` and `linger.ms` for throughput
+2. **Compression**: Use lz4 or zstd (not gzip)
+3. **Zero-copy**: Enable `sendfile()` for broker-to-consumer transfers
+4. **Page cache**: Leave 50% RAM for OS page cache
+5. **Partition count**: Right-size for parallelism without overhead
+6. **Consumer groups**: Scale consumers = partition count
+7. **Replica placement**: Spread across racks/AZs
+8. **Network tuning**: Increase socket buffers, TCP window
+
+## References
+
+- Apache Kafka Design Patterns: https://www.confluent.io/blog/
+- Event-Driven Microservices: https://www.oreilly.com/library/view/designing-event-driven-systems/
+- Kafka The Definitive Guide: https://www.confluent.io/resources/kafka-the-definitive-guide/
+
+---
+
+**Invoke me when you need architecture and design expertise for Kafka systems!**
--- a/agents/kafka-devops/AGENT.md
+++ b/agents/kafka-devops/AGENT.md
@@ -0,0 +1,235 @@
+---
+name: kafka-devops
+description: Kafka DevOps and SRE specialist. Expert in infrastructure deployment, CI/CD, monitoring, incident response, capacity planning, and operational best practices for Apache Kafka.
+---
+
+# Kafka DevOps Agent
+
+## 🚀 How to Invoke This Agent
+
+**Subagent Type**: `specweave-kafka:kafka-devops:kafka-devops`
+
+**Usage Example**:
+
+```typescript
+Task({
+  subagent_type: "specweave-kafka:kafka-devops:kafka-devops",
+  prompt: "Deploy production Kafka cluster on AWS with Terraform, configure monitoring with Prometheus and Grafana",
+  model: "haiku" // optional: haiku, sonnet, opus
+});
+```
+
+**Naming Convention**: `{plugin}:{directory}:{yaml-name-or-directory-name}`
+- **Plugin**: specweave-kafka
+- **Directory**: kafka-devops
+- **Agent Name**: kafka-devops
+
+**When to Use**:
+- You need to deploy and manage Kafka infrastructure
+- You want to set up CI/CD pipelines for Kafka upgrades
+- You're configuring Kafka cluster monitoring and alerting
+- You have operational issues or need incident response
+- You need to implement disaster recovery and backup strategies
+
+I'm a specialized DevOps/SRE agent with deep expertise in Apache Kafka operations, deployment automation, and production reliability.
+
+## My Expertise
+
+### Infrastructure & Deployment
+- **Terraform**: Deploy Kafka on AWS (EC2, MSK), Azure (Event Hubs), GCP
+- **Kubernetes**: Strimzi Operator, Confluent Operator, Helm charts
+- **Docker**: Compose stacks for local dev and testing
+- **CI/CD**: GitOps workflows, automated deployments, blue-green upgrades
+
+### Monitoring & Observability
+- **Prometheus + Grafana**: JMX exporter configuration, custom dashboards
+- **Alerting**: Critical metrics, SLO/SLI definition, on-call runbooks
+- **Distributed Tracing**: OpenTelemetry integration for producers/consumers
+- **Log Aggregation**: ELK stack, Datadog, CloudWatch integration
+
+### Operational Excellence
+- **Capacity Planning**: Cluster sizing, throughput estimation, growth projections
+- **Performance Tuning**: Broker config, OS tuning, JVM optimization
+- **Disaster Recovery**: Backup strategies, MirrorMaker 2, multi-DC replication
+- **Security**: TLS/SSL, SASL authentication, ACLs, encryption at rest
+
+### Incident Response
+- **On-Call Runbooks**: Under-replicated partitions, broker failures, disk full
+- **Troubleshooting**: High latency, consumer lag, rebalancing issues
+- **Root Cause Analysis**: Post-mortems, blameless retrospectives
+- **Remediation**: Quick fixes, long-term improvements
+
+## When to Invoke Me
+
+I activate for:
+- **Infrastructure questions**: "How to deploy Kafka on Kubernetes?", "Terraform module for AWS MSK"
+- **Operational issues**: "Broker down", "under-replicated partitions", "high consumer lag"
+- **Deployment automation**: "CI/CD pipeline for Kafka upgrades", "GitOps workflow"
+- **Monitoring setup**: "Configure Prometheus for Kafka", "create Grafana dashboards"
+- **Capacity planning**: "How many brokers do I need?", "partition count for 10K msg/sec"
+- **Performance tuning**: "Optimize Kafka for low latency", "reduce GC pauses"
+
+## My Tools & References
+
+**Scripts & Configs**:
+- Terraform modules: `plugins/specweave-kafka/terraform/`
+- Docker Compose: `plugins/specweave-kafka/docker/`
+- Monitoring configs: `plugins/specweave-kafka/monitoring/`
+- Sample code: `plugins/specweave-kafka/docker/templates/`
+
+**Utilities**:
+- MCPServerDetector: Auto-detect MCP servers
+- KcatWrapper: CLI tool integration
+- ClusterSizingCalculator: Broker count and storage estimation
+- ConfigValidator: Validate broker/producer/consumer configs
+
+## Example Workflows
+
+### Workflow 1: Deploy Production Kafka on AWS
+```
+User: "I need to deploy a production Kafka cluster on AWS for 5000 msg/sec"
+
+Me:
+1. Sizing calculation:
+   - Expected throughput: 5000 msg/sec × 1KB avg = 5 MB/sec
+   - Recommended: 3 brokers (m5.xlarge)
+   - Storage: 500GB GP3 per broker
+   - Replication factor: 3, min.insync.replicas: 2
+
+2. Platform recommendation:
+   - AWS MSK for managed operations
+   - OR Apache Kafka on EC2 for full control
+
+3. Terraform deployment:
+   - Use aws-msk module OR apache-kafka module
+   - Generate terraform configuration
+   - Execute: terraform init && terraform apply
+
+4. Post-deployment:
+   - Configure monitoring (JMX exporter + Prometheus)
+   - Set up alerting (14 critical/high/warning alerts)
+   - Create test topics and verify connectivity
+```
+
+### Workflow 2: Troubleshoot Under-Replicated Partitions
+```
+User: "We have 50 under-replicated partitions, what's wrong?"
+
+Me:
+1. Immediate checks:
+   - `kafka-topics.sh --describe --under-replicated-partitions`
+   - Check broker status: All brokers up?
+   - Check network: Latency between brokers?
+
+2. Common causes:
+   - Broker failure or restart
+   - Network issues
+   - Slow disk I/O (check `iostat -x 1`)
+   - High producer throughput exceeding replication capacity
+
+3. Investigation:
+   - Check broker logs for errors
+   - Review JMX metrics: ISR shrinks rate, replica lag
+   - Check disk usage and IOPS
+
+4. Remediation:
+   - If broker down: Restart it
+   - If network issues: Fix routing/firewall
+   - If slow disk: Increase IOPS (use GP3 or io2)
+   - If throughput: Scale horizontally (add brokers)
+
+5. Prevention:
+   - Monitor ISR shrinks rate (alert if > 5/sec)
+   - Set up disk I/O monitoring
+   - Regular capacity planning reviews
+```
+
+### Workflow 3: Set Up Complete Monitoring Stack
+```
+User: "Set up monitoring for our Kafka cluster"
+
+Me:
+1. JMX Exporter setup:
+   - Download jmx_prometheus_javaagent JAR
+   - Copy kafka-jmx-exporter.yml config
+   - Add to KAFKA_OPTS: -javaagent:/opt/jmx_prometheus_javaagent.jar=7071:/opt/kafka-jmx-exporter.yml
+   - Restart brokers
+
+2. Prometheus configuration:
+   - Add Kafka scrape config (job: kafka, port: 7071)
+   - Reload Prometheus: kill -HUP $(pidof prometheus)
+
+3. Grafana dashboards:
+   - Install 5 dashboards (cluster, broker, consumer lag, topics, JVM)
+   - Configure Prometheus datasource
+
+4. Alerting rules:
+   - Create 14 alerts (critical/high/warning)
+   - Configure notification channels (Slack, PagerDuty)
+   - Write runbooks for critical alerts
+
+5. Verification:
+   - Test metrics scraping
+   - Open dashboards
+   - Trigger test alert (stop a broker)
+```
+
+## Best Practices I Enforce
+
+### Deployment
+- ✅ Use KRaft mode (no ZooKeeper dependency)
+- ✅ Multi-AZ deployment (spread brokers across 3+ AZs)
+- ✅ Replication factor = 3, min.insync.replicas = 2
+- ✅ Disable unclean.leader.election.enable (prevent data loss)
+- ✅ Set auto.create.topics.enable = false (explicit topic creation)
+
+### Monitoring
+- ✅ Monitor under-replicated partitions (should be 0)
+- ✅ Monitor offline partitions (should be 0)
+- ✅ Monitor active controller count (should be exactly 1)
+- ✅ Track consumer lag per group
+- ✅ Alert on ISR shrinks rate (>5/sec = issue)
+
+### Performance
+- ✅ Use SSD storage (GP3 or better)
+- ✅ Tune JVM heap (50% of RAM, max 32GB)
+- ✅ Use G1GC for garbage collection
+- ✅ Increase num.network.threads and num.io.threads
+- ✅ Enable compression (lz4 for balance of speed and ratio)
+
+### Security
+- ✅ Enable TLS/SSL encryption in transit
+- ✅ Use SASL authentication (SCRAM-SHA-512)
+- ✅ Implement ACLs for topic/group access
+- ✅ Rotate credentials regularly
+- ✅ Enable encryption at rest (for sensitive data)
+
+## Common Incidents I Handle
+
+1. **Under-Replicated Partitions** → Check broker health, network, disk I/O
+2. **High Consumer Lag** → Scale consumers, optimize processing logic
+3. **Broker Out of Disk** → Reduce retention, expand volumes
+4. **High GC Time** → Increase heap, tune GC parameters
+5. **Connection Refused** → Check security groups, SASL config, TLS certificates
+6. **Leader Election Storm** → Disable auto leader rebalancing, check network stability
+7. **Offline Partitions** → Identify failed brokers, restart safely
+8. **ISR Shrinks** → Investigate replication lag, disk I/O, network latency
+
+## Runbooks
+
+For critical alerts, I reference these runbooks:
+- Under-Replicated Partitions: `monitoring/prometheus/kafka-alerts.yml` (Alert 1)
+- Offline Partitions: `monitoring/prometheus/kafka-alerts.yml` (Alert 2)
+- No Active Controller: `monitoring/prometheus/kafka-alerts.yml` (Alert 3)
+- High Consumer Lag: `monitoring/prometheus/kafka-alerts.yml` (Alert 6)
+
+## References
+
+- Apache Kafka Documentation: https://kafka.apache.org/documentation/
+- Confluent Best Practices: https://docs.confluent.io/platform/current/
+- Strimzi Docs: https://strimzi.io/docs/
+- Prometheus JMX Exporter: https://github.com/prometheus/jmx_exporter
+
+---
+
+**Invoke me when you need DevOps/SRE expertise for Kafka deployment, monitoring, or incident response!**
--- a/agents/kafka-observability/AGENT.md
+++ b/agents/kafka-observability/AGENT.md
@@ -0,0 +1,292 @@
+---
+name: kafka-observability
+description: Kafka observability and monitoring specialist. Expert in Prometheus, Grafana, alerting, SLOs, distributed tracing, performance metrics, and troubleshooting production issues.
+---
+
+# Kafka Observability Agent
+
+## 🚀 How to Invoke This Agent
+
+**Subagent Type**: `specweave-kafka:kafka-observability:kafka-observability`
+
+**Usage Example**:
+
+```typescript
+Task({
+  subagent_type: "specweave-kafka:kafka-observability:kafka-observability",
+  prompt: "Set up Kafka monitoring with Prometheus JMX exporter and create Grafana dashboards with alerting rules",
+  model: "haiku" // optional: haiku, sonnet, opus
+});
+```
+
+**Naming Convention**: `{plugin}:{directory}:{yaml-name-or-directory-name}`
+- **Plugin**: specweave-kafka
+- **Directory**: kafka-observability
+- **Agent Name**: kafka-observability
+
+**When to Use**:
+- You need to set up monitoring for Kafka clusters
+- You want to configure alerting for critical Kafka metrics
+- You're troubleshooting high latency, consumer lag, or performance issues
+- You need to analyze Kafka performance bottlenecks
+- You're implementing SLOs for Kafka availability and latency
+
+I'm a specialized observability agent with deep expertise in monitoring, alerting, and troubleshooting Apache Kafka in production.
+
+## My Expertise
+
+### Monitoring Infrastructure
+- **Prometheus + Grafana**: JMX exporter, custom dashboards, recording rules
+- **Metrics Collection**: Broker, topic, consumer, JVM, OS metrics
+- **Distributed Tracing**: OpenTelemetry integration for end-to-end visibility
+- **Log Aggregation**: ELK, Datadog, CloudWatch integration
+
+### Alerting & SLOs
+- **Alert Design**: Critical vs warning, actionable alerts, reduce noise
+- **SLO Definition**: Availability, latency, throughput targets
+- **On-Call Runbooks**: Step-by-step remediation for common incidents
+- **Escalation Policies**: When to page, when to auto-remediate
+
+### Performance Analysis
+- **Latency Profiling**: Produce latency, fetch latency, end-to-end latency
+- **Throughput Optimization**: Identify bottlenecks, scale appropriately
+- **Resource Utilization**: CPU, memory, disk I/O, network bandwidth
+- **Consumer Lag Analysis**: Root cause analysis, scaling recommendations
+
+## When to Invoke Me
+
+I activate for:
+- **Monitoring setup**: "Configure Prometheus for Kafka", "create Grafana dashboards"
+- **Alert configuration**: "Set up critical alerts", "SLO for 99.9% availability"
+- **Troubleshooting**: "High latency", "consumer lag spiking", "broker CPU at 100%"
+- **Performance analysis**: "Why is Kafka slow?", "optimize throughput", "reduce latency"
+- **Incident response**: "Under-replicated partitions", "offline partitions", "broker down"
+
+## My Tools
+
+**Dashboards**:
+- kafka-cluster-overview: Cluster health, throughput, ISR changes
+- kafka-broker-metrics: CPU, memory, network, request handlers
+- kafka-consumer-lag: Lag per group/topic, offset tracking
+- kafka-topic-metrics: Partition count, replication, log size
+- kafka-jvm-metrics: Heap, GC, threads, file descriptors
+
+**Alerting Rules** (14 alerts):
+- CRITICAL: Under-replicated partitions, offline partitions, no controller, unclean leader elections
+- HIGH: Consumer lag, ISR shrinks, leader election rate
+- WARNING: CPU, memory, GC time, disk usage, file descriptors
+
+## Example Workflows
+
+### Workflow 1: Set Up Complete Monitoring Stack
+```
+User: "Set up monitoring for Kafka cluster on Kubernetes"
+
+Me:
+1. JMX Exporter Configuration:
+   - Create ConfigMap with kafka-jmx-exporter.yml
+   - Add to Kafka CR: metricsConfig.type=jmxPrometheusExporter
+   - Metrics exposed on port 9404 (Strimzi default)
+
+2. Prometheus Integration:
+   - Create PodMonitor for Kafka pods
+   - Scrape interval: 30s
+   - Target: strimzi.io/kind=Kafka pods
+
+3. Grafana Dashboard Provisioning:
+   - Create ConfigMaps for 5 dashboards
+   - Label with grafana_dashboard=1
+   - Auto-import within 60 seconds
+
+4. Alerting Rules:
+   - Deploy PrometheusRule with 14 alerts
+   - Configure Alertmanager for Slack/PagerDuty
+   - Test: Stop a broker, verify alert fires
+
+5. Verification:
+   - Open Grafana (kubectl port-forward 3000:80)
+   - Check metrics: kafka_server_broker_topic_metrics_bytesin_total
+   - View dashboards: Cluster Overview, Broker Metrics, Consumer Lag
+```
+
+### Workflow 2: Troubleshoot High Consumer Lag
+```
+User: "Consumer group 'orders-processor' has 50K messages lag, why?"
+
+Me:
+1. Immediate Metrics Check:
+   - kafka_consumergroup_lag{consumergroup="orders-processor"} = 50,000
+   - kafka_consumergroup_current_offset (where are consumers?)
+   - kafka_consumergroup_log_end_offset (where are producers?)
+
+2. Consumer Health Check:
+   - kafka_consumergroup_members = ? (how many active consumers?)
+   - Partition count = 12
+   - Expected: 12 consumers (1 per partition)
+   - If < 12 → Scale up consumers
+
+3. Processing Time Analysis:
+   - Check application metrics: message processing time
+   - If > 100ms → Optimize processing logic
+   - If database calls → Add caching, connection pooling
+
+4. Producer Rate vs Consumer Rate:
+   - Producer: rate(kafka_topic_partition_current_offset[5m]) = 500 msg/sec
+   - Consumer: rate(kafka_consumergroup_current_offset[5m]) = 200 msg/sec
+   - Gap: 300 msg/sec → Consumers too slow!
+
+5. Remediation Options:
+   a. Scale consumers: 12 → 24 (but limited by 12 partitions)
+   b. Increase partitions: 12 → 24 (allows more consumers)
+   c. Optimize processing: Reduce per-message latency
+   d. Batch processing: Process 10 messages at once
+
+6. Long-term Fix:
+   - Increase partitions (requires re-partitioning)
+   - Optimize consumer code (reduce latency)
+   - Add more consumer instances
+   - Monitor: kafka_consumergroup_lag < 1000 (SLO)
+```
+
+### Workflow 3: Investigate Broker Performance Degradation
+```
+User: "Broker CPU spiked to 100%, latency increased, what's wrong?"
+
+Me:
+1. Metrics Timeline Analysis:
+   - os_process_cpu_load{instance="kafka-broker-2"} = 1.0 (100%)
+   - kafka_network_request_metrics_totaltime_total{request="Produce"} spike
+   - kafka_server_request_handler_avg_idle_percent = 0.05 (95% busy!)
+
+2. Correlation Check (find root cause):
+   - kafka_server_broker_topic_metrics_messagesin_total → No spike
+   - kafka_log_flush_time_ms_p99 → Spike from 10ms to 500ms (disk I/O issue!)
+   - iostat (via node exporter) → Disk queue depth = 50 (saturation)
+
+3. Root Cause Identified: Disk I/O Saturation
+   - Likely cause: Log flush taking too long
+   - Check: log.flush.interval.messages and log.flush.interval.ms
+
+4. Immediate Mitigation:
+   - Check disk health: SMART errors?
+   - Check IOPS limits: GP2 exhausted? Upgrade to GP3
+   - Increase provisioned IOPS: 3000 → 10,000
+
+5. Configuration Tuning:
+   - Increase log.flush.interval.messages (flush less frequently)
+   - Reduce log.segment.bytes (smaller segments = less data per flush)
+   - Use faster storage class (io2 for critical production)
+
+6. Monitoring:
+   - Set alert: kafka_log_flush_time_ms_p99 > 100ms for 5m
+   - Track: iostat iowait% < 20% (SLO)
+```
+
+## Critical Metrics I Monitor
+
+### Cluster Health
+- `kafka_controller_active_controller_count` = 1 (exactly one)
+- `kafka_server_replica_manager_under_replicated_partitions` = 0
+- `kafka_controller_offline_partitions_count` = 0
+- `kafka_controller_unclean_leader_elections_total` = 0
+
+### Broker Performance
+- `os_process_cpu_load` < 0.8 (80% CPU)
+- `jvm_memory_heap_used_bytes / jvm_memory_heap_max_bytes` < 0.85 (85% heap)
+- `kafka_server_request_handler_avg_idle_percent` > 0.3 (30% idle)
+- `os_open_file_descriptors / os_max_file_descriptors` < 0.8 (80% FD)
+
+### Throughput & Latency
+- `kafka_server_broker_topic_metrics_bytesin_total` (bytes in/sec)
+- `kafka_server_broker_topic_metrics_bytesout_total` (bytes out/sec)
+- `kafka_network_request_metrics_totaltime_total{request="Produce"}` (produce latency)
+- `kafka_network_request_metrics_totaltime_total{request="FetchConsumer"}` (fetch latency)
+
+### Consumer Lag
+- `kafka_consumergroup_lag` < 1000 messages (SLO)
+- `rate(kafka_consumergroup_current_offset[5m])` = consumer throughput
+- `rate(kafka_topic_partition_current_offset[5m])` = producer throughput
+
+### JVM Health
+- `jvm_gc_collection_time_ms_total` < 500ms/sec (GC time)
+- `jvm_threads_count` < 500 (thread count)
+- `rate(jvm_gc_collection_count_total[5m])` < 1/sec (GC frequency)
+
+## Alerting Best Practices
+
+### Alert Severity Levels
+
+**CRITICAL** (Page On-Call Immediately):
+- Under-replicated partitions > 0 for 5 minutes
+- Offline partitions > 0 for 1 minute
+- No active controller for 1 minute
+- Unclean leader elections > 0
+
+**HIGH** (Notify During Business Hours):
+- Consumer lag > 10,000 messages for 10 minutes
+- ISR shrinks > 5/sec for 5 minutes
+- Leader election rate > 0.5/sec for 5 minutes
+
+**WARNING** (Create Ticket, Investigate Next Day):
+- CPU usage > 80% for 5 minutes
+- Heap memory > 85% for 5 minutes
+- GC time > 500ms/sec for 5 minutes
+- Disk usage > 85% for 5 minutes
+
+### Alert Design Principles
+- ✅ **Actionable**: Alert must require human intervention
+- ✅ **Specific**: Include exact metric value and threshold
+- ✅ **Runbook**: Link to step-by-step remediation guide
+- ✅ **Context**: Include related metrics for correlation
+- ❌ **Avoid Noise**: Don't alert on normal fluctuations
+
+## SLO Definitions
+
+### Example SLOs for Kafka
+```yaml
+# Availability SLO
+- objective: "99.9% of produce requests succeed"
+  measurement: success_rate(kafka_network_request_metrics_totaltime_total{request="Produce"})
+  target: 0.999
+
+# Latency SLO
+- objective: "p99 produce latency < 100ms"
+  measurement: histogram_quantile(0.99, kafka_network_request_metrics_totaltime_total{request="Produce"})
+  target: 0.1  # 100ms
+
+# Consumer Lag SLO
+- objective: "95% of consumer groups have lag < 1000 messages"
+  measurement: count(kafka_consumergroup_lag < 1000) / count(kafka_consumergroup_lag)
+  target: 0.95
+```
+
+## Troubleshooting Decision Tree
+
+```
+High Latency Detected
+├─ Check Broker CPU
+│  └─ High (>80%) → Scale horizontally, optimize config
+│
+├─ Check Disk I/O
+│  └─ High (iowait >20%) → Upgrade storage (GP3/io2), tune flush settings
+│
+├─ Check Network
+│  └─ High RTT → Check inter-broker network, increase socket buffers
+│
+├─ Check GC Time
+│  └─ High (>500ms/sec) → Increase heap, tune GC (G1GC)
+│
+└─ Check Request Handler Idle %
+   └─ Low (<30%) → Increase num.network.threads, num.io.threads
+```
+
+## References
+
+- Prometheus JMX Exporter: https://github.com/prometheus/jmx_exporter
+- Grafana Dashboards: `plugins/specweave-kafka/monitoring/grafana/dashboards/`
+- Alerting Rules: `plugins/specweave-kafka/monitoring/prometheus/kafka-alerts.yml`
+- Kafka Metrics Guide: https://kafka.apache.org/documentation/#monitoring
+
+---
+
+**Invoke me when you need observability, monitoring, alerting, or performance troubleshooting expertise!**
--- a/commands/deploy.md
+++ b/commands/deploy.md
@@ -0,0 +1,99 @@
+---
+name: specweave-kafka:deploy
+description: Deploy Apache Kafka cluster using Terraform (Apache Kafka, AWS MSK, or Azure Event Hubs). Guides platform selection, sizing, and deployment.
+---
+
+# Deploy Kafka Cluster
+
+Deploy Apache Kafka using Infrastructure as Code (Terraform).
+
+## What This Command Does
+
+1. **Platform Selection**: Helps you choose the right Kafka platform
+2. **Cluster Sizing**: Calculates broker count, instance types, storage
+3. **Terraform Generation**: Creates or uses existing Terraform modules
+4. **Deployment**: Guides through terraform init/plan/apply
+5. **Verification**: Tests cluster connectivity and basic operations
+
+## Interactive Workflow
+
+I'll ask you a few questions to determine the best deployment approach:
+
+### Question 1: Which platform?
+- **Apache Kafka** (self-hosted on AWS EC2, KRaft mode)
+- **AWS MSK** (managed Kafka service)
+- **Azure Event Hubs** (Kafka-compatible API)
+
+### Question 2: What's your use case?
+- **Development/Testing** (1 broker, small instance)
+- **Staging** (3 brokers, medium instances)
+- **Production** (3-5 brokers, large instances, multi-AZ)
+
+### Question 3: Expected throughput?
+- Messages per second (peak)
+- Average message size
+- Retention period (hours/days)
+
+Based on your answers, I'll:
+- ✅ Recommend broker count and instance types
+- ✅ Calculate storage requirements
+- ✅ Generate Terraform configuration
+- ✅ Guide deployment
+
+## Example Usage
+
+```bash
+# Start deployment wizard
+/specweave-kafka:deploy
+
+# I'll activate kafka-iac-deployment skill and guide you through:
+# 1. Platform selection
+# 2. Sizing calculation (using ClusterSizingCalculator)
+# 3. Terraform module selection (apache-kafka, aws-msk, or azure-event-hubs)
+# 4. Deployment execution
+# 5. Post-deployment verification
+```
+
+## What Gets Created
+
+**Apache Kafka Deployment** (AWS EC2):
+- 3-5 EC2 instances (m5.xlarge or larger)
+- EBS volumes (GP3, 100Gi+ per broker)
+- Security groups (SASL_SSL on port 9093)
+- IAM roles for S3 backups
+- CloudWatch alarms
+- Load balancer (optional)
+
+**AWS MSK Deployment**:
+- MSK cluster (3-6 brokers)
+- VPC, subnets, security groups
+- IAM authentication
+- CloudWatch monitoring
+- Auto-scaling (optional)
+
+**Azure Event Hubs Deployment**:
+- Event Hubs namespace (Premium SKU)
+- Event hubs (topics)
+- Private endpoints
+- Auto-inflate enabled
+- Zone redundancy
+
+## Prerequisites
+
+- Terraform 1.5+ installed
+- AWS CLI (for AWS deployments) or Azure CLI (for Azure)
+- Appropriate cloud credentials configured
+- VPC and subnets created (if deploying to cloud)
+
+## Post-Deployment
+
+After deployment succeeds, I'll:
+1. ✅ Output bootstrap servers
+2. ✅ Provide connection examples
+3. ✅ Suggest running `/specweave-kafka:monitor-setup` for Prometheus + Grafana
+4. ✅ Suggest testing with `/specweave-kafka:dev-env` locally
+
+---
+
+**Skills Activated**: kafka-iac-deployment, kafka-architecture
+**Related Commands**: /specweave-kafka:monitor-setup, /specweave-kafka:dev-env
--- a/commands/dev-env.md
+++ b/commands/dev-env.md
@@ -0,0 +1,176 @@
+---
+name: specweave-kafka:dev-env
+description: Set up local Kafka development environment using Docker Compose. Includes Kafka (KRaft mode), Schema Registry, Kafka UI, Prometheus, and Grafana.
+---
+
+# Set Up Local Kafka Dev Environment
+
+Spin up a complete local Kafka development environment with one command.
+
+## What This Command Does
+
+1. **Docker Compose Selection**: Choose Kafka or Redpanda
+2. **Service Configuration**: Kafka + Schema Registry + UI + Monitoring
+3. **Environment Setup**: Generate docker-compose.yml
+4. **Start Services**: `docker-compose up -d`
+5. **Verification**: Test cluster and provide connection details
+
+## Two Options Available
+
+### Option 1: Apache Kafka (KRaft Mode)
+**Services**:
+- ✅ Kafka broker (KRaft mode, no ZooKeeper)
+- ✅ Schema Registry (Avro schemas)
+- ✅ Kafka UI (web interface, port 8080)
+- ✅ Prometheus (metrics, port 9090)
+- ✅ Grafana (dashboards, port 3000)
+
+**Use When**: Testing Apache Kafka specifically, need Schema Registry
+
+### Option 2: Redpanda (3-Node Cluster)
+**Services**:
+- ✅ Redpanda (3 brokers, Kafka-compatible)
+- ✅ Redpanda Console (web UI, port 8080)
+- ✅ Prometheus (metrics, port 9090)
+- ✅ Grafana (dashboards, port 3000)
+
+**Use When**: Testing high-performance alternative, need multi-broker cluster locally
+
+## Example Usage
+
+```bash
+# Start dev environment setup
+/specweave-kafka:dev-env
+
+# I'll ask:
+# 1. Which stack? (Kafka or Redpanda)
+# 2. Where to create files? (current directory or specify path)
+# 3. Custom ports? (use defaults or customize)
+
+# Then I'll:
+# - Generate docker-compose.yml
+# - Start all services
+# - Wait for health checks
+# - Provide connection details
+# - Open Kafka UI in browser
+```
+
+## What Gets Created
+
+**Directory Structure**:
+```
+./kafka-dev/
+├── docker-compose.yml       # Main compose file
+├── .env                      # Environment variables
+├── data/                     # Persistent volumes
+│   ├── kafka/
+│   ├── prometheus/
+│   └── grafana/
+└── config/
+    ├── prometheus.yml       # Prometheus config
+    └── grafana/             # Dashboard provisioning
+```
+
+**Services Running**:
+- Kafka: localhost:9092 (plaintext) or localhost:9093 (SASL_SSL)
+- Schema Registry: localhost:8081
+- Kafka UI: http://localhost:8080
+- Prometheus: http://localhost:9090
+- Grafana: http://localhost:3000 (admin/admin)
+
+## Connection Examples
+
+**After setup, connect with**:
+
+### Producer (Node.js):
+```javascript
+const { Kafka } = require('kafkajs');
+
+const kafka = new Kafka({
+  clientId: 'my-app',
+  brokers: ['localhost:9092']
+});
+
+const producer = kafka.producer();
+await producer.connect();
+await producer.send({
+  topic: 'test-topic',
+  messages: [{ value: 'Hello Kafka!' }]
+});
+```
+
+### Consumer (Python):
+```python
+from kafka import KafkaConsumer
+
+consumer = KafkaConsumer(
+    'test-topic',
+    bootstrap_servers=['localhost:9092'],
+    group_id='my-group',
+    auto_offset_reset='earliest'
+)
+
+for message in consumer:
+    print(f"Received: {message.value}")
+```
+
+### kcat (CLI):
+```bash
+# Produce message
+echo "Hello Kafka" | kcat -P -b localhost:9092 -t test-topic
+
+# Consume messages
+kcat -C -b localhost:9092 -t test-topic -o beginning
+```
+
+## Sample Producer/Consumer
+
+I'll also create sample code templates:
+- `producer-nodejs.js` - Production-ready Node.js producer
+- `consumer-nodejs.js` - Production-ready Node.js consumer
+- `producer-python.py` - Python producer with error handling
+- `consumer-python.py` - Python consumer with DLQ
+
+## Prerequisites
+
+- Docker 20+ installed
+- Docker Compose v2+
+- 4GB+ free RAM (for Redpanda 3-node cluster)
+- Ports available: 8080, 8081, 9090, 9092, 9093, 3000
+
+## Post-Setup
+
+After environment starts, I'll:
+1. ✅ Open Kafka UI in browser (http://localhost:8080)
+2. ✅ Create a test topic via UI
+3. ✅ Show producer/consumer examples
+4. ✅ Provide kcat commands for testing
+5. ✅ Show Grafana dashboards (http://localhost:3000)
+
+## Useful Commands
+
+```bash
+# Start environment
+docker-compose up -d
+
+# Stop environment
+docker-compose down
+
+# Stop and remove data
+docker-compose down -v
+
+# View logs
+docker-compose logs -f kafka
+
+# Restart Kafka only
+docker-compose restart kafka
+
+# Check health
+docker-compose ps
+```
+
+---
+
+**Skills Activated**: kafka-cli-tools
+**Docker Compose Location**: `plugins/specweave-kafka/docker/`
+**Sample Code**: `plugins/specweave-kafka/docker/templates/`
--- a/commands/mcp-configure.md
+++ b/commands/mcp-configure.md
@@ -0,0 +1,101 @@
+---
+name: specweave-kafka:mcp-configure
+description: Configure MCP (Model Context Protocol) server for Kafka integration. Auto-detects and configures kanapuli, tuannvm, Joel-hanson, or Confluent MCP servers.
+---
+
+# Configure Kafka MCP Server
+
+Set up MCP (Model Context Protocol) server integration for natural language Kafka operations.
+
+## What This Command Does
+
+1. **MCP Server Detection**: Auto-detect installed MCP servers
+2. **Server Ranking**: Recommend best server for your needs
+3. **Configuration**: Generate Claude Desktop config
+4. **Testing**: Verify MCP server connectivity
+5. **Usage Guide**: Show natural language examples
+
+## Supported MCP Servers
+
+| Server | Language | Features | Best For |
+|--------|----------|----------|----------|
+| **Confluent Official** | - | Natural language, Flink SQL, Enterprise | Production + Confluent Cloud |
+| **tuannvm/kafka-mcp-server** | Go | Advanced SASL (SCRAM-SHA-256/512) | Security-focused deployments |
+| **kanapuli/mcp-kafka** | Node.js | Basic operations, SASL_PLAINTEXT | Quick start, dev environments |
+| **Joel-hanson/kafka-mcp-server** | Python | Claude Desktop integration | Desktop AI workflows |
+
+## Example Usage
+
+```bash
+# Start MCP configuration wizard
+/specweave-kafka:mcp-configure
+
+# I'll:
+# 1. Detect installed MCP servers (npm, go, pip, CLI)
+# 2. Rank servers (Confluent > tuannvm > kanapuli > Joel-hanson)
+# 3. Generate Claude Desktop config (~/.claude/settings.json)
+# 4. Test connection to Kafka
+# 5. Show natural language examples
+```
+
+## What Gets Configured
+
+**Claude Desktop Config** (`~/.claude/settings.json`):
+```json
+{
+  "mcpServers": {
+    "kafka": {
+      "command": "npx",
+      "args": ["mcp-kafka"],
+      "env": {
+        "KAFKA_BROKERS": "localhost:9092",
+        "KAFKA_SASL_USERNAME": "admin",
+        "KAFKA_SASL_PASSWORD": "admin-secret"
+      }
+    }
+  }
+}
+```
+
+## Natural Language Examples
+
+After MCP is configured, you can use natural language with Claude:
+
+```
+You: "List all Kafka topics"
+Claude: [Uses MCP to call listTopics()]
+Output: user-events, order-events, payment-events
+
+You: "Create a topic called 'analytics' with 12 partitions and RF=3"
+Claude: [Uses MCP to call createTopic()]
+Output: Topic 'analytics' created successfully
+
+You: "What's the consumer lag for group 'orders-consumer'?"
+Claude: [Uses MCP to call getConsumerGroupOffsets()]
+Output: Total lag: 1,234 messages across 6 partitions
+
+You: "Send a test message to 'user-events' topic"
+Claude: [Uses MCP to call produceMessage()]
+Output: Message sent to partition 3, offset 12345
+```
+
+## Prerequisites
+
+- Node.js 18+ (for kanapuli or Joel-hanson)
+- Go 1.20+ (for tuannvm)
+- Confluent Cloud account (for Confluent MCP)
+- Kafka cluster accessible from your machine
+
+## Post-Configuration
+
+After MCP is configured, I'll:
+1. ✅ Restart Claude Desktop (required for MCP changes)
+2. ✅ Test MCP server with simple command
+3. ✅ Show 10+ natural language examples
+4. ✅ Provide troubleshooting tips if connection fails
+
+---
+
+**Skills Activated**: kafka-mcp-integration
+**Related Commands**: /specweave-kafka:deploy, /specweave-kafka:dev-env
+**MCP Docs**: https://modelcontextprotocol.io/
--- a/commands/monitor-setup.md
+++ b/commands/monitor-setup.md
@@ -0,0 +1,96 @@
+---
+name: specweave-kafka:monitor-setup
+description: Set up comprehensive Kafka monitoring with Prometheus + Grafana. Configures JMX exporter, dashboards, and alerting rules.
+---
+
+# Set Up Kafka Monitoring
+
+Configure comprehensive monitoring for your Kafka cluster using Prometheus and Grafana.
+
+## What This Command Does
+
+1. **JMX Exporter Setup**: Configure Prometheus JMX exporter for Kafka brokers
+2. **Prometheus Configuration**: Add Kafka scrape targets
+3. **Grafana Dashboards**: Install 5 pre-built dashboards
+4. **Alerting Rules**: Configure 14 critical/high/warning alerts
+5. **Verification**: Test metrics collection and dashboard access
+
+## Interactive Workflow
+
+I'll detect your environment and guide setup:
+
+### Environment Detection
+- **Kubernetes** (Strimzi/Confluent Operator) → Use PodMonitor
+- **Docker Compose** → Add Prometheus + Grafana services
+- **VM/Bare Metal** → Configure JMX exporter JAR
+
+### Question 1: Where is Kafka running?
+- Kubernetes (Strimzi)
+- Docker Compose
+- VMs/EC2 instances
+
+### Question 2: Prometheus already installed?
+- Yes → Just add Kafka scrape config
+- No → Install Prometheus + Grafana stack
+
+## Example Usage
+
+```bash
+# Start monitoring setup wizard
+/specweave-kafka:monitor-setup
+
+# I'll activate kafka-observability skill and:
+# 1. Detect your environment
+# 2. Configure JMX exporter (port 7071)
+# 3. Set up Prometheus scraping
+# 4. Install 5 Grafana dashboards
+# 5. Configure 14 alerting rules
+# 6. Verify metrics collection
+```
+
+## What Gets Configured
+
+**JMX Exporter** (Kafka brokers):
+- Metrics endpoint on port 7071
+- 50+ critical Kafka metrics exported
+- Broker, topic, consumer lag, JVM metrics
+
+**Prometheus Scraping**:
+```yaml
+scrape_configs:
+  - job_name: 'kafka'
+    static_configs:
+      - targets: ['kafka-0:7071', 'kafka-1:7071', 'kafka-2:7071']
+```
+
+**5 Grafana Dashboards**:
+1. **Cluster Overview** - Health, throughput, ISR changes
+2. **Broker Metrics** - CPU, memory, network, request handlers
+3. **Consumer Lag** - Lag per group/topic, offset tracking
+4. **Topic Metrics** - Partition count, replication, log size
+5. **JVM Metrics** - Heap, GC, threads, file descriptors
+
+**14 Alerting Rules**:
+- CRITICAL: Under-replicated partitions, offline partitions, no controller
+- HIGH: Consumer lag, ISR shrinks, leader elections
+- WARNING: CPU, memory, GC time, disk usage
+
+## Prerequisites
+
+- Kafka cluster running (self-hosted or K8s)
+- Prometheus installed (or will be installed)
+- Grafana installed (or will be installed)
+
+## Post-Setup
+
+After setup completes, I'll:
+1. ✅ Provide Grafana URL and credentials
+2. ✅ Show how to access dashboards
+3. ✅ Explain critical alerts
+4. ✅ Suggest testing alerts by stopping a broker
+
+---
+
+**Skills Activated**: kafka-observability
+**Related Commands**: /specweave-kafka:deploy
+**Dashboard Locations**: `plugins/specweave-kafka/monitoring/grafana/dashboards/`
--- a/plugin.lock.json
+++ b/plugin.lock.json
@@ -0,0 +1,93 @@
+{
+  "$schema": "internal://schemas/plugin.lock.v1.json",
+  "pluginId": "gh:anton-abyzov/specweave:plugins/specweave-kafka",
+  "normalized": {
+    "repo": null,
+    "ref": "refs/tags/v20251128.0",
+    "commit": "681f5a385e57731c64ef4b212d55544d87144203",
+    "treeHash": "72b53317c0e203fe3777e89c5d9028138ef4f5c010ddbb40e9aefcb98071116a",
+    "generatedAt": "2025-11-28T10:13:51.728305Z",
+    "toolVersion": "publish_plugins.py@0.2.0"
+  },
+  "origin": {
+    "remote": "git@github.com:zhongweili/42plugin-data.git",
+    "branch": "master",
+    "commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
+    "repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
+  },
+  "manifest": {
+    "name": "specweave-kafka",
+    "description": "Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack",
+    "version": "0.24.0"
+  },
+  "content": {
+    "files": [
+      {
+        "path": "README.md",
+        "sha256": "afb48227ea28ac5877048fa4fac0d0cfcb2f1ae6623286c0cd19bbd756530daa"
+      },
+      {
+        "path": "agents/kafka-devops/AGENT.md",
+        "sha256": "409e6d56102b053bc596eca97f1c693faf10dd70aede9d3cc97fbc896b68a9d9"
+      },
+      {
+        "path": "agents/kafka-observability/AGENT.md",
+        "sha256": "0693bcd35ef65f33ebf9e46cf8310bda8001a436d3e6bd6313e2aed6ec54fcc8"
+      },
+      {
+        "path": "agents/kafka-architect/AGENT.md",
+        "sha256": "f0bb437f1f6f912b8e1afba948f56b2146f4785ba6f1f79512bb50584608b8ea"
+      },
+      {
+        "path": ".claude-plugin/plugin.json",
+        "sha256": "52b4935226fb2716fdd2cec74187ce12252cafc46326a83c64aaf93fb2027cde"
+      },
+      {
+        "path": "commands/monitor-setup.md",
+        "sha256": "d18a3a37122d04ccb4f7a99286e07af58929b11162288a1db755d44626da58b8"
+      },
+      {
+        "path": "commands/deploy.md",
+        "sha256": "82c246ae4d7043e5da67dc6402bcc58f181b078f21ca67619efd266701894ad4"
+      },
+      {
+        "path": "commands/mcp-configure.md",
+        "sha256": "0e2144e29ab332a925535d81753094a426fa46670778fbcd70c802235b024266"
+      },
+      {
+        "path": "commands/dev-env.md",
+        "sha256": "c2cd943d4d1a3a2b05e7321976672594569b25f9d12151213b8b68081b3fe861"
+      },
+      {
+        "path": "skills/kafka-kubernetes/SKILL.md",
+        "sha256": "64da4d3d9cdbe7061d9e0254c7be4a531a9356f43494251958d24aa66622eb53"
+      },
+      {
+        "path": "skills/kafka-mcp-integration/SKILL.md",
+        "sha256": "e132fabf52ebad6a2b57e490a0c7738f6ca70ded6f6f879b8db719de66c17e0d"
+      },
+      {
+        "path": "skills/kafka-architecture/SKILL.md",
+        "sha256": "326e0a3de8c26ce4b36bfe76fc507adca6741b4078bae5214f04039561d84bfd"
+      },
+      {
+        "path": "skills/kafka-observability/SKILL.md",
+        "sha256": "c3b4b19fbac43f0fdba91009efe717f55348c66a26dce0c6dfb721a7c57d0817"
+      },
+      {
+        "path": "skills/kafka-iac-deployment/SKILL.md",
+        "sha256": "fc82a9a7990d1c8ca8ab4f9b8845cd53fdc46767a11e95432b517368fff000de"
+      },
+      {
+        "path": "skills/kafka-cli-tools/SKILL.md",
+        "sha256": "7658d0dfabdb1cf1fa5c83619f41acf2daff28597cdeb35b94e4ce484218d4e0"
+      }
+    ],
+    "dirSha256": "72b53317c0e203fe3777e89c5d9028138ef4f5c010ddbb40e9aefcb98071116a"
+  },
+  "security": {
+    "scannedAt": null,
+    "scannerVersion": null,
+    "flags": []
+  }
+}
--- a/skills/kafka-architecture/SKILL.md
+++ b/skills/kafka-architecture/SKILL.md
@@ -0,0 +1,647 @@
+---
+name: kafka-architecture
+description: Expert knowledge of Apache Kafka architecture, cluster design, capacity planning, partitioning strategies, replication, and high availability. Auto-activates on keywords kafka architecture, cluster sizing, partition strategy, replication factor, kafka ha, kafka scalability, broker count, topic design, kafka performance, kafka capacity planning.
+---
+
+# Kafka Architecture & Design Expert
+
+Comprehensive knowledge of Apache Kafka architecture patterns, cluster design principles, and production best practices for building resilient, scalable event streaming platforms.
+
+## Core Architecture Concepts
+
+### Kafka Cluster Components
+
+**Brokers**:
+- Individual Kafka servers that store and serve data
+- Each broker handles thousands of partitions
+- Typical: 3-10 brokers per cluster (small), 10-100+ (large enterprises)
+
+**Controller**:
+- One broker elected as controller (via KRaft or ZooKeeper)
+- Manages partition leaders and replica assignments
+- Failure triggers automatic re-election
+
+**Topics**:
+- Logical channels for message streams
+- Divided into partitions for parallelism
+- Can have different retention policies per topic
+
+**Partitions**:
+- Ordered, immutable sequence of records
+- Unit of parallelism (1 partition = 1 consumer in a group)
+- Distributed across brokers for load balancing
+
+**Replicas**:
+- Copies of partitions across multiple brokers
+- 1 leader replica (serves reads/writes)
+- N-1 follower replicas (replication only)
+- In-Sync Replicas (ISR): Followers caught up with leader
+
+### KRaft vs ZooKeeper Mode
+
+**KRaft Mode** (Recommended, Kafka 3.3+):
+```yaml
+Cluster Metadata:
+  - Stored in Kafka itself (no external ZooKeeper)
+  - Metadata topic: __cluster_metadata
+  - Controller quorum (3 or 5 nodes)
+  - Faster failover (<1s vs 10-30s)
+  - Simplified operations
+```
+
+**ZooKeeper Mode** (Legacy, deprecated in 4.0):
+```yaml
+External Coordination:
+  - Requires separate ZooKeeper ensemble (3-5 nodes)
+  - Stores cluster metadata, configs, ACLs
+  - Slower failover (10-30 seconds)
+  - More complex to operate
+```
+
+**Migration**: ZooKeeper → KRaft migration supported in Kafka 3.6+
+
+## Cluster Sizing Guidelines
+
+### Small Cluster (Development/Testing)
+
+```yaml
+Configuration:
+  Brokers: 3
+  Partitions per broker: ~100-500
+  Total partitions: 300-1500
+  Replication factor: 3
+  Hardware:
+    - CPU: 4-8 cores
+    - RAM: 8-16 GB
+    - Disk: 500 GB - 1 TB SSD
+    - Network: 1 Gbps
+
+Use Cases:
+  - Development environments
+  - Low-volume production (<10 MB/s)
+  - Proof of concepts
+  - Single datacenter
+
+Example Workload:
+  - 50 topics
+  - 5-10 partitions per topic
+  - 1 million messages/day
+  - 7-day retention
+```
+
+### Medium Cluster (Standard Production)
+
+```yaml
+Configuration:
+  Brokers: 6-12
+  Partitions per broker: 500-2000
+  Total partitions: 3K-24K
+  Replication factor: 3
+  Hardware:
+    - CPU: 16-32 cores
+    - RAM: 64-128 GB
+    - Disk: 2-8 TB NVMe SSD
+    - Network: 10 Gbps
+
+Use Cases:
+  - Standard production workloads
+  - Multi-team environments
+  - Regional deployments
+  - Up to 500 MB/s throughput
+
+Example Workload:
+  - 200-500 topics
+  - 10-50 partitions per topic
+  - 100 million messages/day
+  - 30-day retention
+```
+
+### Large Cluster (High-Scale Production)
+
+```yaml
+Configuration:
+  Brokers: 20-100+
+  Partitions per broker: 2000-4000
+  Total partitions: 40K-400K+
+  Replication factor: 3
+  Hardware:
+    - CPU: 32-64 cores
+    - RAM: 128-256 GB
+    - Disk: 8-20 TB NVMe SSD
+    - Network: 25-100 Gbps
+
+Use Cases:
+  - Large enterprises
+  - Multi-region deployments
+  - Event-driven architectures
+  - 1+ GB/s throughput
+
+Example Workload:
+  - 1000+ topics
+  - 50-200 partitions per topic
+  - 1+ billion messages/day
+  - 90-365 day retention
+```
+
+### Kafka Streams / Exactly-Once Semantics (EOS) Clusters
+
+```yaml
+Configuration:
+  Brokers: 6-12+ (same as standard, but more control plane load)
+  Partitions per broker: 500-1500 (fewer due to transaction overhead)
+  Total partitions: 3K-18K
+  Replication factor: 3
+  Hardware:
+    - CPU: 16-32 cores (more CPU for transactions)
+    - RAM: 64-128 GB
+    - Disk: 4-12 TB NVMe SSD (more for transaction logs)
+    - Network: 10-25 Gbps
+
+Special Considerations:
+  - More brokers due to transaction coordinator load
+  - Lower partition count per broker (transactions = more overhead)
+  - Higher disk IOPS for transaction logs
+  - min.insync.replicas=2 mandatory for EOS
+  - acks=all required for producers
+
+Use Cases:
+  - Stream processing with exactly-once guarantees
+  - Financial transactions
+  - Event sourcing with strict ordering
+  - Multi-step workflows requiring atomicity
+```
+
+## Partitioning Strategy
+
+### How Many Partitions?
+
+**Formula**:
+```
+Partitions = max(
+  Target Throughput / Single Partition Throughput,
+  Number of Consumers (for parallelism),
+  Future Growth Factor (2-3x)
+)
+
+Single Partition Limits:
+  - Write throughput: ~10-50 MB/s
+  - Read throughput: ~30-100 MB/s
+  - Message rate: ~10K-100K msg/s
+```
+
+**Examples**:
+
+**High Throughput Topic** (Logs, Events):
+```yaml
+Requirements:
+  - Write: 200 MB/s
+  - Read: 500 MB/s (multiple consumers)
+  - Expected growth: 3x in 1 year
+
+Calculation:
+  Write partitions: 200 MB/s ÷ 20 MB/s = 10
+  Read partitions: 500 MB/s ÷ 40 MB/s = 13
+  Growth factor: 13 × 3 = 39
+
+Recommendation: 40-50 partitions
+```
+
+**Low-Latency Topic** (Commands, Requests):
+```yaml
+Requirements:
+  - Write: 5 MB/s
+  - Read: 10 MB/s
+  - Latency: <10ms p99
+  - Order preservation: By user ID
+
+Calculation:
+  Throughput partitions: 5 MB/s ÷ 20 MB/s = 1
+  Parallelism: 4 (for redundancy)
+
+Recommendation: 4-6 partitions (keyed by user ID)
+```
+
+**Dead Letter Queue**:
+```yaml
+Recommendation: 1-3 partitions
+Reason: Low volume, order less important
+```
+
+### Partition Key Selection
+
+**Good Keys** (High Cardinality, Even Distribution):
+```yaml
+✅ User ID (UUIDs):
+  - Millions of unique values
+  - Even distribution
+  - Example: "user-123e4567-e89b-12d3-a456-426614174000"
+
+✅ Device ID (IoT):
+  - Unique per device
+  - Natural sharding
+  - Example: "device-sensor-001-zone-a"
+
+✅ Order ID (E-commerce):
+  - Unique per transaction
+  - Even temporal distribution
+  - Example: "order-2024-11-15-abc123"
+```
+
+**Bad Keys** (Low Cardinality, Hotspots):
+```yaml
+❌ Country Code:
+  - Only ~200 values
+  - Uneven (US, CN >> others)
+  - Creates partition hotspots
+
+❌ Boolean Flags:
+  - Only 2 values (true/false)
+  - Severe imbalance
+
+❌ Date (YYYY-MM-DD):
+  - All today's traffic → 1 partition
+  - Temporal hotspot
+```
+
+**Compound Keys** (Best of Both):
+```yaml
+✅ Country + User ID:
+  - Partition by country for locality
+  - Sub-partition by user for distribution
+  - Example: "US:user-123" → hash("US:user-123")
+
+✅ Tenant + Event Type + Timestamp:
+  - Multi-tenant isolation
+  - Event type grouping
+  - Temporal ordering
+```
+
+## Replication & High Availability
+
+### Replication Factor Guidelines
+
+```yaml
+Development:
+  Replication Factor: 1
+  Reason: Fast, no durability needed
+
+Production (Standard):
+  Replication Factor: 3
+  Reason: Balance durability vs cost
+  Tolerates: 2 broker failures (with min.insync.replicas=2)
+
+Production (Critical):
+  Replication Factor: 5
+  Reason: Maximum durability
+  Tolerates: 4 broker failures (with min.insync.replicas=3)
+  Use Cases: Financial transactions, audit logs
+
+Multi-Datacenter:
+  Replication Factor: 3 per DC (6 total)
+  Reason: DC-level fault tolerance
+  Requires: MirrorMaker 2 or Confluent Replicator
+```
+
+### min.insync.replicas
+
+**Configuration**:
+```yaml
+min.insync.replicas=2:
+  - At least 2 replicas must acknowledge writes
+  - Typical for replication.factor=3
+  - Prevents data loss if 1 broker fails
+
+min.insync.replicas=1:
+  - Only leader must acknowledge (dangerous!)
+  - Use only for non-critical topics
+
+min.insync.replicas=3:
+  - At least 3 replicas must acknowledge
+  - For replication.factor=5 (critical systems)
+```
+
+**Rule**: `min.insync.replicas ≤ replication.factor - 1` (to allow 1 replica failure)
+
+### Rack Awareness
+
+```yaml
+Configuration:
+  broker.rack=rack1  # Broker 1
+  broker.rack=rack2  # Broker 2
+  broker.rack=rack3  # Broker 3
+
+Benefit:
+  - Replicas spread across racks
+  - Survives rack-level failures (power, network)
+  - Example: Topic with RF=3 → 1 replica per rack
+
+Placement:
+  Leader: rack1
+  Follower 1: rack2
+  Follower 2: rack3
+```
+
+## Retention Strategies
+
+### Time-Based Retention
+
+```yaml
+Short-Term (Events, Logs):
+  retention.ms: 86400000  # 1 day
+  Use Cases: Real-time analytics, monitoring
+
+Medium-Term (Transactions):
+  retention.ms: 604800000  # 7 days
+  Use Cases: Standard business events
+
+Long-Term (Audit, Compliance):
+  retention.ms: 31536000000  # 365 days
+  Use Cases: Regulatory requirements, event sourcing
+
+Infinite (Event Sourcing):
+  retention.ms: -1  # Forever
+  cleanup.policy: compact
+  Use Cases: Source of truth, state rebuilding
+```
+
+### Size-Based Retention
+
+```yaml
+retention.bytes: 10737418240  # 10 GB per partition
+
+Combined (Time OR Size):
+  retention.ms: 604800000      # 7 days
+  retention.bytes: 107374182400  # 100 GB
+  # Whichever limit is reached first
+```
+
+### Compaction (Log Compaction)
+
+```yaml
+cleanup.policy: compact
+
+How It Works:
+  - Keeps only latest value per key
+  - Deletes old versions
+  - Preserves full history initially, compacts later
+
+Use Cases:
+  - Database changelogs (CDC)
+  - User profile updates
+  - Configuration management
+  - State stores
+
+Example:
+  Before Compaction:
+    user:123 → {name: "Alice", v:1}
+    user:123 → {name: "Alice", v:2, email: "alice@ex.com"}
+    user:123 → {name: "Alice A.", v:3}
+
+  After Compaction:
+    user:123 → {name: "Alice A.", v:3}  # Latest only
+```
+
+## Performance Optimization
+
+### Broker Configuration
+
+```yaml
+# Network threads (handle client connections)
+num.network.threads: 8  # Increase for high connection count
+
+# I/O threads (disk operations)
+num.io.threads: 16  # Set to number of disks × 2
+
+# Replica fetcher threads
+num.replica.fetchers: 4  # Increase for many partitions
+
+# Socket buffer sizes
+socket.send.buffer.bytes: 1048576    # 1 MB
+socket.receive.buffer.bytes: 1048576  # 1 MB
+
+# Log flush (default: OS handles flushing)
+log.flush.interval.messages: 10000  # Flush every 10K messages
+log.flush.interval.ms: 1000         # Or every 1 second
+```
+
+### Producer Optimization
+
+```yaml
+High Throughput:
+  batch.size: 65536            # 64 KB
+  linger.ms: 100               # Wait 100ms for batching
+  compression.type: lz4        # Fast compression
+  acks: 1                      # Leader only
+
+Low Latency:
+  batch.size: 16384            # 16 KB (default)
+  linger.ms: 0                 # Send immediately
+  compression.type: none
+  acks: 1
+
+Durability (Exactly-Once):
+  batch.size: 16384
+  linger.ms: 10
+  compression.type: lz4
+  acks: all
+  enable.idempotence: true
+  transactional.id: "producer-1"
+```
+
+### Consumer Optimization
+
+```yaml
+High Throughput:
+  fetch.min.bytes: 1048576     # 1 MB
+  fetch.max.wait.ms: 500       # Wait 500ms to accumulate
+
+Low Latency:
+  fetch.min.bytes: 1           # Immediate fetch
+  fetch.max.wait.ms: 100       # Short wait
+
+Max Parallelism:
+  # Deploy consumers = number of partitions
+  # More consumers than partitions = idle consumers
+```
+
+## Multi-Datacenter Patterns
+
+### Active-Passive (Disaster Recovery)
+
+```yaml
+Architecture:
+  Primary DC: Full Kafka cluster
+  Secondary DC: Replica cluster (MirrorMaker 2)
+
+Configuration:
+  - Producers → Primary only
+  - Consumers → Primary only
+  - MirrorMaker 2: Primary → Secondary (async replication)
+
+Failover:
+  1. Detect primary failure
+  2. Switch producers/consumers to secondary
+  3. Promote secondary to primary
+
+Recovery Time: 5-30 minutes (manual)
+Data Loss: Potential (async replication lag)
+```
+
+### Active-Active (Geo-Replication)
+
+```yaml
+Architecture:
+  DC1: Kafka cluster (region A)
+  DC2: Kafka cluster (region B)
+  Bidirectional replication via MirrorMaker 2
+
+Configuration:
+  - Producers → Nearest DC
+  - Consumers → Nearest DC or both
+  - Conflict resolution: Last-write-wins or custom
+
+Challenges:
+  - Duplicate messages (at-least-once delivery)
+  - Ordering across DCs not guaranteed
+  - Circular replication prevention
+
+Use Cases:
+  - Global applications
+  - Regional compliance (GDPR)
+  - Load distribution
+```
+
+### Stretch Cluster (Synchronous Replication)
+
+```yaml
+Architecture:
+  Single Kafka cluster spanning 2 DCs
+  Rack awareness: DC1 = rack1, DC2 = rack2
+
+Configuration:
+  min.insync.replicas: 2
+  replication.factor: 4 (2 per DC)
+  acks: all
+
+Requirements:
+  - Low latency between DCs (<10ms)
+  - High bandwidth link (10+ Gbps)
+  - Dedicated fiber
+
+Trade-offs:
+  Pros: Synchronous replication, zero data loss
+  Cons: Latency penalty, network dependency
+```
+
+## Monitoring & Observability
+
+### Key Metrics
+
+**Broker Metrics**:
+```yaml
+UnderReplicatedPartitions:
+  Alert: > 0 for > 5 minutes
+  Indicates: Replica lag, broker failure
+
+OfflinePartitionsCount:
+  Alert: > 0
+  Indicates: No leader elected (critical!)
+
+ActiveControllerCount:
+  Alert: != 1 (should be exactly 1)
+  Indicates: Split brain or no controller
+
+RequestHandlerAvgIdlePercent:
+  Alert: < 20%
+  Indicates: Broker CPU saturation
+```
+
+**Topic Metrics**:
+```yaml
+MessagesInPerSec:
+  Monitor: Throughput trends
+  Alert: Sudden drops (producer failure)
+
+BytesInPerSec / BytesOutPerSec:
+  Monitor: Network utilization
+  Alert: Approaching NIC limits
+
+RecordsLagMax (Consumer):
+  Alert: > 10000 or growing
+  Indicates: Consumer can't keep up
+```
+
+**Disk Metrics**:
+```yaml
+LogSegmentSize:
+  Monitor: Disk usage trends
+  Alert: > 80% capacity
+
+LogFlushRateAndTimeMs:
+  Monitor: Disk write latency
+  Alert: > 100ms p99 (slow disk)
+```
+
+## Security Patterns
+
+### Authentication & Authorization
+
+```yaml
+SASL/SCRAM-SHA-512:
+  - Industry standard
+  - User/password authentication
+  - Stored in ZooKeeper/KRaft
+
+ACLs (Access Control Lists):
+  - Per-topic, per-group permissions
+  - Operations: READ, WRITE, CREATE, DELETE, ALTER
+  - Example:
+      bin/kafka-acls.sh --add \
+        --allow-principal User:alice \
+        --operation READ \
+        --topic orders
+
+mTLS (Mutual TLS):
+  - Certificate-based auth
+  - Strong cryptographic identity
+  - Best for service-to-service
+```
+
+## Integration with SpecWeave
+
+**Automatic Architecture Detection**:
+```typescript
+import { ClusterSizingCalculator } from './lib/utils/sizing';
+
+const calculator = new ClusterSizingCalculator();
+const recommendation = calculator.calculate({
+  throughputMBps: 200,
+  retentionDays: 30,
+  replicationFactor: 3,
+  topicCount: 100
+});
+
+console.log(recommendation);
+// {
+//   brokers: 8,
+//   partitionsPerBroker: 1500,
+//   diskPerBroker: 6000 GB,
+//   ramPerBroker: 64 GB
+// }
+```
+
+**SpecWeave Commands**:
+- `/specweave-kafka:deploy` - Validates cluster sizing before deployment
+- `/specweave-kafka:monitor-setup` - Configures metrics for key indicators
+
+## Related Skills
+
+- `/specweave-kafka:kafka-mcp-integration` - MCP server setup
+- `/specweave-kafka:kafka-cli-tools` - CLI operations
+
+## External Links
+
+- [Kafka Documentation - Architecture](https://kafka.apache.org/documentation/#design)
+- [Confluent - Kafka Sizing](https://www.confluent.io/blog/how-to-choose-the-number-of-topics-partitions-in-a-kafka-cluster/)
+- [KRaft Mode Overview](https://kafka.apache.org/documentation/#kraft)
+- [LinkedIn Engineering - Kafka at Scale](https://engineering.linkedin.com/kafka/running-kafka-scale)
--- a/skills/kafka-cli-tools/SKILL.md
+++ b/skills/kafka-cli-tools/SKILL.md
@@ -0,0 +1,433 @@
+---
+name: kafka-cli-tools
+description: Expert knowledge of Kafka CLI tools (kcat, kcli, kaf, kafkactl). Auto-activates on keywords kcat, kafkacat, kcli, kaf, kafkactl, kafka cli, kafka command line, produce message, consume topic, list topics, kafka metadata. Provides command examples, installation guides, and tool comparisons.
+---
+
+# Kafka CLI Tools Expert
+
+Comprehensive knowledge of modern Kafka CLI tools for production operations, development, and troubleshooting.
+
+## Supported CLI Tools
+
+### 1. kcat (kafkacat) - The Swiss Army Knife
+
+**Installation**:
+```bash
+# macOS
+brew install kcat
+
+# Ubuntu/Debian
+apt-get install kafkacat
+
+# From source
+git clone https://github.com/edenhill/kcat.git
+cd kcat
+./configure && make && sudo make install
+```
+
+**Core Operations**:
+
+**Produce Messages**:
+```bash
+# Simple produce
+echo "Hello Kafka" | kcat -P -b localhost:9092 -t my-topic
+
+# Produce with key (key:value format)
+echo "user123:Login event" | kcat -P -b localhost:9092 -t events -K:
+
+# Produce from file
+cat events.json | kcat -P -b localhost:9092 -t events
+
+# Produce with headers
+echo "msg" | kcat -P -b localhost:9092 -t my-topic -H "source=app1" -H "version=1.0"
+
+# Produce with compression
+echo "data" | kcat -P -b localhost:9092 -t my-topic -z gzip
+
+# Produce with acks=all
+echo "critical-data" | kcat -P -b localhost:9092 -t my-topic -X acks=all
+```
+
+**Consume Messages**:
+```bash
+# Consume from beginning
+kcat -C -b localhost:9092 -t my-topic -o beginning
+
+# Consume from end (latest)
+kcat -C -b localhost:9092 -t my-topic -o end
+
+# Consume specific partition
+kcat -C -b localhost:9092 -t my-topic -p 0 -o beginning
+
+# Consume with consumer group
+kcat -C -b localhost:9092 -G my-group my-topic
+
+# Consume N messages and exit
+kcat -C -b localhost:9092 -t my-topic -c 10
+
+# Custom format (topic:partition:offset:key:value)
+kcat -C -b localhost:9092 -t my-topic -f 'Topic: %t, Partition: %p, Offset: %o, Key: %k, Value: %s\n'
+
+# JSON output
+kcat -C -b localhost:9092 -t my-topic -J
+```
+
+**Metadata & Admin**:
+```bash
+# List all topics
+kcat -L -b localhost:9092
+
+# Get topic metadata (JSON)
+kcat -L -b localhost:9092 -t my-topic -J
+
+# Query topic offsets
+kcat -Q -b localhost:9092 -t my-topic
+
+# Check broker health
+kcat -L -b localhost:9092 | grep "broker\|topic"
+```
+
+**SASL/SSL Authentication**:
+```bash
+# SASL/PLAINTEXT
+kcat -b localhost:9092 \
+  -X security.protocol=SASL_PLAINTEXT \
+  -X sasl.mechanism=PLAIN \
+  -X sasl.username=admin \
+  -X sasl.password=admin-secret \
+  -L
+
+# SASL/SSL
+kcat -b localhost:9093 \
+  -X security.protocol=SASL_SSL \
+  -X sasl.mechanism=SCRAM-SHA-256 \
+  -X sasl.username=admin \
+  -X sasl.password=admin-secret \
+  -X ssl.ca.location=/path/to/ca-cert \
+  -L
+
+# mTLS (mutual TLS)
+kcat -b localhost:9093 \
+  -X security.protocol=SSL \
+  -X ssl.ca.location=/path/to/ca-cert \
+  -X ssl.certificate.location=/path/to/client-cert.pem \
+  -X ssl.key.location=/path/to/client-key.pem \
+  -L
+```
+
+### 2. kcli - Kubernetes-Native Kafka CLI
+
+**Installation**:
+```bash
+# Install via krew (Kubernetes plugin manager)
+kubectl krew install kcli
+
+# Or download binary
+curl -LO https://github.com/cswank/kcli/releases/latest/download/kcli-linux-amd64
+chmod +x kcli-linux-amd64
+sudo mv kcli-linux-amd64 /usr/local/bin/kcli
+```
+
+**Kubernetes Integration**:
+```bash
+# Connect to Kafka running in k8s
+kcli --context my-cluster --namespace kafka
+
+# Produce to topic in k8s
+echo "msg" | kcli produce --topic my-topic --brokers kafka-broker:9092
+
+# Consume from k8s Kafka
+kcli consume --topic my-topic --brokers kafka-broker:9092 --from-beginning
+
+# List topics in k8s cluster
+kcli topics list --brokers kafka-broker:9092
+```
+
+**Best For**:
+- Kubernetes-native deployments
+- Helmfile/Kustomize workflows
+- GitOps with ArgoCD/Flux
+
+### 3. kaf - Modern Terminal UI
+
+**Installation**:
+```bash
+# macOS
+brew install kaf
+
+# Linux (via snap)
+snap install kaf
+
+# From source
+go install github.com/birdayz/kaf/cmd/kaf@latest
+```
+
+**Interactive Features**:
+```bash
+# Configure cluster
+kaf config add-cluster local --brokers localhost:9092
+
+# Use cluster
+kaf config use-cluster local
+
+# Interactive topic browsing (TUI)
+kaf topics
+
+# Interactive consume (arrow keys to navigate)
+kaf consume my-topic
+
+# Produce interactively
+kaf produce my-topic
+
+# Consumer group management
+kaf groups
+kaf group describe my-group
+kaf group reset my-group --topic my-topic --offset earliest
+
+# Schema Registry integration
+kaf schemas
+kaf schema get my-schema
+```
+
+**Best For**:
+- Development workflows
+- Quick topic exploration
+- Consumer group debugging
+- Schema Registry management
+
+### 4. kafkactl - Advanced Admin Tool
+
+**Installation**:
+```bash
+# macOS
+brew install deviceinsight/packages/kafkactl
+
+# Linux
+curl -L https://github.com/deviceinsight/kafkactl/releases/latest/download/kafkactl_linux_amd64 -o kafkactl
+chmod +x kafkactl
+sudo mv kafkactl /usr/local/bin/
+
+# Via Docker
+docker run --rm -it deviceinsight/kafkactl:latest
+```
+
+**Advanced Operations**:
+```bash
+# Configure context
+kafkactl config add-context local --brokers localhost:9092
+
+# Topic management
+kafkactl create topic my-topic --partitions 3 --replication-factor 2
+kafkactl alter topic my-topic --config retention.ms=86400000
+kafkactl delete topic my-topic
+
+# Consumer group operations
+kafkactl describe consumer-group my-group
+kafkactl reset consumer-group my-group --topic my-topic --offset earliest
+kafkactl delete consumer-group my-group
+
+# ACL management
+kafkactl create acl --allow --principal User:alice --operation READ --topic my-topic
+kafkactl list acls
+
+# Quota management
+kafkactl alter client-quota --user alice --producer-byte-rate 1048576
+
+# Reassign partitions
+kafkactl alter partition --topic my-topic --partition 0 --replicas 1,2,3
+```
+
+**Best For**:
+- Production cluster management
+- ACL administration
+- Partition reassignment
+- Quota management
+
+## Tool Comparison Matrix
+
+| Feature | kcat | kcli | kaf | kafkactl |
+|---------|------|------|-----|----------|
+| **Installation** | Easy | Medium | Easy | Easy |
+| **Produce** | ✅ Advanced | ✅ Basic | ✅ Interactive | ✅ Basic |
+| **Consume** | ✅ Advanced | ✅ Basic | ✅ Interactive | ✅ Basic |
+| **Metadata** | ✅ JSON | ✅ Basic | ✅ TUI | ✅ Detailed |
+| **TUI** | ❌ | ❌ | ✅ | ✅ Limited |
+| **Admin** | ❌ | ❌ | ⚠️  Limited | ✅ Advanced |
+| **SASL/SSL** | ✅ | ✅ | ✅ | ✅ |
+| **K8s Native** | ❌ | ✅ | ❌ | ❌ |
+| **Schema Reg** | ❌ | ❌ | ✅ | ❌ |
+| **ACLs** | ❌ | ❌ | ❌ | ✅ |
+| **Quotas** | ❌ | ❌ | ❌ | ✅ |
+| **Best For** | Scripting, ops | Kubernetes | Development | Production admin |
+
+## Common Patterns
+
+### 1. Topic Creation with Optimal Settings
+
+```bash
+# Using kafkactl (recommended for production)
+kafkactl create topic orders \
+  --partitions 12 \
+  --replication-factor 3 \
+  --config retention.ms=604800000 \
+  --config compression.type=lz4 \
+  --config min.insync.replicas=2
+
+# Verify with kcat
+kcat -L -b localhost:9092 -t orders -J | jq '.topics[0]'
+```
+
+### 2. Dead Letter Queue Pattern
+
+```bash
+# Produce failed message to DLQ
+echo "failed-msg" | kcat -P -b localhost:9092 -t orders-dlq \
+  -H "original-topic=orders" \
+  -H "error=DeserializationException" \
+  -H "timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
+
+# Monitor DLQ
+kcat -C -b localhost:9092 -t orders-dlq -f 'Headers: %h\nValue: %s\n\n'
+```
+
+### 3. Consumer Group Lag Monitoring
+
+```bash
+# Using kafkactl
+kafkactl describe consumer-group my-app | grep LAG
+
+# Using kcat (via external tool like kcat-lag)
+kcat -L -b localhost:9092 -J | jq '.topics[].partitions[] | select(.topic=="my-topic")'
+
+# Using kaf (interactive)
+kaf groups
+# Then select group to see lag in TUI
+```
+
+### 4. Multi-Cluster Replication Testing
+
+```bash
+# Produce to source cluster
+echo "test" | kcat -P -b source-kafka:9092 -t replicated-topic
+
+# Consume from target cluster
+kcat -C -b target-kafka:9092 -t replicated-topic -o end -c 1
+
+# Compare offsets
+kcat -Q -b source-kafka:9092 -t replicated-topic
+kcat -Q -b target-kafka:9092 -t replicated-topic
+```
+
+### 5. Performance Testing
+
+```bash
+# Produce 10,000 messages with kcat
+seq 1 10000 | kcat -P -b localhost:9092 -t perf-test
+
+# Consume and measure throughput
+time kcat -C -b localhost:9092 -t perf-test -c 10000 -o beginning > /dev/null
+
+# Test with compression
+seq 1 10000 | kcat -P -b localhost:9092 -t perf-test -z lz4
+```
+
+## Troubleshooting
+
+### Connection Issues
+
+```bash
+# Test broker connectivity
+kcat -L -b localhost:9092
+
+# Check SSL/TLS connection
+openssl s_client -connect localhost:9093 -showcerts
+
+# Verify SASL authentication
+kcat -b localhost:9092 \
+  -X security.protocol=SASL_PLAINTEXT \
+  -X sasl.mechanism=PLAIN \
+  -X sasl.username=admin \
+  -X sasl.password=wrong-password \
+  -L
+# Should fail with authentication error
+```
+
+### Message Not Appearing
+
+```bash
+# Check topic exists
+kcat -L -b localhost:9092 | grep my-topic
+
+# Check partition count
+kcat -L -b localhost:9092 -t my-topic -J | jq '.topics[0].partition_count'
+
+# Query all partition offsets
+kcat -Q -b localhost:9092 -t my-topic
+
+# Consume from all partitions
+for i in {0..11}; do
+  echo "Partition $i:"
+  kcat -C -b localhost:9092 -t my-topic -p $i -c 1 -o end
+done
+```
+
+### Consumer Group Stuck
+
+```bash
+# Check consumer group state
+kafkactl describe consumer-group my-app
+
+# Reset to beginning
+kafkactl reset consumer-group my-app --topic my-topic --offset earliest
+
+# Reset to specific offset
+kafkactl reset consumer-group my-app --topic my-topic --partition 0 --offset 12345
+
+# Delete consumer group (all consumers must be stopped first)
+kafkactl delete consumer-group my-app
+```
+
+## Integration with SpecWeave
+
+**Automatic CLI Tool Detection**:
+SpecWeave auto-detects installed CLI tools and recommends best tool for the operation:
+
+```typescript
+import { CLIToolDetector } from './lib/cli/detector';
+
+const detector = new CLIToolDetector();
+const available = await detector.detectAll();
+
+// Recommended tool for produce operation
+if (available.includes('kcat')) {
+  console.log('Use kcat for produce (fastest)');
+} else if (available.includes('kaf')) {
+  console.log('Use kaf for produce (interactive)');
+}
+```
+
+**SpecWeave Commands**:
+- `/specweave-kafka:dev-env` - Uses Docker Compose + kcat for local testing
+- `/specweave-kafka:monitor-setup` - Sets up kcat-based lag monitoring
+- `/specweave-kafka:mcp-configure` - Validates CLI tools are installed
+
+## Security Best Practices
+
+1. **Never hardcode credentials** - Use environment variables or secrets management
+2. **Use SSL/TLS in production** - Configure `-X security.protocol=SASL_SSL`
+3. **Prefer SCRAM over PLAIN** - Use `-X sasl.mechanism=SCRAM-SHA-256`
+4. **Rotate credentials regularly** - Update passwords and certificates
+5. **Least privilege** - Grant only necessary ACLs to users
+
+## Related Skills
+
+- `/specweave-kafka:kafka-mcp-integration` - MCP server setup and configuration
+- `/specweave-kafka:kafka-architecture` - Cluster design and sizing
+
+## External Links
+
+- [kcat GitHub](https://github.com/edenhill/kcat)
+- [kcli GitHub](https://github.com/cswank/kcli)
+- [kaf GitHub](https://github.com/birdayz/kaf)
+- [kafkactl GitHub](https://github.com/deviceinsight/kafkactl)
+- [Apache Kafka Documentation](https://kafka.apache.org/documentation/)
--- a/skills/kafka-iac-deployment/SKILL.md
+++ b/skills/kafka-iac-deployment/SKILL.md
@@ -0,0 +1,449 @@
+---
+name: kafka-iac-deployment
+description: Infrastructure as Code (IaC) deployment expert for Apache Kafka. Guides Terraform deployments across Apache Kafka (KRaft mode), AWS MSK, Azure Event Hubs. Activates for terraform, iac, infrastructure as code, deploy kafka, provision kafka, aws msk, azure event hubs, kafka infrastructure, terraform modules, cloud deployment, kafka deployment automation.
+---
+
+# Kafka Infrastructure as Code (IaC) Deployment
+
+Expert guidance for deploying Apache Kafka using Terraform across multiple platforms.
+
+## When to Use This Skill
+
+I activate when you need help with:
+- **Terraform deployments**: "Deploy Kafka with Terraform", "provision Kafka cluster"
+- **Platform selection**: "Should I use AWS MSK or self-hosted Kafka?", "compare Kafka platforms"
+- **Infrastructure planning**: "How to size Kafka infrastructure", "Kafka on AWS vs Azure"
+- **IaC automation**: "Automate Kafka deployment", "CI/CD for Kafka infrastructure"
+
+## What I Know
+
+### Available Terraform Modules
+
+This plugin provides 3 production-ready Terraform modules:
+
+#### 1. **Apache Kafka (Self-Hosted, KRaft Mode)**
+- **Location**: `plugins/specweave-kafka/terraform/apache-kafka/`
+- **Platform**: AWS EC2 (can adapt to other clouds)
+- **Architecture**: KRaft mode (no ZooKeeper dependency)
+- **Features**:
+  - Multi-broker cluster (3-5 brokers recommended)
+  - Security groups with SASL_SSL
+  - IAM roles for S3 backups
+  - CloudWatch metrics and alarms
+  - Auto-scaling group support
+  - Custom VPC and subnet configuration
+- **Use When**:
+  - ✅ You need full control over Kafka configuration
+  - ✅ Running Kafka 3.6+ (KRaft mode)
+  - ✅ Want to avoid ZooKeeper operational overhead
+  - ✅ Multi-cloud or hybrid deployments
+- **Variables**:
+  ```hcl
+  module "kafka" {
+    source = "../../plugins/specweave-kafka/terraform/apache-kafka"
+
+    environment         = "production"
+    broker_count        = 3
+    kafka_version       = "3.7.0"
+    instance_type       = "m5.xlarge"
+    vpc_id              = var.vpc_id
+    subnet_ids          = var.subnet_ids
+    domain              = "example.com"
+    enable_s3_backups   = true
+    enable_monitoring   = true
+  }
+  ```
+
+#### 2. **AWS MSK (Managed Streaming for Kafka)**
+- **Location**: `plugins/specweave-kafka/terraform/aws-msk/`
+- **Platform**: AWS Managed Service
+- **Features**:
+  - Fully managed Kafka service
+  - IAM authentication + SASL/SCRAM
+  - Auto-scaling (provisioned throughput)
+  - Built-in monitoring (CloudWatch)
+  - Multi-AZ deployment
+  - Encryption in transit and at rest
+- **Use When**:
+  - ✅ You want AWS to manage Kafka operations
+  - ✅ Need tight AWS integration (IAM, KMS, CloudWatch)
+  - ✅ Prefer operational simplicity over cost
+  - ✅ Running in AWS VPC
+- **Variables**:
+  ```hcl
+  module "msk" {
+    source = "../../plugins/specweave-kafka/terraform/aws-msk"
+
+    cluster_name           = "my-kafka-cluster"
+    kafka_version          = "3.6.0"
+    number_of_broker_nodes = 3
+    broker_node_instance_type = "kafka.m5.large"
+
+    vpc_id     = var.vpc_id
+    subnet_ids = var.private_subnet_ids
+
+    enable_iam_auth      = true
+    enable_scram_auth    = false
+    enable_auto_scaling  = true
+  }
+  ```
+
+#### 3. **Azure Event Hubs (Kafka API)**
+- **Location**: `plugins/specweave-kafka/terraform/azure-event-hubs/`
+- **Platform**: Azure Managed Service
+- **Features**:
+  - Kafka 1.0+ protocol support
+  - Auto-inflate (elastic scaling)
+  - Premium SKU for high throughput
+  - Zone redundancy
+  - Private endpoints (VNet integration)
+  - Event capture to Azure Storage
+- **Use When**:
+  - ✅ Running on Azure cloud
+  - ✅ Need Kafka-compatible API without Kafka operations
+  - ✅ Want serverless scaling (auto-inflate)
+  - ✅ Integrating with Azure ecosystem
+- **Variables**:
+  ```hcl
+  module "event_hubs" {
+    source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
+
+    namespace_name        = "my-event-hub-ns"
+    resource_group_name   = var.resource_group_name
+    location              = "eastus"
+
+    sku                   = "Premium"
+    capacity              = 1
+    kafka_enabled         = true
+    auto_inflate_enabled  = true
+    maximum_throughput_units = 20
+  }
+  ```
+
+## Platform Selection Decision Tree
+
+```
+Need Kafka deployment? START HERE:
+
+├─ Running on AWS?
+│  ├─ YES → Want managed service?
+│  │  ├─ YES → Use AWS MSK module (terraform/aws-msk)
+│  │  └─ NO → Use Apache Kafka module (terraform/apache-kafka)
+│  └─ NO → Continue...
+│
+├─ Running on Azure?
+│  ├─ YES → Use Azure Event Hubs module (terraform/azure-event-hubs)
+│  └─ NO → Continue...
+│
+├─ Multi-cloud or hybrid?
+│  └─ YES → Use Apache Kafka module (most portable)
+│
+├─ Need maximum control?
+│  └─ YES → Use Apache Kafka module
+│
+└─ Default → Use Apache Kafka module (self-hosted, KRaft mode)
+```
+
+## Deployment Workflows
+
+### Workflow 1: Deploy Self-Hosted Kafka (Apache Kafka Module)
+
+**Scenario**: You want full control over Kafka on AWS EC2
+
+```bash
+# 1. Create Terraform configuration
+cat > main.tf <<EOF
+module "kafka_cluster" {
+  source = "../../plugins/specweave-kafka/terraform/apache-kafka"
+
+  environment         = "production"
+  broker_count        = 3
+  kafka_version       = "3.7.0"
+  instance_type       = "m5.xlarge"
+
+  vpc_id     = "vpc-12345678"
+  subnet_ids = ["subnet-abc", "subnet-def", "subnet-ghi"]
+  domain     = "kafka.example.com"
+
+  enable_s3_backups = true
+  enable_monitoring = true
+
+  tags = {
+    Project     = "MyApp"
+    Environment = "Production"
+  }
+}
+
+output "broker_endpoints" {
+  value = module.kafka_cluster.broker_endpoints
+}
+EOF
+
+# 2. Initialize Terraform
+terraform init
+
+# 3. Plan deployment (review what will be created)
+terraform plan
+
+# 4. Apply (create infrastructure)
+terraform apply
+
+# 5. Get broker endpoints
+terraform output broker_endpoints
+# Output: ["kafka-0.kafka.example.com:9093", "kafka-1.kafka.example.com:9093", ...]
+```
+
+### Workflow 2: Deploy AWS MSK (Managed Service)
+
+**Scenario**: You want AWS to manage Kafka operations
+
+```bash
+# 1. Create Terraform configuration
+cat > main.tf <<EOF
+module "msk_cluster" {
+  source = "../../plugins/specweave-kafka/terraform/aws-msk"
+
+  cluster_name           = "my-msk-cluster"
+  kafka_version          = "3.6.0"
+  number_of_broker_nodes = 3
+  broker_node_instance_type = "kafka.m5.large"
+
+  vpc_id     = var.vpc_id
+  subnet_ids = var.private_subnet_ids
+
+  enable_iam_auth     = true
+  enable_auto_scaling = true
+
+  tags = {
+    Project = "MyApp"
+  }
+}
+
+output "bootstrap_brokers" {
+  value = module.msk_cluster.bootstrap_brokers_sasl_iam
+}
+EOF
+
+# 2. Deploy
+terraform init && terraform apply
+
+# 3. Configure IAM authentication
+# (module outputs IAM policy, attach to your application role)
+```
+
+### Workflow 3: Deploy Azure Event Hubs (Kafka API)
+
+**Scenario**: You're on Azure and want Kafka-compatible API
+
+```bash
+# 1. Create Terraform configuration
+cat > main.tf <<EOF
+module "event_hubs" {
+  source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
+
+  namespace_name      = "my-kafka-namespace"
+  resource_group_name = "my-resource-group"
+  location            = "eastus"
+
+  sku                  = "Premium"
+  capacity             = 1
+  kafka_enabled        = true
+  auto_inflate_enabled = true
+  maximum_throughput_units = 20
+
+  # Create hubs (topics) for your use case
+  hubs = [
+    { name = "user-events",    partitions = 12 },
+    { name = "order-events",   partitions = 6 },
+    { name = "payment-events", partitions = 3 }
+  ]
+}
+
+output "connection_string" {
+  value = module.event_hubs.connection_string
+  sensitive = true
+}
+EOF
+
+# 2. Deploy
+terraform init && terraform apply
+
+# 3. Get connection details
+terraform output connection_string
+```
+
+## Infrastructure Sizing Recommendations
+
+### Small Environment (Dev/Test)
+```hcl
+# Self-hosted: 1 broker, m5.large
+broker_count  = 1
+instance_type = "m5.large"
+
+# AWS MSK: 1 broker per AZ, kafka.m5.large
+number_of_broker_nodes = 3
+broker_node_instance_type = "kafka.m5.large"
+
+# Azure Event Hubs: Basic SKU
+sku = "Basic"
+capacity = 1
+```
+
+### Medium Environment (Staging/Production)
+```hcl
+# Self-hosted: 3 brokers, m5.xlarge
+broker_count  = 3
+instance_type = "m5.xlarge"
+
+# AWS MSK: 3 brokers, kafka.m5.xlarge
+number_of_broker_nodes = 3
+broker_node_instance_type = "kafka.m5.xlarge"
+
+# Azure Event Hubs: Standard SKU with auto-inflate
+sku = "Standard"
+capacity = 2
+auto_inflate_enabled = true
+maximum_throughput_units = 10
+```
+
+### Large Environment (High-Throughput Production)
+```hcl
+# Self-hosted: 5+ brokers, m5.2xlarge or m5.4xlarge
+broker_count  = 5
+instance_type = "m5.2xlarge"
+
+# AWS MSK: 6+ brokers, kafka.m5.2xlarge, auto-scaling
+number_of_broker_nodes = 6
+broker_node_instance_type = "kafka.m5.2xlarge"
+enable_auto_scaling = true
+
+# Azure Event Hubs: Premium SKU with zone redundancy
+sku = "Premium"
+capacity = 4
+zone_redundant = true
+maximum_throughput_units = 20
+```
+
+## Best Practices
+
+### Security Best Practices
+1. **Always use encryption in transit**
+   - Self-hosted: Enable SASL_SSL listener
+   - AWS MSK: Set `encryption_in_transit_client_broker = "TLS"`
+   - Azure Event Hubs: HTTPS/TLS enabled by default
+
+2. **Use IAM authentication (when possible)**
+   - AWS MSK: `enable_iam_auth = true`
+   - Azure Event Hubs: Managed identities
+
+3. **Network isolation**
+   - Deploy in private subnets
+   - Use security groups/NSGs restrictively
+   - Azure: Enable private endpoints for Premium SKU
+
+### High Availability Best Practices
+1. **Multi-AZ deployment**
+   - Self-hosted: Distribute brokers across 3+ AZs
+   - AWS MSK: Automatically multi-AZ
+   - Azure Event Hubs: Enable `zone_redundant = true` (Premium)
+
+2. **Replication factor = 3**
+   - Self-hosted: `default.replication.factor=3`
+   - AWS MSK: Configured automatically
+   - Azure Event Hubs: N/A (fully managed)
+
+3. **min.insync.replicas = 2**
+   - Ensures durability even if 1 broker fails
+
+### Cost Optimization
+1. **Right-size instances**
+   - Use ClusterSizingCalculator utility (in kafka-architecture skill)
+   - Start small, scale up based on metrics
+
+2. **Auto-scaling (where available)**
+   - AWS MSK: `enable_auto_scaling = true`
+   - Azure Event Hubs: `auto_inflate_enabled = true`
+
+3. **Retention policies**
+   - Set `log.retention.hours` based on actual needs (default: 168 hours = 7 days)
+   - Shorter retention = lower storage costs
+
+## Monitoring Integration
+
+All modules integrate with monitoring:
+
+### Self-Hosted Kafka
+- CloudWatch metrics (via JMX Exporter)
+- Prometheus + Grafana dashboards (see kafka-observability skill)
+- Custom CloudWatch alarms
+
+### AWS MSK
+- Built-in CloudWatch metrics
+- Enhanced monitoring available
+- Integration with CloudWatch Alarms
+
+### Azure Event Hubs
+- Built-in Azure Monitor metrics
+- Diagnostic logs to Log Analytics
+- Integration with Azure Alerts
+
+## Troubleshooting
+
+### "Terraform destroy fails on security groups"
+**Cause**: Resources using security groups still exist
+**Fix**:
+```bash
+# 1. Find dependent resources
+aws ec2 describe-network-interfaces --filters "Name=group-id,Values=sg-12345678"
+
+# 2. Delete dependent resources first
+# 3. Retry terraform destroy
+```
+
+### "AWS MSK cluster takes 20+ minutes to create"
+**Cause**: MSK provisioning is inherently slow (AWS behavior)
+**Fix**: This is normal. Use `--auto-approve` for automation:
+```bash
+terraform apply -auto-approve
+```
+
+### "Azure Event Hubs: Connection refused"
+**Cause**: Kafka protocol not enabled OR incorrect connection string
+**Fix**:
+1. Verify `kafka_enabled = true` in Terraform
+2. Use Kafka connection string (not Event Hubs connection string)
+3. Check firewall rules (Premium SKU supports private endpoints)
+
+## Integration with Other Skills
+
+- **kafka-architecture**: For cluster sizing and partitioning strategy
+- **kafka-observability**: For Prometheus + Grafana setup after deployment
+- **kafka-kubernetes**: For deploying Kafka on Kubernetes (alternative to Terraform)
+- **kafka-cli-tools**: For testing deployed clusters with kcat
+
+## Quick Reference Commands
+
+```bash
+# Terraform workflow
+terraform init          # Initialize modules
+terraform plan          # Preview changes
+terraform apply         # Create infrastructure
+terraform output        # Get outputs (endpoints, etc.)
+terraform destroy       # Delete infrastructure
+
+# AWS MSK specific
+aws kafka list-clusters # List MSK clusters
+aws kafka describe-cluster --cluster-arn <arn> # Get cluster details
+
+# Azure Event Hubs specific
+az eventhubs namespace list # List namespaces
+az eventhubs eventhub list --namespace-name <name> --resource-group <rg> # List hubs
+```
+
+---
+
+**Next Steps After Deployment**:
+1. Use **kafka-observability** skill to set up Prometheus + Grafana monitoring
+2. Use **kafka-cli-tools** skill to test cluster with kcat
+3. Deploy your producer/consumer applications
+4. Monitor cluster health and performance
--- a/skills/kafka-kubernetes/SKILL.md
+++ b/skills/kafka-kubernetes/SKILL.md
@@ -0,0 +1,667 @@
+---
+name: kafka-kubernetes
+description: Kubernetes deployment expert for Apache Kafka. Guides K8s deployments using Helm charts, operators (Strimzi, Confluent), StatefulSets, and production best practices. Activates for kubernetes, k8s, helm, kafka on kubernetes, strimzi, confluent operator, kafka operator, statefulset, kafka helm chart, k8s deployment, kubernetes kafka, deploy kafka to k8s.
+---
+
+# Kafka on Kubernetes Deployment
+
+Expert guidance for deploying Apache Kafka on Kubernetes using industry-standard tools.
+
+## When to Use This Skill
+
+I activate when you need help with:
+- **Kubernetes deployments**: "Deploy Kafka on Kubernetes", "run Kafka in K8s", "Kafka Helm chart"
+- **Operator selection**: "Strimzi vs Confluent Operator", "which Kafka operator to use"
+- **StatefulSet patterns**: "Kafka StatefulSet best practices", "persistent volumes for Kafka"
+- **Production K8s**: "Production-ready Kafka on K8s", "Kafka high availability in Kubernetes"
+
+## What I Know
+
+### Deployment Options Comparison
+
+| Approach | Difficulty | Production-Ready | Best For |
+|----------|-----------|------------------|----------|
+| **Strimzi Operator** | Easy | ✅ Yes | Self-managed Kafka on K8s, CNCF project |
+| **Confluent Operator** | Medium | ✅ Yes | Enterprise features, Confluent ecosystem |
+| **Bitnami Helm Chart** | Easy | ⚠️  Mostly | Quick dev/staging environments |
+| **Custom StatefulSet** | Hard | ⚠️  Requires expertise | Full control, custom requirements |
+
+**Recommendation**: **Strimzi Operator** for most production use cases (CNCF project, active community, KRaft support)
+
+## Deployment Approach 1: Strimzi Operator (Recommended)
+
+**Strimzi** is a CNCF Sandbox project providing Kubernetes operators for Apache Kafka.
+
+### Features
+- ✅ KRaft mode support (Kafka 3.6+, no ZooKeeper)
+- ✅ Declarative Kafka management (CRDs)
+- ✅ Automatic rolling upgrades
+- ✅ Built-in monitoring (Prometheus metrics)
+- ✅ Mirror Maker 2 for replication
+- ✅ Kafka Connect integration
+- ✅ User and topic management via CRDs
+
+### Installation (Helm)
+
+```bash
+# 1. Add Strimzi Helm repository
+helm repo add strimzi https://strimzi.io/charts/
+helm repo update
+
+# 2. Create namespace
+kubectl create namespace kafka
+
+# 3. Install Strimzi Operator
+helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
+  --namespace kafka \
+  --set watchNamespaces="{kafka}" \
+  --version 0.39.0
+
+# 4. Verify operator is running
+kubectl get pods -n kafka
+# Output: strimzi-cluster-operator-... Running
+```
+
+### Deploy Kafka Cluster (KRaft Mode)
+
+```yaml
+# kafka-cluster.yaml
+apiVersion: kafka.strimzi.io/v1beta2
+kind: KafkaNodePool
+metadata:
+  name: kafka-pool
+  namespace: kafka
+  labels:
+    strimzi.io/cluster: my-kafka-cluster
+spec:
+  replicas: 3
+  roles:
+    - controller
+    - broker
+  storage:
+    type: jbod
+    volumes:
+      - id: 0
+        type: persistent-claim
+        size: 100Gi
+        class: fast-ssd
+        deleteClaim: false
+---
+apiVersion: kafka.strimzi.io/v1beta2
+kind: Kafka
+metadata:
+  name: my-kafka-cluster
+  namespace: kafka
+  annotations:
+    strimzi.io/kraft: enabled
+    strimzi.io/node-pools: enabled
+spec:
+  kafka:
+    version: 3.7.0
+    metadataVersion: 3.7-IV4
+    replicas: 3
+
+    listeners:
+      - name: plain
+        port: 9092
+        type: internal
+        tls: false
+      - name: tls
+        port: 9093
+        type: internal
+        tls: true
+        authentication:
+          type: tls
+      - name: external
+        port: 9094
+        type: loadbalancer
+        tls: true
+        authentication:
+          type: tls
+
+    config:
+      default.replication.factor: 3
+      min.insync.replicas: 2
+      offsets.topic.replication.factor: 3
+      transaction.state.log.replication.factor: 3
+      transaction.state.log.min.isr: 2
+      auto.create.topics.enable: false
+      log.retention.hours: 168
+      log.segment.bytes: 1073741824
+      compression.type: lz4
+
+    resources:
+      requests:
+        memory: 4Gi
+        cpu: "2"
+      limits:
+        memory: 8Gi
+        cpu: "4"
+
+    jvmOptions:
+      -Xms: 2048m
+      -Xmx: 4096m
+
+    metricsConfig:
+      type: jmxPrometheusExporter
+      valueFrom:
+        configMapKeyRef:
+          name: kafka-metrics
+          key: kafka-metrics-config.yml
+```
+
+```bash
+# Apply Kafka cluster
+kubectl apply -f kafka-cluster.yaml
+
+# Wait for cluster to be ready (5-10 minutes)
+kubectl wait kafka/my-kafka-cluster --for=condition=Ready --timeout=600s -n kafka
+
+# Check status
+kubectl get kafka -n kafka
+# Output: my-kafka-cluster   3.7.0   3         True
+```
+
+### Create Topics (Declaratively)
+
+```yaml
+# kafka-topics.yaml
+apiVersion: kafka.strimzi.io/v1beta2
+kind: KafkaTopic
+metadata:
+  name: user-events
+  namespace: kafka
+  labels:
+    strimzi.io/cluster: my-kafka-cluster
+spec:
+  partitions: 12
+  replicas: 3
+  config:
+    retention.ms: 604800000  # 7 days
+    segment.bytes: 1073741824
+    compression.type: lz4
+---
+apiVersion: kafka.strimzi.io/v1beta2
+kind: KafkaTopic
+metadata:
+  name: order-events
+  namespace: kafka
+  labels:
+    strimzi.io/cluster: my-kafka-cluster
+spec:
+  partitions: 6
+  replicas: 3
+  config:
+    retention.ms: 2592000000  # 30 days
+    min.insync.replicas: 2
+```
+
+```bash
+# Apply topics
+kubectl apply -f kafka-topics.yaml
+
+# Verify topics created
+kubectl get kafkatopics -n kafka
+```
+
+### Create Users (Declaratively)
+
+```yaml
+# kafka-users.yaml
+apiVersion: kafka.strimzi.io/v1beta2
+kind: KafkaUser
+metadata:
+  name: my-producer
+  namespace: kafka
+  labels:
+    strimzi.io/cluster: my-kafka-cluster
+spec:
+  authentication:
+    type: tls
+  authorization:
+    type: simple
+    acls:
+      - resource:
+          type: topic
+          name: user-events
+          patternType: literal
+        operations: [Write, Describe]
+      - resource:
+          type: topic
+          name: order-events
+          patternType: literal
+        operations: [Write, Describe]
+---
+apiVersion: kafka.strimzi.io/v1beta2
+kind: KafkaUser
+metadata:
+  name: my-consumer
+  namespace: kafka
+  labels:
+    strimzi.io/cluster: my-kafka-cluster
+spec:
+  authentication:
+    type: tls
+  authorization:
+    type: simple
+    acls:
+      - resource:
+          type: topic
+          name: user-events
+          patternType: literal
+        operations: [Read, Describe]
+      - resource:
+          type: group
+          name: my-consumer-group
+          patternType: literal
+        operations: [Read]
+```
+
+```bash
+# Apply users
+kubectl apply -f kafka-users.yaml
+
+# Get user credentials (TLS certificates)
+kubectl get secret my-producer -n kafka -o jsonpath='{.data.user\.crt}' | base64 -d > producer.crt
+kubectl get secret my-producer -n kafka -o jsonpath='{.data.user\.key}' | base64 -d > producer.key
+kubectl get secret my-kafka-cluster-cluster-ca-cert -n kafka -o jsonpath='{.data.ca\.crt}' | base64 -d > ca.crt
+```
+
+## Deployment Approach 2: Confluent Operator
+
+**Confluent for Kubernetes (CFK)** provides enterprise-grade Kafka management.
+
+### Features
+- ✅ Full Confluent Platform (Kafka, Schema Registry, ksqlDB, Connect)
+- ✅ Hybrid deployments (K8s + on-prem)
+- ✅ Rolling upgrades with zero downtime
+- ✅ Multi-region replication
+- ✅ Advanced security (RBAC, encryption)
+- ⚠️  Requires Confluent Platform license (paid)
+
+### Installation
+
+```bash
+# 1. Add Confluent Helm repository
+helm repo add confluentinc https://packages.confluent.io/helm
+helm repo update
+
+# 2. Create namespace
+kubectl create namespace confluent
+
+# 3. Install Confluent Operator
+helm install confluent-operator confluentinc/confluent-for-kubernetes \
+  --namespace confluent \
+  --version 0.921.11
+
+# 4. Verify
+kubectl get pods -n confluent
+```
+
+### Deploy Kafka Cluster
+
+```yaml
+# kafka-cluster-confluent.yaml
+apiVersion: platform.confluent.io/v1beta1
+kind: Kafka
+metadata:
+  name: kafka
+  namespace: confluent
+spec:
+  replicas: 3
+  image:
+    application: confluentinc/cp-server:7.6.0
+    init: confluentinc/confluent-init-container:2.7.0
+
+  dataVolumeCapacity: 100Gi
+  storageClass:
+    name: fast-ssd
+
+  metricReporter:
+    enabled: true
+
+  listeners:
+    internal:
+      authentication:
+        type: plain
+      tls:
+        enabled: true
+    external:
+      authentication:
+        type: plain
+      tls:
+        enabled: true
+
+  dependencies:
+    zookeeper:
+      endpoint: zookeeper.confluent.svc.cluster.local:2181
+
+  podTemplate:
+    resources:
+      requests:
+        memory: 4Gi
+        cpu: 2
+      limits:
+        memory: 8Gi
+        cpu: 4
+```
+
+```bash
+# Apply Kafka cluster
+kubectl apply -f kafka-cluster-confluent.yaml
+
+# Wait for cluster
+kubectl wait kafka/kafka --for=condition=Ready --timeout=600s -n confluent
+```
+
+## Deployment Approach 3: Bitnami Helm Chart (Dev/Staging)
+
+**Bitnami Helm Chart** is simple but less suitable for production.
+
+### Installation
+
+```bash
+# 1. Add Bitnami repository
+helm repo add bitnami https://charts.bitnami.com/bitnami
+helm repo update
+
+# 2. Install Kafka (KRaft mode)
+helm install kafka bitnami/kafka \
+  --namespace kafka \
+  --create-namespace \
+  --set kraft.enabled=true \
+  --set controller.replicaCount=3 \
+  --set broker.replicaCount=3 \
+  --set persistence.size=100Gi \
+  --set persistence.storageClass=fast-ssd \
+  --set metrics.kafka.enabled=true \
+  --set metrics.jmx.enabled=true
+
+# 3. Get bootstrap servers
+export KAFKA_BOOTSTRAP=$(kubectl get svc kafka -n kafka -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'):9092
+```
+
+**Limitations**:
+- ⚠️  Less production-ready than Strimzi/Confluent
+- ⚠️  Limited declarative topic/user management
+- ⚠️  Fewer advanced features (no MirrorMaker 2, limited RBAC)
+
+## Production Best Practices
+
+### 1. Storage Configuration
+
+**Use SSD-backed storage classes** for Kafka logs:
+
+```yaml
+apiVersion: storage.k8s.io/v1
+kind: StorageClass
+metadata:
+  name: fast-ssd
+provisioner: kubernetes.io/aws-ebs  # or pd.csi.storage.gke.io for GKE
+parameters:
+  type: gp3  # AWS EBS GP3 (or io2 for extreme performance)
+  iopsPerGB: "50"
+  throughput: "125"
+allowVolumeExpansion: true
+volumeBindingMode: WaitForFirstConsumer
+```
+
+**Kafka storage requirements**:
+- **Min IOPS**: 3000+ per broker
+- **Min Throughput**: 125 MB/s per broker
+- **Persistent**: Use `deleteClaim: false` (don't delete data on pod deletion)
+
+### 2. Resource Limits
+
+```yaml
+resources:
+  requests:
+    memory: 4Gi
+    cpu: "2"
+  limits:
+    memory: 8Gi
+    cpu: "4"
+
+jvmOptions:
+  -Xms: 2048m  # Initial heap (50% of memory request)
+  -Xmx: 4096m  # Max heap (50% of memory limit, leave room for OS cache)
+```
+
+**Sizing guidelines**:
+- **Small (dev)**: 2 CPU, 4Gi memory
+- **Medium (staging)**: 4 CPU, 8Gi memory
+- **Large (production)**: 8 CPU, 16Gi memory
+
+### 3. Pod Disruption Budgets
+
+Ensure high availability during K8s upgrades:
+
+```yaml
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: kafka-pdb
+  namespace: kafka
+spec:
+  maxUnavailable: 1
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: kafka
+```
+
+### 4. Affinity Rules
+
+**Spread brokers across availability zones**:
+
+```yaml
+spec:
+  kafka:
+    template:
+      pod:
+        affinity:
+          podAntiAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+              - labelSelector:
+                  matchExpressions:
+                    - key: strimzi.io/name
+                      operator: In
+                      values:
+                        - my-kafka-cluster-kafka
+                topologyKey: topology.kubernetes.io/zone
+```
+
+### 5. Network Policies
+
+**Restrict access to Kafka brokers**:
+
+```yaml
+apiVersion: networking.k8s.io/v1
+kind: NetworkPolicy
+metadata:
+  name: kafka-network-policy
+  namespace: kafka
+spec:
+  podSelector:
+    matchLabels:
+      strimzi.io/name: my-kafka-cluster-kafka
+  policyTypes:
+    - Ingress
+  ingress:
+    - from:
+      - podSelector:
+          matchLabels:
+            app: my-producer
+      - podSelector:
+          matchLabels:
+            app: my-consumer
+      ports:
+      - protocol: TCP
+        port: 9092
+      - protocol: TCP
+        port: 9093
+```
+
+## Monitoring Integration
+
+### Prometheus + Grafana Setup
+
+Strimzi provides built-in Prometheus metrics exporter:
+
+```yaml
+# kafka-metrics-configmap.yaml
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: kafka-metrics
+  namespace: kafka
+data:
+  kafka-metrics-config.yml: |
+    # Use JMX Exporter config from:
+    # plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml
+    lowercaseOutputName: true
+    lowercaseOutputLabelNames: true
+    whitelistObjectNames:
+      - "kafka.server:type=BrokerTopicMetrics,name=*"
+      # ... (copy from kafka-jmx-exporter.yml)
+```
+
+```bash
+# Apply metrics config
+kubectl apply -f kafka-metrics-configmap.yaml
+
+# Install Prometheus Operator (if not already installed)
+helm install prometheus prometheus-community/kube-prometheus-stack \
+  --namespace monitoring \
+  --create-namespace
+
+# Create PodMonitor for Kafka
+kubectl apply -f - <<EOF
+apiVersion: monitoring.coreos.com/v1
+kind: PodMonitor
+metadata:
+  name: kafka-metrics
+  namespace: kafka
+spec:
+  selector:
+    matchLabels:
+      strimzi.io/kind: Kafka
+  podMetricsEndpoints:
+    - port: tcp-prometheus
+      interval: 30s
+EOF
+
+# Access Grafana dashboards (from kafka-observability skill)
+kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
+# Open: http://localhost:3000
+# Dashboards: Kafka Cluster Overview, Broker Metrics, Consumer Lag, Topic Metrics, JVM Metrics
+```
+
+## Troubleshooting
+
+### "Pods stuck in Pending state"
+**Cause**: Insufficient resources or storage class not found
+**Fix**:
+```bash
+# Check events
+kubectl describe pod kafka-my-kafka-cluster-0 -n kafka
+
+# Check storage class exists
+kubectl get storageclass
+
+# If missing, create fast-ssd storage class (see Production Best Practices above)
+```
+
+### "Kafka broker not ready after 10 minutes"
+**Cause**: Slow storage provisioning or resource limits too low
+**Fix**:
+```bash
+# Check broker logs
+kubectl logs kafka-my-kafka-cluster-0 -n kafka
+
+# Common issues:
+# 1. Low IOPS on storage → Use GP3 or better
+# 2. Low memory → Increase resources.requests.memory
+# 3. KRaft quorum not formed → Check all brokers are running
+```
+
+### "Cannot connect to Kafka from outside K8s"
+**Cause**: External listener not configured
+**Fix**:
+```yaml
+# Add external listener (Strimzi)
+spec:
+  kafka:
+    listeners:
+      - name: external
+        port: 9094
+        type: loadbalancer
+        tls: true
+        authentication:
+          type: tls
+
+# Get external bootstrap server
+kubectl get kafka my-kafka-cluster -n kafka -o jsonpath='{.status.listeners[?(@.name=="external")].bootstrapServers}'
+```
+
+## Scaling Operations
+
+### Horizontal Scaling (Add Brokers)
+
+```bash
+# Strimzi: Update KafkaNodePool replicas
+kubectl patch kafkanodepool kafka-pool -n kafka --type='json' \
+  -p='[{"op": "replace", "path": "/spec/replicas", "value": 5}]'
+
+# Confluent: Update Kafka CR
+kubectl patch kafka kafka -n confluent --type='json' \
+  -p='[{"op": "replace", "path": "/spec/replicas", "value": 5}]'
+
+# Wait for new brokers
+kubectl rollout status statefulset/kafka-my-kafka-cluster-kafka -n kafka
+```
+
+### Vertical Scaling (Change Resources)
+
+```bash
+# Update resources in Kafka CR
+kubectl patch kafka my-kafka-cluster -n kafka --type='json' \
+  -p='[
+    {"op": "replace", "path": "/spec/kafka/resources/requests/memory", "value": "8Gi"},
+    {"op": "replace", "path": "/spec/kafka/resources/requests/cpu", "value": "4"}
+  ]'
+
+# Rolling restart will happen automatically
+```
+
+## Integration with Other Skills
+
+- **kafka-iac-deployment**: Alternative to K8s (use Terraform for cloud-managed Kafka)
+- **kafka-observability**: Set up Prometheus + Grafana dashboards for K8s Kafka
+- **kafka-architecture**: Cluster sizing and partitioning strategy
+- **kafka-cli-tools**: Test K8s Kafka cluster with kcat
+
+## Quick Reference Commands
+
+```bash
+# Strimzi
+kubectl get kafka -n kafka                    # List Kafka clusters
+kubectl get kafkatopics -n kafka              # List topics
+kubectl get kafkausers -n kafka               # List users
+kubectl logs kafka-my-kafka-cluster-0 -n kafka  # Check broker logs
+
+# Confluent
+kubectl get kafka -n confluent                # List Kafka clusters
+kubectl get schemaregistry -n confluent       # List Schema Registry
+kubectl get ksqldb -n confluent               # List ksqlDB
+
+# Port-forward for testing
+kubectl port-forward -n kafka svc/my-kafka-cluster-kafka-bootstrap 9092:9092
+```
+
+---
+
+**Next Steps After K8s Deployment**:
+1. Use **kafka-observability** skill to verify Prometheus metrics and Grafana dashboards
+2. Use **kafka-cli-tools** skill to test cluster with kcat
+3. Deploy your producer/consumer applications to K8s
+4. Set up GitOps for declarative topic/user management (ArgoCD, Flux)
--- a/skills/kafka-mcp-integration/SKILL.md
+++ b/skills/kafka-mcp-integration/SKILL.md
@@ -0,0 +1,290 @@
+---
+name: kafka-mcp-integration
+description: MCP server integration for Kafka operations. Auto-activates on keywords kafka mcp, mcp server, mcp configure, mcp setup, kanapuli, tuannvm, confluent mcp, kafka integration. Provides configuration examples and connection guidance for all 4 MCP servers.
+---
+
+# Kafka MCP Server Integration
+
+Expert knowledge for integrating SpecWeave with Kafka MCP (Model Context Protocol) servers. Supports 4 MCP server implementations with auto-detection and configuration guidance.
+
+---
+
+> **Code-First Recommendation**: For most Kafka automation tasks, [writing code is better than MCP](https://www.anthropic.com/engineering/code-execution-with-mcp) (98% token reduction). Use **kafkajs** or **kafka-node** directly:
+>
+> ```typescript
+> import { Kafka } from 'kafkajs';
+> const kafka = new Kafka({ brokers: ['localhost:9092'] });
+> const producer = kafka.producer();
+> await producer.send({ topic: 'events', messages: [{ value: 'Hello' }] });
+> ```
+>
+> **When MCP IS useful**: Quick interactive debugging, topic exploration, Claude Desktop integration.
+>
+> **When to use code instead**: CI/CD pipelines, test automation, production scripts, anything that should be committed and reusable.
+
+---
+
+## Supported MCP Servers
+
+### 1. kanapuli/mcp-kafka (Node.js)
+
+**Installation**:
+```bash
+npm install -g mcp-kafka
+```
+
+**Capabilities**:
+- Authentication: SASL_PLAINTEXT, PLAINTEXT
+- Operations: produce, consume, list-topics, describe-topic, get-offsets
+- Best for: Basic Kafka operations, quick prototyping
+
+**Configuration Example**:
+```json
+{
+  "mcpServers": {
+    "kafka": {
+      "command": "npx",
+      "args": ["mcp-kafka"],
+      "env": {
+        "KAFKA_BROKERS": "localhost:9092",
+        "KAFKA_SASL_MECHANISM": "plain",
+        "KAFKA_SASL_USERNAME": "user",
+        "KAFKA_SASL_PASSWORD": "password"
+      }
+    }
+  }
+}
+```
+
+### 2. tuannvm/kafka-mcp-server (Go)
+
+**Installation**:
+```bash
+go install github.com/tuannvm/kafka-mcp-server@latest
+```
+
+**Capabilities**:
+- Authentication: SASL_SCRAM_SHA_256, SASL_SCRAM_SHA_512, SASL_SSL, PLAINTEXT
+- Operations: All CRUD operations, consumer group management, offset management
+- Best for: Production use, advanced SASL authentication
+
+**Configuration Example**:
+```json
+{
+  "mcpServers": {
+    "kafka": {
+      "command": "kafka-mcp-server",
+      "args": [
+        "--brokers", "localhost:9092",
+        "--sasl-mechanism", "SCRAM-SHA-256",
+        "--sasl-username", "admin",
+        "--sasl-password", "admin-secret"
+      ]
+    }
+  }
+}
+```
+
+### 3. Joel-hanson/kafka-mcp-server (Python)
+
+**Installation**:
+```bash
+pip install kafka-mcp-server
+```
+
+**Capabilities**:
+- Authentication: SASL_PLAINTEXT, PLAINTEXT, SSL
+- Operations: produce, consume, list-topics, describe-topic
+- Best for: Claude Desktop integration, Python ecosystem
+
+**Configuration Example**:
+```json
+{
+  "mcpServers": {
+    "kafka": {
+      "command": "python",
+      "args": ["-m", "kafka_mcp_server"],
+      "env": {
+        "KAFKA_BOOTSTRAP_SERVERS": "localhost:9092"
+      }
+    }
+  }
+}
+```
+
+### 4. Confluent Official MCP (Enterprise)
+
+**Installation**:
+```bash
+confluent plugin install mcp-server
+```
+
+**Capabilities**:
+- Authentication: OAuth, SASL_SCRAM, API Keys
+- Operations: All Kafka operations, Schema Registry, ksqlDB, Flink SQL
+- Advanced: Natural language interface, AI-powered query generation
+- Best for: Confluent Cloud, enterprise deployments
+
+**Configuration Example**:
+```json
+{
+  "mcpServers": {
+    "kafka": {
+      "command": "confluent",
+      "args": ["mcp", "start"],
+      "env": {
+        "CONFLUENT_CLOUD_API_KEY": "your-api-key",
+        "CONFLUENT_CLOUD_API_SECRET": "your-api-secret"
+      }
+    }
+  }
+}
+```
+
+## Auto-Detection
+
+SpecWeave can auto-detect installed MCP servers:
+
+```bash
+/specweave-kafka:mcp-configure
+```
+
+This command:
+1. Scans for installed MCP servers (npm, go, pip, confluent CLI)
+2. Checks which servers are currently running
+3. Ranks servers by capabilities (Confluent > tuannvm > kanapuli > Joel-hanson)
+4. Generates recommended configuration
+5. Tests connection
+
+## Quick Start
+
+### Option 1: Auto-Configure (Recommended)
+
+```bash
+/specweave-kafka:mcp-configure
+```
+
+Interactive wizard guides you through:
+- MCP server selection (or auto-detect)
+- Broker URL configuration
+- Authentication setup
+- Connection testing
+
+### Option 2: Manual Configuration
+
+1. **Install preferred MCP server** (see installation commands above)
+
+2. **Create `.mcp.json` configuration**:
+
+```json
+{
+  "serverType": "tuannvm",
+  "brokerUrls": ["localhost:9092"],
+  "authentication": {
+    "mechanism": "SASL/SCRAM-SHA-256",
+    "username": "admin",
+    "password": "admin-secret"
+  }
+}
+```
+
+3. **Test connection**:
+
+```bash
+# Via MCP server CLI
+kafka-mcp-server test-connection
+
+# Or via SpecWeave
+node -e "import('./dist/lib/mcp/detector.js').then(async ({ MCPServerDetector }) => {
+  const detector = new MCPServerDetector();
+  const result = await detector.detectAll();
+  console.log(JSON.stringify(result, null, 2));
+});"
+```
+
+## MCP Server Comparison
+
+| Feature | kanapuli | tuannvm | Joel-hanson | Confluent |
+|---------|----------|---------|-------------|-----------|
+| **Language** | Node.js | Go | Python | Official CLI |
+| **SASL_PLAINTEXT** | ✅ | ✅ | ✅ | ✅ |
+| **SCRAM-SHA-256** | ❌ | ✅ | ❌ | ✅ |
+| **SCRAM-SHA-512** | ❌ | ✅ | ❌ | ✅ |
+| **mTLS/SSL** | ❌ | ✅ | ✅ | ✅ |
+| **OAuth** | ❌ | ❌ | ❌ | ✅ |
+| **Consumer Groups** | ❌ | ✅ | ❌ | ✅ |
+| **Offset Mgmt** | ❌ | ✅ | ❌ | ✅ |
+| **Schema Registry** | ❌ | ❌ | ❌ | ✅ |
+| **ksqlDB** | ❌ | ❌ | ❌ | ✅ |
+| **Flink SQL** | ❌ | ❌ | ❌ | ✅ |
+| **AI/NL Interface** | ❌ | ❌ | ❌ | ✅ |
+| **Best For** | Prototyping | Production | Desktop | Enterprise |
+
+## Troubleshooting
+
+### MCP Server Not Detected
+
+```bash
+# Check if MCP server installed
+npm list -g mcp-kafka         # kanapuli
+which kafka-mcp-server        # tuannvm
+pip show kafka-mcp-server     # Joel-hanson
+confluent version             # Confluent
+```
+
+### Connection Refused
+
+- Verify Kafka broker is running: `kcat -L -b localhost:9092`
+- Check firewall rules
+- Validate broker URL (correct host:port)
+
+### Authentication Failed
+
+- Double-check credentials (username, password, API keys)
+- Verify SASL mechanism matches broker configuration
+- Check broker logs for authentication errors
+
+### Operations Not Working
+
+- Ensure MCP server supports the operation (see comparison table)
+- Check broker ACLs (permissions for the authenticated user)
+- Verify topic exists: `/specweave-kafka:mcp-configure list-topics`
+
+## Operations via MCP
+
+Once configured, you can perform Kafka operations via MCP:
+
+```typescript
+import { MCPServerDetector } from './lib/mcp/detector';
+
+const detector = new MCPServerDetector();
+const result = await detector.detectAll();
+
+// Use recommended server
+if (result.recommended) {
+  console.log(`Using ${result.recommended} MCP server`);
+  console.log(`Reason: ${result.rankingReason}`);
+}
+```
+
+## Security Best Practices
+
+1. **Never commit credentials** - Use environment variables or secrets manager
+2. **Use strongest auth** - Prefer SCRAM-SHA-512 > SCRAM-SHA-256 > PLAINTEXT
+3. **Enable TLS/SSL** - Encrypt communication with broker
+4. **Rotate credentials** - Regularly update passwords and API keys
+5. **Least privilege** - Grant only necessary ACLs to MCP server user
+
+## Related Commands
+
+- `/specweave-kafka:mcp-configure` - Interactive MCP server setup
+- `/specweave-kafka:dev-env start` - Start local Kafka for testing
+- `/specweave-kafka:deploy` - Deploy production Kafka cluster
+
+## External Links
+
+- [kanapuli/mcp-kafka](https://github.com/kanapuli/mcp-kafka)
+- [tuannvm/kafka-mcp-server](https://github.com/tuannvm/kafka-mcp-server)
+- [Joel-hanson/kafka-mcp-server](https://github.com/Joel-hanson/kafka-mcp-server)
+- [Confluent MCP Documentation](https://docs.confluent.io/platform/current/mcp/)
+- [MCP Protocol Specification](https://modelcontextprotocol.org/)
--- a/skills/kafka-observability/SKILL.md
+++ b/skills/kafka-observability/SKILL.md
@@ -0,0 +1,576 @@
+---
+name: kafka-observability
+description: Kafka monitoring and observability expert. Guides Prometheus + Grafana setup, JMX metrics, alerting rules, and dashboard configuration. Activates for kafka monitoring, prometheus, grafana, kafka metrics, jmx exporter, kafka observability, monitoring setup, kafka dashboards, alerting, kafka performance monitoring, metrics collection.
+---
+
+# Kafka Monitoring & Observability
+
+Expert guidance for implementing comprehensive monitoring and observability for Apache Kafka using Prometheus and Grafana.
+
+## When to Use This Skill
+
+I activate when you need help with:
+- **Monitoring setup**: "Set up Kafka monitoring", "configure Prometheus for Kafka", "Grafana dashboards for Kafka"
+- **Metrics collection**: "Kafka JMX metrics", "export Kafka metrics to Prometheus"
+- **Alerting**: "Kafka alerting rules", "alert on under-replicated partitions", "critical Kafka metrics"
+- **Troubleshooting**: "Monitor Kafka performance", "track consumer lag", "broker health monitoring"
+
+## What I Know
+
+### Available Monitoring Components
+
+This plugin provides a complete monitoring stack:
+
+#### 1. **Prometheus JMX Exporter Configuration**
+- **Location**: `plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml`
+- **Purpose**: Export Kafka JMX metrics to Prometheus format
+- **Metrics Exported**:
+  - Broker topic metrics (bytes in/out, messages in, request rate)
+  - Replica manager (under-replicated partitions, ISR shrinks/expands)
+  - Controller metrics (active controller, offline partitions, leader elections)
+  - Request metrics (produce/fetch latency)
+  - Log metrics (flush rate, flush latency)
+  - JVM metrics (heap, GC, threads, file descriptors)
+
+#### 2. **Grafana Dashboards** (5 Dashboards)
+- **Location**: `plugins/specweave-kafka/monitoring/grafana/dashboards/`
+- **Dashboards**:
+  1. **kafka-cluster-overview.json** - Cluster health and throughput
+  2. **kafka-broker-metrics.json** - Per-broker performance
+  3. **kafka-consumer-lag.json** - Consumer lag monitoring
+  4. **kafka-topic-metrics.json** - Topic-level metrics
+  5. **kafka-jvm-metrics.json** - JVM health (heap, GC, threads)
+
+#### 3. **Grafana Provisioning**
+- **Location**: `plugins/specweave-kafka/monitoring/grafana/provisioning/`
+- **Files**:
+  - `dashboards/kafka.yml` - Dashboard provisioning config
+  - `datasources/prometheus.yml` - Prometheus datasource config
+
+## Setup Workflow 1: JMX Exporter (Self-Hosted Kafka)
+
+For Kafka running on VMs or bare metal (non-Kubernetes).
+
+### Step 1: Download JMX Prometheus Agent
+
+```bash
+# Download JMX Prometheus agent JAR
+cd /opt
+wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
+
+# Copy JMX Exporter config
+cp plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml /opt/kafka-jmx-exporter.yml
+```
+
+### Step 2: Configure Kafka Broker
+
+Add JMX exporter to Kafka startup script:
+
+```bash
+# Edit Kafka startup (e.g., /etc/systemd/system/kafka.service)
+[Service]
+Environment="KAFKA_OPTS=-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
+```
+
+Or add to `kafka-server-start.sh`:
+
+```bash
+export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
+```
+
+### Step 3: Restart Kafka and Verify
+
+```bash
+# Restart Kafka broker
+sudo systemctl restart kafka
+
+# Verify JMX exporter is running (port 7071)
+curl localhost:7071/metrics | grep kafka_server
+
+# Expected output: kafka_server_broker_topic_metrics_bytesin_total{...} 12345
+```
+
+### Step 4: Configure Prometheus Scraping
+
+Add Kafka brokers to Prometheus config:
+
+```yaml
+# prometheus.yml
+scrape_configs:
+  - job_name: 'kafka'
+    static_configs:
+      - targets:
+        - 'kafka-broker-1:7071'
+        - 'kafka-broker-2:7071'
+        - 'kafka-broker-3:7071'
+    scrape_interval: 30s
+```
+
+```bash
+# Reload Prometheus
+sudo systemctl reload prometheus
+
+# OR send SIGHUP
+kill -HUP $(pidof prometheus)
+
+# Verify scraping
+curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
+```
+
+## Setup Workflow 2: Strimzi (Kubernetes)
+
+For Kafka running on Kubernetes with Strimzi Operator.
+
+### Step 1: Create JMX Exporter ConfigMap
+
+```bash
+# Create ConfigMap from JMX exporter config
+kubectl create configmap kafka-metrics \
+  --from-file=kafka-metrics-config.yml=plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml \
+  -n kafka
+```
+
+### Step 2: Configure Kafka CR with Metrics
+
+```yaml
+# kafka-cluster.yaml (add metricsConfig section)
+apiVersion: kafka.strimzi.io/v1beta2
+kind: Kafka
+metadata:
+  name: my-kafka-cluster
+  namespace: kafka
+spec:
+  kafka:
+    version: 3.7.0
+    replicas: 3
+
+    # ... other config ...
+
+    metricsConfig:
+      type: jmxPrometheusExporter
+      valueFrom:
+        configMapKeyRef:
+          name: kafka-metrics
+          key: kafka-metrics-config.yml
+```
+
+```bash
+# Apply updated Kafka CR
+kubectl apply -f kafka-cluster.yaml
+
+# Verify metrics endpoint (wait for rolling restart)
+kubectl exec -it kafka-my-kafka-cluster-0 -n kafka -- curl localhost:9404/metrics | grep kafka_server
+```
+
+### Step 3: Install Prometheus Operator (if not installed)
+
+```bash
+# Add Prometheus Community Helm repo
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo update
+
+# Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
+helm install prometheus prometheus-community/kube-prometheus-stack \
+  --namespace monitoring \
+  --create-namespace \
+  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
+  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
+```
+
+### Step 4: Create PodMonitor for Kafka
+
+```yaml
+# kafka-podmonitor.yaml
+apiVersion: monitoring.coreos.com/v1
+kind: PodMonitor
+metadata:
+  name: kafka-metrics
+  namespace: kafka
+  labels:
+    app: strimzi
+spec:
+  selector:
+    matchLabels:
+      strimzi.io/kind: Kafka
+  podMetricsEndpoints:
+    - port: tcp-prometheus
+      interval: 30s
+```
+
+```bash
+# Apply PodMonitor
+kubectl apply -f kafka-podmonitor.yaml
+
+# Verify Prometheus is scraping Kafka
+kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
+# Open: http://localhost:9090/targets
+# Should see kafka-metrics/* targets
+```
+
+## Setup Workflow 3: Grafana Dashboards
+
+### Installation (Docker Compose)
+
+If using Docker Compose for local development:
+
+```yaml
+# docker-compose.yml (add to existing Kafka setup)
+version: '3.8'
+services:
+  # ... Kafka services ...
+
+  prometheus:
+    image: prom/prometheus:v2.48.0
+    ports:
+      - "9090:9090"
+    volumes:
+      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
+      - prometheus-data:/prometheus
+    command:
+      - '--config.file=/etc/prometheus/prometheus.yml'
+      - '--storage.tsdb.path=/prometheus'
+
+  grafana:
+    image: grafana/grafana:10.2.0
+    ports:
+      - "3000:3000"
+    environment:
+      - GF_SECURITY_ADMIN_PASSWORD=admin
+    volumes:
+      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
+      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
+      - grafana-data:/var/lib/grafana
+
+volumes:
+  prometheus-data:
+  grafana-data:
+```
+
+```bash
+# Start monitoring stack
+docker-compose up -d prometheus grafana
+
+# Access Grafana
+# URL: http://localhost:3000
+# Username: admin
+# Password: admin
+```
+
+### Installation (Kubernetes)
+
+Dashboards are auto-provisioned if using kube-prometheus-stack:
+
+```bash
+# Create ConfigMaps for each dashboard
+for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
+  name=$(basename "$dashboard" .json)
+  kubectl create configmap "kafka-dashboard-$name" \
+    --from-file="$dashboard" \
+    -n monitoring \
+    --dry-run=client -o yaml | kubectl apply -f -
+done
+
+# Label ConfigMaps for Grafana auto-discovery
+kubectl label configmap -n monitoring kafka-dashboard-* grafana_dashboard=1
+
+# Grafana will auto-import dashboards (wait 30-60 seconds)
+
+# Access Grafana
+kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
+# URL: http://localhost:3000
+# Username: admin
+# Password: prom-operator (default kube-prometheus-stack password)
+```
+
+### Manual Dashboard Import
+
+If auto-provisioning doesn't work:
+
+```bash
+# 1. Access Grafana UI
+# 2. Go to: Dashboards → Import
+# 3. Upload JSON files from:
+#    plugins/specweave-kafka/monitoring/grafana/dashboards/
+
+# Or use Grafana API
+for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
+  curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
+    -H "Content-Type: application/json" \
+    -d @"$dashboard"
+done
+```
+
+## Dashboard Overview
+
+### 1. **Kafka Cluster Overview** (`kafka-cluster-overview.json`)
+
+**Purpose**: High-level cluster health
+
+**Key Metrics**:
+- Active Controller Count (should be exactly 1)
+- Under-Replicated Partitions (should be 0) ⚠️  CRITICAL
+- Offline Partitions Count (should be 0) ⚠️  CRITICAL
+- Unclean Leader Elections (should be 0)
+- Cluster Throughput (bytes in/out per second)
+- Request Rate (produce, fetch requests per second)
+- ISR Changes (shrinks/expands)
+- Leader Election Rate
+
+**Use When**: Checking overall cluster health
+
+### 2. **Kafka Broker Metrics** (`kafka-broker-metrics.json`)
+
+**Purpose**: Per-broker performance
+
+**Key Metrics**:
+- Broker CPU Usage (% utilization)
+- Broker Heap Memory Usage
+- Broker Network Throughput (bytes in/out)
+- Request Handler Idle Percentage (low = CPU saturation)
+- File Descriptors (open vs max)
+- Log Flush Latency (p50, p99)
+- JVM GC Collection Count/Time
+
+**Use When**: Investigating broker performance issues
+
+### 3. **Kafka Consumer Lag** (`kafka-consumer-lag.json`)
+
+**Purpose**: Consumer lag monitoring
+
+**Key Metrics**:
+- Consumer Lag per Topic/Partition
+- Total Lag per Consumer Group
+- Offset Commit Rate
+- Current Consumer Offset
+- Log End Offset (producer offset)
+- Consumer Group Members
+
+**Use When**: Troubleshooting slow consumers or lag spikes
+
+### 4. **Kafka Topic Metrics** (`kafka-topic-metrics.json`)
+
+**Purpose**: Topic-level metrics
+
+**Key Metrics**:
+- Messages Produced per Topic
+- Bytes per Topic (in/out)
+- Partition Count per Topic
+- Replication Factor
+- In-Sync Replicas
+- Log Size per Partition
+- Current Offset per Partition
+- Partition Leader Distribution
+
+**Use When**: Analyzing topic throughput and hotspots
+
+### 5. **Kafka JVM Metrics** (`kafka-jvm-metrics.json`)
+
+**Purpose**: JVM health monitoring
+
+**Key Metrics**:
+- Heap Memory Usage (used vs max)
+- Heap Utilization Percentage
+- GC Collection Rate (collections/sec)
+- GC Collection Time (ms/sec)
+- JVM Thread Count
+- Heap Memory by Pool (young gen, old gen, survivor)
+- Off-Heap Memory Usage (metaspace, code cache)
+- GC Pause Time Percentiles (p50, p95, p99)
+
+**Use When**: Investigating memory leaks or GC pauses
+
+## Critical Alerts Configuration
+
+Create Prometheus alerting rules for critical Kafka metrics:
+
+```yaml
+# kafka-alerts.yml
+apiVersion: monitoring.coreos.com/v1
+kind: PrometheusRule
+metadata:
+  name: kafka-alerts
+  namespace: monitoring
+spec:
+  groups:
+    - name: kafka.rules
+      interval: 30s
+      rules:
+        # CRITICAL: Under-Replicated Partitions
+        - alert: KafkaUnderReplicatedPartitions
+          expr: sum(kafka_server_replica_manager_under_replicated_partitions) > 0
+          for: 5m
+          labels:
+            severity: critical
+          annotations:
+            summary: "Kafka has under-replicated partitions"
+            description: "{{ $value }} partitions are under-replicated. Data loss risk!"
+
+        # CRITICAL: Offline Partitions
+        - alert: KafkaOfflinePartitions
+          expr: kafka_controller_offline_partitions_count > 0
+          for: 1m
+          labels:
+            severity: critical
+          annotations:
+            summary: "Kafka has offline partitions"
+            description: "{{ $value }} partitions are offline. Service degradation!"
+
+        # CRITICAL: No Active Controller
+        - alert: KafkaNoActiveController
+          expr: kafka_controller_active_controller_count == 0
+          for: 1m
+          labels:
+            severity: critical
+          annotations:
+            summary: "No active Kafka controller"
+            description: "Cluster has no active controller. Cannot perform administrative operations!"
+
+        # WARNING: High Consumer Lag
+        - alert: KafkaConsumerLagHigh
+          expr: sum by (consumergroup) (kafka_consumergroup_lag) > 10000
+          for: 10m
+          labels:
+            severity: warning
+          annotations:
+            summary: "Consumer group {{ $labels.consumergroup }} has high lag"
+            description: "Lag is {{ $value }} messages. Consumers may be slow."
+
+        # WARNING: High CPU Usage
+        - alert: KafkaBrokerHighCPU
+          expr: os_process_cpu_load{job="kafka"} > 0.8
+          for: 5m
+          labels:
+            severity: warning
+          annotations:
+            summary: "Broker {{ $labels.instance }} has high CPU usage"
+            description: "CPU usage is {{ $value | humanizePercentage }}. Consider scaling."
+
+        # WARNING: Low Heap Memory
+        - alert: KafkaBrokerLowHeapMemory
+          expr: jvm_memory_heap_used_bytes{job="kafka"} / jvm_memory_heap_max_bytes{job="kafka"} > 0.9
+          for: 5m
+          labels:
+            severity: warning
+          annotations:
+            summary: "Broker {{ $labels.instance }} has low heap memory"
+            description: "Heap usage is {{ $value | humanizePercentage }}. Risk of OOM!"
+
+        # WARNING: High GC Time
+        - alert: KafkaBrokerHighGCTime
+          expr: rate(jvm_gc_collection_time_ms_total{job="kafka"}[5m]) > 500
+          for: 5m
+          labels:
+            severity: warning
+          annotations:
+            summary: "Broker {{ $labels.instance }} spending too much time in GC"
+            description: "GC time is {{ $value }}ms/sec. Application pauses likely."
+```
+
+```bash
+# Apply alerts (Kubernetes)
+kubectl apply -f kafka-alerts.yml
+
+# Verify alerts loaded
+kubectl get prometheusrules -n monitoring
+```
+
+## Troubleshooting
+
+### "Prometheus not scraping Kafka metrics"
+
+**Symptoms**: No Kafka metrics in Prometheus
+
+**Fix**:
+```bash
+# 1. Verify JMX exporter is running
+curl http://kafka-broker:7071/metrics
+
+# 2. Check Prometheus targets
+curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
+
+# 3. Check Prometheus logs
+kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0
+
+# Common issues:
+# - Firewall blocking port 7071
+# - Incorrect scrape config
+# - Kafka broker not running
+```
+
+### "Grafana dashboards not loading"
+
+**Symptoms**: Dashboards show "No data"
+
+**Fix**:
+```bash
+# 1. Verify Prometheus datasource
+# Grafana UI → Configuration → Data Sources → Prometheus → Test
+
+# 2. Check if Kafka metrics exist in Prometheus
+# Prometheus UI → Graph → Enter: kafka_server_broker_topic_metrics_bytesin_total
+
+# 3. Verify dashboard queries match your Prometheus job name
+# Dashboard panels use job="kafka" by default
+# If your job name is different, update dashboard JSON
+```
+
+### "Consumer lag metrics missing"
+
+**Symptoms**: Consumer lag dashboard empty
+
+**Fix**:
+Consumer lag metrics require **Kafka Exporter** (separate from JMX Exporter):
+
+```bash
+# Install Kafka Exporter (Kubernetes)
+helm install kafka-exporter prometheus-community/prometheus-kafka-exporter \
+  --namespace monitoring \
+  --set kafkaServer={kafka-bootstrap:9092}
+
+# Or run as Docker container
+docker run -d -p 9308:9308 \
+  danielqsj/kafka-exporter \
+  --kafka.server=kafka:9092 \
+  --web.listen-address=:9308
+
+# Add to Prometheus scrape config
+scrape_configs:
+  - job_name: 'kafka-exporter'
+    static_configs:
+      - targets: ['kafka-exporter:9308']
+```
+
+## Integration with Other Skills
+
+- **kafka-iac-deployment**: Set up monitoring during Terraform deployment
+- **kafka-kubernetes**: Configure monitoring for Strimzi Kafka on K8s
+- **kafka-architecture**: Use cluster sizing metrics to validate capacity planning
+- **kafka-cli-tools**: Use kcat to generate test traffic and verify metrics
+
+## Quick Reference Commands
+
+```bash
+# Check JMX exporter metrics
+curl http://localhost:7071/metrics | grep -E "(kafka_server|kafka_controller)"
+
+# Prometheus query examples
+curl -g 'http://localhost:9090/api/v1/query?query=kafka_server_replica_manager_under_replicated_partitions'
+
+# Grafana dashboard export
+curl http://admin:admin@localhost:3000/api/dashboards/uid/kafka-cluster-overview | jq .dashboard > backup.json
+
+# Reload Prometheus config
+kill -HUP $(pidof prometheus)
+
+# Check Prometheus targets
+curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
+```
+
+---
+
+**Next Steps After Monitoring Setup**:
+1. Review all 5 Grafana dashboards to familiarize yourself with metrics
+2. Set up alerting (Slack, PagerDuty, email)
+3. Create runbooks for critical alerts (under-replicated partitions, offline partitions, no controller)
+4. Monitor for 7 days to establish baseline metrics
+5. Tune JVM settings based on GC metrics