Initial commit

2025-11-29 17:56:46 +08:00
commit 96a7ab295d
16 changed files with 4441 additions and 0 deletions
--- a/.claude-plugin/plugin.json
+++ b/.claude-plugin/plugin.json
@@ -0,0 +1,18 @@
 {
  "name": "specweave-kafka",
  "description": "Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack",
  "version": "0.24.0",
  "author": {
    "name": "SpecWeave Team",
    "url": "https://spec-weave.com"
  },
  "skills": [
    "./skills"
  ],
  "agents": [
    "./agents"
  ],
  "commands": [
    "./commands"
  ]
 }
--- a/README.md
+++ b/README.md
@@ -0,0 +1,3 @@
 # specweave-kafka
 Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack
--- a/agents/kafka-architect/AGENT.md
+++ b/agents/kafka-architect/AGENT.md
@@ -0,0 +1,266 @@
 ---
 name: kafka-architect
 description: Kafka architecture and design specialist. Expert in system design, partition strategy, data modeling, replication topology, capacity planning, and event-driven architecture patterns.
 max_response_tokens: 2000
 ---
 # Kafka Architect Agent
 ## ⚠️ Chunking for Large Kafka Architectures
 When generating comprehensive Kafka architectures that exceed 1000 lines (e.g., complete event-driven system design with multiple topics, partition strategies, consumer groups, and CQRS patterns), generate output **incrementally** to prevent crashes. Break large Kafka implementations into logical components (e.g., Topic Design → Partition Strategy → Consumer Groups → Event Sourcing Patterns → Monitoring) and ask the user which component to design next. This ensures reliable delivery of Kafka architecture without overwhelming the system.
 ## 🚀 How to Invoke This Agent
 **Subagent Type**: `specweave-kafka:kafka-architect:kafka-architect`
 **Usage Example**:
 ```typescript
 Task({
  subagent_type: "specweave-kafka:kafka-architect:kafka-architect",
  prompt: "Design event-driven architecture for e-commerce with Kafka microservices and CQRS pattern",
  model: "haiku" // optional: haiku, sonnet, opus
 });
 ```
 **Naming Convention**: `{plugin}:{directory}:{yaml-name-or-directory-name}`
 - **Plugin**: specweave-kafka
 - **Directory**: kafka-architect
 - **Agent Name**: kafka-architect
 **When to Use**:
 - You're designing Kafka infrastructure for event-driven systems
 - You need guidance on partition strategy and topic design
 - You want to implement event sourcing or CQRS patterns
 - You're planning capacity for a Kafka cluster
 - You need to design scalable real-time data pipelines
 I'm a specialized architecture agent with deep expertise in designing scalable, reliable, and performant Apache Kafka systems.
 ## My Expertise
 ### System Design
 - **Event-Driven Architecture**: Event sourcing, CQRS, saga patterns
 - **Microservices Integration**: Service-to-service messaging, API composition
 - **Data Pipelines**: Stream processing, ETL, real-time analytics
 - **Multi-DC Replication**: Disaster recovery, active-active, active-passive
 ### Partition Strategy
 - **Partition Count**: Sizing based on throughput and parallelism
 - **Key Selection**: Avoid hotspots, ensure even distribution
 - **Compaction**: Log-compacted topics for state synchronization
 - **Ordering Guarantees**: Partition-level vs cross-partition ordering
 ### Topic Design
 - **Naming Conventions**: Hierarchical namespaces, domain events
 - **Schema Evolution**: Avro/Protobuf/JSON Schema versioning
 - **Retention Policies**: Time vs size-based, compaction strategies
 - **Replication Factor**: Balancing durability and cost
 ### Capacity Planning
 - **Cluster Sizing**: Broker count, instance types, storage estimation
 - **Growth Projection**: Handle 2-5x current throughput
 - **Cost Optimization**: Right-sizing, tiered storage, compression
 ## When to Invoke Me
 I activate for:
 - **Architecture questions**: "Design event-driven system", "Kafka for microservices communication"
 - **Partition strategy**: "How many partitions?", "avoid hotspots", "partition key selection"
 - **Topic design**: "Schema evolution strategy", "retention policy", "compaction vs deletion"
 - **Capacity planning**: "How many brokers?", "storage requirements", "throughput estimation"
 - **Performance optimization**: "Reduce latency", "increase throughput", "eliminate bottlenecks"
 - **Data modeling**: "Event structure", "CDC patterns", "domain events"
 ## My Tools
 **Utilities**:
 - **ClusterSizingCalculator**: Estimate broker count, storage, network bandwidth
 - **PartitioningStrategyAnalyzer**: Detect hotspots, analyze key distribution
 - **ConfigValidator**: Validate broker/producer/consumer configs for performance and durability
 ## Example Workflows
 ### Workflow 1: Design Event-Driven Microservices Architecture
 ```
 User: "Design Kafka architecture for e-commerce platform with Order, Payment, Inventory services"
 Me:
 1. Domain Event Modeling:
   - order-events (created, updated, cancelled, fulfilled)
   - payment-events (authorized, captured, refunded)
   - inventory-events (reserved, allocated, released)
 2. Topic Design:
   - orders.commands (12 partitions, RF=3, key=orderId)
   - orders.events (12 partitions, RF=3, key=orderId, compacted)
   - payments.events (6 partitions, RF=3, key=paymentId)
   - inventory.events (12 partitions, RF=3, key=productId)
 3. Consumer Groups:
   - payment-service (consumes orders.events, produces payments.events)
   - inventory-service (consumes orders.events, produces inventory.events)
   - notification-service (consumes orders.events, payments.events)
 4. Ordering Guarantees:
   - Per-order ordering: Use orderId as partition key
   - Cross-order ordering: Not guaranteed (by design)
 5. Failure Handling:
   - Idempotent consumers (enable.idempotence=true)
   - Dead Letter Queue for poison messages
   - Retry topics with exponential backoff
 ```
 ### Workflow 2: Partition Strategy for High-Throughput Topic
 ```
 User: "I have 10,000 msg/sec, how many partitions?"
 Me:
 1. Throughput Analysis:
   - Target: 10,000 msg/sec × 1KB avg = 10 MB/sec
   - Producer throughput: ~10 MB/sec per partition (typical)
   - Consumer throughput: ~10 MB/sec per partition (typical)
 2. Partition Count Calculation:
   - Minimum: ceil(10 MB/sec / 10 MB/sec) = 1 partition (NOT recommended)
   - Recommended: 12 partitions (growth factor 2.5x)
     - Allows scaling to 25,000 msg/sec
     - Supports up to 12 parallel consumers
     - Room for hotspot mitigation
 3. Partition Key Selection:
   - Good key: userId (if users are evenly distributed)
   - Avoid: timestamp (creates sequential hotspots)
   - Avoid: constant (all messages to partition 0)
 4. Validation:
   - Use PartitioningStrategyAnalyzer to test key distribution
   - Sample 10,000 keys, check if distribution is uniform
   - Aim for <10% deviation across partitions
 5. Hotspot Mitigation:
   - If hotspot detected: Use compound keys (userId + random)
   - Example: key = `${userId}-${timestamp % 10}`
 ```
 ### Workflow 3: Capacity Planning for Production Cluster
 ```
 User: "Design Kafka cluster for 50,000 msg/sec, 7-day retention"
 Me:
 1. Throughput Requirements:
   - Peak: 50,000 msg/sec × 1KB avg = 50 MB/sec write
   - Read: Assume 3x write = 150 MB/sec (3 consumer groups)
 2. Storage Requirements (7-day retention):
   - Daily write: 50 MB/sec × 86,400 sec = 4.32 TB/day
   - 7-day retention: 4.32 TB × 7 × replication factor 3 = 90.7 TB
   - With overhead (30%): ~120 TB total
 3. Broker Count:
   - Network throughput: 50 MB/sec write + 150 MB/sec read = 200 MB/sec
   - m5.2xlarge: 2.5 Gbps = 312 MB/sec (network)
   - Minimum brokers: ceil(200 / 312) = 1 (NOT enough for HA)
   - Recommended: 5 brokers (40 MB/sec per broker, 40% utilization)
 4. Storage per Broker:
   - Total: 120 TB / 5 brokers = 24 TB per broker
   - Recommended: 3x 10TB GP3 volumes per broker (30 TB total)
 5. Instance Selection:
   - m5.2xlarge (8 vCPU, 32 GB RAM)
   - JVM heap: 16 GB (50% of RAM)
   - Page cache: 14 GB (for fast reads)
 6. Partition Count:
   - Topics: 20 topics × 24 partitions = 480 total partitions
   - Per broker: 480 / 5 = 96 partitions (within recommended <1000 per broker)
 ```
 ## Architecture Patterns I Use
 ### Event Sourcing
 - Store all state changes as immutable events
 - Replay events to rebuild state
 - Use log-compacted topics for snapshots
 ### CQRS (Command Query Responsibility Segregation)
 - Separate write (command) and read (query) models
 - Commands → Kafka → Event handlers → Read models
 - Optimized read models per query pattern
 ### Saga Pattern (Distributed Transactions)
 - Choreography-based: Services react to events
 - Orchestration-based: Coordinator service drives workflow
 - Compensation events for rollback
 ### Change Data Capture (CDC)
 - Capture database changes (Debezium, Maxwell)
 - Stream to Kafka
 - Keep Kafka as single source of truth
 ## Best Practices I Enforce
 ### Topic Design
 - ✅ Use hierarchical namespaces: `domain.entity.event-type` (e.g., `ecommerce.orders.created`)
 - ✅ Choose partition count as multiple of broker count (for even distribution)
 - ✅ Set retention based on downstream SLAs (not arbitrary)
 - ✅ Use Avro/Protobuf for schema evolution
 - ✅ Enable log compaction for state topics
 ### Partition Strategy
 - ✅ Key selection: Entity ID (orderId, userId, deviceId)
 - ✅ Avoid sequential keys (timestamp, auto-increment ID)
 - ✅ Target partition count: 2-3x current consumer parallelism
 - ✅ Validate distribution with sample keys (use PartitioningStrategyAnalyzer)
 ### Replication
 - ✅ Replication factor = 3 (standard for production)
 - ✅ min.insync.replicas = 2 (balance durability and availability)
 - ✅ Unclean leader election = false (prevent data loss)
 - ✅ Monitor under-replicated partitions (should be 0)
 ### Producer Configuration
 - ✅ acks=all (wait for all replicas)
 - ✅ enable.idempotence=true (exactly-once semantics)
 - ✅ compression.type=lz4 (balance speed and ratio)
 - ✅ batch.size=65536 (64KB batching for throughput)
 ### Consumer Configuration
 - ✅ enable.auto.commit=false (manual offset management)
 - ✅ max.poll.records=100-500 (avoid session timeout)
 - ✅ isolation.level=read_committed (for transactional producers)
 ## Anti-Patterns I Warn Against
 - ❌ **Single partition topics**: No parallelism, no scalability
 - ❌ **Too many partitions**: High broker overhead, slow rebalancing
 - ❌ **Weak partition keys**: Sequential keys, null keys, constant keys
 - ❌ **Auto-create topics**: Uncontrolled partition count
 - ❌ **Unclean leader election**: Data loss risk
 - ❌ **Insufficient replication**: Single point of failure
 - ❌ **Ignoring consumer lag**: Backpressure builds up
 - ❌ **Schema evolution without planning**: Breaking changes to consumers
 ## Performance Optimization Techniques
 1. **Batching**: Increase `batch.size` and `linger.ms` for throughput
 2. **Compression**: Use lz4 or zstd (not gzip)
 3. **Zero-copy**: Enable `sendfile()` for broker-to-consumer transfers
 4. **Page cache**: Leave 50% RAM for OS page cache
 5. **Partition count**: Right-size for parallelism without overhead
 6. **Consumer groups**: Scale consumers = partition count
 7. **Replica placement**: Spread across racks/AZs
 8. **Network tuning**: Increase socket buffers, TCP window
 ## References
 - Apache Kafka Design Patterns: https://www.confluent.io/blog/
 - Event-Driven Microservices: https://www.oreilly.com/library/view/designing-event-driven-systems/
 - Kafka The Definitive Guide: https://www.confluent.io/resources/kafka-the-definitive-guide/
 ---
 **Invoke me when you need architecture and design expertise for Kafka systems!**
--- a/agents/kafka-devops/AGENT.md
+++ b/agents/kafka-devops/AGENT.md
@@ -0,0 +1,235 @@
 ---
 name: kafka-devops
 description: Kafka DevOps and SRE specialist. Expert in infrastructure deployment, CI/CD, monitoring, incident response, capacity planning, and operational best practices for Apache Kafka.
 ---
 # Kafka DevOps Agent
 ## 🚀 How to Invoke This Agent
 **Subagent Type**: `specweave-kafka:kafka-devops:kafka-devops`
 **Usage Example**:
 ```typescript
 Task({
  subagent_type: "specweave-kafka:kafka-devops:kafka-devops",
  prompt: "Deploy production Kafka cluster on AWS with Terraform, configure monitoring with Prometheus and Grafana",
  model: "haiku" // optional: haiku, sonnet, opus
 });
 ```
 **Naming Convention**: `{plugin}:{directory}:{yaml-name-or-directory-name}`
 - **Plugin**: specweave-kafka
 - **Directory**: kafka-devops
 - **Agent Name**: kafka-devops
 **When to Use**:
 - You need to deploy and manage Kafka infrastructure
 - You want to set up CI/CD pipelines for Kafka upgrades
 - You're configuring Kafka cluster monitoring and alerting
 - You have operational issues or need incident response
 - You need to implement disaster recovery and backup strategies
 I'm a specialized DevOps/SRE agent with deep expertise in Apache Kafka operations, deployment automation, and production reliability.
 ## My Expertise
 ### Infrastructure & Deployment
 - **Terraform**: Deploy Kafka on AWS (EC2, MSK), Azure (Event Hubs), GCP
 - **Kubernetes**: Strimzi Operator, Confluent Operator, Helm charts
 - **Docker**: Compose stacks for local dev and testing
 - **CI/CD**: GitOps workflows, automated deployments, blue-green upgrades
 ### Monitoring & Observability
 - **Prometheus + Grafana**: JMX exporter configuration, custom dashboards
 - **Alerting**: Critical metrics, SLO/SLI definition, on-call runbooks
 - **Distributed Tracing**: OpenTelemetry integration for producers/consumers
 - **Log Aggregation**: ELK stack, Datadog, CloudWatch integration
 ### Operational Excellence
 - **Capacity Planning**: Cluster sizing, throughput estimation, growth projections
 - **Performance Tuning**: Broker config, OS tuning, JVM optimization
 - **Disaster Recovery**: Backup strategies, MirrorMaker 2, multi-DC replication
 - **Security**: TLS/SSL, SASL authentication, ACLs, encryption at rest
 ### Incident Response
 - **On-Call Runbooks**: Under-replicated partitions, broker failures, disk full
 - **Troubleshooting**: High latency, consumer lag, rebalancing issues
 - **Root Cause Analysis**: Post-mortems, blameless retrospectives
 - **Remediation**: Quick fixes, long-term improvements
 ## When to Invoke Me
 I activate for:
 - **Infrastructure questions**: "How to deploy Kafka on Kubernetes?", "Terraform module for AWS MSK"
 - **Operational issues**: "Broker down", "under-replicated partitions", "high consumer lag"
 - **Deployment automation**: "CI/CD pipeline for Kafka upgrades", "GitOps workflow"
 - **Monitoring setup**: "Configure Prometheus for Kafka", "create Grafana dashboards"
 - **Capacity planning**: "How many brokers do I need?", "partition count for 10K msg/sec"
 - **Performance tuning**: "Optimize Kafka for low latency", "reduce GC pauses"
 ## My Tools & References
 **Scripts & Configs**:
 - Terraform modules: `plugins/specweave-kafka/terraform/`
 - Docker Compose: `plugins/specweave-kafka/docker/`
 - Monitoring configs: `plugins/specweave-kafka/monitoring/`
 - Sample code: `plugins/specweave-kafka/docker/templates/`
 **Utilities**:
 - MCPServerDetector: Auto-detect MCP servers
 - KcatWrapper: CLI tool integration
 - ClusterSizingCalculator: Broker count and storage estimation
 - ConfigValidator: Validate broker/producer/consumer configs
 ## Example Workflows
 ### Workflow 1: Deploy Production Kafka on AWS
 ```
 User: "I need to deploy a production Kafka cluster on AWS for 5000 msg/sec"
 Me:
 1. Sizing calculation:
   - Expected throughput: 5000 msg/sec × 1KB avg = 5 MB/sec
   - Recommended: 3 brokers (m5.xlarge)
   - Storage: 500GB GP3 per broker
   - Replication factor: 3, min.insync.replicas: 2
 2. Platform recommendation:
   - AWS MSK for managed operations
   - OR Apache Kafka on EC2 for full control
 3. Terraform deployment:
   - Use aws-msk module OR apache-kafka module
   - Generate terraform configuration
   - Execute: terraform init && terraform apply
 4. Post-deployment:
   - Configure monitoring (JMX exporter + Prometheus)
   - Set up alerting (14 critical/high/warning alerts)
   - Create test topics and verify connectivity
 ```
 ### Workflow 2: Troubleshoot Under-Replicated Partitions
 ```
 User: "We have 50 under-replicated partitions, what's wrong?"
 Me:
 1. Immediate checks:
   - `kafka-topics.sh --describe --under-replicated-partitions`
   - Check broker status: All brokers up?
   - Check network: Latency between brokers?
 2. Common causes:
   - Broker failure or restart
   - Network issues
   - Slow disk I/O (check `iostat -x 1`)
   - High producer throughput exceeding replication capacity
 3. Investigation:
   - Check broker logs for errors
   - Review JMX metrics: ISR shrinks rate, replica lag
   - Check disk usage and IOPS
 4. Remediation:
   - If broker down: Restart it
   - If network issues: Fix routing/firewall
   - If slow disk: Increase IOPS (use GP3 or io2)
   - If throughput: Scale horizontally (add brokers)
 5. Prevention:
   - Monitor ISR shrinks rate (alert if > 5/sec)
   - Set up disk I/O monitoring
   - Regular capacity planning reviews
 ```
 ### Workflow 3: Set Up Complete Monitoring Stack
 ```
 User: "Set up monitoring for our Kafka cluster"
 Me:
 1. JMX Exporter setup:
   - Download jmx_prometheus_javaagent JAR
   - Copy kafka-jmx-exporter.yml config
   - Add to KAFKA_OPTS: -javaagent:/opt/jmx_prometheus_javaagent.jar=7071:/opt/kafka-jmx-exporter.yml
   - Restart brokers
 2. Prometheus configuration:
   - Add Kafka scrape config (job: kafka, port: 7071)
   - Reload Prometheus: kill -HUP $(pidof prometheus)
 3. Grafana dashboards:
   - Install 5 dashboards (cluster, broker, consumer lag, topics, JVM)
   - Configure Prometheus datasource
 4. Alerting rules:
   - Create 14 alerts (critical/high/warning)
   - Configure notification channels (Slack, PagerDuty)
   - Write runbooks for critical alerts
 5. Verification:
   - Test metrics scraping
   - Open dashboards
   - Trigger test alert (stop a broker)
 ```
 ## Best Practices I Enforce
 ### Deployment
 - ✅ Use KRaft mode (no ZooKeeper dependency)
 - ✅ Multi-AZ deployment (spread brokers across 3+ AZs)
 - ✅ Replication factor = 3, min.insync.replicas = 2
 - ✅ Disable unclean.leader.election.enable (prevent data loss)
 - ✅ Set auto.create.topics.enable = false (explicit topic creation)
 ### Monitoring
 - ✅ Monitor under-replicated partitions (should be 0)
 - ✅ Monitor offline partitions (should be 0)
 - ✅ Monitor active controller count (should be exactly 1)
 - ✅ Track consumer lag per group
 - ✅ Alert on ISR shrinks rate (>5/sec = issue)
 ### Performance
 - ✅ Use SSD storage (GP3 or better)
 - ✅ Tune JVM heap (50% of RAM, max 32GB)
 - ✅ Use G1GC for garbage collection
 - ✅ Increase num.network.threads and num.io.threads
 - ✅ Enable compression (lz4 for balance of speed and ratio)
 ### Security
 - ✅ Enable TLS/SSL encryption in transit
 - ✅ Use SASL authentication (SCRAM-SHA-512)
 - ✅ Implement ACLs for topic/group access
 - ✅ Rotate credentials regularly
 - ✅ Enable encryption at rest (for sensitive data)
 ## Common Incidents I Handle
 1. **Under-Replicated Partitions** → Check broker health, network, disk I/O
 2. **High Consumer Lag** → Scale consumers, optimize processing logic
 3. **Broker Out of Disk** → Reduce retention, expand volumes
 4. **High GC Time** → Increase heap, tune GC parameters
 5. **Connection Refused** → Check security groups, SASL config, TLS certificates
 6. **Leader Election Storm** → Disable auto leader rebalancing, check network stability
 7. **Offline Partitions** → Identify failed brokers, restart safely
 8. **ISR Shrinks** → Investigate replication lag, disk I/O, network latency
 ## Runbooks
 For critical alerts, I reference these runbooks:
 - Under-Replicated Partitions: `monitoring/prometheus/kafka-alerts.yml` (Alert 1)
 - Offline Partitions: `monitoring/prometheus/kafka-alerts.yml` (Alert 2)
 - No Active Controller: `monitoring/prometheus/kafka-alerts.yml` (Alert 3)
 - High Consumer Lag: `monitoring/prometheus/kafka-alerts.yml` (Alert 6)
 ## References
 - Apache Kafka Documentation: https://kafka.apache.org/documentation/
 - Confluent Best Practices: https://docs.confluent.io/platform/current/
 - Strimzi Docs: https://strimzi.io/docs/
 - Prometheus JMX Exporter: https://github.com/prometheus/jmx_exporter
 ---
 **Invoke me when you need DevOps/SRE expertise for Kafka deployment, monitoring, or incident response!**
--- a/agents/kafka-observability/AGENT.md
+++ b/agents/kafka-observability/AGENT.md
@@ -0,0 +1,292 @@
 ---
 name: kafka-observability
 description: Kafka observability and monitoring specialist. Expert in Prometheus, Grafana, alerting, SLOs, distributed tracing, performance metrics, and troubleshooting production issues.
 ---
 # Kafka Observability Agent
 ## 🚀 How to Invoke This Agent
 **Subagent Type**: `specweave-kafka:kafka-observability:kafka-observability`
 **Usage Example**:
 ```typescript
 Task({
  subagent_type: "specweave-kafka:kafka-observability:kafka-observability",
  prompt: "Set up Kafka monitoring with Prometheus JMX exporter and create Grafana dashboards with alerting rules",
  model: "haiku" // optional: haiku, sonnet, opus
 });
 ```
 **Naming Convention**: `{plugin}:{directory}:{yaml-name-or-directory-name}`
 - **Plugin**: specweave-kafka
 - **Directory**: kafka-observability
 - **Agent Name**: kafka-observability
 **When to Use**:
 - You need to set up monitoring for Kafka clusters
 - You want to configure alerting for critical Kafka metrics
 - You're troubleshooting high latency, consumer lag, or performance issues
 - You need to analyze Kafka performance bottlenecks
 - You're implementing SLOs for Kafka availability and latency
 I'm a specialized observability agent with deep expertise in monitoring, alerting, and troubleshooting Apache Kafka in production.
 ## My Expertise
 ### Monitoring Infrastructure
 - **Prometheus + Grafana**: JMX exporter, custom dashboards, recording rules
 - **Metrics Collection**: Broker, topic, consumer, JVM, OS metrics
 - **Distributed Tracing**: OpenTelemetry integration for end-to-end visibility
 - **Log Aggregation**: ELK, Datadog, CloudWatch integration
 ### Alerting & SLOs
 - **Alert Design**: Critical vs warning, actionable alerts, reduce noise
 - **SLO Definition**: Availability, latency, throughput targets
 - **On-Call Runbooks**: Step-by-step remediation for common incidents
 - **Escalation Policies**: When to page, when to auto-remediate
 ### Performance Analysis
 - **Latency Profiling**: Produce latency, fetch latency, end-to-end latency
 - **Throughput Optimization**: Identify bottlenecks, scale appropriately
 - **Resource Utilization**: CPU, memory, disk I/O, network bandwidth
 - **Consumer Lag Analysis**: Root cause analysis, scaling recommendations
 ## When to Invoke Me
 I activate for:
 - **Monitoring setup**: "Configure Prometheus for Kafka", "create Grafana dashboards"
 - **Alert configuration**: "Set up critical alerts", "SLO for 99.9% availability"
 - **Troubleshooting**: "High latency", "consumer lag spiking", "broker CPU at 100%"
 - **Performance analysis**: "Why is Kafka slow?", "optimize throughput", "reduce latency"
 - **Incident response**: "Under-replicated partitions", "offline partitions", "broker down"
 ## My Tools
 **Dashboards**:
 - kafka-cluster-overview: Cluster health, throughput, ISR changes
 - kafka-broker-metrics: CPU, memory, network, request handlers
 - kafka-consumer-lag: Lag per group/topic, offset tracking
 - kafka-topic-metrics: Partition count, replication, log size
 - kafka-jvm-metrics: Heap, GC, threads, file descriptors
 **Alerting Rules** (14 alerts):
 - CRITICAL: Under-replicated partitions, offline partitions, no controller, unclean leader elections
 - HIGH: Consumer lag, ISR shrinks, leader election rate
 - WARNING: CPU, memory, GC time, disk usage, file descriptors
 ## Example Workflows
 ### Workflow 1: Set Up Complete Monitoring Stack
 ```
 User: "Set up monitoring for Kafka cluster on Kubernetes"
 Me:
 1. JMX Exporter Configuration:
   - Create ConfigMap with kafka-jmx-exporter.yml
   - Add to Kafka CR: metricsConfig.type=jmxPrometheusExporter
   - Metrics exposed on port 9404 (Strimzi default)
 2. Prometheus Integration:
   - Create PodMonitor for Kafka pods
   - Scrape interval: 30s
   - Target: strimzi.io/kind=Kafka pods
 3. Grafana Dashboard Provisioning:
   - Create ConfigMaps for 5 dashboards
   - Label with grafana_dashboard=1
   - Auto-import within 60 seconds
 4. Alerting Rules:
   - Deploy PrometheusRule with 14 alerts
   - Configure Alertmanager for Slack/PagerDuty
   - Test: Stop a broker, verify alert fires
 5. Verification:
   - Open Grafana (kubectl port-forward 3000:80)
   - Check metrics: kafka_server_broker_topic_metrics_bytesin_total
   - View dashboards: Cluster Overview, Broker Metrics, Consumer Lag
 ```
 ### Workflow 2: Troubleshoot High Consumer Lag
 ```
 User: "Consumer group 'orders-processor' has 50K messages lag, why?"
 Me:
 1. Immediate Metrics Check:
   - kafka_consumergroup_lag{consumergroup="orders-processor"} = 50,000
   - kafka_consumergroup_current_offset (where are consumers?)
   - kafka_consumergroup_log_end_offset (where are producers?)
 2. Consumer Health Check:
   - kafka_consumergroup_members = ? (how many active consumers?)
   - Partition count = 12
   - Expected: 12 consumers (1 per partition)
   - If < 12 → Scale up consumers
 3. Processing Time Analysis:
   - Check application metrics: message processing time
   - If > 100ms → Optimize processing logic
   - If database calls → Add caching, connection pooling
 4. Producer Rate vs Consumer Rate:
   - Producer: rate(kafka_topic_partition_current_offset[5m]) = 500 msg/sec
   - Consumer: rate(kafka_consumergroup_current_offset[5m]) = 200 msg/sec
   - Gap: 300 msg/sec → Consumers too slow!
 5. Remediation Options:
   a. Scale consumers: 12 → 24 (but limited by 12 partitions)
   b. Increase partitions: 12 → 24 (allows more consumers)
   c. Optimize processing: Reduce per-message latency
   d. Batch processing: Process 10 messages at once
 6. Long-term Fix:
   - Increase partitions (requires re-partitioning)
   - Optimize consumer code (reduce latency)
   - Add more consumer instances
   - Monitor: kafka_consumergroup_lag < 1000 (SLO)
 ```
 ### Workflow 3: Investigate Broker Performance Degradation
 ```
 User: "Broker CPU spiked to 100%, latency increased, what's wrong?"
 Me:
 1. Metrics Timeline Analysis:
   - os_process_cpu_load{instance="kafka-broker-2"} = 1.0 (100%)
   - kafka_network_request_metrics_totaltime_total{request="Produce"} spike
   - kafka_server_request_handler_avg_idle_percent = 0.05 (95% busy!)
 2. Correlation Check (find root cause):
   - kafka_server_broker_topic_metrics_messagesin_total → No spike
   - kafka_log_flush_time_ms_p99 → Spike from 10ms to 500ms (disk I/O issue!)
   - iostat (via node exporter) → Disk queue depth = 50 (saturation)
 3. Root Cause Identified: Disk I/O Saturation
   - Likely cause: Log flush taking too long
   - Check: log.flush.interval.messages and log.flush.interval.ms
 4. Immediate Mitigation:
   - Check disk health: SMART errors?
   - Check IOPS limits: GP2 exhausted? Upgrade to GP3
   - Increase provisioned IOPS: 3000 → 10,000
 5. Configuration Tuning:
   - Increase log.flush.interval.messages (flush less frequently)
   - Reduce log.segment.bytes (smaller segments = less data per flush)
   - Use faster storage class (io2 for critical production)
 6. Monitoring:
   - Set alert: kafka_log_flush_time_ms_p99 > 100ms for 5m
   - Track: iostat iowait% < 20% (SLO)
 ```
 ## Critical Metrics I Monitor
 ### Cluster Health
 - `kafka_controller_active_controller_count` = 1 (exactly one)
 - `kafka_server_replica_manager_under_replicated_partitions` = 0
 - `kafka_controller_offline_partitions_count` = 0
 - `kafka_controller_unclean_leader_elections_total` = 0
 ### Broker Performance
 - `os_process_cpu_load` < 0.8 (80% CPU)
 - `jvm_memory_heap_used_bytes / jvm_memory_heap_max_bytes` < 0.85 (85% heap)
 - `kafka_server_request_handler_avg_idle_percent` > 0.3 (30% idle)
 - `os_open_file_descriptors / os_max_file_descriptors` < 0.8 (80% FD)
 ### Throughput & Latency
 - `kafka_server_broker_topic_metrics_bytesin_total` (bytes in/sec)
 - `kafka_server_broker_topic_metrics_bytesout_total` (bytes out/sec)
 - `kafka_network_request_metrics_totaltime_total{request="Produce"}` (produce latency)
 - `kafka_network_request_metrics_totaltime_total{request="FetchConsumer"}` (fetch latency)
 ### Consumer Lag
 - `kafka_consumergroup_lag` < 1000 messages (SLO)
 - `rate(kafka_consumergroup_current_offset[5m])` = consumer throughput
 - `rate(kafka_topic_partition_current_offset[5m])` = producer throughput
 ### JVM Health
 - `jvm_gc_collection_time_ms_total` < 500ms/sec (GC time)
 - `jvm_threads_count` < 500 (thread count)
 - `rate(jvm_gc_collection_count_total[5m])` < 1/sec (GC frequency)
 ## Alerting Best Practices
 ### Alert Severity Levels
 **CRITICAL** (Page On-Call Immediately):
 - Under-replicated partitions > 0 for 5 minutes
 - Offline partitions > 0 for 1 minute
 - No active controller for 1 minute
 - Unclean leader elections > 0
 **HIGH** (Notify During Business Hours):
 - Consumer lag > 10,000 messages for 10 minutes
 - ISR shrinks > 5/sec for 5 minutes
 - Leader election rate > 0.5/sec for 5 minutes
 **WARNING** (Create Ticket, Investigate Next Day):
 - CPU usage > 80% for 5 minutes
 - Heap memory > 85% for 5 minutes
 - GC time > 500ms/sec for 5 minutes
 - Disk usage > 85% for 5 minutes
 ### Alert Design Principles
 - ✅ **Actionable**: Alert must require human intervention
 - ✅ **Specific**: Include exact metric value and threshold
 - ✅ **Runbook**: Link to step-by-step remediation guide
 - ✅ **Context**: Include related metrics for correlation
 - ❌ **Avoid Noise**: Don't alert on normal fluctuations
 ## SLO Definitions
 ### Example SLOs for Kafka
 ```yaml
 # Availability SLO
 - objective: "99.9% of produce requests succeed"
  measurement: success_rate(kafka_network_request_metrics_totaltime_total{request="Produce"})
  target: 0.999
 # Latency SLO
 - objective: "p99 produce latency < 100ms"
  measurement: histogram_quantile(0.99, kafka_network_request_metrics_totaltime_total{request="Produce"})
  target: 0.1  # 100ms
 # Consumer Lag SLO
 - objective: "95% of consumer groups have lag < 1000 messages"
  measurement: count(kafka_consumergroup_lag < 1000) / count(kafka_consumergroup_lag)
  target: 0.95
 ```
 ## Troubleshooting Decision Tree
 ```
 High Latency Detected
 ├─ Check Broker CPU
 │  └─ High (>80%) → Scale horizontally, optimize config
 │
 ├─ Check Disk I/O
 │  └─ High (iowait >20%) → Upgrade storage (GP3/io2), tune flush settings
 │
 ├─ Check Network
 │  └─ High RTT → Check inter-broker network, increase socket buffers
 │
 ├─ Check GC Time
 │  └─ High (>500ms/sec) → Increase heap, tune GC (G1GC)
 │
 └─ Check Request Handler Idle %
   └─ Low (<30%) → Increase num.network.threads, num.io.threads
 ```
 ## References
 - Prometheus JMX Exporter: https://github.com/prometheus/jmx_exporter
 - Grafana Dashboards: `plugins/specweave-kafka/monitoring/grafana/dashboards/`
 - Alerting Rules: `plugins/specweave-kafka/monitoring/prometheus/kafka-alerts.yml`
 - Kafka Metrics Guide: https://kafka.apache.org/documentation/#monitoring
 ---
 **Invoke me when you need observability, monitoring, alerting, or performance troubleshooting expertise!**
--- a/commands/deploy.md
+++ b/commands/deploy.md
@@ -0,0 +1,99 @@
 ---
 name: specweave-kafka:deploy
 description: Deploy Apache Kafka cluster using Terraform (Apache Kafka, AWS MSK, or Azure Event Hubs). Guides platform selection, sizing, and deployment.
 ---
 # Deploy Kafka Cluster
 Deploy Apache Kafka using Infrastructure as Code (Terraform).
 ## What This Command Does
 1. **Platform Selection**: Helps you choose the right Kafka platform
 2. **Cluster Sizing**: Calculates broker count, instance types, storage
 3. **Terraform Generation**: Creates or uses existing Terraform modules
 4. **Deployment**: Guides through terraform init/plan/apply
 5. **Verification**: Tests cluster connectivity and basic operations
 ## Interactive Workflow
 I'll ask you a few questions to determine the best deployment approach:
 ### Question 1: Which platform?
 - **Apache Kafka** (self-hosted on AWS EC2, KRaft mode)
 - **AWS MSK** (managed Kafka service)
 - **Azure Event Hubs** (Kafka-compatible API)
 ### Question 2: What's your use case?
 - **Development/Testing** (1 broker, small instance)
 - **Staging** (3 brokers, medium instances)
 - **Production** (3-5 brokers, large instances, multi-AZ)
 ### Question 3: Expected throughput?
 - Messages per second (peak)
 - Average message size
 - Retention period (hours/days)
 Based on your answers, I'll:
 - ✅ Recommend broker count and instance types
 - ✅ Calculate storage requirements
 - ✅ Generate Terraform configuration
 - ✅ Guide deployment
 ## Example Usage
 ```bash
 # Start deployment wizard
 /specweave-kafka:deploy
 # I'll activate kafka-iac-deployment skill and guide you through:
 # 1. Platform selection
 # 2. Sizing calculation (using ClusterSizingCalculator)
 # 3. Terraform module selection (apache-kafka, aws-msk, or azure-event-hubs)
 # 4. Deployment execution
 # 5. Post-deployment verification
 ```
 ## What Gets Created
 **Apache Kafka Deployment** (AWS EC2):
 - 3-5 EC2 instances (m5.xlarge or larger)
 - EBS volumes (GP3, 100Gi+ per broker)
 - Security groups (SASL_SSL on port 9093)
 - IAM roles for S3 backups
 - CloudWatch alarms
 - Load balancer (optional)
 **AWS MSK Deployment**:
 - MSK cluster (3-6 brokers)
 - VPC, subnets, security groups
 - IAM authentication
 - CloudWatch monitoring
 - Auto-scaling (optional)
 **Azure Event Hubs Deployment**:
 - Event Hubs namespace (Premium SKU)
 - Event hubs (topics)
 - Private endpoints
 - Auto-inflate enabled
 - Zone redundancy
 ## Prerequisites
 - Terraform 1.5+ installed
 - AWS CLI (for AWS deployments) or Azure CLI (for Azure)
 - Appropriate cloud credentials configured
 - VPC and subnets created (if deploying to cloud)
 ## Post-Deployment
 After deployment succeeds, I'll:
 1. ✅ Output bootstrap servers
 2. ✅ Provide connection examples
 3. ✅ Suggest running `/specweave-kafka:monitor-setup` for Prometheus + Grafana
 4. ✅ Suggest testing with `/specweave-kafka:dev-env` locally
 ---
 **Skills Activated**: kafka-iac-deployment, kafka-architecture
 **Related Commands**: /specweave-kafka:monitor-setup, /specweave-kafka:dev-env
--- a/commands/dev-env.md
+++ b/commands/dev-env.md
@@ -0,0 +1,176 @@
 ---
 name: specweave-kafka:dev-env
 description: Set up local Kafka development environment using Docker Compose. Includes Kafka (KRaft mode), Schema Registry, Kafka UI, Prometheus, and Grafana.
 ---
 # Set Up Local Kafka Dev Environment
 Spin up a complete local Kafka development environment with one command.
 ## What This Command Does
 1. **Docker Compose Selection**: Choose Kafka or Redpanda
 2. **Service Configuration**: Kafka + Schema Registry + UI + Monitoring
 3. **Environment Setup**: Generate docker-compose.yml
 4. **Start Services**: `docker-compose up -d`
 5. **Verification**: Test cluster and provide connection details
 ## Two Options Available
 ### Option 1: Apache Kafka (KRaft Mode)
 **Services**:
 - ✅ Kafka broker (KRaft mode, no ZooKeeper)
 - ✅ Schema Registry (Avro schemas)
 - ✅ Kafka UI (web interface, port 8080)
 - ✅ Prometheus (metrics, port 9090)
 - ✅ Grafana (dashboards, port 3000)
 **Use When**: Testing Apache Kafka specifically, need Schema Registry
 ### Option 2: Redpanda (3-Node Cluster)
 **Services**:
 - ✅ Redpanda (3 brokers, Kafka-compatible)
 - ✅ Redpanda Console (web UI, port 8080)
 - ✅ Prometheus (metrics, port 9090)
 - ✅ Grafana (dashboards, port 3000)
 **Use When**: Testing high-performance alternative, need multi-broker cluster locally
 ## Example Usage
 ```bash
 # Start dev environment setup
 /specweave-kafka:dev-env
 # I'll ask:
 # 1. Which stack? (Kafka or Redpanda)
 # 2. Where to create files? (current directory or specify path)
 # 3. Custom ports? (use defaults or customize)
 # Then I'll:
 # - Generate docker-compose.yml
 # - Start all services
 # - Wait for health checks
 # - Provide connection details
 # - Open Kafka UI in browser
 ```
 ## What Gets Created
 **Directory Structure**:
 ```
 ./kafka-dev/
 ├── docker-compose.yml       # Main compose file
 ├── .env                      # Environment variables
 ├── data/                     # Persistent volumes
 │   ├── kafka/
 │   ├── prometheus/
 │   └── grafana/
 └── config/
    ├── prometheus.yml       # Prometheus config
    └── grafana/             # Dashboard provisioning
 ```
 **Services Running**:
 - Kafka: localhost:9092 (plaintext) or localhost:9093 (SASL_SSL)
 - Schema Registry: localhost:8081
 - Kafka UI: http://localhost:8080
 - Prometheus: http://localhost:9090
 - Grafana: http://localhost:3000 (admin/admin)
 ## Connection Examples
 **After setup, connect with**:
 ### Producer (Node.js):
 ```javascript
 const { Kafka } = require('kafkajs');
 const kafka = new Kafka({
  clientId: 'my-app',
  brokers: ['localhost:9092']
 });
 const producer = kafka.producer();
 await producer.connect();
 await producer.send({
  topic: 'test-topic',
  messages: [{ value: 'Hello Kafka!' }]
 });
 ```
 ### Consumer (Python):
 ```python
 from kafka import KafkaConsumer
 consumer = KafkaConsumer(
    'test-topic',
    bootstrap_servers=['localhost:9092'],
    group_id='my-group',
    auto_offset_reset='earliest'
 )
 for message in consumer:
    print(f"Received: {message.value}")
 ```
 ### kcat (CLI):
 ```bash
 # Produce message
 echo "Hello Kafka" | kcat -P -b localhost:9092 -t test-topic
 # Consume messages
 kcat -C -b localhost:9092 -t test-topic -o beginning
 ```
 ## Sample Producer/Consumer
 I'll also create sample code templates:
 - `producer-nodejs.js` - Production-ready Node.js producer
 - `consumer-nodejs.js` - Production-ready Node.js consumer
 - `producer-python.py` - Python producer with error handling
 - `consumer-python.py` - Python consumer with DLQ
 ## Prerequisites
 - Docker 20+ installed
 - Docker Compose v2+
 - 4GB+ free RAM (for Redpanda 3-node cluster)
 - Ports available: 8080, 8081, 9090, 9092, 9093, 3000
 ## Post-Setup
 After environment starts, I'll:
 1. ✅ Open Kafka UI in browser (http://localhost:8080)
 2. ✅ Create a test topic via UI
 3. ✅ Show producer/consumer examples
 4. ✅ Provide kcat commands for testing
 5. ✅ Show Grafana dashboards (http://localhost:3000)
 ## Useful Commands
 ```bash
 # Start environment
 docker-compose up -d
 # Stop environment
 docker-compose down
 # Stop and remove data
 docker-compose down -v
 # View logs
 docker-compose logs -f kafka
 # Restart Kafka only
 docker-compose restart kafka
 # Check health
 docker-compose ps
 ```
 ---
 **Skills Activated**: kafka-cli-tools
 **Docker Compose Location**: `plugins/specweave-kafka/docker/`
 **Sample Code**: `plugins/specweave-kafka/docker/templates/`
--- a/commands/mcp-configure.md
+++ b/commands/mcp-configure.md
@@ -0,0 +1,101 @@
 ---
 name: specweave-kafka:mcp-configure
 description: Configure MCP (Model Context Protocol) server for Kafka integration. Auto-detects and configures kanapuli, tuannvm, Joel-hanson, or Confluent MCP servers.
 ---
 # Configure Kafka MCP Server
 Set up MCP (Model Context Protocol) server integration for natural language Kafka operations.
 ## What This Command Does
 1. **MCP Server Detection**: Auto-detect installed MCP servers
 2. **Server Ranking**: Recommend best server for your needs
 3. **Configuration**: Generate Claude Desktop config
 4. **Testing**: Verify MCP server connectivity
 5. **Usage Guide**: Show natural language examples
 ## Supported MCP Servers
 | Server | Language | Features | Best For |
 |--------|----------|----------|----------|
 | **Confluent Official** | - | Natural language, Flink SQL, Enterprise | Production + Confluent Cloud |
 | **tuannvm/kafka-mcp-server** | Go | Advanced SASL (SCRAM-SHA-256/512) | Security-focused deployments |
 | **kanapuli/mcp-kafka** | Node.js | Basic operations, SASL_PLAINTEXT | Quick start, dev environments |
 | **Joel-hanson/kafka-mcp-server** | Python | Claude Desktop integration | Desktop AI workflows |
 ## Example Usage
 ```bash
 # Start MCP configuration wizard
 /specweave-kafka:mcp-configure
 # I'll:
 # 1. Detect installed MCP servers (npm, go, pip, CLI)
 # 2. Rank servers (Confluent > tuannvm > kanapuli > Joel-hanson)
 # 3. Generate Claude Desktop config (~/.claude/settings.json)
 # 4. Test connection to Kafka
 # 5. Show natural language examples
 ```
 ## What Gets Configured
 **Claude Desktop Config** (`~/.claude/settings.json`):
 ```json
 {
  "mcpServers": {
    "kafka": {
      "command": "npx",
      "args": ["mcp-kafka"],
      "env": {
        "KAFKA_BROKERS": "localhost:9092",
        "KAFKA_SASL_USERNAME": "admin",
        "KAFKA_SASL_PASSWORD": "admin-secret"
      }
    }
  }
 }
 ```
 ## Natural Language Examples
 After MCP is configured, you can use natural language with Claude:
 ```
 You: "List all Kafka topics"
 Claude: [Uses MCP to call listTopics()]
 Output: user-events, order-events, payment-events
 You: "Create a topic called 'analytics' with 12 partitions and RF=3"
 Claude: [Uses MCP to call createTopic()]
 Output: Topic 'analytics' created successfully
 You: "What's the consumer lag for group 'orders-consumer'?"
 Claude: [Uses MCP to call getConsumerGroupOffsets()]
 Output: Total lag: 1,234 messages across 6 partitions
 You: "Send a test message to 'user-events' topic"
 Claude: [Uses MCP to call produceMessage()]
 Output: Message sent to partition 3, offset 12345
 ```
 ## Prerequisites
 - Node.js 18+ (for kanapuli or Joel-hanson)
 - Go 1.20+ (for tuannvm)
 - Confluent Cloud account (for Confluent MCP)
 - Kafka cluster accessible from your machine
 ## Post-Configuration
 After MCP is configured, I'll:
 1. ✅ Restart Claude Desktop (required for MCP changes)
 2. ✅ Test MCP server with simple command
 3. ✅ Show 10+ natural language examples
 4. ✅ Provide troubleshooting tips if connection fails
 ---
 **Skills Activated**: kafka-mcp-integration
 **Related Commands**: /specweave-kafka:deploy, /specweave-kafka:dev-env
 **MCP Docs**: https://modelcontextprotocol.io/
--- a/commands/monitor-setup.md
+++ b/commands/monitor-setup.md
@@ -0,0 +1,96 @@
 ---
 name: specweave-kafka:monitor-setup
 description: Set up comprehensive Kafka monitoring with Prometheus + Grafana. Configures JMX exporter, dashboards, and alerting rules.
 ---
 # Set Up Kafka Monitoring
 Configure comprehensive monitoring for your Kafka cluster using Prometheus and Grafana.
 ## What This Command Does
 1. **JMX Exporter Setup**: Configure Prometheus JMX exporter for Kafka brokers
 2. **Prometheus Configuration**: Add Kafka scrape targets
 3. **Grafana Dashboards**: Install 5 pre-built dashboards
 4. **Alerting Rules**: Configure 14 critical/high/warning alerts
 5. **Verification**: Test metrics collection and dashboard access
 ## Interactive Workflow
 I'll detect your environment and guide setup:
 ### Environment Detection
 - **Kubernetes** (Strimzi/Confluent Operator) → Use PodMonitor
 - **Docker Compose** → Add Prometheus + Grafana services
 - **VM/Bare Metal** → Configure JMX exporter JAR
 ### Question 1: Where is Kafka running?
 - Kubernetes (Strimzi)
 - Docker Compose
 - VMs/EC2 instances
 ### Question 2: Prometheus already installed?
 - Yes → Just add Kafka scrape config
 - No → Install Prometheus + Grafana stack
 ## Example Usage
 ```bash
 # Start monitoring setup wizard
 /specweave-kafka:monitor-setup
 # I'll activate kafka-observability skill and:
 # 1. Detect your environment
 # 2. Configure JMX exporter (port 7071)
 # 3. Set up Prometheus scraping
 # 4. Install 5 Grafana dashboards
 # 5. Configure 14 alerting rules
 # 6. Verify metrics collection
 ```
 ## What Gets Configured
 **JMX Exporter** (Kafka brokers):
 - Metrics endpoint on port 7071
 - 50+ critical Kafka metrics exported
 - Broker, topic, consumer lag, JVM metrics
 **Prometheus Scraping**:
 ```yaml
 scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets: ['kafka-0:7071', 'kafka-1:7071', 'kafka-2:7071']
 ```
 **5 Grafana Dashboards**:
 1. **Cluster Overview** - Health, throughput, ISR changes
 2. **Broker Metrics** - CPU, memory, network, request handlers
 3. **Consumer Lag** - Lag per group/topic, offset tracking
 4. **Topic Metrics** - Partition count, replication, log size
 5. **JVM Metrics** - Heap, GC, threads, file descriptors
 **14 Alerting Rules**:
 - CRITICAL: Under-replicated partitions, offline partitions, no controller
 - HIGH: Consumer lag, ISR shrinks, leader elections
 - WARNING: CPU, memory, GC time, disk usage
 ## Prerequisites
 - Kafka cluster running (self-hosted or K8s)
 - Prometheus installed (or will be installed)
 - Grafana installed (or will be installed)
 ## Post-Setup
 After setup completes, I'll:
 1. ✅ Provide Grafana URL and credentials
 2. ✅ Show how to access dashboards
 3. ✅ Explain critical alerts
 4. ✅ Suggest testing alerts by stopping a broker
 ---
 **Skills Activated**: kafka-observability
 **Related Commands**: /specweave-kafka:deploy
 **Dashboard Locations**: `plugins/specweave-kafka/monitoring/grafana/dashboards/`
--- a/plugin.lock.json
+++ b/plugin.lock.json
@@ -0,0 +1,93 @@
 {
  "$schema": "internal://schemas/plugin.lock.v1.json",
  "pluginId": "gh:anton-abyzov/specweave:plugins/specweave-kafka",
  "normalized": {
    "repo": null,
    "ref": "refs/tags/v20251128.0",
    "commit": "681f5a385e57731c64ef4b212d55544d87144203",
    "treeHash": "72b53317c0e203fe3777e89c5d9028138ef4f5c010ddbb40e9aefcb98071116a",
    "generatedAt": "2025-11-28T10:13:51.728305Z",
    "toolVersion": "publish_plugins.py@0.2.0"
  },
  "origin": {
    "remote": "git@github.com:zhongweili/42plugin-data.git",
    "branch": "master",
    "commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
    "repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
  },
  "manifest": {
    "name": "specweave-kafka",
    "description": "Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack",
    "version": "0.24.0"
  },
  "content": {
    "files": [
      {
        "path": "README.md",
        "sha256": "afb48227ea28ac5877048fa4fac0d0cfcb2f1ae6623286c0cd19bbd756530daa"
      },
      {
        "path": "agents/kafka-devops/AGENT.md",
        "sha256": "409e6d56102b053bc596eca97f1c693faf10dd70aede9d3cc97fbc896b68a9d9"
      },
      {
        "path": "agents/kafka-observability/AGENT.md",
        "sha256": "0693bcd35ef65f33ebf9e46cf8310bda8001a436d3e6bd6313e2aed6ec54fcc8"
      },
      {
        "path": "agents/kafka-architect/AGENT.md",
        "sha256": "f0bb437f1f6f912b8e1afba948f56b2146f4785ba6f1f79512bb50584608b8ea"
      },
      {
        "path": ".claude-plugin/plugin.json",
        "sha256": "52b4935226fb2716fdd2cec74187ce12252cafc46326a83c64aaf93fb2027cde"
      },
      {
        "path": "commands/monitor-setup.md",
        "sha256": "d18a3a37122d04ccb4f7a99286e07af58929b11162288a1db755d44626da58b8"
      },
      {
        "path": "commands/deploy.md",
        "sha256": "82c246ae4d7043e5da67dc6402bcc58f181b078f21ca67619efd266701894ad4"
      },
      {
        "path": "commands/mcp-configure.md",
        "sha256": "0e2144e29ab332a925535d81753094a426fa46670778fbcd70c802235b024266"
      },
      {
        "path": "commands/dev-env.md",
        "sha256": "c2cd943d4d1a3a2b05e7321976672594569b25f9d12151213b8b68081b3fe861"
      },
      {
        "path": "skills/kafka-kubernetes/SKILL.md",
        "sha256": "64da4d3d9cdbe7061d9e0254c7be4a531a9356f43494251958d24aa66622eb53"
      },
      {
        "path": "skills/kafka-mcp-integration/SKILL.md",
        "sha256": "e132fabf52ebad6a2b57e490a0c7738f6ca70ded6f6f879b8db719de66c17e0d"
      },
      {
        "path": "skills/kafka-architecture/SKILL.md",
        "sha256": "326e0a3de8c26ce4b36bfe76fc507adca6741b4078bae5214f04039561d84bfd"
      },
      {
        "path": "skills/kafka-observability/SKILL.md",
        "sha256": "c3b4b19fbac43f0fdba91009efe717f55348c66a26dce0c6dfb721a7c57d0817"
      },
      {
        "path": "skills/kafka-iac-deployment/SKILL.md",
        "sha256": "fc82a9a7990d1c8ca8ab4f9b8845cd53fdc46767a11e95432b517368fff000de"
      },
      {
        "path": "skills/kafka-cli-tools/SKILL.md",
        "sha256": "7658d0dfabdb1cf1fa5c83619f41acf2daff28597cdeb35b94e4ce484218d4e0"
      }
    ],
    "dirSha256": "72b53317c0e203fe3777e89c5d9028138ef4f5c010ddbb40e9aefcb98071116a"
  },
  "security": {
    "scannedAt": null,
    "scannerVersion": null,
    "flags": []
  }
 }
--- a/skills/kafka-architecture/SKILL.md
+++ b/skills/kafka-architecture/SKILL.md
@@ -0,0 +1,647 @@
 ---
 name: kafka-architecture
 description: Expert knowledge of Apache Kafka architecture, cluster design, capacity planning, partitioning strategies, replication, and high availability. Auto-activates on keywords kafka architecture, cluster sizing, partition strategy, replication factor, kafka ha, kafka scalability, broker count, topic design, kafka performance, kafka capacity planning.
 ---
 # Kafka Architecture & Design Expert
 Comprehensive knowledge of Apache Kafka architecture patterns, cluster design principles, and production best practices for building resilient, scalable event streaming platforms.
 ## Core Architecture Concepts
 ### Kafka Cluster Components
 **Brokers**:
 - Individual Kafka servers that store and serve data
 - Each broker handles thousands of partitions
 - Typical: 3-10 brokers per cluster (small), 10-100+ (large enterprises)
 **Controller**:
 - One broker elected as controller (via KRaft or ZooKeeper)
 - Manages partition leaders and replica assignments
 - Failure triggers automatic re-election
 **Topics**:
 - Logical channels for message streams
 - Divided into partitions for parallelism
 - Can have different retention policies per topic
 **Partitions**:
 - Ordered, immutable sequence of records
 - Unit of parallelism (1 partition = 1 consumer in a group)
 - Distributed across brokers for load balancing
 **Replicas**:
 - Copies of partitions across multiple brokers
 - 1 leader replica (serves reads/writes)
 - N-1 follower replicas (replication only)
 - In-Sync Replicas (ISR): Followers caught up with leader
 ### KRaft vs ZooKeeper Mode
 **KRaft Mode** (Recommended, Kafka 3.3+):
 ```yaml
 Cluster Metadata:
  - Stored in Kafka itself (no external ZooKeeper)
  - Metadata topic: __cluster_metadata
  - Controller quorum (3 or 5 nodes)
  - Faster failover (<1s vs 10-30s)
  - Simplified operations
 ```
 **ZooKeeper Mode** (Legacy, deprecated in 4.0):
 ```yaml
 External Coordination:
  - Requires separate ZooKeeper ensemble (3-5 nodes)
  - Stores cluster metadata, configs, ACLs
  - Slower failover (10-30 seconds)
  - More complex to operate
 ```
 **Migration**: ZooKeeper → KRaft migration supported in Kafka 3.6+
 ## Cluster Sizing Guidelines
 ### Small Cluster (Development/Testing)
 ```yaml
 Configuration:
  Brokers: 3
  Partitions per broker: ~100-500
  Total partitions: 300-1500
  Replication factor: 3
  Hardware:
    - CPU: 4-8 cores
    - RAM: 8-16 GB
    - Disk: 500 GB - 1 TB SSD
    - Network: 1 Gbps
 Use Cases:
  - Development environments
  - Low-volume production (<10 MB/s)
  - Proof of concepts
  - Single datacenter
 Example Workload:
  - 50 topics
  - 5-10 partitions per topic
  - 1 million messages/day
  - 7-day retention
 ```
 ### Medium Cluster (Standard Production)
 ```yaml
 Configuration:
  Brokers: 6-12
  Partitions per broker: 500-2000
  Total partitions: 3K-24K
  Replication factor: 3
  Hardware:
    - CPU: 16-32 cores
    - RAM: 64-128 GB
    - Disk: 2-8 TB NVMe SSD
    - Network: 10 Gbps
 Use Cases:
  - Standard production workloads
  - Multi-team environments
  - Regional deployments
  - Up to 500 MB/s throughput
 Example Workload:
  - 200-500 topics
  - 10-50 partitions per topic
  - 100 million messages/day
  - 30-day retention
 ```
 ### Large Cluster (High-Scale Production)
 ```yaml
 Configuration:
  Brokers: 20-100+
  Partitions per broker: 2000-4000
  Total partitions: 40K-400K+
  Replication factor: 3
  Hardware:
    - CPU: 32-64 cores
    - RAM: 128-256 GB
    - Disk: 8-20 TB NVMe SSD
    - Network: 25-100 Gbps
 Use Cases:
  - Large enterprises
  - Multi-region deployments
  - Event-driven architectures
  - 1+ GB/s throughput
 Example Workload:
  - 1000+ topics
  - 50-200 partitions per topic
  - 1+ billion messages/day
  - 90-365 day retention
 ```
 ### Kafka Streams / Exactly-Once Semantics (EOS) Clusters
 ```yaml
 Configuration:
  Brokers: 6-12+ (same as standard, but more control plane load)
  Partitions per broker: 500-1500 (fewer due to transaction overhead)
  Total partitions: 3K-18K
  Replication factor: 3
  Hardware:
    - CPU: 16-32 cores (more CPU for transactions)
    - RAM: 64-128 GB
    - Disk: 4-12 TB NVMe SSD (more for transaction logs)
    - Network: 10-25 Gbps
 Special Considerations:
  - More brokers due to transaction coordinator load
  - Lower partition count per broker (transactions = more overhead)
  - Higher disk IOPS for transaction logs
  - min.insync.replicas=2 mandatory for EOS
  - acks=all required for producers
 Use Cases:
  - Stream processing with exactly-once guarantees
  - Financial transactions
  - Event sourcing with strict ordering
  - Multi-step workflows requiring atomicity
 ```
 ## Partitioning Strategy
 ### How Many Partitions?
 **Formula**:
 ```
 Partitions = max(
  Target Throughput / Single Partition Throughput,
  Number of Consumers (for parallelism),
  Future Growth Factor (2-3x)
 )
 Single Partition Limits:
  - Write throughput: ~10-50 MB/s
  - Read throughput: ~30-100 MB/s
  - Message rate: ~10K-100K msg/s
 ```
 **Examples**:
 **High Throughput Topic** (Logs, Events):
 ```yaml
 Requirements:
  - Write: 200 MB/s
  - Read: 500 MB/s (multiple consumers)
  - Expected growth: 3x in 1 year
 Calculation:
  Write partitions: 200 MB/s ÷ 20 MB/s = 10
  Read partitions: 500 MB/s ÷ 40 MB/s = 13
  Growth factor: 13 × 3 = 39
 Recommendation: 40-50 partitions
 ```
 **Low-Latency Topic** (Commands, Requests):
 ```yaml
 Requirements:
  - Write: 5 MB/s
  - Read: 10 MB/s
  - Latency: <10ms p99
  - Order preservation: By user ID
 Calculation:
  Throughput partitions: 5 MB/s ÷ 20 MB/s = 1
  Parallelism: 4 (for redundancy)
 Recommendation: 4-6 partitions (keyed by user ID)
 ```
 **Dead Letter Queue**:
 ```yaml
 Recommendation: 1-3 partitions
 Reason: Low volume, order less important
 ```
 ### Partition Key Selection
 **Good Keys** (High Cardinality, Even Distribution):
 ```yaml
 ✅ User ID (UUIDs):
  - Millions of unique values
  - Even distribution
  - Example: "user-123e4567-e89b-12d3-a456-426614174000"
 ✅ Device ID (IoT):
  - Unique per device
  - Natural sharding
  - Example: "device-sensor-001-zone-a"
 ✅ Order ID (E-commerce):
  - Unique per transaction
  - Even temporal distribution
  - Example: "order-2024-11-15-abc123"
 ```
 **Bad Keys** (Low Cardinality, Hotspots):
 ```yaml
 ❌ Country Code:
  - Only ~200 values
  - Uneven (US, CN >> others)
  - Creates partition hotspots
 ❌ Boolean Flags:
  - Only 2 values (true/false)
  - Severe imbalance
 ❌ Date (YYYY-MM-DD):
  - All today's traffic → 1 partition
  - Temporal hotspot
 ```
 **Compound Keys** (Best of Both):
 ```yaml
 ✅ Country + User ID:
  - Partition by country for locality
  - Sub-partition by user for distribution
  - Example: "US:user-123" → hash("US:user-123")
 ✅ Tenant + Event Type + Timestamp:
  - Multi-tenant isolation
  - Event type grouping
  - Temporal ordering
 ```
 ## Replication & High Availability
 ### Replication Factor Guidelines
 ```yaml
 Development:
  Replication Factor: 1
  Reason: Fast, no durability needed
 Production (Standard):
  Replication Factor: 3
  Reason: Balance durability vs cost
  Tolerates: 2 broker failures (with min.insync.replicas=2)
 Production (Critical):
  Replication Factor: 5
  Reason: Maximum durability
  Tolerates: 4 broker failures (with min.insync.replicas=3)
  Use Cases: Financial transactions, audit logs
 Multi-Datacenter:
  Replication Factor: 3 per DC (6 total)
  Reason: DC-level fault tolerance
  Requires: MirrorMaker 2 or Confluent Replicator
 ```
 ### min.insync.replicas
 **Configuration**:
 ```yaml
 min.insync.replicas=2:
  - At least 2 replicas must acknowledge writes
  - Typical for replication.factor=3
  - Prevents data loss if 1 broker fails
 min.insync.replicas=1:
  - Only leader must acknowledge (dangerous!)
  - Use only for non-critical topics
 min.insync.replicas=3:
  - At least 3 replicas must acknowledge
  - For replication.factor=5 (critical systems)
 ```
 **Rule**: `min.insync.replicas ≤ replication.factor - 1` (to allow 1 replica failure)
 ### Rack Awareness
 ```yaml
 Configuration:
  broker.rack=rack1  # Broker 1
  broker.rack=rack2  # Broker 2
  broker.rack=rack3  # Broker 3
 Benefit:
  - Replicas spread across racks
  - Survives rack-level failures (power, network)
  - Example: Topic with RF=3 → 1 replica per rack
 Placement:
  Leader: rack1
  Follower 1: rack2
  Follower 2: rack3
 ```
 ## Retention Strategies
 ### Time-Based Retention
 ```yaml
 Short-Term (Events, Logs):
  retention.ms: 86400000  # 1 day
  Use Cases: Real-time analytics, monitoring
 Medium-Term (Transactions):
  retention.ms: 604800000  # 7 days
  Use Cases: Standard business events
 Long-Term (Audit, Compliance):
  retention.ms: 31536000000  # 365 days
  Use Cases: Regulatory requirements, event sourcing
 Infinite (Event Sourcing):
  retention.ms: -1  # Forever
  cleanup.policy: compact
  Use Cases: Source of truth, state rebuilding
 ```
 ### Size-Based Retention
 ```yaml
 retention.bytes: 10737418240  # 10 GB per partition
 Combined (Time OR Size):
  retention.ms: 604800000      # 7 days
  retention.bytes: 107374182400  # 100 GB
  # Whichever limit is reached first
 ```
 ### Compaction (Log Compaction)
 ```yaml
 cleanup.policy: compact
 How It Works:
  - Keeps only latest value per key
  - Deletes old versions
  - Preserves full history initially, compacts later
 Use Cases:
  - Database changelogs (CDC)
  - User profile updates
  - Configuration management
  - State stores
 Example:
  Before Compaction:
    user:123 → {name: "Alice", v:1}
    user:123 → {name: "Alice", v:2, email: "alice@ex.com"}
    user:123 → {name: "Alice A.", v:3}
  After Compaction:
    user:123 → {name: "Alice A.", v:3}  # Latest only
 ```
 ## Performance Optimization
 ### Broker Configuration
 ```yaml
 # Network threads (handle client connections)
 num.network.threads: 8  # Increase for high connection count
 # I/O threads (disk operations)
 num.io.threads: 16  # Set to number of disks × 2
 # Replica fetcher threads
 num.replica.fetchers: 4  # Increase for many partitions
 # Socket buffer sizes
 socket.send.buffer.bytes: 1048576    # 1 MB
 socket.receive.buffer.bytes: 1048576  # 1 MB
 # Log flush (default: OS handles flushing)
 log.flush.interval.messages: 10000  # Flush every 10K messages
 log.flush.interval.ms: 1000         # Or every 1 second
 ```
 ### Producer Optimization
 ```yaml
 High Throughput:
  batch.size: 65536            # 64 KB
  linger.ms: 100               # Wait 100ms for batching
  compression.type: lz4        # Fast compression
  acks: 1                      # Leader only
 Low Latency:
  batch.size: 16384            # 16 KB (default)
  linger.ms: 0                 # Send immediately
  compression.type: none
  acks: 1
 Durability (Exactly-Once):
  batch.size: 16384
  linger.ms: 10
  compression.type: lz4
  acks: all
  enable.idempotence: true
  transactional.id: "producer-1"
 ```
 ### Consumer Optimization
 ```yaml
 High Throughput:
  fetch.min.bytes: 1048576     # 1 MB
  fetch.max.wait.ms: 500       # Wait 500ms to accumulate
 Low Latency:
  fetch.min.bytes: 1           # Immediate fetch
  fetch.max.wait.ms: 100       # Short wait
 Max Parallelism:
  # Deploy consumers = number of partitions
  # More consumers than partitions = idle consumers
 ```
 ## Multi-Datacenter Patterns
 ### Active-Passive (Disaster Recovery)
 ```yaml
 Architecture:
  Primary DC: Full Kafka cluster
  Secondary DC: Replica cluster (MirrorMaker 2)
 Configuration:
  - Producers → Primary only
  - Consumers → Primary only
  - MirrorMaker 2: Primary → Secondary (async replication)
 Failover:
  1. Detect primary failure
  2. Switch producers/consumers to secondary
  3. Promote secondary to primary
 Recovery Time: 5-30 minutes (manual)
 Data Loss: Potential (async replication lag)
 ```
 ### Active-Active (Geo-Replication)
 ```yaml
 Architecture:
  DC1: Kafka cluster (region A)
  DC2: Kafka cluster (region B)
  Bidirectional replication via MirrorMaker 2
 Configuration:
  - Producers → Nearest DC
  - Consumers → Nearest DC or both
  - Conflict resolution: Last-write-wins or custom
 Challenges:
  - Duplicate messages (at-least-once delivery)
  - Ordering across DCs not guaranteed
  - Circular replication prevention
 Use Cases:
  - Global applications
  - Regional compliance (GDPR)
  - Load distribution
 ```
 ### Stretch Cluster (Synchronous Replication)
 ```yaml
 Architecture:
  Single Kafka cluster spanning 2 DCs
  Rack awareness: DC1 = rack1, DC2 = rack2
 Configuration:
  min.insync.replicas: 2
  replication.factor: 4 (2 per DC)
  acks: all
 Requirements:
  - Low latency between DCs (<10ms)
  - High bandwidth link (10+ Gbps)
  - Dedicated fiber
 Trade-offs:
  Pros: Synchronous replication, zero data loss
  Cons: Latency penalty, network dependency
 ```
 ## Monitoring & Observability
 ### Key Metrics
 **Broker Metrics**:
 ```yaml
 UnderReplicatedPartitions:
  Alert: > 0 for > 5 minutes
  Indicates: Replica lag, broker failure
 OfflinePartitionsCount:
  Alert: > 0
  Indicates: No leader elected (critical!)
 ActiveControllerCount:
  Alert: != 1 (should be exactly 1)
  Indicates: Split brain or no controller
 RequestHandlerAvgIdlePercent:
  Alert: < 20%
  Indicates: Broker CPU saturation
 ```
 **Topic Metrics**:
 ```yaml
 MessagesInPerSec:
  Monitor: Throughput trends
  Alert: Sudden drops (producer failure)
 BytesInPerSec / BytesOutPerSec:
  Monitor: Network utilization
  Alert: Approaching NIC limits
 RecordsLagMax (Consumer):
  Alert: > 10000 or growing
  Indicates: Consumer can't keep up
 ```
 **Disk Metrics**:
 ```yaml
 LogSegmentSize:
  Monitor: Disk usage trends
  Alert: > 80% capacity
 LogFlushRateAndTimeMs:
  Monitor: Disk write latency
  Alert: > 100ms p99 (slow disk)
 ```
 ## Security Patterns
 ### Authentication & Authorization
 ```yaml
 SASL/SCRAM-SHA-512:
  - Industry standard
  - User/password authentication
  - Stored in ZooKeeper/KRaft
 ACLs (Access Control Lists):
  - Per-topic, per-group permissions
  - Operations: READ, WRITE, CREATE, DELETE, ALTER
  - Example:
      bin/kafka-acls.sh --add \
        --allow-principal User:alice \
        --operation READ \
        --topic orders
 mTLS (Mutual TLS):
  - Certificate-based auth
  - Strong cryptographic identity
  - Best for service-to-service
 ```
 ## Integration with SpecWeave
 **Automatic Architecture Detection**:
 ```typescript
 import { ClusterSizingCalculator } from './lib/utils/sizing';
 const calculator = new ClusterSizingCalculator();
 const recommendation = calculator.calculate({
  throughputMBps: 200,
  retentionDays: 30,
  replicationFactor: 3,
  topicCount: 100
 });
 console.log(recommendation);
 // {
 //   brokers: 8,
 //   partitionsPerBroker: 1500,
 //   diskPerBroker: 6000 GB,
 //   ramPerBroker: 64 GB
 // }
 ```
 **SpecWeave Commands**:
 - `/specweave-kafka:deploy` - Validates cluster sizing before deployment
 - `/specweave-kafka:monitor-setup` - Configures metrics for key indicators
 ## Related Skills
 - `/specweave-kafka:kafka-mcp-integration` - MCP server setup
 - `/specweave-kafka:kafka-cli-tools` - CLI operations
 ## External Links
 - [Kafka Documentation - Architecture](https://kafka.apache.org/documentation/#design)
 - [Confluent - Kafka Sizing](https://www.confluent.io/blog/how-to-choose-the-number-of-topics-partitions-in-a-kafka-cluster/)
 - [KRaft Mode Overview](https://kafka.apache.org/documentation/#kraft)
 - [LinkedIn Engineering - Kafka at Scale](https://engineering.linkedin.com/kafka/running-kafka-scale)
--- a/skills/kafka-cli-tools/SKILL.md
+++ b/skills/kafka-cli-tools/SKILL.md
@@ -0,0 +1,433 @@
 ---
 name: kafka-cli-tools
 description: Expert knowledge of Kafka CLI tools (kcat, kcli, kaf, kafkactl). Auto-activates on keywords kcat, kafkacat, kcli, kaf, kafkactl, kafka cli, kafka command line, produce message, consume topic, list topics, kafka metadata. Provides command examples, installation guides, and tool comparisons.
 ---
 # Kafka CLI Tools Expert
 Comprehensive knowledge of modern Kafka CLI tools for production operations, development, and troubleshooting.
 ## Supported CLI Tools
 ### 1. kcat (kafkacat) - The Swiss Army Knife
 **Installation**:
 ```bash
 # macOS
 brew install kcat
 # Ubuntu/Debian
 apt-get install kafkacat
 # From source
 git clone https://github.com/edenhill/kcat.git
 cd kcat
 ./configure && make && sudo make install
 ```
 **Core Operations**:
 **Produce Messages**:
 ```bash
 # Simple produce
 echo "Hello Kafka" | kcat -P -b localhost:9092 -t my-topic
 # Produce with key (key:value format)
 echo "user123:Login event" | kcat -P -b localhost:9092 -t events -K:
 # Produce from file
 cat events.json | kcat -P -b localhost:9092 -t events
 # Produce with headers
 echo "msg" | kcat -P -b localhost:9092 -t my-topic -H "source=app1" -H "version=1.0"
 # Produce with compression
 echo "data" | kcat -P -b localhost:9092 -t my-topic -z gzip
 # Produce with acks=all
 echo "critical-data" | kcat -P -b localhost:9092 -t my-topic -X acks=all
 ```
 **Consume Messages**:
 ```bash
 # Consume from beginning
 kcat -C -b localhost:9092 -t my-topic -o beginning
 # Consume from end (latest)
 kcat -C -b localhost:9092 -t my-topic -o end
 # Consume specific partition
 kcat -C -b localhost:9092 -t my-topic -p 0 -o beginning
 # Consume with consumer group
 kcat -C -b localhost:9092 -G my-group my-topic
 # Consume N messages and exit
 kcat -C -b localhost:9092 -t my-topic -c 10
 # Custom format (topic:partition:offset:key:value)
 kcat -C -b localhost:9092 -t my-topic -f 'Topic: %t, Partition: %p, Offset: %o, Key: %k, Value: %s\n'
 # JSON output
 kcat -C -b localhost:9092 -t my-topic -J
 ```
 **Metadata & Admin**:
 ```bash
 # List all topics
 kcat -L -b localhost:9092
 # Get topic metadata (JSON)
 kcat -L -b localhost:9092 -t my-topic -J
 # Query topic offsets
 kcat -Q -b localhost:9092 -t my-topic
 # Check broker health
 kcat -L -b localhost:9092 | grep "broker\|topic"
 ```
 **SASL/SSL Authentication**:
 ```bash
 # SASL/PLAINTEXT
 kcat -b localhost:9092 \
  -X security.protocol=SASL_PLAINTEXT \
  -X sasl.mechanism=PLAIN \
  -X sasl.username=admin \
  -X sasl.password=admin-secret \
  -L
 # SASL/SSL
 kcat -b localhost:9093 \
  -X security.protocol=SASL_SSL \
  -X sasl.mechanism=SCRAM-SHA-256 \
  -X sasl.username=admin \
  -X sasl.password=admin-secret \
  -X ssl.ca.location=/path/to/ca-cert \
  -L
 # mTLS (mutual TLS)
 kcat -b localhost:9093 \
  -X security.protocol=SSL \
  -X ssl.ca.location=/path/to/ca-cert \
  -X ssl.certificate.location=/path/to/client-cert.pem \
  -X ssl.key.location=/path/to/client-key.pem \
  -L
 ```
 ### 2. kcli - Kubernetes-Native Kafka CLI
 **Installation**:
 ```bash
 # Install via krew (Kubernetes plugin manager)
 kubectl krew install kcli
 # Or download binary
 curl -LO https://github.com/cswank/kcli/releases/latest/download/kcli-linux-amd64
 chmod +x kcli-linux-amd64
 sudo mv kcli-linux-amd64 /usr/local/bin/kcli
 ```
 **Kubernetes Integration**:
 ```bash
 # Connect to Kafka running in k8s
 kcli --context my-cluster --namespace kafka
 # Produce to topic in k8s
 echo "msg" | kcli produce --topic my-topic --brokers kafka-broker:9092
 # Consume from k8s Kafka
 kcli consume --topic my-topic --brokers kafka-broker:9092 --from-beginning
 # List topics in k8s cluster
 kcli topics list --brokers kafka-broker:9092
 ```
 **Best For**:
 - Kubernetes-native deployments
 - Helmfile/Kustomize workflows
 - GitOps with ArgoCD/Flux
 ### 3. kaf - Modern Terminal UI
 **Installation**:
 ```bash
 # macOS
 brew install kaf
 # Linux (via snap)
 snap install kaf
 # From source
 go install github.com/birdayz/kaf/cmd/kaf@latest
 ```
 **Interactive Features**:
 ```bash
 # Configure cluster
 kaf config add-cluster local --brokers localhost:9092
 # Use cluster
 kaf config use-cluster local
 # Interactive topic browsing (TUI)
 kaf topics
 # Interactive consume (arrow keys to navigate)
 kaf consume my-topic
 # Produce interactively
 kaf produce my-topic
 # Consumer group management
 kaf groups
 kaf group describe my-group
 kaf group reset my-group --topic my-topic --offset earliest
 # Schema Registry integration
 kaf schemas
 kaf schema get my-schema
 ```
 **Best For**:
 - Development workflows
 - Quick topic exploration
 - Consumer group debugging
 - Schema Registry management
 ### 4. kafkactl - Advanced Admin Tool
 **Installation**:
 ```bash
 # macOS
 brew install deviceinsight/packages/kafkactl
 # Linux
 curl -L https://github.com/deviceinsight/kafkactl/releases/latest/download/kafkactl_linux_amd64 -o kafkactl
 chmod +x kafkactl
 sudo mv kafkactl /usr/local/bin/
 # Via Docker
 docker run --rm -it deviceinsight/kafkactl:latest
 ```
 **Advanced Operations**:
 ```bash
 # Configure context
 kafkactl config add-context local --brokers localhost:9092
 # Topic management
 kafkactl create topic my-topic --partitions 3 --replication-factor 2
 kafkactl alter topic my-topic --config retention.ms=86400000
 kafkactl delete topic my-topic
 # Consumer group operations
 kafkactl describe consumer-group my-group
 kafkactl reset consumer-group my-group --topic my-topic --offset earliest
 kafkactl delete consumer-group my-group
 # ACL management
 kafkactl create acl --allow --principal User:alice --operation READ --topic my-topic
 kafkactl list acls
 # Quota management
 kafkactl alter client-quota --user alice --producer-byte-rate 1048576
 # Reassign partitions
 kafkactl alter partition --topic my-topic --partition 0 --replicas 1,2,3
 ```
 **Best For**:
 - Production cluster management
 - ACL administration
 - Partition reassignment
 - Quota management
 ## Tool Comparison Matrix
 | Feature | kcat | kcli | kaf | kafkactl |
 |---------|------|------|-----|----------|
 | **Installation** | Easy | Medium | Easy | Easy |
 | **Produce** | ✅ Advanced | ✅ Basic | ✅ Interactive | ✅ Basic |
 | **Consume** | ✅ Advanced | ✅ Basic | ✅ Interactive | ✅ Basic |
 | **Metadata** | ✅ JSON | ✅ Basic | ✅ TUI | ✅ Detailed |
 | **TUI** | ❌ | ❌ | ✅ | ✅ Limited |
 | **Admin** | ❌ | ❌ | ⚠️  Limited | ✅ Advanced |
 | **SASL/SSL** | ✅ | ✅ | ✅ | ✅ |
 | **K8s Native** | ❌ | ✅ | ❌ | ❌ |
 | **Schema Reg** | ❌ | ❌ | ✅ | ❌ |
 | **ACLs** | ❌ | ❌ | ❌ | ✅ |
 | **Quotas** | ❌ | ❌ | ❌ | ✅ |
 | **Best For** | Scripting, ops | Kubernetes | Development | Production admin |
 ## Common Patterns
 ### 1. Topic Creation with Optimal Settings
 ```bash
 # Using kafkactl (recommended for production)
 kafkactl create topic orders \
  --partitions 12 \
  --replication-factor 3 \
  --config retention.ms=604800000 \
  --config compression.type=lz4 \
  --config min.insync.replicas=2
 # Verify with kcat
 kcat -L -b localhost:9092 -t orders -J | jq '.topics[0]'
 ```
 ### 2. Dead Letter Queue Pattern
 ```bash
 # Produce failed message to DLQ
 echo "failed-msg" | kcat -P -b localhost:9092 -t orders-dlq \
  -H "original-topic=orders" \
  -H "error=DeserializationException" \
  -H "timestamp=$(date -u +%Y-%m-%dT%H:%M:%SZ)"
 # Monitor DLQ
 kcat -C -b localhost:9092 -t orders-dlq -f 'Headers: %h\nValue: %s\n\n'
 ```
 ### 3. Consumer Group Lag Monitoring
 ```bash
 # Using kafkactl
 kafkactl describe consumer-group my-app | grep LAG
 # Using kcat (via external tool like kcat-lag)
 kcat -L -b localhost:9092 -J | jq '.topics[].partitions[] | select(.topic=="my-topic")'
 # Using kaf (interactive)
 kaf groups
 # Then select group to see lag in TUI
 ```
 ### 4. Multi-Cluster Replication Testing
 ```bash
 # Produce to source cluster
 echo "test" | kcat -P -b source-kafka:9092 -t replicated-topic
 # Consume from target cluster
 kcat -C -b target-kafka:9092 -t replicated-topic -o end -c 1
 # Compare offsets
 kcat -Q -b source-kafka:9092 -t replicated-topic
 kcat -Q -b target-kafka:9092 -t replicated-topic
 ```
 ### 5. Performance Testing
 ```bash
 # Produce 10,000 messages with kcat
 seq 1 10000 | kcat -P -b localhost:9092 -t perf-test
 # Consume and measure throughput
 time kcat -C -b localhost:9092 -t perf-test -c 10000 -o beginning > /dev/null
 # Test with compression
 seq 1 10000 | kcat -P -b localhost:9092 -t perf-test -z lz4
 ```
 ## Troubleshooting
 ### Connection Issues
 ```bash
 # Test broker connectivity
 kcat -L -b localhost:9092
 # Check SSL/TLS connection
 openssl s_client -connect localhost:9093 -showcerts
 # Verify SASL authentication
 kcat -b localhost:9092 \
  -X security.protocol=SASL_PLAINTEXT \
  -X sasl.mechanism=PLAIN \
  -X sasl.username=admin \
  -X sasl.password=wrong-password \
  -L
 # Should fail with authentication error
 ```
 ### Message Not Appearing
 ```bash
 # Check topic exists
 kcat -L -b localhost:9092 | grep my-topic
 # Check partition count
 kcat -L -b localhost:9092 -t my-topic -J | jq '.topics[0].partition_count'
 # Query all partition offsets
 kcat -Q -b localhost:9092 -t my-topic
 # Consume from all partitions
 for i in {0..11}; do
  echo "Partition $i:"
  kcat -C -b localhost:9092 -t my-topic -p $i -c 1 -o end
 done
 ```
 ### Consumer Group Stuck
 ```bash
 # Check consumer group state
 kafkactl describe consumer-group my-app
 # Reset to beginning
 kafkactl reset consumer-group my-app --topic my-topic --offset earliest
 # Reset to specific offset
 kafkactl reset consumer-group my-app --topic my-topic --partition 0 --offset 12345
 # Delete consumer group (all consumers must be stopped first)
 kafkactl delete consumer-group my-app
 ```
 ## Integration with SpecWeave
 **Automatic CLI Tool Detection**:
 SpecWeave auto-detects installed CLI tools and recommends best tool for the operation:
 ```typescript
 import { CLIToolDetector } from './lib/cli/detector';
 const detector = new CLIToolDetector();
 const available = await detector.detectAll();
 // Recommended tool for produce operation
 if (available.includes('kcat')) {
  console.log('Use kcat for produce (fastest)');
 } else if (available.includes('kaf')) {
  console.log('Use kaf for produce (interactive)');
 }
 ```
 **SpecWeave Commands**:
 - `/specweave-kafka:dev-env` - Uses Docker Compose + kcat for local testing
 - `/specweave-kafka:monitor-setup` - Sets up kcat-based lag monitoring
 - `/specweave-kafka:mcp-configure` - Validates CLI tools are installed
 ## Security Best Practices
 1. **Never hardcode credentials** - Use environment variables or secrets management
 2. **Use SSL/TLS in production** - Configure `-X security.protocol=SASL_SSL`
 3. **Prefer SCRAM over PLAIN** - Use `-X sasl.mechanism=SCRAM-SHA-256`
 4. **Rotate credentials regularly** - Update passwords and certificates
 5. **Least privilege** - Grant only necessary ACLs to users
 ## Related Skills
 - `/specweave-kafka:kafka-mcp-integration` - MCP server setup and configuration
 - `/specweave-kafka:kafka-architecture` - Cluster design and sizing
 ## External Links
 - [kcat GitHub](https://github.com/edenhill/kcat)
 - [kcli GitHub](https://github.com/cswank/kcli)
 - [kaf GitHub](https://github.com/birdayz/kaf)
 - [kafkactl GitHub](https://github.com/deviceinsight/kafkactl)
 - [Apache Kafka Documentation](https://kafka.apache.org/documentation/)
--- a/skills/kafka-iac-deployment/SKILL.md
+++ b/skills/kafka-iac-deployment/SKILL.md
@@ -0,0 +1,449 @@
 ---
 name: kafka-iac-deployment
 description: Infrastructure as Code (IaC) deployment expert for Apache Kafka. Guides Terraform deployments across Apache Kafka (KRaft mode), AWS MSK, Azure Event Hubs. Activates for terraform, iac, infrastructure as code, deploy kafka, provision kafka, aws msk, azure event hubs, kafka infrastructure, terraform modules, cloud deployment, kafka deployment automation.
 ---
 # Kafka Infrastructure as Code (IaC) Deployment
 Expert guidance for deploying Apache Kafka using Terraform across multiple platforms.
 ## When to Use This Skill
 I activate when you need help with:
 - **Terraform deployments**: "Deploy Kafka with Terraform", "provision Kafka cluster"
 - **Platform selection**: "Should I use AWS MSK or self-hosted Kafka?", "compare Kafka platforms"
 - **Infrastructure planning**: "How to size Kafka infrastructure", "Kafka on AWS vs Azure"
 - **IaC automation**: "Automate Kafka deployment", "CI/CD for Kafka infrastructure"
 ## What I Know
 ### Available Terraform Modules
 This plugin provides 3 production-ready Terraform modules:
 #### 1. **Apache Kafka (Self-Hosted, KRaft Mode)**
 - **Location**: `plugins/specweave-kafka/terraform/apache-kafka/`
 - **Platform**: AWS EC2 (can adapt to other clouds)
 - **Architecture**: KRaft mode (no ZooKeeper dependency)
 - **Features**:
  - Multi-broker cluster (3-5 brokers recommended)
  - Security groups with SASL_SSL
  - IAM roles for S3 backups
  - CloudWatch metrics and alarms
  - Auto-scaling group support
  - Custom VPC and subnet configuration
 - **Use When**:
  - ✅ You need full control over Kafka configuration
  - ✅ Running Kafka 3.6+ (KRaft mode)
  - ✅ Want to avoid ZooKeeper operational overhead
  - ✅ Multi-cloud or hybrid deployments
 - **Variables**:
  ```hcl
  module "kafka" {
    source = "../../plugins/specweave-kafka/terraform/apache-kafka"
    environment         = "production"
    broker_count        = 3
    kafka_version       = "3.7.0"
    instance_type       = "m5.xlarge"
    vpc_id              = var.vpc_id
    subnet_ids          = var.subnet_ids
    domain              = "example.com"
    enable_s3_backups   = true
    enable_monitoring   = true
  }
  ```
 #### 2. **AWS MSK (Managed Streaming for Kafka)**
 - **Location**: `plugins/specweave-kafka/terraform/aws-msk/`
 - **Platform**: AWS Managed Service
 - **Features**:
  - Fully managed Kafka service
  - IAM authentication + SASL/SCRAM
  - Auto-scaling (provisioned throughput)
  - Built-in monitoring (CloudWatch)
  - Multi-AZ deployment
  - Encryption in transit and at rest
 - **Use When**:
  - ✅ You want AWS to manage Kafka operations
  - ✅ Need tight AWS integration (IAM, KMS, CloudWatch)
  - ✅ Prefer operational simplicity over cost
  - ✅ Running in AWS VPC
 - **Variables**:
  ```hcl
  module "msk" {
    source = "../../plugins/specweave-kafka/terraform/aws-msk"
    cluster_name           = "my-kafka-cluster"
    kafka_version          = "3.6.0"
    number_of_broker_nodes = 3
    broker_node_instance_type = "kafka.m5.large"
    vpc_id     = var.vpc_id
    subnet_ids = var.private_subnet_ids
    enable_iam_auth      = true
    enable_scram_auth    = false
    enable_auto_scaling  = true
  }
  ```
 #### 3. **Azure Event Hubs (Kafka API)**
 - **Location**: `plugins/specweave-kafka/terraform/azure-event-hubs/`
 - **Platform**: Azure Managed Service
 - **Features**:
  - Kafka 1.0+ protocol support
  - Auto-inflate (elastic scaling)
  - Premium SKU for high throughput
  - Zone redundancy
  - Private endpoints (VNet integration)
  - Event capture to Azure Storage
 - **Use When**:
  - ✅ Running on Azure cloud
  - ✅ Need Kafka-compatible API without Kafka operations
  - ✅ Want serverless scaling (auto-inflate)
  - ✅ Integrating with Azure ecosystem
 - **Variables**:
  ```hcl
  module "event_hubs" {
    source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
    namespace_name        = "my-event-hub-ns"
    resource_group_name   = var.resource_group_name
    location              = "eastus"
    sku                   = "Premium"
    capacity              = 1
    kafka_enabled         = true
    auto_inflate_enabled  = true
    maximum_throughput_units = 20
  }
  ```
 ## Platform Selection Decision Tree
 ```
 Need Kafka deployment? START HERE:
 ├─ Running on AWS?
 │  ├─ YES → Want managed service?
 │  │  ├─ YES → Use AWS MSK module (terraform/aws-msk)
 │  │  └─ NO → Use Apache Kafka module (terraform/apache-kafka)
 │  └─ NO → Continue...
 │
 ├─ Running on Azure?
 │  ├─ YES → Use Azure Event Hubs module (terraform/azure-event-hubs)
 │  └─ NO → Continue...
 │
 ├─ Multi-cloud or hybrid?
 │  └─ YES → Use Apache Kafka module (most portable)
 │
 ├─ Need maximum control?
 │  └─ YES → Use Apache Kafka module
 │
 └─ Default → Use Apache Kafka module (self-hosted, KRaft mode)
 ```
 ## Deployment Workflows
 ### Workflow 1: Deploy Self-Hosted Kafka (Apache Kafka Module)
 **Scenario**: You want full control over Kafka on AWS EC2
 ```bash
 # 1. Create Terraform configuration
 cat > main.tf <<EOF
 module "kafka_cluster" {
  source = "../../plugins/specweave-kafka/terraform/apache-kafka"
  environment         = "production"
  broker_count        = 3
  kafka_version       = "3.7.0"
  instance_type       = "m5.xlarge"
  vpc_id     = "vpc-12345678"
  subnet_ids = ["subnet-abc", "subnet-def", "subnet-ghi"]
  domain     = "kafka.example.com"
  enable_s3_backups = true
  enable_monitoring = true
  tags = {
    Project     = "MyApp"
    Environment = "Production"
  }
 }
 output "broker_endpoints" {
  value = module.kafka_cluster.broker_endpoints
 }
 EOF
 # 2. Initialize Terraform
 terraform init
 # 3. Plan deployment (review what will be created)
 terraform plan
 # 4. Apply (create infrastructure)
 terraform apply
 # 5. Get broker endpoints
 terraform output broker_endpoints
 # Output: ["kafka-0.kafka.example.com:9093", "kafka-1.kafka.example.com:9093", ...]
 ```
 ### Workflow 2: Deploy AWS MSK (Managed Service)
 **Scenario**: You want AWS to manage Kafka operations
 ```bash
 # 1. Create Terraform configuration
 cat > main.tf <<EOF
 module "msk_cluster" {
  source = "../../plugins/specweave-kafka/terraform/aws-msk"
  cluster_name           = "my-msk-cluster"
  kafka_version          = "3.6.0"
  number_of_broker_nodes = 3
  broker_node_instance_type = "kafka.m5.large"
  vpc_id     = var.vpc_id
  subnet_ids = var.private_subnet_ids
  enable_iam_auth     = true
  enable_auto_scaling = true
  tags = {
    Project = "MyApp"
  }
 }
 output "bootstrap_brokers" {
  value = module.msk_cluster.bootstrap_brokers_sasl_iam
 }
 EOF
 # 2. Deploy
 terraform init && terraform apply
 # 3. Configure IAM authentication
 # (module outputs IAM policy, attach to your application role)
 ```
 ### Workflow 3: Deploy Azure Event Hubs (Kafka API)
 **Scenario**: You're on Azure and want Kafka-compatible API
 ```bash
 # 1. Create Terraform configuration
 cat > main.tf <<EOF
 module "event_hubs" {
  source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
  namespace_name      = "my-kafka-namespace"
  resource_group_name = "my-resource-group"
  location            = "eastus"
  sku                  = "Premium"
  capacity             = 1
  kafka_enabled        = true
  auto_inflate_enabled = true
  maximum_throughput_units = 20
  # Create hubs (topics) for your use case
  hubs = [
    { name = "user-events",    partitions = 12 },
    { name = "order-events",   partitions = 6 },
    { name = "payment-events", partitions = 3 }
  ]
 }
 output "connection_string" {
  value = module.event_hubs.connection_string
  sensitive = true
 }
 EOF
 # 2. Deploy
 terraform init && terraform apply
 # 3. Get connection details
 terraform output connection_string
 ```
 ## Infrastructure Sizing Recommendations
 ### Small Environment (Dev/Test)
 ```hcl
 # Self-hosted: 1 broker, m5.large
 broker_count  = 1
 instance_type = "m5.large"
 # AWS MSK: 1 broker per AZ, kafka.m5.large
 number_of_broker_nodes = 3
 broker_node_instance_type = "kafka.m5.large"
 # Azure Event Hubs: Basic SKU
 sku = "Basic"
 capacity = 1
 ```
 ### Medium Environment (Staging/Production)
 ```hcl
 # Self-hosted: 3 brokers, m5.xlarge
 broker_count  = 3
 instance_type = "m5.xlarge"
 # AWS MSK: 3 brokers, kafka.m5.xlarge
 number_of_broker_nodes = 3
 broker_node_instance_type = "kafka.m5.xlarge"
 # Azure Event Hubs: Standard SKU with auto-inflate
 sku = "Standard"
 capacity = 2
 auto_inflate_enabled = true
 maximum_throughput_units = 10
 ```
 ### Large Environment (High-Throughput Production)
 ```hcl
 # Self-hosted: 5+ brokers, m5.2xlarge or m5.4xlarge
 broker_count  = 5
 instance_type = "m5.2xlarge"
 # AWS MSK: 6+ brokers, kafka.m5.2xlarge, auto-scaling
 number_of_broker_nodes = 6
 broker_node_instance_type = "kafka.m5.2xlarge"
 enable_auto_scaling = true
 # Azure Event Hubs: Premium SKU with zone redundancy
 sku = "Premium"
 capacity = 4
 zone_redundant = true
 maximum_throughput_units = 20
 ```
 ## Best Practices
 ### Security Best Practices
 1. **Always use encryption in transit**
   - Self-hosted: Enable SASL_SSL listener
   - AWS MSK: Set `encryption_in_transit_client_broker = "TLS"`
   - Azure Event Hubs: HTTPS/TLS enabled by default
 2. **Use IAM authentication (when possible)**
   - AWS MSK: `enable_iam_auth = true`
   - Azure Event Hubs: Managed identities
 3. **Network isolation**
   - Deploy in private subnets
   - Use security groups/NSGs restrictively
   - Azure: Enable private endpoints for Premium SKU
 ### High Availability Best Practices
 1. **Multi-AZ deployment**
   - Self-hosted: Distribute brokers across 3+ AZs
   - AWS MSK: Automatically multi-AZ
   - Azure Event Hubs: Enable `zone_redundant = true` (Premium)
 2. **Replication factor = 3**
   - Self-hosted: `default.replication.factor=3`
   - AWS MSK: Configured automatically
   - Azure Event Hubs: N/A (fully managed)
 3. **min.insync.replicas = 2**
   - Ensures durability even if 1 broker fails
 ### Cost Optimization
 1. **Right-size instances**
   - Use ClusterSizingCalculator utility (in kafka-architecture skill)
   - Start small, scale up based on metrics
 2. **Auto-scaling (where available)**
   - AWS MSK: `enable_auto_scaling = true`
   - Azure Event Hubs: `auto_inflate_enabled = true`
 3. **Retention policies**
   - Set `log.retention.hours` based on actual needs (default: 168 hours = 7 days)
   - Shorter retention = lower storage costs
 ## Monitoring Integration
 All modules integrate with monitoring:
 ### Self-Hosted Kafka
 - CloudWatch metrics (via JMX Exporter)
 - Prometheus + Grafana dashboards (see kafka-observability skill)
 - Custom CloudWatch alarms
 ### AWS MSK
 - Built-in CloudWatch metrics
 - Enhanced monitoring available
 - Integration with CloudWatch Alarms
 ### Azure Event Hubs
 - Built-in Azure Monitor metrics
 - Diagnostic logs to Log Analytics
 - Integration with Azure Alerts
 ## Troubleshooting
 ### "Terraform destroy fails on security groups"
 **Cause**: Resources using security groups still exist
 **Fix**:
 ```bash
 # 1. Find dependent resources
 aws ec2 describe-network-interfaces --filters "Name=group-id,Values=sg-12345678"
 # 2. Delete dependent resources first
 # 3. Retry terraform destroy
 ```
 ### "AWS MSK cluster takes 20+ minutes to create"
 **Cause**: MSK provisioning is inherently slow (AWS behavior)
 **Fix**: This is normal. Use `--auto-approve` for automation:
 ```bash
 terraform apply -auto-approve
 ```
 ### "Azure Event Hubs: Connection refused"
 **Cause**: Kafka protocol not enabled OR incorrect connection string
 **Fix**:
 1. Verify `kafka_enabled = true` in Terraform
 2. Use Kafka connection string (not Event Hubs connection string)
 3. Check firewall rules (Premium SKU supports private endpoints)
 ## Integration with Other Skills
 - **kafka-architecture**: For cluster sizing and partitioning strategy
 - **kafka-observability**: For Prometheus + Grafana setup after deployment
 - **kafka-kubernetes**: For deploying Kafka on Kubernetes (alternative to Terraform)
 - **kafka-cli-tools**: For testing deployed clusters with kcat
 ## Quick Reference Commands
 ```bash
 # Terraform workflow
 terraform init          # Initialize modules
 terraform plan          # Preview changes
 terraform apply         # Create infrastructure
 terraform output        # Get outputs (endpoints, etc.)
 terraform destroy       # Delete infrastructure
 # AWS MSK specific
 aws kafka list-clusters # List MSK clusters
 aws kafka describe-cluster --cluster-arn <arn> # Get cluster details
 # Azure Event Hubs specific
 az eventhubs namespace list # List namespaces
 az eventhubs eventhub list --namespace-name <name> --resource-group <rg> # List hubs
 ```
 ---
 **Next Steps After Deployment**:
 1. Use **kafka-observability** skill to set up Prometheus + Grafana monitoring
 2. Use **kafka-cli-tools** skill to test cluster with kcat
 3. Deploy your producer/consumer applications
 4. Monitor cluster health and performance
--- a/skills/kafka-kubernetes/SKILL.md
+++ b/skills/kafka-kubernetes/SKILL.md
@@ -0,0 +1,667 @@
 ---
 name: kafka-kubernetes
 description: Kubernetes deployment expert for Apache Kafka. Guides K8s deployments using Helm charts, operators (Strimzi, Confluent), StatefulSets, and production best practices. Activates for kubernetes, k8s, helm, kafka on kubernetes, strimzi, confluent operator, kafka operator, statefulset, kafka helm chart, k8s deployment, kubernetes kafka, deploy kafka to k8s.
 ---
 # Kafka on Kubernetes Deployment
 Expert guidance for deploying Apache Kafka on Kubernetes using industry-standard tools.
 ## When to Use This Skill
 I activate when you need help with:
 - **Kubernetes deployments**: "Deploy Kafka on Kubernetes", "run Kafka in K8s", "Kafka Helm chart"
 - **Operator selection**: "Strimzi vs Confluent Operator", "which Kafka operator to use"
 - **StatefulSet patterns**: "Kafka StatefulSet best practices", "persistent volumes for Kafka"
 - **Production K8s**: "Production-ready Kafka on K8s", "Kafka high availability in Kubernetes"
 ## What I Know
 ### Deployment Options Comparison
 | Approach | Difficulty | Production-Ready | Best For |
 |----------|-----------|------------------|----------|
 | **Strimzi Operator** | Easy | ✅ Yes | Self-managed Kafka on K8s, CNCF project |
 | **Confluent Operator** | Medium | ✅ Yes | Enterprise features, Confluent ecosystem |
 | **Bitnami Helm Chart** | Easy | ⚠️  Mostly | Quick dev/staging environments |
 | **Custom StatefulSet** | Hard | ⚠️  Requires expertise | Full control, custom requirements |
 **Recommendation**: **Strimzi Operator** for most production use cases (CNCF project, active community, KRaft support)
 ## Deployment Approach 1: Strimzi Operator (Recommended)
 **Strimzi** is a CNCF Sandbox project providing Kubernetes operators for Apache Kafka.
 ### Features
 - ✅ KRaft mode support (Kafka 3.6+, no ZooKeeper)
 - ✅ Declarative Kafka management (CRDs)
 - ✅ Automatic rolling upgrades
 - ✅ Built-in monitoring (Prometheus metrics)
 - ✅ Mirror Maker 2 for replication
 - ✅ Kafka Connect integration
 - ✅ User and topic management via CRDs
 ### Installation (Helm)
 ```bash
 # 1. Add Strimzi Helm repository
 helm repo add strimzi https://strimzi.io/charts/
 helm repo update
 # 2. Create namespace
 kubectl create namespace kafka
 # 3. Install Strimzi Operator
 helm install strimzi-kafka-operator strimzi/strimzi-kafka-operator \
  --namespace kafka \
  --set watchNamespaces="{kafka}" \
  --version 0.39.0
 # 4. Verify operator is running
 kubectl get pods -n kafka
 # Output: strimzi-cluster-operator-... Running
 ```
 ### Deploy Kafka Cluster (KRaft Mode)
 ```yaml
 # kafka-cluster.yaml
 apiVersion: kafka.strimzi.io/v1beta2
 kind: KafkaNodePool
 metadata:
  name: kafka-pool
  namespace: kafka
  labels:
    strimzi.io/cluster: my-kafka-cluster
 spec:
  replicas: 3
  roles:
    - controller
    - broker
  storage:
    type: jbod
    volumes:
      - id: 0
        type: persistent-claim
        size: 100Gi
        class: fast-ssd
        deleteClaim: false
 ---
 apiVersion: kafka.strimzi.io/v1beta2
 kind: Kafka
 metadata:
  name: my-kafka-cluster
  namespace: kafka
  annotations:
    strimzi.io/kraft: enabled
    strimzi.io/node-pools: enabled
 spec:
  kafka:
    version: 3.7.0
    metadataVersion: 3.7-IV4
    replicas: 3
    listeners:
      - name: plain
        port: 9092
        type: internal
        tls: false
      - name: tls
        port: 9093
        type: internal
        tls: true
        authentication:
          type: tls
      - name: external
        port: 9094
        type: loadbalancer
        tls: true
        authentication:
          type: tls
    config:
      default.replication.factor: 3
      min.insync.replicas: 2
      offsets.topic.replication.factor: 3
      transaction.state.log.replication.factor: 3
      transaction.state.log.min.isr: 2
      auto.create.topics.enable: false
      log.retention.hours: 168
      log.segment.bytes: 1073741824
      compression.type: lz4
    resources:
      requests:
        memory: 4Gi
        cpu: "2"
      limits:
        memory: 8Gi
        cpu: "4"
    jvmOptions:
      -Xms: 2048m
      -Xmx: 4096m
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
 ```
 ```bash
 # Apply Kafka cluster
 kubectl apply -f kafka-cluster.yaml
 # Wait for cluster to be ready (5-10 minutes)
 kubectl wait kafka/my-kafka-cluster --for=condition=Ready --timeout=600s -n kafka
 # Check status
 kubectl get kafka -n kafka
 # Output: my-kafka-cluster   3.7.0   3         True
 ```
 ### Create Topics (Declaratively)
 ```yaml
 # kafka-topics.yaml
 apiVersion: kafka.strimzi.io/v1beta2
 kind: KafkaTopic
 metadata:
  name: user-events
  namespace: kafka
  labels:
    strimzi.io/cluster: my-kafka-cluster
 spec:
  partitions: 12
  replicas: 3
  config:
    retention.ms: 604800000  # 7 days
    segment.bytes: 1073741824
    compression.type: lz4
 ---
 apiVersion: kafka.strimzi.io/v1beta2
 kind: KafkaTopic
 metadata:
  name: order-events
  namespace: kafka
  labels:
    strimzi.io/cluster: my-kafka-cluster
 spec:
  partitions: 6
  replicas: 3
  config:
    retention.ms: 2592000000  # 30 days
    min.insync.replicas: 2
 ```
 ```bash
 # Apply topics
 kubectl apply -f kafka-topics.yaml
 # Verify topics created
 kubectl get kafkatopics -n kafka
 ```
 ### Create Users (Declaratively)
 ```yaml
 # kafka-users.yaml
 apiVersion: kafka.strimzi.io/v1beta2
 kind: KafkaUser
 metadata:
  name: my-producer
  namespace: kafka
  labels:
    strimzi.io/cluster: my-kafka-cluster
 spec:
  authentication:
    type: tls
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: user-events
          patternType: literal
        operations: [Write, Describe]
      - resource:
          type: topic
          name: order-events
          patternType: literal
        operations: [Write, Describe]
 ---
 apiVersion: kafka.strimzi.io/v1beta2
 kind: KafkaUser
 metadata:
  name: my-consumer
  namespace: kafka
  labels:
    strimzi.io/cluster: my-kafka-cluster
 spec:
  authentication:
    type: tls
  authorization:
    type: simple
    acls:
      - resource:
          type: topic
          name: user-events
          patternType: literal
        operations: [Read, Describe]
      - resource:
          type: group
          name: my-consumer-group
          patternType: literal
        operations: [Read]
 ```
 ```bash
 # Apply users
 kubectl apply -f kafka-users.yaml
 # Get user credentials (TLS certificates)
 kubectl get secret my-producer -n kafka -o jsonpath='{.data.user\.crt}' | base64 -d > producer.crt
 kubectl get secret my-producer -n kafka -o jsonpath='{.data.user\.key}' | base64 -d > producer.key
 kubectl get secret my-kafka-cluster-cluster-ca-cert -n kafka -o jsonpath='{.data.ca\.crt}' | base64 -d > ca.crt
 ```
 ## Deployment Approach 2: Confluent Operator
 **Confluent for Kubernetes (CFK)** provides enterprise-grade Kafka management.
 ### Features
 - ✅ Full Confluent Platform (Kafka, Schema Registry, ksqlDB, Connect)
 - ✅ Hybrid deployments (K8s + on-prem)
 - ✅ Rolling upgrades with zero downtime
 - ✅ Multi-region replication
 - ✅ Advanced security (RBAC, encryption)
 - ⚠️  Requires Confluent Platform license (paid)
 ### Installation
 ```bash
 # 1. Add Confluent Helm repository
 helm repo add confluentinc https://packages.confluent.io/helm
 helm repo update
 # 2. Create namespace
 kubectl create namespace confluent
 # 3. Install Confluent Operator
 helm install confluent-operator confluentinc/confluent-for-kubernetes \
  --namespace confluent \
  --version 0.921.11
 # 4. Verify
 kubectl get pods -n confluent
 ```
 ### Deploy Kafka Cluster
 ```yaml
 # kafka-cluster-confluent.yaml
 apiVersion: platform.confluent.io/v1beta1
 kind: Kafka
 metadata:
  name: kafka
  namespace: confluent
 spec:
  replicas: 3
  image:
    application: confluentinc/cp-server:7.6.0
    init: confluentinc/confluent-init-container:2.7.0
  dataVolumeCapacity: 100Gi
  storageClass:
    name: fast-ssd
  metricReporter:
    enabled: true
  listeners:
    internal:
      authentication:
        type: plain
      tls:
        enabled: true
    external:
      authentication:
        type: plain
      tls:
        enabled: true
  dependencies:
    zookeeper:
      endpoint: zookeeper.confluent.svc.cluster.local:2181
  podTemplate:
    resources:
      requests:
        memory: 4Gi
        cpu: 2
      limits:
        memory: 8Gi
        cpu: 4
 ```
 ```bash
 # Apply Kafka cluster
 kubectl apply -f kafka-cluster-confluent.yaml
 # Wait for cluster
 kubectl wait kafka/kafka --for=condition=Ready --timeout=600s -n confluent
 ```
 ## Deployment Approach 3: Bitnami Helm Chart (Dev/Staging)
 **Bitnami Helm Chart** is simple but less suitable for production.
 ### Installation
 ```bash
 # 1. Add Bitnami repository
 helm repo add bitnami https://charts.bitnami.com/bitnami
 helm repo update
 # 2. Install Kafka (KRaft mode)
 helm install kafka bitnami/kafka \
  --namespace kafka \
  --create-namespace \
  --set kraft.enabled=true \
  --set controller.replicaCount=3 \
  --set broker.replicaCount=3 \
  --set persistence.size=100Gi \
  --set persistence.storageClass=fast-ssd \
  --set metrics.kafka.enabled=true \
  --set metrics.jmx.enabled=true
 # 3. Get bootstrap servers
 export KAFKA_BOOTSTRAP=$(kubectl get svc kafka -n kafka -o jsonpath='{.status.loadBalancer.ingress[0].hostname}'):9092
 ```
 **Limitations**:
 - ⚠️  Less production-ready than Strimzi/Confluent
 - ⚠️  Limited declarative topic/user management
 - ⚠️  Fewer advanced features (no MirrorMaker 2, limited RBAC)
 ## Production Best Practices
 ### 1. Storage Configuration
 **Use SSD-backed storage classes** for Kafka logs:
 ```yaml
 apiVersion: storage.k8s.io/v1
 kind: StorageClass
 metadata:
  name: fast-ssd
 provisioner: kubernetes.io/aws-ebs  # or pd.csi.storage.gke.io for GKE
 parameters:
  type: gp3  # AWS EBS GP3 (or io2 for extreme performance)
  iopsPerGB: "50"
  throughput: "125"
 allowVolumeExpansion: true
 volumeBindingMode: WaitForFirstConsumer
 ```
 **Kafka storage requirements**:
 - **Min IOPS**: 3000+ per broker
 - **Min Throughput**: 125 MB/s per broker
 - **Persistent**: Use `deleteClaim: false` (don't delete data on pod deletion)
 ### 2. Resource Limits
 ```yaml
 resources:
  requests:
    memory: 4Gi
    cpu: "2"
  limits:
    memory: 8Gi
    cpu: "4"
 jvmOptions:
  -Xms: 2048m  # Initial heap (50% of memory request)
  -Xmx: 4096m  # Max heap (50% of memory limit, leave room for OS cache)
 ```
 **Sizing guidelines**:
 - **Small (dev)**: 2 CPU, 4Gi memory
 - **Medium (staging)**: 4 CPU, 8Gi memory
 - **Large (production)**: 8 CPU, 16Gi memory
 ### 3. Pod Disruption Budgets
 Ensure high availability during K8s upgrades:
 ```yaml
 apiVersion: policy/v1
 kind: PodDisruptionBudget
 metadata:
  name: kafka-pdb
  namespace: kafka
 spec:
  maxUnavailable: 1
  selector:
    matchLabels:
      app.kubernetes.io/name: kafka
 ```
 ### 4. Affinity Rules
 **Spread brokers across availability zones**:
 ```yaml
 spec:
  kafka:
    template:
      pod:
        affinity:
          podAntiAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                    - key: strimzi.io/name
                      operator: In
                      values:
                        - my-kafka-cluster-kafka
                topologyKey: topology.kubernetes.io/zone
 ```
 ### 5. Network Policies
 **Restrict access to Kafka brokers**:
 ```yaml
 apiVersion: networking.k8s.io/v1
 kind: NetworkPolicy
 metadata:
  name: kafka-network-policy
  namespace: kafka
 spec:
  podSelector:
    matchLabels:
      strimzi.io/name: my-kafka-cluster-kafka
  policyTypes:
    - Ingress
  ingress:
    - from:
      - podSelector:
          matchLabels:
            app: my-producer
      - podSelector:
          matchLabels:
            app: my-consumer
      ports:
      - protocol: TCP
        port: 9092
      - protocol: TCP
        port: 9093
 ```
 ## Monitoring Integration
 ### Prometheus + Grafana Setup
 Strimzi provides built-in Prometheus metrics exporter:
 ```yaml
 # kafka-metrics-configmap.yaml
 apiVersion: v1
 kind: ConfigMap
 metadata:
  name: kafka-metrics
  namespace: kafka
 data:
  kafka-metrics-config.yml: |
    # Use JMX Exporter config from:
    # plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml
    lowercaseOutputName: true
    lowercaseOutputLabelNames: true
    whitelistObjectNames:
      - "kafka.server:type=BrokerTopicMetrics,name=*"
      # ... (copy from kafka-jmx-exporter.yml)
 ```
 ```bash
 # Apply metrics config
 kubectl apply -f kafka-metrics-configmap.yaml
 # Install Prometheus Operator (if not already installed)
 helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace
 # Create PodMonitor for Kafka
 kubectl apply -f - <<EOF
 apiVersion: monitoring.coreos.com/v1
 kind: PodMonitor
 metadata:
  name: kafka-metrics
  namespace: kafka
 spec:
  selector:
    matchLabels:
      strimzi.io/kind: Kafka
  podMetricsEndpoints:
    - port: tcp-prometheus
      interval: 30s
 EOF
 # Access Grafana dashboards (from kafka-observability skill)
 kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
 # Open: http://localhost:3000
 # Dashboards: Kafka Cluster Overview, Broker Metrics, Consumer Lag, Topic Metrics, JVM Metrics
 ```
 ## Troubleshooting
 ### "Pods stuck in Pending state"
 **Cause**: Insufficient resources or storage class not found
 **Fix**:
 ```bash
 # Check events
 kubectl describe pod kafka-my-kafka-cluster-0 -n kafka
 # Check storage class exists
 kubectl get storageclass
 # If missing, create fast-ssd storage class (see Production Best Practices above)
 ```
 ### "Kafka broker not ready after 10 minutes"
 **Cause**: Slow storage provisioning or resource limits too low
 **Fix**:
 ```bash
 # Check broker logs
 kubectl logs kafka-my-kafka-cluster-0 -n kafka
 # Common issues:
 # 1. Low IOPS on storage → Use GP3 or better
 # 2. Low memory → Increase resources.requests.memory
 # 3. KRaft quorum not formed → Check all brokers are running
 ```
 ### "Cannot connect to Kafka from outside K8s"
 **Cause**: External listener not configured
 **Fix**:
 ```yaml
 # Add external listener (Strimzi)
 spec:
  kafka:
    listeners:
      - name: external
        port: 9094
        type: loadbalancer
        tls: true
        authentication:
          type: tls
 # Get external bootstrap server
 kubectl get kafka my-kafka-cluster -n kafka -o jsonpath='{.status.listeners[?(@.name=="external")].bootstrapServers}'
 ```
 ## Scaling Operations
 ### Horizontal Scaling (Add Brokers)
 ```bash
 # Strimzi: Update KafkaNodePool replicas
 kubectl patch kafkanodepool kafka-pool -n kafka --type='json' \
  -p='[{"op": "replace", "path": "/spec/replicas", "value": 5}]'
 # Confluent: Update Kafka CR
 kubectl patch kafka kafka -n confluent --type='json' \
  -p='[{"op": "replace", "path": "/spec/replicas", "value": 5}]'
 # Wait for new brokers
 kubectl rollout status statefulset/kafka-my-kafka-cluster-kafka -n kafka
 ```
 ### Vertical Scaling (Change Resources)
 ```bash
 # Update resources in Kafka CR
 kubectl patch kafka my-kafka-cluster -n kafka --type='json' \
  -p='[
    {"op": "replace", "path": "/spec/kafka/resources/requests/memory", "value": "8Gi"},
    {"op": "replace", "path": "/spec/kafka/resources/requests/cpu", "value": "4"}
  ]'
 # Rolling restart will happen automatically
 ```
 ## Integration with Other Skills
 - **kafka-iac-deployment**: Alternative to K8s (use Terraform for cloud-managed Kafka)
 - **kafka-observability**: Set up Prometheus + Grafana dashboards for K8s Kafka
 - **kafka-architecture**: Cluster sizing and partitioning strategy
 - **kafka-cli-tools**: Test K8s Kafka cluster with kcat
 ## Quick Reference Commands
 ```bash
 # Strimzi
 kubectl get kafka -n kafka                    # List Kafka clusters
 kubectl get kafkatopics -n kafka              # List topics
 kubectl get kafkausers -n kafka               # List users
 kubectl logs kafka-my-kafka-cluster-0 -n kafka  # Check broker logs
 # Confluent
 kubectl get kafka -n confluent                # List Kafka clusters
 kubectl get schemaregistry -n confluent       # List Schema Registry
 kubectl get ksqldb -n confluent               # List ksqlDB
 # Port-forward for testing
 kubectl port-forward -n kafka svc/my-kafka-cluster-kafka-bootstrap 9092:9092
 ```
 ---
 **Next Steps After K8s Deployment**:
 1. Use **kafka-observability** skill to verify Prometheus metrics and Grafana dashboards
 2. Use **kafka-cli-tools** skill to test cluster with kcat
 3. Deploy your producer/consumer applications to K8s
 4. Set up GitOps for declarative topic/user management (ArgoCD, Flux)
--- a/skills/kafka-mcp-integration/SKILL.md
+++ b/skills/kafka-mcp-integration/SKILL.md
@@ -0,0 +1,290 @@
 ---
 name: kafka-mcp-integration
 description: MCP server integration for Kafka operations. Auto-activates on keywords kafka mcp, mcp server, mcp configure, mcp setup, kanapuli, tuannvm, confluent mcp, kafka integration. Provides configuration examples and connection guidance for all 4 MCP servers.
 ---
 # Kafka MCP Server Integration
 Expert knowledge for integrating SpecWeave with Kafka MCP (Model Context Protocol) servers. Supports 4 MCP server implementations with auto-detection and configuration guidance.
 ---
 > **Code-First Recommendation**: For most Kafka automation tasks, [writing code is better than MCP](https://www.anthropic.com/engineering/code-execution-with-mcp) (98% token reduction). Use **kafkajs** or **kafka-node** directly:
 >
 > ```typescript
 > import { Kafka } from 'kafkajs';
 > const kafka = new Kafka({ brokers: ['localhost:9092'] });
 > const producer = kafka.producer();
 > await producer.send({ topic: 'events', messages: [{ value: 'Hello' }] });
 > ```
 >
 > **When MCP IS useful**: Quick interactive debugging, topic exploration, Claude Desktop integration.
 >
 > **When to use code instead**: CI/CD pipelines, test automation, production scripts, anything that should be committed and reusable.
 ---
 ## Supported MCP Servers
 ### 1. kanapuli/mcp-kafka (Node.js)
 **Installation**:
 ```bash
 npm install -g mcp-kafka
 ```
 **Capabilities**:
 - Authentication: SASL_PLAINTEXT, PLAINTEXT
 - Operations: produce, consume, list-topics, describe-topic, get-offsets
 - Best for: Basic Kafka operations, quick prototyping
 **Configuration Example**:
 ```json
 {
  "mcpServers": {
    "kafka": {
      "command": "npx",
      "args": ["mcp-kafka"],
      "env": {
        "KAFKA_BROKERS": "localhost:9092",
        "KAFKA_SASL_MECHANISM": "plain",
        "KAFKA_SASL_USERNAME": "user",
        "KAFKA_SASL_PASSWORD": "password"
      }
    }
  }
 }
 ```
 ### 2. tuannvm/kafka-mcp-server (Go)
 **Installation**:
 ```bash
 go install github.com/tuannvm/kafka-mcp-server@latest
 ```
 **Capabilities**:
 - Authentication: SASL_SCRAM_SHA_256, SASL_SCRAM_SHA_512, SASL_SSL, PLAINTEXT
 - Operations: All CRUD operations, consumer group management, offset management
 - Best for: Production use, advanced SASL authentication
 **Configuration Example**:
 ```json
 {
  "mcpServers": {
    "kafka": {
      "command": "kafka-mcp-server",
      "args": [
        "--brokers", "localhost:9092",
        "--sasl-mechanism", "SCRAM-SHA-256",
        "--sasl-username", "admin",
        "--sasl-password", "admin-secret"
      ]
    }
  }
 }
 ```
 ### 3. Joel-hanson/kafka-mcp-server (Python)
 **Installation**:
 ```bash
 pip install kafka-mcp-server
 ```
 **Capabilities**:
 - Authentication: SASL_PLAINTEXT, PLAINTEXT, SSL
 - Operations: produce, consume, list-topics, describe-topic
 - Best for: Claude Desktop integration, Python ecosystem
 **Configuration Example**:
 ```json
 {
  "mcpServers": {
    "kafka": {
      "command": "python",
      "args": ["-m", "kafka_mcp_server"],
      "env": {
        "KAFKA_BOOTSTRAP_SERVERS": "localhost:9092"
      }
    }
  }
 }
 ```
 ### 4. Confluent Official MCP (Enterprise)
 **Installation**:
 ```bash
 confluent plugin install mcp-server
 ```
 **Capabilities**:
 - Authentication: OAuth, SASL_SCRAM, API Keys
 - Operations: All Kafka operations, Schema Registry, ksqlDB, Flink SQL
 - Advanced: Natural language interface, AI-powered query generation
 - Best for: Confluent Cloud, enterprise deployments
 **Configuration Example**:
 ```json
 {
  "mcpServers": {
    "kafka": {
      "command": "confluent",
      "args": ["mcp", "start"],
      "env": {
        "CONFLUENT_CLOUD_API_KEY": "your-api-key",
        "CONFLUENT_CLOUD_API_SECRET": "your-api-secret"
      }
    }
  }
 }
 ```
 ## Auto-Detection
 SpecWeave can auto-detect installed MCP servers:
 ```bash
 /specweave-kafka:mcp-configure
 ```
 This command:
 1. Scans for installed MCP servers (npm, go, pip, confluent CLI)
 2. Checks which servers are currently running
 3. Ranks servers by capabilities (Confluent > tuannvm > kanapuli > Joel-hanson)
 4. Generates recommended configuration
 5. Tests connection
 ## Quick Start
 ### Option 1: Auto-Configure (Recommended)
 ```bash
 /specweave-kafka:mcp-configure
 ```
 Interactive wizard guides you through:
 - MCP server selection (or auto-detect)
 - Broker URL configuration
 - Authentication setup
 - Connection testing
 ### Option 2: Manual Configuration
 1. **Install preferred MCP server** (see installation commands above)
 2. **Create `.mcp.json` configuration**:
 ```json
 {
  "serverType": "tuannvm",
  "brokerUrls": ["localhost:9092"],
  "authentication": {
    "mechanism": "SASL/SCRAM-SHA-256",
    "username": "admin",
    "password": "admin-secret"
  }
 }
 ```
 3. **Test connection**:
 ```bash
 # Via MCP server CLI
 kafka-mcp-server test-connection
 # Or via SpecWeave
 node -e "import('./dist/lib/mcp/detector.js').then(async ({ MCPServerDetector }) => {
  const detector = new MCPServerDetector();
  const result = await detector.detectAll();
  console.log(JSON.stringify(result, null, 2));
 });"
 ```
 ## MCP Server Comparison
 | Feature | kanapuli | tuannvm | Joel-hanson | Confluent |
 |---------|----------|---------|-------------|-----------|
 | **Language** | Node.js | Go | Python | Official CLI |
 | **SASL_PLAINTEXT** | ✅ | ✅ | ✅ | ✅ |
 | **SCRAM-SHA-256** | ❌ | ✅ | ❌ | ✅ |
 | **SCRAM-SHA-512** | ❌ | ✅ | ❌ | ✅ |
 | **mTLS/SSL** | ❌ | ✅ | ✅ | ✅ |
 | **OAuth** | ❌ | ❌ | ❌ | ✅ |
 | **Consumer Groups** | ❌ | ✅ | ❌ | ✅ |
 | **Offset Mgmt** | ❌ | ✅ | ❌ | ✅ |
 | **Schema Registry** | ❌ | ❌ | ❌ | ✅ |
 | **ksqlDB** | ❌ | ❌ | ❌ | ✅ |
 | **Flink SQL** | ❌ | ❌ | ❌ | ✅ |
 | **AI/NL Interface** | ❌ | ❌ | ❌ | ✅ |
 | **Best For** | Prototyping | Production | Desktop | Enterprise |
 ## Troubleshooting
 ### MCP Server Not Detected
 ```bash
 # Check if MCP server installed
 npm list -g mcp-kafka         # kanapuli
 which kafka-mcp-server        # tuannvm
 pip show kafka-mcp-server     # Joel-hanson
 confluent version             # Confluent
 ```
 ### Connection Refused
 - Verify Kafka broker is running: `kcat -L -b localhost:9092`
 - Check firewall rules
 - Validate broker URL (correct host:port)
 ### Authentication Failed
 - Double-check credentials (username, password, API keys)
 - Verify SASL mechanism matches broker configuration
 - Check broker logs for authentication errors
 ### Operations Not Working
 - Ensure MCP server supports the operation (see comparison table)
 - Check broker ACLs (permissions for the authenticated user)
 - Verify topic exists: `/specweave-kafka:mcp-configure list-topics`
 ## Operations via MCP
 Once configured, you can perform Kafka operations via MCP:
 ```typescript
 import { MCPServerDetector } from './lib/mcp/detector';
 const detector = new MCPServerDetector();
 const result = await detector.detectAll();
 // Use recommended server
 if (result.recommended) {
  console.log(`Using ${result.recommended} MCP server`);
  console.log(`Reason: ${result.rankingReason}`);
 }
 ```
 ## Security Best Practices
 1. **Never commit credentials** - Use environment variables or secrets manager
 2. **Use strongest auth** - Prefer SCRAM-SHA-512 > SCRAM-SHA-256 > PLAINTEXT
 3. **Enable TLS/SSL** - Encrypt communication with broker
 4. **Rotate credentials** - Regularly update passwords and API keys
 5. **Least privilege** - Grant only necessary ACLs to MCP server user
 ## Related Commands
 - `/specweave-kafka:mcp-configure` - Interactive MCP server setup
 - `/specweave-kafka:dev-env start` - Start local Kafka for testing
 - `/specweave-kafka:deploy` - Deploy production Kafka cluster
 ## External Links
 - [kanapuli/mcp-kafka](https://github.com/kanapuli/mcp-kafka)
 - [tuannvm/kafka-mcp-server](https://github.com/tuannvm/kafka-mcp-server)
 - [Joel-hanson/kafka-mcp-server](https://github.com/Joel-hanson/kafka-mcp-server)
 - [Confluent MCP Documentation](https://docs.confluent.io/platform/current/mcp/)
 - [MCP Protocol Specification](https://modelcontextprotocol.org/)
--- a/skills/kafka-observability/SKILL.md
+++ b/skills/kafka-observability/SKILL.md
@@ -0,0 +1,576 @@
 ---
 name: kafka-observability
 description: Kafka monitoring and observability expert. Guides Prometheus + Grafana setup, JMX metrics, alerting rules, and dashboard configuration. Activates for kafka monitoring, prometheus, grafana, kafka metrics, jmx exporter, kafka observability, monitoring setup, kafka dashboards, alerting, kafka performance monitoring, metrics collection.
 ---
 # Kafka Monitoring & Observability
 Expert guidance for implementing comprehensive monitoring and observability for Apache Kafka using Prometheus and Grafana.
 ## When to Use This Skill
 I activate when you need help with:
 - **Monitoring setup**: "Set up Kafka monitoring", "configure Prometheus for Kafka", "Grafana dashboards for Kafka"
 - **Metrics collection**: "Kafka JMX metrics", "export Kafka metrics to Prometheus"
 - **Alerting**: "Kafka alerting rules", "alert on under-replicated partitions", "critical Kafka metrics"
 - **Troubleshooting**: "Monitor Kafka performance", "track consumer lag", "broker health monitoring"
 ## What I Know
 ### Available Monitoring Components
 This plugin provides a complete monitoring stack:
 #### 1. **Prometheus JMX Exporter Configuration**
 - **Location**: `plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml`
 - **Purpose**: Export Kafka JMX metrics to Prometheus format
 - **Metrics Exported**:
  - Broker topic metrics (bytes in/out, messages in, request rate)
  - Replica manager (under-replicated partitions, ISR shrinks/expands)
  - Controller metrics (active controller, offline partitions, leader elections)
  - Request metrics (produce/fetch latency)
  - Log metrics (flush rate, flush latency)
  - JVM metrics (heap, GC, threads, file descriptors)
 #### 2. **Grafana Dashboards** (5 Dashboards)
 - **Location**: `plugins/specweave-kafka/monitoring/grafana/dashboards/`
 - **Dashboards**:
  1. **kafka-cluster-overview.json** - Cluster health and throughput
  2. **kafka-broker-metrics.json** - Per-broker performance
  3. **kafka-consumer-lag.json** - Consumer lag monitoring
  4. **kafka-topic-metrics.json** - Topic-level metrics
  5. **kafka-jvm-metrics.json** - JVM health (heap, GC, threads)
 #### 3. **Grafana Provisioning**
 - **Location**: `plugins/specweave-kafka/monitoring/grafana/provisioning/`
 - **Files**:
  - `dashboards/kafka.yml` - Dashboard provisioning config
  - `datasources/prometheus.yml` - Prometheus datasource config
 ## Setup Workflow 1: JMX Exporter (Self-Hosted Kafka)
 For Kafka running on VMs or bare metal (non-Kubernetes).
 ### Step 1: Download JMX Prometheus Agent
 ```bash
 # Download JMX Prometheus agent JAR
 cd /opt
 wget https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.20.0/jmx_prometheus_javaagent-0.20.0.jar
 # Copy JMX Exporter config
 cp plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml /opt/kafka-jmx-exporter.yml
 ```
 ### Step 2: Configure Kafka Broker
 Add JMX exporter to Kafka startup script:
 ```bash
 # Edit Kafka startup (e.g., /etc/systemd/system/kafka.service)
 [Service]
 Environment="KAFKA_OPTS=-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
 ```
 Or add to `kafka-server-start.sh`:
 ```bash
 export KAFKA_OPTS="-javaagent:/opt/jmx_prometheus_javaagent-0.20.0.jar=7071:/opt/kafka-jmx-exporter.yml"
 ```
 ### Step 3: Restart Kafka and Verify
 ```bash
 # Restart Kafka broker
 sudo systemctl restart kafka
 # Verify JMX exporter is running (port 7071)
 curl localhost:7071/metrics | grep kafka_server
 # Expected output: kafka_server_broker_topic_metrics_bytesin_total{...} 12345
 ```
 ### Step 4: Configure Prometheus Scraping
 Add Kafka brokers to Prometheus config:
 ```yaml
 # prometheus.yml
 scrape_configs:
  - job_name: 'kafka'
    static_configs:
      - targets:
        - 'kafka-broker-1:7071'
        - 'kafka-broker-2:7071'
        - 'kafka-broker-3:7071'
    scrape_interval: 30s
 ```
 ```bash
 # Reload Prometheus
 sudo systemctl reload prometheus
 # OR send SIGHUP
 kill -HUP $(pidof prometheus)
 # Verify scraping
 curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
 ```
 ## Setup Workflow 2: Strimzi (Kubernetes)
 For Kafka running on Kubernetes with Strimzi Operator.
 ### Step 1: Create JMX Exporter ConfigMap
 ```bash
 # Create ConfigMap from JMX exporter config
 kubectl create configmap kafka-metrics \
  --from-file=kafka-metrics-config.yml=plugins/specweave-kafka/monitoring/prometheus/kafka-jmx-exporter.yml \
  -n kafka
 ```
 ### Step 2: Configure Kafka CR with Metrics
 ```yaml
 # kafka-cluster.yaml (add metricsConfig section)
 apiVersion: kafka.strimzi.io/v1beta2
 kind: Kafka
 metadata:
  name: my-kafka-cluster
  namespace: kafka
 spec:
  kafka:
    version: 3.7.0
    replicas: 3
    # ... other config ...
    metricsConfig:
      type: jmxPrometheusExporter
      valueFrom:
        configMapKeyRef:
          name: kafka-metrics
          key: kafka-metrics-config.yml
 ```
 ```bash
 # Apply updated Kafka CR
 kubectl apply -f kafka-cluster.yaml
 # Verify metrics endpoint (wait for rolling restart)
 kubectl exec -it kafka-my-kafka-cluster-0 -n kafka -- curl localhost:9404/metrics | grep kafka_server
 ```
 ### Step 3: Install Prometheus Operator (if not installed)
 ```bash
 # Add Prometheus Community Helm repo
 helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
 helm repo update
 # Install kube-prometheus-stack (Prometheus + Grafana + Alertmanager)
 helm install prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
  --set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
 ```
 ### Step 4: Create PodMonitor for Kafka
 ```yaml
 # kafka-podmonitor.yaml
 apiVersion: monitoring.coreos.com/v1
 kind: PodMonitor
 metadata:
  name: kafka-metrics
  namespace: kafka
  labels:
    app: strimzi
 spec:
  selector:
    matchLabels:
      strimzi.io/kind: Kafka
  podMetricsEndpoints:
    - port: tcp-prometheus
      interval: 30s
 ```
 ```bash
 # Apply PodMonitor
 kubectl apply -f kafka-podmonitor.yaml
 # Verify Prometheus is scraping Kafka
 kubectl port-forward -n monitoring svc/prometheus-kube-prometheus-prometheus 9090:9090
 # Open: http://localhost:9090/targets
 # Should see kafka-metrics/* targets
 ```
 ## Setup Workflow 3: Grafana Dashboards
 ### Installation (Docker Compose)
 If using Docker Compose for local development:
 ```yaml
 # docker-compose.yml (add to existing Kafka setup)
 version: '3.8'
 services:
  # ... Kafka services ...
  prometheus:
    image: prom/prometheus:v2.48.0
    ports:
      - "9090:9090"
    volumes:
      - ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus-data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
  grafana:
    image: grafana/grafana:10.2.0
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning
      - ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
      - grafana-data:/var/lib/grafana
 volumes:
  prometheus-data:
  grafana-data:
 ```
 ```bash
 # Start monitoring stack
 docker-compose up -d prometheus grafana
 # Access Grafana
 # URL: http://localhost:3000
 # Username: admin
 # Password: admin
 ```
 ### Installation (Kubernetes)
 Dashboards are auto-provisioned if using kube-prometheus-stack:
 ```bash
 # Create ConfigMaps for each dashboard
 for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
  name=$(basename "$dashboard" .json)
  kubectl create configmap "kafka-dashboard-$name" \
    --from-file="$dashboard" \
    -n monitoring \
    --dry-run=client -o yaml | kubectl apply -f -
 done
 # Label ConfigMaps for Grafana auto-discovery
 kubectl label configmap -n monitoring kafka-dashboard-* grafana_dashboard=1
 # Grafana will auto-import dashboards (wait 30-60 seconds)
 # Access Grafana
 kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
 # URL: http://localhost:3000
 # Username: admin
 # Password: prom-operator (default kube-prometheus-stack password)
 ```
 ### Manual Dashboard Import
 If auto-provisioning doesn't work:
 ```bash
 # 1. Access Grafana UI
 # 2. Go to: Dashboards → Import
 # 3. Upload JSON files from:
 #    plugins/specweave-kafka/monitoring/grafana/dashboards/
 # Or use Grafana API
 for dashboard in plugins/specweave-kafka/monitoring/grafana/dashboards/*.json; do
  curl -X POST http://admin:admin@localhost:3000/api/dashboards/db \
    -H "Content-Type: application/json" \
    -d @"$dashboard"
 done
 ```
 ## Dashboard Overview
 ### 1. **Kafka Cluster Overview** (`kafka-cluster-overview.json`)
 **Purpose**: High-level cluster health
 **Key Metrics**:
 - Active Controller Count (should be exactly 1)
 - Under-Replicated Partitions (should be 0) ⚠️  CRITICAL
 - Offline Partitions Count (should be 0) ⚠️  CRITICAL
 - Unclean Leader Elections (should be 0)
 - Cluster Throughput (bytes in/out per second)
 - Request Rate (produce, fetch requests per second)
 - ISR Changes (shrinks/expands)
 - Leader Election Rate
 **Use When**: Checking overall cluster health
 ### 2. **Kafka Broker Metrics** (`kafka-broker-metrics.json`)
 **Purpose**: Per-broker performance
 **Key Metrics**:
 - Broker CPU Usage (% utilization)
 - Broker Heap Memory Usage
 - Broker Network Throughput (bytes in/out)
 - Request Handler Idle Percentage (low = CPU saturation)
 - File Descriptors (open vs max)
 - Log Flush Latency (p50, p99)
 - JVM GC Collection Count/Time
 **Use When**: Investigating broker performance issues
 ### 3. **Kafka Consumer Lag** (`kafka-consumer-lag.json`)
 **Purpose**: Consumer lag monitoring
 **Key Metrics**:
 - Consumer Lag per Topic/Partition
 - Total Lag per Consumer Group
 - Offset Commit Rate
 - Current Consumer Offset
 - Log End Offset (producer offset)
 - Consumer Group Members
 **Use When**: Troubleshooting slow consumers or lag spikes
 ### 4. **Kafka Topic Metrics** (`kafka-topic-metrics.json`)
 **Purpose**: Topic-level metrics
 **Key Metrics**:
 - Messages Produced per Topic
 - Bytes per Topic (in/out)
 - Partition Count per Topic
 - Replication Factor
 - In-Sync Replicas
 - Log Size per Partition
 - Current Offset per Partition
 - Partition Leader Distribution
 **Use When**: Analyzing topic throughput and hotspots
 ### 5. **Kafka JVM Metrics** (`kafka-jvm-metrics.json`)
 **Purpose**: JVM health monitoring
 **Key Metrics**:
 - Heap Memory Usage (used vs max)
 - Heap Utilization Percentage
 - GC Collection Rate (collections/sec)
 - GC Collection Time (ms/sec)
 - JVM Thread Count
 - Heap Memory by Pool (young gen, old gen, survivor)
 - Off-Heap Memory Usage (metaspace, code cache)
 - GC Pause Time Percentiles (p50, p95, p99)
 **Use When**: Investigating memory leaks or GC pauses
 ## Critical Alerts Configuration
 Create Prometheus alerting rules for critical Kafka metrics:
 ```yaml
 # kafka-alerts.yml
 apiVersion: monitoring.coreos.com/v1
 kind: PrometheusRule
 metadata:
  name: kafka-alerts
  namespace: monitoring
 spec:
  groups:
    - name: kafka.rules
      interval: 30s
      rules:
        # CRITICAL: Under-Replicated Partitions
        - alert: KafkaUnderReplicatedPartitions
          expr: sum(kafka_server_replica_manager_under_replicated_partitions) > 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Kafka has under-replicated partitions"
            description: "{{ $value }} partitions are under-replicated. Data loss risk!"
        # CRITICAL: Offline Partitions
        - alert: KafkaOfflinePartitions
          expr: kafka_controller_offline_partitions_count > 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "Kafka has offline partitions"
            description: "{{ $value }} partitions are offline. Service degradation!"
        # CRITICAL: No Active Controller
        - alert: KafkaNoActiveController
          expr: kafka_controller_active_controller_count == 0
          for: 1m
          labels:
            severity: critical
          annotations:
            summary: "No active Kafka controller"
            description: "Cluster has no active controller. Cannot perform administrative operations!"
        # WARNING: High Consumer Lag
        - alert: KafkaConsumerLagHigh
          expr: sum by (consumergroup) (kafka_consumergroup_lag) > 10000
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "Consumer group {{ $labels.consumergroup }} has high lag"
            description: "Lag is {{ $value }} messages. Consumers may be slow."
        # WARNING: High CPU Usage
        - alert: KafkaBrokerHighCPU
          expr: os_process_cpu_load{job="kafka"} > 0.8
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Broker {{ $labels.instance }} has high CPU usage"
            description: "CPU usage is {{ $value | humanizePercentage }}. Consider scaling."
        # WARNING: Low Heap Memory
        - alert: KafkaBrokerLowHeapMemory
          expr: jvm_memory_heap_used_bytes{job="kafka"} / jvm_memory_heap_max_bytes{job="kafka"} > 0.9
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Broker {{ $labels.instance }} has low heap memory"
            description: "Heap usage is {{ $value | humanizePercentage }}. Risk of OOM!"
        # WARNING: High GC Time
        - alert: KafkaBrokerHighGCTime
          expr: rate(jvm_gc_collection_time_ms_total{job="kafka"}[5m]) > 500
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "Broker {{ $labels.instance }} spending too much time in GC"
            description: "GC time is {{ $value }}ms/sec. Application pauses likely."
 ```
 ```bash
 # Apply alerts (Kubernetes)
 kubectl apply -f kafka-alerts.yml
 # Verify alerts loaded
 kubectl get prometheusrules -n monitoring
 ```
 ## Troubleshooting
 ### "Prometheus not scraping Kafka metrics"
 **Symptoms**: No Kafka metrics in Prometheus
 **Fix**:
 ```bash
 # 1. Verify JMX exporter is running
 curl http://kafka-broker:7071/metrics
 # 2. Check Prometheus targets
 curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
 # 3. Check Prometheus logs
 kubectl logs -n monitoring prometheus-kube-prometheus-prometheus-0
 # Common issues:
 # - Firewall blocking port 7071
 # - Incorrect scrape config
 # - Kafka broker not running
 ```
 ### "Grafana dashboards not loading"
 **Symptoms**: Dashboards show "No data"
 **Fix**:
 ```bash
 # 1. Verify Prometheus datasource
 # Grafana UI → Configuration → Data Sources → Prometheus → Test
 # 2. Check if Kafka metrics exist in Prometheus
 # Prometheus UI → Graph → Enter: kafka_server_broker_topic_metrics_bytesin_total
 # 3. Verify dashboard queries match your Prometheus job name
 # Dashboard panels use job="kafka" by default
 # If your job name is different, update dashboard JSON
 ```
 ### "Consumer lag metrics missing"
 **Symptoms**: Consumer lag dashboard empty
 **Fix**:
 Consumer lag metrics require **Kafka Exporter** (separate from JMX Exporter):
 ```bash
 # Install Kafka Exporter (Kubernetes)
 helm install kafka-exporter prometheus-community/prometheus-kafka-exporter \
  --namespace monitoring \
  --set kafkaServer={kafka-bootstrap:9092}
 # Or run as Docker container
 docker run -d -p 9308:9308 \
  danielqsj/kafka-exporter \
  --kafka.server=kafka:9092 \
  --web.listen-address=:9308
 # Add to Prometheus scrape config
 scrape_configs:
  - job_name: 'kafka-exporter'
    static_configs:
      - targets: ['kafka-exporter:9308']
 ```
 ## Integration with Other Skills
 - **kafka-iac-deployment**: Set up monitoring during Terraform deployment
 - **kafka-kubernetes**: Configure monitoring for Strimzi Kafka on K8s
 - **kafka-architecture**: Use cluster sizing metrics to validate capacity planning
 - **kafka-cli-tools**: Use kcat to generate test traffic and verify metrics
 ## Quick Reference Commands
 ```bash
 # Check JMX exporter metrics
 curl http://localhost:7071/metrics | grep -E "(kafka_server|kafka_controller)"
 # Prometheus query examples
 curl -g 'http://localhost:9090/api/v1/query?query=kafka_server_replica_manager_under_replicated_partitions'
 # Grafana dashboard export
 curl http://admin:admin@localhost:3000/api/dashboards/uid/kafka-cluster-overview | jq .dashboard > backup.json
 # Reload Prometheus config
 kill -HUP $(pidof prometheus)
 # Check Prometheus targets
 curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.job=="kafka")'
 ```
 ---
 **Next Steps After Monitoring Setup**:
 1. Review all 5 Grafana dashboards to familiarize yourself with metrics
 2. Set up alerting (Slack, PagerDuty, email)
 3. Create runbooks for critical alerts (under-replicated partitions, offline partitions, no controller)
 4. Monitor for 7 days to establish baseline metrics
 5. Tune JVM settings based on GC metrics
		`@@ -0,0 +1,3 @@`
							`# specweave-kafka`

							`Apache Kafka event streaming integration with MCP servers, CLI tools (kcat), Terraform modules, and comprehensive observability stack`