Files
2025-11-29 17:56:46 +08:00

450 lines
13 KiB
Markdown

---
name: kafka-iac-deployment
description: Infrastructure as Code (IaC) deployment expert for Apache Kafka. Guides Terraform deployments across Apache Kafka (KRaft mode), AWS MSK, Azure Event Hubs. Activates for terraform, iac, infrastructure as code, deploy kafka, provision kafka, aws msk, azure event hubs, kafka infrastructure, terraform modules, cloud deployment, kafka deployment automation.
---
# Kafka Infrastructure as Code (IaC) Deployment
Expert guidance for deploying Apache Kafka using Terraform across multiple platforms.
## When to Use This Skill
I activate when you need help with:
- **Terraform deployments**: "Deploy Kafka with Terraform", "provision Kafka cluster"
- **Platform selection**: "Should I use AWS MSK or self-hosted Kafka?", "compare Kafka platforms"
- **Infrastructure planning**: "How to size Kafka infrastructure", "Kafka on AWS vs Azure"
- **IaC automation**: "Automate Kafka deployment", "CI/CD for Kafka infrastructure"
## What I Know
### Available Terraform Modules
This plugin provides 3 production-ready Terraform modules:
#### 1. **Apache Kafka (Self-Hosted, KRaft Mode)**
- **Location**: `plugins/specweave-kafka/terraform/apache-kafka/`
- **Platform**: AWS EC2 (can adapt to other clouds)
- **Architecture**: KRaft mode (no ZooKeeper dependency)
- **Features**:
- Multi-broker cluster (3-5 brokers recommended)
- Security groups with SASL_SSL
- IAM roles for S3 backups
- CloudWatch metrics and alarms
- Auto-scaling group support
- Custom VPC and subnet configuration
- **Use When**:
- ✅ You need full control over Kafka configuration
- ✅ Running Kafka 3.6+ (KRaft mode)
- ✅ Want to avoid ZooKeeper operational overhead
- ✅ Multi-cloud or hybrid deployments
- **Variables**:
```hcl
module "kafka" {
source = "../../plugins/specweave-kafka/terraform/apache-kafka"
environment = "production"
broker_count = 3
kafka_version = "3.7.0"
instance_type = "m5.xlarge"
vpc_id = var.vpc_id
subnet_ids = var.subnet_ids
domain = "example.com"
enable_s3_backups = true
enable_monitoring = true
}
```
#### 2. **AWS MSK (Managed Streaming for Kafka)**
- **Location**: `plugins/specweave-kafka/terraform/aws-msk/`
- **Platform**: AWS Managed Service
- **Features**:
- Fully managed Kafka service
- IAM authentication + SASL/SCRAM
- Auto-scaling (provisioned throughput)
- Built-in monitoring (CloudWatch)
- Multi-AZ deployment
- Encryption in transit and at rest
- **Use When**:
- ✅ You want AWS to manage Kafka operations
- ✅ Need tight AWS integration (IAM, KMS, CloudWatch)
- ✅ Prefer operational simplicity over cost
- ✅ Running in AWS VPC
- **Variables**:
```hcl
module "msk" {
source = "../../plugins/specweave-kafka/terraform/aws-msk"
cluster_name = "my-kafka-cluster"
kafka_version = "3.6.0"
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.large"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
enable_iam_auth = true
enable_scram_auth = false
enable_auto_scaling = true
}
```
#### 3. **Azure Event Hubs (Kafka API)**
- **Location**: `plugins/specweave-kafka/terraform/azure-event-hubs/`
- **Platform**: Azure Managed Service
- **Features**:
- Kafka 1.0+ protocol support
- Auto-inflate (elastic scaling)
- Premium SKU for high throughput
- Zone redundancy
- Private endpoints (VNet integration)
- Event capture to Azure Storage
- **Use When**:
- ✅ Running on Azure cloud
- ✅ Need Kafka-compatible API without Kafka operations
- ✅ Want serverless scaling (auto-inflate)
- ✅ Integrating with Azure ecosystem
- **Variables**:
```hcl
module "event_hubs" {
source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
namespace_name = "my-event-hub-ns"
resource_group_name = var.resource_group_name
location = "eastus"
sku = "Premium"
capacity = 1
kafka_enabled = true
auto_inflate_enabled = true
maximum_throughput_units = 20
}
```
## Platform Selection Decision Tree
```
Need Kafka deployment? START HERE:
├─ Running on AWS?
│ ├─ YES → Want managed service?
│ │ ├─ YES → Use AWS MSK module (terraform/aws-msk)
│ │ └─ NO → Use Apache Kafka module (terraform/apache-kafka)
│ └─ NO → Continue...
├─ Running on Azure?
│ ├─ YES → Use Azure Event Hubs module (terraform/azure-event-hubs)
│ └─ NO → Continue...
├─ Multi-cloud or hybrid?
│ └─ YES → Use Apache Kafka module (most portable)
├─ Need maximum control?
│ └─ YES → Use Apache Kafka module
└─ Default → Use Apache Kafka module (self-hosted, KRaft mode)
```
## Deployment Workflows
### Workflow 1: Deploy Self-Hosted Kafka (Apache Kafka Module)
**Scenario**: You want full control over Kafka on AWS EC2
```bash
# 1. Create Terraform configuration
cat > main.tf <<EOF
module "kafka_cluster" {
source = "../../plugins/specweave-kafka/terraform/apache-kafka"
environment = "production"
broker_count = 3
kafka_version = "3.7.0"
instance_type = "m5.xlarge"
vpc_id = "vpc-12345678"
subnet_ids = ["subnet-abc", "subnet-def", "subnet-ghi"]
domain = "kafka.example.com"
enable_s3_backups = true
enable_monitoring = true
tags = {
Project = "MyApp"
Environment = "Production"
}
}
output "broker_endpoints" {
value = module.kafka_cluster.broker_endpoints
}
EOF
# 2. Initialize Terraform
terraform init
# 3. Plan deployment (review what will be created)
terraform plan
# 4. Apply (create infrastructure)
terraform apply
# 5. Get broker endpoints
terraform output broker_endpoints
# Output: ["kafka-0.kafka.example.com:9093", "kafka-1.kafka.example.com:9093", ...]
```
### Workflow 2: Deploy AWS MSK (Managed Service)
**Scenario**: You want AWS to manage Kafka operations
```bash
# 1. Create Terraform configuration
cat > main.tf <<EOF
module "msk_cluster" {
source = "../../plugins/specweave-kafka/terraform/aws-msk"
cluster_name = "my-msk-cluster"
kafka_version = "3.6.0"
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.large"
vpc_id = var.vpc_id
subnet_ids = var.private_subnet_ids
enable_iam_auth = true
enable_auto_scaling = true
tags = {
Project = "MyApp"
}
}
output "bootstrap_brokers" {
value = module.msk_cluster.bootstrap_brokers_sasl_iam
}
EOF
# 2. Deploy
terraform init && terraform apply
# 3. Configure IAM authentication
# (module outputs IAM policy, attach to your application role)
```
### Workflow 3: Deploy Azure Event Hubs (Kafka API)
**Scenario**: You're on Azure and want Kafka-compatible API
```bash
# 1. Create Terraform configuration
cat > main.tf <<EOF
module "event_hubs" {
source = "../../plugins/specweave-kafka/terraform/azure-event-hubs"
namespace_name = "my-kafka-namespace"
resource_group_name = "my-resource-group"
location = "eastus"
sku = "Premium"
capacity = 1
kafka_enabled = true
auto_inflate_enabled = true
maximum_throughput_units = 20
# Create hubs (topics) for your use case
hubs = [
{ name = "user-events", partitions = 12 },
{ name = "order-events", partitions = 6 },
{ name = "payment-events", partitions = 3 }
]
}
output "connection_string" {
value = module.event_hubs.connection_string
sensitive = true
}
EOF
# 2. Deploy
terraform init && terraform apply
# 3. Get connection details
terraform output connection_string
```
## Infrastructure Sizing Recommendations
### Small Environment (Dev/Test)
```hcl
# Self-hosted: 1 broker, m5.large
broker_count = 1
instance_type = "m5.large"
# AWS MSK: 1 broker per AZ, kafka.m5.large
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.large"
# Azure Event Hubs: Basic SKU
sku = "Basic"
capacity = 1
```
### Medium Environment (Staging/Production)
```hcl
# Self-hosted: 3 brokers, m5.xlarge
broker_count = 3
instance_type = "m5.xlarge"
# AWS MSK: 3 brokers, kafka.m5.xlarge
number_of_broker_nodes = 3
broker_node_instance_type = "kafka.m5.xlarge"
# Azure Event Hubs: Standard SKU with auto-inflate
sku = "Standard"
capacity = 2
auto_inflate_enabled = true
maximum_throughput_units = 10
```
### Large Environment (High-Throughput Production)
```hcl
# Self-hosted: 5+ brokers, m5.2xlarge or m5.4xlarge
broker_count = 5
instance_type = "m5.2xlarge"
# AWS MSK: 6+ brokers, kafka.m5.2xlarge, auto-scaling
number_of_broker_nodes = 6
broker_node_instance_type = "kafka.m5.2xlarge"
enable_auto_scaling = true
# Azure Event Hubs: Premium SKU with zone redundancy
sku = "Premium"
capacity = 4
zone_redundant = true
maximum_throughput_units = 20
```
## Best Practices
### Security Best Practices
1. **Always use encryption in transit**
- Self-hosted: Enable SASL_SSL listener
- AWS MSK: Set `encryption_in_transit_client_broker = "TLS"`
- Azure Event Hubs: HTTPS/TLS enabled by default
2. **Use IAM authentication (when possible)**
- AWS MSK: `enable_iam_auth = true`
- Azure Event Hubs: Managed identities
3. **Network isolation**
- Deploy in private subnets
- Use security groups/NSGs restrictively
- Azure: Enable private endpoints for Premium SKU
### High Availability Best Practices
1. **Multi-AZ deployment**
- Self-hosted: Distribute brokers across 3+ AZs
- AWS MSK: Automatically multi-AZ
- Azure Event Hubs: Enable `zone_redundant = true` (Premium)
2. **Replication factor = 3**
- Self-hosted: `default.replication.factor=3`
- AWS MSK: Configured automatically
- Azure Event Hubs: N/A (fully managed)
3. **min.insync.replicas = 2**
- Ensures durability even if 1 broker fails
### Cost Optimization
1. **Right-size instances**
- Use ClusterSizingCalculator utility (in kafka-architecture skill)
- Start small, scale up based on metrics
2. **Auto-scaling (where available)**
- AWS MSK: `enable_auto_scaling = true`
- Azure Event Hubs: `auto_inflate_enabled = true`
3. **Retention policies**
- Set `log.retention.hours` based on actual needs (default: 168 hours = 7 days)
- Shorter retention = lower storage costs
## Monitoring Integration
All modules integrate with monitoring:
### Self-Hosted Kafka
- CloudWatch metrics (via JMX Exporter)
- Prometheus + Grafana dashboards (see kafka-observability skill)
- Custom CloudWatch alarms
### AWS MSK
- Built-in CloudWatch metrics
- Enhanced monitoring available
- Integration with CloudWatch Alarms
### Azure Event Hubs
- Built-in Azure Monitor metrics
- Diagnostic logs to Log Analytics
- Integration with Azure Alerts
## Troubleshooting
### "Terraform destroy fails on security groups"
**Cause**: Resources using security groups still exist
**Fix**:
```bash
# 1. Find dependent resources
aws ec2 describe-network-interfaces --filters "Name=group-id,Values=sg-12345678"
# 2. Delete dependent resources first
# 3. Retry terraform destroy
```
### "AWS MSK cluster takes 20+ minutes to create"
**Cause**: MSK provisioning is inherently slow (AWS behavior)
**Fix**: This is normal. Use `--auto-approve` for automation:
```bash
terraform apply -auto-approve
```
### "Azure Event Hubs: Connection refused"
**Cause**: Kafka protocol not enabled OR incorrect connection string
**Fix**:
1. Verify `kafka_enabled = true` in Terraform
2. Use Kafka connection string (not Event Hubs connection string)
3. Check firewall rules (Premium SKU supports private endpoints)
## Integration with Other Skills
- **kafka-architecture**: For cluster sizing and partitioning strategy
- **kafka-observability**: For Prometheus + Grafana setup after deployment
- **kafka-kubernetes**: For deploying Kafka on Kubernetes (alternative to Terraform)
- **kafka-cli-tools**: For testing deployed clusters with kcat
## Quick Reference Commands
```bash
# Terraform workflow
terraform init # Initialize modules
terraform plan # Preview changes
terraform apply # Create infrastructure
terraform output # Get outputs (endpoints, etc.)
terraform destroy # Delete infrastructure
# AWS MSK specific
aws kafka list-clusters # List MSK clusters
aws kafka describe-cluster --cluster-arn <arn> # Get cluster details
# Azure Event Hubs specific
az eventhubs namespace list # List namespaces
az eventhubs eventhub list --namespace-name <name> --resource-group <rg> # List hubs
```
---
**Next Steps After Deployment**:
1. Use **kafka-observability** skill to set up Prometheus + Grafana monitoring
2. Use **kafka-cli-tools** skill to test cluster with kcat
3. Deploy your producer/consumer applications
4. Monitor cluster health and performance