gh-ahmedasmar-devops-claude…/skills/references/cost_optimization.md

# Terraform Cost Optimization Guide

Strategies for optimizing cloud infrastructure costs when using Terraform.

## Table of Contents

1. [Right-Sizing Resources](#right-sizing-resources)
2. [Spot and Reserved Instances](#spot-and-reserved-instances)
3. [Storage Optimization](#storage-optimization)
4. [Networking Costs](#networking-costs)
5. [Resource Lifecycle](#resource-lifecycle)
6. [Cost Tagging](#cost-tagging)
7. [Monitoring and Alerts](#monitoring-and-alerts)
8. [Multi-Cloud Considerations](#multi-cloud-considerations)

---

## Right-Sizing Resources

### Compute Resources

**Start small, scale up:**
```hcl
variable "instance_type" {
  type        = string
  description = "EC2 instance type"
  default     = "t3.micro"  # Start with smallest reasonable size

  validation {
    condition     = can(regex("^t[0-9]\\.", var.instance_type))
    error_message = "Consider starting with burstable (t-series) instances for cost optimization."
  }
}
```

**Use auto-scaling instead of over-provisioning:**
```hcl
resource "aws_autoscaling_group" "app" {
  min_size         = 2   # Minimum for HA
  desired_capacity = 2   # Normal load
  max_size         = 10  # Peak load

  # Scale based on actual usage
  target_group_arns = [aws_lb_target_group.app.arn]

  tag {
    key                 = "Environment"
    value               = var.environment
    propagate_at_launch = true
  }
}
```

### Database Right-Sizing

**Start with appropriate size:**
```hcl
resource "aws_db_instance" "main" {
  instance_class = var.environment == "prod" ? "db.t3.medium" : "db.t3.micro"

  # Enable auto-scaling for storage
  allocated_storage     = 20
  max_allocated_storage = 100  # Auto-scale up to 100GB

  # Use cheaper storage for non-prod
  storage_type = var.environment == "prod" ? "io1" : "gp3"
}
```

---

## Spot and Reserved Instances

### Spot Instances for Non-Critical Workloads

**Launch Template for Spot:**
```hcl
resource "aws_launch_template" "spot" {
  name_prefix   = "spot-"
  image_id      = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  instance_market_options {
    market_type = "spot"

    spot_options {
      max_price                      = "0.05"  # Set price limit
      spot_instance_type             = "one-time"
      instance_interruption_behavior = "terminate"
    }
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name        = "spot-instance"
      Workload    = "non-critical"
      CostSavings = "true"
    }
  }
}

resource "aws_autoscaling_group" "spot" {
  desired_capacity = 5
  max_size         = 10
  min_size         = 0

  mixed_instances_policy {
    instances_distribution {
      on_demand_percentage_above_base_capacity = 20  # 20% on-demand, 80% spot
      spot_allocation_strategy                 = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.spot.id
        version            = "$Latest"
      }

      # Multiple instance types increase spot availability
      override {
        instance_type = "t3.medium"
      }
      override {
        instance_type = "t3.large"
      }
      override {
        instance_type = "t3a.medium"
      }
    }
  }
}
```

### Reserved Instances (Use Outside Terraform)

Terraform shouldn't manage reservations directly, but should:
- Tag resources consistently for reservation planning
- Use Instance Savings Plans for flexibility
- Monitor usage patterns to inform reservation purchases

**Tagging for reservation analysis:**
```hcl
locals {
  reservation_tags = {
    ReservationCandidate = var.environment == "prod" ? "true" : "false"
    UsagePattern         = "steady-state"  # or "variable", "burst"
    CostCenter          = var.cost_center
  }
}
```

---

## Storage Optimization

### S3 Lifecycle Policies

**Automatic tiering:**
```hcl
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id

  rule {
    id     = "log-retention"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"  # Infrequent Access after 30 days
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"  # Instant Retrieval Glacier after 90 days
    }

    transition {
      days          = 180
      storage_class = "DEEP_ARCHIVE"  # Deep Archive after 180 days
    }

    expiration {
      days = 365  # Delete after 1 year
    }
  }
}
```

**Intelligent tiering for variable access:**
```hcl
resource "aws_s3_bucket_intelligent_tiering_configuration" "assets" {
  bucket = aws_s3_bucket.assets.id
  name   = "entire-bucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }

  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}
```

### EBS Volume Optimization

**Use appropriate volume types:**
```hcl
resource "aws_instance" "app" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  root_block_device {
    volume_type = "gp3"  # gp3 is cheaper than gp2 with better baseline
    volume_size = 20
    iops        = 3000   # Default, only pay more if you need more
    throughput  = 125    # Default
    encrypted   = true

    # Delete on termination to avoid orphaned volumes
    delete_on_termination = true
  }

  tags = {
    Name = "app-server"
  }
}
```

**Snapshot lifecycle:**
```hcl
resource "aws_dlm_lifecycle_policy" "snapshots" {
  description        = "EBS snapshot lifecycle"
  execution_role_arn = aws_iam_role.dlm.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "Daily snapshots"

      create_rule {
        interval      = 24
        interval_unit = "HOURS"
        times         = ["03:00"]
      }

      retain_rule {
        count = 7  # Keep only 7 days of snapshots
      }

      copy_tags = true
    }

    target_tags = {
      BackupEnabled = "true"
    }
  }
}
```

---

## Networking Costs

### Minimize Data Transfer

**Use VPC endpoints to avoid NAT charges:**
```hcl
resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.region}.s3"
  route_table_ids = [
    aws_route_table.private.id
  ]

  tags = {
    Name        = "s3-endpoint"
    CostSavings = "reduces-nat-charges"
  }
}

resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.region}.dynamodb"
  route_table_ids = [
    aws_route_table.private.id
  ]
}
```

**Interface endpoints for AWS services:**
```hcl
resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true

  tags = {
    Name        = "ecr-api-endpoint"
    CostSavings = "reduces-nat-data-transfer"
  }
}
```

### Regional Optimization

**Co-locate resources in same region/AZ:**
```hcl
# Bad - cross-region data transfer is expensive
resource "aws_instance" "app" {
  availability_zone = "us-east-1a"
}

resource "aws_rds_cluster" "main" {
  availability_zones = ["us-west-2a"]  # Different region!
}

# Good - same region and AZ when possible
resource "aws_instance" "app" {
  availability_zone = var.availability_zone
}

resource "aws_rds_cluster" "main" {
  availability_zones = [var.availability_zone]  # Same AZ
}
```

---

## Resource Lifecycle

### Scheduled Shutdown for Non-Production

**Lambda to stop/start instances:**
```hcl
resource "aws_lambda_function" "scheduler" {
  filename      = "scheduler.zip"
  function_name = "instance-scheduler"
  role          = aws_iam_role.scheduler.arn
  handler       = "scheduler.handler"
  runtime       = "python3.9"

  environment {
    variables = {
      TAG_KEY   = "Schedule"
      TAG_VALUE = "business-hours"
    }
  }
}

# EventBridge rule to stop instances at night
resource "aws_cloudwatch_event_rule" "stop_instances" {
  name                = "stop-dev-instances"
  description         = "Stop dev instances at 7 PM"
  schedule_expression = "cron(0 19 ? * MON-FRI *)"  # 7 PM weekdays
}

resource "aws_cloudwatch_event_target" "stop" {
  rule      = aws_cloudwatch_event_rule.stop_instances.name
  target_id = "stop-instances"
  arn       = aws_lambda_function.scheduler.arn

  input = jsonencode({
    action = "stop"
  })
}

# Start instances in the morning
resource "aws_cloudwatch_event_rule" "start_instances" {
  name                = "start-dev-instances"
  description         = "Start dev instances at 8 AM"
  schedule_expression = "cron(0 8 ? * MON-FRI *)"  # 8 AM weekdays
}
```

**Tag instances for scheduling:**
```hcl
resource "aws_instance" "dev" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  tags = {
    Name        = "dev-server"
    Environment = "dev"
    Schedule    = "business-hours"  # Scheduler will stop/start based on this
    AutoShutdown = "true"
  }
}
```

### Cleanup Old Resources

**S3 lifecycle for temporary data:**
```hcl
resource "aws_s3_bucket_lifecycle_configuration" "temp" {
  bucket = aws_s3_bucket.temp.id

  rule {
    id     = "cleanup-temp-files"
    status = "Enabled"

    filter {
      prefix = "temp/"
    }

    expiration {
      days = 7  # Delete after 7 days
    }

    abort_incomplete_multipart_upload {
      days_after_initiation = 1
    }
  }
}
```

---

## Cost Tagging

### Comprehensive Tagging Strategy

**Define tagging locals:**
```hcl
locals {
  common_tags = {
    # Cost allocation tags
    CostCenter  = var.cost_center
    Project     = var.project_name
    Environment = var.environment
    Owner       = var.team_email

    # Operational tags
    ManagedBy       = "Terraform"
    TerraformModule = basename(abspath(path.module))

    # Cost optimization tags
    AutoShutdown        = var.environment != "prod" ? "enabled" : "disabled"
    ReservationCandidate = var.environment == "prod" ? "true" : "false"
    CostOptimized       = "true"
  }
}

# Apply to all resources
resource "aws_instance" "app" {
  # ... configuration ...

  tags = merge(
    local.common_tags,
    {
      Name = "${var.environment}-app-server"
      Role = "application"
    }
  )
}
```

**Enforce tagging with AWS Config:**
```hcl
resource "aws_config_config_rule" "required_tags" {
  name = "required-tags"

  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }

  input_parameters = jsonencode({
    tag1Key = "CostCenter"
    tag2Key = "Environment"
    tag3Key = "Owner"
  })

  depends_on = [aws_config_configuration_recorder.main]
}
```

---

## Monitoring and Alerts

### Budget Alerts

**AWS Budgets with Terraform:**
```hcl
resource "aws_budgets_budget" "monthly" {
  name              = "${var.environment}-monthly-budget"
  budget_type       = "COST"
  limit_amount      = var.monthly_budget
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  time_period_start = "2024-01-01_00:00"

  cost_filter {
    name = "TagKeyValue"
    values = [
      "Environment$${var.environment}"
    ]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.budget_alert_email]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.budget_alert_email]
  }
}
```

### Cost Anomaly Detection

```hcl
resource "aws_ce_anomaly_monitor" "service" {
  name              = "${var.environment}-service-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "alerts" {
  name      = "${var.environment}-anomaly-alerts"
  frequency = "DAILY"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service.arn
  ]

  subscriber {
    type    = "EMAIL"
    address = var.cost_alert_email
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]  # Alert on $100+ anomalies
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}
```

---

## Multi-Cloud Considerations

### Azure Cost Optimization

**Use Azure Hybrid Benefit:**
```hcl
resource "azurerm_linux_virtual_machine" "main" {
  # ... configuration ...

  # Use Azure Hybrid Benefit for licensing savings
  license_type = "RHEL_BYOS"  # or "SLES_BYOS"
}
```

**Azure Reserved Instances (outside Terraform):**
- Purchase through Azure Portal
- Tag VMs with `ReservationGroup` for planning

### GCP Cost Optimization

**Use committed use discounts:**
```hcl
resource "google_compute_instance" "main" {
  # ... configuration ...

  # Use committed use discount
  scheduling {
    automatic_restart   = true
    on_host_maintenance = "MIGRATE"
    preemptible         = var.environment != "prod"  # Preemptible for non-prod
  }
}
```

**GCP Preemptible VMs:**
```hcl
resource "google_compute_instance_template" "preemptible" {
  machine_type = "n1-standard-1"

  scheduling {
    automatic_restart   = false
    on_host_maintenance = "TERMINATE"
    preemptible         = true  # Up to 80% cost reduction
  }
}
```

---

## Cost Optimization Checklist

### Before Deployment
- [ ] Right-size compute resources (start small)
- [ ] Use appropriate storage tiers
- [ ] Enable auto-scaling instead of over-provisioning
- [ ] Implement tagging strategy
- [ ] Configure lifecycle policies
- [ ] Set up VPC endpoints for AWS services

### After Deployment
- [ ] Monitor actual usage vs. provisioned capacity
- [ ] Review cost allocation tags
- [ ] Identify reservation opportunities
- [ ] Configure budget alerts
- [ ] Enable cost anomaly detection
- [ ] Schedule non-production resource shutdown

### Ongoing
- [ ] Monthly cost review
- [ ] Quarterly right-sizing analysis
- [ ] Annual reservation review
- [ ] Remove unused resources
- [ ] Optimize data transfer patterns
- [ ] Update instance families (new generations are often cheaper)

---

## Cost Estimation Tools

### Use `infracost` in CI/CD

```bash
# Install infracost
curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh

# Generate cost estimate
infracost breakdown --path .

# Compare cost changes in PR
infracost diff --path . --compare-to tfplan.json
```

### Terraform Cloud Cost Estimation

Enable in Terraform Cloud workspace settings for automatic cost estimates on every plan.

---

## Additional Resources

- AWS Cost Optimization: https://aws.amazon.com/pricing/cost-optimization/
- Azure Cost Management: https://azure.microsoft.com/en-us/products/cost-management/
- GCP Cost Management: https://cloud.google.com/cost-management
- Infracost: https://www.infracost.io/
- Cloud Cost Optimization Tools: Kubecost, CloudHealth, CloudCheckr