Files
gh-ahmedasmar-devops-claude…/skills/references/cost_optimization.md
2025-11-29 17:51:17 +08:00

15 KiB

Terraform Cost Optimization Guide

Strategies for optimizing cloud infrastructure costs when using Terraform.

Table of Contents

  1. Right-Sizing Resources
  2. Spot and Reserved Instances
  3. Storage Optimization
  4. Networking Costs
  5. Resource Lifecycle
  6. Cost Tagging
  7. Monitoring and Alerts
  8. Multi-Cloud Considerations

Right-Sizing Resources

Compute Resources

Start small, scale up:

variable "instance_type" {
  type        = string
  description = "EC2 instance type"
  default     = "t3.micro"  # Start with smallest reasonable size

  validation {
    condition     = can(regex("^t[0-9]\\.", var.instance_type))
    error_message = "Consider starting with burstable (t-series) instances for cost optimization."
  }
}

Use auto-scaling instead of over-provisioning:

resource "aws_autoscaling_group" "app" {
  min_size         = 2   # Minimum for HA
  desired_capacity = 2   # Normal load
  max_size         = 10  # Peak load

  # Scale based on actual usage
  target_group_arns = [aws_lb_target_group.app.arn]

  tag {
    key                 = "Environment"
    value               = var.environment
    propagate_at_launch = true
  }
}

Database Right-Sizing

Start with appropriate size:

resource "aws_db_instance" "main" {
  instance_class = var.environment == "prod" ? "db.t3.medium" : "db.t3.micro"

  # Enable auto-scaling for storage
  allocated_storage     = 20
  max_allocated_storage = 100  # Auto-scale up to 100GB

  # Use cheaper storage for non-prod
  storage_type = var.environment == "prod" ? "io1" : "gp3"
}

Spot and Reserved Instances

Spot Instances for Non-Critical Workloads

Launch Template for Spot:

resource "aws_launch_template" "spot" {
  name_prefix   = "spot-"
  image_id      = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  instance_market_options {
    market_type = "spot"

    spot_options {
      max_price                      = "0.05"  # Set price limit
      spot_instance_type             = "one-time"
      instance_interruption_behavior = "terminate"
    }
  }

  tag_specifications {
    resource_type = "instance"
    tags = {
      Name        = "spot-instance"
      Workload    = "non-critical"
      CostSavings = "true"
    }
  }
}

resource "aws_autoscaling_group" "spot" {
  desired_capacity = 5
  max_size         = 10
  min_size         = 0

  mixed_instances_policy {
    instances_distribution {
      on_demand_percentage_above_base_capacity = 20  # 20% on-demand, 80% spot
      spot_allocation_strategy                 = "capacity-optimized"
    }

    launch_template {
      launch_template_specification {
        launch_template_id = aws_launch_template.spot.id
        version            = "$Latest"
      }

      # Multiple instance types increase spot availability
      override {
        instance_type = "t3.medium"
      }
      override {
        instance_type = "t3.large"
      }
      override {
        instance_type = "t3a.medium"
      }
    }
  }
}

Reserved Instances (Use Outside Terraform)

Terraform shouldn't manage reservations directly, but should:

  • Tag resources consistently for reservation planning
  • Use Instance Savings Plans for flexibility
  • Monitor usage patterns to inform reservation purchases

Tagging for reservation analysis:

locals {
  reservation_tags = {
    ReservationCandidate = var.environment == "prod" ? "true" : "false"
    UsagePattern         = "steady-state"  # or "variable", "burst"
    CostCenter          = var.cost_center
  }
}

Storage Optimization

S3 Lifecycle Policies

Automatic tiering:

resource "aws_s3_bucket_lifecycle_configuration" "logs" {
  bucket = aws_s3_bucket.logs.id

  rule {
    id     = "log-retention"
    status = "Enabled"

    transition {
      days          = 30
      storage_class = "STANDARD_IA"  # Infrequent Access after 30 days
    }

    transition {
      days          = 90
      storage_class = "GLACIER_IR"  # Instant Retrieval Glacier after 90 days
    }

    transition {
      days          = 180
      storage_class = "DEEP_ARCHIVE"  # Deep Archive after 180 days
    }

    expiration {
      days = 365  # Delete after 1 year
    }
  }
}

Intelligent tiering for variable access:

resource "aws_s3_bucket_intelligent_tiering_configuration" "assets" {
  bucket = aws_s3_bucket.assets.id
  name   = "entire-bucket"

  tiering {
    access_tier = "ARCHIVE_ACCESS"
    days        = 90
  }

  tiering {
    access_tier = "DEEP_ARCHIVE_ACCESS"
    days        = 180
  }
}

EBS Volume Optimization

Use appropriate volume types:

resource "aws_instance" "app" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  root_block_device {
    volume_type = "gp3"  # gp3 is cheaper than gp2 with better baseline
    volume_size = 20
    iops        = 3000   # Default, only pay more if you need more
    throughput  = 125    # Default
    encrypted   = true

    # Delete on termination to avoid orphaned volumes
    delete_on_termination = true
  }

  tags = {
    Name = "app-server"
  }
}

Snapshot lifecycle:

resource "aws_dlm_lifecycle_policy" "snapshots" {
  description        = "EBS snapshot lifecycle"
  execution_role_arn = aws_iam_role.dlm.arn
  state              = "ENABLED"

  policy_details {
    resource_types = ["VOLUME"]

    schedule {
      name = "Daily snapshots"

      create_rule {
        interval      = 24
        interval_unit = "HOURS"
        times         = ["03:00"]
      }

      retain_rule {
        count = 7  # Keep only 7 days of snapshots
      }

      copy_tags = true
    }

    target_tags = {
      BackupEnabled = "true"
    }
  }
}

Networking Costs

Minimize Data Transfer

Use VPC endpoints to avoid NAT charges:

resource "aws_vpc_endpoint" "s3" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.region}.s3"
  route_table_ids = [
    aws_route_table.private.id
  ]

  tags = {
    Name        = "s3-endpoint"
    CostSavings = "reduces-nat-charges"
  }
}

resource "aws_vpc_endpoint" "dynamodb" {
  vpc_id       = aws_vpc.main.id
  service_name = "com.amazonaws.${var.region}.dynamodb"
  route_table_ids = [
    aws_route_table.private.id
  ]
}

Interface endpoints for AWS services:

resource "aws_vpc_endpoint" "ecr_api" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.${var.region}.ecr.api"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.vpc_endpoints.id]
  private_dns_enabled = true

  tags = {
    Name        = "ecr-api-endpoint"
    CostSavings = "reduces-nat-data-transfer"
  }
}

Regional Optimization

Co-locate resources in same region/AZ:

# Bad - cross-region data transfer is expensive
resource "aws_instance" "app" {
  availability_zone = "us-east-1a"
}

resource "aws_rds_cluster" "main" {
  availability_zones = ["us-west-2a"]  # Different region!
}

# Good - same region and AZ when possible
resource "aws_instance" "app" {
  availability_zone = var.availability_zone
}

resource "aws_rds_cluster" "main" {
  availability_zones = [var.availability_zone]  # Same AZ
}

Resource Lifecycle

Scheduled Shutdown for Non-Production

Lambda to stop/start instances:

resource "aws_lambda_function" "scheduler" {
  filename      = "scheduler.zip"
  function_name = "instance-scheduler"
  role          = aws_iam_role.scheduler.arn
  handler       = "scheduler.handler"
  runtime       = "python3.9"

  environment {
    variables = {
      TAG_KEY   = "Schedule"
      TAG_VALUE = "business-hours"
    }
  }
}

# EventBridge rule to stop instances at night
resource "aws_cloudwatch_event_rule" "stop_instances" {
  name                = "stop-dev-instances"
  description         = "Stop dev instances at 7 PM"
  schedule_expression = "cron(0 19 ? * MON-FRI *)"  # 7 PM weekdays
}

resource "aws_cloudwatch_event_target" "stop" {
  rule      = aws_cloudwatch_event_rule.stop_instances.name
  target_id = "stop-instances"
  arn       = aws_lambda_function.scheduler.arn

  input = jsonencode({
    action = "stop"
  })
}

# Start instances in the morning
resource "aws_cloudwatch_event_rule" "start_instances" {
  name                = "start-dev-instances"
  description         = "Start dev instances at 8 AM"
  schedule_expression = "cron(0 8 ? * MON-FRI *)"  # 8 AM weekdays
}

Tag instances for scheduling:

resource "aws_instance" "dev" {
  ami           = data.aws_ami.amazon_linux.id
  instance_type = "t3.medium"

  tags = {
    Name        = "dev-server"
    Environment = "dev"
    Schedule    = "business-hours"  # Scheduler will stop/start based on this
    AutoShutdown = "true"
  }
}

Cleanup Old Resources

S3 lifecycle for temporary data:

resource "aws_s3_bucket_lifecycle_configuration" "temp" {
  bucket = aws_s3_bucket.temp.id

  rule {
    id     = "cleanup-temp-files"
    status = "Enabled"

    filter {
      prefix = "temp/"
    }

    expiration {
      days = 7  # Delete after 7 days
    }

    abort_incomplete_multipart_upload {
      days_after_initiation = 1
    }
  }
}

Cost Tagging

Comprehensive Tagging Strategy

Define tagging locals:

locals {
  common_tags = {
    # Cost allocation tags
    CostCenter  = var.cost_center
    Project     = var.project_name
    Environment = var.environment
    Owner       = var.team_email

    # Operational tags
    ManagedBy       = "Terraform"
    TerraformModule = basename(abspath(path.module))

    # Cost optimization tags
    AutoShutdown        = var.environment != "prod" ? "enabled" : "disabled"
    ReservationCandidate = var.environment == "prod" ? "true" : "false"
    CostOptimized       = "true"
  }
}

# Apply to all resources
resource "aws_instance" "app" {
  # ... configuration ...

  tags = merge(
    local.common_tags,
    {
      Name = "${var.environment}-app-server"
      Role = "application"
    }
  )
}

Enforce tagging with AWS Config:

resource "aws_config_config_rule" "required_tags" {
  name = "required-tags"

  source {
    owner             = "AWS"
    source_identifier = "REQUIRED_TAGS"
  }

  input_parameters = jsonencode({
    tag1Key = "CostCenter"
    tag2Key = "Environment"
    tag3Key = "Owner"
  })

  depends_on = [aws_config_configuration_recorder.main]
}

Monitoring and Alerts

Budget Alerts

AWS Budgets with Terraform:

resource "aws_budgets_budget" "monthly" {
  name              = "${var.environment}-monthly-budget"
  budget_type       = "COST"
  limit_amount      = var.monthly_budget
  limit_unit        = "USD"
  time_unit         = "MONTHLY"
  time_period_start = "2024-01-01_00:00"

  cost_filter {
    name = "TagKeyValue"
    values = [
      "Environment$${var.environment}"
    ]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.budget_alert_email]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = [var.budget_alert_email]
  }
}

Cost Anomaly Detection

resource "aws_ce_anomaly_monitor" "service" {
  name              = "${var.environment}-service-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "alerts" {
  name      = "${var.environment}-anomaly-alerts"
  frequency = "DAILY"

  monitor_arn_list = [
    aws_ce_anomaly_monitor.service.arn
  ]

  subscriber {
    type    = "EMAIL"
    address = var.cost_alert_email
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]  # Alert on $100+ anomalies
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

Multi-Cloud Considerations

Azure Cost Optimization

Use Azure Hybrid Benefit:

resource "azurerm_linux_virtual_machine" "main" {
  # ... configuration ...

  # Use Azure Hybrid Benefit for licensing savings
  license_type = "RHEL_BYOS"  # or "SLES_BYOS"
}

Azure Reserved Instances (outside Terraform):

  • Purchase through Azure Portal
  • Tag VMs with ReservationGroup for planning

GCP Cost Optimization

Use committed use discounts:

resource "google_compute_instance" "main" {
  # ... configuration ...

  # Use committed use discount
  scheduling {
    automatic_restart   = true
    on_host_maintenance = "MIGRATE"
    preemptible         = var.environment != "prod"  # Preemptible for non-prod
  }
}

GCP Preemptible VMs:

resource "google_compute_instance_template" "preemptible" {
  machine_type = "n1-standard-1"

  scheduling {
    automatic_restart   = false
    on_host_maintenance = "TERMINATE"
    preemptible         = true  # Up to 80% cost reduction
  }
}

Cost Optimization Checklist

Before Deployment

  • Right-size compute resources (start small)
  • Use appropriate storage tiers
  • Enable auto-scaling instead of over-provisioning
  • Implement tagging strategy
  • Configure lifecycle policies
  • Set up VPC endpoints for AWS services

After Deployment

  • Monitor actual usage vs. provisioned capacity
  • Review cost allocation tags
  • Identify reservation opportunities
  • Configure budget alerts
  • Enable cost anomaly detection
  • Schedule non-production resource shutdown

Ongoing

  • Monthly cost review
  • Quarterly right-sizing analysis
  • Annual reservation review
  • Remove unused resources
  • Optimize data transfer patterns
  • Update instance families (new generations are often cheaper)

Cost Estimation Tools

Use infracost in CI/CD

# Install infracost
curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh

# Generate cost estimate
infracost breakdown --path .

# Compare cost changes in PR
infracost diff --path . --compare-to tfplan.json

Terraform Cloud Cost Estimation

Enable in Terraform Cloud workspace settings for automatic cost estimates on every plan.


Additional Resources