666 lines
15 KiB
Markdown
666 lines
15 KiB
Markdown
# Terraform Cost Optimization Guide
|
|
|
|
Strategies for optimizing cloud infrastructure costs when using Terraform.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Right-Sizing Resources](#right-sizing-resources)
|
|
2. [Spot and Reserved Instances](#spot-and-reserved-instances)
|
|
3. [Storage Optimization](#storage-optimization)
|
|
4. [Networking Costs](#networking-costs)
|
|
5. [Resource Lifecycle](#resource-lifecycle)
|
|
6. [Cost Tagging](#cost-tagging)
|
|
7. [Monitoring and Alerts](#monitoring-and-alerts)
|
|
8. [Multi-Cloud Considerations](#multi-cloud-considerations)
|
|
|
|
---
|
|
|
|
## Right-Sizing Resources
|
|
|
|
### Compute Resources
|
|
|
|
**Start small, scale up:**
|
|
```hcl
|
|
variable "instance_type" {
|
|
type = string
|
|
description = "EC2 instance type"
|
|
default = "t3.micro" # Start with smallest reasonable size
|
|
|
|
validation {
|
|
condition = can(regex("^t[0-9]\\.", var.instance_type))
|
|
error_message = "Consider starting with burstable (t-series) instances for cost optimization."
|
|
}
|
|
}
|
|
```
|
|
|
|
**Use auto-scaling instead of over-provisioning:**
|
|
```hcl
|
|
resource "aws_autoscaling_group" "app" {
|
|
min_size = 2 # Minimum for HA
|
|
desired_capacity = 2 # Normal load
|
|
max_size = 10 # Peak load
|
|
|
|
# Scale based on actual usage
|
|
target_group_arns = [aws_lb_target_group.app.arn]
|
|
|
|
tag {
|
|
key = "Environment"
|
|
value = var.environment
|
|
propagate_at_launch = true
|
|
}
|
|
}
|
|
```
|
|
|
|
### Database Right-Sizing
|
|
|
|
**Start with appropriate size:**
|
|
```hcl
|
|
resource "aws_db_instance" "main" {
|
|
instance_class = var.environment == "prod" ? "db.t3.medium" : "db.t3.micro"
|
|
|
|
# Enable auto-scaling for storage
|
|
allocated_storage = 20
|
|
max_allocated_storage = 100 # Auto-scale up to 100GB
|
|
|
|
# Use cheaper storage for non-prod
|
|
storage_type = var.environment == "prod" ? "io1" : "gp3"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Spot and Reserved Instances
|
|
|
|
### Spot Instances for Non-Critical Workloads
|
|
|
|
**Launch Template for Spot:**
|
|
```hcl
|
|
resource "aws_launch_template" "spot" {
|
|
name_prefix = "spot-"
|
|
image_id = data.aws_ami.amazon_linux.id
|
|
instance_type = "t3.medium"
|
|
|
|
instance_market_options {
|
|
market_type = "spot"
|
|
|
|
spot_options {
|
|
max_price = "0.05" # Set price limit
|
|
spot_instance_type = "one-time"
|
|
instance_interruption_behavior = "terminate"
|
|
}
|
|
}
|
|
|
|
tag_specifications {
|
|
resource_type = "instance"
|
|
tags = {
|
|
Name = "spot-instance"
|
|
Workload = "non-critical"
|
|
CostSavings = "true"
|
|
}
|
|
}
|
|
}
|
|
|
|
resource "aws_autoscaling_group" "spot" {
|
|
desired_capacity = 5
|
|
max_size = 10
|
|
min_size = 0
|
|
|
|
mixed_instances_policy {
|
|
instances_distribution {
|
|
on_demand_percentage_above_base_capacity = 20 # 20% on-demand, 80% spot
|
|
spot_allocation_strategy = "capacity-optimized"
|
|
}
|
|
|
|
launch_template {
|
|
launch_template_specification {
|
|
launch_template_id = aws_launch_template.spot.id
|
|
version = "$Latest"
|
|
}
|
|
|
|
# Multiple instance types increase spot availability
|
|
override {
|
|
instance_type = "t3.medium"
|
|
}
|
|
override {
|
|
instance_type = "t3.large"
|
|
}
|
|
override {
|
|
instance_type = "t3a.medium"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Reserved Instances (Use Outside Terraform)
|
|
|
|
Terraform shouldn't manage reservations directly, but should:
|
|
- Tag resources consistently for reservation planning
|
|
- Use Instance Savings Plans for flexibility
|
|
- Monitor usage patterns to inform reservation purchases
|
|
|
|
**Tagging for reservation analysis:**
|
|
```hcl
|
|
locals {
|
|
reservation_tags = {
|
|
ReservationCandidate = var.environment == "prod" ? "true" : "false"
|
|
UsagePattern = "steady-state" # or "variable", "burst"
|
|
CostCenter = var.cost_center
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Storage Optimization
|
|
|
|
### S3 Lifecycle Policies
|
|
|
|
**Automatic tiering:**
|
|
```hcl
|
|
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
|
|
bucket = aws_s3_bucket.logs.id
|
|
|
|
rule {
|
|
id = "log-retention"
|
|
status = "Enabled"
|
|
|
|
transition {
|
|
days = 30
|
|
storage_class = "STANDARD_IA" # Infrequent Access after 30 days
|
|
}
|
|
|
|
transition {
|
|
days = 90
|
|
storage_class = "GLACIER_IR" # Instant Retrieval Glacier after 90 days
|
|
}
|
|
|
|
transition {
|
|
days = 180
|
|
storage_class = "DEEP_ARCHIVE" # Deep Archive after 180 days
|
|
}
|
|
|
|
expiration {
|
|
days = 365 # Delete after 1 year
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Intelligent tiering for variable access:**
|
|
```hcl
|
|
resource "aws_s3_bucket_intelligent_tiering_configuration" "assets" {
|
|
bucket = aws_s3_bucket.assets.id
|
|
name = "entire-bucket"
|
|
|
|
tiering {
|
|
access_tier = "ARCHIVE_ACCESS"
|
|
days = 90
|
|
}
|
|
|
|
tiering {
|
|
access_tier = "DEEP_ARCHIVE_ACCESS"
|
|
days = 180
|
|
}
|
|
}
|
|
```
|
|
|
|
### EBS Volume Optimization
|
|
|
|
**Use appropriate volume types:**
|
|
```hcl
|
|
resource "aws_instance" "app" {
|
|
ami = data.aws_ami.amazon_linux.id
|
|
instance_type = "t3.medium"
|
|
|
|
root_block_device {
|
|
volume_type = "gp3" # gp3 is cheaper than gp2 with better baseline
|
|
volume_size = 20
|
|
iops = 3000 # Default, only pay more if you need more
|
|
throughput = 125 # Default
|
|
encrypted = true
|
|
|
|
# Delete on termination to avoid orphaned volumes
|
|
delete_on_termination = true
|
|
}
|
|
|
|
tags = {
|
|
Name = "app-server"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Snapshot lifecycle:**
|
|
```hcl
|
|
resource "aws_dlm_lifecycle_policy" "snapshots" {
|
|
description = "EBS snapshot lifecycle"
|
|
execution_role_arn = aws_iam_role.dlm.arn
|
|
state = "ENABLED"
|
|
|
|
policy_details {
|
|
resource_types = ["VOLUME"]
|
|
|
|
schedule {
|
|
name = "Daily snapshots"
|
|
|
|
create_rule {
|
|
interval = 24
|
|
interval_unit = "HOURS"
|
|
times = ["03:00"]
|
|
}
|
|
|
|
retain_rule {
|
|
count = 7 # Keep only 7 days of snapshots
|
|
}
|
|
|
|
copy_tags = true
|
|
}
|
|
|
|
target_tags = {
|
|
BackupEnabled = "true"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Networking Costs
|
|
|
|
### Minimize Data Transfer
|
|
|
|
**Use VPC endpoints to avoid NAT charges:**
|
|
```hcl
|
|
resource "aws_vpc_endpoint" "s3" {
|
|
vpc_id = aws_vpc.main.id
|
|
service_name = "com.amazonaws.${var.region}.s3"
|
|
route_table_ids = [
|
|
aws_route_table.private.id
|
|
]
|
|
|
|
tags = {
|
|
Name = "s3-endpoint"
|
|
CostSavings = "reduces-nat-charges"
|
|
}
|
|
}
|
|
|
|
resource "aws_vpc_endpoint" "dynamodb" {
|
|
vpc_id = aws_vpc.main.id
|
|
service_name = "com.amazonaws.${var.region}.dynamodb"
|
|
route_table_ids = [
|
|
aws_route_table.private.id
|
|
]
|
|
}
|
|
```
|
|
|
|
**Interface endpoints for AWS services:**
|
|
```hcl
|
|
resource "aws_vpc_endpoint" "ecr_api" {
|
|
vpc_id = aws_vpc.main.id
|
|
service_name = "com.amazonaws.${var.region}.ecr.api"
|
|
vpc_endpoint_type = "Interface"
|
|
subnet_ids = aws_subnet.private[*].id
|
|
security_group_ids = [aws_security_group.vpc_endpoints.id]
|
|
private_dns_enabled = true
|
|
|
|
tags = {
|
|
Name = "ecr-api-endpoint"
|
|
CostSavings = "reduces-nat-data-transfer"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Regional Optimization
|
|
|
|
**Co-locate resources in same region/AZ:**
|
|
```hcl
|
|
# Bad - cross-region data transfer is expensive
|
|
resource "aws_instance" "app" {
|
|
availability_zone = "us-east-1a"
|
|
}
|
|
|
|
resource "aws_rds_cluster" "main" {
|
|
availability_zones = ["us-west-2a"] # Different region!
|
|
}
|
|
|
|
# Good - same region and AZ when possible
|
|
resource "aws_instance" "app" {
|
|
availability_zone = var.availability_zone
|
|
}
|
|
|
|
resource "aws_rds_cluster" "main" {
|
|
availability_zones = [var.availability_zone] # Same AZ
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Resource Lifecycle
|
|
|
|
### Scheduled Shutdown for Non-Production
|
|
|
|
**Lambda to stop/start instances:**
|
|
```hcl
|
|
resource "aws_lambda_function" "scheduler" {
|
|
filename = "scheduler.zip"
|
|
function_name = "instance-scheduler"
|
|
role = aws_iam_role.scheduler.arn
|
|
handler = "scheduler.handler"
|
|
runtime = "python3.9"
|
|
|
|
environment {
|
|
variables = {
|
|
TAG_KEY = "Schedule"
|
|
TAG_VALUE = "business-hours"
|
|
}
|
|
}
|
|
}
|
|
|
|
# EventBridge rule to stop instances at night
|
|
resource "aws_cloudwatch_event_rule" "stop_instances" {
|
|
name = "stop-dev-instances"
|
|
description = "Stop dev instances at 7 PM"
|
|
schedule_expression = "cron(0 19 ? * MON-FRI *)" # 7 PM weekdays
|
|
}
|
|
|
|
resource "aws_cloudwatch_event_target" "stop" {
|
|
rule = aws_cloudwatch_event_rule.stop_instances.name
|
|
target_id = "stop-instances"
|
|
arn = aws_lambda_function.scheduler.arn
|
|
|
|
input = jsonencode({
|
|
action = "stop"
|
|
})
|
|
}
|
|
|
|
# Start instances in the morning
|
|
resource "aws_cloudwatch_event_rule" "start_instances" {
|
|
name = "start-dev-instances"
|
|
description = "Start dev instances at 8 AM"
|
|
schedule_expression = "cron(0 8 ? * MON-FRI *)" # 8 AM weekdays
|
|
}
|
|
```
|
|
|
|
**Tag instances for scheduling:**
|
|
```hcl
|
|
resource "aws_instance" "dev" {
|
|
ami = data.aws_ami.amazon_linux.id
|
|
instance_type = "t3.medium"
|
|
|
|
tags = {
|
|
Name = "dev-server"
|
|
Environment = "dev"
|
|
Schedule = "business-hours" # Scheduler will stop/start based on this
|
|
AutoShutdown = "true"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Cleanup Old Resources
|
|
|
|
**S3 lifecycle for temporary data:**
|
|
```hcl
|
|
resource "aws_s3_bucket_lifecycle_configuration" "temp" {
|
|
bucket = aws_s3_bucket.temp.id
|
|
|
|
rule {
|
|
id = "cleanup-temp-files"
|
|
status = "Enabled"
|
|
|
|
filter {
|
|
prefix = "temp/"
|
|
}
|
|
|
|
expiration {
|
|
days = 7 # Delete after 7 days
|
|
}
|
|
|
|
abort_incomplete_multipart_upload {
|
|
days_after_initiation = 1
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Cost Tagging
|
|
|
|
### Comprehensive Tagging Strategy
|
|
|
|
**Define tagging locals:**
|
|
```hcl
|
|
locals {
|
|
common_tags = {
|
|
# Cost allocation tags
|
|
CostCenter = var.cost_center
|
|
Project = var.project_name
|
|
Environment = var.environment
|
|
Owner = var.team_email
|
|
|
|
# Operational tags
|
|
ManagedBy = "Terraform"
|
|
TerraformModule = basename(abspath(path.module))
|
|
|
|
# Cost optimization tags
|
|
AutoShutdown = var.environment != "prod" ? "enabled" : "disabled"
|
|
ReservationCandidate = var.environment == "prod" ? "true" : "false"
|
|
CostOptimized = "true"
|
|
}
|
|
}
|
|
|
|
# Apply to all resources
|
|
resource "aws_instance" "app" {
|
|
# ... configuration ...
|
|
|
|
tags = merge(
|
|
local.common_tags,
|
|
{
|
|
Name = "${var.environment}-app-server"
|
|
Role = "application"
|
|
}
|
|
)
|
|
}
|
|
```
|
|
|
|
**Enforce tagging with AWS Config:**
|
|
```hcl
|
|
resource "aws_config_config_rule" "required_tags" {
|
|
name = "required-tags"
|
|
|
|
source {
|
|
owner = "AWS"
|
|
source_identifier = "REQUIRED_TAGS"
|
|
}
|
|
|
|
input_parameters = jsonencode({
|
|
tag1Key = "CostCenter"
|
|
tag2Key = "Environment"
|
|
tag3Key = "Owner"
|
|
})
|
|
|
|
depends_on = [aws_config_configuration_recorder.main]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring and Alerts
|
|
|
|
### Budget Alerts
|
|
|
|
**AWS Budgets with Terraform:**
|
|
```hcl
|
|
resource "aws_budgets_budget" "monthly" {
|
|
name = "${var.environment}-monthly-budget"
|
|
budget_type = "COST"
|
|
limit_amount = var.monthly_budget
|
|
limit_unit = "USD"
|
|
time_unit = "MONTHLY"
|
|
time_period_start = "2024-01-01_00:00"
|
|
|
|
cost_filter {
|
|
name = "TagKeyValue"
|
|
values = [
|
|
"Environment$${var.environment}"
|
|
]
|
|
}
|
|
|
|
notification {
|
|
comparison_operator = "GREATER_THAN"
|
|
threshold = 80
|
|
threshold_type = "PERCENTAGE"
|
|
notification_type = "ACTUAL"
|
|
subscriber_email_addresses = [var.budget_alert_email]
|
|
}
|
|
|
|
notification {
|
|
comparison_operator = "GREATER_THAN"
|
|
threshold = 100
|
|
threshold_type = "PERCENTAGE"
|
|
notification_type = "ACTUAL"
|
|
subscriber_email_addresses = [var.budget_alert_email]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Cost Anomaly Detection
|
|
|
|
```hcl
|
|
resource "aws_ce_anomaly_monitor" "service" {
|
|
name = "${var.environment}-service-monitor"
|
|
monitor_type = "DIMENSIONAL"
|
|
monitor_dimension = "SERVICE"
|
|
}
|
|
|
|
resource "aws_ce_anomaly_subscription" "alerts" {
|
|
name = "${var.environment}-anomaly-alerts"
|
|
frequency = "DAILY"
|
|
|
|
monitor_arn_list = [
|
|
aws_ce_anomaly_monitor.service.arn
|
|
]
|
|
|
|
subscriber {
|
|
type = "EMAIL"
|
|
address = var.cost_alert_email
|
|
}
|
|
|
|
threshold_expression {
|
|
dimension {
|
|
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
|
|
values = ["100"] # Alert on $100+ anomalies
|
|
match_options = ["GREATER_THAN_OR_EQUAL"]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Multi-Cloud Considerations
|
|
|
|
### Azure Cost Optimization
|
|
|
|
**Use Azure Hybrid Benefit:**
|
|
```hcl
|
|
resource "azurerm_linux_virtual_machine" "main" {
|
|
# ... configuration ...
|
|
|
|
# Use Azure Hybrid Benefit for licensing savings
|
|
license_type = "RHEL_BYOS" # or "SLES_BYOS"
|
|
}
|
|
```
|
|
|
|
**Azure Reserved Instances (outside Terraform):**
|
|
- Purchase through Azure Portal
|
|
- Tag VMs with `ReservationGroup` for planning
|
|
|
|
### GCP Cost Optimization
|
|
|
|
**Use committed use discounts:**
|
|
```hcl
|
|
resource "google_compute_instance" "main" {
|
|
# ... configuration ...
|
|
|
|
# Use committed use discount
|
|
scheduling {
|
|
automatic_restart = true
|
|
on_host_maintenance = "MIGRATE"
|
|
preemptible = var.environment != "prod" # Preemptible for non-prod
|
|
}
|
|
}
|
|
```
|
|
|
|
**GCP Preemptible VMs:**
|
|
```hcl
|
|
resource "google_compute_instance_template" "preemptible" {
|
|
machine_type = "n1-standard-1"
|
|
|
|
scheduling {
|
|
automatic_restart = false
|
|
on_host_maintenance = "TERMINATE"
|
|
preemptible = true # Up to 80% cost reduction
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Cost Optimization Checklist
|
|
|
|
### Before Deployment
|
|
- [ ] Right-size compute resources (start small)
|
|
- [ ] Use appropriate storage tiers
|
|
- [ ] Enable auto-scaling instead of over-provisioning
|
|
- [ ] Implement tagging strategy
|
|
- [ ] Configure lifecycle policies
|
|
- [ ] Set up VPC endpoints for AWS services
|
|
|
|
### After Deployment
|
|
- [ ] Monitor actual usage vs. provisioned capacity
|
|
- [ ] Review cost allocation tags
|
|
- [ ] Identify reservation opportunities
|
|
- [ ] Configure budget alerts
|
|
- [ ] Enable cost anomaly detection
|
|
- [ ] Schedule non-production resource shutdown
|
|
|
|
### Ongoing
|
|
- [ ] Monthly cost review
|
|
- [ ] Quarterly right-sizing analysis
|
|
- [ ] Annual reservation review
|
|
- [ ] Remove unused resources
|
|
- [ ] Optimize data transfer patterns
|
|
- [ ] Update instance families (new generations are often cheaper)
|
|
|
|
---
|
|
|
|
## Cost Estimation Tools
|
|
|
|
### Use `infracost` in CI/CD
|
|
|
|
```bash
|
|
# Install infracost
|
|
curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh
|
|
|
|
# Generate cost estimate
|
|
infracost breakdown --path .
|
|
|
|
# Compare cost changes in PR
|
|
infracost diff --path . --compare-to tfplan.json
|
|
```
|
|
|
|
### Terraform Cloud Cost Estimation
|
|
|
|
Enable in Terraform Cloud workspace settings for automatic cost estimates on every plan.
|
|
|
|
---
|
|
|
|
## Additional Resources
|
|
|
|
- AWS Cost Optimization: https://aws.amazon.com/pricing/cost-optimization/
|
|
- Azure Cost Management: https://azure.microsoft.com/en-us/products/cost-management/
|
|
- GCP Cost Management: https://cloud.google.com/cost-management
|
|
- Infracost: https://www.infracost.io/
|
|
- Cloud Cost Optimization Tools: Kubecost, CloudHealth, CloudCheckr
|