Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:17 +08:00
commit 04d2231fb6
14 changed files with 4569 additions and 0 deletions

View File

@@ -0,0 +1,709 @@
# Terraform Best Practices
Comprehensive guide to Terraform best practices for infrastructure as code.
## Table of Contents
1. [Project Structure](#project-structure)
2. [State Management](#state-management)
3. [Module Design](#module-design)
4. [Variable Management](#variable-management)
5. [Resource Naming](#resource-naming)
6. [Security Practices](#security-practices)
7. [Testing & Validation](#testing--validation)
8. [CI/CD Integration](#cicd-integration)
---
## Project Structure
### Recommended Directory Layout
```
terraform-project/
├── environments/
│ ├── dev/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── terraform.tfvars
│ │ └── backend.tf
│ ├── staging/
│ └── prod/
├── modules/
│ ├── networking/
│ │ ├── main.tf
│ │ ├── variables.tf
│ │ ├── outputs.tf
│ │ ├── versions.tf
│ │ └── README.md
│ ├── compute/
│ └── database/
├── global/
│ ├── iam/
│ └── dns/
└── README.md
```
### Key Principles
**Separate Environments**
- Use directories for each environment (dev, staging, prod)
- Each environment has its own state file
- Prevents accidental changes to wrong environment
**Reusable Modules**
- Common infrastructure patterns in modules/
- Modules are versioned and tested
- Used across multiple environments
**Global Resources**
- Resources shared across environments (IAM, DNS)
- Separate state for better isolation
- Carefully managed with extra review
---
## State Management
### Remote State is Essential
**Why Remote State:**
- Team collaboration and locking
- State backup and versioning
- Secure credential handling
- Disaster recovery
**Recommended Backend: S3 + DynamoDB**
```hcl
terraform {
backend "s3" {
bucket = "company-terraform-state"
key = "prod/networking/terraform.tfstate"
region = "us-east-1"
encrypt = true
dynamodb_table = "terraform-state-lock"
kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID"
}
}
```
**State Best Practices:**
1. **Enable Encryption**: Always encrypt state at rest
2. **Enable Versioning**: On S3 bucket for state recovery
3. **Use State Locking**: DynamoDB table prevents concurrent modifications
4. **Restrict Access**: IAM policies limiting who can read/write state
5. **Separate State Files**: Different states for different components
6. **Regular Backups**: Automated backups of state files
### State File Organization
**Bad - Single State:**
```
terraform.tfstate (contains everything)
```
**Good - Multiple States:**
```
networking/terraform.tfstate
compute/terraform.tfstate
database/terraform.tfstate
```
**Benefits:**
- Reduced blast radius
- Faster plan/apply operations
- Parallel team work
- Easier to understand and debug
### State Management Commands
```bash
# List resources in state
terraform state list
# Show specific resource
terraform state show aws_instance.example
# Move resource to different address
terraform state mv aws_instance.old aws_instance.new
# Remove resource from state (doesn't destroy)
terraform state rm aws_instance.example
# Import existing resource
terraform import aws_instance.example i-1234567890abcdef0
# Pull state for inspection (read-only)
terraform state pull > state.json
```
---
## Module Design
### Module Structure
Every module should have:
```
module-name/
├── main.tf # Primary resources
├── variables.tf # Input variables
├── outputs.tf # Output values
├── versions.tf # Version constraints
├── README.md # Documentation
└── examples/ # Usage examples
└── complete/
├── main.tf
└── variables.tf
```
### Module Best Practices
**1. Single Responsibility**
Each module should do one thing well:
-`vpc-module` creates VPC with subnets, route tables, NACLs
-`infrastructure` creates VPC, EC2, RDS, S3, everything
**2. Composability**
Modules should work together:
```hcl
module "vpc" {
source = "./modules/vpc"
cidr = "10.0.0.0/16"
}
module "eks" {
source = "./modules/eks"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
}
```
**3. Sensible Defaults**
```hcl
variable "instance_type" {
type = string
description = "EC2 instance type"
default = "t3.micro" # Reasonable default
}
variable "enable_monitoring" {
type = bool
description = "Enable detailed monitoring"
default = false # Cost-effective default
}
```
**4. Complete Documentation**
```hcl
variable "vpc_cidr" {
type = string
description = "CIDR block for VPC. Must be a valid IPv4 CIDR."
validation {
condition = can(cidrhost(var.vpc_cidr, 0))
error_message = "Must be a valid IPv4 CIDR block."
}
}
```
**5. Output Useful Values**
```hcl
output "vpc_id" {
description = "ID of the VPC"
value = aws_vpc.main.id
}
output "private_subnet_ids" {
description = "List of private subnet IDs for deploying workloads"
value = aws_subnet.private[*].id
}
output "nat_gateway_ips" {
description = "Elastic IPs of NAT gateways for firewall whitelisting"
value = aws_eip.nat[*].public_ip
}
```
### Module Versioning
**Use Git Tags for Versioning:**
```hcl
module "vpc" {
source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.2.3"
# Configuration...
}
```
**Semantic Versioning:**
- v1.0.0 → First stable release
- v1.1.0 → New features (backward compatible)
- v1.1.1 → Bug fixes
- v2.0.0 → Breaking changes
---
## Variable Management
### Variable Declaration
**Always Include:**
```hcl
variable "environment" {
type = string
description = "Environment name (dev, staging, prod)"
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
```
### Variable Files Hierarchy
```
terraform.tfvars # Default values (committed, no secrets)
dev.tfvars # Dev overrides
prod.tfvars # Prod overrides
secrets.auto.tfvars # Auto-loaded (in .gitignore)
```
**Usage:**
```bash
terraform apply -var-file="prod.tfvars"
```
### Sensitive Variables
**Mark as Sensitive:**
```hcl
variable "database_password" {
type = string
description = "Master password for database"
sensitive = true
}
```
**Never commit secrets:**
```bash
# .gitignore
*.auto.tfvars
secrets.tfvars
terraform.tfvars # If contains secrets
```
**Better: Use External Secret Management**
```hcl
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/database/master-password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
```
### Variable Organization
**Group related variables:**
```hcl
# Network Configuration
variable "vpc_cidr" { }
variable "availability_zones" { }
variable "public_subnet_cidrs" { }
variable "private_subnet_cidrs" { }
# Application Configuration
variable "app_name" { }
variable "app_version" { }
variable "instance_count" { }
# Tagging
variable "tags" {
type = map(string)
description = "Common tags for all resources"
default = {}
}
```
---
## Resource Naming
### Naming Conventions
**Terraform Resources (snake_case):**
```hcl
resource "aws_vpc" "main_vpc" { }
resource "aws_subnet" "public_subnet_az1" { }
resource "aws_instance" "web_server_01" { }
```
**AWS Resource Names (kebab-case):**
```hcl
resource "aws_s3_bucket" "logs" {
bucket = "company-prod-application-logs"
# company-{env}-{service}-{purpose}
}
resource "aws_instance" "web" {
tags = {
Name = "prod-web-server-01"
# {env}-{service}-{type}-{number}
}
}
```
### Naming Standards
**Pattern: `{company}-{environment}-{service}-{resource_type}`**
Examples:
- `acme-prod-api-alb`
- `acme-dev-workers-asg`
- `acme-staging-database-rds`
**Benefits:**
- Easy filtering in AWS console
- Clear ownership and purpose
- Consistent across environments
- Billing and cost tracking
---
## Security Practices
### 1. Principle of Least Privilege
```hcl
# Bad - Too permissive
resource "aws_iam_policy" "bad" {
policy = jsonencode({
Statement = [{
Effect = "Allow"
Action = "*"
Resource = "*"
}]
})
}
# Good - Specific permissions
resource "aws_iam_policy" "good" {
policy = jsonencode({
Statement = [{
Effect = "Allow"
Action = [
"s3:GetObject",
"s3:PutObject"
]
Resource = "arn:aws:s3:::my-bucket/*"
}]
})
}
```
### 2. Encryption Everywhere
```hcl
# Encrypt S3 buckets
resource "aws_s3_bucket" "secure" {
bucket = "my-secure-bucket"
}
resource "aws_s3_bucket_server_side_encryption_configuration" "secure" {
bucket = aws_s3_bucket.secure.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "aws:kms"
kms_master_key_id = aws_kms_key.bucket.arn
}
}
}
# Encrypt EBS volumes
resource "aws_instance" "secure" {
root_block_device {
encrypted = true
}
}
# Encrypt RDS databases
resource "aws_db_instance" "secure" {
storage_encrypted = true
kms_key_id = aws_kms_key.rds.arn
}
```
### 3. Network Security
```hcl
# Restrictive security groups
resource "aws_security_group" "web" {
name_prefix = "web-"
# Only allow specific inbound
ingress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"] # Consider restricting further
}
# Explicit outbound
egress {
from_port = 443
to_port = 443
protocol = "tcp"
cidr_blocks = ["0.0.0.0/0"]
}
}
# Use private subnets for workloads
resource "aws_subnet" "private" {
map_public_ip_on_launch = false # No public IPs
}
```
### 4. Secret Management
**Never in Code:**
```hcl
# ❌ NEVER DO THIS
resource "aws_db_instance" "bad" {
password = "MySecretPassword123" # NEVER!
}
```
**Use AWS Secrets Manager:**
```hcl
# ✅ CORRECT APPROACH
data "aws_secretsmanager_secret_version" "db" {
secret_id = var.db_secret_arn
}
resource "aws_db_instance" "good" {
password = data.aws_secretsmanager_secret_version.db.secret_string
}
```
### 5. Resource Tagging
```hcl
locals {
common_tags = {
Environment = var.environment
ManagedBy = "Terraform"
Owner = "platform-team"
Project = var.project_name
CostCenter = var.cost_center
}
}
resource "aws_instance" "web" {
tags = merge(
local.common_tags,
{
Name = "web-server"
Role = "webserver"
}
)
}
```
---
## Testing & Validation
### Pre-Deployment Validation
**1. Terraform Validate**
```bash
terraform validate
```
Checks syntax and configuration validity.
**2. Terraform Plan**
```bash
terraform plan -out=tfplan
```
Review changes before applying.
**3. tflint**
```bash
tflint --module
```
Linter for catching errors and enforcing conventions.
**4. checkov**
```bash
checkov -d .
```
Security and compliance scanning.
**5. terraform-docs**
```bash
terraform-docs markdown . > README.md
```
Auto-generate documentation.
### Automated Testing
**Terratest (Go):**
```go
func TestVPCCreation(t *testing.T) {
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../examples/complete",
})
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcId)
}
```
---
## CI/CD Integration
### GitHub Actions Example
```yaml
name: Terraform
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
terraform:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Terraform
uses: hashicorp/setup-terraform@v2
- name: Terraform Init
run: terraform init
- name: Terraform Validate
run: terraform validate
- name: Terraform Plan
run: terraform plan -no-color
if: github.event_name == 'pull_request'
- name: Terraform Apply
run: terraform apply -auto-approve
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
```
### Best Practices for CI/CD
1. **Always run plan on PRs** - Review changes before merge
2. **Require approvals** - Human review for production
3. **Use workspaces or directories** - Separate pipeline per environment
4. **Store state remotely** - S3 backend with locking
5. **Use credential management** - OIDC or IAM roles, never store credentials
6. **Run security scans** - checkov, tfsec in pipeline
7. **Tag releases** - Version your infrastructure code
---
## Common Pitfalls to Avoid
### 1. Not Using Remote State
- ❌ Local state doesn't work for teams
- ✅ Use S3, Terraform Cloud, or other remote backend
### 2. Hardcoding Values
-`region = "us-east-1"` in every resource
- ✅ Use variables and locals
### 3. Not Using Modules
- ❌ Copying code between environments
- ✅ Create reusable modules
### 4. Ignoring State
- ❌ Manually modifying infrastructure
- ✅ All changes through Terraform
### 5. Poor Naming
-`resource "aws_instance" "i1" { }`
-`resource "aws_instance" "web_server_01" { }`
### 6. No Documentation
- ❌ No README, no comments
- ✅ Document everything
### 7. Massive State Files
- ❌ Single state for entire infrastructure
- ✅ Break into logical components
### 8. No Testing
- ❌ Apply directly to production
- ✅ Test in dev/staging first
---
## Quick Reference
### Essential Commands
```bash
# Initialize
terraform init
# Validate configuration
terraform validate
# Format code
terraform fmt -recursive
# Plan changes
terraform plan
# Apply changes
terraform apply
# Destroy resources
terraform destroy
# Show current state
terraform show
# List resources
terraform state list
# Output values
terraform output
```
### Useful Flags
```bash
# Plan without color
terraform plan -no-color
# Apply without prompts
terraform apply -auto-approve
# Destroy specific resource
terraform destroy -target=aws_instance.example
# Use specific var file
terraform apply -var-file="prod.tfvars"
# Set variable via CLI
terraform apply -var="environment=prod"
```

View File

@@ -0,0 +1,665 @@
# Terraform Cost Optimization Guide
Strategies for optimizing cloud infrastructure costs when using Terraform.
## Table of Contents
1. [Right-Sizing Resources](#right-sizing-resources)
2. [Spot and Reserved Instances](#spot-and-reserved-instances)
3. [Storage Optimization](#storage-optimization)
4. [Networking Costs](#networking-costs)
5. [Resource Lifecycle](#resource-lifecycle)
6. [Cost Tagging](#cost-tagging)
7. [Monitoring and Alerts](#monitoring-and-alerts)
8. [Multi-Cloud Considerations](#multi-cloud-considerations)
---
## Right-Sizing Resources
### Compute Resources
**Start small, scale up:**
```hcl
variable "instance_type" {
type = string
description = "EC2 instance type"
default = "t3.micro" # Start with smallest reasonable size
validation {
condition = can(regex("^t[0-9]\\.", var.instance_type))
error_message = "Consider starting with burstable (t-series) instances for cost optimization."
}
}
```
**Use auto-scaling instead of over-provisioning:**
```hcl
resource "aws_autoscaling_group" "app" {
min_size = 2 # Minimum for HA
desired_capacity = 2 # Normal load
max_size = 10 # Peak load
# Scale based on actual usage
target_group_arns = [aws_lb_target_group.app.arn]
tag {
key = "Environment"
value = var.environment
propagate_at_launch = true
}
}
```
### Database Right-Sizing
**Start with appropriate size:**
```hcl
resource "aws_db_instance" "main" {
instance_class = var.environment == "prod" ? "db.t3.medium" : "db.t3.micro"
# Enable auto-scaling for storage
allocated_storage = 20
max_allocated_storage = 100 # Auto-scale up to 100GB
# Use cheaper storage for non-prod
storage_type = var.environment == "prod" ? "io1" : "gp3"
}
```
---
## Spot and Reserved Instances
### Spot Instances for Non-Critical Workloads
**Launch Template for Spot:**
```hcl
resource "aws_launch_template" "spot" {
name_prefix = "spot-"
image_id = data.aws_ami.amazon_linux.id
instance_type = "t3.medium"
instance_market_options {
market_type = "spot"
spot_options {
max_price = "0.05" # Set price limit
spot_instance_type = "one-time"
instance_interruption_behavior = "terminate"
}
}
tag_specifications {
resource_type = "instance"
tags = {
Name = "spot-instance"
Workload = "non-critical"
CostSavings = "true"
}
}
}
resource "aws_autoscaling_group" "spot" {
desired_capacity = 5
max_size = 10
min_size = 0
mixed_instances_policy {
instances_distribution {
on_demand_percentage_above_base_capacity = 20 # 20% on-demand, 80% spot
spot_allocation_strategy = "capacity-optimized"
}
launch_template {
launch_template_specification {
launch_template_id = aws_launch_template.spot.id
version = "$Latest"
}
# Multiple instance types increase spot availability
override {
instance_type = "t3.medium"
}
override {
instance_type = "t3.large"
}
override {
instance_type = "t3a.medium"
}
}
}
}
```
### Reserved Instances (Use Outside Terraform)
Terraform shouldn't manage reservations directly, but should:
- Tag resources consistently for reservation planning
- Use Instance Savings Plans for flexibility
- Monitor usage patterns to inform reservation purchases
**Tagging for reservation analysis:**
```hcl
locals {
reservation_tags = {
ReservationCandidate = var.environment == "prod" ? "true" : "false"
UsagePattern = "steady-state" # or "variable", "burst"
CostCenter = var.cost_center
}
}
```
---
## Storage Optimization
### S3 Lifecycle Policies
**Automatic tiering:**
```hcl
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
bucket = aws_s3_bucket.logs.id
rule {
id = "log-retention"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA" # Infrequent Access after 30 days
}
transition {
days = 90
storage_class = "GLACIER_IR" # Instant Retrieval Glacier after 90 days
}
transition {
days = 180
storage_class = "DEEP_ARCHIVE" # Deep Archive after 180 days
}
expiration {
days = 365 # Delete after 1 year
}
}
}
```
**Intelligent tiering for variable access:**
```hcl
resource "aws_s3_bucket_intelligent_tiering_configuration" "assets" {
bucket = aws_s3_bucket.assets.id
name = "entire-bucket"
tiering {
access_tier = "ARCHIVE_ACCESS"
days = 90
}
tiering {
access_tier = "DEEP_ARCHIVE_ACCESS"
days = 180
}
}
```
### EBS Volume Optimization
**Use appropriate volume types:**
```hcl
resource "aws_instance" "app" {
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.medium"
root_block_device {
volume_type = "gp3" # gp3 is cheaper than gp2 with better baseline
volume_size = 20
iops = 3000 # Default, only pay more if you need more
throughput = 125 # Default
encrypted = true
# Delete on termination to avoid orphaned volumes
delete_on_termination = true
}
tags = {
Name = "app-server"
}
}
```
**Snapshot lifecycle:**
```hcl
resource "aws_dlm_lifecycle_policy" "snapshots" {
description = "EBS snapshot lifecycle"
execution_role_arn = aws_iam_role.dlm.arn
state = "ENABLED"
policy_details {
resource_types = ["VOLUME"]
schedule {
name = "Daily snapshots"
create_rule {
interval = 24
interval_unit = "HOURS"
times = ["03:00"]
}
retain_rule {
count = 7 # Keep only 7 days of snapshots
}
copy_tags = true
}
target_tags = {
BackupEnabled = "true"
}
}
}
```
---
## Networking Costs
### Minimize Data Transfer
**Use VPC endpoints to avoid NAT charges:**
```hcl
resource "aws_vpc_endpoint" "s3" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.s3"
route_table_ids = [
aws_route_table.private.id
]
tags = {
Name = "s3-endpoint"
CostSavings = "reduces-nat-charges"
}
}
resource "aws_vpc_endpoint" "dynamodb" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.dynamodb"
route_table_ids = [
aws_route_table.private.id
]
}
```
**Interface endpoints for AWS services:**
```hcl
resource "aws_vpc_endpoint" "ecr_api" {
vpc_id = aws_vpc.main.id
service_name = "com.amazonaws.${var.region}.ecr.api"
vpc_endpoint_type = "Interface"
subnet_ids = aws_subnet.private[*].id
security_group_ids = [aws_security_group.vpc_endpoints.id]
private_dns_enabled = true
tags = {
Name = "ecr-api-endpoint"
CostSavings = "reduces-nat-data-transfer"
}
}
```
### Regional Optimization
**Co-locate resources in same region/AZ:**
```hcl
# Bad - cross-region data transfer is expensive
resource "aws_instance" "app" {
availability_zone = "us-east-1a"
}
resource "aws_rds_cluster" "main" {
availability_zones = ["us-west-2a"] # Different region!
}
# Good - same region and AZ when possible
resource "aws_instance" "app" {
availability_zone = var.availability_zone
}
resource "aws_rds_cluster" "main" {
availability_zones = [var.availability_zone] # Same AZ
}
```
---
## Resource Lifecycle
### Scheduled Shutdown for Non-Production
**Lambda to stop/start instances:**
```hcl
resource "aws_lambda_function" "scheduler" {
filename = "scheduler.zip"
function_name = "instance-scheduler"
role = aws_iam_role.scheduler.arn
handler = "scheduler.handler"
runtime = "python3.9"
environment {
variables = {
TAG_KEY = "Schedule"
TAG_VALUE = "business-hours"
}
}
}
# EventBridge rule to stop instances at night
resource "aws_cloudwatch_event_rule" "stop_instances" {
name = "stop-dev-instances"
description = "Stop dev instances at 7 PM"
schedule_expression = "cron(0 19 ? * MON-FRI *)" # 7 PM weekdays
}
resource "aws_cloudwatch_event_target" "stop" {
rule = aws_cloudwatch_event_rule.stop_instances.name
target_id = "stop-instances"
arn = aws_lambda_function.scheduler.arn
input = jsonencode({
action = "stop"
})
}
# Start instances in the morning
resource "aws_cloudwatch_event_rule" "start_instances" {
name = "start-dev-instances"
description = "Start dev instances at 8 AM"
schedule_expression = "cron(0 8 ? * MON-FRI *)" # 8 AM weekdays
}
```
**Tag instances for scheduling:**
```hcl
resource "aws_instance" "dev" {
ami = data.aws_ami.amazon_linux.id
instance_type = "t3.medium"
tags = {
Name = "dev-server"
Environment = "dev"
Schedule = "business-hours" # Scheduler will stop/start based on this
AutoShutdown = "true"
}
}
```
### Cleanup Old Resources
**S3 lifecycle for temporary data:**
```hcl
resource "aws_s3_bucket_lifecycle_configuration" "temp" {
bucket = aws_s3_bucket.temp.id
rule {
id = "cleanup-temp-files"
status = "Enabled"
filter {
prefix = "temp/"
}
expiration {
days = 7 # Delete after 7 days
}
abort_incomplete_multipart_upload {
days_after_initiation = 1
}
}
}
```
---
## Cost Tagging
### Comprehensive Tagging Strategy
**Define tagging locals:**
```hcl
locals {
common_tags = {
# Cost allocation tags
CostCenter = var.cost_center
Project = var.project_name
Environment = var.environment
Owner = var.team_email
# Operational tags
ManagedBy = "Terraform"
TerraformModule = basename(abspath(path.module))
# Cost optimization tags
AutoShutdown = var.environment != "prod" ? "enabled" : "disabled"
ReservationCandidate = var.environment == "prod" ? "true" : "false"
CostOptimized = "true"
}
}
# Apply to all resources
resource "aws_instance" "app" {
# ... configuration ...
tags = merge(
local.common_tags,
{
Name = "${var.environment}-app-server"
Role = "application"
}
)
}
```
**Enforce tagging with AWS Config:**
```hcl
resource "aws_config_config_rule" "required_tags" {
name = "required-tags"
source {
owner = "AWS"
source_identifier = "REQUIRED_TAGS"
}
input_parameters = jsonencode({
tag1Key = "CostCenter"
tag2Key = "Environment"
tag3Key = "Owner"
})
depends_on = [aws_config_configuration_recorder.main]
}
```
---
## Monitoring and Alerts
### Budget Alerts
**AWS Budgets with Terraform:**
```hcl
resource "aws_budgets_budget" "monthly" {
name = "${var.environment}-monthly-budget"
budget_type = "COST"
limit_amount = var.monthly_budget
limit_unit = "USD"
time_unit = "MONTHLY"
time_period_start = "2024-01-01_00:00"
cost_filter {
name = "TagKeyValue"
values = [
"Environment$${var.environment}"
]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.budget_alert_email]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = [var.budget_alert_email]
}
}
```
### Cost Anomaly Detection
```hcl
resource "aws_ce_anomaly_monitor" "service" {
name = "${var.environment}-service-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_subscription" "alerts" {
name = "${var.environment}-anomaly-alerts"
frequency = "DAILY"
monitor_arn_list = [
aws_ce_anomaly_monitor.service.arn
]
subscriber {
type = "EMAIL"
address = var.cost_alert_email
}
threshold_expression {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["100"] # Alert on $100+ anomalies
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
```
---
## Multi-Cloud Considerations
### Azure Cost Optimization
**Use Azure Hybrid Benefit:**
```hcl
resource "azurerm_linux_virtual_machine" "main" {
# ... configuration ...
# Use Azure Hybrid Benefit for licensing savings
license_type = "RHEL_BYOS" # or "SLES_BYOS"
}
```
**Azure Reserved Instances (outside Terraform):**
- Purchase through Azure Portal
- Tag VMs with `ReservationGroup` for planning
### GCP Cost Optimization
**Use committed use discounts:**
```hcl
resource "google_compute_instance" "main" {
# ... configuration ...
# Use committed use discount
scheduling {
automatic_restart = true
on_host_maintenance = "MIGRATE"
preemptible = var.environment != "prod" # Preemptible for non-prod
}
}
```
**GCP Preemptible VMs:**
```hcl
resource "google_compute_instance_template" "preemptible" {
machine_type = "n1-standard-1"
scheduling {
automatic_restart = false
on_host_maintenance = "TERMINATE"
preemptible = true # Up to 80% cost reduction
}
}
```
---
## Cost Optimization Checklist
### Before Deployment
- [ ] Right-size compute resources (start small)
- [ ] Use appropriate storage tiers
- [ ] Enable auto-scaling instead of over-provisioning
- [ ] Implement tagging strategy
- [ ] Configure lifecycle policies
- [ ] Set up VPC endpoints for AWS services
### After Deployment
- [ ] Monitor actual usage vs. provisioned capacity
- [ ] Review cost allocation tags
- [ ] Identify reservation opportunities
- [ ] Configure budget alerts
- [ ] Enable cost anomaly detection
- [ ] Schedule non-production resource shutdown
### Ongoing
- [ ] Monthly cost review
- [ ] Quarterly right-sizing analysis
- [ ] Annual reservation review
- [ ] Remove unused resources
- [ ] Optimize data transfer patterns
- [ ] Update instance families (new generations are often cheaper)
---
## Cost Estimation Tools
### Use `infracost` in CI/CD
```bash
# Install infracost
curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh
# Generate cost estimate
infracost breakdown --path .
# Compare cost changes in PR
infracost diff --path . --compare-to tfplan.json
```
### Terraform Cloud Cost Estimation
Enable in Terraform Cloud workspace settings for automatic cost estimates on every plan.
---
## Additional Resources
- AWS Cost Optimization: https://aws.amazon.com/pricing/cost-optimization/
- Azure Cost Management: https://azure.microsoft.com/en-us/products/cost-management/
- GCP Cost Management: https://cloud.google.com/cost-management
- Infracost: https://www.infracost.io/
- Cloud Cost Optimization Tools: Kubecost, CloudHealth, CloudCheckr

View File

@@ -0,0 +1,635 @@
# Terraform Troubleshooting Guide
Common Terraform and Terragrunt issues with solutions.
## Table of Contents
1. [State Issues](#state-issues)
2. [Provider Issues](#provider-issues)
3. [Resource Errors](#resource-errors)
4. [Module Issues](#module-issues)
5. [Terragrunt Specific](#terragrunt-specific)
6. [Performance Issues](#performance-issues)
---
## State Issues
### State Lock Error
**Symptom:**
```
Error locking state: Error acquiring the state lock
Lock Info:
ID: abc123...
Path: terraform.tfstate
Operation: OperationTypeApply
Who: user@hostname
Created: 2024-01-15 10:30:00 UTC
```
**Common Causes:**
1. Previous operation crashed or was interrupted
2. Another user/process is running terraform
3. State lock wasn't released properly
**Resolution:**
1. **Verify no one else is running terraform:**
```bash
# Check with team first!
```
2. **Force unlock (use with caution):**
```bash
terraform force-unlock abc123
```
3. **For DynamoDB backend, check lock table:**
```bash
aws dynamodb get-item \
--table-name terraform-state-lock \
--key '{"LockID": {"S": "path/to/state/terraform.tfstate-md5"}}'
```
**Prevention:**
- Use proper state locking backend (S3 + DynamoDB)
- Implement timeout in CI/CD pipelines
- Always let terraform complete or properly cancel
---
### State Drift Detected
**Symptom:**
```
Note: Objects have changed outside of Terraform
Terraform detected the following changes made outside of Terraform
since the last "terraform apply":
```
**Common Causes:**
1. Manual changes in AWS console
2. Another tool modifying resources
3. Auto-scaling or auto-remediation
**Resolution:**
1. **Review the drift:**
```bash
terraform plan -detailed-exitcode
```
2. **Options:**
- **Import changes:** Update terraform to match reality
- **Revert changes:** Apply terraform to restore desired state
- **Refresh state:** `terraform apply -refresh-only`
3. **Import specific changes:**
```bash
# Update your .tf files, then:
terraform plan # Verify it matches
terraform apply
```
**Prevention:**
- Implement policy to prevent manual changes
- Use AWS Config rules to detect drift
- Regular `terraform plan` to catch drift early
- Consider using Terraform Cloud drift detection
---
### State Corruption
**Symptom:**
```
Error: Failed to load state
Error: state snapshot was created by Terraform v1.5.0,
which is newer than current v1.3.0
```
**Common Causes:**
1. Using different Terraform versions
2. State file manually edited
3. Incomplete state upload
**Resolution:**
1. **Version mismatch:**
```bash
# Upgrade to matching version
tfenv install 1.5.0
tfenv use 1.5.0
```
2. **Restore from backup:**
```bash
# For S3 backend with versioning
aws s3api list-object-versions \
--bucket terraform-state \
--prefix prod/terraform.tfstate
# Restore specific version
aws s3api get-object \
--bucket terraform-state \
--key prod/terraform.tfstate \
--version-id VERSION_ID \
terraform.tfstate
```
3. **Rebuild state (last resort):**
```bash
# Remove corrupted state
terraform state rm aws_instance.example
# Re-import resources
terraform import aws_instance.example i-1234567890abcdef0
```
**Prevention:**
- Pin Terraform version in `versions.tf`
- Enable S3 versioning for state bucket
- Never manually edit state files
- Use consistent Terraform versions across team
---
## Provider Issues
### Provider Version Conflict
**Symptom:**
```
Error: Incompatible provider version
Provider registry.terraform.io/hashicorp/aws v5.0.0 does not have
a package available for your current platform
```
**Resolution:**
1. **Specify version constraints:**
```hcl
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.67.0" # Use compatible version
}
}
}
```
2. **Clean provider cache:**
```bash
rm -rf .terraform
terraform init -upgrade
```
3. **Lock file sync:**
```bash
terraform providers lock \
-platform=darwin_amd64 \
-platform=darwin_arm64 \
-platform=linux_amd64
```
---
### Authentication Failures
**Symptom:**
```
Error: error configuring Terraform AWS Provider:
no valid credential sources found
```
**Common Causes:**
1. Missing AWS credentials
2. Expired credentials
3. Incorrect IAM permissions
**Resolution:**
1. **Verify credentials:**
```bash
aws sts get-caller-identity
```
2. **Check credential order:**
- Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- Shared credentials file (~/.aws/credentials)
- IAM role (for EC2/ECS)
3. **Configure provider:**
```hcl
provider "aws" {
region = "us-east-1"
# Option 1: Use profile
profile = "production"
# Option 2: Assume role
assume_role {
role_arn = "arn:aws:iam::ACCOUNT:role/TerraformRole"
}
}
```
4. **Check IAM permissions:**
```bash
# Test specific permission
aws ec2 describe-instances --dry-run
```
**Prevention:**
- Use IAM roles in CI/CD
- Implement OIDC for GitHub Actions
- Regular credential rotation
- Use AWS SSO for developers
---
## Resource Errors
### Resource Already Exists
**Symptom:**
```
Error: creating EC2 Instance: EntityAlreadyExists:
Resource with id 'i-1234567890abcdef0' already exists
```
**Resolution:**
1. **Import existing resource:**
```bash
terraform import aws_instance.web i-1234567890abcdef0
```
2. **Verify configuration matches:**
```bash
terraform plan # Should show no changes after import
```
3. **If configuration differs, update it:**
```hcl
resource "aws_instance" "web" {
ami = "ami-abc123" # Match existing
instance_type = "t3.micro" # Match existing
}
```
---
### Dependency Errors
**Symptom:**
```
Error: resource depends on resource "aws_vpc.main" that
is not declared in the configuration
```
**Resolution:**
1. **Add explicit dependency:**
```hcl
resource "aws_subnet" "private" {
vpc_id = aws_vpc.main.id
depends_on = [
aws_internet_gateway.main # Explicit dependency
]
}
```
2. **Use data sources for existing resources:**
```hcl
data "aws_vpc" "existing" {
id = "vpc-12345678"
}
resource "aws_subnet" "new" {
vpc_id = data.aws_vpc.existing.id
}
```
---
### Timeout Errors
**Symptom:**
```
Error: timeout while waiting for state to become 'available'
(last state: 'pending', timeout: 10m0s)
```
**Resolution:**
1. **Increase timeout:**
```hcl
resource "aws_db_instance" "main" {
# ... configuration ...
timeouts {
create = "60m"
update = "60m"
delete = "60m"
}
}
```
2. **Check resource status manually:**
```bash
aws rds describe-db-instances --db-instance-identifier mydb
```
3. **Retry the operation:**
```bash
terraform apply
```
---
## Module Issues
### Module Source Not Found
**Symptom:**
```
Error: Failed to download module
Could not download module "vpc" (main.tf:10) source:
git::https://github.com/company/terraform-modules.git//vpc
```
**Resolution:**
1. **Verify source URL:**
```hcl
module "vpc" {
source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.0.0"
# Add authentication if private repo
}
```
2. **For private repos, configure Git auth:**
```bash
# SSH key
git config --global url."git@github.com:".insteadOf "https://github.com/"
# Or use HTTPS with token
git config --global url."https://oauth2:TOKEN@github.com/".insteadOf "https://github.com/"
```
3. **Clear module cache:**
```bash
rm -rf .terraform/modules
terraform init
```
---
### Module Version Conflicts
**Symptom:**
```
Error: Inconsistent dependency lock file
Module has dependencies locked at version 1.0.0 but
root module requires version 2.0.0
```
**Resolution:**
1. **Update lock file:**
```bash
terraform init -upgrade
```
2. **Pin module version:**
```hcl
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 3.0" # Compatible with 3.x
}
```
---
## Terragrunt Specific
### Dependency Cycle Detected
**Symptom:**
```
Error: Dependency cycle detected:
module-a depends on module-b
module-b depends on module-c
module-c depends on module-a
```
**Resolution:**
1. **Review dependencies in terragrunt.hcl:**
```hcl
dependency "vpc" {
config_path = "../vpc"
}
dependency "database" {
config_path = "../database"
}
# Don't create circular references!
```
2. **Refactor to remove cycle:**
- Split modules differently
- Use data sources instead of dependencies
- Pass values through variables
3. **Use mock outputs during planning:**
```hcl
dependency "vpc" {
config_path = "../vpc"
mock_outputs = {
vpc_id = "vpc-mock"
}
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}
```
---
### Hook Failures
**Symptom:**
```
Error: Hook execution failed
Command: pre_apply_hook.sh
Exit code: 1
```
**Resolution:**
1. **Debug the hook:**
```bash
# Run hook manually
bash .terragrunt-cache/.../pre_apply_hook.sh
```
2. **Add error handling to hook:**
```bash
#!/bin/bash
set -e # Exit on error
# Your hook logic
if ! command -v jq &> /dev/null; then
echo "jq is required but not installed"
exit 1
fi
```
3. **Make hook executable:**
```bash
chmod +x hooks/pre_apply_hook.sh
```
---
### Include Path Issues
**Symptom:**
```
Error: Cannot include file
Path does not exist: ../common.hcl
```
**Resolution:**
1. **Use correct relative path:**
```hcl
include "root" {
path = find_in_parent_folders()
}
include "common" {
path = "${get_terragrunt_dir()}/../common.hcl"
}
```
2. **Verify file exists:**
```bash
ls -la ../common.hcl
```
---
## Performance Issues
### Slow Plans/Applies
**Symptoms:**
- `terraform plan` takes >5 minutes
- `terraform apply` very slow
- State operations timing out
**Common Causes:**
1. Too many resources in single state
2. Slow provider API calls
3. Large number of data sources
4. Complex interpolations
**Resolution:**
1. **Split state files:**
```
networking/ # Separate state
compute/ # Separate state
database/ # Separate state
```
2. **Use targeted operations:**
```bash
terraform plan -target=aws_instance.web
terraform apply -target=module.vpc
```
3. **Optimize data sources:**
```hcl
# Bad - queries every plan
data "aws_ami" "ubuntu" {
most_recent = true
# ... filters
}
# Better - use specific AMI
variable "ami_id" {
default = "ami-abc123" # Update periodically
}
```
4. **Enable parallelism:**
```bash
terraform apply -parallelism=20 # Default is 10
```
5. **Use caching (Terragrunt):**
```hcl
remote_state {
backend = "s3"
config = {
skip_credentials_validation = true # Faster
skip_metadata_api_check = true
}
}
```
---
## Quick Diagnostic Steps
When encountering any Terraform error:
1. **Read the full error message** - Don't skip details
2. **Check recent changes** - What changed since last successful run?
3. **Verify versions** - Terraform, providers, modules
4. **Check state** - Is it locked? Corrupted?
5. **Test authentication** - Can you access resources manually?
6. **Review logs** - Use TF_LOG=DEBUG for detailed output
7. **Isolate the problem** - Use -target to test specific resources
### Enable Debug Logging
```bash
export TF_LOG=DEBUG
export TF_LOG_PATH=terraform-debug.log
terraform plan
```
### Test Configuration
```bash
terraform validate # Syntax check
terraform fmt -check # Format check
tflint # Linting
```
---
## Prevention Checklist
- [ ] Use remote state with locking
- [ ] Pin Terraform and provider versions
- [ ] Implement pre-commit hooks
- [ ] Run plan before every apply
- [ ] Use modules for reusable components
- [ ] Enable state versioning/backups
- [ ] Document architecture and dependencies
- [ ] Implement CI/CD with proper reviews
- [ ] Regular terraform plan in CI to detect drift
- [ ] Monitor and alert on state changes