Initial commit
This commit is contained in:
709
skills/references/best_practices.md
Normal file
709
skills/references/best_practices.md
Normal file
@@ -0,0 +1,709 @@
|
||||
# Terraform Best Practices
|
||||
|
||||
Comprehensive guide to Terraform best practices for infrastructure as code.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Project Structure](#project-structure)
|
||||
2. [State Management](#state-management)
|
||||
3. [Module Design](#module-design)
|
||||
4. [Variable Management](#variable-management)
|
||||
5. [Resource Naming](#resource-naming)
|
||||
6. [Security Practices](#security-practices)
|
||||
7. [Testing & Validation](#testing--validation)
|
||||
8. [CI/CD Integration](#cicd-integration)
|
||||
|
||||
---
|
||||
|
||||
## Project Structure
|
||||
|
||||
### Recommended Directory Layout
|
||||
|
||||
```
|
||||
terraform-project/
|
||||
├── environments/
|
||||
│ ├── dev/
|
||||
│ │ ├── main.tf
|
||||
│ │ ├── variables.tf
|
||||
│ │ ├── outputs.tf
|
||||
│ │ ├── terraform.tfvars
|
||||
│ │ └── backend.tf
|
||||
│ ├── staging/
|
||||
│ └── prod/
|
||||
├── modules/
|
||||
│ ├── networking/
|
||||
│ │ ├── main.tf
|
||||
│ │ ├── variables.tf
|
||||
│ │ ├── outputs.tf
|
||||
│ │ ├── versions.tf
|
||||
│ │ └── README.md
|
||||
│ ├── compute/
|
||||
│ └── database/
|
||||
├── global/
|
||||
│ ├── iam/
|
||||
│ └── dns/
|
||||
└── README.md
|
||||
```
|
||||
|
||||
### Key Principles
|
||||
|
||||
**Separate Environments**
|
||||
- Use directories for each environment (dev, staging, prod)
|
||||
- Each environment has its own state file
|
||||
- Prevents accidental changes to wrong environment
|
||||
|
||||
**Reusable Modules**
|
||||
- Common infrastructure patterns in modules/
|
||||
- Modules are versioned and tested
|
||||
- Used across multiple environments
|
||||
|
||||
**Global Resources**
|
||||
- Resources shared across environments (IAM, DNS)
|
||||
- Separate state for better isolation
|
||||
- Carefully managed with extra review
|
||||
|
||||
---
|
||||
|
||||
## State Management
|
||||
|
||||
### Remote State is Essential
|
||||
|
||||
**Why Remote State:**
|
||||
- Team collaboration and locking
|
||||
- State backup and versioning
|
||||
- Secure credential handling
|
||||
- Disaster recovery
|
||||
|
||||
**Recommended Backend: S3 + DynamoDB**
|
||||
|
||||
```hcl
|
||||
terraform {
|
||||
backend "s3" {
|
||||
bucket = "company-terraform-state"
|
||||
key = "prod/networking/terraform.tfstate"
|
||||
region = "us-east-1"
|
||||
encrypt = true
|
||||
dynamodb_table = "terraform-state-lock"
|
||||
kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**State Best Practices:**
|
||||
|
||||
1. **Enable Encryption**: Always encrypt state at rest
|
||||
2. **Enable Versioning**: On S3 bucket for state recovery
|
||||
3. **Use State Locking**: DynamoDB table prevents concurrent modifications
|
||||
4. **Restrict Access**: IAM policies limiting who can read/write state
|
||||
5. **Separate State Files**: Different states for different components
|
||||
6. **Regular Backups**: Automated backups of state files
|
||||
|
||||
### State File Organization
|
||||
|
||||
**Bad - Single State:**
|
||||
```
|
||||
terraform.tfstate (contains everything)
|
||||
```
|
||||
|
||||
**Good - Multiple States:**
|
||||
```
|
||||
networking/terraform.tfstate
|
||||
compute/terraform.tfstate
|
||||
database/terraform.tfstate
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Reduced blast radius
|
||||
- Faster plan/apply operations
|
||||
- Parallel team work
|
||||
- Easier to understand and debug
|
||||
|
||||
### State Management Commands
|
||||
|
||||
```bash
|
||||
# List resources in state
|
||||
terraform state list
|
||||
|
||||
# Show specific resource
|
||||
terraform state show aws_instance.example
|
||||
|
||||
# Move resource to different address
|
||||
terraform state mv aws_instance.old aws_instance.new
|
||||
|
||||
# Remove resource from state (doesn't destroy)
|
||||
terraform state rm aws_instance.example
|
||||
|
||||
# Import existing resource
|
||||
terraform import aws_instance.example i-1234567890abcdef0
|
||||
|
||||
# Pull state for inspection (read-only)
|
||||
terraform state pull > state.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Module Design
|
||||
|
||||
### Module Structure
|
||||
|
||||
Every module should have:
|
||||
|
||||
```
|
||||
module-name/
|
||||
├── main.tf # Primary resources
|
||||
├── variables.tf # Input variables
|
||||
├── outputs.tf # Output values
|
||||
├── versions.tf # Version constraints
|
||||
├── README.md # Documentation
|
||||
└── examples/ # Usage examples
|
||||
└── complete/
|
||||
├── main.tf
|
||||
└── variables.tf
|
||||
```
|
||||
|
||||
### Module Best Practices
|
||||
|
||||
**1. Single Responsibility**
|
||||
Each module should do one thing well:
|
||||
- ✅ `vpc-module` creates VPC with subnets, route tables, NACLs
|
||||
- ❌ `infrastructure` creates VPC, EC2, RDS, S3, everything
|
||||
|
||||
**2. Composability**
|
||||
Modules should work together:
|
||||
```hcl
|
||||
module "vpc" {
|
||||
source = "./modules/vpc"
|
||||
cidr = "10.0.0.0/16"
|
||||
}
|
||||
|
||||
module "eks" {
|
||||
source = "./modules/eks"
|
||||
vpc_id = module.vpc.vpc_id
|
||||
subnet_ids = module.vpc.private_subnet_ids
|
||||
}
|
||||
```
|
||||
|
||||
**3. Sensible Defaults**
|
||||
```hcl
|
||||
variable "instance_type" {
|
||||
type = string
|
||||
description = "EC2 instance type"
|
||||
default = "t3.micro" # Reasonable default
|
||||
}
|
||||
|
||||
variable "enable_monitoring" {
|
||||
type = bool
|
||||
description = "Enable detailed monitoring"
|
||||
default = false # Cost-effective default
|
||||
}
|
||||
```
|
||||
|
||||
**4. Complete Documentation**
|
||||
|
||||
```hcl
|
||||
variable "vpc_cidr" {
|
||||
type = string
|
||||
description = "CIDR block for VPC. Must be a valid IPv4 CIDR."
|
||||
|
||||
validation {
|
||||
condition = can(cidrhost(var.vpc_cidr, 0))
|
||||
error_message = "Must be a valid IPv4 CIDR block."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**5. Output Useful Values**
|
||||
|
||||
```hcl
|
||||
output "vpc_id" {
|
||||
description = "ID of the VPC"
|
||||
value = aws_vpc.main.id
|
||||
}
|
||||
|
||||
output "private_subnet_ids" {
|
||||
description = "List of private subnet IDs for deploying workloads"
|
||||
value = aws_subnet.private[*].id
|
||||
}
|
||||
|
||||
output "nat_gateway_ips" {
|
||||
description = "Elastic IPs of NAT gateways for firewall whitelisting"
|
||||
value = aws_eip.nat[*].public_ip
|
||||
}
|
||||
```
|
||||
|
||||
### Module Versioning
|
||||
|
||||
**Use Git Tags for Versioning:**
|
||||
```hcl
|
||||
module "vpc" {
|
||||
source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.2.3"
|
||||
# Configuration...
|
||||
}
|
||||
```
|
||||
|
||||
**Semantic Versioning:**
|
||||
- v1.0.0 → First stable release
|
||||
- v1.1.0 → New features (backward compatible)
|
||||
- v1.1.1 → Bug fixes
|
||||
- v2.0.0 → Breaking changes
|
||||
|
||||
---
|
||||
|
||||
## Variable Management
|
||||
|
||||
### Variable Declaration
|
||||
|
||||
**Always Include:**
|
||||
```hcl
|
||||
variable "environment" {
|
||||
type = string
|
||||
description = "Environment name (dev, staging, prod)"
|
||||
|
||||
validation {
|
||||
condition = contains(["dev", "staging", "prod"], var.environment)
|
||||
error_message = "Environment must be dev, staging, or prod."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Variable Files Hierarchy
|
||||
|
||||
```
|
||||
terraform.tfvars # Default values (committed, no secrets)
|
||||
dev.tfvars # Dev overrides
|
||||
prod.tfvars # Prod overrides
|
||||
secrets.auto.tfvars # Auto-loaded (in .gitignore)
|
||||
```
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
terraform apply -var-file="prod.tfvars"
|
||||
```
|
||||
|
||||
### Sensitive Variables
|
||||
|
||||
**Mark as Sensitive:**
|
||||
```hcl
|
||||
variable "database_password" {
|
||||
type = string
|
||||
description = "Master password for database"
|
||||
sensitive = true
|
||||
}
|
||||
```
|
||||
|
||||
**Never commit secrets:**
|
||||
```bash
|
||||
# .gitignore
|
||||
*.auto.tfvars
|
||||
secrets.tfvars
|
||||
terraform.tfvars # If contains secrets
|
||||
```
|
||||
|
||||
**Better: Use External Secret Management**
|
||||
```hcl
|
||||
data "aws_secretsmanager_secret_version" "db_password" {
|
||||
secret_id = "prod/database/master-password"
|
||||
}
|
||||
|
||||
resource "aws_db_instance" "main" {
|
||||
password = data.aws_secretsmanager_secret_version.db_password.secret_string
|
||||
}
|
||||
```
|
||||
|
||||
### Variable Organization
|
||||
|
||||
**Group related variables:**
|
||||
```hcl
|
||||
# Network Configuration
|
||||
variable "vpc_cidr" { }
|
||||
variable "availability_zones" { }
|
||||
variable "public_subnet_cidrs" { }
|
||||
variable "private_subnet_cidrs" { }
|
||||
|
||||
# Application Configuration
|
||||
variable "app_name" { }
|
||||
variable "app_version" { }
|
||||
variable "instance_count" { }
|
||||
|
||||
# Tagging
|
||||
variable "tags" {
|
||||
type = map(string)
|
||||
description = "Common tags for all resources"
|
||||
default = {}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resource Naming
|
||||
|
||||
### Naming Conventions
|
||||
|
||||
**Terraform Resources (snake_case):**
|
||||
```hcl
|
||||
resource "aws_vpc" "main_vpc" { }
|
||||
resource "aws_subnet" "public_subnet_az1" { }
|
||||
resource "aws_instance" "web_server_01" { }
|
||||
```
|
||||
|
||||
**AWS Resource Names (kebab-case):**
|
||||
```hcl
|
||||
resource "aws_s3_bucket" "logs" {
|
||||
bucket = "company-prod-application-logs"
|
||||
# company-{env}-{service}-{purpose}
|
||||
}
|
||||
|
||||
resource "aws_instance" "web" {
|
||||
tags = {
|
||||
Name = "prod-web-server-01"
|
||||
# {env}-{service}-{type}-{number}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Naming Standards
|
||||
|
||||
**Pattern: `{company}-{environment}-{service}-{resource_type}`**
|
||||
|
||||
Examples:
|
||||
- `acme-prod-api-alb`
|
||||
- `acme-dev-workers-asg`
|
||||
- `acme-staging-database-rds`
|
||||
|
||||
**Benefits:**
|
||||
- Easy filtering in AWS console
|
||||
- Clear ownership and purpose
|
||||
- Consistent across environments
|
||||
- Billing and cost tracking
|
||||
|
||||
---
|
||||
|
||||
## Security Practices
|
||||
|
||||
### 1. Principle of Least Privilege
|
||||
|
||||
```hcl
|
||||
# Bad - Too permissive
|
||||
resource "aws_iam_policy" "bad" {
|
||||
policy = jsonencode({
|
||||
Statement = [{
|
||||
Effect = "Allow"
|
||||
Action = "*"
|
||||
Resource = "*"
|
||||
}]
|
||||
})
|
||||
}
|
||||
|
||||
# Good - Specific permissions
|
||||
resource "aws_iam_policy" "good" {
|
||||
policy = jsonencode({
|
||||
Statement = [{
|
||||
Effect = "Allow"
|
||||
Action = [
|
||||
"s3:GetObject",
|
||||
"s3:PutObject"
|
||||
]
|
||||
Resource = "arn:aws:s3:::my-bucket/*"
|
||||
}]
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
### 2. Encryption Everywhere
|
||||
|
||||
```hcl
|
||||
# Encrypt S3 buckets
|
||||
resource "aws_s3_bucket" "secure" {
|
||||
bucket = "my-secure-bucket"
|
||||
}
|
||||
|
||||
resource "aws_s3_bucket_server_side_encryption_configuration" "secure" {
|
||||
bucket = aws_s3_bucket.secure.id
|
||||
|
||||
rule {
|
||||
apply_server_side_encryption_by_default {
|
||||
sse_algorithm = "aws:kms"
|
||||
kms_master_key_id = aws_kms_key.bucket.arn
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# Encrypt EBS volumes
|
||||
resource "aws_instance" "secure" {
|
||||
root_block_device {
|
||||
encrypted = true
|
||||
}
|
||||
}
|
||||
|
||||
# Encrypt RDS databases
|
||||
resource "aws_db_instance" "secure" {
|
||||
storage_encrypted = true
|
||||
kms_key_id = aws_kms_key.rds.arn
|
||||
}
|
||||
```
|
||||
|
||||
### 3. Network Security
|
||||
|
||||
```hcl
|
||||
# Restrictive security groups
|
||||
resource "aws_security_group" "web" {
|
||||
name_prefix = "web-"
|
||||
|
||||
# Only allow specific inbound
|
||||
ingress {
|
||||
from_port = 443
|
||||
to_port = 443
|
||||
protocol = "tcp"
|
||||
cidr_blocks = ["0.0.0.0/0"] # Consider restricting further
|
||||
}
|
||||
|
||||
# Explicit outbound
|
||||
egress {
|
||||
from_port = 443
|
||||
to_port = 443
|
||||
protocol = "tcp"
|
||||
cidr_blocks = ["0.0.0.0/0"]
|
||||
}
|
||||
}
|
||||
|
||||
# Use private subnets for workloads
|
||||
resource "aws_subnet" "private" {
|
||||
map_public_ip_on_launch = false # No public IPs
|
||||
}
|
||||
```
|
||||
|
||||
### 4. Secret Management
|
||||
|
||||
**Never in Code:**
|
||||
```hcl
|
||||
# ❌ NEVER DO THIS
|
||||
resource "aws_db_instance" "bad" {
|
||||
password = "MySecretPassword123" # NEVER!
|
||||
}
|
||||
```
|
||||
|
||||
**Use AWS Secrets Manager:**
|
||||
```hcl
|
||||
# ✅ CORRECT APPROACH
|
||||
data "aws_secretsmanager_secret_version" "db" {
|
||||
secret_id = var.db_secret_arn
|
||||
}
|
||||
|
||||
resource "aws_db_instance" "good" {
|
||||
password = data.aws_secretsmanager_secret_version.db.secret_string
|
||||
}
|
||||
```
|
||||
|
||||
### 5. Resource Tagging
|
||||
|
||||
```hcl
|
||||
locals {
|
||||
common_tags = {
|
||||
Environment = var.environment
|
||||
ManagedBy = "Terraform"
|
||||
Owner = "platform-team"
|
||||
Project = var.project_name
|
||||
CostCenter = var.cost_center
|
||||
}
|
||||
}
|
||||
|
||||
resource "aws_instance" "web" {
|
||||
tags = merge(
|
||||
local.common_tags,
|
||||
{
|
||||
Name = "web-server"
|
||||
Role = "webserver"
|
||||
}
|
||||
)
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Testing & Validation
|
||||
|
||||
### Pre-Deployment Validation
|
||||
|
||||
**1. Terraform Validate**
|
||||
```bash
|
||||
terraform validate
|
||||
```
|
||||
Checks syntax and configuration validity.
|
||||
|
||||
**2. Terraform Plan**
|
||||
```bash
|
||||
terraform plan -out=tfplan
|
||||
```
|
||||
Review changes before applying.
|
||||
|
||||
**3. tflint**
|
||||
```bash
|
||||
tflint --module
|
||||
```
|
||||
Linter for catching errors and enforcing conventions.
|
||||
|
||||
**4. checkov**
|
||||
```bash
|
||||
checkov -d .
|
||||
```
|
||||
Security and compliance scanning.
|
||||
|
||||
**5. terraform-docs**
|
||||
```bash
|
||||
terraform-docs markdown . > README.md
|
||||
```
|
||||
Auto-generate documentation.
|
||||
|
||||
### Automated Testing
|
||||
|
||||
**Terratest (Go):**
|
||||
```go
|
||||
func TestVPCCreation(t *testing.T) {
|
||||
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
|
||||
TerraformDir: "../examples/complete",
|
||||
})
|
||||
|
||||
defer terraform.Destroy(t, terraformOptions)
|
||||
terraform.InitAndApply(t, terraformOptions)
|
||||
|
||||
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
|
||||
assert.NotEmpty(t, vpcId)
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## CI/CD Integration
|
||||
|
||||
### GitHub Actions Example
|
||||
|
||||
```yaml
|
||||
name: Terraform
|
||||
|
||||
on:
|
||||
pull_request:
|
||||
branches: [main]
|
||||
push:
|
||||
branches: [main]
|
||||
|
||||
jobs:
|
||||
terraform:
|
||||
runs-on: ubuntu-latest
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
|
||||
- name: Setup Terraform
|
||||
uses: hashicorp/setup-terraform@v2
|
||||
|
||||
- name: Terraform Init
|
||||
run: terraform init
|
||||
|
||||
- name: Terraform Validate
|
||||
run: terraform validate
|
||||
|
||||
- name: Terraform Plan
|
||||
run: terraform plan -no-color
|
||||
if: github.event_name == 'pull_request'
|
||||
|
||||
- name: Terraform Apply
|
||||
run: terraform apply -auto-approve
|
||||
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
|
||||
```
|
||||
|
||||
### Best Practices for CI/CD
|
||||
|
||||
1. **Always run plan on PRs** - Review changes before merge
|
||||
2. **Require approvals** - Human review for production
|
||||
3. **Use workspaces or directories** - Separate pipeline per environment
|
||||
4. **Store state remotely** - S3 backend with locking
|
||||
5. **Use credential management** - OIDC or IAM roles, never store credentials
|
||||
6. **Run security scans** - checkov, tfsec in pipeline
|
||||
7. **Tag releases** - Version your infrastructure code
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls to Avoid
|
||||
|
||||
### 1. Not Using Remote State
|
||||
- ❌ Local state doesn't work for teams
|
||||
- ✅ Use S3, Terraform Cloud, or other remote backend
|
||||
|
||||
### 2. Hardcoding Values
|
||||
- ❌ `region = "us-east-1"` in every resource
|
||||
- ✅ Use variables and locals
|
||||
|
||||
### 3. Not Using Modules
|
||||
- ❌ Copying code between environments
|
||||
- ✅ Create reusable modules
|
||||
|
||||
### 4. Ignoring State
|
||||
- ❌ Manually modifying infrastructure
|
||||
- ✅ All changes through Terraform
|
||||
|
||||
### 5. Poor Naming
|
||||
- ❌ `resource "aws_instance" "i1" { }`
|
||||
- ✅ `resource "aws_instance" "web_server_01" { }`
|
||||
|
||||
### 6. No Documentation
|
||||
- ❌ No README, no comments
|
||||
- ✅ Document everything
|
||||
|
||||
### 7. Massive State Files
|
||||
- ❌ Single state for entire infrastructure
|
||||
- ✅ Break into logical components
|
||||
|
||||
### 8. No Testing
|
||||
- ❌ Apply directly to production
|
||||
- ✅ Test in dev/staging first
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Essential Commands
|
||||
```bash
|
||||
# Initialize
|
||||
terraform init
|
||||
|
||||
# Validate configuration
|
||||
terraform validate
|
||||
|
||||
# Format code
|
||||
terraform fmt -recursive
|
||||
|
||||
# Plan changes
|
||||
terraform plan
|
||||
|
||||
# Apply changes
|
||||
terraform apply
|
||||
|
||||
# Destroy resources
|
||||
terraform destroy
|
||||
|
||||
# Show current state
|
||||
terraform show
|
||||
|
||||
# List resources
|
||||
terraform state list
|
||||
|
||||
# Output values
|
||||
terraform output
|
||||
```
|
||||
|
||||
### Useful Flags
|
||||
```bash
|
||||
# Plan without color
|
||||
terraform plan -no-color
|
||||
|
||||
# Apply without prompts
|
||||
terraform apply -auto-approve
|
||||
|
||||
# Destroy specific resource
|
||||
terraform destroy -target=aws_instance.example
|
||||
|
||||
# Use specific var file
|
||||
terraform apply -var-file="prod.tfvars"
|
||||
|
||||
# Set variable via CLI
|
||||
terraform apply -var="environment=prod"
|
||||
```
|
||||
665
skills/references/cost_optimization.md
Normal file
665
skills/references/cost_optimization.md
Normal file
@@ -0,0 +1,665 @@
|
||||
# Terraform Cost Optimization Guide
|
||||
|
||||
Strategies for optimizing cloud infrastructure costs when using Terraform.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Right-Sizing Resources](#right-sizing-resources)
|
||||
2. [Spot and Reserved Instances](#spot-and-reserved-instances)
|
||||
3. [Storage Optimization](#storage-optimization)
|
||||
4. [Networking Costs](#networking-costs)
|
||||
5. [Resource Lifecycle](#resource-lifecycle)
|
||||
6. [Cost Tagging](#cost-tagging)
|
||||
7. [Monitoring and Alerts](#monitoring-and-alerts)
|
||||
8. [Multi-Cloud Considerations](#multi-cloud-considerations)
|
||||
|
||||
---
|
||||
|
||||
## Right-Sizing Resources
|
||||
|
||||
### Compute Resources
|
||||
|
||||
**Start small, scale up:**
|
||||
```hcl
|
||||
variable "instance_type" {
|
||||
type = string
|
||||
description = "EC2 instance type"
|
||||
default = "t3.micro" # Start with smallest reasonable size
|
||||
|
||||
validation {
|
||||
condition = can(regex("^t[0-9]\\.", var.instance_type))
|
||||
error_message = "Consider starting with burstable (t-series) instances for cost optimization."
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Use auto-scaling instead of over-provisioning:**
|
||||
```hcl
|
||||
resource "aws_autoscaling_group" "app" {
|
||||
min_size = 2 # Minimum for HA
|
||||
desired_capacity = 2 # Normal load
|
||||
max_size = 10 # Peak load
|
||||
|
||||
# Scale based on actual usage
|
||||
target_group_arns = [aws_lb_target_group.app.arn]
|
||||
|
||||
tag {
|
||||
key = "Environment"
|
||||
value = var.environment
|
||||
propagate_at_launch = true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Database Right-Sizing
|
||||
|
||||
**Start with appropriate size:**
|
||||
```hcl
|
||||
resource "aws_db_instance" "main" {
|
||||
instance_class = var.environment == "prod" ? "db.t3.medium" : "db.t3.micro"
|
||||
|
||||
# Enable auto-scaling for storage
|
||||
allocated_storage = 20
|
||||
max_allocated_storage = 100 # Auto-scale up to 100GB
|
||||
|
||||
# Use cheaper storage for non-prod
|
||||
storage_type = var.environment == "prod" ? "io1" : "gp3"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Spot and Reserved Instances
|
||||
|
||||
### Spot Instances for Non-Critical Workloads
|
||||
|
||||
**Launch Template for Spot:**
|
||||
```hcl
|
||||
resource "aws_launch_template" "spot" {
|
||||
name_prefix = "spot-"
|
||||
image_id = data.aws_ami.amazon_linux.id
|
||||
instance_type = "t3.medium"
|
||||
|
||||
instance_market_options {
|
||||
market_type = "spot"
|
||||
|
||||
spot_options {
|
||||
max_price = "0.05" # Set price limit
|
||||
spot_instance_type = "one-time"
|
||||
instance_interruption_behavior = "terminate"
|
||||
}
|
||||
}
|
||||
|
||||
tag_specifications {
|
||||
resource_type = "instance"
|
||||
tags = {
|
||||
Name = "spot-instance"
|
||||
Workload = "non-critical"
|
||||
CostSavings = "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
resource "aws_autoscaling_group" "spot" {
|
||||
desired_capacity = 5
|
||||
max_size = 10
|
||||
min_size = 0
|
||||
|
||||
mixed_instances_policy {
|
||||
instances_distribution {
|
||||
on_demand_percentage_above_base_capacity = 20 # 20% on-demand, 80% spot
|
||||
spot_allocation_strategy = "capacity-optimized"
|
||||
}
|
||||
|
||||
launch_template {
|
||||
launch_template_specification {
|
||||
launch_template_id = aws_launch_template.spot.id
|
||||
version = "$Latest"
|
||||
}
|
||||
|
||||
# Multiple instance types increase spot availability
|
||||
override {
|
||||
instance_type = "t3.medium"
|
||||
}
|
||||
override {
|
||||
instance_type = "t3.large"
|
||||
}
|
||||
override {
|
||||
instance_type = "t3a.medium"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Reserved Instances (Use Outside Terraform)
|
||||
|
||||
Terraform shouldn't manage reservations directly, but should:
|
||||
- Tag resources consistently for reservation planning
|
||||
- Use Instance Savings Plans for flexibility
|
||||
- Monitor usage patterns to inform reservation purchases
|
||||
|
||||
**Tagging for reservation analysis:**
|
||||
```hcl
|
||||
locals {
|
||||
reservation_tags = {
|
||||
ReservationCandidate = var.environment == "prod" ? "true" : "false"
|
||||
UsagePattern = "steady-state" # or "variable", "burst"
|
||||
CostCenter = var.cost_center
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Storage Optimization
|
||||
|
||||
### S3 Lifecycle Policies
|
||||
|
||||
**Automatic tiering:**
|
||||
```hcl
|
||||
resource "aws_s3_bucket_lifecycle_configuration" "logs" {
|
||||
bucket = aws_s3_bucket.logs.id
|
||||
|
||||
rule {
|
||||
id = "log-retention"
|
||||
status = "Enabled"
|
||||
|
||||
transition {
|
||||
days = 30
|
||||
storage_class = "STANDARD_IA" # Infrequent Access after 30 days
|
||||
}
|
||||
|
||||
transition {
|
||||
days = 90
|
||||
storage_class = "GLACIER_IR" # Instant Retrieval Glacier after 90 days
|
||||
}
|
||||
|
||||
transition {
|
||||
days = 180
|
||||
storage_class = "DEEP_ARCHIVE" # Deep Archive after 180 days
|
||||
}
|
||||
|
||||
expiration {
|
||||
days = 365 # Delete after 1 year
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Intelligent tiering for variable access:**
|
||||
```hcl
|
||||
resource "aws_s3_bucket_intelligent_tiering_configuration" "assets" {
|
||||
bucket = aws_s3_bucket.assets.id
|
||||
name = "entire-bucket"
|
||||
|
||||
tiering {
|
||||
access_tier = "ARCHIVE_ACCESS"
|
||||
days = 90
|
||||
}
|
||||
|
||||
tiering {
|
||||
access_tier = "DEEP_ARCHIVE_ACCESS"
|
||||
days = 180
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### EBS Volume Optimization
|
||||
|
||||
**Use appropriate volume types:**
|
||||
```hcl
|
||||
resource "aws_instance" "app" {
|
||||
ami = data.aws_ami.amazon_linux.id
|
||||
instance_type = "t3.medium"
|
||||
|
||||
root_block_device {
|
||||
volume_type = "gp3" # gp3 is cheaper than gp2 with better baseline
|
||||
volume_size = 20
|
||||
iops = 3000 # Default, only pay more if you need more
|
||||
throughput = 125 # Default
|
||||
encrypted = true
|
||||
|
||||
# Delete on termination to avoid orphaned volumes
|
||||
delete_on_termination = true
|
||||
}
|
||||
|
||||
tags = {
|
||||
Name = "app-server"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Snapshot lifecycle:**
|
||||
```hcl
|
||||
resource "aws_dlm_lifecycle_policy" "snapshots" {
|
||||
description = "EBS snapshot lifecycle"
|
||||
execution_role_arn = aws_iam_role.dlm.arn
|
||||
state = "ENABLED"
|
||||
|
||||
policy_details {
|
||||
resource_types = ["VOLUME"]
|
||||
|
||||
schedule {
|
||||
name = "Daily snapshots"
|
||||
|
||||
create_rule {
|
||||
interval = 24
|
||||
interval_unit = "HOURS"
|
||||
times = ["03:00"]
|
||||
}
|
||||
|
||||
retain_rule {
|
||||
count = 7 # Keep only 7 days of snapshots
|
||||
}
|
||||
|
||||
copy_tags = true
|
||||
}
|
||||
|
||||
target_tags = {
|
||||
BackupEnabled = "true"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Networking Costs
|
||||
|
||||
### Minimize Data Transfer
|
||||
|
||||
**Use VPC endpoints to avoid NAT charges:**
|
||||
```hcl
|
||||
resource "aws_vpc_endpoint" "s3" {
|
||||
vpc_id = aws_vpc.main.id
|
||||
service_name = "com.amazonaws.${var.region}.s3"
|
||||
route_table_ids = [
|
||||
aws_route_table.private.id
|
||||
]
|
||||
|
||||
tags = {
|
||||
Name = "s3-endpoint"
|
||||
CostSavings = "reduces-nat-charges"
|
||||
}
|
||||
}
|
||||
|
||||
resource "aws_vpc_endpoint" "dynamodb" {
|
||||
vpc_id = aws_vpc.main.id
|
||||
service_name = "com.amazonaws.${var.region}.dynamodb"
|
||||
route_table_ids = [
|
||||
aws_route_table.private.id
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Interface endpoints for AWS services:**
|
||||
```hcl
|
||||
resource "aws_vpc_endpoint" "ecr_api" {
|
||||
vpc_id = aws_vpc.main.id
|
||||
service_name = "com.amazonaws.${var.region}.ecr.api"
|
||||
vpc_endpoint_type = "Interface"
|
||||
subnet_ids = aws_subnet.private[*].id
|
||||
security_group_ids = [aws_security_group.vpc_endpoints.id]
|
||||
private_dns_enabled = true
|
||||
|
||||
tags = {
|
||||
Name = "ecr-api-endpoint"
|
||||
CostSavings = "reduces-nat-data-transfer"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Regional Optimization
|
||||
|
||||
**Co-locate resources in same region/AZ:**
|
||||
```hcl
|
||||
# Bad - cross-region data transfer is expensive
|
||||
resource "aws_instance" "app" {
|
||||
availability_zone = "us-east-1a"
|
||||
}
|
||||
|
||||
resource "aws_rds_cluster" "main" {
|
||||
availability_zones = ["us-west-2a"] # Different region!
|
||||
}
|
||||
|
||||
# Good - same region and AZ when possible
|
||||
resource "aws_instance" "app" {
|
||||
availability_zone = var.availability_zone
|
||||
}
|
||||
|
||||
resource "aws_rds_cluster" "main" {
|
||||
availability_zones = [var.availability_zone] # Same AZ
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Resource Lifecycle
|
||||
|
||||
### Scheduled Shutdown for Non-Production
|
||||
|
||||
**Lambda to stop/start instances:**
|
||||
```hcl
|
||||
resource "aws_lambda_function" "scheduler" {
|
||||
filename = "scheduler.zip"
|
||||
function_name = "instance-scheduler"
|
||||
role = aws_iam_role.scheduler.arn
|
||||
handler = "scheduler.handler"
|
||||
runtime = "python3.9"
|
||||
|
||||
environment {
|
||||
variables = {
|
||||
TAG_KEY = "Schedule"
|
||||
TAG_VALUE = "business-hours"
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
# EventBridge rule to stop instances at night
|
||||
resource "aws_cloudwatch_event_rule" "stop_instances" {
|
||||
name = "stop-dev-instances"
|
||||
description = "Stop dev instances at 7 PM"
|
||||
schedule_expression = "cron(0 19 ? * MON-FRI *)" # 7 PM weekdays
|
||||
}
|
||||
|
||||
resource "aws_cloudwatch_event_target" "stop" {
|
||||
rule = aws_cloudwatch_event_rule.stop_instances.name
|
||||
target_id = "stop-instances"
|
||||
arn = aws_lambda_function.scheduler.arn
|
||||
|
||||
input = jsonencode({
|
||||
action = "stop"
|
||||
})
|
||||
}
|
||||
|
||||
# Start instances in the morning
|
||||
resource "aws_cloudwatch_event_rule" "start_instances" {
|
||||
name = "start-dev-instances"
|
||||
description = "Start dev instances at 8 AM"
|
||||
schedule_expression = "cron(0 8 ? * MON-FRI *)" # 8 AM weekdays
|
||||
}
|
||||
```
|
||||
|
||||
**Tag instances for scheduling:**
|
||||
```hcl
|
||||
resource "aws_instance" "dev" {
|
||||
ami = data.aws_ami.amazon_linux.id
|
||||
instance_type = "t3.medium"
|
||||
|
||||
tags = {
|
||||
Name = "dev-server"
|
||||
Environment = "dev"
|
||||
Schedule = "business-hours" # Scheduler will stop/start based on this
|
||||
AutoShutdown = "true"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Cleanup Old Resources
|
||||
|
||||
**S3 lifecycle for temporary data:**
|
||||
```hcl
|
||||
resource "aws_s3_bucket_lifecycle_configuration" "temp" {
|
||||
bucket = aws_s3_bucket.temp.id
|
||||
|
||||
rule {
|
||||
id = "cleanup-temp-files"
|
||||
status = "Enabled"
|
||||
|
||||
filter {
|
||||
prefix = "temp/"
|
||||
}
|
||||
|
||||
expiration {
|
||||
days = 7 # Delete after 7 days
|
||||
}
|
||||
|
||||
abort_incomplete_multipart_upload {
|
||||
days_after_initiation = 1
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost Tagging
|
||||
|
||||
### Comprehensive Tagging Strategy
|
||||
|
||||
**Define tagging locals:**
|
||||
```hcl
|
||||
locals {
|
||||
common_tags = {
|
||||
# Cost allocation tags
|
||||
CostCenter = var.cost_center
|
||||
Project = var.project_name
|
||||
Environment = var.environment
|
||||
Owner = var.team_email
|
||||
|
||||
# Operational tags
|
||||
ManagedBy = "Terraform"
|
||||
TerraformModule = basename(abspath(path.module))
|
||||
|
||||
# Cost optimization tags
|
||||
AutoShutdown = var.environment != "prod" ? "enabled" : "disabled"
|
||||
ReservationCandidate = var.environment == "prod" ? "true" : "false"
|
||||
CostOptimized = "true"
|
||||
}
|
||||
}
|
||||
|
||||
# Apply to all resources
|
||||
resource "aws_instance" "app" {
|
||||
# ... configuration ...
|
||||
|
||||
tags = merge(
|
||||
local.common_tags,
|
||||
{
|
||||
Name = "${var.environment}-app-server"
|
||||
Role = "application"
|
||||
}
|
||||
)
|
||||
}
|
||||
```
|
||||
|
||||
**Enforce tagging with AWS Config:**
|
||||
```hcl
|
||||
resource "aws_config_config_rule" "required_tags" {
|
||||
name = "required-tags"
|
||||
|
||||
source {
|
||||
owner = "AWS"
|
||||
source_identifier = "REQUIRED_TAGS"
|
||||
}
|
||||
|
||||
input_parameters = jsonencode({
|
||||
tag1Key = "CostCenter"
|
||||
tag2Key = "Environment"
|
||||
tag3Key = "Owner"
|
||||
})
|
||||
|
||||
depends_on = [aws_config_configuration_recorder.main]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monitoring and Alerts
|
||||
|
||||
### Budget Alerts
|
||||
|
||||
**AWS Budgets with Terraform:**
|
||||
```hcl
|
||||
resource "aws_budgets_budget" "monthly" {
|
||||
name = "${var.environment}-monthly-budget"
|
||||
budget_type = "COST"
|
||||
limit_amount = var.monthly_budget
|
||||
limit_unit = "USD"
|
||||
time_unit = "MONTHLY"
|
||||
time_period_start = "2024-01-01_00:00"
|
||||
|
||||
cost_filter {
|
||||
name = "TagKeyValue"
|
||||
values = [
|
||||
"Environment$${var.environment}"
|
||||
]
|
||||
}
|
||||
|
||||
notification {
|
||||
comparison_operator = "GREATER_THAN"
|
||||
threshold = 80
|
||||
threshold_type = "PERCENTAGE"
|
||||
notification_type = "ACTUAL"
|
||||
subscriber_email_addresses = [var.budget_alert_email]
|
||||
}
|
||||
|
||||
notification {
|
||||
comparison_operator = "GREATER_THAN"
|
||||
threshold = 100
|
||||
threshold_type = "PERCENTAGE"
|
||||
notification_type = "ACTUAL"
|
||||
subscriber_email_addresses = [var.budget_alert_email]
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Cost Anomaly Detection
|
||||
|
||||
```hcl
|
||||
resource "aws_ce_anomaly_monitor" "service" {
|
||||
name = "${var.environment}-service-monitor"
|
||||
monitor_type = "DIMENSIONAL"
|
||||
monitor_dimension = "SERVICE"
|
||||
}
|
||||
|
||||
resource "aws_ce_anomaly_subscription" "alerts" {
|
||||
name = "${var.environment}-anomaly-alerts"
|
||||
frequency = "DAILY"
|
||||
|
||||
monitor_arn_list = [
|
||||
aws_ce_anomaly_monitor.service.arn
|
||||
]
|
||||
|
||||
subscriber {
|
||||
type = "EMAIL"
|
||||
address = var.cost_alert_email
|
||||
}
|
||||
|
||||
threshold_expression {
|
||||
dimension {
|
||||
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
|
||||
values = ["100"] # Alert on $100+ anomalies
|
||||
match_options = ["GREATER_THAN_OR_EQUAL"]
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Multi-Cloud Considerations
|
||||
|
||||
### Azure Cost Optimization
|
||||
|
||||
**Use Azure Hybrid Benefit:**
|
||||
```hcl
|
||||
resource "azurerm_linux_virtual_machine" "main" {
|
||||
# ... configuration ...
|
||||
|
||||
# Use Azure Hybrid Benefit for licensing savings
|
||||
license_type = "RHEL_BYOS" # or "SLES_BYOS"
|
||||
}
|
||||
```
|
||||
|
||||
**Azure Reserved Instances (outside Terraform):**
|
||||
- Purchase through Azure Portal
|
||||
- Tag VMs with `ReservationGroup` for planning
|
||||
|
||||
### GCP Cost Optimization
|
||||
|
||||
**Use committed use discounts:**
|
||||
```hcl
|
||||
resource "google_compute_instance" "main" {
|
||||
# ... configuration ...
|
||||
|
||||
# Use committed use discount
|
||||
scheduling {
|
||||
automatic_restart = true
|
||||
on_host_maintenance = "MIGRATE"
|
||||
preemptible = var.environment != "prod" # Preemptible for non-prod
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**GCP Preemptible VMs:**
|
||||
```hcl
|
||||
resource "google_compute_instance_template" "preemptible" {
|
||||
machine_type = "n1-standard-1"
|
||||
|
||||
scheduling {
|
||||
automatic_restart = false
|
||||
on_host_maintenance = "TERMINATE"
|
||||
preemptible = true # Up to 80% cost reduction
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Cost Optimization Checklist
|
||||
|
||||
### Before Deployment
|
||||
- [ ] Right-size compute resources (start small)
|
||||
- [ ] Use appropriate storage tiers
|
||||
- [ ] Enable auto-scaling instead of over-provisioning
|
||||
- [ ] Implement tagging strategy
|
||||
- [ ] Configure lifecycle policies
|
||||
- [ ] Set up VPC endpoints for AWS services
|
||||
|
||||
### After Deployment
|
||||
- [ ] Monitor actual usage vs. provisioned capacity
|
||||
- [ ] Review cost allocation tags
|
||||
- [ ] Identify reservation opportunities
|
||||
- [ ] Configure budget alerts
|
||||
- [ ] Enable cost anomaly detection
|
||||
- [ ] Schedule non-production resource shutdown
|
||||
|
||||
### Ongoing
|
||||
- [ ] Monthly cost review
|
||||
- [ ] Quarterly right-sizing analysis
|
||||
- [ ] Annual reservation review
|
||||
- [ ] Remove unused resources
|
||||
- [ ] Optimize data transfer patterns
|
||||
- [ ] Update instance families (new generations are often cheaper)
|
||||
|
||||
---
|
||||
|
||||
## Cost Estimation Tools
|
||||
|
||||
### Use `infracost` in CI/CD
|
||||
|
||||
```bash
|
||||
# Install infracost
|
||||
curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh
|
||||
|
||||
# Generate cost estimate
|
||||
infracost breakdown --path .
|
||||
|
||||
# Compare cost changes in PR
|
||||
infracost diff --path . --compare-to tfplan.json
|
||||
```
|
||||
|
||||
### Terraform Cloud Cost Estimation
|
||||
|
||||
Enable in Terraform Cloud workspace settings for automatic cost estimates on every plan.
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- AWS Cost Optimization: https://aws.amazon.com/pricing/cost-optimization/
|
||||
- Azure Cost Management: https://azure.microsoft.com/en-us/products/cost-management/
|
||||
- GCP Cost Management: https://cloud.google.com/cost-management
|
||||
- Infracost: https://www.infracost.io/
|
||||
- Cloud Cost Optimization Tools: Kubecost, CloudHealth, CloudCheckr
|
||||
635
skills/references/troubleshooting.md
Normal file
635
skills/references/troubleshooting.md
Normal file
@@ -0,0 +1,635 @@
|
||||
# Terraform Troubleshooting Guide
|
||||
|
||||
Common Terraform and Terragrunt issues with solutions.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [State Issues](#state-issues)
|
||||
2. [Provider Issues](#provider-issues)
|
||||
3. [Resource Errors](#resource-errors)
|
||||
4. [Module Issues](#module-issues)
|
||||
5. [Terragrunt Specific](#terragrunt-specific)
|
||||
6. [Performance Issues](#performance-issues)
|
||||
|
||||
---
|
||||
|
||||
## State Issues
|
||||
|
||||
### State Lock Error
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error locking state: Error acquiring the state lock
|
||||
Lock Info:
|
||||
ID: abc123...
|
||||
Path: terraform.tfstate
|
||||
Operation: OperationTypeApply
|
||||
Who: user@hostname
|
||||
Created: 2024-01-15 10:30:00 UTC
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
1. Previous operation crashed or was interrupted
|
||||
2. Another user/process is running terraform
|
||||
3. State lock wasn't released properly
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Verify no one else is running terraform:**
|
||||
```bash
|
||||
# Check with team first!
|
||||
```
|
||||
|
||||
2. **Force unlock (use with caution):**
|
||||
```bash
|
||||
terraform force-unlock abc123
|
||||
```
|
||||
|
||||
3. **For DynamoDB backend, check lock table:**
|
||||
```bash
|
||||
aws dynamodb get-item \
|
||||
--table-name terraform-state-lock \
|
||||
--key '{"LockID": {"S": "path/to/state/terraform.tfstate-md5"}}'
|
||||
```
|
||||
|
||||
**Prevention:**
|
||||
- Use proper state locking backend (S3 + DynamoDB)
|
||||
- Implement timeout in CI/CD pipelines
|
||||
- Always let terraform complete or properly cancel
|
||||
|
||||
---
|
||||
|
||||
### State Drift Detected
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Note: Objects have changed outside of Terraform
|
||||
|
||||
Terraform detected the following changes made outside of Terraform
|
||||
since the last "terraform apply":
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
1. Manual changes in AWS console
|
||||
2. Another tool modifying resources
|
||||
3. Auto-scaling or auto-remediation
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Review the drift:**
|
||||
```bash
|
||||
terraform plan -detailed-exitcode
|
||||
```
|
||||
|
||||
2. **Options:**
|
||||
- **Import changes:** Update terraform to match reality
|
||||
- **Revert changes:** Apply terraform to restore desired state
|
||||
- **Refresh state:** `terraform apply -refresh-only`
|
||||
|
||||
3. **Import specific changes:**
|
||||
```bash
|
||||
# Update your .tf files, then:
|
||||
terraform plan # Verify it matches
|
||||
terraform apply
|
||||
```
|
||||
|
||||
**Prevention:**
|
||||
- Implement policy to prevent manual changes
|
||||
- Use AWS Config rules to detect drift
|
||||
- Regular `terraform plan` to catch drift early
|
||||
- Consider using Terraform Cloud drift detection
|
||||
|
||||
---
|
||||
|
||||
### State Corruption
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Failed to load state
|
||||
Error: state snapshot was created by Terraform v1.5.0,
|
||||
which is newer than current v1.3.0
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
1. Using different Terraform versions
|
||||
2. State file manually edited
|
||||
3. Incomplete state upload
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Version mismatch:**
|
||||
```bash
|
||||
# Upgrade to matching version
|
||||
tfenv install 1.5.0
|
||||
tfenv use 1.5.0
|
||||
```
|
||||
|
||||
2. **Restore from backup:**
|
||||
```bash
|
||||
# For S3 backend with versioning
|
||||
aws s3api list-object-versions \
|
||||
--bucket terraform-state \
|
||||
--prefix prod/terraform.tfstate
|
||||
|
||||
# Restore specific version
|
||||
aws s3api get-object \
|
||||
--bucket terraform-state \
|
||||
--key prod/terraform.tfstate \
|
||||
--version-id VERSION_ID \
|
||||
terraform.tfstate
|
||||
```
|
||||
|
||||
3. **Rebuild state (last resort):**
|
||||
```bash
|
||||
# Remove corrupted state
|
||||
terraform state rm aws_instance.example
|
||||
|
||||
# Re-import resources
|
||||
terraform import aws_instance.example i-1234567890abcdef0
|
||||
```
|
||||
|
||||
**Prevention:**
|
||||
- Pin Terraform version in `versions.tf`
|
||||
- Enable S3 versioning for state bucket
|
||||
- Never manually edit state files
|
||||
- Use consistent Terraform versions across team
|
||||
|
||||
---
|
||||
|
||||
## Provider Issues
|
||||
|
||||
### Provider Version Conflict
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Incompatible provider version
|
||||
|
||||
Provider registry.terraform.io/hashicorp/aws v5.0.0 does not have
|
||||
a package available for your current platform
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Specify version constraints:**
|
||||
```hcl
|
||||
terraform {
|
||||
required_providers {
|
||||
aws = {
|
||||
source = "hashicorp/aws"
|
||||
version = "~> 4.67.0" # Use compatible version
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
2. **Clean provider cache:**
|
||||
```bash
|
||||
rm -rf .terraform
|
||||
terraform init -upgrade
|
||||
```
|
||||
|
||||
3. **Lock file sync:**
|
||||
```bash
|
||||
terraform providers lock \
|
||||
-platform=darwin_amd64 \
|
||||
-platform=darwin_arm64 \
|
||||
-platform=linux_amd64
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Authentication Failures
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: error configuring Terraform AWS Provider:
|
||||
no valid credential sources found
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
1. Missing AWS credentials
|
||||
2. Expired credentials
|
||||
3. Incorrect IAM permissions
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Verify credentials:**
|
||||
```bash
|
||||
aws sts get-caller-identity
|
||||
```
|
||||
|
||||
2. **Check credential order:**
|
||||
- Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
|
||||
- Shared credentials file (~/.aws/credentials)
|
||||
- IAM role (for EC2/ECS)
|
||||
|
||||
3. **Configure provider:**
|
||||
```hcl
|
||||
provider "aws" {
|
||||
region = "us-east-1"
|
||||
|
||||
# Option 1: Use profile
|
||||
profile = "production"
|
||||
|
||||
# Option 2: Assume role
|
||||
assume_role {
|
||||
role_arn = "arn:aws:iam::ACCOUNT:role/TerraformRole"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
4. **Check IAM permissions:**
|
||||
```bash
|
||||
# Test specific permission
|
||||
aws ec2 describe-instances --dry-run
|
||||
```
|
||||
|
||||
**Prevention:**
|
||||
- Use IAM roles in CI/CD
|
||||
- Implement OIDC for GitHub Actions
|
||||
- Regular credential rotation
|
||||
- Use AWS SSO for developers
|
||||
|
||||
---
|
||||
|
||||
## Resource Errors
|
||||
|
||||
### Resource Already Exists
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: creating EC2 Instance: EntityAlreadyExists:
|
||||
Resource with id 'i-1234567890abcdef0' already exists
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Import existing resource:**
|
||||
```bash
|
||||
terraform import aws_instance.web i-1234567890abcdef0
|
||||
```
|
||||
|
||||
2. **Verify configuration matches:**
|
||||
```bash
|
||||
terraform plan # Should show no changes after import
|
||||
```
|
||||
|
||||
3. **If configuration differs, update it:**
|
||||
```hcl
|
||||
resource "aws_instance" "web" {
|
||||
ami = "ami-abc123" # Match existing
|
||||
instance_type = "t3.micro" # Match existing
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Dependency Errors
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: resource depends on resource "aws_vpc.main" that
|
||||
is not declared in the configuration
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Add explicit dependency:**
|
||||
```hcl
|
||||
resource "aws_subnet" "private" {
|
||||
vpc_id = aws_vpc.main.id
|
||||
|
||||
depends_on = [
|
||||
aws_internet_gateway.main # Explicit dependency
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
2. **Use data sources for existing resources:**
|
||||
```hcl
|
||||
data "aws_vpc" "existing" {
|
||||
id = "vpc-12345678"
|
||||
}
|
||||
|
||||
resource "aws_subnet" "new" {
|
||||
vpc_id = data.aws_vpc.existing.id
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Timeout Errors
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: timeout while waiting for state to become 'available'
|
||||
(last state: 'pending', timeout: 10m0s)
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Increase timeout:**
|
||||
```hcl
|
||||
resource "aws_db_instance" "main" {
|
||||
# ... configuration ...
|
||||
|
||||
timeouts {
|
||||
create = "60m"
|
||||
update = "60m"
|
||||
delete = "60m"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
2. **Check resource status manually:**
|
||||
```bash
|
||||
aws rds describe-db-instances --db-instance-identifier mydb
|
||||
```
|
||||
|
||||
3. **Retry the operation:**
|
||||
```bash
|
||||
terraform apply
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Module Issues
|
||||
|
||||
### Module Source Not Found
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Failed to download module
|
||||
|
||||
Could not download module "vpc" (main.tf:10) source:
|
||||
git::https://github.com/company/terraform-modules.git//vpc
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Verify source URL:**
|
||||
```hcl
|
||||
module "vpc" {
|
||||
source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.0.0"
|
||||
# Add authentication if private repo
|
||||
}
|
||||
```
|
||||
|
||||
2. **For private repos, configure Git auth:**
|
||||
```bash
|
||||
# SSH key
|
||||
git config --global url."git@github.com:".insteadOf "https://github.com/"
|
||||
|
||||
# Or use HTTPS with token
|
||||
git config --global url."https://oauth2:TOKEN@github.com/".insteadOf "https://github.com/"
|
||||
```
|
||||
|
||||
3. **Clear module cache:**
|
||||
```bash
|
||||
rm -rf .terraform/modules
|
||||
terraform init
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Module Version Conflicts
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Inconsistent dependency lock file
|
||||
|
||||
Module has dependencies locked at version 1.0.0 but
|
||||
root module requires version 2.0.0
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Update lock file:**
|
||||
```bash
|
||||
terraform init -upgrade
|
||||
```
|
||||
|
||||
2. **Pin module version:**
|
||||
```hcl
|
||||
module "vpc" {
|
||||
source = "terraform-aws-modules/vpc/aws"
|
||||
version = "~> 3.0" # Compatible with 3.x
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Terragrunt Specific
|
||||
|
||||
### Dependency Cycle Detected
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Dependency cycle detected:
|
||||
module-a depends on module-b
|
||||
module-b depends on module-c
|
||||
module-c depends on module-a
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Review dependencies in terragrunt.hcl:**
|
||||
```hcl
|
||||
dependency "vpc" {
|
||||
config_path = "../vpc"
|
||||
}
|
||||
|
||||
dependency "database" {
|
||||
config_path = "../database"
|
||||
}
|
||||
|
||||
# Don't create circular references!
|
||||
```
|
||||
|
||||
2. **Refactor to remove cycle:**
|
||||
- Split modules differently
|
||||
- Use data sources instead of dependencies
|
||||
- Pass values through variables
|
||||
|
||||
3. **Use mock outputs during planning:**
|
||||
```hcl
|
||||
dependency "vpc" {
|
||||
config_path = "../vpc"
|
||||
|
||||
mock_outputs = {
|
||||
vpc_id = "vpc-mock"
|
||||
}
|
||||
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Hook Failures
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Hook execution failed
|
||||
Command: pre_apply_hook.sh
|
||||
Exit code: 1
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Debug the hook:**
|
||||
```bash
|
||||
# Run hook manually
|
||||
bash .terragrunt-cache/.../pre_apply_hook.sh
|
||||
```
|
||||
|
||||
2. **Add error handling to hook:**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
set -e # Exit on error
|
||||
|
||||
# Your hook logic
|
||||
if ! command -v jq &> /dev/null; then
|
||||
echo "jq is required but not installed"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Make hook executable:**
|
||||
```bash
|
||||
chmod +x hooks/pre_apply_hook.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Include Path Issues
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Cannot include file
|
||||
Path does not exist: ../common.hcl
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Use correct relative path:**
|
||||
```hcl
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
include "common" {
|
||||
path = "${get_terragrunt_dir()}/../common.hcl"
|
||||
}
|
||||
```
|
||||
|
||||
2. **Verify file exists:**
|
||||
```bash
|
||||
ls -la ../common.hcl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### Slow Plans/Applies
|
||||
|
||||
**Symptoms:**
|
||||
- `terraform plan` takes >5 minutes
|
||||
- `terraform apply` very slow
|
||||
- State operations timing out
|
||||
|
||||
**Common Causes:**
|
||||
1. Too many resources in single state
|
||||
2. Slow provider API calls
|
||||
3. Large number of data sources
|
||||
4. Complex interpolations
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Split state files:**
|
||||
```
|
||||
networking/ # Separate state
|
||||
compute/ # Separate state
|
||||
database/ # Separate state
|
||||
```
|
||||
|
||||
2. **Use targeted operations:**
|
||||
```bash
|
||||
terraform plan -target=aws_instance.web
|
||||
terraform apply -target=module.vpc
|
||||
```
|
||||
|
||||
3. **Optimize data sources:**
|
||||
```hcl
|
||||
# Bad - queries every plan
|
||||
data "aws_ami" "ubuntu" {
|
||||
most_recent = true
|
||||
# ... filters
|
||||
}
|
||||
|
||||
# Better - use specific AMI
|
||||
variable "ami_id" {
|
||||
default = "ami-abc123" # Update periodically
|
||||
}
|
||||
```
|
||||
|
||||
4. **Enable parallelism:**
|
||||
```bash
|
||||
terraform apply -parallelism=20 # Default is 10
|
||||
```
|
||||
|
||||
5. **Use caching (Terragrunt):**
|
||||
```hcl
|
||||
remote_state {
|
||||
backend = "s3"
|
||||
config = {
|
||||
skip_credentials_validation = true # Faster
|
||||
skip_metadata_api_check = true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Diagnostic Steps
|
||||
|
||||
When encountering any Terraform error:
|
||||
|
||||
1. **Read the full error message** - Don't skip details
|
||||
2. **Check recent changes** - What changed since last successful run?
|
||||
3. **Verify versions** - Terraform, providers, modules
|
||||
4. **Check state** - Is it locked? Corrupted?
|
||||
5. **Test authentication** - Can you access resources manually?
|
||||
6. **Review logs** - Use TF_LOG=DEBUG for detailed output
|
||||
7. **Isolate the problem** - Use -target to test specific resources
|
||||
|
||||
### Enable Debug Logging
|
||||
|
||||
```bash
|
||||
export TF_LOG=DEBUG
|
||||
export TF_LOG_PATH=terraform-debug.log
|
||||
terraform plan
|
||||
```
|
||||
|
||||
### Test Configuration
|
||||
|
||||
```bash
|
||||
terraform validate # Syntax check
|
||||
terraform fmt -check # Format check
|
||||
tflint # Linting
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention Checklist
|
||||
|
||||
- [ ] Use remote state with locking
|
||||
- [ ] Pin Terraform and provider versions
|
||||
- [ ] Implement pre-commit hooks
|
||||
- [ ] Run plan before every apply
|
||||
- [ ] Use modules for reusable components
|
||||
- [ ] Enable state versioning/backups
|
||||
- [ ] Document architecture and dependencies
|
||||
- [ ] Implement CI/CD with proper reviews
|
||||
- [ ] Regular terraform plan in CI to detect drift
|
||||
- [ ] Monitor and alert on state changes
|
||||
Reference in New Issue
Block a user