710 lines
14 KiB
Markdown
710 lines
14 KiB
Markdown
# Terraform Best Practices
|
|
|
|
Comprehensive guide to Terraform best practices for infrastructure as code.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Project Structure](#project-structure)
|
|
2. [State Management](#state-management)
|
|
3. [Module Design](#module-design)
|
|
4. [Variable Management](#variable-management)
|
|
5. [Resource Naming](#resource-naming)
|
|
6. [Security Practices](#security-practices)
|
|
7. [Testing & Validation](#testing--validation)
|
|
8. [CI/CD Integration](#cicd-integration)
|
|
|
|
---
|
|
|
|
## Project Structure
|
|
|
|
### Recommended Directory Layout
|
|
|
|
```
|
|
terraform-project/
|
|
├── environments/
|
|
│ ├── dev/
|
|
│ │ ├── main.tf
|
|
│ │ ├── variables.tf
|
|
│ │ ├── outputs.tf
|
|
│ │ ├── terraform.tfvars
|
|
│ │ └── backend.tf
|
|
│ ├── staging/
|
|
│ └── prod/
|
|
├── modules/
|
|
│ ├── networking/
|
|
│ │ ├── main.tf
|
|
│ │ ├── variables.tf
|
|
│ │ ├── outputs.tf
|
|
│ │ ├── versions.tf
|
|
│ │ └── README.md
|
|
│ ├── compute/
|
|
│ └── database/
|
|
├── global/
|
|
│ ├── iam/
|
|
│ └── dns/
|
|
└── README.md
|
|
```
|
|
|
|
### Key Principles
|
|
|
|
**Separate Environments**
|
|
- Use directories for each environment (dev, staging, prod)
|
|
- Each environment has its own state file
|
|
- Prevents accidental changes to wrong environment
|
|
|
|
**Reusable Modules**
|
|
- Common infrastructure patterns in modules/
|
|
- Modules are versioned and tested
|
|
- Used across multiple environments
|
|
|
|
**Global Resources**
|
|
- Resources shared across environments (IAM, DNS)
|
|
- Separate state for better isolation
|
|
- Carefully managed with extra review
|
|
|
|
---
|
|
|
|
## State Management
|
|
|
|
### Remote State is Essential
|
|
|
|
**Why Remote State:**
|
|
- Team collaboration and locking
|
|
- State backup and versioning
|
|
- Secure credential handling
|
|
- Disaster recovery
|
|
|
|
**Recommended Backend: S3 + DynamoDB**
|
|
|
|
```hcl
|
|
terraform {
|
|
backend "s3" {
|
|
bucket = "company-terraform-state"
|
|
key = "prod/networking/terraform.tfstate"
|
|
region = "us-east-1"
|
|
encrypt = true
|
|
dynamodb_table = "terraform-state-lock"
|
|
kms_key_id = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID"
|
|
}
|
|
}
|
|
```
|
|
|
|
**State Best Practices:**
|
|
|
|
1. **Enable Encryption**: Always encrypt state at rest
|
|
2. **Enable Versioning**: On S3 bucket for state recovery
|
|
3. **Use State Locking**: DynamoDB table prevents concurrent modifications
|
|
4. **Restrict Access**: IAM policies limiting who can read/write state
|
|
5. **Separate State Files**: Different states for different components
|
|
6. **Regular Backups**: Automated backups of state files
|
|
|
|
### State File Organization
|
|
|
|
**Bad - Single State:**
|
|
```
|
|
terraform.tfstate (contains everything)
|
|
```
|
|
|
|
**Good - Multiple States:**
|
|
```
|
|
networking/terraform.tfstate
|
|
compute/terraform.tfstate
|
|
database/terraform.tfstate
|
|
```
|
|
|
|
**Benefits:**
|
|
- Reduced blast radius
|
|
- Faster plan/apply operations
|
|
- Parallel team work
|
|
- Easier to understand and debug
|
|
|
|
### State Management Commands
|
|
|
|
```bash
|
|
# List resources in state
|
|
terraform state list
|
|
|
|
# Show specific resource
|
|
terraform state show aws_instance.example
|
|
|
|
# Move resource to different address
|
|
terraform state mv aws_instance.old aws_instance.new
|
|
|
|
# Remove resource from state (doesn't destroy)
|
|
terraform state rm aws_instance.example
|
|
|
|
# Import existing resource
|
|
terraform import aws_instance.example i-1234567890abcdef0
|
|
|
|
# Pull state for inspection (read-only)
|
|
terraform state pull > state.json
|
|
```
|
|
|
|
---
|
|
|
|
## Module Design
|
|
|
|
### Module Structure
|
|
|
|
Every module should have:
|
|
|
|
```
|
|
module-name/
|
|
├── main.tf # Primary resources
|
|
├── variables.tf # Input variables
|
|
├── outputs.tf # Output values
|
|
├── versions.tf # Version constraints
|
|
├── README.md # Documentation
|
|
└── examples/ # Usage examples
|
|
└── complete/
|
|
├── main.tf
|
|
└── variables.tf
|
|
```
|
|
|
|
### Module Best Practices
|
|
|
|
**1. Single Responsibility**
|
|
Each module should do one thing well:
|
|
- ✅ `vpc-module` creates VPC with subnets, route tables, NACLs
|
|
- ❌ `infrastructure` creates VPC, EC2, RDS, S3, everything
|
|
|
|
**2. Composability**
|
|
Modules should work together:
|
|
```hcl
|
|
module "vpc" {
|
|
source = "./modules/vpc"
|
|
cidr = "10.0.0.0/16"
|
|
}
|
|
|
|
module "eks" {
|
|
source = "./modules/eks"
|
|
vpc_id = module.vpc.vpc_id
|
|
subnet_ids = module.vpc.private_subnet_ids
|
|
}
|
|
```
|
|
|
|
**3. Sensible Defaults**
|
|
```hcl
|
|
variable "instance_type" {
|
|
type = string
|
|
description = "EC2 instance type"
|
|
default = "t3.micro" # Reasonable default
|
|
}
|
|
|
|
variable "enable_monitoring" {
|
|
type = bool
|
|
description = "Enable detailed monitoring"
|
|
default = false # Cost-effective default
|
|
}
|
|
```
|
|
|
|
**4. Complete Documentation**
|
|
|
|
```hcl
|
|
variable "vpc_cidr" {
|
|
type = string
|
|
description = "CIDR block for VPC. Must be a valid IPv4 CIDR."
|
|
|
|
validation {
|
|
condition = can(cidrhost(var.vpc_cidr, 0))
|
|
error_message = "Must be a valid IPv4 CIDR block."
|
|
}
|
|
}
|
|
```
|
|
|
|
**5. Output Useful Values**
|
|
|
|
```hcl
|
|
output "vpc_id" {
|
|
description = "ID of the VPC"
|
|
value = aws_vpc.main.id
|
|
}
|
|
|
|
output "private_subnet_ids" {
|
|
description = "List of private subnet IDs for deploying workloads"
|
|
value = aws_subnet.private[*].id
|
|
}
|
|
|
|
output "nat_gateway_ips" {
|
|
description = "Elastic IPs of NAT gateways for firewall whitelisting"
|
|
value = aws_eip.nat[*].public_ip
|
|
}
|
|
```
|
|
|
|
### Module Versioning
|
|
|
|
**Use Git Tags for Versioning:**
|
|
```hcl
|
|
module "vpc" {
|
|
source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.2.3"
|
|
# Configuration...
|
|
}
|
|
```
|
|
|
|
**Semantic Versioning:**
|
|
- v1.0.0 → First stable release
|
|
- v1.1.0 → New features (backward compatible)
|
|
- v1.1.1 → Bug fixes
|
|
- v2.0.0 → Breaking changes
|
|
|
|
---
|
|
|
|
## Variable Management
|
|
|
|
### Variable Declaration
|
|
|
|
**Always Include:**
|
|
```hcl
|
|
variable "environment" {
|
|
type = string
|
|
description = "Environment name (dev, staging, prod)"
|
|
|
|
validation {
|
|
condition = contains(["dev", "staging", "prod"], var.environment)
|
|
error_message = "Environment must be dev, staging, or prod."
|
|
}
|
|
}
|
|
```
|
|
|
|
### Variable Files Hierarchy
|
|
|
|
```
|
|
terraform.tfvars # Default values (committed, no secrets)
|
|
dev.tfvars # Dev overrides
|
|
prod.tfvars # Prod overrides
|
|
secrets.auto.tfvars # Auto-loaded (in .gitignore)
|
|
```
|
|
|
|
**Usage:**
|
|
```bash
|
|
terraform apply -var-file="prod.tfvars"
|
|
```
|
|
|
|
### Sensitive Variables
|
|
|
|
**Mark as Sensitive:**
|
|
```hcl
|
|
variable "database_password" {
|
|
type = string
|
|
description = "Master password for database"
|
|
sensitive = true
|
|
}
|
|
```
|
|
|
|
**Never commit secrets:**
|
|
```bash
|
|
# .gitignore
|
|
*.auto.tfvars
|
|
secrets.tfvars
|
|
terraform.tfvars # If contains secrets
|
|
```
|
|
|
|
**Better: Use External Secret Management**
|
|
```hcl
|
|
data "aws_secretsmanager_secret_version" "db_password" {
|
|
secret_id = "prod/database/master-password"
|
|
}
|
|
|
|
resource "aws_db_instance" "main" {
|
|
password = data.aws_secretsmanager_secret_version.db_password.secret_string
|
|
}
|
|
```
|
|
|
|
### Variable Organization
|
|
|
|
**Group related variables:**
|
|
```hcl
|
|
# Network Configuration
|
|
variable "vpc_cidr" { }
|
|
variable "availability_zones" { }
|
|
variable "public_subnet_cidrs" { }
|
|
variable "private_subnet_cidrs" { }
|
|
|
|
# Application Configuration
|
|
variable "app_name" { }
|
|
variable "app_version" { }
|
|
variable "instance_count" { }
|
|
|
|
# Tagging
|
|
variable "tags" {
|
|
type = map(string)
|
|
description = "Common tags for all resources"
|
|
default = {}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Resource Naming
|
|
|
|
### Naming Conventions
|
|
|
|
**Terraform Resources (snake_case):**
|
|
```hcl
|
|
resource "aws_vpc" "main_vpc" { }
|
|
resource "aws_subnet" "public_subnet_az1" { }
|
|
resource "aws_instance" "web_server_01" { }
|
|
```
|
|
|
|
**AWS Resource Names (kebab-case):**
|
|
```hcl
|
|
resource "aws_s3_bucket" "logs" {
|
|
bucket = "company-prod-application-logs"
|
|
# company-{env}-{service}-{purpose}
|
|
}
|
|
|
|
resource "aws_instance" "web" {
|
|
tags = {
|
|
Name = "prod-web-server-01"
|
|
# {env}-{service}-{type}-{number}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Naming Standards
|
|
|
|
**Pattern: `{company}-{environment}-{service}-{resource_type}`**
|
|
|
|
Examples:
|
|
- `acme-prod-api-alb`
|
|
- `acme-dev-workers-asg`
|
|
- `acme-staging-database-rds`
|
|
|
|
**Benefits:**
|
|
- Easy filtering in AWS console
|
|
- Clear ownership and purpose
|
|
- Consistent across environments
|
|
- Billing and cost tracking
|
|
|
|
---
|
|
|
|
## Security Practices
|
|
|
|
### 1. Principle of Least Privilege
|
|
|
|
```hcl
|
|
# Bad - Too permissive
|
|
resource "aws_iam_policy" "bad" {
|
|
policy = jsonencode({
|
|
Statement = [{
|
|
Effect = "Allow"
|
|
Action = "*"
|
|
Resource = "*"
|
|
}]
|
|
})
|
|
}
|
|
|
|
# Good - Specific permissions
|
|
resource "aws_iam_policy" "good" {
|
|
policy = jsonencode({
|
|
Statement = [{
|
|
Effect = "Allow"
|
|
Action = [
|
|
"s3:GetObject",
|
|
"s3:PutObject"
|
|
]
|
|
Resource = "arn:aws:s3:::my-bucket/*"
|
|
}]
|
|
})
|
|
}
|
|
```
|
|
|
|
### 2. Encryption Everywhere
|
|
|
|
```hcl
|
|
# Encrypt S3 buckets
|
|
resource "aws_s3_bucket" "secure" {
|
|
bucket = "my-secure-bucket"
|
|
}
|
|
|
|
resource "aws_s3_bucket_server_side_encryption_configuration" "secure" {
|
|
bucket = aws_s3_bucket.secure.id
|
|
|
|
rule {
|
|
apply_server_side_encryption_by_default {
|
|
sse_algorithm = "aws:kms"
|
|
kms_master_key_id = aws_kms_key.bucket.arn
|
|
}
|
|
}
|
|
}
|
|
|
|
# Encrypt EBS volumes
|
|
resource "aws_instance" "secure" {
|
|
root_block_device {
|
|
encrypted = true
|
|
}
|
|
}
|
|
|
|
# Encrypt RDS databases
|
|
resource "aws_db_instance" "secure" {
|
|
storage_encrypted = true
|
|
kms_key_id = aws_kms_key.rds.arn
|
|
}
|
|
```
|
|
|
|
### 3. Network Security
|
|
|
|
```hcl
|
|
# Restrictive security groups
|
|
resource "aws_security_group" "web" {
|
|
name_prefix = "web-"
|
|
|
|
# Only allow specific inbound
|
|
ingress {
|
|
from_port = 443
|
|
to_port = 443
|
|
protocol = "tcp"
|
|
cidr_blocks = ["0.0.0.0/0"] # Consider restricting further
|
|
}
|
|
|
|
# Explicit outbound
|
|
egress {
|
|
from_port = 443
|
|
to_port = 443
|
|
protocol = "tcp"
|
|
cidr_blocks = ["0.0.0.0/0"]
|
|
}
|
|
}
|
|
|
|
# Use private subnets for workloads
|
|
resource "aws_subnet" "private" {
|
|
map_public_ip_on_launch = false # No public IPs
|
|
}
|
|
```
|
|
|
|
### 4. Secret Management
|
|
|
|
**Never in Code:**
|
|
```hcl
|
|
# ❌ NEVER DO THIS
|
|
resource "aws_db_instance" "bad" {
|
|
password = "MySecretPassword123" # NEVER!
|
|
}
|
|
```
|
|
|
|
**Use AWS Secrets Manager:**
|
|
```hcl
|
|
# ✅ CORRECT APPROACH
|
|
data "aws_secretsmanager_secret_version" "db" {
|
|
secret_id = var.db_secret_arn
|
|
}
|
|
|
|
resource "aws_db_instance" "good" {
|
|
password = data.aws_secretsmanager_secret_version.db.secret_string
|
|
}
|
|
```
|
|
|
|
### 5. Resource Tagging
|
|
|
|
```hcl
|
|
locals {
|
|
common_tags = {
|
|
Environment = var.environment
|
|
ManagedBy = "Terraform"
|
|
Owner = "platform-team"
|
|
Project = var.project_name
|
|
CostCenter = var.cost_center
|
|
}
|
|
}
|
|
|
|
resource "aws_instance" "web" {
|
|
tags = merge(
|
|
local.common_tags,
|
|
{
|
|
Name = "web-server"
|
|
Role = "webserver"
|
|
}
|
|
)
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Testing & Validation
|
|
|
|
### Pre-Deployment Validation
|
|
|
|
**1. Terraform Validate**
|
|
```bash
|
|
terraform validate
|
|
```
|
|
Checks syntax and configuration validity.
|
|
|
|
**2. Terraform Plan**
|
|
```bash
|
|
terraform plan -out=tfplan
|
|
```
|
|
Review changes before applying.
|
|
|
|
**3. tflint**
|
|
```bash
|
|
tflint --module
|
|
```
|
|
Linter for catching errors and enforcing conventions.
|
|
|
|
**4. checkov**
|
|
```bash
|
|
checkov -d .
|
|
```
|
|
Security and compliance scanning.
|
|
|
|
**5. terraform-docs**
|
|
```bash
|
|
terraform-docs markdown . > README.md
|
|
```
|
|
Auto-generate documentation.
|
|
|
|
### Automated Testing
|
|
|
|
**Terratest (Go):**
|
|
```go
|
|
func TestVPCCreation(t *testing.T) {
|
|
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
|
|
TerraformDir: "../examples/complete",
|
|
})
|
|
|
|
defer terraform.Destroy(t, terraformOptions)
|
|
terraform.InitAndApply(t, terraformOptions)
|
|
|
|
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
|
|
assert.NotEmpty(t, vpcId)
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## CI/CD Integration
|
|
|
|
### GitHub Actions Example
|
|
|
|
```yaml
|
|
name: Terraform
|
|
|
|
on:
|
|
pull_request:
|
|
branches: [main]
|
|
push:
|
|
branches: [main]
|
|
|
|
jobs:
|
|
terraform:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v3
|
|
|
|
- name: Setup Terraform
|
|
uses: hashicorp/setup-terraform@v2
|
|
|
|
- name: Terraform Init
|
|
run: terraform init
|
|
|
|
- name: Terraform Validate
|
|
run: terraform validate
|
|
|
|
- name: Terraform Plan
|
|
run: terraform plan -no-color
|
|
if: github.event_name == 'pull_request'
|
|
|
|
- name: Terraform Apply
|
|
run: terraform apply -auto-approve
|
|
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
|
|
```
|
|
|
|
### Best Practices for CI/CD
|
|
|
|
1. **Always run plan on PRs** - Review changes before merge
|
|
2. **Require approvals** - Human review for production
|
|
3. **Use workspaces or directories** - Separate pipeline per environment
|
|
4. **Store state remotely** - S3 backend with locking
|
|
5. **Use credential management** - OIDC or IAM roles, never store credentials
|
|
6. **Run security scans** - checkov, tfsec in pipeline
|
|
7. **Tag releases** - Version your infrastructure code
|
|
|
|
---
|
|
|
|
## Common Pitfalls to Avoid
|
|
|
|
### 1. Not Using Remote State
|
|
- ❌ Local state doesn't work for teams
|
|
- ✅ Use S3, Terraform Cloud, or other remote backend
|
|
|
|
### 2. Hardcoding Values
|
|
- ❌ `region = "us-east-1"` in every resource
|
|
- ✅ Use variables and locals
|
|
|
|
### 3. Not Using Modules
|
|
- ❌ Copying code between environments
|
|
- ✅ Create reusable modules
|
|
|
|
### 4. Ignoring State
|
|
- ❌ Manually modifying infrastructure
|
|
- ✅ All changes through Terraform
|
|
|
|
### 5. Poor Naming
|
|
- ❌ `resource "aws_instance" "i1" { }`
|
|
- ✅ `resource "aws_instance" "web_server_01" { }`
|
|
|
|
### 6. No Documentation
|
|
- ❌ No README, no comments
|
|
- ✅ Document everything
|
|
|
|
### 7. Massive State Files
|
|
- ❌ Single state for entire infrastructure
|
|
- ✅ Break into logical components
|
|
|
|
### 8. No Testing
|
|
- ❌ Apply directly to production
|
|
- ✅ Test in dev/staging first
|
|
|
|
---
|
|
|
|
## Quick Reference
|
|
|
|
### Essential Commands
|
|
```bash
|
|
# Initialize
|
|
terraform init
|
|
|
|
# Validate configuration
|
|
terraform validate
|
|
|
|
# Format code
|
|
terraform fmt -recursive
|
|
|
|
# Plan changes
|
|
terraform plan
|
|
|
|
# Apply changes
|
|
terraform apply
|
|
|
|
# Destroy resources
|
|
terraform destroy
|
|
|
|
# Show current state
|
|
terraform show
|
|
|
|
# List resources
|
|
terraform state list
|
|
|
|
# Output values
|
|
terraform output
|
|
```
|
|
|
|
### Useful Flags
|
|
```bash
|
|
# Plan without color
|
|
terraform plan -no-color
|
|
|
|
# Apply without prompts
|
|
terraform apply -auto-approve
|
|
|
|
# Destroy specific resource
|
|
terraform destroy -target=aws_instance.example
|
|
|
|
# Use specific var file
|
|
terraform apply -var-file="prod.tfvars"
|
|
|
|
# Set variable via CLI
|
|
terraform apply -var="environment=prod"
|
|
```
|