Files
gh-ahmedasmar-devops-claude…/skills/references/best_practices.md
2025-11-29 17:51:17 +08:00

14 KiB

Terraform Best Practices

Comprehensive guide to Terraform best practices for infrastructure as code.

Table of Contents

  1. Project Structure
  2. State Management
  3. Module Design
  4. Variable Management
  5. Resource Naming
  6. Security Practices
  7. Testing & Validation
  8. CI/CD Integration

Project Structure

terraform-project/
├── environments/
│   ├── dev/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── terraform.tfvars
│   │   └── backend.tf
│   ├── staging/
│   └── prod/
├── modules/
│   ├── networking/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   ├── versions.tf
│   │   └── README.md
│   ├── compute/
│   └── database/
├── global/
│   ├── iam/
│   └── dns/
└── README.md

Key Principles

Separate Environments

  • Use directories for each environment (dev, staging, prod)
  • Each environment has its own state file
  • Prevents accidental changes to wrong environment

Reusable Modules

  • Common infrastructure patterns in modules/
  • Modules are versioned and tested
  • Used across multiple environments

Global Resources

  • Resources shared across environments (IAM, DNS)
  • Separate state for better isolation
  • Carefully managed with extra review

State Management

Remote State is Essential

Why Remote State:

  • Team collaboration and locking
  • State backup and versioning
  • Secure credential handling
  • Disaster recovery

Recommended Backend: S3 + DynamoDB

terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-lock"
    kms_key_id     = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID"
  }
}

State Best Practices:

  1. Enable Encryption: Always encrypt state at rest
  2. Enable Versioning: On S3 bucket for state recovery
  3. Use State Locking: DynamoDB table prevents concurrent modifications
  4. Restrict Access: IAM policies limiting who can read/write state
  5. Separate State Files: Different states for different components
  6. Regular Backups: Automated backups of state files

State File Organization

Bad - Single State:

terraform.tfstate  (contains everything)

Good - Multiple States:

networking/terraform.tfstate
compute/terraform.tfstate
database/terraform.tfstate

Benefits:

  • Reduced blast radius
  • Faster plan/apply operations
  • Parallel team work
  • Easier to understand and debug

State Management Commands

# List resources in state
terraform state list

# Show specific resource
terraform state show aws_instance.example

# Move resource to different address
terraform state mv aws_instance.old aws_instance.new

# Remove resource from state (doesn't destroy)
terraform state rm aws_instance.example

# Import existing resource
terraform import aws_instance.example i-1234567890abcdef0

# Pull state for inspection (read-only)
terraform state pull > state.json

Module Design

Module Structure

Every module should have:

module-name/
├── main.tf          # Primary resources
├── variables.tf     # Input variables
├── outputs.tf       # Output values
├── versions.tf      # Version constraints
├── README.md        # Documentation
└── examples/        # Usage examples
    └── complete/
        ├── main.tf
        └── variables.tf

Module Best Practices

1. Single Responsibility Each module should do one thing well:

  • vpc-module creates VPC with subnets, route tables, NACLs
  • infrastructure creates VPC, EC2, RDS, S3, everything

2. Composability Modules should work together:

module "vpc" {
  source = "./modules/vpc"
  cidr   = "10.0.0.0/16"
}

module "eks" {
  source     = "./modules/eks"
  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids
}

3. Sensible Defaults

variable "instance_type" {
  type        = string
  description = "EC2 instance type"
  default     = "t3.micro"  # Reasonable default
}

variable "enable_monitoring" {
  type        = bool
  description = "Enable detailed monitoring"
  default     = false  # Cost-effective default
}

4. Complete Documentation

variable "vpc_cidr" {
  type        = string
  description = "CIDR block for VPC. Must be a valid IPv4 CIDR."
  
  validation {
    condition     = can(cidrhost(var.vpc_cidr, 0))
    error_message = "Must be a valid IPv4 CIDR block."
  }
}

5. Output Useful Values

output "vpc_id" {
  description = "ID of the VPC"
  value       = aws_vpc.main.id
}

output "private_subnet_ids" {
  description = "List of private subnet IDs for deploying workloads"
  value       = aws_subnet.private[*].id
}

output "nat_gateway_ips" {
  description = "Elastic IPs of NAT gateways for firewall whitelisting"
  value       = aws_eip.nat[*].public_ip
}

Module Versioning

Use Git Tags for Versioning:

module "vpc" {
  source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.2.3"
  # Configuration...
}

Semantic Versioning:

  • v1.0.0 → First stable release
  • v1.1.0 → New features (backward compatible)
  • v1.1.1 → Bug fixes
  • v2.0.0 → Breaking changes

Variable Management

Variable Declaration

Always Include:

variable "environment" {
  type        = string
  description = "Environment name (dev, staging, prod)"
  
  validation {
    condition     = contains(["dev", "staging", "prod"], var.environment)
    error_message = "Environment must be dev, staging, or prod."
  }
}

Variable Files Hierarchy

terraform.tfvars        # Default values (committed, no secrets)
dev.tfvars             # Dev overrides
prod.tfvars            # Prod overrides  
secrets.auto.tfvars    # Auto-loaded (in .gitignore)

Usage:

terraform apply -var-file="prod.tfvars"

Sensitive Variables

Mark as Sensitive:

variable "database_password" {
  type        = string
  description = "Master password for database"
  sensitive   = true
}

Never commit secrets:

# .gitignore
*.auto.tfvars
secrets.tfvars
terraform.tfvars  # If contains secrets

Better: Use External Secret Management

data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/database/master-password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

Variable Organization

Group related variables:

# Network Configuration
variable "vpc_cidr" { }
variable "availability_zones" { }
variable "public_subnet_cidrs" { }
variable "private_subnet_cidrs" { }

# Application Configuration  
variable "app_name" { }
variable "app_version" { }
variable "instance_count" { }

# Tagging
variable "tags" {
  type        = map(string)
  description = "Common tags for all resources"
  default     = {}
}

Resource Naming

Naming Conventions

Terraform Resources (snake_case):

resource "aws_vpc" "main_vpc" { }
resource "aws_subnet" "public_subnet_az1" { }
resource "aws_instance" "web_server_01" { }

AWS Resource Names (kebab-case):

resource "aws_s3_bucket" "logs" {
  bucket = "company-prod-application-logs"
  # company-{env}-{service}-{purpose}
}

resource "aws_instance" "web" {
  tags = {
    Name = "prod-web-server-01"
    # {env}-{service}-{type}-{number}
  }
}

Naming Standards

Pattern: {company}-{environment}-{service}-{resource_type}

Examples:

  • acme-prod-api-alb
  • acme-dev-workers-asg
  • acme-staging-database-rds

Benefits:

  • Easy filtering in AWS console
  • Clear ownership and purpose
  • Consistent across environments
  • Billing and cost tracking

Security Practices

1. Principle of Least Privilege

# Bad - Too permissive
resource "aws_iam_policy" "bad" {
  policy = jsonencode({
    Statement = [{
      Effect   = "Allow"
      Action   = "*"
      Resource = "*"
    }]
  })
}

# Good - Specific permissions
resource "aws_iam_policy" "good" {
  policy = jsonencode({
    Statement = [{
      Effect = "Allow"
      Action = [
        "s3:GetObject",
        "s3:PutObject"
      ]
      Resource = "arn:aws:s3:::my-bucket/*"
    }]
  })
}

2. Encryption Everywhere

# Encrypt S3 buckets
resource "aws_s3_bucket" "secure" {
  bucket = "my-secure-bucket"
}

resource "aws_s3_bucket_server_side_encryption_configuration" "secure" {
  bucket = aws_s3_bucket.secure.id
  
  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = "aws:kms"
      kms_master_key_id = aws_kms_key.bucket.arn
    }
  }
}

# Encrypt EBS volumes
resource "aws_instance" "secure" {
  root_block_device {
    encrypted = true
  }
}

# Encrypt RDS databases
resource "aws_db_instance" "secure" {
  storage_encrypted = true
  kms_key_id       = aws_kms_key.rds.arn
}

3. Network Security

# Restrictive security groups
resource "aws_security_group" "web" {
  name_prefix = "web-"
  
  # Only allow specific inbound
  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # Consider restricting further
  }
  
  # Explicit outbound
  egress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]
  }
}

# Use private subnets for workloads
resource "aws_subnet" "private" {
  map_public_ip_on_launch = false  # No public IPs
}

4. Secret Management

Never in Code:

# ❌ NEVER DO THIS
resource "aws_db_instance" "bad" {
  password = "MySecretPassword123"  # NEVER!
}

Use AWS Secrets Manager:

# ✅ CORRECT APPROACH
data "aws_secretsmanager_secret_version" "db" {
  secret_id = var.db_secret_arn
}

resource "aws_db_instance" "good" {
  password = data.aws_secretsmanager_secret_version.db.secret_string
}

5. Resource Tagging

locals {
  common_tags = {
    Environment = var.environment
    ManagedBy   = "Terraform"
    Owner       = "platform-team"
    Project     = var.project_name
    CostCenter  = var.cost_center
  }
}

resource "aws_instance" "web" {
  tags = merge(
    local.common_tags,
    {
      Name = "web-server"
      Role = "webserver"
    }
  )
}

Testing & Validation

Pre-Deployment Validation

1. Terraform Validate

terraform validate

Checks syntax and configuration validity.

2. Terraform Plan

terraform plan -out=tfplan

Review changes before applying.

3. tflint

tflint --module

Linter for catching errors and enforcing conventions.

4. checkov

checkov -d .

Security and compliance scanning.

5. terraform-docs

terraform-docs markdown . > README.md

Auto-generate documentation.

Automated Testing

Terratest (Go):

func TestVPCCreation(t *testing.T) {
    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../examples/complete",
    })
    
    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)
    
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    assert.NotEmpty(t, vpcId)
}

CI/CD Integration

GitHub Actions Example

name: Terraform

on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  terraform:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      
      - name: Setup Terraform
        uses: hashicorp/setup-terraform@v2
        
      - name: Terraform Init
        run: terraform init
        
      - name: Terraform Validate
        run: terraform validate
        
      - name: Terraform Plan
        run: terraform plan -no-color
        if: github.event_name == 'pull_request'
        
      - name: Terraform Apply
        run: terraform apply -auto-approve
        if: github.event_name == 'push' && github.ref == 'refs/heads/main'

Best Practices for CI/CD

  1. Always run plan on PRs - Review changes before merge
  2. Require approvals - Human review for production
  3. Use workspaces or directories - Separate pipeline per environment
  4. Store state remotely - S3 backend with locking
  5. Use credential management - OIDC or IAM roles, never store credentials
  6. Run security scans - checkov, tfsec in pipeline
  7. Tag releases - Version your infrastructure code

Common Pitfalls to Avoid

1. Not Using Remote State

  • Local state doesn't work for teams
  • Use S3, Terraform Cloud, or other remote backend

2. Hardcoding Values

  • region = "us-east-1" in every resource
  • Use variables and locals

3. Not Using Modules

  • Copying code between environments
  • Create reusable modules

4. Ignoring State

  • Manually modifying infrastructure
  • All changes through Terraform

5. Poor Naming

  • resource "aws_instance" "i1" { }
  • resource "aws_instance" "web_server_01" { }

6. No Documentation

  • No README, no comments
  • Document everything

7. Massive State Files

  • Single state for entire infrastructure
  • Break into logical components

8. No Testing

  • Apply directly to production
  • Test in dev/staging first

Quick Reference

Essential Commands

# Initialize
terraform init

# Validate configuration
terraform validate

# Format code
terraform fmt -recursive

# Plan changes
terraform plan

# Apply changes
terraform apply

# Destroy resources
terraform destroy

# Show current state
terraform show

# List resources
terraform state list

# Output values
terraform output

Useful Flags

# Plan without color
terraform plan -no-color

# Apply without prompts
terraform apply -auto-approve

# Destroy specific resource
terraform destroy -target=aws_instance.example

# Use specific var file
terraform apply -var-file="prod.tfvars"

# Set variable via CLI
terraform apply -var="environment=prod"