Initial commit

2025-11-29 17:51:17 +08:00
commit 04d2231fb6
14 changed files with 4569 additions and 0 deletions
--- a/skills/references/best_practices.md
+++ b/skills/references/best_practices.md
@@ -0,0 +1,709 @@
+# Terraform Best Practices
+
+Comprehensive guide to Terraform best practices for infrastructure as code.
+
+## Table of Contents
+
+1. [Project Structure](#project-structure)
+2. [State Management](#state-management)
+3. [Module Design](#module-design)
+4. [Variable Management](#variable-management)
+5. [Resource Naming](#resource-naming)
+6. [Security Practices](#security-practices)
+7. [Testing & Validation](#testing--validation)
+8. [CI/CD Integration](#cicd-integration)
+
+---
+
+## Project Structure
+
+### Recommended Directory Layout
+
+```
+terraform-project/
+├── environments/
+│   ├── dev/
+│   │   ├── main.tf
+│   │   ├── variables.tf
+│   │   ├── outputs.tf
+│   │   ├── terraform.tfvars
+│   │   └── backend.tf
+│   ├── staging/
+│   └── prod/
+├── modules/
+│   ├── networking/
+│   │   ├── main.tf
+│   │   ├── variables.tf
+│   │   ├── outputs.tf
+│   │   ├── versions.tf
+│   │   └── README.md
+│   ├── compute/
+│   └── database/
+├── global/
+│   ├── iam/
+│   └── dns/
+└── README.md
+```
+
+### Key Principles
+
+**Separate Environments**
+- Use directories for each environment (dev, staging, prod)
+- Each environment has its own state file
+- Prevents accidental changes to wrong environment
+
+**Reusable Modules**
+- Common infrastructure patterns in modules/
+- Modules are versioned and tested
+- Used across multiple environments
+
+**Global Resources**
+- Resources shared across environments (IAM, DNS)
+- Separate state for better isolation
+- Carefully managed with extra review
+
+---
+
+## State Management
+
+### Remote State is Essential
+
+**Why Remote State:**
+- Team collaboration and locking
+- State backup and versioning
+- Secure credential handling
+- Disaster recovery
+
+**Recommended Backend: S3 + DynamoDB**
+
+```hcl
+terraform {
+  backend "s3" {
+    bucket         = "company-terraform-state"
+    key            = "prod/networking/terraform.tfstate"
+    region         = "us-east-1"
+    encrypt        = true
+    dynamodb_table = "terraform-state-lock"
+    kms_key_id     = "arn:aws:kms:us-east-1:ACCOUNT:key/KEY_ID"
+  }
+}
+```
+
+**State Best Practices:**
+
+1. **Enable Encryption**: Always encrypt state at rest
+2. **Enable Versioning**: On S3 bucket for state recovery
+3. **Use State Locking**: DynamoDB table prevents concurrent modifications
+4. **Restrict Access**: IAM policies limiting who can read/write state
+5. **Separate State Files**: Different states for different components
+6. **Regular Backups**: Automated backups of state files
+
+### State File Organization
+
+**Bad - Single State:**
+```
+terraform.tfstate  (contains everything)
+```
+
+**Good - Multiple States:**
+```
+networking/terraform.tfstate
+compute/terraform.tfstate
+database/terraform.tfstate
+```
+
+**Benefits:**
+- Reduced blast radius
+- Faster plan/apply operations
+- Parallel team work
+- Easier to understand and debug
+
+### State Management Commands
+
+```bash
+# List resources in state
+terraform state list
+
+# Show specific resource
+terraform state show aws_instance.example
+
+# Move resource to different address
+terraform state mv aws_instance.old aws_instance.new
+
+# Remove resource from state (doesn't destroy)
+terraform state rm aws_instance.example
+
+# Import existing resource
+terraform import aws_instance.example i-1234567890abcdef0
+
+# Pull state for inspection (read-only)
+terraform state pull > state.json
+```
+
+---
+
+## Module Design
+
+### Module Structure
+
+Every module should have:
+
+```
+module-name/
+├── main.tf          # Primary resources
+├── variables.tf     # Input variables
+├── outputs.tf       # Output values
+├── versions.tf      # Version constraints
+├── README.md        # Documentation
+└── examples/        # Usage examples
+    └── complete/
+        ├── main.tf
+        └── variables.tf
+```
+
+### Module Best Practices
+
+**1. Single Responsibility**
+Each module should do one thing well:
+- ✅ `vpc-module` creates VPC with subnets, route tables, NACLs
+- ❌ `infrastructure` creates VPC, EC2, RDS, S3, everything
+
+**2. Composability**
+Modules should work together:
+```hcl
+module "vpc" {
+  source = "./modules/vpc"
+  cidr   = "10.0.0.0/16"
+}
+
+module "eks" {
+  source     = "./modules/eks"
+  vpc_id     = module.vpc.vpc_id
+  subnet_ids = module.vpc.private_subnet_ids
+}
+```
+
+**3. Sensible Defaults**
+```hcl
+variable "instance_type" {
+  type        = string
+  description = "EC2 instance type"
+  default     = "t3.micro"  # Reasonable default
+}
+
+variable "enable_monitoring" {
+  type        = bool
+  description = "Enable detailed monitoring"
+  default     = false  # Cost-effective default
+}
+```
+
+**4. Complete Documentation**
+
+```hcl
+variable "vpc_cidr" {
+  type        = string
+  description = "CIDR block for VPC. Must be a valid IPv4 CIDR."
+  
+  validation {
+    condition     = can(cidrhost(var.vpc_cidr, 0))
+    error_message = "Must be a valid IPv4 CIDR block."
+  }
+}
+```
+
+**5. Output Useful Values**
+
+```hcl
+output "vpc_id" {
+  description = "ID of the VPC"
+  value       = aws_vpc.main.id
+}
+
+output "private_subnet_ids" {
+  description = "List of private subnet IDs for deploying workloads"
+  value       = aws_subnet.private[*].id
+}
+
+output "nat_gateway_ips" {
+  description = "Elastic IPs of NAT gateways for firewall whitelisting"
+  value       = aws_eip.nat[*].public_ip
+}
+```
+
+### Module Versioning
+
+**Use Git Tags for Versioning:**
+```hcl
+module "vpc" {
+  source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.2.3"
+  # Configuration...
+}
+```
+
+**Semantic Versioning:**
+- v1.0.0 → First stable release
+- v1.1.0 → New features (backward compatible)
+- v1.1.1 → Bug fixes
+- v2.0.0 → Breaking changes
+
+---
+
+## Variable Management
+
+### Variable Declaration
+
+**Always Include:**
+```hcl
+variable "environment" {
+  type        = string
+  description = "Environment name (dev, staging, prod)"
+  
+  validation {
+    condition     = contains(["dev", "staging", "prod"], var.environment)
+    error_message = "Environment must be dev, staging, or prod."
+  }
+}
+```
+
+### Variable Files Hierarchy
+
+```
+terraform.tfvars        # Default values (committed, no secrets)
+dev.tfvars             # Dev overrides
+prod.tfvars            # Prod overrides  
+secrets.auto.tfvars    # Auto-loaded (in .gitignore)
+```
+
+**Usage:**
+```bash
+terraform apply -var-file="prod.tfvars"
+```
+
+### Sensitive Variables
+
+**Mark as Sensitive:**
+```hcl
+variable "database_password" {
+  type        = string
+  description = "Master password for database"
+  sensitive   = true
+}
+```
+
+**Never commit secrets:**
+```bash
+# .gitignore
+*.auto.tfvars
+secrets.tfvars
+terraform.tfvars  # If contains secrets
+```
+
+**Better: Use External Secret Management**
+```hcl
+data "aws_secretsmanager_secret_version" "db_password" {
+  secret_id = "prod/database/master-password"
+}
+
+resource "aws_db_instance" "main" {
+  password = data.aws_secretsmanager_secret_version.db_password.secret_string
+}
+```
+
+### Variable Organization
+
+**Group related variables:**
+```hcl
+# Network Configuration
+variable "vpc_cidr" { }
+variable "availability_zones" { }
+variable "public_subnet_cidrs" { }
+variable "private_subnet_cidrs" { }
+
+# Application Configuration  
+variable "app_name" { }
+variable "app_version" { }
+variable "instance_count" { }
+
+# Tagging
+variable "tags" {
+  type        = map(string)
+  description = "Common tags for all resources"
+  default     = {}
+}
+```
+
+---
+
+## Resource Naming
+
+### Naming Conventions
+
+**Terraform Resources (snake_case):**
+```hcl
+resource "aws_vpc" "main_vpc" { }
+resource "aws_subnet" "public_subnet_az1" { }
+resource "aws_instance" "web_server_01" { }
+```
+
+**AWS Resource Names (kebab-case):**
+```hcl
+resource "aws_s3_bucket" "logs" {
+  bucket = "company-prod-application-logs"
+  # company-{env}-{service}-{purpose}
+}
+
+resource "aws_instance" "web" {
+  tags = {
+    Name = "prod-web-server-01"
+    # {env}-{service}-{type}-{number}
+  }
+}
+```
+
+### Naming Standards
+
+**Pattern: `{company}-{environment}-{service}-{resource_type}`**
+
+Examples:
+- `acme-prod-api-alb`
+- `acme-dev-workers-asg`
+- `acme-staging-database-rds`
+
+**Benefits:**
+- Easy filtering in AWS console
+- Clear ownership and purpose
+- Consistent across environments
+- Billing and cost tracking
+
+---
+
+## Security Practices
+
+### 1. Principle of Least Privilege
+
+```hcl
+# Bad - Too permissive
+resource "aws_iam_policy" "bad" {
+  policy = jsonencode({
+    Statement = [{
+      Effect   = "Allow"
+      Action   = "*"
+      Resource = "*"
+    }]
+  })
+}
+
+# Good - Specific permissions
+resource "aws_iam_policy" "good" {
+  policy = jsonencode({
+    Statement = [{
+      Effect = "Allow"
+      Action = [
+        "s3:GetObject",
+        "s3:PutObject"
+      ]
+      Resource = "arn:aws:s3:::my-bucket/*"
+    }]
+  })
+}
+```
+
+### 2. Encryption Everywhere
+
+```hcl
+# Encrypt S3 buckets
+resource "aws_s3_bucket" "secure" {
+  bucket = "my-secure-bucket"
+}
+
+resource "aws_s3_bucket_server_side_encryption_configuration" "secure" {
+  bucket = aws_s3_bucket.secure.id
+  
+  rule {
+    apply_server_side_encryption_by_default {
+      sse_algorithm     = "aws:kms"
+      kms_master_key_id = aws_kms_key.bucket.arn
+    }
+  }
+}
+
+# Encrypt EBS volumes
+resource "aws_instance" "secure" {
+  root_block_device {
+    encrypted = true
+  }
+}
+
+# Encrypt RDS databases
+resource "aws_db_instance" "secure" {
+  storage_encrypted = true
+  kms_key_id       = aws_kms_key.rds.arn
+}
+```
+
+### 3. Network Security
+
+```hcl
+# Restrictive security groups
+resource "aws_security_group" "web" {
+  name_prefix = "web-"
+  
+  # Only allow specific inbound
+  ingress {
+    from_port   = 443
+    to_port     = 443
+    protocol    = "tcp"
+    cidr_blocks = ["0.0.0.0/0"]  # Consider restricting further
+  }
+  
+  # Explicit outbound
+  egress {
+    from_port   = 443
+    to_port     = 443
+    protocol    = "tcp"
+    cidr_blocks = ["0.0.0.0/0"]
+  }
+}
+
+# Use private subnets for workloads
+resource "aws_subnet" "private" {
+  map_public_ip_on_launch = false  # No public IPs
+}
+```
+
+### 4. Secret Management
+
+**Never in Code:**
+```hcl
+# ❌ NEVER DO THIS
+resource "aws_db_instance" "bad" {
+  password = "MySecretPassword123"  # NEVER!
+}
+```
+
+**Use AWS Secrets Manager:**
+```hcl
+# ✅ CORRECT APPROACH
+data "aws_secretsmanager_secret_version" "db" {
+  secret_id = var.db_secret_arn
+}
+
+resource "aws_db_instance" "good" {
+  password = data.aws_secretsmanager_secret_version.db.secret_string
+}
+```
+
+### 5. Resource Tagging
+
+```hcl
+locals {
+  common_tags = {
+    Environment = var.environment
+    ManagedBy   = "Terraform"
+    Owner       = "platform-team"
+    Project     = var.project_name
+    CostCenter  = var.cost_center
+  }
+}
+
+resource "aws_instance" "web" {
+  tags = merge(
+    local.common_tags,
+    {
+      Name = "web-server"
+      Role = "webserver"
+    }
+  )
+}
+```
+
+---
+
+## Testing & Validation
+
+### Pre-Deployment Validation
+
+**1. Terraform Validate**
+```bash
+terraform validate
+```
+Checks syntax and configuration validity.
+
+**2. Terraform Plan**
+```bash
+terraform plan -out=tfplan
+```
+Review changes before applying.
+
+**3. tflint**
+```bash
+tflint --module
+```
+Linter for catching errors and enforcing conventions.
+
+**4. checkov**
+```bash
+checkov -d .
+```
+Security and compliance scanning.
+
+**5. terraform-docs**
+```bash
+terraform-docs markdown . > README.md
+```
+Auto-generate documentation.
+
+### Automated Testing
+
+**Terratest (Go):**
+```go
+func TestVPCCreation(t *testing.T) {
+    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
+        TerraformDir: "../examples/complete",
+    })
+    
+    defer terraform.Destroy(t, terraformOptions)
+    terraform.InitAndApply(t, terraformOptions)
+    
+    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
+    assert.NotEmpty(t, vpcId)
+}
+```
+
+---
+
+## CI/CD Integration
+
+### GitHub Actions Example
+
+```yaml
+name: Terraform
+
+on:
+  pull_request:
+    branches: [main]
+  push:
+    branches: [main]
+
+jobs:
+  terraform:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v3
+      
+      - name: Setup Terraform
+        uses: hashicorp/setup-terraform@v2
+        
+      - name: Terraform Init
+        run: terraform init
+        
+      - name: Terraform Validate
+        run: terraform validate
+        
+      - name: Terraform Plan
+        run: terraform plan -no-color
+        if: github.event_name == 'pull_request'
+        
+      - name: Terraform Apply
+        run: terraform apply -auto-approve
+        if: github.event_name == 'push' && github.ref == 'refs/heads/main'
+```
+
+### Best Practices for CI/CD
+
+1. **Always run plan on PRs** - Review changes before merge
+2. **Require approvals** - Human review for production
+3. **Use workspaces or directories** - Separate pipeline per environment
+4. **Store state remotely** - S3 backend with locking
+5. **Use credential management** - OIDC or IAM roles, never store credentials
+6. **Run security scans** - checkov, tfsec in pipeline
+7. **Tag releases** - Version your infrastructure code
+
+---
+
+## Common Pitfalls to Avoid
+
+### 1. Not Using Remote State
+- ❌ Local state doesn't work for teams
+- ✅ Use S3, Terraform Cloud, or other remote backend
+
+### 2. Hardcoding Values
+- ❌ `region = "us-east-1"` in every resource
+- ✅ Use variables and locals
+
+### 3. Not Using Modules
+- ❌ Copying code between environments
+- ✅ Create reusable modules
+
+### 4. Ignoring State
+- ❌ Manually modifying infrastructure
+- ✅ All changes through Terraform
+
+### 5. Poor Naming
+- ❌ `resource "aws_instance" "i1" { }`
+- ✅ `resource "aws_instance" "web_server_01" { }`
+
+### 6. No Documentation
+- ❌ No README, no comments
+- ✅ Document everything
+
+### 7. Massive State Files
+- ❌ Single state for entire infrastructure
+- ✅ Break into logical components
+
+### 8. No Testing
+- ❌ Apply directly to production
+- ✅ Test in dev/staging first
+
+---
+
+## Quick Reference
+
+### Essential Commands
+```bash
+# Initialize
+terraform init
+
+# Validate configuration
+terraform validate
+
+# Format code
+terraform fmt -recursive
+
+# Plan changes
+terraform plan
+
+# Apply changes
+terraform apply
+
+# Destroy resources
+terraform destroy
+
+# Show current state
+terraform show
+
+# List resources
+terraform state list
+
+# Output values
+terraform output
+```
+
+### Useful Flags
+```bash
+# Plan without color
+terraform plan -no-color
+
+# Apply without prompts
+terraform apply -auto-approve
+
+# Destroy specific resource
+terraform destroy -target=aws_instance.example
+
+# Use specific var file
+terraform apply -var-file="prod.tfvars"
+
+# Set variable via CLI
+terraform apply -var="environment=prod"
+```
--- a/skills/references/cost_optimization.md
+++ b/skills/references/cost_optimization.md
@@ -0,0 +1,665 @@
+# Terraform Cost Optimization Guide
+
+Strategies for optimizing cloud infrastructure costs when using Terraform.
+
+## Table of Contents
+
+1. [Right-Sizing Resources](#right-sizing-resources)
+2. [Spot and Reserved Instances](#spot-and-reserved-instances)
+3. [Storage Optimization](#storage-optimization)
+4. [Networking Costs](#networking-costs)
+5. [Resource Lifecycle](#resource-lifecycle)
+6. [Cost Tagging](#cost-tagging)
+7. [Monitoring and Alerts](#monitoring-and-alerts)
+8. [Multi-Cloud Considerations](#multi-cloud-considerations)
+
+---
+
+## Right-Sizing Resources
+
+### Compute Resources
+
+**Start small, scale up:**
+```hcl
+variable "instance_type" {
+  type        = string
+  description = "EC2 instance type"
+  default     = "t3.micro"  # Start with smallest reasonable size
+
+  validation {
+    condition     = can(regex("^t[0-9]\\.", var.instance_type))
+    error_message = "Consider starting with burstable (t-series) instances for cost optimization."
+  }
+}
+```
+
+**Use auto-scaling instead of over-provisioning:**
+```hcl
+resource "aws_autoscaling_group" "app" {
+  min_size         = 2   # Minimum for HA
+  desired_capacity = 2   # Normal load
+  max_size         = 10  # Peak load
+
+  # Scale based on actual usage
+  target_group_arns = [aws_lb_target_group.app.arn]
+
+  tag {
+    key                 = "Environment"
+    value               = var.environment
+    propagate_at_launch = true
+  }
+}
+```
+
+### Database Right-Sizing
+
+**Start with appropriate size:**
+```hcl
+resource "aws_db_instance" "main" {
+  instance_class = var.environment == "prod" ? "db.t3.medium" : "db.t3.micro"
+
+  # Enable auto-scaling for storage
+  allocated_storage     = 20
+  max_allocated_storage = 100  # Auto-scale up to 100GB
+
+  # Use cheaper storage for non-prod
+  storage_type = var.environment == "prod" ? "io1" : "gp3"
+}
+```
+
+---
+
+## Spot and Reserved Instances
+
+### Spot Instances for Non-Critical Workloads
+
+**Launch Template for Spot:**
+```hcl
+resource "aws_launch_template" "spot" {
+  name_prefix   = "spot-"
+  image_id      = data.aws_ami.amazon_linux.id
+  instance_type = "t3.medium"
+
+  instance_market_options {
+    market_type = "spot"
+
+    spot_options {
+      max_price                      = "0.05"  # Set price limit
+      spot_instance_type             = "one-time"
+      instance_interruption_behavior = "terminate"
+    }
+  }
+
+  tag_specifications {
+    resource_type = "instance"
+    tags = {
+      Name        = "spot-instance"
+      Workload    = "non-critical"
+      CostSavings = "true"
+    }
+  }
+}
+
+resource "aws_autoscaling_group" "spot" {
+  desired_capacity = 5
+  max_size         = 10
+  min_size         = 0
+
+  mixed_instances_policy {
+    instances_distribution {
+      on_demand_percentage_above_base_capacity = 20  # 20% on-demand, 80% spot
+      spot_allocation_strategy                 = "capacity-optimized"
+    }
+
+    launch_template {
+      launch_template_specification {
+        launch_template_id = aws_launch_template.spot.id
+        version            = "$Latest"
+      }
+
+      # Multiple instance types increase spot availability
+      override {
+        instance_type = "t3.medium"
+      }
+      override {
+        instance_type = "t3.large"
+      }
+      override {
+        instance_type = "t3a.medium"
+      }
+    }
+  }
+}
+```
+
+### Reserved Instances (Use Outside Terraform)
+
+Terraform shouldn't manage reservations directly, but should:
+- Tag resources consistently for reservation planning
+- Use Instance Savings Plans for flexibility
+- Monitor usage patterns to inform reservation purchases
+
+**Tagging for reservation analysis:**
+```hcl
+locals {
+  reservation_tags = {
+    ReservationCandidate = var.environment == "prod" ? "true" : "false"
+    UsagePattern         = "steady-state"  # or "variable", "burst"
+    CostCenter          = var.cost_center
+  }
+}
+```
+
+---
+
+## Storage Optimization
+
+### S3 Lifecycle Policies
+
+**Automatic tiering:**
+```hcl
+resource "aws_s3_bucket_lifecycle_configuration" "logs" {
+  bucket = aws_s3_bucket.logs.id
+
+  rule {
+    id     = "log-retention"
+    status = "Enabled"
+
+    transition {
+      days          = 30
+      storage_class = "STANDARD_IA"  # Infrequent Access after 30 days
+    }
+
+    transition {
+      days          = 90
+      storage_class = "GLACIER_IR"  # Instant Retrieval Glacier after 90 days
+    }
+
+    transition {
+      days          = 180
+      storage_class = "DEEP_ARCHIVE"  # Deep Archive after 180 days
+    }
+
+    expiration {
+      days = 365  # Delete after 1 year
+    }
+  }
+}
+```
+
+**Intelligent tiering for variable access:**
+```hcl
+resource "aws_s3_bucket_intelligent_tiering_configuration" "assets" {
+  bucket = aws_s3_bucket.assets.id
+  name   = "entire-bucket"
+
+  tiering {
+    access_tier = "ARCHIVE_ACCESS"
+    days        = 90
+  }
+
+  tiering {
+    access_tier = "DEEP_ARCHIVE_ACCESS"
+    days        = 180
+  }
+}
+```
+
+### EBS Volume Optimization
+
+**Use appropriate volume types:**
+```hcl
+resource "aws_instance" "app" {
+  ami           = data.aws_ami.amazon_linux.id
+  instance_type = "t3.medium"
+
+  root_block_device {
+    volume_type = "gp3"  # gp3 is cheaper than gp2 with better baseline
+    volume_size = 20
+    iops        = 3000   # Default, only pay more if you need more
+    throughput  = 125    # Default
+    encrypted   = true
+
+    # Delete on termination to avoid orphaned volumes
+    delete_on_termination = true
+  }
+
+  tags = {
+    Name = "app-server"
+  }
+}
+```
+
+**Snapshot lifecycle:**
+```hcl
+resource "aws_dlm_lifecycle_policy" "snapshots" {
+  description        = "EBS snapshot lifecycle"
+  execution_role_arn = aws_iam_role.dlm.arn
+  state              = "ENABLED"
+
+  policy_details {
+    resource_types = ["VOLUME"]
+
+    schedule {
+      name = "Daily snapshots"
+
+      create_rule {
+        interval      = 24
+        interval_unit = "HOURS"
+        times         = ["03:00"]
+      }
+
+      retain_rule {
+        count = 7  # Keep only 7 days of snapshots
+      }
+
+      copy_tags = true
+    }
+
+    target_tags = {
+      BackupEnabled = "true"
+    }
+  }
+}
+```
+
+---
+
+## Networking Costs
+
+### Minimize Data Transfer
+
+**Use VPC endpoints to avoid NAT charges:**
+```hcl
+resource "aws_vpc_endpoint" "s3" {
+  vpc_id       = aws_vpc.main.id
+  service_name = "com.amazonaws.${var.region}.s3"
+  route_table_ids = [
+    aws_route_table.private.id
+  ]
+
+  tags = {
+    Name        = "s3-endpoint"
+    CostSavings = "reduces-nat-charges"
+  }
+}
+
+resource "aws_vpc_endpoint" "dynamodb" {
+  vpc_id       = aws_vpc.main.id
+  service_name = "com.amazonaws.${var.region}.dynamodb"
+  route_table_ids = [
+    aws_route_table.private.id
+  ]
+}
+```
+
+**Interface endpoints for AWS services:**
+```hcl
+resource "aws_vpc_endpoint" "ecr_api" {
+  vpc_id              = aws_vpc.main.id
+  service_name        = "com.amazonaws.${var.region}.ecr.api"
+  vpc_endpoint_type   = "Interface"
+  subnet_ids          = aws_subnet.private[*].id
+  security_group_ids  = [aws_security_group.vpc_endpoints.id]
+  private_dns_enabled = true
+
+  tags = {
+    Name        = "ecr-api-endpoint"
+    CostSavings = "reduces-nat-data-transfer"
+  }
+}
+```
+
+### Regional Optimization
+
+**Co-locate resources in same region/AZ:**
+```hcl
+# Bad - cross-region data transfer is expensive
+resource "aws_instance" "app" {
+  availability_zone = "us-east-1a"
+}
+
+resource "aws_rds_cluster" "main" {
+  availability_zones = ["us-west-2a"]  # Different region!
+}
+
+# Good - same region and AZ when possible
+resource "aws_instance" "app" {
+  availability_zone = var.availability_zone
+}
+
+resource "aws_rds_cluster" "main" {
+  availability_zones = [var.availability_zone]  # Same AZ
+}
+```
+
+---
+
+## Resource Lifecycle
+
+### Scheduled Shutdown for Non-Production
+
+**Lambda to stop/start instances:**
+```hcl
+resource "aws_lambda_function" "scheduler" {
+  filename      = "scheduler.zip"
+  function_name = "instance-scheduler"
+  role          = aws_iam_role.scheduler.arn
+  handler       = "scheduler.handler"
+  runtime       = "python3.9"
+
+  environment {
+    variables = {
+      TAG_KEY   = "Schedule"
+      TAG_VALUE = "business-hours"
+    }
+  }
+}
+
+# EventBridge rule to stop instances at night
+resource "aws_cloudwatch_event_rule" "stop_instances" {
+  name                = "stop-dev-instances"
+  description         = "Stop dev instances at 7 PM"
+  schedule_expression = "cron(0 19 ? * MON-FRI *)"  # 7 PM weekdays
+}
+
+resource "aws_cloudwatch_event_target" "stop" {
+  rule      = aws_cloudwatch_event_rule.stop_instances.name
+  target_id = "stop-instances"
+  arn       = aws_lambda_function.scheduler.arn
+
+  input = jsonencode({
+    action = "stop"
+  })
+}
+
+# Start instances in the morning
+resource "aws_cloudwatch_event_rule" "start_instances" {
+  name                = "start-dev-instances"
+  description         = "Start dev instances at 8 AM"
+  schedule_expression = "cron(0 8 ? * MON-FRI *)"  # 8 AM weekdays
+}
+```
+
+**Tag instances for scheduling:**
+```hcl
+resource "aws_instance" "dev" {
+  ami           = data.aws_ami.amazon_linux.id
+  instance_type = "t3.medium"
+
+  tags = {
+    Name        = "dev-server"
+    Environment = "dev"
+    Schedule    = "business-hours"  # Scheduler will stop/start based on this
+    AutoShutdown = "true"
+  }
+}
+```
+
+### Cleanup Old Resources
+
+**S3 lifecycle for temporary data:**
+```hcl
+resource "aws_s3_bucket_lifecycle_configuration" "temp" {
+  bucket = aws_s3_bucket.temp.id
+
+  rule {
+    id     = "cleanup-temp-files"
+    status = "Enabled"
+
+    filter {
+      prefix = "temp/"
+    }
+
+    expiration {
+      days = 7  # Delete after 7 days
+    }
+
+    abort_incomplete_multipart_upload {
+      days_after_initiation = 1
+    }
+  }
+}
+```
+
+---
+
+## Cost Tagging
+
+### Comprehensive Tagging Strategy
+
+**Define tagging locals:**
+```hcl
+locals {
+  common_tags = {
+    # Cost allocation tags
+    CostCenter  = var.cost_center
+    Project     = var.project_name
+    Environment = var.environment
+    Owner       = var.team_email
+
+    # Operational tags
+    ManagedBy       = "Terraform"
+    TerraformModule = basename(abspath(path.module))
+
+    # Cost optimization tags
+    AutoShutdown        = var.environment != "prod" ? "enabled" : "disabled"
+    ReservationCandidate = var.environment == "prod" ? "true" : "false"
+    CostOptimized       = "true"
+  }
+}
+
+# Apply to all resources
+resource "aws_instance" "app" {
+  # ... configuration ...
+
+  tags = merge(
+    local.common_tags,
+    {
+      Name = "${var.environment}-app-server"
+      Role = "application"
+    }
+  )
+}
+```
+
+**Enforce tagging with AWS Config:**
+```hcl
+resource "aws_config_config_rule" "required_tags" {
+  name = "required-tags"
+
+  source {
+    owner             = "AWS"
+    source_identifier = "REQUIRED_TAGS"
+  }
+
+  input_parameters = jsonencode({
+    tag1Key = "CostCenter"
+    tag2Key = "Environment"
+    tag3Key = "Owner"
+  })
+
+  depends_on = [aws_config_configuration_recorder.main]
+}
+```
+
+---
+
+## Monitoring and Alerts
+
+### Budget Alerts
+
+**AWS Budgets with Terraform:**
+```hcl
+resource "aws_budgets_budget" "monthly" {
+  name              = "${var.environment}-monthly-budget"
+  budget_type       = "COST"
+  limit_amount      = var.monthly_budget
+  limit_unit        = "USD"
+  time_unit         = "MONTHLY"
+  time_period_start = "2024-01-01_00:00"
+
+  cost_filter {
+    name = "TagKeyValue"
+    values = [
+      "Environment$${var.environment}"
+    ]
+  }
+
+  notification {
+    comparison_operator        = "GREATER_THAN"
+    threshold                  = 80
+    threshold_type             = "PERCENTAGE"
+    notification_type          = "ACTUAL"
+    subscriber_email_addresses = [var.budget_alert_email]
+  }
+
+  notification {
+    comparison_operator        = "GREATER_THAN"
+    threshold                  = 100
+    threshold_type             = "PERCENTAGE"
+    notification_type          = "ACTUAL"
+    subscriber_email_addresses = [var.budget_alert_email]
+  }
+}
+```
+
+### Cost Anomaly Detection
+
+```hcl
+resource "aws_ce_anomaly_monitor" "service" {
+  name              = "${var.environment}-service-monitor"
+  monitor_type      = "DIMENSIONAL"
+  monitor_dimension = "SERVICE"
+}
+
+resource "aws_ce_anomaly_subscription" "alerts" {
+  name      = "${var.environment}-anomaly-alerts"
+  frequency = "DAILY"
+
+  monitor_arn_list = [
+    aws_ce_anomaly_monitor.service.arn
+  ]
+
+  subscriber {
+    type    = "EMAIL"
+    address = var.cost_alert_email
+  }
+
+  threshold_expression {
+    dimension {
+      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
+      values        = ["100"]  # Alert on $100+ anomalies
+      match_options = ["GREATER_THAN_OR_EQUAL"]
+    }
+  }
+}
+```
+
+---
+
+## Multi-Cloud Considerations
+
+### Azure Cost Optimization
+
+**Use Azure Hybrid Benefit:**
+```hcl
+resource "azurerm_linux_virtual_machine" "main" {
+  # ... configuration ...
+
+  # Use Azure Hybrid Benefit for licensing savings
+  license_type = "RHEL_BYOS"  # or "SLES_BYOS"
+}
+```
+
+**Azure Reserved Instances (outside Terraform):**
+- Purchase through Azure Portal
+- Tag VMs with `ReservationGroup` for planning
+
+### GCP Cost Optimization
+
+**Use committed use discounts:**
+```hcl
+resource "google_compute_instance" "main" {
+  # ... configuration ...
+
+  # Use committed use discount
+  scheduling {
+    automatic_restart   = true
+    on_host_maintenance = "MIGRATE"
+    preemptible         = var.environment != "prod"  # Preemptible for non-prod
+  }
+}
+```
+
+**GCP Preemptible VMs:**
+```hcl
+resource "google_compute_instance_template" "preemptible" {
+  machine_type = "n1-standard-1"
+
+  scheduling {
+    automatic_restart   = false
+    on_host_maintenance = "TERMINATE"
+    preemptible         = true  # Up to 80% cost reduction
+  }
+}
+```
+
+---
+
+## Cost Optimization Checklist
+
+### Before Deployment
+- [ ] Right-size compute resources (start small)
+- [ ] Use appropriate storage tiers
+- [ ] Enable auto-scaling instead of over-provisioning
+- [ ] Implement tagging strategy
+- [ ] Configure lifecycle policies
+- [ ] Set up VPC endpoints for AWS services
+
+### After Deployment
+- [ ] Monitor actual usage vs. provisioned capacity
+- [ ] Review cost allocation tags
+- [ ] Identify reservation opportunities
+- [ ] Configure budget alerts
+- [ ] Enable cost anomaly detection
+- [ ] Schedule non-production resource shutdown
+
+### Ongoing
+- [ ] Monthly cost review
+- [ ] Quarterly right-sizing analysis
+- [ ] Annual reservation review
+- [ ] Remove unused resources
+- [ ] Optimize data transfer patterns
+- [ ] Update instance families (new generations are often cheaper)
+
+---
+
+## Cost Estimation Tools
+
+### Use `infracost` in CI/CD
+
+```bash
+# Install infracost
+curl -fsSL https://raw.githubusercontent.com/infracost/infracost/master/scripts/install.sh | sh
+
+# Generate cost estimate
+infracost breakdown --path .
+
+# Compare cost changes in PR
+infracost diff --path . --compare-to tfplan.json
+```
+
+### Terraform Cloud Cost Estimation
+
+Enable in Terraform Cloud workspace settings for automatic cost estimates on every plan.
+
+---
+
+## Additional Resources
+
+- AWS Cost Optimization: https://aws.amazon.com/pricing/cost-optimization/
+- Azure Cost Management: https://azure.microsoft.com/en-us/products/cost-management/
+- GCP Cost Management: https://cloud.google.com/cost-management
+- Infracost: https://www.infracost.io/
+- Cloud Cost Optimization Tools: Kubecost, CloudHealth, CloudCheckr
--- a/skills/references/troubleshooting.md
+++ b/skills/references/troubleshooting.md
@@ -0,0 +1,635 @@
+# Terraform Troubleshooting Guide
+
+Common Terraform and Terragrunt issues with solutions.
+
+## Table of Contents
+
+1. [State Issues](#state-issues)
+2. [Provider Issues](#provider-issues)
+3. [Resource Errors](#resource-errors)
+4. [Module Issues](#module-issues)
+5. [Terragrunt Specific](#terragrunt-specific)
+6. [Performance Issues](#performance-issues)
+
+---
+
+## State Issues
+
+### State Lock Error
+
+**Symptom:**
+```
+Error locking state: Error acquiring the state lock
+Lock Info:
+  ID:        abc123...
+  Path:      terraform.tfstate
+  Operation: OperationTypeApply
+  Who:       user@hostname
+  Created:   2024-01-15 10:30:00 UTC
+```
+
+**Common Causes:**
+1. Previous operation crashed or was interrupted
+2. Another user/process is running terraform
+3. State lock wasn't released properly
+
+**Resolution:**
+
+1. **Verify no one else is running terraform:**
+```bash
+# Check with team first!
+```
+
+2. **Force unlock (use with caution):**
+```bash
+terraform force-unlock abc123
+```
+
+3. **For DynamoDB backend, check lock table:**
+```bash
+aws dynamodb get-item \
+  --table-name terraform-state-lock \
+  --key '{"LockID": {"S": "path/to/state/terraform.tfstate-md5"}}'
+```
+
+**Prevention:**
+- Use proper state locking backend (S3 + DynamoDB)
+- Implement timeout in CI/CD pipelines
+- Always let terraform complete or properly cancel
+
+---
+
+### State Drift Detected
+
+**Symptom:**
+```
+Note: Objects have changed outside of Terraform
+
+Terraform detected the following changes made outside of Terraform
+since the last "terraform apply":
+```
+
+**Common Causes:**
+1. Manual changes in AWS console
+2. Another tool modifying resources
+3. Auto-scaling or auto-remediation
+
+**Resolution:**
+
+1. **Review the drift:**
+```bash
+terraform plan -detailed-exitcode
+```
+
+2. **Options:**
+   - **Import changes:** Update terraform to match reality
+   - **Revert changes:** Apply terraform to restore desired state
+   - **Refresh state:** `terraform apply -refresh-only`
+
+3. **Import specific changes:**
+```bash
+# Update your .tf files, then:
+terraform plan  # Verify it matches
+terraform apply
+```
+
+**Prevention:**
+- Implement policy to prevent manual changes
+- Use AWS Config rules to detect drift
+- Regular `terraform plan` to catch drift early
+- Consider using Terraform Cloud drift detection
+
+---
+
+### State Corruption
+
+**Symptom:**
+```
+Error: Failed to load state
+Error: state snapshot was created by Terraform v1.5.0, 
+which is newer than current v1.3.0
+```
+
+**Common Causes:**
+1. Using different Terraform versions
+2. State file manually edited
+3. Incomplete state upload
+
+**Resolution:**
+
+1. **Version mismatch:**
+```bash
+# Upgrade to matching version
+tfenv install 1.5.0
+tfenv use 1.5.0
+```
+
+2. **Restore from backup:**
+```bash
+# For S3 backend with versioning
+aws s3api list-object-versions \
+  --bucket terraform-state \
+  --prefix prod/terraform.tfstate
+
+# Restore specific version
+aws s3api get-object \
+  --bucket terraform-state \
+  --key prod/terraform.tfstate \
+  --version-id VERSION_ID \
+  terraform.tfstate
+```
+
+3. **Rebuild state (last resort):**
+```bash
+# Remove corrupted state
+terraform state rm aws_instance.example
+
+# Re-import resources
+terraform import aws_instance.example i-1234567890abcdef0
+```
+
+**Prevention:**
+- Pin Terraform version in `versions.tf`
+- Enable S3 versioning for state bucket
+- Never manually edit state files
+- Use consistent Terraform versions across team
+
+---
+
+## Provider Issues
+
+### Provider Version Conflict
+
+**Symptom:**
+```
+Error: Incompatible provider version
+
+Provider registry.terraform.io/hashicorp/aws v5.0.0 does not have 
+a package available for your current platform
+```
+
+**Resolution:**
+
+1. **Specify version constraints:**
+```hcl
+terraform {
+  required_providers {
+    aws = {
+      source  = "hashicorp/aws"
+      version = "~> 4.67.0"  # Use compatible version
+    }
+  }
+}
+```
+
+2. **Clean provider cache:**
+```bash
+rm -rf .terraform
+terraform init -upgrade
+```
+
+3. **Lock file sync:**
+```bash
+terraform providers lock \
+  -platform=darwin_amd64 \
+  -platform=darwin_arm64 \
+  -platform=linux_amd64
+```
+
+---
+
+### Authentication Failures
+
+**Symptom:**
+```
+Error: error configuring Terraform AWS Provider: 
+no valid credential sources found
+```
+
+**Common Causes:**
+1. Missing AWS credentials
+2. Expired credentials
+3. Incorrect IAM permissions
+
+**Resolution:**
+
+1. **Verify credentials:**
+```bash
+aws sts get-caller-identity
+```
+
+2. **Check credential order:**
+   - Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
+   - Shared credentials file (~/.aws/credentials)
+   - IAM role (for EC2/ECS)
+
+3. **Configure provider:**
+```hcl
+provider "aws" {
+  region = "us-east-1"
+  
+  # Option 1: Use profile
+  profile = "production"
+  
+  # Option 2: Assume role
+  assume_role {
+    role_arn = "arn:aws:iam::ACCOUNT:role/TerraformRole"
+  }
+}
+```
+
+4. **Check IAM permissions:**
+```bash
+# Test specific permission
+aws ec2 describe-instances --dry-run
+```
+
+**Prevention:**
+- Use IAM roles in CI/CD
+- Implement OIDC for GitHub Actions
+- Regular credential rotation
+- Use AWS SSO for developers
+
+---
+
+## Resource Errors
+
+### Resource Already Exists
+
+**Symptom:**
+```
+Error: creating EC2 Instance: EntityAlreadyExists: 
+Resource with id 'i-1234567890abcdef0' already exists
+```
+
+**Resolution:**
+
+1. **Import existing resource:**
+```bash
+terraform import aws_instance.web i-1234567890abcdef0
+```
+
+2. **Verify configuration matches:**
+```bash
+terraform plan  # Should show no changes after import
+```
+
+3. **If configuration differs, update it:**
+```hcl
+resource "aws_instance" "web" {
+  ami           = "ami-abc123"  # Match existing
+  instance_type = "t3.micro"    # Match existing
+}
+```
+
+---
+
+### Dependency Errors
+
+**Symptom:**
+```
+Error: resource depends on resource "aws_vpc.main" that 
+is not declared in the configuration
+```
+
+**Resolution:**
+
+1. **Add explicit dependency:**
+```hcl
+resource "aws_subnet" "private" {
+  vpc_id = aws_vpc.main.id
+  
+  depends_on = [
+    aws_internet_gateway.main  # Explicit dependency
+  ]
+}
+```
+
+2. **Use data sources for existing resources:**
+```hcl
+data "aws_vpc" "existing" {
+  id = "vpc-12345678"
+}
+
+resource "aws_subnet" "new" {
+  vpc_id = data.aws_vpc.existing.id
+}
+```
+
+---
+
+### Timeout Errors
+
+**Symptom:**
+```
+Error: timeout while waiting for state to become 'available'
+(last state: 'pending', timeout: 10m0s)
+```
+
+**Resolution:**
+
+1. **Increase timeout:**
+```hcl
+resource "aws_db_instance" "main" {
+  # ... configuration ...
+  
+  timeouts {
+    create = "60m"
+    update = "60m"
+    delete = "60m"
+  }
+}
+```
+
+2. **Check resource status manually:**
+```bash
+aws rds describe-db-instances --db-instance-identifier mydb
+```
+
+3. **Retry the operation:**
+```bash
+terraform apply
+```
+
+---
+
+## Module Issues
+
+### Module Source Not Found
+
+**Symptom:**
+```
+Error: Failed to download module
+
+Could not download module "vpc" (main.tf:10) source: 
+git::https://github.com/company/terraform-modules.git//vpc
+```
+
+**Resolution:**
+
+1. **Verify source URL:**
+```hcl
+module "vpc" {
+  source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.0.0"
+  # Add authentication if private repo
+}
+```
+
+2. **For private repos, configure Git auth:**
+```bash
+# SSH key
+git config --global url."git@github.com:".insteadOf "https://github.com/"
+
+# Or use HTTPS with token
+git config --global url."https://oauth2:TOKEN@github.com/".insteadOf "https://github.com/"
+```
+
+3. **Clear module cache:**
+```bash
+rm -rf .terraform/modules
+terraform init
+```
+
+---
+
+### Module Version Conflicts
+
+**Symptom:**
+```
+Error: Inconsistent dependency lock file
+
+Module has dependencies locked at version 1.0.0 but 
+root module requires version 2.0.0
+```
+
+**Resolution:**
+
+1. **Update lock file:**
+```bash
+terraform init -upgrade
+```
+
+2. **Pin module version:**
+```hcl
+module "vpc" {
+  source  = "terraform-aws-modules/vpc/aws"
+  version = "~> 3.0"  # Compatible with 3.x
+}
+```
+
+---
+
+## Terragrunt Specific
+
+### Dependency Cycle Detected
+
+**Symptom:**
+```
+Error: Dependency cycle detected:
+  module-a depends on module-b
+  module-b depends on module-c  
+  module-c depends on module-a
+```
+
+**Resolution:**
+
+1. **Review dependencies in terragrunt.hcl:**
+```hcl
+dependency "vpc" {
+  config_path = "../vpc"
+}
+
+dependency "database" {
+  config_path = "../database"
+}
+
+# Don't create circular references!
+```
+
+2. **Refactor to remove cycle:**
+   - Split modules differently
+   - Use data sources instead of dependencies
+   - Pass values through variables
+
+3. **Use mock outputs during planning:**
+```hcl
+dependency "vpc" {
+  config_path = "../vpc"
+  
+  mock_outputs = {
+    vpc_id = "vpc-mock"
+  }
+  mock_outputs_allowed_terraform_commands = ["validate", "plan"]
+}
+```
+
+---
+
+### Hook Failures
+
+**Symptom:**
+```
+Error: Hook execution failed
+Command: pre_apply_hook.sh
+Exit code: 1
+```
+
+**Resolution:**
+
+1. **Debug the hook:**
+```bash
+# Run hook manually
+bash .terragrunt-cache/.../pre_apply_hook.sh
+```
+
+2. **Add error handling to hook:**
+```bash
+#!/bin/bash
+set -e  # Exit on error
+
+# Your hook logic
+if ! command -v jq &> /dev/null; then
+    echo "jq is required but not installed"
+    exit 1
+fi
+```
+
+3. **Make hook executable:**
+```bash
+chmod +x hooks/pre_apply_hook.sh
+```
+
+---
+
+### Include Path Issues
+
+**Symptom:**
+```
+Error: Cannot include file
+Path does not exist: ../common.hcl
+```
+
+**Resolution:**
+
+1. **Use correct relative path:**
+```hcl
+include "root" {
+  path = find_in_parent_folders()
+}
+
+include "common" {
+  path = "${get_terragrunt_dir()}/../common.hcl"
+}
+```
+
+2. **Verify file exists:**
+```bash
+ls -la ../common.hcl
+```
+
+---
+
+## Performance Issues
+
+### Slow Plans/Applies
+
+**Symptoms:**
+- `terraform plan` takes >5 minutes
+- `terraform apply` very slow
+- State operations timing out
+
+**Common Causes:**
+1. Too many resources in single state
+2. Slow provider API calls
+3. Large number of data sources
+4. Complex interpolations
+
+**Resolution:**
+
+1. **Split state files:**
+```
+networking/     # Separate state
+compute/        # Separate state  
+database/       # Separate state
+```
+
+2. **Use targeted operations:**
+```bash
+terraform plan -target=aws_instance.web
+terraform apply -target=module.vpc
+```
+
+3. **Optimize data sources:**
+```hcl
+# Bad - queries every plan
+data "aws_ami" "ubuntu" {
+  most_recent = true
+  # ... filters
+}
+
+# Better - use specific AMI
+variable "ami_id" {
+  default = "ami-abc123"  # Update periodically
+}
+```
+
+4. **Enable parallelism:**
+```bash
+terraform apply -parallelism=20  # Default is 10
+```
+
+5. **Use caching (Terragrunt):**
+```hcl
+remote_state {
+  backend = "s3"
+  config = {
+    skip_credentials_validation = true  # Faster
+    skip_metadata_api_check     = true
+  }
+}
+```
+
+---
+
+## Quick Diagnostic Steps
+
+When encountering any Terraform error:
+
+1. **Read the full error message** - Don't skip details
+2. **Check recent changes** - What changed since last successful run?
+3. **Verify versions** - Terraform, providers, modules
+4. **Check state** - Is it locked? Corrupted?
+5. **Test authentication** - Can you access resources manually?
+6. **Review logs** - Use TF_LOG=DEBUG for detailed output
+7. **Isolate the problem** - Use -target to test specific resources
+
+### Enable Debug Logging
+
+```bash
+export TF_LOG=DEBUG
+export TF_LOG_PATH=terraform-debug.log
+terraform plan
+```
+
+### Test Configuration
+
+```bash
+terraform validate  # Syntax check
+terraform fmt -check  # Format check
+tflint  # Linting
+```
+
+---
+
+## Prevention Checklist
+
+- [ ] Use remote state with locking
+- [ ] Pin Terraform and provider versions
+- [ ] Implement pre-commit hooks
+- [ ] Run plan before every apply
+- [ ] Use modules for reusable components
+- [ ] Enable state versioning/backups
+- [ ] Document architecture and dependencies
+- [ ] Implement CI/CD with proper reviews
+- [ ] Regular terraform plan in CI to detect drift
+- [ ] Monitor and alert on state changes