Files
gh-ahmedasmar-devops-claude…/skills/references/troubleshooting.md
2025-11-29 17:51:17 +08:00

11 KiB

Terraform Troubleshooting Guide

Common Terraform and Terragrunt issues with solutions.

Table of Contents

  1. State Issues
  2. Provider Issues
  3. Resource Errors
  4. Module Issues
  5. Terragrunt Specific
  6. Performance Issues

State Issues

State Lock Error

Symptom:

Error locking state: Error acquiring the state lock
Lock Info:
  ID:        abc123...
  Path:      terraform.tfstate
  Operation: OperationTypeApply
  Who:       user@hostname
  Created:   2024-01-15 10:30:00 UTC

Common Causes:

  1. Previous operation crashed or was interrupted
  2. Another user/process is running terraform
  3. State lock wasn't released properly

Resolution:

  1. Verify no one else is running terraform:
# Check with team first!
  1. Force unlock (use with caution):
terraform force-unlock abc123
  1. For DynamoDB backend, check lock table:
aws dynamodb get-item \
  --table-name terraform-state-lock \
  --key '{"LockID": {"S": "path/to/state/terraform.tfstate-md5"}}'

Prevention:

  • Use proper state locking backend (S3 + DynamoDB)
  • Implement timeout in CI/CD pipelines
  • Always let terraform complete or properly cancel

State Drift Detected

Symptom:

Note: Objects have changed outside of Terraform

Terraform detected the following changes made outside of Terraform
since the last "terraform apply":

Common Causes:

  1. Manual changes in AWS console
  2. Another tool modifying resources
  3. Auto-scaling or auto-remediation

Resolution:

  1. Review the drift:
terraform plan -detailed-exitcode
  1. Options:

    • Import changes: Update terraform to match reality
    • Revert changes: Apply terraform to restore desired state
    • Refresh state: terraform apply -refresh-only
  2. Import specific changes:

# Update your .tf files, then:
terraform plan  # Verify it matches
terraform apply

Prevention:

  • Implement policy to prevent manual changes
  • Use AWS Config rules to detect drift
  • Regular terraform plan to catch drift early
  • Consider using Terraform Cloud drift detection

State Corruption

Symptom:

Error: Failed to load state
Error: state snapshot was created by Terraform v1.5.0, 
which is newer than current v1.3.0

Common Causes:

  1. Using different Terraform versions
  2. State file manually edited
  3. Incomplete state upload

Resolution:

  1. Version mismatch:
# Upgrade to matching version
tfenv install 1.5.0
tfenv use 1.5.0
  1. Restore from backup:
# For S3 backend with versioning
aws s3api list-object-versions \
  --bucket terraform-state \
  --prefix prod/terraform.tfstate

# Restore specific version
aws s3api get-object \
  --bucket terraform-state \
  --key prod/terraform.tfstate \
  --version-id VERSION_ID \
  terraform.tfstate
  1. Rebuild state (last resort):
# Remove corrupted state
terraform state rm aws_instance.example

# Re-import resources
terraform import aws_instance.example i-1234567890abcdef0

Prevention:

  • Pin Terraform version in versions.tf
  • Enable S3 versioning for state bucket
  • Never manually edit state files
  • Use consistent Terraform versions across team

Provider Issues

Provider Version Conflict

Symptom:

Error: Incompatible provider version

Provider registry.terraform.io/hashicorp/aws v5.0.0 does not have 
a package available for your current platform

Resolution:

  1. Specify version constraints:
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "~> 4.67.0"  # Use compatible version
    }
  }
}
  1. Clean provider cache:
rm -rf .terraform
terraform init -upgrade
  1. Lock file sync:
terraform providers lock \
  -platform=darwin_amd64 \
  -platform=darwin_arm64 \
  -platform=linux_amd64

Authentication Failures

Symptom:

Error: error configuring Terraform AWS Provider: 
no valid credential sources found

Common Causes:

  1. Missing AWS credentials
  2. Expired credentials
  3. Incorrect IAM permissions

Resolution:

  1. Verify credentials:
aws sts get-caller-identity
  1. Check credential order:

    • Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
    • Shared credentials file (~/.aws/credentials)
    • IAM role (for EC2/ECS)
  2. Configure provider:

provider "aws" {
  region = "us-east-1"
  
  # Option 1: Use profile
  profile = "production"
  
  # Option 2: Assume role
  assume_role {
    role_arn = "arn:aws:iam::ACCOUNT:role/TerraformRole"
  }
}
  1. Check IAM permissions:
# Test specific permission
aws ec2 describe-instances --dry-run

Prevention:

  • Use IAM roles in CI/CD
  • Implement OIDC for GitHub Actions
  • Regular credential rotation
  • Use AWS SSO for developers

Resource Errors

Resource Already Exists

Symptom:

Error: creating EC2 Instance: EntityAlreadyExists: 
Resource with id 'i-1234567890abcdef0' already exists

Resolution:

  1. Import existing resource:
terraform import aws_instance.web i-1234567890abcdef0
  1. Verify configuration matches:
terraform plan  # Should show no changes after import
  1. If configuration differs, update it:
resource "aws_instance" "web" {
  ami           = "ami-abc123"  # Match existing
  instance_type = "t3.micro"    # Match existing
}

Dependency Errors

Symptom:

Error: resource depends on resource "aws_vpc.main" that 
is not declared in the configuration

Resolution:

  1. Add explicit dependency:
resource "aws_subnet" "private" {
  vpc_id = aws_vpc.main.id
  
  depends_on = [
    aws_internet_gateway.main  # Explicit dependency
  ]
}
  1. Use data sources for existing resources:
data "aws_vpc" "existing" {
  id = "vpc-12345678"
}

resource "aws_subnet" "new" {
  vpc_id = data.aws_vpc.existing.id
}

Timeout Errors

Symptom:

Error: timeout while waiting for state to become 'available'
(last state: 'pending', timeout: 10m0s)

Resolution:

  1. Increase timeout:
resource "aws_db_instance" "main" {
  # ... configuration ...
  
  timeouts {
    create = "60m"
    update = "60m"
    delete = "60m"
  }
}
  1. Check resource status manually:
aws rds describe-db-instances --db-instance-identifier mydb
  1. Retry the operation:
terraform apply

Module Issues

Module Source Not Found

Symptom:

Error: Failed to download module

Could not download module "vpc" (main.tf:10) source: 
git::https://github.com/company/terraform-modules.git//vpc

Resolution:

  1. Verify source URL:
module "vpc" {
  source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.0.0"
  # Add authentication if private repo
}
  1. For private repos, configure Git auth:
# SSH key
git config --global url."git@github.com:".insteadOf "https://github.com/"

# Or use HTTPS with token
git config --global url."https://oauth2:TOKEN@github.com/".insteadOf "https://github.com/"
  1. Clear module cache:
rm -rf .terraform/modules
terraform init

Module Version Conflicts

Symptom:

Error: Inconsistent dependency lock file

Module has dependencies locked at version 1.0.0 but 
root module requires version 2.0.0

Resolution:

  1. Update lock file:
terraform init -upgrade
  1. Pin module version:
module "vpc" {
  source  = "terraform-aws-modules/vpc/aws"
  version = "~> 3.0"  # Compatible with 3.x
}

Terragrunt Specific

Dependency Cycle Detected

Symptom:

Error: Dependency cycle detected:
  module-a depends on module-b
  module-b depends on module-c  
  module-c depends on module-a

Resolution:

  1. Review dependencies in terragrunt.hcl:
dependency "vpc" {
  config_path = "../vpc"
}

dependency "database" {
  config_path = "../database"
}

# Don't create circular references!
  1. Refactor to remove cycle:

    • Split modules differently
    • Use data sources instead of dependencies
    • Pass values through variables
  2. Use mock outputs during planning:

dependency "vpc" {
  config_path = "../vpc"
  
  mock_outputs = {
    vpc_id = "vpc-mock"
  }
  mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}

Hook Failures

Symptom:

Error: Hook execution failed
Command: pre_apply_hook.sh
Exit code: 1

Resolution:

  1. Debug the hook:
# Run hook manually
bash .terragrunt-cache/.../pre_apply_hook.sh
  1. Add error handling to hook:
#!/bin/bash
set -e  # Exit on error

# Your hook logic
if ! command -v jq &> /dev/null; then
    echo "jq is required but not installed"
    exit 1
fi
  1. Make hook executable:
chmod +x hooks/pre_apply_hook.sh

Include Path Issues

Symptom:

Error: Cannot include file
Path does not exist: ../common.hcl

Resolution:

  1. Use correct relative path:
include "root" {
  path = find_in_parent_folders()
}

include "common" {
  path = "${get_terragrunt_dir()}/../common.hcl"
}
  1. Verify file exists:
ls -la ../common.hcl

Performance Issues

Slow Plans/Applies

Symptoms:

  • terraform plan takes >5 minutes
  • terraform apply very slow
  • State operations timing out

Common Causes:

  1. Too many resources in single state
  2. Slow provider API calls
  3. Large number of data sources
  4. Complex interpolations

Resolution:

  1. Split state files:
networking/     # Separate state
compute/        # Separate state  
database/       # Separate state
  1. Use targeted operations:
terraform plan -target=aws_instance.web
terraform apply -target=module.vpc
  1. Optimize data sources:
# Bad - queries every plan
data "aws_ami" "ubuntu" {
  most_recent = true
  # ... filters
}

# Better - use specific AMI
variable "ami_id" {
  default = "ami-abc123"  # Update periodically
}
  1. Enable parallelism:
terraform apply -parallelism=20  # Default is 10
  1. Use caching (Terragrunt):
remote_state {
  backend = "s3"
  config = {
    skip_credentials_validation = true  # Faster
    skip_metadata_api_check     = true
  }
}

Quick Diagnostic Steps

When encountering any Terraform error:

  1. Read the full error message - Don't skip details
  2. Check recent changes - What changed since last successful run?
  3. Verify versions - Terraform, providers, modules
  4. Check state - Is it locked? Corrupted?
  5. Test authentication - Can you access resources manually?
  6. Review logs - Use TF_LOG=DEBUG for detailed output
  7. Isolate the problem - Use -target to test specific resources

Enable Debug Logging

export TF_LOG=DEBUG
export TF_LOG_PATH=terraform-debug.log
terraform plan

Test Configuration

terraform validate  # Syntax check
terraform fmt -check  # Format check
tflint  # Linting

Prevention Checklist

  • Use remote state with locking
  • Pin Terraform and provider versions
  • Implement pre-commit hooks
  • Run plan before every apply
  • Use modules for reusable components
  • Enable state versioning/backups
  • Document architecture and dependencies
  • Implement CI/CD with proper reviews
  • Regular terraform plan in CI to detect drift
  • Monitor and alert on state changes