11 KiB
11 KiB
Terraform Troubleshooting Guide
Common Terraform and Terragrunt issues with solutions.
Table of Contents
State Issues
State Lock Error
Symptom:
Error locking state: Error acquiring the state lock
Lock Info:
ID: abc123...
Path: terraform.tfstate
Operation: OperationTypeApply
Who: user@hostname
Created: 2024-01-15 10:30:00 UTC
Common Causes:
- Previous operation crashed or was interrupted
- Another user/process is running terraform
- State lock wasn't released properly
Resolution:
- Verify no one else is running terraform:
# Check with team first!
- Force unlock (use with caution):
terraform force-unlock abc123
- For DynamoDB backend, check lock table:
aws dynamodb get-item \
--table-name terraform-state-lock \
--key '{"LockID": {"S": "path/to/state/terraform.tfstate-md5"}}'
Prevention:
- Use proper state locking backend (S3 + DynamoDB)
- Implement timeout in CI/CD pipelines
- Always let terraform complete or properly cancel
State Drift Detected
Symptom:
Note: Objects have changed outside of Terraform
Terraform detected the following changes made outside of Terraform
since the last "terraform apply":
Common Causes:
- Manual changes in AWS console
- Another tool modifying resources
- Auto-scaling or auto-remediation
Resolution:
- Review the drift:
terraform plan -detailed-exitcode
-
Options:
- Import changes: Update terraform to match reality
- Revert changes: Apply terraform to restore desired state
- Refresh state:
terraform apply -refresh-only
-
Import specific changes:
# Update your .tf files, then:
terraform plan # Verify it matches
terraform apply
Prevention:
- Implement policy to prevent manual changes
- Use AWS Config rules to detect drift
- Regular
terraform planto catch drift early - Consider using Terraform Cloud drift detection
State Corruption
Symptom:
Error: Failed to load state
Error: state snapshot was created by Terraform v1.5.0,
which is newer than current v1.3.0
Common Causes:
- Using different Terraform versions
- State file manually edited
- Incomplete state upload
Resolution:
- Version mismatch:
# Upgrade to matching version
tfenv install 1.5.0
tfenv use 1.5.0
- Restore from backup:
# For S3 backend with versioning
aws s3api list-object-versions \
--bucket terraform-state \
--prefix prod/terraform.tfstate
# Restore specific version
aws s3api get-object \
--bucket terraform-state \
--key prod/terraform.tfstate \
--version-id VERSION_ID \
terraform.tfstate
- Rebuild state (last resort):
# Remove corrupted state
terraform state rm aws_instance.example
# Re-import resources
terraform import aws_instance.example i-1234567890abcdef0
Prevention:
- Pin Terraform version in
versions.tf - Enable S3 versioning for state bucket
- Never manually edit state files
- Use consistent Terraform versions across team
Provider Issues
Provider Version Conflict
Symptom:
Error: Incompatible provider version
Provider registry.terraform.io/hashicorp/aws v5.0.0 does not have
a package available for your current platform
Resolution:
- Specify version constraints:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.67.0" # Use compatible version
}
}
}
- Clean provider cache:
rm -rf .terraform
terraform init -upgrade
- Lock file sync:
terraform providers lock \
-platform=darwin_amd64 \
-platform=darwin_arm64 \
-platform=linux_amd64
Authentication Failures
Symptom:
Error: error configuring Terraform AWS Provider:
no valid credential sources found
Common Causes:
- Missing AWS credentials
- Expired credentials
- Incorrect IAM permissions
Resolution:
- Verify credentials:
aws sts get-caller-identity
-
Check credential order:
- Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- Shared credentials file (~/.aws/credentials)
- IAM role (for EC2/ECS)
-
Configure provider:
provider "aws" {
region = "us-east-1"
# Option 1: Use profile
profile = "production"
# Option 2: Assume role
assume_role {
role_arn = "arn:aws:iam::ACCOUNT:role/TerraformRole"
}
}
- Check IAM permissions:
# Test specific permission
aws ec2 describe-instances --dry-run
Prevention:
- Use IAM roles in CI/CD
- Implement OIDC for GitHub Actions
- Regular credential rotation
- Use AWS SSO for developers
Resource Errors
Resource Already Exists
Symptom:
Error: creating EC2 Instance: EntityAlreadyExists:
Resource with id 'i-1234567890abcdef0' already exists
Resolution:
- Import existing resource:
terraform import aws_instance.web i-1234567890abcdef0
- Verify configuration matches:
terraform plan # Should show no changes after import
- If configuration differs, update it:
resource "aws_instance" "web" {
ami = "ami-abc123" # Match existing
instance_type = "t3.micro" # Match existing
}
Dependency Errors
Symptom:
Error: resource depends on resource "aws_vpc.main" that
is not declared in the configuration
Resolution:
- Add explicit dependency:
resource "aws_subnet" "private" {
vpc_id = aws_vpc.main.id
depends_on = [
aws_internet_gateway.main # Explicit dependency
]
}
- Use data sources for existing resources:
data "aws_vpc" "existing" {
id = "vpc-12345678"
}
resource "aws_subnet" "new" {
vpc_id = data.aws_vpc.existing.id
}
Timeout Errors
Symptom:
Error: timeout while waiting for state to become 'available'
(last state: 'pending', timeout: 10m0s)
Resolution:
- Increase timeout:
resource "aws_db_instance" "main" {
# ... configuration ...
timeouts {
create = "60m"
update = "60m"
delete = "60m"
}
}
- Check resource status manually:
aws rds describe-db-instances --db-instance-identifier mydb
- Retry the operation:
terraform apply
Module Issues
Module Source Not Found
Symptom:
Error: Failed to download module
Could not download module "vpc" (main.tf:10) source:
git::https://github.com/company/terraform-modules.git//vpc
Resolution:
- Verify source URL:
module "vpc" {
source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.0.0"
# Add authentication if private repo
}
- For private repos, configure Git auth:
# SSH key
git config --global url."git@github.com:".insteadOf "https://github.com/"
# Or use HTTPS with token
git config --global url."https://oauth2:TOKEN@github.com/".insteadOf "https://github.com/"
- Clear module cache:
rm -rf .terraform/modules
terraform init
Module Version Conflicts
Symptom:
Error: Inconsistent dependency lock file
Module has dependencies locked at version 1.0.0 but
root module requires version 2.0.0
Resolution:
- Update lock file:
terraform init -upgrade
- Pin module version:
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 3.0" # Compatible with 3.x
}
Terragrunt Specific
Dependency Cycle Detected
Symptom:
Error: Dependency cycle detected:
module-a depends on module-b
module-b depends on module-c
module-c depends on module-a
Resolution:
- Review dependencies in terragrunt.hcl:
dependency "vpc" {
config_path = "../vpc"
}
dependency "database" {
config_path = "../database"
}
# Don't create circular references!
-
Refactor to remove cycle:
- Split modules differently
- Use data sources instead of dependencies
- Pass values through variables
-
Use mock outputs during planning:
dependency "vpc" {
config_path = "../vpc"
mock_outputs = {
vpc_id = "vpc-mock"
}
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}
Hook Failures
Symptom:
Error: Hook execution failed
Command: pre_apply_hook.sh
Exit code: 1
Resolution:
- Debug the hook:
# Run hook manually
bash .terragrunt-cache/.../pre_apply_hook.sh
- Add error handling to hook:
#!/bin/bash
set -e # Exit on error
# Your hook logic
if ! command -v jq &> /dev/null; then
echo "jq is required but not installed"
exit 1
fi
- Make hook executable:
chmod +x hooks/pre_apply_hook.sh
Include Path Issues
Symptom:
Error: Cannot include file
Path does not exist: ../common.hcl
Resolution:
- Use correct relative path:
include "root" {
path = find_in_parent_folders()
}
include "common" {
path = "${get_terragrunt_dir()}/../common.hcl"
}
- Verify file exists:
ls -la ../common.hcl
Performance Issues
Slow Plans/Applies
Symptoms:
terraform plantakes >5 minutesterraform applyvery slow- State operations timing out
Common Causes:
- Too many resources in single state
- Slow provider API calls
- Large number of data sources
- Complex interpolations
Resolution:
- Split state files:
networking/ # Separate state
compute/ # Separate state
database/ # Separate state
- Use targeted operations:
terraform plan -target=aws_instance.web
terraform apply -target=module.vpc
- Optimize data sources:
# Bad - queries every plan
data "aws_ami" "ubuntu" {
most_recent = true
# ... filters
}
# Better - use specific AMI
variable "ami_id" {
default = "ami-abc123" # Update periodically
}
- Enable parallelism:
terraform apply -parallelism=20 # Default is 10
- Use caching (Terragrunt):
remote_state {
backend = "s3"
config = {
skip_credentials_validation = true # Faster
skip_metadata_api_check = true
}
}
Quick Diagnostic Steps
When encountering any Terraform error:
- Read the full error message - Don't skip details
- Check recent changes - What changed since last successful run?
- Verify versions - Terraform, providers, modules
- Check state - Is it locked? Corrupted?
- Test authentication - Can you access resources manually?
- Review logs - Use TF_LOG=DEBUG for detailed output
- Isolate the problem - Use -target to test specific resources
Enable Debug Logging
export TF_LOG=DEBUG
export TF_LOG_PATH=terraform-debug.log
terraform plan
Test Configuration
terraform validate # Syntax check
terraform fmt -check # Format check
tflint # Linting
Prevention Checklist
- Use remote state with locking
- Pin Terraform and provider versions
- Implement pre-commit hooks
- Run plan before every apply
- Use modules for reusable components
- Enable state versioning/backups
- Document architecture and dependencies
- Implement CI/CD with proper reviews
- Regular terraform plan in CI to detect drift
- Monitor and alert on state changes