Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:17 +08:00
commit 04d2231fb6
14 changed files with 4569 additions and 0 deletions

View File

@@ -0,0 +1,635 @@
# Terraform Troubleshooting Guide
Common Terraform and Terragrunt issues with solutions.
## Table of Contents
1. [State Issues](#state-issues)
2. [Provider Issues](#provider-issues)
3. [Resource Errors](#resource-errors)
4. [Module Issues](#module-issues)
5. [Terragrunt Specific](#terragrunt-specific)
6. [Performance Issues](#performance-issues)
---
## State Issues
### State Lock Error
**Symptom:**
```
Error locking state: Error acquiring the state lock
Lock Info:
ID: abc123...
Path: terraform.tfstate
Operation: OperationTypeApply
Who: user@hostname
Created: 2024-01-15 10:30:00 UTC
```
**Common Causes:**
1. Previous operation crashed or was interrupted
2. Another user/process is running terraform
3. State lock wasn't released properly
**Resolution:**
1. **Verify no one else is running terraform:**
```bash
# Check with team first!
```
2. **Force unlock (use with caution):**
```bash
terraform force-unlock abc123
```
3. **For DynamoDB backend, check lock table:**
```bash
aws dynamodb get-item \
--table-name terraform-state-lock \
--key '{"LockID": {"S": "path/to/state/terraform.tfstate-md5"}}'
```
**Prevention:**
- Use proper state locking backend (S3 + DynamoDB)
- Implement timeout in CI/CD pipelines
- Always let terraform complete or properly cancel
---
### State Drift Detected
**Symptom:**
```
Note: Objects have changed outside of Terraform
Terraform detected the following changes made outside of Terraform
since the last "terraform apply":
```
**Common Causes:**
1. Manual changes in AWS console
2. Another tool modifying resources
3. Auto-scaling or auto-remediation
**Resolution:**
1. **Review the drift:**
```bash
terraform plan -detailed-exitcode
```
2. **Options:**
- **Import changes:** Update terraform to match reality
- **Revert changes:** Apply terraform to restore desired state
- **Refresh state:** `terraform apply -refresh-only`
3. **Import specific changes:**
```bash
# Update your .tf files, then:
terraform plan # Verify it matches
terraform apply
```
**Prevention:**
- Implement policy to prevent manual changes
- Use AWS Config rules to detect drift
- Regular `terraform plan` to catch drift early
- Consider using Terraform Cloud drift detection
---
### State Corruption
**Symptom:**
```
Error: Failed to load state
Error: state snapshot was created by Terraform v1.5.0,
which is newer than current v1.3.0
```
**Common Causes:**
1. Using different Terraform versions
2. State file manually edited
3. Incomplete state upload
**Resolution:**
1. **Version mismatch:**
```bash
# Upgrade to matching version
tfenv install 1.5.0
tfenv use 1.5.0
```
2. **Restore from backup:**
```bash
# For S3 backend with versioning
aws s3api list-object-versions \
--bucket terraform-state \
--prefix prod/terraform.tfstate
# Restore specific version
aws s3api get-object \
--bucket terraform-state \
--key prod/terraform.tfstate \
--version-id VERSION_ID \
terraform.tfstate
```
3. **Rebuild state (last resort):**
```bash
# Remove corrupted state
terraform state rm aws_instance.example
# Re-import resources
terraform import aws_instance.example i-1234567890abcdef0
```
**Prevention:**
- Pin Terraform version in `versions.tf`
- Enable S3 versioning for state bucket
- Never manually edit state files
- Use consistent Terraform versions across team
---
## Provider Issues
### Provider Version Conflict
**Symptom:**
```
Error: Incompatible provider version
Provider registry.terraform.io/hashicorp/aws v5.0.0 does not have
a package available for your current platform
```
**Resolution:**
1. **Specify version constraints:**
```hcl
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 4.67.0" # Use compatible version
}
}
}
```
2. **Clean provider cache:**
```bash
rm -rf .terraform
terraform init -upgrade
```
3. **Lock file sync:**
```bash
terraform providers lock \
-platform=darwin_amd64 \
-platform=darwin_arm64 \
-platform=linux_amd64
```
---
### Authentication Failures
**Symptom:**
```
Error: error configuring Terraform AWS Provider:
no valid credential sources found
```
**Common Causes:**
1. Missing AWS credentials
2. Expired credentials
3. Incorrect IAM permissions
**Resolution:**
1. **Verify credentials:**
```bash
aws sts get-caller-identity
```
2. **Check credential order:**
- Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
- Shared credentials file (~/.aws/credentials)
- IAM role (for EC2/ECS)
3. **Configure provider:**
```hcl
provider "aws" {
region = "us-east-1"
# Option 1: Use profile
profile = "production"
# Option 2: Assume role
assume_role {
role_arn = "arn:aws:iam::ACCOUNT:role/TerraformRole"
}
}
```
4. **Check IAM permissions:**
```bash
# Test specific permission
aws ec2 describe-instances --dry-run
```
**Prevention:**
- Use IAM roles in CI/CD
- Implement OIDC for GitHub Actions
- Regular credential rotation
- Use AWS SSO for developers
---
## Resource Errors
### Resource Already Exists
**Symptom:**
```
Error: creating EC2 Instance: EntityAlreadyExists:
Resource with id 'i-1234567890abcdef0' already exists
```
**Resolution:**
1. **Import existing resource:**
```bash
terraform import aws_instance.web i-1234567890abcdef0
```
2. **Verify configuration matches:**
```bash
terraform plan # Should show no changes after import
```
3. **If configuration differs, update it:**
```hcl
resource "aws_instance" "web" {
ami = "ami-abc123" # Match existing
instance_type = "t3.micro" # Match existing
}
```
---
### Dependency Errors
**Symptom:**
```
Error: resource depends on resource "aws_vpc.main" that
is not declared in the configuration
```
**Resolution:**
1. **Add explicit dependency:**
```hcl
resource "aws_subnet" "private" {
vpc_id = aws_vpc.main.id
depends_on = [
aws_internet_gateway.main # Explicit dependency
]
}
```
2. **Use data sources for existing resources:**
```hcl
data "aws_vpc" "existing" {
id = "vpc-12345678"
}
resource "aws_subnet" "new" {
vpc_id = data.aws_vpc.existing.id
}
```
---
### Timeout Errors
**Symptom:**
```
Error: timeout while waiting for state to become 'available'
(last state: 'pending', timeout: 10m0s)
```
**Resolution:**
1. **Increase timeout:**
```hcl
resource "aws_db_instance" "main" {
# ... configuration ...
timeouts {
create = "60m"
update = "60m"
delete = "60m"
}
}
```
2. **Check resource status manually:**
```bash
aws rds describe-db-instances --db-instance-identifier mydb
```
3. **Retry the operation:**
```bash
terraform apply
```
---
## Module Issues
### Module Source Not Found
**Symptom:**
```
Error: Failed to download module
Could not download module "vpc" (main.tf:10) source:
git::https://github.com/company/terraform-modules.git//vpc
```
**Resolution:**
1. **Verify source URL:**
```hcl
module "vpc" {
source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.0.0"
# Add authentication if private repo
}
```
2. **For private repos, configure Git auth:**
```bash
# SSH key
git config --global url."git@github.com:".insteadOf "https://github.com/"
# Or use HTTPS with token
git config --global url."https://oauth2:TOKEN@github.com/".insteadOf "https://github.com/"
```
3. **Clear module cache:**
```bash
rm -rf .terraform/modules
terraform init
```
---
### Module Version Conflicts
**Symptom:**
```
Error: Inconsistent dependency lock file
Module has dependencies locked at version 1.0.0 but
root module requires version 2.0.0
```
**Resolution:**
1. **Update lock file:**
```bash
terraform init -upgrade
```
2. **Pin module version:**
```hcl
module "vpc" {
source = "terraform-aws-modules/vpc/aws"
version = "~> 3.0" # Compatible with 3.x
}
```
---
## Terragrunt Specific
### Dependency Cycle Detected
**Symptom:**
```
Error: Dependency cycle detected:
module-a depends on module-b
module-b depends on module-c
module-c depends on module-a
```
**Resolution:**
1. **Review dependencies in terragrunt.hcl:**
```hcl
dependency "vpc" {
config_path = "../vpc"
}
dependency "database" {
config_path = "../database"
}
# Don't create circular references!
```
2. **Refactor to remove cycle:**
- Split modules differently
- Use data sources instead of dependencies
- Pass values through variables
3. **Use mock outputs during planning:**
```hcl
dependency "vpc" {
config_path = "../vpc"
mock_outputs = {
vpc_id = "vpc-mock"
}
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
}
```
---
### Hook Failures
**Symptom:**
```
Error: Hook execution failed
Command: pre_apply_hook.sh
Exit code: 1
```
**Resolution:**
1. **Debug the hook:**
```bash
# Run hook manually
bash .terragrunt-cache/.../pre_apply_hook.sh
```
2. **Add error handling to hook:**
```bash
#!/bin/bash
set -e # Exit on error
# Your hook logic
if ! command -v jq &> /dev/null; then
echo "jq is required but not installed"
exit 1
fi
```
3. **Make hook executable:**
```bash
chmod +x hooks/pre_apply_hook.sh
```
---
### Include Path Issues
**Symptom:**
```
Error: Cannot include file
Path does not exist: ../common.hcl
```
**Resolution:**
1. **Use correct relative path:**
```hcl
include "root" {
path = find_in_parent_folders()
}
include "common" {
path = "${get_terragrunt_dir()}/../common.hcl"
}
```
2. **Verify file exists:**
```bash
ls -la ../common.hcl
```
---
## Performance Issues
### Slow Plans/Applies
**Symptoms:**
- `terraform plan` takes >5 minutes
- `terraform apply` very slow
- State operations timing out
**Common Causes:**
1. Too many resources in single state
2. Slow provider API calls
3. Large number of data sources
4. Complex interpolations
**Resolution:**
1. **Split state files:**
```
networking/ # Separate state
compute/ # Separate state
database/ # Separate state
```
2. **Use targeted operations:**
```bash
terraform plan -target=aws_instance.web
terraform apply -target=module.vpc
```
3. **Optimize data sources:**
```hcl
# Bad - queries every plan
data "aws_ami" "ubuntu" {
most_recent = true
# ... filters
}
# Better - use specific AMI
variable "ami_id" {
default = "ami-abc123" # Update periodically
}
```
4. **Enable parallelism:**
```bash
terraform apply -parallelism=20 # Default is 10
```
5. **Use caching (Terragrunt):**
```hcl
remote_state {
backend = "s3"
config = {
skip_credentials_validation = true # Faster
skip_metadata_api_check = true
}
}
```
---
## Quick Diagnostic Steps
When encountering any Terraform error:
1. **Read the full error message** - Don't skip details
2. **Check recent changes** - What changed since last successful run?
3. **Verify versions** - Terraform, providers, modules
4. **Check state** - Is it locked? Corrupted?
5. **Test authentication** - Can you access resources manually?
6. **Review logs** - Use TF_LOG=DEBUG for detailed output
7. **Isolate the problem** - Use -target to test specific resources
### Enable Debug Logging
```bash
export TF_LOG=DEBUG
export TF_LOG_PATH=terraform-debug.log
terraform plan
```
### Test Configuration
```bash
terraform validate # Syntax check
terraform fmt -check # Format check
tflint # Linting
```
---
## Prevention Checklist
- [ ] Use remote state with locking
- [ ] Pin Terraform and provider versions
- [ ] Implement pre-commit hooks
- [ ] Run plan before every apply
- [ ] Use modules for reusable components
- [ ] Enable state versioning/backups
- [ ] Document architecture and dependencies
- [ ] Implement CI/CD with proper reviews
- [ ] Regular terraform plan in CI to detect drift
- [ ] Monitor and alert on state changes