Initial commit
This commit is contained in:
635
skills/references/troubleshooting.md
Normal file
635
skills/references/troubleshooting.md
Normal file
@@ -0,0 +1,635 @@
|
||||
# Terraform Troubleshooting Guide
|
||||
|
||||
Common Terraform and Terragrunt issues with solutions.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [State Issues](#state-issues)
|
||||
2. [Provider Issues](#provider-issues)
|
||||
3. [Resource Errors](#resource-errors)
|
||||
4. [Module Issues](#module-issues)
|
||||
5. [Terragrunt Specific](#terragrunt-specific)
|
||||
6. [Performance Issues](#performance-issues)
|
||||
|
||||
---
|
||||
|
||||
## State Issues
|
||||
|
||||
### State Lock Error
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error locking state: Error acquiring the state lock
|
||||
Lock Info:
|
||||
ID: abc123...
|
||||
Path: terraform.tfstate
|
||||
Operation: OperationTypeApply
|
||||
Who: user@hostname
|
||||
Created: 2024-01-15 10:30:00 UTC
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
1. Previous operation crashed or was interrupted
|
||||
2. Another user/process is running terraform
|
||||
3. State lock wasn't released properly
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Verify no one else is running terraform:**
|
||||
```bash
|
||||
# Check with team first!
|
||||
```
|
||||
|
||||
2. **Force unlock (use with caution):**
|
||||
```bash
|
||||
terraform force-unlock abc123
|
||||
```
|
||||
|
||||
3. **For DynamoDB backend, check lock table:**
|
||||
```bash
|
||||
aws dynamodb get-item \
|
||||
--table-name terraform-state-lock \
|
||||
--key '{"LockID": {"S": "path/to/state/terraform.tfstate-md5"}}'
|
||||
```
|
||||
|
||||
**Prevention:**
|
||||
- Use proper state locking backend (S3 + DynamoDB)
|
||||
- Implement timeout in CI/CD pipelines
|
||||
- Always let terraform complete or properly cancel
|
||||
|
||||
---
|
||||
|
||||
### State Drift Detected
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Note: Objects have changed outside of Terraform
|
||||
|
||||
Terraform detected the following changes made outside of Terraform
|
||||
since the last "terraform apply":
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
1. Manual changes in AWS console
|
||||
2. Another tool modifying resources
|
||||
3. Auto-scaling or auto-remediation
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Review the drift:**
|
||||
```bash
|
||||
terraform plan -detailed-exitcode
|
||||
```
|
||||
|
||||
2. **Options:**
|
||||
- **Import changes:** Update terraform to match reality
|
||||
- **Revert changes:** Apply terraform to restore desired state
|
||||
- **Refresh state:** `terraform apply -refresh-only`
|
||||
|
||||
3. **Import specific changes:**
|
||||
```bash
|
||||
# Update your .tf files, then:
|
||||
terraform plan # Verify it matches
|
||||
terraform apply
|
||||
```
|
||||
|
||||
**Prevention:**
|
||||
- Implement policy to prevent manual changes
|
||||
- Use AWS Config rules to detect drift
|
||||
- Regular `terraform plan` to catch drift early
|
||||
- Consider using Terraform Cloud drift detection
|
||||
|
||||
---
|
||||
|
||||
### State Corruption
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Failed to load state
|
||||
Error: state snapshot was created by Terraform v1.5.0,
|
||||
which is newer than current v1.3.0
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
1. Using different Terraform versions
|
||||
2. State file manually edited
|
||||
3. Incomplete state upload
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Version mismatch:**
|
||||
```bash
|
||||
# Upgrade to matching version
|
||||
tfenv install 1.5.0
|
||||
tfenv use 1.5.0
|
||||
```
|
||||
|
||||
2. **Restore from backup:**
|
||||
```bash
|
||||
# For S3 backend with versioning
|
||||
aws s3api list-object-versions \
|
||||
--bucket terraform-state \
|
||||
--prefix prod/terraform.tfstate
|
||||
|
||||
# Restore specific version
|
||||
aws s3api get-object \
|
||||
--bucket terraform-state \
|
||||
--key prod/terraform.tfstate \
|
||||
--version-id VERSION_ID \
|
||||
terraform.tfstate
|
||||
```
|
||||
|
||||
3. **Rebuild state (last resort):**
|
||||
```bash
|
||||
# Remove corrupted state
|
||||
terraform state rm aws_instance.example
|
||||
|
||||
# Re-import resources
|
||||
terraform import aws_instance.example i-1234567890abcdef0
|
||||
```
|
||||
|
||||
**Prevention:**
|
||||
- Pin Terraform version in `versions.tf`
|
||||
- Enable S3 versioning for state bucket
|
||||
- Never manually edit state files
|
||||
- Use consistent Terraform versions across team
|
||||
|
||||
---
|
||||
|
||||
## Provider Issues
|
||||
|
||||
### Provider Version Conflict
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Incompatible provider version
|
||||
|
||||
Provider registry.terraform.io/hashicorp/aws v5.0.0 does not have
|
||||
a package available for your current platform
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Specify version constraints:**
|
||||
```hcl
|
||||
terraform {
|
||||
required_providers {
|
||||
aws = {
|
||||
source = "hashicorp/aws"
|
||||
version = "~> 4.67.0" # Use compatible version
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
2. **Clean provider cache:**
|
||||
```bash
|
||||
rm -rf .terraform
|
||||
terraform init -upgrade
|
||||
```
|
||||
|
||||
3. **Lock file sync:**
|
||||
```bash
|
||||
terraform providers lock \
|
||||
-platform=darwin_amd64 \
|
||||
-platform=darwin_arm64 \
|
||||
-platform=linux_amd64
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Authentication Failures
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: error configuring Terraform AWS Provider:
|
||||
no valid credential sources found
|
||||
```
|
||||
|
||||
**Common Causes:**
|
||||
1. Missing AWS credentials
|
||||
2. Expired credentials
|
||||
3. Incorrect IAM permissions
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Verify credentials:**
|
||||
```bash
|
||||
aws sts get-caller-identity
|
||||
```
|
||||
|
||||
2. **Check credential order:**
|
||||
- Environment variables (AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY)
|
||||
- Shared credentials file (~/.aws/credentials)
|
||||
- IAM role (for EC2/ECS)
|
||||
|
||||
3. **Configure provider:**
|
||||
```hcl
|
||||
provider "aws" {
|
||||
region = "us-east-1"
|
||||
|
||||
# Option 1: Use profile
|
||||
profile = "production"
|
||||
|
||||
# Option 2: Assume role
|
||||
assume_role {
|
||||
role_arn = "arn:aws:iam::ACCOUNT:role/TerraformRole"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
4. **Check IAM permissions:**
|
||||
```bash
|
||||
# Test specific permission
|
||||
aws ec2 describe-instances --dry-run
|
||||
```
|
||||
|
||||
**Prevention:**
|
||||
- Use IAM roles in CI/CD
|
||||
- Implement OIDC for GitHub Actions
|
||||
- Regular credential rotation
|
||||
- Use AWS SSO for developers
|
||||
|
||||
---
|
||||
|
||||
## Resource Errors
|
||||
|
||||
### Resource Already Exists
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: creating EC2 Instance: EntityAlreadyExists:
|
||||
Resource with id 'i-1234567890abcdef0' already exists
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Import existing resource:**
|
||||
```bash
|
||||
terraform import aws_instance.web i-1234567890abcdef0
|
||||
```
|
||||
|
||||
2. **Verify configuration matches:**
|
||||
```bash
|
||||
terraform plan # Should show no changes after import
|
||||
```
|
||||
|
||||
3. **If configuration differs, update it:**
|
||||
```hcl
|
||||
resource "aws_instance" "web" {
|
||||
ami = "ami-abc123" # Match existing
|
||||
instance_type = "t3.micro" # Match existing
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Dependency Errors
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: resource depends on resource "aws_vpc.main" that
|
||||
is not declared in the configuration
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Add explicit dependency:**
|
||||
```hcl
|
||||
resource "aws_subnet" "private" {
|
||||
vpc_id = aws_vpc.main.id
|
||||
|
||||
depends_on = [
|
||||
aws_internet_gateway.main # Explicit dependency
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
2. **Use data sources for existing resources:**
|
||||
```hcl
|
||||
data "aws_vpc" "existing" {
|
||||
id = "vpc-12345678"
|
||||
}
|
||||
|
||||
resource "aws_subnet" "new" {
|
||||
vpc_id = data.aws_vpc.existing.id
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Timeout Errors
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: timeout while waiting for state to become 'available'
|
||||
(last state: 'pending', timeout: 10m0s)
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Increase timeout:**
|
||||
```hcl
|
||||
resource "aws_db_instance" "main" {
|
||||
# ... configuration ...
|
||||
|
||||
timeouts {
|
||||
create = "60m"
|
||||
update = "60m"
|
||||
delete = "60m"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
2. **Check resource status manually:**
|
||||
```bash
|
||||
aws rds describe-db-instances --db-instance-identifier mydb
|
||||
```
|
||||
|
||||
3. **Retry the operation:**
|
||||
```bash
|
||||
terraform apply
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Module Issues
|
||||
|
||||
### Module Source Not Found
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Failed to download module
|
||||
|
||||
Could not download module "vpc" (main.tf:10) source:
|
||||
git::https://github.com/company/terraform-modules.git//vpc
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Verify source URL:**
|
||||
```hcl
|
||||
module "vpc" {
|
||||
source = "git::https://github.com/company/terraform-modules.git//vpc?ref=v1.0.0"
|
||||
# Add authentication if private repo
|
||||
}
|
||||
```
|
||||
|
||||
2. **For private repos, configure Git auth:**
|
||||
```bash
|
||||
# SSH key
|
||||
git config --global url."git@github.com:".insteadOf "https://github.com/"
|
||||
|
||||
# Or use HTTPS with token
|
||||
git config --global url."https://oauth2:TOKEN@github.com/".insteadOf "https://github.com/"
|
||||
```
|
||||
|
||||
3. **Clear module cache:**
|
||||
```bash
|
||||
rm -rf .terraform/modules
|
||||
terraform init
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Module Version Conflicts
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Inconsistent dependency lock file
|
||||
|
||||
Module has dependencies locked at version 1.0.0 but
|
||||
root module requires version 2.0.0
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Update lock file:**
|
||||
```bash
|
||||
terraform init -upgrade
|
||||
```
|
||||
|
||||
2. **Pin module version:**
|
||||
```hcl
|
||||
module "vpc" {
|
||||
source = "terraform-aws-modules/vpc/aws"
|
||||
version = "~> 3.0" # Compatible with 3.x
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Terragrunt Specific
|
||||
|
||||
### Dependency Cycle Detected
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Dependency cycle detected:
|
||||
module-a depends on module-b
|
||||
module-b depends on module-c
|
||||
module-c depends on module-a
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Review dependencies in terragrunt.hcl:**
|
||||
```hcl
|
||||
dependency "vpc" {
|
||||
config_path = "../vpc"
|
||||
}
|
||||
|
||||
dependency "database" {
|
||||
config_path = "../database"
|
||||
}
|
||||
|
||||
# Don't create circular references!
|
||||
```
|
||||
|
||||
2. **Refactor to remove cycle:**
|
||||
- Split modules differently
|
||||
- Use data sources instead of dependencies
|
||||
- Pass values through variables
|
||||
|
||||
3. **Use mock outputs during planning:**
|
||||
```hcl
|
||||
dependency "vpc" {
|
||||
config_path = "../vpc"
|
||||
|
||||
mock_outputs = {
|
||||
vpc_id = "vpc-mock"
|
||||
}
|
||||
mock_outputs_allowed_terraform_commands = ["validate", "plan"]
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Hook Failures
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Hook execution failed
|
||||
Command: pre_apply_hook.sh
|
||||
Exit code: 1
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Debug the hook:**
|
||||
```bash
|
||||
# Run hook manually
|
||||
bash .terragrunt-cache/.../pre_apply_hook.sh
|
||||
```
|
||||
|
||||
2. **Add error handling to hook:**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
set -e # Exit on error
|
||||
|
||||
# Your hook logic
|
||||
if ! command -v jq &> /dev/null; then
|
||||
echo "jq is required but not installed"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
3. **Make hook executable:**
|
||||
```bash
|
||||
chmod +x hooks/pre_apply_hook.sh
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Include Path Issues
|
||||
|
||||
**Symptom:**
|
||||
```
|
||||
Error: Cannot include file
|
||||
Path does not exist: ../common.hcl
|
||||
```
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Use correct relative path:**
|
||||
```hcl
|
||||
include "root" {
|
||||
path = find_in_parent_folders()
|
||||
}
|
||||
|
||||
include "common" {
|
||||
path = "${get_terragrunt_dir()}/../common.hcl"
|
||||
}
|
||||
```
|
||||
|
||||
2. **Verify file exists:**
|
||||
```bash
|
||||
ls -la ../common.hcl
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Issues
|
||||
|
||||
### Slow Plans/Applies
|
||||
|
||||
**Symptoms:**
|
||||
- `terraform plan` takes >5 minutes
|
||||
- `terraform apply` very slow
|
||||
- State operations timing out
|
||||
|
||||
**Common Causes:**
|
||||
1. Too many resources in single state
|
||||
2. Slow provider API calls
|
||||
3. Large number of data sources
|
||||
4. Complex interpolations
|
||||
|
||||
**Resolution:**
|
||||
|
||||
1. **Split state files:**
|
||||
```
|
||||
networking/ # Separate state
|
||||
compute/ # Separate state
|
||||
database/ # Separate state
|
||||
```
|
||||
|
||||
2. **Use targeted operations:**
|
||||
```bash
|
||||
terraform plan -target=aws_instance.web
|
||||
terraform apply -target=module.vpc
|
||||
```
|
||||
|
||||
3. **Optimize data sources:**
|
||||
```hcl
|
||||
# Bad - queries every plan
|
||||
data "aws_ami" "ubuntu" {
|
||||
most_recent = true
|
||||
# ... filters
|
||||
}
|
||||
|
||||
# Better - use specific AMI
|
||||
variable "ami_id" {
|
||||
default = "ami-abc123" # Update periodically
|
||||
}
|
||||
```
|
||||
|
||||
4. **Enable parallelism:**
|
||||
```bash
|
||||
terraform apply -parallelism=20 # Default is 10
|
||||
```
|
||||
|
||||
5. **Use caching (Terragrunt):**
|
||||
```hcl
|
||||
remote_state {
|
||||
backend = "s3"
|
||||
config = {
|
||||
skip_credentials_validation = true # Faster
|
||||
skip_metadata_api_check = true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Diagnostic Steps
|
||||
|
||||
When encountering any Terraform error:
|
||||
|
||||
1. **Read the full error message** - Don't skip details
|
||||
2. **Check recent changes** - What changed since last successful run?
|
||||
3. **Verify versions** - Terraform, providers, modules
|
||||
4. **Check state** - Is it locked? Corrupted?
|
||||
5. **Test authentication** - Can you access resources manually?
|
||||
6. **Review logs** - Use TF_LOG=DEBUG for detailed output
|
||||
7. **Isolate the problem** - Use -target to test specific resources
|
||||
|
||||
### Enable Debug Logging
|
||||
|
||||
```bash
|
||||
export TF_LOG=DEBUG
|
||||
export TF_LOG_PATH=terraform-debug.log
|
||||
terraform plan
|
||||
```
|
||||
|
||||
### Test Configuration
|
||||
|
||||
```bash
|
||||
terraform validate # Syntax check
|
||||
terraform fmt -check # Format check
|
||||
tflint # Linting
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Prevention Checklist
|
||||
|
||||
- [ ] Use remote state with locking
|
||||
- [ ] Pin Terraform and provider versions
|
||||
- [ ] Implement pre-commit hooks
|
||||
- [ ] Run plan before every apply
|
||||
- [ ] Use modules for reusable components
|
||||
- [ ] Enable state versioning/backups
|
||||
- [ ] Document architecture and dependencies
|
||||
- [ ] Implement CI/CD with proper reviews
|
||||
- [ ] Regular terraform plan in CI to detect drift
|
||||
- [ ] Monitor and alert on state changes
|
||||
Reference in New Issue
Block a user