Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:38:39 +08:00
commit e52a0f7f53
14 changed files with 1919 additions and 0 deletions

View File

@@ -0,0 +1,274 @@
---
name: cost-optimization
description: Optimize cloud costs through resource rightsizing, tagging strategies, reserved instances, and spending analysis. Use when reducing cloud expenses, analyzing infrastructure costs, or implementing cost governance policies.
---
# Cloud Cost Optimization
Strategies and patterns for optimizing cloud costs across AWS, Azure, and GCP.
## Purpose
Implement systematic cost optimization strategies to reduce cloud spending while maintaining performance and reliability.
## When to Use
- Reduce cloud spending
- Right-size resources
- Implement cost governance
- Optimize multi-cloud costs
- Meet budget constraints
## Cost Optimization Framework
### 1. Visibility
- Implement cost allocation tags
- Use cloud cost management tools
- Set up budget alerts
- Create cost dashboards
### 2. Right-Sizing
- Analyze resource utilization
- Downsize over-provisioned resources
- Use auto-scaling
- Remove idle resources
### 3. Pricing Models
- Use reserved capacity
- Leverage spot/preemptible instances
- Implement savings plans
- Use committed use discounts
### 4. Architecture Optimization
- Use managed services
- Implement caching
- Optimize data transfer
- Use lifecycle policies
## AWS Cost Optimization
### Reserved Instances
```
Savings: 30-72% vs On-Demand
Term: 1 or 3 years
Payment: All/Partial/No upfront
Flexibility: Standard or Convertible
```
### Savings Plans
```
Compute Savings Plans: 66% savings
EC2 Instance Savings Plans: 72% savings
Applies to: EC2, Fargate, Lambda
Flexible across: Instance families, regions, OS
```
### Spot Instances
```
Savings: Up to 90% vs On-Demand
Best for: Batch jobs, CI/CD, stateless workloads
Risk: 2-minute interruption notice
Strategy: Mix with On-Demand for resilience
```
### S3 Cost Optimization
```hcl
resource "aws_s3_bucket_lifecycle_configuration" "example" {
bucket = aws_s3_bucket.example.id
rule {
id = "transition-to-ia"
status = "Enabled"
transition {
days = 30
storage_class = "STANDARD_IA"
}
transition {
days = 90
storage_class = "GLACIER"
}
expiration {
days = 365
}
}
}
```
## Azure Cost Optimization
### Reserved VM Instances
- 1 or 3 year terms
- Up to 72% savings
- Flexible sizing
- Exchangeable
### Azure Hybrid Benefit
- Use existing Windows Server licenses
- Up to 80% savings with RI
- Available for Windows and SQL Server
### Azure Advisor Recommendations
- Right-size VMs
- Delete unused resources
- Use reserved capacity
- Optimize storage
## GCP Cost Optimization
### Committed Use Discounts
- 1 or 3 year commitment
- Up to 57% savings
- Applies to vCPUs and memory
- Resource-based or spend-based
### Sustained Use Discounts
- Automatic discounts
- Up to 30% for running instances
- No commitment required
- Applies to Compute Engine, GKE
### Preemptible VMs
- Up to 80% savings
- 24-hour maximum runtime
- Best for batch workloads
## Tagging Strategy
### AWS Tagging
```hcl
locals {
common_tags = {
Environment = "production"
Project = "my-project"
CostCenter = "engineering"
Owner = "team@example.com"
ManagedBy = "terraform"
}
}
resource "aws_instance" "example" {
ami = "ami-12345678"
instance_type = "t3.medium"
tags = merge(
local.common_tags,
{
Name = "web-server"
}
)
}
```
**Reference:** See `references/tagging-standards.md`
## Cost Monitoring
### Budget Alerts
```hcl
# AWS Budget
resource "aws_budgets_budget" "monthly" {
name = "monthly-budget"
budget_type = "COST"
limit_amount = "1000"
limit_unit = "USD"
time_period_start = "2024-01-01_00:00"
time_unit = "MONTHLY"
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["team@example.com"]
}
}
```
### Cost Anomaly Detection
- AWS Cost Anomaly Detection
- Azure Cost Management alerts
- GCP Budget alerts
## Architecture Patterns
### Pattern 1: Serverless First
- Use Lambda/Functions for event-driven
- Pay only for execution time
- Auto-scaling included
- No idle costs
### Pattern 2: Right-Sized Databases
```
Development: t3.small RDS
Staging: t3.large RDS
Production: r6g.2xlarge RDS with read replicas
```
### Pattern 3: Multi-Tier Storage
```
Hot data: S3 Standard
Warm data: S3 Standard-IA (30 days)
Cold data: S3 Glacier (90 days)
Archive: S3 Deep Archive (365 days)
```
### Pattern 4: Auto-Scaling
```hcl
resource "aws_autoscaling_policy" "scale_up" {
name = "scale-up"
scaling_adjustment = 2
adjustment_type = "ChangeInCapacity"
cooldown = 300
autoscaling_group_name = aws_autoscaling_group.main.name
}
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = "2"
metric_name = "CPUUtilization"
namespace = "AWS/EC2"
period = "60"
statistic = "Average"
threshold = "80"
alarm_actions = [aws_autoscaling_policy.scale_up.arn]
}
```
## Cost Optimization Checklist
- [ ] Implement cost allocation tags
- [ ] Delete unused resources (EBS, EIPs, snapshots)
- [ ] Right-size instances based on utilization
- [ ] Use reserved capacity for steady workloads
- [ ] Implement auto-scaling
- [ ] Optimize storage classes
- [ ] Use lifecycle policies
- [ ] Enable cost anomaly detection
- [ ] Set budget alerts
- [ ] Review costs weekly
- [ ] Use spot/preemptible instances
- [ ] Optimize data transfer costs
- [ ] Implement caching layers
- [ ] Use managed services
- [ ] Monitor and optimize continuously
## Tools
- **AWS:** Cost Explorer, Cost Anomaly Detection, Compute Optimizer
- **Azure:** Cost Management, Advisor
- **GCP:** Cost Management, Recommender
- **Multi-cloud:** CloudHealth, Cloudability, Kubecost
## Reference Files
- `references/tagging-standards.md` - Tagging conventions
- `assets/cost-analysis-template.xlsx` - Cost analysis spreadsheet
## Related Skills
- `terraform-module-library` - For resource provisioning
- `multi-cloud-architecture` - For cloud selection

View File

@@ -0,0 +1,226 @@
---
name: hybrid-cloud-networking
description: Configure secure, high-performance connectivity between on-premises infrastructure and cloud platforms using VPN and dedicated connections. Use when building hybrid cloud architectures, connecting data centers to cloud, or implementing secure cross-premises networking.
---
# Hybrid Cloud Networking
Configure secure, high-performance connectivity between on-premises and cloud environments using VPN, Direct Connect, and ExpressRoute.
## Purpose
Establish secure, reliable network connectivity between on-premises data centers and cloud providers (AWS, Azure, GCP).
## When to Use
- Connect on-premises to cloud
- Extend datacenter to cloud
- Implement hybrid active-active setups
- Meet compliance requirements
- Migrate to cloud gradually
## Connection Options
### AWS Connectivity
#### 1. Site-to-Site VPN
- IPSec VPN over internet
- Up to 1.25 Gbps per tunnel
- Cost-effective for moderate bandwidth
- Higher latency, internet-dependent
```hcl
resource "aws_vpn_gateway" "main" {
vpc_id = aws_vpc.main.id
tags = {
Name = "main-vpn-gateway"
}
}
resource "aws_customer_gateway" "main" {
bgp_asn = 65000
ip_address = "203.0.113.1"
type = "ipsec.1"
}
resource "aws_vpn_connection" "main" {
vpn_gateway_id = aws_vpn_gateway.main.id
customer_gateway_id = aws_customer_gateway.main.id
type = "ipsec.1"
static_routes_only = false
}
```
#### 2. AWS Direct Connect
- Dedicated network connection
- 1 Gbps to 100 Gbps
- Lower latency, consistent bandwidth
- More expensive, setup time required
**Reference:** See `references/direct-connect.md`
### Azure Connectivity
#### 1. Site-to-Site VPN
```hcl
resource "azurerm_virtual_network_gateway" "vpn" {
name = "vpn-gateway"
location = azurerm_resource_group.main.location
resource_group_name = azurerm_resource_group.main.name
type = "Vpn"
vpn_type = "RouteBased"
sku = "VpnGw1"
ip_configuration {
name = "vnetGatewayConfig"
public_ip_address_id = azurerm_public_ip.vpn.id
private_ip_address_allocation = "Dynamic"
subnet_id = azurerm_subnet.gateway.id
}
}
```
#### 2. Azure ExpressRoute
- Private connection via connectivity provider
- Up to 100 Gbps
- Low latency, high reliability
- Premium for global connectivity
### GCP Connectivity
#### 1. Cloud VPN
- IPSec VPN (Classic or HA VPN)
- HA VPN: 99.99% SLA
- Up to 3 Gbps per tunnel
#### 2. Cloud Interconnect
- Dedicated (10 Gbps, 100 Gbps)
- Partner (50 Mbps to 50 Gbps)
- Lower latency than VPN
## Hybrid Network Patterns
### Pattern 1: Hub-and-Spoke
```
On-Premises Datacenter
VPN/Direct Connect
Transit Gateway (AWS) / vWAN (Azure)
├─ Production VPC/VNet
├─ Staging VPC/VNet
└─ Development VPC/VNet
```
### Pattern 2: Multi-Region Hybrid
```
On-Premises
├─ Direct Connect → us-east-1
└─ Direct Connect → us-west-2
Cross-Region Peering
```
### Pattern 3: Multi-Cloud Hybrid
```
On-Premises Datacenter
├─ Direct Connect → AWS
├─ ExpressRoute → Azure
└─ Interconnect → GCP
```
## Routing Configuration
### BGP Configuration
```
On-Premises Router:
- AS Number: 65000
- Advertise: 10.0.0.0/8
Cloud Router:
- AS Number: 64512 (AWS), 65515 (Azure)
- Advertise: Cloud VPC/VNet CIDRs
```
### Route Propagation
- Enable route propagation on route tables
- Use BGP for dynamic routing
- Implement route filtering
- Monitor route advertisements
## Security Best Practices
1. **Use private connectivity** (Direct Connect/ExpressRoute)
2. **Implement encryption** for VPN tunnels
3. **Use VPC endpoints** to avoid internet routing
4. **Configure network ACLs** and security groups
5. **Enable VPC Flow Logs** for monitoring
6. **Implement DDoS protection**
7. **Use PrivateLink/Private Endpoints**
8. **Monitor connections** with CloudWatch/Monitor
9. **Implement redundancy** (dual tunnels)
10. **Regular security audits**
## High Availability
### Dual VPN Tunnels
```hcl
resource "aws_vpn_connection" "primary" {
vpn_gateway_id = aws_vpn_gateway.main.id
customer_gateway_id = aws_customer_gateway.primary.id
type = "ipsec.1"
}
resource "aws_vpn_connection" "secondary" {
vpn_gateway_id = aws_vpn_gateway.main.id
customer_gateway_id = aws_customer_gateway.secondary.id
type = "ipsec.1"
}
```
### Active-Active Configuration
- Multiple connections from different locations
- BGP for automatic failover
- Equal-cost multi-path (ECMP) routing
- Monitor health of all connections
## Monitoring and Troubleshooting
### Key Metrics
- Tunnel status (up/down)
- Bytes in/out
- Packet loss
- Latency
- BGP session status
### Troubleshooting
```bash
# AWS VPN
aws ec2 describe-vpn-connections
aws ec2 get-vpn-connection-telemetry
# Azure VPN
az network vpn-connection show
az network vpn-connection show-device-config-script
```
## Cost Optimization
1. **Right-size connections** based on traffic
2. **Use VPN for low-bandwidth** workloads
3. **Consolidate traffic** through fewer connections
4. **Minimize data transfer** costs
5. **Use Direct Connect** for high bandwidth
6. **Implement caching** to reduce traffic
## Reference Files
- `references/vpn-setup.md` - VPN configuration guide
- `references/direct-connect.md` - Direct Connect setup
## Related Skills
- `multi-cloud-architecture` - For architecture decisions
- `terraform-module-library` - For IaC implementation

View File

@@ -0,0 +1,177 @@
---
name: multi-cloud-architecture
description: Design multi-cloud architectures using a decision framework to select and integrate services across AWS, Azure, and GCP. Use when building multi-cloud systems, avoiding vendor lock-in, or leveraging best-of-breed services from multiple providers.
---
# Multi-Cloud Architecture
Decision framework and patterns for architecting applications across AWS, Azure, and GCP.
## Purpose
Design cloud-agnostic architectures and make informed decisions about service selection across cloud providers.
## When to Use
- Design multi-cloud strategies
- Migrate between cloud providers
- Select cloud services for specific workloads
- Implement cloud-agnostic architectures
- Optimize costs across providers
## Cloud Service Comparison
### Compute Services
| AWS | Azure | GCP | Use Case |
|-----|-------|-----|----------|
| EC2 | Virtual Machines | Compute Engine | IaaS VMs |
| ECS | Container Instances | Cloud Run | Containers |
| EKS | AKS | GKE | Kubernetes |
| Lambda | Functions | Cloud Functions | Serverless |
| Fargate | Container Apps | Cloud Run | Managed containers |
### Storage Services
| AWS | Azure | GCP | Use Case |
|-----|-------|-----|----------|
| S3 | Blob Storage | Cloud Storage | Object storage |
| EBS | Managed Disks | Persistent Disk | Block storage |
| EFS | Azure Files | Filestore | File storage |
| Glacier | Archive Storage | Archive Storage | Cold storage |
### Database Services
| AWS | Azure | GCP | Use Case |
|-----|-------|-----|----------|
| RDS | SQL Database | Cloud SQL | Managed SQL |
| DynamoDB | Cosmos DB | Firestore | NoSQL |
| Aurora | PostgreSQL/MySQL | Cloud Spanner | Distributed SQL |
| ElastiCache | Cache for Redis | Memorystore | Caching |
**Reference:** See `references/service-comparison.md` for complete comparison
## Multi-Cloud Patterns
### Pattern 1: Single Provider with DR
- Primary workload in one cloud
- Disaster recovery in another
- Database replication across clouds
- Automated failover
### Pattern 2: Best-of-Breed
- Use best service from each provider
- AI/ML on GCP
- Enterprise apps on Azure
- General compute on AWS
### Pattern 3: Geographic Distribution
- Serve users from nearest cloud region
- Data sovereignty compliance
- Global load balancing
- Regional failover
### Pattern 4: Cloud-Agnostic Abstraction
- Kubernetes for compute
- PostgreSQL for database
- S3-compatible storage (MinIO)
- Open source tools
## Cloud-Agnostic Architecture
### Use Cloud-Native Alternatives
- **Compute:** Kubernetes (EKS/AKS/GKE)
- **Database:** PostgreSQL/MySQL (RDS/SQL Database/Cloud SQL)
- **Message Queue:** Apache Kafka (MSK/Event Hubs/Confluent)
- **Cache:** Redis (ElastiCache/Azure Cache/Memorystore)
- **Object Storage:** S3-compatible API
- **Monitoring:** Prometheus/Grafana
- **Service Mesh:** Istio/Linkerd
### Abstraction Layers
```
Application Layer
Infrastructure Abstraction (Terraform)
Cloud Provider APIs
AWS / Azure / GCP
```
## Cost Comparison
### Compute Pricing Factors
- **AWS:** On-demand, Reserved, Spot, Savings Plans
- **Azure:** Pay-as-you-go, Reserved, Spot
- **GCP:** On-demand, Committed use, Preemptible
### Cost Optimization Strategies
1. Use reserved/committed capacity (30-70% savings)
2. Leverage spot/preemptible instances
3. Right-size resources
4. Use serverless for variable workloads
5. Optimize data transfer costs
6. Implement lifecycle policies
7. Use cost allocation tags
8. Monitor with cloud cost tools
**Reference:** See `references/multi-cloud-patterns.md`
## Migration Strategy
### Phase 1: Assessment
- Inventory current infrastructure
- Identify dependencies
- Assess cloud compatibility
- Estimate costs
### Phase 2: Pilot
- Select pilot workload
- Implement in target cloud
- Test thoroughly
- Document learnings
### Phase 3: Migration
- Migrate workloads incrementally
- Maintain dual-run period
- Monitor performance
- Validate functionality
### Phase 4: Optimization
- Right-size resources
- Implement cloud-native services
- Optimize costs
- Enhance security
## Best Practices
1. **Use infrastructure as code** (Terraform/OpenTofu)
2. **Implement CI/CD pipelines** for deployments
3. **Design for failure** across clouds
4. **Use managed services** when possible
5. **Implement comprehensive monitoring**
6. **Automate cost optimization**
7. **Follow security best practices**
8. **Document cloud-specific configurations**
9. **Test disaster recovery** procedures
10. **Train teams** on multiple clouds
## Reference Files
- `references/service-comparison.md` - Complete service comparison
- `references/multi-cloud-patterns.md` - Architecture patterns
## Related Skills
- `terraform-module-library` - For IaC implementation
- `cost-optimization` - For cost management
- `hybrid-cloud-networking` - For connectivity

View File

@@ -0,0 +1,249 @@
---
name: terraform-module-library
description: Build reusable Terraform modules for AWS, Azure, and GCP infrastructure following infrastructure-as-code best practices. Use when creating infrastructure modules, standardizing cloud provisioning, or implementing reusable IaC components.
---
# Terraform Module Library
Production-ready Terraform module patterns for AWS, Azure, and GCP infrastructure.
## Purpose
Create reusable, well-tested Terraform modules for common cloud infrastructure patterns across multiple cloud providers.
## When to Use
- Build reusable infrastructure components
- Standardize cloud resource provisioning
- Implement infrastructure as code best practices
- Create multi-cloud compatible modules
- Establish organizational Terraform standards
## Module Structure
```
terraform-modules/
├── aws/
│ ├── vpc/
│ ├── eks/
│ ├── rds/
│ └── s3/
├── azure/
│ ├── vnet/
│ ├── aks/
│ └── storage/
└── gcp/
├── vpc/
├── gke/
└── cloud-sql/
```
## Standard Module Pattern
```
module-name/
├── main.tf # Main resources
├── variables.tf # Input variables
├── outputs.tf # Output values
├── versions.tf # Provider versions
├── README.md # Documentation
├── examples/ # Usage examples
│ └── complete/
│ ├── main.tf
│ └── variables.tf
└── tests/ # Terratest files
└── module_test.go
```
## AWS VPC Module Example
**main.tf:**
```hcl
resource "aws_vpc" "main" {
cidr_block = var.cidr_block
enable_dns_hostnames = var.enable_dns_hostnames
enable_dns_support = var.enable_dns_support
tags = merge(
{
Name = var.name
},
var.tags
)
}
resource "aws_subnet" "private" {
count = length(var.private_subnet_cidrs)
vpc_id = aws_vpc.main.id
cidr_block = var.private_subnet_cidrs[count.index]
availability_zone = var.availability_zones[count.index]
tags = merge(
{
Name = "${var.name}-private-${count.index + 1}"
Tier = "private"
},
var.tags
)
}
resource "aws_internet_gateway" "main" {
count = var.create_internet_gateway ? 1 : 0
vpc_id = aws_vpc.main.id
tags = merge(
{
Name = "${var.name}-igw"
},
var.tags
)
}
```
**variables.tf:**
```hcl
variable "name" {
description = "Name of the VPC"
type = string
}
variable "cidr_block" {
description = "CIDR block for VPC"
type = string
validation {
condition = can(regex("^([0-9]{1,3}\\.){3}[0-9]{1,3}/[0-9]{1,2}$", var.cidr_block))
error_message = "CIDR block must be valid IPv4 CIDR notation."
}
}
variable "availability_zones" {
description = "List of availability zones"
type = list(string)
}
variable "private_subnet_cidrs" {
description = "CIDR blocks for private subnets"
type = list(string)
default = []
}
variable "enable_dns_hostnames" {
description = "Enable DNS hostnames in VPC"
type = bool
default = true
}
variable "tags" {
description = "Additional tags"
type = map(string)
default = {}
}
```
**outputs.tf:**
```hcl
output "vpc_id" {
description = "ID of the VPC"
value = aws_vpc.main.id
}
output "private_subnet_ids" {
description = "IDs of private subnets"
value = aws_subnet.private[*].id
}
output "vpc_cidr_block" {
description = "CIDR block of VPC"
value = aws_vpc.main.cidr_block
}
```
## Best Practices
1. **Use semantic versioning** for modules
2. **Document all variables** with descriptions
3. **Provide examples** in examples/ directory
4. **Use validation blocks** for input validation
5. **Output important attributes** for module composition
6. **Pin provider versions** in versions.tf
7. **Use locals** for computed values
8. **Implement conditional resources** with count/for_each
9. **Test modules** with Terratest
10. **Tag all resources** consistently
## Module Composition
```hcl
module "vpc" {
source = "../../modules/aws/vpc"
name = "production"
cidr_block = "10.0.0.0/16"
availability_zones = ["us-west-2a", "us-west-2b", "us-west-2c"]
private_subnet_cidrs = [
"10.0.1.0/24",
"10.0.2.0/24",
"10.0.3.0/24"
]
tags = {
Environment = "production"
ManagedBy = "terraform"
}
}
module "rds" {
source = "../../modules/aws/rds"
identifier = "production-db"
engine = "postgres"
engine_version = "15.3"
instance_class = "db.t3.large"
vpc_id = module.vpc.vpc_id
subnet_ids = module.vpc.private_subnet_ids
tags = {
Environment = "production"
}
}
```
## Reference Files
- `assets/vpc-module/` - Complete VPC module example
- `assets/rds-module/` - RDS module example
- `references/aws-modules.md` - AWS module patterns
- `references/azure-modules.md` - Azure module patterns
- `references/gcp-modules.md` - GCP module patterns
## Testing
```go
// tests/vpc_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/stretchr/testify/assert"
)
func TestVPCModule(t *testing.T) {
terraformOptions := &terraform.Options{
TerraformDir: "../examples/complete",
}
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
vpcID := terraform.Output(t, terraformOptions, "vpc_id")
assert.NotEmpty(t, vpcID)
}
```
## Related Skills
- `multi-cloud-architecture` - For architectural decisions
- `cost-optimization` - For cost-effective designs

View File

@@ -0,0 +1,63 @@
# AWS Terraform Module Patterns
## VPC Module
- VPC with public/private subnets
- Internet Gateway and NAT Gateways
- Route tables and associations
- Network ACLs
- VPC Flow Logs
## EKS Module
- EKS cluster with managed node groups
- IRSA (IAM Roles for Service Accounts)
- Cluster autoscaler
- VPC CNI configuration
- Cluster logging
## RDS Module
- RDS instance or cluster
- Automated backups
- Read replicas
- Parameter groups
- Subnet groups
- Security groups
## S3 Module
- S3 bucket with versioning
- Encryption at rest
- Bucket policies
- Lifecycle rules
- Replication configuration
## ALB Module
- Application Load Balancer
- Target groups
- Listener rules
- SSL/TLS certificates
- Access logs
## Lambda Module
- Lambda function
- IAM execution role
- CloudWatch Logs
- Environment variables
- VPC configuration (optional)
## Security Group Module
- Reusable security group rules
- Ingress/egress rules
- Dynamic rule creation
- Rule descriptions
## Best Practices
1. Use AWS provider version ~> 5.0
2. Enable encryption by default
3. Use least-privilege IAM
4. Tag all resources consistently
5. Enable logging and monitoring
6. Use KMS for encryption
7. Implement backup strategies
8. Use PrivateLink when possible
9. Enable GuardDuty/SecurityHub
10. Follow AWS Well-Architected Framework