Initial commit
This commit is contained in:
11
.claude-plugin/plugin.json
Normal file
11
.claude-plugin/plugin.json
Normal file
@@ -0,0 +1,11 @@
|
|||||||
|
{
|
||||||
|
"name": "aws-cost-optimization",
|
||||||
|
"description": "AWS cost optimization and FinOps workflows with automated analysis scripts",
|
||||||
|
"version": "1.0.0",
|
||||||
|
"author": {
|
||||||
|
"name": "DevOps Skills Team"
|
||||||
|
},
|
||||||
|
"skills": [
|
||||||
|
"./"
|
||||||
|
]
|
||||||
|
}
|
||||||
3
README.md
Normal file
3
README.md
Normal file
@@ -0,0 +1,3 @@
|
|||||||
|
# aws-cost-optimization
|
||||||
|
|
||||||
|
AWS cost optimization and FinOps workflows with automated analysis scripts
|
||||||
608
SKILL.md
Normal file
608
SKILL.md
Normal file
@@ -0,0 +1,608 @@
|
|||||||
|
---
|
||||||
|
name: aws-cost-finops
|
||||||
|
description: AWS cost optimization and FinOps workflows. Use for finding unused resources, analyzing Reserved Instance opportunities, detecting cost anomalies, rightsizing instances, evaluating Spot instances, migrating to newer generation instances, implementing FinOps best practices, optimizing storage/network/database costs, and managing cloud financial operations. Includes automated analysis scripts and comprehensive reference documentation.
|
||||||
|
---
|
||||||
|
|
||||||
|
# AWS Cost Optimization & FinOps
|
||||||
|
|
||||||
|
Systematic workflows for AWS cost optimization and financial operations management.
|
||||||
|
|
||||||
|
## When to Use This Skill
|
||||||
|
|
||||||
|
Use this skill when you need to:
|
||||||
|
|
||||||
|
- **Find cost savings**: Identify unused resources, rightsizing opportunities, or commitment discounts
|
||||||
|
- **Analyze spending**: Understand cost trends, detect anomalies, or break down costs
|
||||||
|
- **Optimize architecture**: Choose cost-effective services, storage tiers, or instance types
|
||||||
|
- **Implement FinOps**: Set up governance, tagging, budgets, or monthly reviews
|
||||||
|
- **Make purchase decisions**: Evaluate Reserved Instances, Savings Plans, or Spot instances
|
||||||
|
- **Troubleshoot costs**: Investigate unexpected bills or cost spikes
|
||||||
|
- **Plan budgets**: Forecast costs or evaluate impact of new projects
|
||||||
|
|
||||||
|
## Cost Optimization Workflow
|
||||||
|
|
||||||
|
Follow this systematic approach for AWS cost optimization:
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────┐
|
||||||
|
│ 1. DISCOVER │
|
||||||
|
│ What are we spending money on? │
|
||||||
|
│ Run: find_unused_resources.py │
|
||||||
|
│ Run: cost_anomaly_detector.py │
|
||||||
|
└─────────────────────────────────────────────┘
|
||||||
|
↓
|
||||||
|
┌─────────────────────────────────────────────┐
|
||||||
|
│ 2. ANALYZE │
|
||||||
|
│ Where are the optimization opportunities?│
|
||||||
|
│ Run: rightsizing_analyzer.py │
|
||||||
|
│ Run: detect_old_generations.py │
|
||||||
|
│ Run: spot_recommendations.py │
|
||||||
|
│ Run: analyze_ri_recommendations.py │
|
||||||
|
└─────────────────────────────────────────────┘
|
||||||
|
↓
|
||||||
|
┌─────────────────────────────────────────────┐
|
||||||
|
│ 3. PRIORITIZE │
|
||||||
|
│ What should we optimize first? │
|
||||||
|
│ - Quick wins (low risk, high savings) │
|
||||||
|
│ - Low-hanging fruit (easy to implement) │
|
||||||
|
│ - Strategic improvements │
|
||||||
|
└─────────────────────────────────────────────┘
|
||||||
|
↓
|
||||||
|
┌─────────────────────────────────────────────┐
|
||||||
|
│ 4. IMPLEMENT │
|
||||||
|
│ Execute optimization actions │
|
||||||
|
│ - Delete unused resources │
|
||||||
|
│ - Rightsize instances │
|
||||||
|
│ - Purchase commitments │
|
||||||
|
│ - Migrate to new generations │
|
||||||
|
└─────────────────────────────────────────────┘
|
||||||
|
↓
|
||||||
|
┌─────────────────────────────────────────────┐
|
||||||
|
│ 5. MONITOR │
|
||||||
|
│ Verify savings and track metrics │
|
||||||
|
│ - Monthly cost reviews │
|
||||||
|
│ - Tag compliance monitoring │
|
||||||
|
│ - Budget variance tracking │
|
||||||
|
└─────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Core Workflows
|
||||||
|
|
||||||
|
### Workflow 1: Monthly Cost Optimization Review
|
||||||
|
|
||||||
|
**Frequency**: Run monthly (first week of each month)
|
||||||
|
|
||||||
|
**Step 1: Find Unused Resources**
|
||||||
|
```bash
|
||||||
|
# Scan for waste across all resources
|
||||||
|
python3 scripts/find_unused_resources.py
|
||||||
|
|
||||||
|
# Expected output:
|
||||||
|
# - Unattached EBS volumes
|
||||||
|
# - Old snapshots
|
||||||
|
# - Unused Elastic IPs
|
||||||
|
# - Idle NAT Gateways
|
||||||
|
# - Idle EC2 instances
|
||||||
|
# - Unused load balancers
|
||||||
|
# - Estimated monthly savings
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 2: Analyze Cost Anomalies**
|
||||||
|
```bash
|
||||||
|
# Detect unusual spending patterns
|
||||||
|
python3 scripts/cost_anomaly_detector.py --days 30
|
||||||
|
|
||||||
|
# Expected output:
|
||||||
|
# - Cost spikes and anomalies
|
||||||
|
# - Top cost drivers
|
||||||
|
# - Period-over-period comparison
|
||||||
|
# - 30-day forecast
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3: Identify Rightsizing Opportunities**
|
||||||
|
```bash
|
||||||
|
# Find oversized instances
|
||||||
|
python3 scripts/rightsizing_analyzer.py --days 30
|
||||||
|
|
||||||
|
# Expected output:
|
||||||
|
# - EC2 instances with low utilization
|
||||||
|
# - RDS instances with low utilization
|
||||||
|
# - Recommended smaller instance types
|
||||||
|
# - Estimated savings
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 4: Generate Monthly Report**
|
||||||
|
```bash
|
||||||
|
# Use the template to compile findings
|
||||||
|
cp assets/templates/monthly_cost_report.md reports/$(date +%Y-%m)-cost-report.md
|
||||||
|
|
||||||
|
# Fill in:
|
||||||
|
# - Findings from scripts
|
||||||
|
# - Action items
|
||||||
|
# - Team cost breakdowns
|
||||||
|
# - Optimization wins
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 5: Team Review Meeting**
|
||||||
|
- Present findings to engineering teams
|
||||||
|
- Assign optimization tasks
|
||||||
|
- Track action items to completion
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 2: Commitment Purchase Analysis (RI/Savings Plans)
|
||||||
|
|
||||||
|
**When**: Quarterly or when usage patterns stabilize
|
||||||
|
|
||||||
|
**Step 1: Analyze Current Usage**
|
||||||
|
```bash
|
||||||
|
# Identify workloads suitable for commitments
|
||||||
|
python3 scripts/analyze_ri_recommendations.py --days 60
|
||||||
|
|
||||||
|
# Looks for:
|
||||||
|
# - EC2 instances running consistently for 60+ days
|
||||||
|
# - RDS instances with stable usage
|
||||||
|
# - Calculates ROI for 1yr vs 3yr commitments
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 2: Review Recommendations**
|
||||||
|
|
||||||
|
Evaluate each recommendation:
|
||||||
|
```
|
||||||
|
✅ Good candidate if:
|
||||||
|
- Running 24/7 for 60+ days
|
||||||
|
- Workload is stable and predictable
|
||||||
|
- No plans to change architecture
|
||||||
|
- Savings > 30%
|
||||||
|
|
||||||
|
❌ Poor candidate if:
|
||||||
|
- Workload is variable or experimental
|
||||||
|
- Architecture changes planned
|
||||||
|
- Instance type may change
|
||||||
|
- Dev/test environment
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3: Choose Commitment Type**
|
||||||
|
|
||||||
|
**Reserved Instances**:
|
||||||
|
- Standard RI: Highest discount (63%), no flexibility
|
||||||
|
- Convertible RI: Moderate discount (54%), can change instance type
|
||||||
|
- Best for: Specific instance types, stable workloads
|
||||||
|
|
||||||
|
**Savings Plans**:
|
||||||
|
- Compute SP: Flexible across instance types, regions (66% savings)
|
||||||
|
- EC2 Instance SP: Flexible across sizes in same family (72% savings)
|
||||||
|
- Best for: Variable workloads within constraints
|
||||||
|
|
||||||
|
**Decision Matrix**:
|
||||||
|
```
|
||||||
|
Known instance type, won't change → Standard RI
|
||||||
|
May need to change types → Convertible RI or Compute SP
|
||||||
|
Variable workloads → Compute Savings Plan
|
||||||
|
Maximum flexibility → Compute Savings Plan
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 4: Purchase and Track**
|
||||||
|
- Purchase through AWS Console or CLI
|
||||||
|
- Tag commitments with purchase date and owner
|
||||||
|
- Monitor utilization monthly
|
||||||
|
- Aim for >90% utilization
|
||||||
|
|
||||||
|
**Reference**: See `references/best_practices.md` for detailed commitment strategies
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 3: Instance Generation Migration
|
||||||
|
|
||||||
|
**When**: During architecture reviews or optimization sprints
|
||||||
|
|
||||||
|
**Step 1: Detect Old Instances**
|
||||||
|
```bash
|
||||||
|
# Find outdated instance generations
|
||||||
|
python3 scripts/detect_old_generations.py
|
||||||
|
|
||||||
|
# Identifies:
|
||||||
|
# - t2 → t3 migrations (10% savings)
|
||||||
|
# - m4 → m5 → m6i migrations
|
||||||
|
# - Intel → Graviton opportunities (20% savings)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 2: Prioritize Migrations**
|
||||||
|
|
||||||
|
**Quick Wins (Low Risk)**:
|
||||||
|
```
|
||||||
|
t2 → t3: Drop-in replacement, 10% savings
|
||||||
|
m4 → m5: Better performance, 5% savings
|
||||||
|
gp2 → gp3: No downtime, 20% savings
|
||||||
|
```
|
||||||
|
|
||||||
|
**Medium Effort (Test Required)**:
|
||||||
|
```
|
||||||
|
x86 → Graviton (ARM64): 20% savings
|
||||||
|
- Requires ARM64 compatibility testing
|
||||||
|
- Most modern frameworks support ARM64
|
||||||
|
- Test in staging first
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 3: Execute Migration**
|
||||||
|
|
||||||
|
**For EC2 (x86 to x86)**:
|
||||||
|
1. Stop instance
|
||||||
|
2. Change instance type
|
||||||
|
3. Start instance
|
||||||
|
4. Verify application
|
||||||
|
|
||||||
|
**For Graviton Migration**:
|
||||||
|
1. Create ARM64 AMI or Docker image
|
||||||
|
2. Launch new Graviton instance
|
||||||
|
3. Test thoroughly
|
||||||
|
4. Cut over traffic
|
||||||
|
5. Terminate old instance
|
||||||
|
|
||||||
|
**Step 4: Validate Savings**
|
||||||
|
- Monitor new costs in Cost Explorer
|
||||||
|
- Verify performance is acceptable
|
||||||
|
- Document migration for other teams
|
||||||
|
|
||||||
|
**Reference**: See `references/best_practices.md` → Compute Optimization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### Workflow 4: Spot Instance Evaluation
|
||||||
|
|
||||||
|
**When**: For fault-tolerant workloads or Auto Scaling Groups
|
||||||
|
|
||||||
|
**Step 1: Identify Candidates**
|
||||||
|
```bash
|
||||||
|
# Analyze workloads for Spot suitability
|
||||||
|
python3 scripts/spot_recommendations.py
|
||||||
|
|
||||||
|
# Evaluates:
|
||||||
|
# - Instances in Auto Scaling Groups (good candidates)
|
||||||
|
# - Dev/test/staging environments
|
||||||
|
# - Batch processing workloads
|
||||||
|
# - CI/CD and build servers
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 2: Assess Suitability**
|
||||||
|
|
||||||
|
**Excellent for Spot**:
|
||||||
|
- Stateless applications
|
||||||
|
- Batch jobs
|
||||||
|
- CI/CD pipelines
|
||||||
|
- Data processing
|
||||||
|
- Auto Scaling Groups
|
||||||
|
|
||||||
|
**NOT suitable for Spot**:
|
||||||
|
- Databases (without replicas)
|
||||||
|
- Stateful applications
|
||||||
|
- Real-time services
|
||||||
|
- Mission-critical workloads
|
||||||
|
|
||||||
|
**Step 3: Implementation Strategy**
|
||||||
|
|
||||||
|
**Option 1: Fargate Spot (Easiest)**
|
||||||
|
```yaml
|
||||||
|
# ECS task definition
|
||||||
|
requiresCompatibilities:
|
||||||
|
- FARGATE
|
||||||
|
capacityProviderStrategy:
|
||||||
|
- capacityProvider: FARGATE_SPOT
|
||||||
|
weight: 70 # 70% Spot
|
||||||
|
- capacityProvider: FARGATE
|
||||||
|
weight: 30 # 30% On-Demand
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 2: EC2 Auto Scaling with Spot**
|
||||||
|
```yaml
|
||||||
|
# Mixed instances policy
|
||||||
|
MixedInstancesPolicy:
|
||||||
|
InstancesDistribution:
|
||||||
|
OnDemandBaseCapacity: 2
|
||||||
|
OnDemandPercentageAboveBaseCapacity: 30
|
||||||
|
SpotAllocationStrategy: capacity-optimized
|
||||||
|
LaunchTemplate:
|
||||||
|
Overrides:
|
||||||
|
- InstanceType: m5.large
|
||||||
|
- InstanceType: m5a.large
|
||||||
|
- InstanceType: m5n.large
|
||||||
|
```
|
||||||
|
|
||||||
|
**Option 3: EC2 Spot Fleet**
|
||||||
|
```bash
|
||||||
|
# Create Spot Fleet with diverse instance types
|
||||||
|
aws ec2 request-spot-fleet --spot-fleet-request-config file://spot-fleet.json
|
||||||
|
```
|
||||||
|
|
||||||
|
**Step 4: Implement Interruption Handling**
|
||||||
|
```bash
|
||||||
|
# Handle 2-minute termination notice
|
||||||
|
# Instance metadata: /latest/meta-data/spot/instance-action
|
||||||
|
|
||||||
|
# In application:
|
||||||
|
1. Poll for termination notice
|
||||||
|
2. Gracefully shutdown (save state)
|
||||||
|
3. Drain connections
|
||||||
|
4. Exit
|
||||||
|
```
|
||||||
|
|
||||||
|
**Reference**: See `references/best_practices.md` → Compute Optimization → Spot Instances
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference: Cost Optimization Scripts
|
||||||
|
|
||||||
|
### All Scripts Location
|
||||||
|
```bash
|
||||||
|
ls scripts/
|
||||||
|
# find_unused_resources.py
|
||||||
|
# analyze_ri_recommendations.py
|
||||||
|
# detect_old_generations.py
|
||||||
|
# spot_recommendations.py
|
||||||
|
# rightsizing_analyzer.py
|
||||||
|
# cost_anomaly_detector.py
|
||||||
|
```
|
||||||
|
|
||||||
|
### Script Usage Patterns
|
||||||
|
|
||||||
|
**Monthly Review (Run all)**:
|
||||||
|
```bash
|
||||||
|
python3 scripts/find_unused_resources.py
|
||||||
|
python3 scripts/cost_anomaly_detector.py --days 30
|
||||||
|
python3 scripts/rightsizing_analyzer.py --days 30
|
||||||
|
```
|
||||||
|
|
||||||
|
**Quarterly Optimization**:
|
||||||
|
```bash
|
||||||
|
python3 scripts/analyze_ri_recommendations.py --days 60
|
||||||
|
python3 scripts/detect_old_generations.py
|
||||||
|
python3 scripts/spot_recommendations.py
|
||||||
|
```
|
||||||
|
|
||||||
|
**Specific Region Only**:
|
||||||
|
```bash
|
||||||
|
python3 scripts/find_unused_resources.py --region us-east-1
|
||||||
|
python3 scripts/rightsizing_analyzer.py --region us-west-2
|
||||||
|
```
|
||||||
|
|
||||||
|
**Named AWS Profile**:
|
||||||
|
```bash
|
||||||
|
python3 scripts/find_unused_resources.py --profile production
|
||||||
|
python3 scripts/cost_anomaly_detector.py --profile production --days 60
|
||||||
|
```
|
||||||
|
|
||||||
|
### Script Requirements
|
||||||
|
```bash
|
||||||
|
# Install dependencies
|
||||||
|
pip install boto3 tabulate
|
||||||
|
|
||||||
|
# AWS credentials required
|
||||||
|
# Configure via: aws configure
|
||||||
|
# Or use: --profile PROFILE_NAME
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Service-Specific Optimization
|
||||||
|
|
||||||
|
### Compute Optimization
|
||||||
|
**Key Actions**:
|
||||||
|
- Migrate to Graviton (20% savings)
|
||||||
|
- Use Spot for fault-tolerant workloads (70% savings)
|
||||||
|
- Purchase RIs for stable workloads (40-65% savings)
|
||||||
|
- Right-size oversized instances
|
||||||
|
|
||||||
|
**Reference**: `references/best_practices.md` → Compute Optimization
|
||||||
|
|
||||||
|
### Storage Optimization
|
||||||
|
**Key Actions**:
|
||||||
|
- Convert gp2 → gp3 (20% savings)
|
||||||
|
- Implement S3 lifecycle policies (50-95% savings)
|
||||||
|
- Delete old snapshots
|
||||||
|
- Use S3 Intelligent-Tiering
|
||||||
|
|
||||||
|
**Reference**: `references/best_practices.md` → Storage Optimization
|
||||||
|
|
||||||
|
### Network Optimization
|
||||||
|
**Key Actions**:
|
||||||
|
- Replace NAT Gateways with VPC Endpoints (save $25-30/month each)
|
||||||
|
- Use CloudFront to reduce data transfer costs
|
||||||
|
- Colocate resources in same AZ when possible
|
||||||
|
|
||||||
|
**Reference**: `references/best_practices.md` → Network Optimization
|
||||||
|
|
||||||
|
### Database Optimization
|
||||||
|
**Key Actions**:
|
||||||
|
- Right-size RDS instances
|
||||||
|
- Use gp3 storage (20% cheaper than gp2)
|
||||||
|
- Evaluate Aurora Serverless for variable workloads
|
||||||
|
- Purchase RDS Reserved Instances
|
||||||
|
|
||||||
|
**Reference**: `references/best_practices.md` → Database Optimization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Service Alternatives Decision Guide
|
||||||
|
|
||||||
|
Need help choosing between services?
|
||||||
|
|
||||||
|
**Question**: "Should I use EC2, Lambda, or Fargate?"
|
||||||
|
**Answer**: See `references/service_alternatives.md` → Compute Alternatives
|
||||||
|
|
||||||
|
**Question**: "Which S3 storage class should I use?"
|
||||||
|
**Answer**: See `references/service_alternatives.md` → Storage Alternatives
|
||||||
|
|
||||||
|
**Question**: "Should I use RDS or Aurora?"
|
||||||
|
**Answer**: See `references/service_alternatives.md` → Database Alternatives
|
||||||
|
|
||||||
|
**Question**: "NAT Gateway vs VPC Endpoint vs NAT Instance?"
|
||||||
|
**Answer**: See `references/service_alternatives.md` → Networking Alternatives
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## FinOps Governance & Process
|
||||||
|
|
||||||
|
### Setting Up FinOps
|
||||||
|
|
||||||
|
**Phase 1: Foundation (Month 1)**
|
||||||
|
- Enable Cost Explorer
|
||||||
|
- Set up AWS Budgets
|
||||||
|
- Define tagging strategy
|
||||||
|
- Activate cost allocation tags
|
||||||
|
|
||||||
|
**Phase 2: Visibility (Months 2-3)**
|
||||||
|
- Implement tagging enforcement
|
||||||
|
- Run optimization scripts
|
||||||
|
- Set up monthly reviews
|
||||||
|
- Create team cost reports
|
||||||
|
|
||||||
|
**Phase 3: Culture (Ongoing)**
|
||||||
|
- Cost metrics in engineering KPIs
|
||||||
|
- Cost review in architecture decisions
|
||||||
|
- Regular optimization sprints
|
||||||
|
- FinOps champions in each team
|
||||||
|
|
||||||
|
**Full Guide**: See `references/finops_governance.md`
|
||||||
|
|
||||||
|
### Monthly Review Process
|
||||||
|
|
||||||
|
**Week 1**: Data Collection
|
||||||
|
- Run all optimization scripts
|
||||||
|
- Export Cost & Usage Reports
|
||||||
|
- Compile findings
|
||||||
|
|
||||||
|
**Week 2**: Analysis
|
||||||
|
- Identify trends
|
||||||
|
- Find opportunities
|
||||||
|
- Prioritize actions
|
||||||
|
|
||||||
|
**Week 3**: Team Reviews
|
||||||
|
- Present to engineering teams
|
||||||
|
- Discuss optimizations
|
||||||
|
- Assign action items
|
||||||
|
|
||||||
|
**Week 4**: Executive Reporting
|
||||||
|
- Create executive summary
|
||||||
|
- Forecast next quarter
|
||||||
|
- Report optimization wins
|
||||||
|
|
||||||
|
**Template**: See `assets/templates/monthly_cost_report.md`
|
||||||
|
|
||||||
|
**Detailed Process**: See `references/finops_governance.md` → Monthly Review Process
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cost Optimization Checklist
|
||||||
|
|
||||||
|
### Quick Wins (Do First)
|
||||||
|
- [ ] Delete unattached EBS volumes
|
||||||
|
- [ ] Delete old EBS snapshots (>90 days)
|
||||||
|
- [ ] Release unused Elastic IPs
|
||||||
|
- [ ] Convert gp2 → gp3 volumes
|
||||||
|
- [ ] Stop/terminate idle EC2 instances
|
||||||
|
- [ ] Enable S3 Intelligent-Tiering
|
||||||
|
- [ ] Set up AWS Budgets and alerts
|
||||||
|
|
||||||
|
### Medium Effort (This Quarter)
|
||||||
|
- [ ] Right-size oversized instances
|
||||||
|
- [ ] Migrate to newer instance generations
|
||||||
|
- [ ] Purchase Reserved Instances for stable workloads
|
||||||
|
- [ ] Implement S3 lifecycle policies
|
||||||
|
- [ ] Replace NAT Gateways with VPC Endpoints (where applicable)
|
||||||
|
- [ ] Enable automated resource scheduling (dev/test)
|
||||||
|
- [ ] Implement tagging strategy and enforcement
|
||||||
|
|
||||||
|
### Strategic Initiatives (Ongoing)
|
||||||
|
- [ ] Migrate to Graviton instances
|
||||||
|
- [ ] Implement Spot for fault-tolerant workloads
|
||||||
|
- [ ] Establish monthly cost review process
|
||||||
|
- [ ] Set up cost allocation by team
|
||||||
|
- [ ] Implement chargeback/showback model
|
||||||
|
- [ ] Create FinOps culture and practices
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Troubleshooting Cost Issues
|
||||||
|
|
||||||
|
### "My bill suddenly increased"
|
||||||
|
|
||||||
|
1. Run cost anomaly detection:
|
||||||
|
```bash
|
||||||
|
python3 scripts/cost_anomaly_detector.py --days 30
|
||||||
|
```
|
||||||
|
|
||||||
|
2. Check Cost Explorer for service breakdown
|
||||||
|
3. Review CloudTrail for resource creation events
|
||||||
|
4. Check for AutoScaling events
|
||||||
|
5. Verify no Reserved Instances expired
|
||||||
|
|
||||||
|
### "I need to reduce costs by X%"
|
||||||
|
|
||||||
|
Follow the optimization workflow:
|
||||||
|
1. Run all discovery scripts
|
||||||
|
2. Calculate total potential savings
|
||||||
|
3. Prioritize by: Savings Amount × (1 / Effort)
|
||||||
|
4. Focus on quick wins first
|
||||||
|
5. Implement strategic changes for long-term
|
||||||
|
|
||||||
|
### "How do I know if Reserved Instances make sense?"
|
||||||
|
|
||||||
|
Run RI analysis:
|
||||||
|
```bash
|
||||||
|
python3 scripts/analyze_ri_recommendations.py --days 60
|
||||||
|
```
|
||||||
|
|
||||||
|
Look for:
|
||||||
|
- Instances running 60+ days consistently
|
||||||
|
- Workloads that won't change
|
||||||
|
- Savings > 30%
|
||||||
|
|
||||||
|
### "Which resources can I safely delete?"
|
||||||
|
|
||||||
|
Run unused resource finder:
|
||||||
|
```bash
|
||||||
|
python3 scripts/find_unused_resources.py
|
||||||
|
```
|
||||||
|
|
||||||
|
Safe to delete (usually):
|
||||||
|
- Unattached EBS volumes (after verifying)
|
||||||
|
- Snapshots > 90 days (if backups exist elsewhere)
|
||||||
|
- Unused Elastic IPs (after verifying not in DNS)
|
||||||
|
- Stopped EC2 instances > 30 days (after confirming abandoned)
|
||||||
|
|
||||||
|
Always verify with resource owner before deletion!
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Best Practices Summary
|
||||||
|
|
||||||
|
1. **Tag Everything**: Consistent tagging enables cost allocation and accountability
|
||||||
|
2. **Monitor Continuously**: Weekly script runs catch waste early
|
||||||
|
3. **Review Monthly**: Regular reviews prevent cost drift
|
||||||
|
4. **Right-size Proactively**: Don't wait for cost issues to optimize
|
||||||
|
5. **Use Commitments Wisely**: RIs/SPs for stable workloads only
|
||||||
|
6. **Test Before Migrating**: Especially for Graviton or Spot
|
||||||
|
7. **Automate Cleanup**: Scheduled shutdown of dev/test resources
|
||||||
|
8. **Share Wins**: Celebrate cost savings to build FinOps culture
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Additional Resources
|
||||||
|
|
||||||
|
**Detailed References**:
|
||||||
|
- `references/best_practices.md`: Comprehensive optimization strategies
|
||||||
|
- `references/service_alternatives.md`: Cost-effective service selection
|
||||||
|
- `references/finops_governance.md`: Organizational FinOps practices
|
||||||
|
|
||||||
|
**Templates**:
|
||||||
|
- `assets/templates/monthly_cost_report.md`: Monthly reporting template
|
||||||
|
|
||||||
|
**Scripts**:
|
||||||
|
- All scripts in `scripts/` directory with `--help` for usage
|
||||||
|
|
||||||
|
**AWS Documentation**:
|
||||||
|
- AWS Cost Explorer: https://aws.amazon.com/aws-cost-management/aws-cost-explorer/
|
||||||
|
- AWS Budgets: https://aws.amazon.com/aws-cost-management/aws-budgets/
|
||||||
|
- FinOps Foundation: https://www.finops.org
|
||||||
298
assets/templates/monthly_cost_report.md
Normal file
298
assets/templates/monthly_cost_report.md
Normal file
@@ -0,0 +1,298 @@
|
|||||||
|
# AWS Cost Optimization Report - [Month Year]
|
||||||
|
|
||||||
|
**Report Date**: [Date]
|
||||||
|
**Reporting Period**: [Start Date] - [End Date]
|
||||||
|
**Prepared By**: [Your Name/Team]
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
| Metric | Value | vs Budget | vs Last Month |
|
||||||
|
|--------|-------|-----------|---------------|
|
||||||
|
| **Total AWS Spend** | $XX,XXX | ±X% | ±X% |
|
||||||
|
| **Largest Service** | Service Name ($X,XXX) | - | - |
|
||||||
|
| **Optimization Savings** | $X,XXX | - | - |
|
||||||
|
| **Projected Next Month** | $XX,XXX | - | - |
|
||||||
|
|
||||||
|
### Key Highlights
|
||||||
|
- ✅ [Positive highlight, e.g., "Reduced compute costs by 15%"]
|
||||||
|
- ⚠️ [Area of concern, e.g., "Storage costs increased 25% due to new backups"]
|
||||||
|
- 🎯 [Action taken, e.g., "Purchased Reserved Instances for $X,XXX annual savings"]
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cost Breakdown by Service
|
||||||
|
|
||||||
|
| Service | Current Month | Last Month | Change | % of Total |
|
||||||
|
|---------|--------------|------------|--------|-----------|
|
||||||
|
| EC2 | $XX,XXX | $XX,XXX | +/-X% | XX% |
|
||||||
|
| RDS | $XX,XXX | $XX,XXX | +/-X% | XX% |
|
||||||
|
| S3 | $XX,XXX | $XX,XXX | +/-X% | XX% |
|
||||||
|
| Data Transfer | $XX,XXX | $XX,XXX | +/-X% | XX% |
|
||||||
|
| Lambda | $XX,XXX | $XX,XXX | +/-X% | XX% |
|
||||||
|
| Other | $XX,XXX | $XX,XXX | +/-X% | XX% |
|
||||||
|
| **Total** | **$XX,XXX** | **$XX,XXX** | **+/-X%** | **100%** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cost by Environment
|
||||||
|
|
||||||
|
| Environment | Cost | % of Total | Budget | Variance |
|
||||||
|
|-------------|------|-----------|--------|----------|
|
||||||
|
| Production | $XX,XXX | XX% | $XX,XXX | +/-X% |
|
||||||
|
| Staging | $XX,XXX | XX% | $XX,XXX | +/-X% |
|
||||||
|
| Development | $XX,XXX | XX% | $XX,XXX | +/-X% |
|
||||||
|
| Test | $XX,XXX | XX% | $XX,XXX | +/-X% |
|
||||||
|
| **Total** | **$XX,XXX** | **100%** | **$XX,XXX** | **+/-X%** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cost by Team/Project
|
||||||
|
|
||||||
|
| Team/Project | Cost | % of Total | vs Last Month |
|
||||||
|
|--------------|------|-----------|---------------|
|
||||||
|
| Team Alpha | $XX,XXX | XX% | +/-X% |
|
||||||
|
| Team Beta | $XX,XXX | XX% | +/-X% |
|
||||||
|
| Team Gamma | $XX,XXX | XX% | +/-X% |
|
||||||
|
| Platform/Shared | $XX,XXX | XX% | +/-X% |
|
||||||
|
| Untagged | $XX,XXX | XX% | +/-X% |
|
||||||
|
| **Total** | **$XX,XXX** | **100%** | **+/-X%** |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cost Anomalies Detected
|
||||||
|
|
||||||
|
### Significant Cost Increases
|
||||||
|
|
||||||
|
| Date | Service | Cost | Baseline | Increase | Root Cause | Action Taken |
|
||||||
|
|------|---------|------|----------|----------|------------|--------------|
|
||||||
|
| [Date] | [Service] | $XXX | $XXX | +XX% | [Explanation] | [Action] |
|
||||||
|
|
||||||
|
### Unusual Spending Patterns
|
||||||
|
|
||||||
|
- **[Service/Resource]**: [Description of anomaly and investigation findings]
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Optimization Activities This Month
|
||||||
|
|
||||||
|
### Actions Completed
|
||||||
|
|
||||||
|
1. **[Optimization Action 1]**
|
||||||
|
- **Description**: [What was done]
|
||||||
|
- **Monthly Savings**: $XXX
|
||||||
|
- **Annual Savings**: $XXX
|
||||||
|
- **Effort**: [Hours/Days]
|
||||||
|
|
||||||
|
2. **[Optimization Action 2]**
|
||||||
|
- **Description**: [What was done]
|
||||||
|
- **Monthly Savings**: $XXX
|
||||||
|
- **Annual Savings**: $XXX
|
||||||
|
- **Effort**: [Hours/Days]
|
||||||
|
|
||||||
|
3. **[Optimization Action 3]**
|
||||||
|
- **Description**: [What was done]
|
||||||
|
- **Monthly Savings**: $XXX
|
||||||
|
- **Annual Savings**: $XXX
|
||||||
|
- **Effort**: [Hours/Days]
|
||||||
|
|
||||||
|
### Total Savings Achieved
|
||||||
|
- **Monthly**: $XXX
|
||||||
|
- **Annual**: $XXX
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Optimization Opportunities Identified
|
||||||
|
|
||||||
|
### High Priority (Recommended This Month)
|
||||||
|
|
||||||
|
1. **[Opportunity 1]**
|
||||||
|
- **Issue**: [Description of waste/inefficiency]
|
||||||
|
- **Recommendation**: [What to do]
|
||||||
|
- **Estimated Monthly Savings**: $XXX
|
||||||
|
- **Effort**: [Low/Medium/High]
|
||||||
|
- **Risk**: [Low/Medium/High]
|
||||||
|
- **Owner**: [Team/Person]
|
||||||
|
- **Deadline**: [Date]
|
||||||
|
|
||||||
|
2. **[Opportunity 2]**
|
||||||
|
- **Issue**: [Description]
|
||||||
|
- **Recommendation**: [Action]
|
||||||
|
- **Estimated Monthly Savings**: $XXX
|
||||||
|
- **Effort**: [Low/Medium/High]
|
||||||
|
- **Risk**: [Low/Medium/High]
|
||||||
|
- **Owner**: [Team/Person]
|
||||||
|
- **Deadline**: [Date]
|
||||||
|
|
||||||
|
### Medium Priority (Next Quarter)
|
||||||
|
|
||||||
|
1. **[Opportunity 3]**
|
||||||
|
- **Details**: [Brief description]
|
||||||
|
- **Estimated Monthly Savings**: $XXX
|
||||||
|
|
||||||
|
2. **[Opportunity 4]**
|
||||||
|
- **Details**: [Brief description]
|
||||||
|
- **Estimated Monthly Savings**: $XXX
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resource Inventory
|
||||||
|
|
||||||
|
### Unused Resources Found
|
||||||
|
|
||||||
|
| Resource Type | Count | Total Monthly Cost | Action |
|
||||||
|
|---------------|-------|-------------------|--------|
|
||||||
|
| Unattached EBS Volumes | XX | $XXX | Delete after review |
|
||||||
|
| Old Snapshots (>90 days) | XX | $XXX | Delete after review |
|
||||||
|
| Unused Elastic IPs | XX | $XXX | Release |
|
||||||
|
| Idle NAT Gateways | XX | $XXX | Review and consolidate |
|
||||||
|
| Idle Load Balancers | XX | $XXX | Delete |
|
||||||
|
| Stopped EC2 (>30 days) | XX | $XXX | Terminate |
|
||||||
|
|
||||||
|
**Total Potential Savings**: $XXX/month
|
||||||
|
|
||||||
|
### Rightsizing Recommendations
|
||||||
|
|
||||||
|
| Instance ID | Current Type | Recommended Type | Monthly Savings | Utilization |
|
||||||
|
|-------------|--------------|------------------|-----------------|-------------|
|
||||||
|
| i-xxxxx | m5.2xlarge | m5.xlarge | $XXX | Avg CPU: XX% |
|
||||||
|
| i-xxxxx | c5.4xlarge | c5.2xlarge | $XXX | Avg CPU: XX% |
|
||||||
|
| i-xxxxx | r5.8xlarge | r5.4xlarge | $XXX | Avg CPU: XX% |
|
||||||
|
|
||||||
|
**Total Potential Savings**: $XXX/month
|
||||||
|
|
||||||
|
### Reserved Instance/Savings Plan Opportunities
|
||||||
|
|
||||||
|
| Service | Instance Type | Quantity | Commitment | Monthly Savings | Annual Savings |
|
||||||
|
|---------|--------------|----------|------------|-----------------|----------------|
|
||||||
|
| EC2 | m5.xlarge | 10 | 1yr Standard RI | $XXX | $XXX |
|
||||||
|
| RDS | db.r5.large | 5 | 3yr Standard RI | $XXX | $XXX |
|
||||||
|
|
||||||
|
**Total Potential Annual Savings**: $XXX
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Commitment Utilization
|
||||||
|
|
||||||
|
### Reserved Instances
|
||||||
|
|
||||||
|
| Instance Type | Purchased | Utilized | Utilization % | Status |
|
||||||
|
|---------------|-----------|----------|---------------|--------|
|
||||||
|
| m5.xlarge | 20 | 19.2 | 96% | ✅ Good |
|
||||||
|
| c5.2xlarge | 10 | 7.5 | 75% | ⚠️ Review |
|
||||||
|
| r5.large | 5 | 5.0 | 100% | ✅ Good |
|
||||||
|
|
||||||
|
### Savings Plans
|
||||||
|
|
||||||
|
| Commitment Type | Commitment | Used | Utilization % | Status |
|
||||||
|
|----------------|------------|------|---------------|--------|
|
||||||
|
| Compute SP | $5,000/month | $4,950 | 99% | ✅ Good |
|
||||||
|
| EC2 Instance SP | $2,000/month | $1,800 | 90% | ✅ Good |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Tag Compliance
|
||||||
|
|
||||||
|
| Tag | Compliance Rate | Resources Missing Tags | Trend |
|
||||||
|
|-----|----------------|------------------------|-------|
|
||||||
|
| Environment | 95% | 120 | ↗️ Improving |
|
||||||
|
| Owner | 88% | 280 | → Stable |
|
||||||
|
| Project | 92% | 180 | ↗️ Improving |
|
||||||
|
| CostCenter | 85% | 350 | ↘️ Declining |
|
||||||
|
|
||||||
|
**Action Required**: [Teams/resources that need to improve tagging]
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Forecast & Projections
|
||||||
|
|
||||||
|
### Next Month Forecast
|
||||||
|
|
||||||
|
- **AWS Cost Explorer Forecast**: $XX,XXX
|
||||||
|
- **Confidence Level**: [High/Medium/Low]
|
||||||
|
- **Known Variables**:
|
||||||
|
- ✅ [Factor that will decrease costs]
|
||||||
|
- ⚠️ [Factor that will increase costs]
|
||||||
|
|
||||||
|
### Quarterly Projection
|
||||||
|
|
||||||
|
| Quarter | Projected Cost | vs Previous Quarter | Notes |
|
||||||
|
|---------|---------------|---------------------|-------|
|
||||||
|
| Q[X] [Year] | $XXX,XXX | +/-X% | [Notes] |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Upcoming Changes & Impact
|
||||||
|
|
||||||
|
### New Projects/Initiatives
|
||||||
|
|
||||||
|
1. **[Project Name]**
|
||||||
|
- **Launch Date**: [Date]
|
||||||
|
- **Expected Monthly Cost**: $XXX
|
||||||
|
- **Budget Allocated**: $XXX
|
||||||
|
|
||||||
|
2. **[Project Name]**
|
||||||
|
- **Launch Date**: [Date]
|
||||||
|
- **Expected Monthly Cost**: $XXX
|
||||||
|
- **Budget Allocated**: $XXX
|
||||||
|
|
||||||
|
### Planned Optimizations
|
||||||
|
|
||||||
|
1. **[Planned Activity]**
|
||||||
|
- **Scheduled**: [Date]
|
||||||
|
- **Expected Savings**: $XXX/month
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Action Items from Last Month
|
||||||
|
|
||||||
|
| Item | Owner | Status | Notes |
|
||||||
|
|------|-------|--------|-------|
|
||||||
|
| [Action item 1] | [Name] | ✅ Complete | [Notes] |
|
||||||
|
| [Action item 2] | [Name] | 🔄 In Progress | [Notes] |
|
||||||
|
| [Action item 3] | [Name] | ❌ Blocked | [Notes] |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Action Items for Next Month
|
||||||
|
|
||||||
|
| Priority | Item | Owner | Deadline | Expected Savings |
|
||||||
|
|----------|------|-------|----------|------------------|
|
||||||
|
| 🔴 High | [Action 1] | [Name] | [Date] | $XXX/month |
|
||||||
|
| 🔴 High | [Action 2] | [Name] | [Date] | $XXX/month |
|
||||||
|
| 🟡 Medium | [Action 3] | [Name] | [Date] | $XXX/month |
|
||||||
|
| 🟡 Medium | [Action 4] | [Name] | [Date] | $XXX/month |
|
||||||
|
| 🟢 Low | [Action 5] | [Name] | [Date] | $XXX/month |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix
|
||||||
|
|
||||||
|
### Methodology
|
||||||
|
|
||||||
|
- **Data Source**: AWS Cost Explorer, Cost & Usage Reports
|
||||||
|
- **Scripts Used**:
|
||||||
|
- `find_unused_resources.py`
|
||||||
|
- `analyze_ri_recommendations.py`
|
||||||
|
- `rightsizing_analyzer.py`
|
||||||
|
- `cost_anomaly_detector.py`
|
||||||
|
- **Analysis Period**: [Days] days of data
|
||||||
|
- **Cost Estimation**: Based on [region] pricing, [assumptions]
|
||||||
|
|
||||||
|
### Definitions
|
||||||
|
|
||||||
|
- **Untagged Resources**: Resources missing one or more required tags
|
||||||
|
- **Idle Resources**: Resources with <5% avg utilization over analysis period
|
||||||
|
- **Optimization Savings**: Actual realized savings from completed optimizations
|
||||||
|
- **Potential Savings**: Estimated savings from recommended actions
|
||||||
|
|
||||||
|
### Contact
|
||||||
|
|
||||||
|
For questions about this report, contact:
|
||||||
|
- **FinOps Team**: [email]
|
||||||
|
- **Report Author**: [name, email]
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
**Next Review Date**: [Date]
|
||||||
85
plugin.lock.json
Normal file
85
plugin.lock.json
Normal file
@@ -0,0 +1,85 @@
|
|||||||
|
{
|
||||||
|
"$schema": "internal://schemas/plugin.lock.v1.json",
|
||||||
|
"pluginId": "gh:ahmedasmar/devops-claude-skills:aws-cost-optimization",
|
||||||
|
"normalized": {
|
||||||
|
"repo": null,
|
||||||
|
"ref": "refs/tags/v20251128.0",
|
||||||
|
"commit": "8af875d202ad9fffe3aad470ddfc44807008c807",
|
||||||
|
"treeHash": "ef6f5974900c9c322acd19ad2bc266d3d5af03d47494c64181307cc41cfc6013",
|
||||||
|
"generatedAt": "2025-11-28T10:13:02.954058Z",
|
||||||
|
"toolVersion": "publish_plugins.py@0.2.0"
|
||||||
|
},
|
||||||
|
"origin": {
|
||||||
|
"remote": "git@github.com:zhongweili/42plugin-data.git",
|
||||||
|
"branch": "master",
|
||||||
|
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
|
||||||
|
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
|
||||||
|
},
|
||||||
|
"manifest": {
|
||||||
|
"name": "aws-cost-optimization",
|
||||||
|
"description": "AWS cost optimization and FinOps workflows with automated analysis scripts",
|
||||||
|
"version": "1.0.0"
|
||||||
|
},
|
||||||
|
"content": {
|
||||||
|
"files": [
|
||||||
|
{
|
||||||
|
"path": "README.md",
|
||||||
|
"sha256": "760d3040eedf58f280c4c83efe429d9fae7a775d919140525790531b3c0305eb"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "SKILL.md",
|
||||||
|
"sha256": "70743548409e846b3913ad3c197a4ccd3960f1463f391d88eb2ddb90865b3695"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "references/finops_governance.md",
|
||||||
|
"sha256": "3a799de330494b65d2cca0417f93dfe52ca4f53a0c136fb6f6a3b9a68d1cb132"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "references/best_practices.md",
|
||||||
|
"sha256": "276432147c83917b1262819ea99e71a2650c54723a4126144bd231db4ecc3852"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "references/service_alternatives.md",
|
||||||
|
"sha256": "857c8fb0d38f866d219536440df530cbc279c1c6b43de0e8a7ea8535f4a0fed7"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "scripts/spot_recommendations.py",
|
||||||
|
"sha256": "0683409a643399a253068aca744d306a2a082eb564c699869265203d45c568b6"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "scripts/find_unused_resources.py",
|
||||||
|
"sha256": "08ad6fd4b3a63f1d0a5d1cbfcbfe4b97e8097a73d8eb698618563b90f7282494"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "scripts/cost_anomaly_detector.py",
|
||||||
|
"sha256": "2008ea17ffba6683d03a0a82f1f38eae0755be6db9be95138b3aa0c4631034ad"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "scripts/rightsizing_analyzer.py",
|
||||||
|
"sha256": "3c065e84053dbba462b586cff1456dbb59d885f1c2ddc8ba986008c0872fc667"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "scripts/analyze_ri_recommendations.py",
|
||||||
|
"sha256": "8611eb7aee1c6db721d861ab139b6571690d84c8b369a49b6475bbed5a0e3f6c"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "scripts/detect_old_generations.py",
|
||||||
|
"sha256": "aaa5b9e616fa26bb972cc14616050b20b6a1c46a520bb233ac4bc7428ff47397"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": ".claude-plugin/plugin.json",
|
||||||
|
"sha256": "27b05fae94d7a80326cc793d0ca992d2cb6f72a4d21ecbe7213b948d3063dd1c"
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"path": "assets/templates/monthly_cost_report.md",
|
||||||
|
"sha256": "9ec05851de442b0952764035421d1478875d07ad8a2e03b7cd16e842ca6c4621"
|
||||||
|
}
|
||||||
|
],
|
||||||
|
"dirSha256": "ef6f5974900c9c322acd19ad2bc266d3d5af03d47494c64181307cc41cfc6013"
|
||||||
|
},
|
||||||
|
"security": {
|
||||||
|
"scannedAt": null,
|
||||||
|
"scannerVersion": null,
|
||||||
|
"flags": []
|
||||||
|
}
|
||||||
|
}
|
||||||
362
references/best_practices.md
Normal file
362
references/best_practices.md
Normal file
@@ -0,0 +1,362 @@
|
|||||||
|
# AWS Cost Optimization Best Practices
|
||||||
|
|
||||||
|
Comprehensive strategies for optimizing AWS costs across all major service categories.
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [Compute Optimization](#compute-optimization)
|
||||||
|
2. [Storage Optimization](#storage-optimization)
|
||||||
|
3. [Network Optimization](#network-optimization)
|
||||||
|
4. [Database Optimization](#database-optimization)
|
||||||
|
5. [Container & Serverless Optimization](#container--serverless-optimization)
|
||||||
|
6. [General Principles](#general-principles)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Compute Optimization
|
||||||
|
|
||||||
|
### EC2 Instance Optimization
|
||||||
|
|
||||||
|
**Right Instance Family**
|
||||||
|
- **General Purpose (T3, M5, M6i)**: Web servers, small-medium databases, dev environments
|
||||||
|
- **Compute Optimized (C5, C6i, C6g)**: CPU-intensive workloads, batch processing, HPC
|
||||||
|
- **Memory Optimized (R5, R6i, R6g)**: Databases, in-memory caches, big data
|
||||||
|
- **Storage Optimized (I3, D2)**: High IOPS, data warehousing, Hadoop
|
||||||
|
|
||||||
|
**Graviton Migration (ARM64)**
|
||||||
|
- Up to 20% cost savings with M6g, C6g, R6g, T4g instances
|
||||||
|
- Test compatibility first: Most modern languages/frameworks support ARM64
|
||||||
|
- Best for: Stateless applications, containerized workloads, open-source software
|
||||||
|
|
||||||
|
**Instance Sizing**
|
||||||
|
- Start small and scale up based on metrics
|
||||||
|
- Monitor CPU, memory, network for 2+ weeks before committing
|
||||||
|
- Use CloudWatch metrics to identify underutilized instances
|
||||||
|
- Consider burstable instances (T3) for variable workloads
|
||||||
|
|
||||||
|
**Purchase Options**
|
||||||
|
- **On-Demand**: Flexible, no commitment, highest cost
|
||||||
|
- **Reserved Instances**: 1-3 year commitment, up to 63% savings
|
||||||
|
- Standard RI: Highest discount, no flexibility
|
||||||
|
- Convertible RI: Moderate discount, can change instance types
|
||||||
|
- **Savings Plans**: Flexible commitment to compute spend, up to 66% savings
|
||||||
|
- **Spot Instances**: Up to 90% savings, suitable for fault-tolerant workloads
|
||||||
|
|
||||||
|
### Auto Scaling
|
||||||
|
|
||||||
|
**Horizontal Scaling**
|
||||||
|
- Scale out during peak, scale in during off-peak
|
||||||
|
- Use target tracking policies (CPU, ALB requests, custom metrics)
|
||||||
|
- Set minimum instances for high availability, maximum for cost control
|
||||||
|
- Consider scheduled scaling for predictable patterns
|
||||||
|
|
||||||
|
**Mixed Instances Policy**
|
||||||
|
- Combine instance types for better Spot availability
|
||||||
|
- Mix Spot and On-Demand for reliability
|
||||||
|
- Example: 70% Spot, 30% On-Demand for fault-tolerant apps
|
||||||
|
|
||||||
|
### Lambda Optimization
|
||||||
|
|
||||||
|
**Memory Configuration**
|
||||||
|
- Memory allocation determines CPU allocation
|
||||||
|
- More memory = faster execution = potentially lower cost
|
||||||
|
- Test different memory settings to find cost/performance sweet spot
|
||||||
|
|
||||||
|
**Cold Start Mitigation**
|
||||||
|
- Provisioned concurrency for critical functions (adds cost)
|
||||||
|
- Keep functions warm with scheduled invocations
|
||||||
|
- Minimize deployment package size
|
||||||
|
- Use Lambda layers for shared dependencies
|
||||||
|
|
||||||
|
**Execution Time**
|
||||||
|
- Optimize code to reduce execution duration
|
||||||
|
- Every 100ms of execution matters at scale
|
||||||
|
- Consider Graviton2 (arm64) for 20% better price/performance
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Storage Optimization
|
||||||
|
|
||||||
|
### S3 Cost Optimization
|
||||||
|
|
||||||
|
**Storage Classes**
|
||||||
|
- **S3 Standard**: Frequently accessed data
|
||||||
|
- **S3 Intelligent-Tiering**: Auto-moves between tiers, ideal for unknown patterns
|
||||||
|
- **S3 Standard-IA**: Infrequent access, 50% cheaper than Standard
|
||||||
|
- **S3 One Zone-IA**: Non-critical, infrequent access, 20% cheaper than Standard-IA
|
||||||
|
- **S3 Glacier Instant Retrieval**: Archive with instant access, 68% cheaper
|
||||||
|
- **S3 Glacier Flexible Retrieval**: Archive, retrieval in minutes-hours, 77% cheaper
|
||||||
|
- **S3 Glacier Deep Archive**: Long-term archive, retrieval in 12 hours, 83% cheaper
|
||||||
|
|
||||||
|
**Lifecycle Policies**
|
||||||
|
- Automatically transition objects between storage classes
|
||||||
|
- Delete incomplete multipart uploads after 7 days
|
||||||
|
- Example policy:
|
||||||
|
- 0-30 days: S3 Standard
|
||||||
|
- 30-90 days: S3 Standard-IA
|
||||||
|
- 90-365 days: S3 Glacier Flexible Retrieval
|
||||||
|
- 365+ days: S3 Glacier Deep Archive or Delete
|
||||||
|
|
||||||
|
**Request Optimization**
|
||||||
|
- Use CloudFront CDN to reduce S3 GET requests
|
||||||
|
- Batch operations instead of individual API calls
|
||||||
|
- Use S3 Select to retrieve subsets of data
|
||||||
|
- Enable S3 Transfer Acceleration for faster uploads (if needed)
|
||||||
|
|
||||||
|
**Cost Monitoring**
|
||||||
|
- Enable S3 Storage Lens for usage analytics
|
||||||
|
- Set up S3 Storage Class Analysis
|
||||||
|
- Monitor request costs (can exceed storage costs for small files)
|
||||||
|
|
||||||
|
### EBS Optimization
|
||||||
|
|
||||||
|
**Volume Types**
|
||||||
|
- **gp3**: General purpose, 20% cheaper than gp2, configurable IOPS/throughput
|
||||||
|
- **gp2**: Legacy general purpose (migrate to gp3)
|
||||||
|
- **io2**: High performance, mission-critical (only if needed)
|
||||||
|
- **st1**: Throughput-optimized HDD for big data (cheaper for sequential access)
|
||||||
|
- **sc1**: Cold HDD for infrequent access (cheapest)
|
||||||
|
|
||||||
|
**Snapshot Management**
|
||||||
|
- Delete old snapshots (they accumulate quickly)
|
||||||
|
- Use Lifecycle Manager for automated snapshot policies
|
||||||
|
- Snapshots are incremental but deletion is complex (use Data Lifecycle Manager)
|
||||||
|
- Consider cross-region replication costs
|
||||||
|
|
||||||
|
**Volume Cleanup**
|
||||||
|
- Delete unattached volumes
|
||||||
|
- Right-size oversized volumes
|
||||||
|
- Consider EBS Elastic Volumes to modify without downtime
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Network Optimization
|
||||||
|
|
||||||
|
### Data Transfer Costs
|
||||||
|
|
||||||
|
**General Rules**
|
||||||
|
- **Free**: Inbound from internet, same-AZ traffic (same subnet)
|
||||||
|
- **Cheap**: Same-region traffic across AZs
|
||||||
|
- **Expensive**: Cross-region, outbound to internet, CloudFront to origin
|
||||||
|
|
||||||
|
**Optimization Strategies**
|
||||||
|
- Colocate resources in same AZ when possible (consider HA trade-offs)
|
||||||
|
- Use VPC endpoints for AWS service access (avoids NAT/IGW costs)
|
||||||
|
- Implement caching with CloudFront, ElastiCache
|
||||||
|
- Compress data before transfer
|
||||||
|
- Use AWS PrivateLink instead of internet egress
|
||||||
|
|
||||||
|
### NAT Gateway Optimization
|
||||||
|
|
||||||
|
**Cost Structure**
|
||||||
|
- ~$32.85/month per NAT Gateway
|
||||||
|
- Data processing charges: $0.045/GB
|
||||||
|
|
||||||
|
**Alternatives**
|
||||||
|
- **VPC Endpoints**: Direct access to AWS services (S3, DynamoDB, etc.)
|
||||||
|
- Interface endpoints: $7.20/month + $0.01/GB
|
||||||
|
- Gateway endpoints: Free for S3 and DynamoDB
|
||||||
|
- **NAT Instance**: Cheaper but requires management
|
||||||
|
- **Single NAT Gateway**: Use one instead of one per AZ (reduces HA)
|
||||||
|
- **S3 Gateway Endpoint**: Free alternative for S3 access
|
||||||
|
|
||||||
|
**When to Use What**
|
||||||
|
- High traffic to AWS services → VPC Endpoints
|
||||||
|
- Low traffic, dev/test → Single NAT Gateway or NAT instance
|
||||||
|
- Production, HA required → NAT Gateway per AZ
|
||||||
|
- S3 access only → S3 Gateway Endpoint (free)
|
||||||
|
|
||||||
|
### CloudFront Optimization
|
||||||
|
|
||||||
|
**Use Cases for Savings**
|
||||||
|
- Reduce S3 data transfer costs (CloudFront egress is cheaper)
|
||||||
|
- Cache frequently accessed content
|
||||||
|
- Regional edge caches for less popular content
|
||||||
|
|
||||||
|
**Configuration**
|
||||||
|
- Use appropriate price class (exclude expensive regions if not needed)
|
||||||
|
- Set proper TTL to maximize cache hit ratio
|
||||||
|
- Use compression (gzip, brotli)
|
||||||
|
- Monitor cache hit ratio and adjust
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Database Optimization
|
||||||
|
|
||||||
|
### RDS Cost Optimization
|
||||||
|
|
||||||
|
**Instance Sizing**
|
||||||
|
- Right-size based on CloudWatch metrics (CPU, memory, connections)
|
||||||
|
- Consider burstable instances (db.t3) for variable workloads
|
||||||
|
- Graviton instances (db.m6g, db.r6g) offer 20% savings
|
||||||
|
|
||||||
|
**Storage Optimization**
|
||||||
|
- Use gp3 instead of gp2 (20% cheaper)
|
||||||
|
- Enable storage autoscaling with upper limit
|
||||||
|
- Delete old automated backups
|
||||||
|
- Reduce backup retention period if possible
|
||||||
|
|
||||||
|
**High Availability Trade-offs**
|
||||||
|
- Multi-AZ doubles cost (needed for production)
|
||||||
|
- Single-AZ acceptable for dev/test
|
||||||
|
- Read replicas for read scaling (cheaper than bigger instance)
|
||||||
|
|
||||||
|
**Aurora vs RDS**
|
||||||
|
- Aurora costs more but offers better scaling
|
||||||
|
- Aurora Serverless v2 for variable workloads
|
||||||
|
- Standard RDS for predictable workloads
|
||||||
|
- PostgreSQL/MySQL community for dev/test
|
||||||
|
|
||||||
|
### DynamoDB Optimization
|
||||||
|
|
||||||
|
**Capacity Modes**
|
||||||
|
- **On-Demand**: Pay per request, unpredictable traffic
|
||||||
|
- **Provisioned**: Cheaper for consistent traffic, requires capacity planning
|
||||||
|
- **Reserved Capacity**: 1-3 year commitment for provisioned capacity
|
||||||
|
|
||||||
|
**Table Design**
|
||||||
|
- Use single-table design to minimize costs
|
||||||
|
- Implement GSI/LSI carefully (they add cost)
|
||||||
|
- Enable point-in-time recovery only if needed
|
||||||
|
- Use TTL to auto-expire old data
|
||||||
|
|
||||||
|
**Read Optimization**
|
||||||
|
- Use eventually consistent reads (50% cheaper than strongly consistent)
|
||||||
|
- Implement caching (DAX or ElastiCache)
|
||||||
|
- Batch operations when possible
|
||||||
|
|
||||||
|
### ElastiCache Optimization
|
||||||
|
|
||||||
|
**Node Types**
|
||||||
|
- Graviton instances (cache.m6g, cache.r6g) for 20% savings
|
||||||
|
- Right-size based on memory usage and eviction rates
|
||||||
|
|
||||||
|
**Redis vs Memcached**
|
||||||
|
- Redis: More features, persistence, replication (more expensive)
|
||||||
|
- Memcached: Simpler, no persistence, multi-threaded (cheaper)
|
||||||
|
|
||||||
|
**Strategies**
|
||||||
|
- Reserved nodes for 30-55% savings
|
||||||
|
- Single-AZ for dev/test
|
||||||
|
- Monitor eviction rates to avoid over-provisioning
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Container & Serverless Optimization
|
||||||
|
|
||||||
|
### ECS/Fargate Optimization
|
||||||
|
|
||||||
|
**Compute Options**
|
||||||
|
- **EC2 Launch Type**: More control, cheaper for steady workloads
|
||||||
|
- **Fargate**: Serverless, easier management, better for variable loads
|
||||||
|
- **Fargate Spot**: Up to 70% savings for fault-tolerant tasks
|
||||||
|
|
||||||
|
**Graviton Support**
|
||||||
|
- Fargate ARM64 support available
|
||||||
|
- ECS on Graviton2 EC2 instances for 20% savings
|
||||||
|
|
||||||
|
**Right-sizing**
|
||||||
|
- Start with minimal CPU/memory, scale up based on metrics
|
||||||
|
- Use Container Insights for utilization data
|
||||||
|
- Consider task packing (multiple containers per task)
|
||||||
|
|
||||||
|
### EKS Optimization
|
||||||
|
|
||||||
|
**Control Plane**
|
||||||
|
- $73/month per cluster (consider consolidation)
|
||||||
|
- Use single cluster with namespaces when appropriate
|
||||||
|
|
||||||
|
**Worker Nodes**
|
||||||
|
- Use Spot instances for fault-tolerant pods (up to 90% savings)
|
||||||
|
- Managed node groups with Graviton instances
|
||||||
|
- Karpenter for intelligent autoscaling
|
||||||
|
- Mixed instance types for better Spot availability
|
||||||
|
|
||||||
|
**Cost Visibility**
|
||||||
|
- Kubecost or OpenCost for K8s cost attribution
|
||||||
|
- Resource requests/limits prevent waste
|
||||||
|
- Cluster autoscaler for automatic node scaling
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## General Principles
|
||||||
|
|
||||||
|
### Tagging Strategy
|
||||||
|
|
||||||
|
**Cost Allocation Tags**
|
||||||
|
- Environment: prod, staging, dev, test
|
||||||
|
- Owner: team/person responsible
|
||||||
|
- Project: business initiative
|
||||||
|
- CostCenter: chargeback allocation
|
||||||
|
- Application: specific app name
|
||||||
|
|
||||||
|
**Tag Enforcement**
|
||||||
|
- Use AWS Organizations policies to enforce tagging
|
||||||
|
- Service Control Policies to prevent untagged resources
|
||||||
|
- AWS Config rules for compliance
|
||||||
|
|
||||||
|
### Monitoring and Governance
|
||||||
|
|
||||||
|
**Cost Monitoring Tools**
|
||||||
|
- AWS Cost Explorer: Historical analysis
|
||||||
|
- AWS Budgets: Proactive alerts
|
||||||
|
- Cost and Usage Reports: Detailed data export
|
||||||
|
- Cost Anomaly Detection: Automatic anomaly alerts
|
||||||
|
|
||||||
|
**Regular Reviews**
|
||||||
|
- Monthly cost review meetings
|
||||||
|
- Quarterly rightsizing exercises
|
||||||
|
- Annual Reserved Instance/Savings Plan optimization
|
||||||
|
- Automated reports to stakeholders
|
||||||
|
|
||||||
|
### Automation
|
||||||
|
|
||||||
|
**Infrastructure as Code**
|
||||||
|
- Define resource sizes in code (prevent oversizing)
|
||||||
|
- Automated cleanup of dev/test resources
|
||||||
|
- Scheduled shutdown of non-production resources
|
||||||
|
|
||||||
|
**Cost Optimization Tools**
|
||||||
|
- AWS Compute Optimizer: ML-based recommendations
|
||||||
|
- AWS Trusted Advisor: Best practice checks
|
||||||
|
- Third-party tools: CloudHealth, Cloudability, Spot.io
|
||||||
|
|
||||||
|
### Cultural Best Practices
|
||||||
|
|
||||||
|
**Engineering Ownership**
|
||||||
|
- Engineers should see cost impact of their changes
|
||||||
|
- Cost metrics in dashboards alongside performance
|
||||||
|
- Cost budgets for teams/projects
|
||||||
|
|
||||||
|
**Experiments and Cleanup**
|
||||||
|
- Tag experimental resources with expiration dates
|
||||||
|
- Automated cleanup of abandoned resources
|
||||||
|
- Regular audits of unused resources
|
||||||
|
|
||||||
|
**Cost-Aware Architecture**
|
||||||
|
- Design for cost from the beginning
|
||||||
|
- Choose appropriate service tiers
|
||||||
|
- Implement auto-scaling and right-sizing from day one
|
||||||
|
- Consider serverless and managed services
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Wins Checklist
|
||||||
|
|
||||||
|
- [ ] Delete unattached EBS volumes
|
||||||
|
- [ ] Delete old EBS snapshots
|
||||||
|
- [ ] Release unused Elastic IPs
|
||||||
|
- [ ] Stop or terminate idle EC2 instances
|
||||||
|
- [ ] Right-size oversized instances
|
||||||
|
- [ ] Convert gp2 to gp3 volumes
|
||||||
|
- [ ] Enable S3 Intelligent-Tiering
|
||||||
|
- [ ] Set up S3 lifecycle policies
|
||||||
|
- [ ] Replace NAT Gateways with VPC Endpoints where possible
|
||||||
|
- [ ] Migrate to Graviton instances
|
||||||
|
- [ ] Purchase Reserved Instances/Savings Plans for stable workloads
|
||||||
|
- [ ] Use Spot instances for fault-tolerant workloads
|
||||||
|
- [ ] Delete old RDS snapshots
|
||||||
|
- [ ] Enable DynamoDB auto-scaling
|
||||||
|
- [ ] Set up cost allocation tags
|
||||||
|
- [ ] Enable AWS Budgets alerts
|
||||||
|
- [ ] Schedule shutdown of dev/test resources
|
||||||
740
references/finops_governance.md
Normal file
740
references/finops_governance.md
Normal file
@@ -0,0 +1,740 @@
|
|||||||
|
# FinOps Governance Framework
|
||||||
|
|
||||||
|
Organizational practices, processes, and governance for AWS cost optimization.
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [FinOps Principles](#finops-principles)
|
||||||
|
2. [Cost Allocation & Tagging](#cost-allocation--tagging)
|
||||||
|
3. [Budget Management](#budget-management)
|
||||||
|
4. [Monthly Review Process](#monthly-review-process)
|
||||||
|
5. [Roles & Responsibilities](#roles--responsibilities)
|
||||||
|
6. [Chargeback & Showback](#chargeback--showback)
|
||||||
|
7. [Policy & Governance](#policy--governance)
|
||||||
|
8. [Metrics & KPIs](#metrics--kpis)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## FinOps Principles
|
||||||
|
|
||||||
|
### The FinOps Framework
|
||||||
|
|
||||||
|
FinOps is the practice of bringing financial accountability to cloud spending through collaboration between engineering, finance, and business teams.
|
||||||
|
|
||||||
|
**Core Principles:**
|
||||||
|
|
||||||
|
1. **Teams Need to Collaborate**
|
||||||
|
- Engineering makes technical decisions
|
||||||
|
- Finance provides visibility and reporting
|
||||||
|
- Business sets priorities and budgets
|
||||||
|
- Cross-functional cost optimization
|
||||||
|
|
||||||
|
2. **Everyone Takes Ownership**
|
||||||
|
- Engineers see cost impact of their decisions
|
||||||
|
- Teams have cost budgets and accountability
|
||||||
|
- Cost is a efficiency metric, not just finance
|
||||||
|
|
||||||
|
3. **Decisions Driven by Business Value**
|
||||||
|
- Speed, quality, and cost trade-offs
|
||||||
|
- Investment vs optimization decisions
|
||||||
|
- ROI-based prioritization
|
||||||
|
|
||||||
|
4. **Take Advantage of Variable Cost Model**
|
||||||
|
- Scale resources up and down as needed
|
||||||
|
- Use different pricing models strategically
|
||||||
|
- Optimize for actual usage patterns
|
||||||
|
|
||||||
|
5. **Centralized Team Drives FinOps**
|
||||||
|
- Central FinOps team enables
|
||||||
|
- Distributed execution by product teams
|
||||||
|
- Share best practices and tools
|
||||||
|
|
||||||
|
### FinOps Maturity Model
|
||||||
|
|
||||||
|
**Crawl Phase (Getting Started)**
|
||||||
|
- Basic cost visibility
|
||||||
|
- Manual reporting
|
||||||
|
- Ad-hoc optimization
|
||||||
|
- Initial tagging strategy
|
||||||
|
- Basic budget alerts
|
||||||
|
|
||||||
|
**Walk Phase (Improving)**
|
||||||
|
- Automated cost reporting
|
||||||
|
- Regular optimization reviews
|
||||||
|
- Systematic tagging enforcement
|
||||||
|
- Team cost allocation
|
||||||
|
- Reserved Instance planning
|
||||||
|
- Monthly optimization meetings
|
||||||
|
|
||||||
|
**Run Phase (Optimized)**
|
||||||
|
- Real-time cost visibility
|
||||||
|
- Automated optimization
|
||||||
|
- Cost-aware engineering culture
|
||||||
|
- Predictive forecasting
|
||||||
|
- Automated guardrails
|
||||||
|
- FinOps integrated in SDLC
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cost Allocation & Tagging
|
||||||
|
|
||||||
|
### Tagging Strategy
|
||||||
|
|
||||||
|
**Required Tags (Enforce via Policy)**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
Required Tags:
|
||||||
|
Environment:
|
||||||
|
values: [prod, staging, dev, test]
|
||||||
|
purpose: Separate production from non-production costs
|
||||||
|
|
||||||
|
Owner:
|
||||||
|
values: [email or team name]
|
||||||
|
purpose: Contact for resource questions
|
||||||
|
|
||||||
|
Project:
|
||||||
|
values: [project code]
|
||||||
|
purpose: Track project spending
|
||||||
|
|
||||||
|
CostCenter:
|
||||||
|
values: [department code]
|
||||||
|
purpose: Chargeback allocation
|
||||||
|
|
||||||
|
Application:
|
||||||
|
values: [app name]
|
||||||
|
purpose: Application-level cost tracking
|
||||||
|
```
|
||||||
|
|
||||||
|
**Optional but Recommended Tags**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
Optional Tags:
|
||||||
|
ExpirationDate:
|
||||||
|
format: YYYY-MM-DD
|
||||||
|
purpose: Auto-cleanup scheduling
|
||||||
|
|
||||||
|
DataClassification:
|
||||||
|
values: [public, internal, confidential, restricted]
|
||||||
|
purpose: Security and compliance
|
||||||
|
|
||||||
|
BackupRequired:
|
||||||
|
values: [true, false]
|
||||||
|
purpose: Backup policy enforcement
|
||||||
|
|
||||||
|
Criticality:
|
||||||
|
values: [critical, high, medium, low]
|
||||||
|
purpose: Priority and SLA determination
|
||||||
|
```
|
||||||
|
|
||||||
|
### Tag Enforcement
|
||||||
|
|
||||||
|
**Using AWS Organizations Service Control Policies (SCP)**
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"Version": "2012-10-17",
|
||||||
|
"Statement": [
|
||||||
|
{
|
||||||
|
"Sid": "DenyEC2CreationWithoutTags",
|
||||||
|
"Effect": "Deny",
|
||||||
|
"Action": [
|
||||||
|
"ec2:RunInstances"
|
||||||
|
],
|
||||||
|
"Resource": [
|
||||||
|
"arn:aws:ec2:*:*:instance/*"
|
||||||
|
],
|
||||||
|
"Condition": {
|
||||||
|
"StringNotLike": {
|
||||||
|
"aws:RequestTag/Environment": ["prod", "staging", "dev", "test"],
|
||||||
|
"aws:RequestTag/Owner": "*",
|
||||||
|
"aws:RequestTag/Project": "*"
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
**Using AWS Config Rules**
|
||||||
|
|
||||||
|
- **required-tags**: Enforce tags on all resources
|
||||||
|
- **ec2-instance-no-public-ip**: Prevent public IPs unless tagged
|
||||||
|
- Custom Lambda-based rules for complex logic
|
||||||
|
|
||||||
|
**Tag Compliance Monitoring**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Example: Check tag compliance
|
||||||
|
# Run weekly to find untagged resources
|
||||||
|
|
||||||
|
aws resourcegroupstaggingapi get-resources \
|
||||||
|
--query 'ResourceTagMappingList[?length(Tags) == `0`]' \
|
||||||
|
--output table
|
||||||
|
|
||||||
|
# Or use Tag Editor in AWS Console
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cost Allocation Tags
|
||||||
|
|
||||||
|
**Activating Cost Allocation Tags**
|
||||||
|
|
||||||
|
1. Go to AWS Billing → Cost Allocation Tags
|
||||||
|
2. Select user-defined tags to activate
|
||||||
|
3. Wait 24 hours for tags to appear in Cost Explorer
|
||||||
|
4. Tags only apply to charges after activation
|
||||||
|
|
||||||
|
**Best Practices**
|
||||||
|
|
||||||
|
- Activate tags before using them
|
||||||
|
- Use consistent naming (e.g., `Environment` not `Env` or `environment`)
|
||||||
|
- Document tag values in wiki/runbook
|
||||||
|
- Review and update tag strategy quarterly
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Budget Management
|
||||||
|
|
||||||
|
### AWS Budgets Setup
|
||||||
|
|
||||||
|
**Budget Types**
|
||||||
|
|
||||||
|
1. **Cost Budget**: Track spending against threshold
|
||||||
|
2. **Usage Budget**: Track service usage (e.g., EC2 hours)
|
||||||
|
3. **Savings Plans Budget**: Track commitment utilization
|
||||||
|
4. **Reservation Budget**: Track RI utilization
|
||||||
|
|
||||||
|
**Recommended Budgets**
|
||||||
|
|
||||||
|
**1. Overall Monthly Budget**
|
||||||
|
```yaml
|
||||||
|
Budget Name: Company-Wide-Monthly-Budget
|
||||||
|
Amount: $50,000/month
|
||||||
|
Alerts:
|
||||||
|
- 50% actual: Email CFO, FinOps team
|
||||||
|
- 80% actual: Email CFO, CTO, FinOps team
|
||||||
|
- 100% forecasted: Email CFO, CTO, all team leads
|
||||||
|
- 100% actual: Email everyone + Slack alert
|
||||||
|
```
|
||||||
|
|
||||||
|
**2. Per-Environment Budgets**
|
||||||
|
```yaml
|
||||||
|
Budget Name: Production-Environment-Budget
|
||||||
|
Amount: $30,000/month
|
||||||
|
Filter: Environment=prod
|
||||||
|
Alerts:
|
||||||
|
- 80% actual: Email engineering leads
|
||||||
|
- 100% forecasted: Email CTO + FinOps
|
||||||
|
|
||||||
|
Budget Name: Dev-Environment-Budget
|
||||||
|
Amount: $5,000/month
|
||||||
|
Filter: Environment=dev
|
||||||
|
Alerts:
|
||||||
|
- 100% actual: Email dev team leads
|
||||||
|
- 120% actual: Automated shutdown (if possible)
|
||||||
|
```
|
||||||
|
|
||||||
|
**3. Per-Team Budgets**
|
||||||
|
```yaml
|
||||||
|
Budget Name: Team-Platform-Budget
|
||||||
|
Amount: $15,000/month
|
||||||
|
Filter: Owner=platform-team
|
||||||
|
Alerts:
|
||||||
|
- 90% actual: Email platform team
|
||||||
|
- 100% forecasted: Email platform team + manager
|
||||||
|
```
|
||||||
|
|
||||||
|
**4. Per-Project Budgets**
|
||||||
|
```yaml
|
||||||
|
Budget Name: Project-Phoenix-Budget
|
||||||
|
Amount: $8,000/month
|
||||||
|
Filter: Project=phoenix
|
||||||
|
Alerts:
|
||||||
|
- 75% actual: Email project owner
|
||||||
|
- 100% actual: Email project owner + sponsor
|
||||||
|
```
|
||||||
|
|
||||||
|
### Budget Alert Actions
|
||||||
|
|
||||||
|
**Automated Responses to Budget Alerts**
|
||||||
|
|
||||||
|
```python
|
||||||
|
# Lambda function triggered by Budget alert SNS topic
|
||||||
|
|
||||||
|
def lambda_handler(event, context):
|
||||||
|
# Parse budget alert
|
||||||
|
budget_name = event['budgetName']
|
||||||
|
threshold = event['threshold']
|
||||||
|
|
||||||
|
if threshold >= 100:
|
||||||
|
# Stop non-production instances
|
||||||
|
stop_dev_instances()
|
||||||
|
|
||||||
|
# Send Slack alert
|
||||||
|
send_slack_alert(f"🚨 Budget {budget_name} exceeded!")
|
||||||
|
|
||||||
|
# Create JIRA ticket
|
||||||
|
create_cost_investigation_ticket()
|
||||||
|
|
||||||
|
elif threshold >= 80:
|
||||||
|
# Send warning
|
||||||
|
send_slack_alert(f"⚠️ Budget {budget_name} at 80%")
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Monthly Review Process
|
||||||
|
|
||||||
|
### FinOps Monthly Cadence
|
||||||
|
|
||||||
|
**Week 1: Data Collection**
|
||||||
|
- Export Cost & Usage Reports
|
||||||
|
- Run cost optimization scripts
|
||||||
|
- Gather CloudWatch metrics
|
||||||
|
- Compile anomaly reports
|
||||||
|
|
||||||
|
**Week 2: Analysis**
|
||||||
|
- Identify cost trends
|
||||||
|
- Find optimization opportunities
|
||||||
|
- Compare to previous months
|
||||||
|
- Analyze tag compliance
|
||||||
|
|
||||||
|
**Week 3: Team Review Meetings**
|
||||||
|
- Present findings to engineering teams
|
||||||
|
- Discuss optimization opportunities
|
||||||
|
- Assign action items
|
||||||
|
- Review upcoming projects
|
||||||
|
|
||||||
|
**Week 4: Executive Reporting**
|
||||||
|
- Create executive summary
|
||||||
|
- Present cost trends to leadership
|
||||||
|
- Report on optimization wins
|
||||||
|
- Forecast next quarter
|
||||||
|
|
||||||
|
### Monthly Review Meeting Agenda
|
||||||
|
|
||||||
|
**Attendees**: Engineering Leads, FinOps Team, Finance Rep, Product Manager
|
||||||
|
|
||||||
|
**Agenda (1 hour)**
|
||||||
|
|
||||||
|
1. **Previous Month Recap (10 min)**
|
||||||
|
- Total spend vs budget
|
||||||
|
- Top 5 services by cost
|
||||||
|
- Month-over-month comparison
|
||||||
|
- Budget variance explanation
|
||||||
|
|
||||||
|
2. **Cost Anomalies (10 min)**
|
||||||
|
- Unusual spending spikes
|
||||||
|
- Root cause analysis
|
||||||
|
- Prevention measures
|
||||||
|
|
||||||
|
3. **Optimization Opportunities (15 min)**
|
||||||
|
- Unused resources found
|
||||||
|
- Rightsizing recommendations
|
||||||
|
- Reserved Instance opportunities
|
||||||
|
- Estimated savings
|
||||||
|
|
||||||
|
4. **Team Cost Breakdown (10 min)**
|
||||||
|
- Per-team spending
|
||||||
|
- Top spenders
|
||||||
|
- Tag compliance status
|
||||||
|
|
||||||
|
5. **Upcoming Changes (10 min)**
|
||||||
|
- New projects launching
|
||||||
|
- Expected cost impact
|
||||||
|
- Budget adjustments needed
|
||||||
|
|
||||||
|
6. **Action Items Review (5 min)**
|
||||||
|
- Follow-up on previous items
|
||||||
|
- Assign new action items
|
||||||
|
- Set deadlines
|
||||||
|
|
||||||
|
**Deliverable**: Monthly FinOps Report (template provided)
|
||||||
|
|
||||||
|
### Monthly Report Template
|
||||||
|
|
||||||
|
```markdown
|
||||||
|
# AWS Cost Report - [Month Year]
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
- Total spend: $XX,XXX
|
||||||
|
- vs Budget: X% (under/over)
|
||||||
|
- vs Last month: +/-X%
|
||||||
|
- Optimization savings: $X,XXX
|
||||||
|
|
||||||
|
## Cost Breakdown
|
||||||
|
| Service | Cost | % of Total | MoM Change |
|
||||||
|
|---------|------|-----------|-----------|
|
||||||
|
| EC2 | $XX | XX% | +/-X% |
|
||||||
|
| RDS | $XX | XX% | +/-X% |
|
||||||
|
|
||||||
|
## Optimization Actions Taken
|
||||||
|
1. Migrated 20 instances to Graviton (saved $X/month)
|
||||||
|
2. Purchased Reserved Instances (saved $X/month)
|
||||||
|
3. Deleted unused resources (saved $X/month)
|
||||||
|
|
||||||
|
## Recommendations for Next Month
|
||||||
|
1. Right-size 15 oversized instances (potential $X/month savings)
|
||||||
|
2. Implement S3 lifecycle policies (potential $X/month savings)
|
||||||
|
|
||||||
|
## Action Items
|
||||||
|
- [ ] [Owner] Task description (Deadline)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Roles & Responsibilities
|
||||||
|
|
||||||
|
### FinOps Team Structure
|
||||||
|
|
||||||
|
**FinOps Lead**
|
||||||
|
- Owns overall cloud financial management
|
||||||
|
- Reports to CFO and CTO
|
||||||
|
- Sets FinOps strategy and goals
|
||||||
|
- Manages budget process
|
||||||
|
|
||||||
|
**Cloud Cost Analyst**
|
||||||
|
- Analyzes spending trends
|
||||||
|
- Generates reports and dashboards
|
||||||
|
- Identifies optimization opportunities
|
||||||
|
- Runs monthly review process
|
||||||
|
|
||||||
|
**Cloud Architect (FinOps focus)**
|
||||||
|
- Advises on cost-optimized architectures
|
||||||
|
- Implements cost optimization tools
|
||||||
|
- Trains engineers on FinOps practices
|
||||||
|
- Reviews architectural designs for cost impact
|
||||||
|
|
||||||
|
### Engineering Team Responsibilities
|
||||||
|
|
||||||
|
**Engineering Manager**
|
||||||
|
- Owns team budget
|
||||||
|
- Reviews monthly cost reports
|
||||||
|
- Prioritizes optimization work
|
||||||
|
- Ensures tagging compliance
|
||||||
|
|
||||||
|
**Engineers**
|
||||||
|
- Tag all resources they create
|
||||||
|
- Consider cost in design decisions
|
||||||
|
- Implement optimization recommendations
|
||||||
|
- Delete unused resources
|
||||||
|
|
||||||
|
**Platform/SRE Team**
|
||||||
|
- Implements cost optimization tooling
|
||||||
|
- Automates cost monitoring
|
||||||
|
- Provides cost visibility dashboards
|
||||||
|
- Enforces tagging policies
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Chargeback & Showback
|
||||||
|
|
||||||
|
### Showback (Visibility Only)
|
||||||
|
|
||||||
|
**Purpose**: Show teams their costs without charging them
|
||||||
|
**Goal**: Raise cost awareness
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
- Monthly cost reports per team
|
||||||
|
- Dashboard showing team spending
|
||||||
|
- Highlight cost trends
|
||||||
|
- No budget enforcement
|
||||||
|
|
||||||
|
**Best for**: Organizations new to FinOps
|
||||||
|
|
||||||
|
### Chargeback (Financial Accountability)
|
||||||
|
|
||||||
|
**Purpose**: Allocate costs back to business units
|
||||||
|
**Goal**: Financial accountability
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
- Tag-based cost allocation
|
||||||
|
- Transfer costs between cost centers
|
||||||
|
- Teams have hard budgets
|
||||||
|
- Overspending requires justification
|
||||||
|
|
||||||
|
**Best for**: Mature FinOps organizations
|
||||||
|
|
||||||
|
### Hybrid Model (Recommended)
|
||||||
|
|
||||||
|
**Shared Costs**: Charged to central IT
|
||||||
|
- VPC resources
|
||||||
|
- Security tools
|
||||||
|
- Monitoring infrastructure
|
||||||
|
- Shared services
|
||||||
|
|
||||||
|
**Team Costs**: Charged to teams
|
||||||
|
- Compute resources (EC2, Lambda)
|
||||||
|
- Databases
|
||||||
|
- Storage
|
||||||
|
- Application-specific services
|
||||||
|
|
||||||
|
**Implementation**:
|
||||||
|
```
|
||||||
|
Total AWS Bill: $100,000
|
||||||
|
|
||||||
|
Shared Costs (30%): $30,000
|
||||||
|
→ Charged to IT/Platform budget
|
||||||
|
|
||||||
|
Team Costs (70%): $70,000
|
||||||
|
→ Allocated by tags:
|
||||||
|
- Team A (Project=alpha): $20,000
|
||||||
|
- Team B (Project=beta): $25,000
|
||||||
|
- Team C (Project=gamma): $15,000
|
||||||
|
- Untagged (alert!): $10,000 → Needs investigation
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Policy & Governance
|
||||||
|
|
||||||
|
### Cost Governance Policies
|
||||||
|
|
||||||
|
**1. Resource Creation Policies**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
Policy: All resources must be tagged
|
||||||
|
Enforcement: Service Control Policy (SCP)
|
||||||
|
Exception process: Request via FinOps team
|
||||||
|
|
||||||
|
Policy: Dev/test resources must auto-stop nights/weekends
|
||||||
|
Enforcement: AWS Instance Scheduler
|
||||||
|
Exception process: Tag with NoAutoStop=true (requires approval)
|
||||||
|
|
||||||
|
Policy: S3 buckets must have lifecycle policies
|
||||||
|
Enforcement: AWS Config rule
|
||||||
|
Exception process: Document justification in bucket tags
|
||||||
|
```
|
||||||
|
|
||||||
|
**2. Approval Workflows**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Spending thresholds requiring approval
|
||||||
|
|
||||||
|
< $1,000/month:
|
||||||
|
- Auto-approved
|
||||||
|
- Must be tagged
|
||||||
|
|
||||||
|
$1,000 - $5,000/month:
|
||||||
|
- Engineering manager approval
|
||||||
|
- Documented in JIRA
|
||||||
|
|
||||||
|
$5,000 - $20,000/month:
|
||||||
|
- Director approval
|
||||||
|
- Budget impact assessment
|
||||||
|
- FinOps team review
|
||||||
|
|
||||||
|
> $20,000/month:
|
||||||
|
- VP approval
|
||||||
|
- Business case required
|
||||||
|
- Quarterly review checkpoint
|
||||||
|
```
|
||||||
|
|
||||||
|
**3. Reserved Instance / Savings Plans Policy**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
Policy: All commitments require FinOps review
|
||||||
|
|
||||||
|
Process:
|
||||||
|
1. Team identifies workload suitable for commitment
|
||||||
|
2. Submit request to FinOps with:
|
||||||
|
- Resource details
|
||||||
|
- Usage history (30+ days)
|
||||||
|
- Business justification
|
||||||
|
3. FinOps analyzes and recommends
|
||||||
|
4. Finance approves commitment
|
||||||
|
5. FinOps purchases and tracks utilization
|
||||||
|
```
|
||||||
|
|
||||||
|
### Automation & Guardrails
|
||||||
|
|
||||||
|
**Automated Actions**
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# Non-production resource scheduling
|
||||||
|
Schedule: Instance Scheduler
|
||||||
|
- Stop all dev/test EC2/RDS instances at 7pm weekdays
|
||||||
|
- Stop all dev/test instances all weekend
|
||||||
|
- Start at 7am weekdays
|
||||||
|
- Exception tag: NoAutoStop=true
|
||||||
|
|
||||||
|
# Untagged resource alerts
|
||||||
|
Trigger: AWS Config rule violation
|
||||||
|
Action:
|
||||||
|
- Send Slack alert to team
|
||||||
|
- Create JIRA ticket
|
||||||
|
- Escalate if not tagged in 48 hours
|
||||||
|
|
||||||
|
# Old snapshot cleanup
|
||||||
|
Schedule: Weekly Lambda function
|
||||||
|
Action:
|
||||||
|
- Delete snapshots older than 90 days (unless tagged KeepForever=true)
|
||||||
|
- Notify teams of deletions
|
||||||
|
- Estimate savings
|
||||||
|
|
||||||
|
# Budget breach response
|
||||||
|
Trigger: Budget > 100%
|
||||||
|
Action:
|
||||||
|
- Email alerts to stakeholders
|
||||||
|
- Create incident ticket
|
||||||
|
- Stop non-production resources (optional)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Metrics & KPIs
|
||||||
|
|
||||||
|
### Key FinOps Metrics
|
||||||
|
|
||||||
|
**1. Cost Metrics**
|
||||||
|
```yaml
|
||||||
|
Total Monthly Cloud Spend:
|
||||||
|
Target: Within budget
|
||||||
|
Trend: Track month-over-month
|
||||||
|
|
||||||
|
Cost per Customer:
|
||||||
|
Calculation: Total AWS Cost / Active Customers
|
||||||
|
Target: Decreasing over time
|
||||||
|
|
||||||
|
Cost per Transaction:
|
||||||
|
Calculation: Total AWS Cost / Transactions Processed
|
||||||
|
Target: Optimize for efficiency
|
||||||
|
|
||||||
|
Unit Economics:
|
||||||
|
Calculation: Revenue per Customer - Cost per Customer
|
||||||
|
Target: Positive and growing
|
||||||
|
```
|
||||||
|
|
||||||
|
**2. Efficiency Metrics**
|
||||||
|
```yaml
|
||||||
|
Compute Utilization:
|
||||||
|
Metric: Average CPU utilization
|
||||||
|
Target: 40-60% (room for burst, not over-provisioned)
|
||||||
|
|
||||||
|
Storage Utilization:
|
||||||
|
Metric: % of S3 in cost-optimized tiers
|
||||||
|
Target: >60% in IA or Glacier tiers
|
||||||
|
|
||||||
|
Reserved Instance Coverage:
|
||||||
|
Metric: % of On-Demand usage covered by RIs/SPs
|
||||||
|
Target: >70% for stable workloads
|
||||||
|
|
||||||
|
RI/SP Utilization:
|
||||||
|
Metric: % of RIs/SPs actually used
|
||||||
|
Target: >90%
|
||||||
|
```
|
||||||
|
|
||||||
|
**3. Operational Metrics**
|
||||||
|
```yaml
|
||||||
|
Tag Compliance:
|
||||||
|
Metric: % of resources with required tags
|
||||||
|
Target: >95%
|
||||||
|
|
||||||
|
Budget Variance:
|
||||||
|
Metric: Actual vs Budget %
|
||||||
|
Target: ±5%
|
||||||
|
|
||||||
|
Optimization Savings:
|
||||||
|
Metric: $ saved per month from optimizations
|
||||||
|
Target: Growing
|
||||||
|
|
||||||
|
Mean Time to Optimize (MTTO):
|
||||||
|
Metric: Days from finding opportunity to implementing
|
||||||
|
Target: <30 days
|
||||||
|
```
|
||||||
|
|
||||||
|
**4. Organizational Metrics**
|
||||||
|
```yaml
|
||||||
|
FinOps Engagement:
|
||||||
|
Metric: % of teams attending monthly reviews
|
||||||
|
Target: 100%
|
||||||
|
|
||||||
|
Cost Awareness:
|
||||||
|
Survey: Do engineers know their team's monthly cost?
|
||||||
|
Target: >80% aware
|
||||||
|
|
||||||
|
Optimization Velocity:
|
||||||
|
Metric: # optimization tasks completed per quarter
|
||||||
|
Target: Growing trend
|
||||||
|
```
|
||||||
|
|
||||||
|
### Dashboard Requirements
|
||||||
|
|
||||||
|
**Executive Dashboard (Monthly)**
|
||||||
|
- Total spend vs budget
|
||||||
|
- Spend by service (top 10)
|
||||||
|
- Month-over-month trend
|
||||||
|
- Forecast for next quarter
|
||||||
|
- Optimization savings achieved
|
||||||
|
|
||||||
|
**Engineering Dashboard (Real-time)**
|
||||||
|
- Per-team costs (daily)
|
||||||
|
- Cost anomaly alerts
|
||||||
|
- Untagged resources count
|
||||||
|
- Budget utilization %
|
||||||
|
- Top cost drivers
|
||||||
|
|
||||||
|
**FinOps Dashboard (Daily)**
|
||||||
|
- Detailed service costs
|
||||||
|
- Tag compliance metrics
|
||||||
|
- RI/SP utilization
|
||||||
|
- Rightsizing opportunities
|
||||||
|
- Unused resource counts
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Getting Started Checklist
|
||||||
|
|
||||||
|
### Phase 1: Foundation (Month 1)
|
||||||
|
- [ ] Enable Cost Explorer
|
||||||
|
- [ ] Set up AWS Budgets
|
||||||
|
- [ ] Define tagging strategy
|
||||||
|
- [ ] Activate cost allocation tags
|
||||||
|
- [ ] Set up Cost and Usage Reports (CUR)
|
||||||
|
- [ ] Create basic cost dashboard
|
||||||
|
|
||||||
|
### Phase 2: Visibility (Months 2-3)
|
||||||
|
- [ ] Implement tagging enforcement
|
||||||
|
- [ ] Run first optimization scripts
|
||||||
|
- [ ] Set up monthly review meeting
|
||||||
|
- [ ] Create team cost reports
|
||||||
|
- [ ] Assign team cost owners
|
||||||
|
- [ ] Document FinOps processes
|
||||||
|
|
||||||
|
### Phase 3: Optimization (Months 4-6)
|
||||||
|
- [ ] Implement automated resource scheduling
|
||||||
|
- [ ] Purchase first Reserved Instances
|
||||||
|
- [ ] Set up cost anomaly detection
|
||||||
|
- [ ] Automate reporting
|
||||||
|
- [ ] Train engineering teams
|
||||||
|
- [ ] Implement showback/chargeback
|
||||||
|
|
||||||
|
### Phase 4: Culture (Ongoing)
|
||||||
|
- [ ] Cost metrics in engineering KPIs
|
||||||
|
- [ ] Cost review in architecture reviews
|
||||||
|
- [ ] Regular optimization sprints
|
||||||
|
- [ ] FinOps champions in each team
|
||||||
|
- [ ] Cost-aware development practices
|
||||||
|
- [ ] Continuous improvement
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Resources
|
||||||
|
|
||||||
|
**AWS Native Tools**
|
||||||
|
- AWS Cost Explorer
|
||||||
|
- AWS Budgets
|
||||||
|
- AWS Cost Anomaly Detection
|
||||||
|
- AWS Compute Optimizer
|
||||||
|
- AWS Trusted Advisor
|
||||||
|
- AWS Cost & Usage Reports
|
||||||
|
|
||||||
|
**Third-Party Tools**
|
||||||
|
- CloudHealth (VMware)
|
||||||
|
- Cloudability (Apptio)
|
||||||
|
- Kubecost (Kubernetes cost monitoring)
|
||||||
|
- Spot.io (Cost optimization platform)
|
||||||
|
|
||||||
|
**FinOps Foundation**
|
||||||
|
- https://www.finops.org
|
||||||
|
- FinOps Certified Practitioner certification
|
||||||
|
- FinOps community and best practices
|
||||||
466
references/service_alternatives.md
Normal file
466
references/service_alternatives.md
Normal file
@@ -0,0 +1,466 @@
|
|||||||
|
# AWS Service Alternatives - Cost Optimization Guide
|
||||||
|
|
||||||
|
When to use cheaper alternatives and cost-effective service options for common AWS services.
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [Compute Alternatives](#compute-alternatives)
|
||||||
|
2. [Storage Alternatives](#storage-alternatives)
|
||||||
|
3. [Database Alternatives](#database-alternatives)
|
||||||
|
4. [Networking Alternatives](#networking-alternatives)
|
||||||
|
5. [Application Services](#application-services)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Compute Alternatives
|
||||||
|
|
||||||
|
### EC2 vs Lambda vs Fargate
|
||||||
|
|
||||||
|
**EC2 (Most Economical for Consistent Workloads)**
|
||||||
|
- **When to use**: 24/7 workloads, predictable traffic, need full OS control
|
||||||
|
- **Cost model**: Hourly charges, cheaper with Reserved Instances
|
||||||
|
- **Best for**: Always-on applications, legacy apps, specific OS/kernel requirements
|
||||||
|
- **Example**: Web server handling steady traffic → EC2 with Reserved Instance
|
||||||
|
|
||||||
|
**Lambda (Most Economical for Intermittent Work)**
|
||||||
|
- **When to use**: Event-driven, sporadic usage, < 15 minute executions
|
||||||
|
- **Cost model**: Pay per execution and duration (GB-seconds)
|
||||||
|
- **Best for**: APIs with sporadic traffic, scheduled tasks, event processing
|
||||||
|
- **Example**: Image processing triggered by S3 upload → Lambda
|
||||||
|
- **Break-even**: ~20-30 hours/month execution time vs equivalent EC2
|
||||||
|
|
||||||
|
**Fargate (Middle Ground)**
|
||||||
|
- **When to use**: Containerized apps, variable traffic, don't want to manage servers
|
||||||
|
- **Cost model**: Pay for vCPU and memory allocated
|
||||||
|
- **Best for**: Microservices, batch jobs, variable load applications
|
||||||
|
- **Example**: Background worker that scales 0-10 containers → Fargate
|
||||||
|
- **Tip**: Fargate Spot offers up to 70% savings for fault-tolerant tasks
|
||||||
|
|
||||||
|
**Decision Matrix**
|
||||||
|
```
|
||||||
|
Consistent 24/7 load → EC2 with Reserved Instances
|
||||||
|
Variable load, containerized → Fargate (or Fargate Spot)
|
||||||
|
Event-driven, < 15 min → Lambda
|
||||||
|
Batch processing → Fargate Spot or EC2 Spot
|
||||||
|
```
|
||||||
|
|
||||||
|
### EC2 Instance Alternatives
|
||||||
|
|
||||||
|
**Standard vs Graviton (ARM64)**
|
||||||
|
- **Graviton Savings**: 20% cheaper for same performance
|
||||||
|
- **When to use**: Modern applications, ARM-compatible workloads
|
||||||
|
- **Alternatives**:
|
||||||
|
- t3.large → t4g.large (20% cheaper)
|
||||||
|
- m5.xlarge → m6g.xlarge (20% cheaper)
|
||||||
|
- c5.2xlarge → c6g.2xlarge (20% cheaper)
|
||||||
|
- **Considerations**: Test application compatibility first
|
||||||
|
|
||||||
|
**Current vs Previous Generation**
|
||||||
|
- **Migration Savings**: 5-10% cheaper, better performance
|
||||||
|
- **Examples**:
|
||||||
|
- t2 → t3 (10% cheaper, better performance)
|
||||||
|
- m4 → m5 → m6i (progressive improvements)
|
||||||
|
- c4 → c5 → c6i (better price/performance)
|
||||||
|
- **Action**: Check `detect_old_generations.py` script
|
||||||
|
|
||||||
|
**On-Demand vs Spot vs Reserved**
|
||||||
|
- **On-Demand**: $X/hour, highest cost, full flexibility
|
||||||
|
- **Spot**: 60-90% discount, can be interrupted
|
||||||
|
- **Reserved (1yr)**: 30-40% discount
|
||||||
|
- **Reserved (3yr)**: 50-65% discount
|
||||||
|
- **Decision**: Use Spot for fault-tolerant, RI for predictable, On-Demand for rest
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Storage Alternatives
|
||||||
|
|
||||||
|
### S3 Storage Classes
|
||||||
|
|
||||||
|
**Frequently Accessed Data**
|
||||||
|
```
|
||||||
|
S3 Standard → $0.023/GB/month
|
||||||
|
Use when: Accessing files multiple times per month
|
||||||
|
```
|
||||||
|
|
||||||
|
**Infrequently Accessed Data**
|
||||||
|
```
|
||||||
|
S3 Standard → S3 Standard-IA
|
||||||
|
$0.023/GB/month → $0.0125/GB/month (46% cheaper)
|
||||||
|
Retrieval cost: $0.01/GB
|
||||||
|
Break-even: < 1 access per month
|
||||||
|
Use when: Backups, disaster recovery, infrequently accessed files
|
||||||
|
```
|
||||||
|
|
||||||
|
**Unknown Access Patterns**
|
||||||
|
```
|
||||||
|
S3 Standard → S3 Intelligent-Tiering
|
||||||
|
$0.023/GB/month → Automatic optimization
|
||||||
|
Extra cost: $0.0025 per 1000 objects monitored
|
||||||
|
Use when: Unclear access patterns, don't want to manage lifecycle
|
||||||
|
Best for: Mixed workloads, analytics datasets
|
||||||
|
```
|
||||||
|
|
||||||
|
**Archive Storage**
|
||||||
|
```
|
||||||
|
S3 Standard → S3 Glacier Instant Retrieval
|
||||||
|
$0.023/GB → $0.004/GB (83% cheaper)
|
||||||
|
Retrieval: Milliseconds, $0.03/GB
|
||||||
|
Use when: Archive with immediate access needs (e.g., medical records)
|
||||||
|
|
||||||
|
S3 Standard → S3 Glacier Flexible Retrieval
|
||||||
|
$0.023/GB → $0.0036/GB (84% cheaper)
|
||||||
|
Retrieval: Minutes to hours, $0.01/GB
|
||||||
|
Use when: Archive data, acceptable retrieval delay
|
||||||
|
|
||||||
|
S3 Standard → S3 Glacier Deep Archive
|
||||||
|
$0.023/GB → $0.00099/GB (96% cheaper)
|
||||||
|
Retrieval: 12 hours, $0.02/GB
|
||||||
|
Use when: Long-term archive, regulatory compliance, rarely accessed
|
||||||
|
```
|
||||||
|
|
||||||
|
**Decision Tree**
|
||||||
|
```
|
||||||
|
Accessed daily → S3 Standard
|
||||||
|
Accessed monthly → S3 Standard-IA
|
||||||
|
Unknown pattern → S3 Intelligent-Tiering
|
||||||
|
Archive, instant access → Glacier Instant Retrieval
|
||||||
|
Archive, can wait hours → Glacier Flexible Retrieval
|
||||||
|
Archive, can wait 12 hours → Glacier Deep Archive
|
||||||
|
```
|
||||||
|
|
||||||
|
### EBS Volume Types
|
||||||
|
|
||||||
|
**General Purpose Volumes**
|
||||||
|
```
|
||||||
|
gp2 → gp3
|
||||||
|
$0.10/GB → $0.08/GB (20% cheaper)
|
||||||
|
Additional benefits: Configurable IOPS/throughput independent of size
|
||||||
|
Action: Convert all gp2 to gp3 (no downtime required)
|
||||||
|
```
|
||||||
|
|
||||||
|
**High Performance Workloads**
|
||||||
|
```
|
||||||
|
io1 → io2
|
||||||
|
Same price, better durability and IOPS
|
||||||
|
io2 Block Express: For highest performance needs
|
||||||
|
|
||||||
|
Consider: Do you really need provisioned IOPS?
|
||||||
|
Many workloads perform fine on gp3 (up to 16,000 IOPS)
|
||||||
|
Test gp3 before committing to io2
|
||||||
|
```
|
||||||
|
|
||||||
|
**Throughput-Optimized Workloads**
|
||||||
|
```
|
||||||
|
gp3 → st1 (Throughput Optimized HDD)
|
||||||
|
$0.08/GB → $0.045/GB (44% cheaper)
|
||||||
|
Use when: Big data, data warehouses, log processing
|
||||||
|
Sequential access patterns, throughput more important than IOPS
|
||||||
|
```
|
||||||
|
|
||||||
|
**Cold Data**
|
||||||
|
```
|
||||||
|
gp3 → sc1 (Cold HDD)
|
||||||
|
$0.08/GB → $0.015/GB (81% cheaper)
|
||||||
|
Use when: Infrequently accessed data, lowest cost priority
|
||||||
|
Example: Archive storage, cold backups
|
||||||
|
```
|
||||||
|
|
||||||
|
### EFS vs S3 vs EBS
|
||||||
|
|
||||||
|
**S3 (Cheapest for Object Storage)**
|
||||||
|
- **Cost**: $0.023/GB/month (Standard)
|
||||||
|
- **When to use**: Object storage, static files, backups
|
||||||
|
- **Pros**: Unlimited scale, integrates with everything
|
||||||
|
- **Cons**: Not a file system, higher latency
|
||||||
|
|
||||||
|
**EBS (Best for Single-Instance Block Storage)**
|
||||||
|
- **Cost**: $0.08/GB/month (gp3)
|
||||||
|
- **When to use**: Boot volumes, database storage, single EC2 instance
|
||||||
|
- **Pros**: High performance, low latency
|
||||||
|
- **Cons**: Single-AZ, attached to one instance
|
||||||
|
|
||||||
|
**EFS (File System Across Multiple Instances)**
|
||||||
|
- **Cost**: $0.30/GB/month (Standard), $0.016/GB/month (IA)
|
||||||
|
- **When to use**: Shared file storage across multiple instances
|
||||||
|
- **Pros**: Multi-AZ, grows automatically, NFSv4
|
||||||
|
- **Cons**: More expensive than EBS
|
||||||
|
- **Optimization**: Use EFS Intelligent-Tiering to auto-move to IA class
|
||||||
|
|
||||||
|
**Decision Matrix**
|
||||||
|
```
|
||||||
|
Single instance, block storage → EBS
|
||||||
|
Multiple instances, shared files → EFS (with Intelligent-Tiering)
|
||||||
|
Object storage, static files → S3
|
||||||
|
Large data, high throughput → FSx for Lustre
|
||||||
|
Windows file shares → FSx for Windows
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Database Alternatives
|
||||||
|
|
||||||
|
### RDS vs Aurora vs Self-Managed
|
||||||
|
|
||||||
|
**RDS PostgreSQL/MySQL (Baseline)**
|
||||||
|
- **Cost**: Instance + storage
|
||||||
|
- **When to use**: Standard relational DB needs
|
||||||
|
- **Example**: db.t3.medium = ~$60/month + storage
|
||||||
|
|
||||||
|
**Aurora PostgreSQL/MySQL (2-3x RDS Cost)**
|
||||||
|
- **Cost**: Instance + storage + I/O charges
|
||||||
|
- **When to use**: Need high availability, auto-scaling storage, read replicas
|
||||||
|
- **Pros**: Better performance, automatic failover, up to 15 read replicas
|
||||||
|
- **Cons**: More expensive
|
||||||
|
- **Break-even**: High read traffic, need fast replication
|
||||||
|
|
||||||
|
**Aurora Serverless v2 (Variable Workloads)**
|
||||||
|
- **Cost**: Pay per ACU (Aurora Capacity Unit) per second
|
||||||
|
- **When to use**: Variable load, dev/test, infrequent usage
|
||||||
|
- **Example**: Dev database used 8 hours/day → 67% savings vs always-on
|
||||||
|
- **Limitation**: Min capacity charges apply
|
||||||
|
|
||||||
|
**Self-Managed on EC2 (Cheapest for Experts)**
|
||||||
|
- **Cost**: Just EC2 + EBS costs
|
||||||
|
- **When to use**: Full control needed, specific configuration, cost-sensitive
|
||||||
|
- **Pros**: Can be 50-70% cheaper than RDS
|
||||||
|
- **Cons**: You manage backups, patching, HA, monitoring
|
||||||
|
- **Consideration**: Factor in operational overhead
|
||||||
|
|
||||||
|
**Decision Matrix**
|
||||||
|
```
|
||||||
|
Standard workload, managed preferred → RDS
|
||||||
|
High availability, many reads → Aurora
|
||||||
|
Variable workload → Aurora Serverless v2
|
||||||
|
Cost-sensitive, have DBA expertise → Self-managed on EC2
|
||||||
|
Dev/test, intermittent use → Aurora Serverless v2
|
||||||
|
```
|
||||||
|
|
||||||
|
### DynamoDB Pricing Models
|
||||||
|
|
||||||
|
**On-Demand (Unpredictable Traffic)**
|
||||||
|
- **Cost**: $1.25 per million writes, $0.25 per million reads
|
||||||
|
- **When to use**: Variable traffic, new applications, spiky workloads
|
||||||
|
- **Pros**: No capacity planning, scales automatically
|
||||||
|
- **Example**: New API with unknown traffic pattern
|
||||||
|
|
||||||
|
**Provisioned Capacity (Predictable Traffic)**
|
||||||
|
- **Cost**: $0.00065 per WCU/hour, $0.00013 per RCU/hour
|
||||||
|
- **When to use**: Predictable traffic patterns
|
||||||
|
- **Savings**: 60-80% cheaper than on-demand at consistent usage
|
||||||
|
- **Example**: Application with steady 100 req/sec
|
||||||
|
|
||||||
|
**Reserved Capacity (Long-term Commitment)**
|
||||||
|
- **Cost**: Additional 30-50% discount on provisioned capacity
|
||||||
|
- **When to use**: Known long-term capacity needs
|
||||||
|
- **Commitment**: 1-3 years
|
||||||
|
|
||||||
|
**Break-Even Calculation**
|
||||||
|
```
|
||||||
|
On-Demand: $1.25 per million writes
|
||||||
|
Provisioned: ~$0.47 per million writes (at capacity)
|
||||||
|
Break-even: ~65% consistent utilization
|
||||||
|
|
||||||
|
Action: Start with on-demand, switch to provisioned once patterns clear
|
||||||
|
```
|
||||||
|
|
||||||
|
### Database Migration Options
|
||||||
|
|
||||||
|
**From Commercial to Open Source**
|
||||||
|
```
|
||||||
|
Oracle → Aurora PostgreSQL or RDS PostgreSQL
|
||||||
|
Savings: 90% on licensing costs
|
||||||
|
Consider: PostgreSQL compatibility, migration effort
|
||||||
|
|
||||||
|
SQL Server → Aurora PostgreSQL or RDS PostgreSQL/MySQL
|
||||||
|
Savings: 50-90% on licensing costs
|
||||||
|
Consider: Application compatibility, migration effort
|
||||||
|
```
|
||||||
|
|
||||||
|
**From RDS to Aurora**
|
||||||
|
```
|
||||||
|
Only if: High availability requirements, many read replicas needed
|
||||||
|
Cost increase: 20-50% more
|
||||||
|
Benefit: Better performance, automatic failover, scaling
|
||||||
|
```
|
||||||
|
|
||||||
|
**From Aurora to RDS**
|
||||||
|
```
|
||||||
|
When: Don't need Aurora features, cost-conscious
|
||||||
|
Savings: 20-50%
|
||||||
|
Downgrade if: Single-AZ sufficient, limited read replicas needed
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Networking Alternatives
|
||||||
|
|
||||||
|
### NAT Gateway Alternatives
|
||||||
|
|
||||||
|
**NAT Gateway (Default, Expensive)**
|
||||||
|
- **Cost**: $32.85/month + $0.045/GB processed
|
||||||
|
- **When to use**: Production, high availability, easy management
|
||||||
|
|
||||||
|
**VPC Endpoints (Cheaper for AWS Services)**
|
||||||
|
- **Gateway Endpoint (S3, DynamoDB)**: FREE
|
||||||
|
- **Interface Endpoint**: $7.20/month + $0.01/GB
|
||||||
|
- **When to use**: Accessing S3, DynamoDB, or other AWS services
|
||||||
|
- **Savings**: $25-30/month vs NAT Gateway
|
||||||
|
- **Example**: Lambda accessing S3 → Use S3 Gateway Endpoint
|
||||||
|
|
||||||
|
**NAT Instance (Cheapest, More Work)**
|
||||||
|
- **Cost**: Just EC2 cost (e.g., t3.micro = $7.50/month)
|
||||||
|
- **When to use**: Dev/test, cost-sensitive, low traffic
|
||||||
|
- **Cons**: Must manage, less resilient, manual HA setup
|
||||||
|
- **Savings**: 75% vs NAT Gateway
|
||||||
|
|
||||||
|
**Decision Matrix**
|
||||||
|
```
|
||||||
|
S3 or DynamoDB only → Gateway Endpoint (FREE)
|
||||||
|
Other AWS services → Interface Endpoint
|
||||||
|
Production, high availability → NAT Gateway
|
||||||
|
Dev/test, low traffic → NAT Instance or single NAT Gateway
|
||||||
|
```
|
||||||
|
|
||||||
|
### Load Balancer Alternatives
|
||||||
|
|
||||||
|
**Application Load Balancer (ALB)**
|
||||||
|
- **Cost**: $16.20/month + LCU charges
|
||||||
|
- **When to use**: HTTP/HTTPS, path-based routing, microservices
|
||||||
|
- **Features**: Layer 7, content-based routing, Lambda targets
|
||||||
|
|
||||||
|
**Network Load Balancer (NLB)**
|
||||||
|
- **Cost**: $22.35/month + LCU charges
|
||||||
|
- **When to use**: TCP/UDP, extreme performance, static IPs
|
||||||
|
- **Use case**: Non-HTTP protocols, high throughput
|
||||||
|
|
||||||
|
**Classic Load Balancer (Legacy)**
|
||||||
|
- **Cost**: $18/month + data charges
|
||||||
|
- **Recommendation**: Migrate to ALB or NLB (better features, often cheaper)
|
||||||
|
|
||||||
|
**CloudFront + S3 (Static Content)**
|
||||||
|
- **Cost**: Much cheaper for static content
|
||||||
|
- **When to use**: Static website, single-page app
|
||||||
|
- **Setup**: S3 static hosting + CloudFront distribution
|
||||||
|
- **Savings**: 90% vs ALB for static content
|
||||||
|
|
||||||
|
**API Gateway (REST APIs)**
|
||||||
|
- **Cost**: Pay per request
|
||||||
|
- **When to use**: REST API, need API management features
|
||||||
|
- **Alternative to**: ALB for simple APIs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Application Services
|
||||||
|
|
||||||
|
### Message Queue Alternatives
|
||||||
|
|
||||||
|
**SQS vs SNS vs EventBridge vs Kinesis**
|
||||||
|
|
||||||
|
**SQS (Point-to-Point, Cheapest)**
|
||||||
|
- **Cost**: $0.40 per million requests (Standard), $0.50 (FIFO)
|
||||||
|
- **When to use**: Work queues, decoupling services
|
||||||
|
- **Best for**: Job processing, task queues
|
||||||
|
|
||||||
|
**SNS (Pub/Sub, Cheap)**
|
||||||
|
- **Cost**: $0.50 per million publishes
|
||||||
|
- **When to use**: Fan-out notifications, multiple subscribers
|
||||||
|
- **Best for**: Notifications, multiple consumers
|
||||||
|
|
||||||
|
**EventBridge (Event Router)**
|
||||||
|
- **Cost**: $1.00 per million events
|
||||||
|
- **When to use**: Event-driven architecture, complex routing
|
||||||
|
- **Best for**: Cross-account events, SaaS integrations
|
||||||
|
|
||||||
|
**Kinesis (Streaming, Expensive)**
|
||||||
|
- **Cost**: $0.015 per shard-hour + PUT charges
|
||||||
|
- **When to use**: Real-time streaming, ordered processing
|
||||||
|
- **Best for**: Logs, analytics, real-time processing
|
||||||
|
- **Alternative**: Kinesis Data Firehose (simpler, cheaper for basic needs)
|
||||||
|
|
||||||
|
**Decision Matrix**
|
||||||
|
```
|
||||||
|
Simple queue → SQS
|
||||||
|
Multiple consumers → SNS
|
||||||
|
Complex event routing → EventBridge
|
||||||
|
Real-time streaming → Kinesis
|
||||||
|
Log aggregation → Kinesis Firehose
|
||||||
|
```
|
||||||
|
|
||||||
|
### Container Orchestration
|
||||||
|
|
||||||
|
**ECS vs EKS vs Fargate**
|
||||||
|
|
||||||
|
**ECS on EC2 (Cheapest)**
|
||||||
|
- **Cost**: Just EC2 costs (no ECS fee)
|
||||||
|
- **When to use**: AWS-native, simpler workloads
|
||||||
|
- **Best for**: Cost-sensitive, AWS-specific deployments
|
||||||
|
|
||||||
|
**ECS on Fargate (Serverless, Easy)**
|
||||||
|
- **Cost**: Pay per task (vCPU + memory)
|
||||||
|
- **When to use**: Variable load, don't want to manage servers
|
||||||
|
- **Best for**: Variable workloads, simpler operations
|
||||||
|
|
||||||
|
**EKS (Kubernetes, Expensive)**
|
||||||
|
- **Cost**: $73/month per cluster + node costs
|
||||||
|
- **When to use**: Need Kubernetes, multi-cloud, complex deployments
|
||||||
|
- **Best for**: Kubernetes expertise, need K8s ecosystem
|
||||||
|
- **Tip**: Consolidate workloads to fewer clusters
|
||||||
|
|
||||||
|
**Decision Matrix**
|
||||||
|
```
|
||||||
|
AWS-native, cost-sensitive → ECS on EC2
|
||||||
|
Variable load, easy management → ECS on Fargate
|
||||||
|
Need Kubernetes → EKS
|
||||||
|
Multiple environments → Consider single EKS cluster with namespaces
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Quick Reference: When to Switch
|
||||||
|
|
||||||
|
### Immediate Actions (Low Risk)
|
||||||
|
- [ ] gp2 → gp3 (20% savings, no downtime)
|
||||||
|
- [ ] S3 Standard → Intelligent-Tiering (auto-optimization)
|
||||||
|
- [ ] NAT Gateway → VPC Endpoints for S3/DynamoDB (free)
|
||||||
|
- [ ] Old generation instances → New generation (10-20% savings)
|
||||||
|
- [ ] Intel → Graviton (20% savings, test first)
|
||||||
|
|
||||||
|
### Medium Effort Actions
|
||||||
|
- [ ] On-Demand → Reserved Instances/Savings Plans (40-65% savings)
|
||||||
|
- [ ] Always-on EC2 → Lambda for intermittent work
|
||||||
|
- [ ] S3 Standard → Lifecycle policies (50-95% savings on old data)
|
||||||
|
- [ ] RDS On-Demand → Reserved Instances (40-65% savings)
|
||||||
|
- [ ] DynamoDB On-Demand → Provisioned (60-80% savings if predictable)
|
||||||
|
|
||||||
|
### High Effort Actions (Evaluate Carefully)
|
||||||
|
- [ ] RDS → Aurora (usually more expensive, only if need features)
|
||||||
|
- [ ] Aurora → RDS (20-50% savings if don't need Aurora features)
|
||||||
|
- [ ] Commercial DB → PostgreSQL (90% savings, migration effort)
|
||||||
|
- [ ] EC2 → Lambda (case-by-case, break-even analysis needed)
|
||||||
|
- [ ] ECS → EKS (usually more expensive, only if need K8s)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Cost Comparison Tool
|
||||||
|
|
||||||
|
Use this mental model when evaluating alternatives:
|
||||||
|
|
||||||
|
```
|
||||||
|
1. Calculate current monthly cost
|
||||||
|
2. Calculate alternative monthly cost
|
||||||
|
3. Estimate migration effort (hours × $cost)
|
||||||
|
4. Calculate payback period: Migration Cost / Monthly Savings
|
||||||
|
5. Decide: Payback < 3 months → Likely worth it
|
||||||
|
Payback > 6 months → Evaluate carefully
|
||||||
|
```
|
||||||
|
|
||||||
|
**Example:**
|
||||||
|
```
|
||||||
|
Current: ALB for static site = $20/month
|
||||||
|
Alternative: CloudFront + S3 = $2/month
|
||||||
|
Savings: $18/month
|
||||||
|
Migration: 4 hours × $100/hour = $400
|
||||||
|
Payback: $400 / $18 = 22 months → Maybe not worth it
|
||||||
|
|
||||||
|
But if: Multiple sites, reusable pattern → Worth the investment
|
||||||
|
```
|
||||||
347
scripts/analyze_ri_recommendations.py
Executable file
347
scripts/analyze_ri_recommendations.py
Executable file
@@ -0,0 +1,347 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Analyze EC2 and RDS usage patterns to recommend Reserved Instances.
|
||||||
|
|
||||||
|
This script:
|
||||||
|
- Identifies consistently running EC2 instances
|
||||||
|
- Calculates potential savings with Reserved Instances
|
||||||
|
- Recommends RI types (Standard vs Convertible) and commitment levels (1yr vs 3yr)
|
||||||
|
- Analyzes RDS instances for RI opportunities
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 analyze_ri_recommendations.py [--region REGION] [--profile PROFILE] [--days DAYS]
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
pip install boto3 tabulate
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import boto3
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
from collections import defaultdict
|
||||||
|
from tabulate import tabulate
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
class RIAnalyzer:
|
||||||
|
def __init__(self, profile: str = None, region: str = None, days: int = 30):
|
||||||
|
self.session = boto3.Session(profile_name=profile) if profile else boto3.Session()
|
||||||
|
self.regions = [region] if region else self._get_all_regions()
|
||||||
|
self.days = days
|
||||||
|
self.recommendations = {
|
||||||
|
'ec2': [],
|
||||||
|
'rds': []
|
||||||
|
}
|
||||||
|
self.total_potential_savings = 0.0
|
||||||
|
|
||||||
|
# Simplified RI discount rates (actual rates vary by region/instance type)
|
||||||
|
self.ri_discounts = {
|
||||||
|
'1yr_no_upfront': 0.40, # ~40% savings
|
||||||
|
'1yr_partial_upfront': 0.42, # ~42% savings
|
||||||
|
'1yr_all_upfront': 0.43, # ~43% savings
|
||||||
|
'3yr_no_upfront': 0.60, # ~60% savings
|
||||||
|
'3yr_partial_upfront': 0.62, # ~62% savings
|
||||||
|
'3yr_all_upfront': 0.63 # ~63% savings
|
||||||
|
}
|
||||||
|
|
||||||
|
def _get_all_regions(self) -> List[str]:
|
||||||
|
"""Get all enabled AWS regions."""
|
||||||
|
ec2 = self.session.client('ec2', region_name='us-east-1')
|
||||||
|
regions = ec2.describe_regions(AllRegions=False)
|
||||||
|
return [region['RegionName'] for region in regions['Regions']]
|
||||||
|
|
||||||
|
def _estimate_hourly_cost(self, instance_type: str) -> float:
|
||||||
|
"""Rough estimate of On-Demand hourly cost."""
|
||||||
|
cost_map = {
|
||||||
|
't2.micro': 0.0116, 't2.small': 0.023, 't2.medium': 0.0464,
|
||||||
|
't3.micro': 0.0104, 't3.small': 0.0208, 't3.medium': 0.0416,
|
||||||
|
't3.large': 0.0832, 't3.xlarge': 0.1664, 't3.2xlarge': 0.3328,
|
||||||
|
'm5.large': 0.096, 'm5.xlarge': 0.192, 'm5.2xlarge': 0.384,
|
||||||
|
'm5.4xlarge': 0.768, 'm5.8xlarge': 1.536, 'm5.12xlarge': 2.304,
|
||||||
|
'c5.large': 0.085, 'c5.xlarge': 0.17, 'c5.2xlarge': 0.34,
|
||||||
|
'c5.4xlarge': 0.68, 'c5.9xlarge': 1.53, 'c5.18xlarge': 3.06,
|
||||||
|
'r5.large': 0.126, 'r5.xlarge': 0.252, 'r5.2xlarge': 0.504,
|
||||||
|
'r5.4xlarge': 1.008, 'r5.8xlarge': 2.016, 'r5.12xlarge': 3.024,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Default fallback based on instance family
|
||||||
|
if instance_type not in cost_map:
|
||||||
|
family = instance_type.split('.')[0]
|
||||||
|
family_defaults = {'t2': 0.02, 't3': 0.02, 'm5': 0.10, 'c5': 0.09, 'r5': 0.13}
|
||||||
|
return family_defaults.get(family, 0.10)
|
||||||
|
|
||||||
|
return cost_map[instance_type]
|
||||||
|
|
||||||
|
def _calculate_savings(self, hourly_cost: float, hours_running: float) -> Dict[str, float]:
|
||||||
|
"""Calculate potential savings with different RI options."""
|
||||||
|
monthly_od_cost = hourly_cost * hours_running
|
||||||
|
|
||||||
|
savings = {}
|
||||||
|
for ri_type, discount in self.ri_discounts.items():
|
||||||
|
monthly_ri_cost = monthly_od_cost * (1 - discount)
|
||||||
|
monthly_savings = monthly_od_cost - monthly_ri_cost
|
||||||
|
savings[ri_type] = {
|
||||||
|
'monthly_cost': monthly_ri_cost,
|
||||||
|
'monthly_savings': monthly_savings,
|
||||||
|
'annual_savings': monthly_savings * 12
|
||||||
|
}
|
||||||
|
|
||||||
|
return savings
|
||||||
|
|
||||||
|
def analyze_ec2_instances(self):
|
||||||
|
"""Analyze EC2 instances for RI opportunities."""
|
||||||
|
print(f"\n[1/2] Analyzing EC2 instances (last {self.days} days)...")
|
||||||
|
|
||||||
|
# Group instances by type and platform
|
||||||
|
instance_groups = defaultdict(lambda: {'count': 0, 'instances': []})
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
ec2 = self.session.client('ec2', region_name=region)
|
||||||
|
cloudwatch = self.session.client('cloudwatch', region_name=region)
|
||||||
|
|
||||||
|
instances = ec2.describe_instances(
|
||||||
|
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
|
||||||
|
)
|
||||||
|
|
||||||
|
for reservation in instances['Reservations']:
|
||||||
|
for instance in reservation['Instances']:
|
||||||
|
instance_id = instance['InstanceId']
|
||||||
|
instance_type = instance['InstanceType']
|
||||||
|
platform = instance.get('Platform', 'Linux/UNIX')
|
||||||
|
|
||||||
|
# Check if instance has been running consistently
|
||||||
|
launch_time = instance['LaunchTime']
|
||||||
|
days_running = (datetime.now(launch_time.tzinfo) - launch_time).days
|
||||||
|
|
||||||
|
if days_running >= self.days:
|
||||||
|
# Check uptime via CloudWatch
|
||||||
|
end_time = datetime.now()
|
||||||
|
start_time = end_time - timedelta(days=self.days)
|
||||||
|
|
||||||
|
try:
|
||||||
|
metrics = cloudwatch.get_metric_statistics(
|
||||||
|
Namespace='AWS/EC2',
|
||||||
|
MetricName='StatusCheckFailed',
|
||||||
|
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
|
||||||
|
StartTime=start_time,
|
||||||
|
EndTime=end_time,
|
||||||
|
Period=3600,
|
||||||
|
Statistics=['Sum']
|
||||||
|
)
|
||||||
|
|
||||||
|
# If we have metrics, the instance has been running
|
||||||
|
if metrics['Datapoints'] or days_running >= self.days:
|
||||||
|
key = f"{instance_type}_{platform}_{region}"
|
||||||
|
instance_groups[key]['count'] += 1
|
||||||
|
instance_groups[key]['instances'].append({
|
||||||
|
'id': instance_id,
|
||||||
|
'type': instance_type,
|
||||||
|
'platform': platform,
|
||||||
|
'region': region,
|
||||||
|
'days_running': days_running
|
||||||
|
})
|
||||||
|
except Exception:
|
||||||
|
# If CloudWatch fails, still count long-running instances
|
||||||
|
if days_running >= self.days:
|
||||||
|
key = f"{instance_type}_{platform}_{region}"
|
||||||
|
instance_groups[key]['count'] += 1
|
||||||
|
instance_groups[key]['instances'].append({
|
||||||
|
'id': instance_id,
|
||||||
|
'type': instance_type,
|
||||||
|
'platform': platform,
|
||||||
|
'region': region,
|
||||||
|
'days_running': days_running
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
# Generate recommendations
|
||||||
|
for key, data in instance_groups.items():
|
||||||
|
if data['count'] > 0:
|
||||||
|
sample = data['instances'][0]
|
||||||
|
instance_type = sample['type']
|
||||||
|
platform = sample['platform']
|
||||||
|
region = sample['region']
|
||||||
|
count = data['count']
|
||||||
|
|
||||||
|
hourly_cost = self._estimate_hourly_cost(instance_type)
|
||||||
|
hours_per_month = 730 # Average hours in a month
|
||||||
|
savings = self._calculate_savings(hourly_cost, hours_per_month * count)
|
||||||
|
|
||||||
|
# Recommend best option (3yr all upfront for max savings)
|
||||||
|
best_option = savings['3yr_all_upfront']
|
||||||
|
self.total_potential_savings += best_option['annual_savings']
|
||||||
|
|
||||||
|
self.recommendations['ec2'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Instance Type': instance_type,
|
||||||
|
'Platform': platform,
|
||||||
|
'Count': count,
|
||||||
|
'Current Monthly Cost': f"${hourly_cost * hours_per_month * count:.2f}",
|
||||||
|
'1yr Savings (monthly)': f"${savings['1yr_all_upfront']['monthly_savings']:.2f}",
|
||||||
|
'3yr Savings (monthly)': f"${savings['3yr_all_upfront']['monthly_savings']:.2f}",
|
||||||
|
'Annual Savings (3yr)': f"${best_option['annual_savings']:.2f}",
|
||||||
|
'Recommendation': '3yr Standard RI (All Upfront)'
|
||||||
|
})
|
||||||
|
|
||||||
|
print(f" Found {len(self.recommendations['ec2'])} RI opportunities")
|
||||||
|
|
||||||
|
def analyze_rds_instances(self):
|
||||||
|
"""Analyze RDS instances for RI opportunities."""
|
||||||
|
print(f"\n[2/2] Analyzing RDS instances (last {self.days} days)...")
|
||||||
|
|
||||||
|
instance_groups = defaultdict(lambda: {'count': 0, 'instances': []})
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
rds = self.session.client('rds', region_name=region)
|
||||||
|
instances = rds.describe_db_instances()
|
||||||
|
|
||||||
|
for instance in instances['DBInstances']:
|
||||||
|
instance_id = instance['DBInstanceIdentifier']
|
||||||
|
instance_class = instance['DBInstanceClass']
|
||||||
|
engine = instance['Engine']
|
||||||
|
multi_az = instance['MultiAZ']
|
||||||
|
|
||||||
|
# Check if instance has been running for the analysis period
|
||||||
|
create_time = instance['InstanceCreateTime']
|
||||||
|
days_running = (datetime.now(create_time.tzinfo) - create_time).days
|
||||||
|
|
||||||
|
if days_running >= self.days:
|
||||||
|
key = f"{instance_class}_{engine}_{multi_az}_{region}"
|
||||||
|
instance_groups[key]['count'] += 1
|
||||||
|
instance_groups[key]['instances'].append({
|
||||||
|
'id': instance_id,
|
||||||
|
'class': instance_class,
|
||||||
|
'engine': engine,
|
||||||
|
'multi_az': multi_az,
|
||||||
|
'region': region,
|
||||||
|
'days_running': days_running
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
# Generate recommendations
|
||||||
|
for key, data in instance_groups.items():
|
||||||
|
if data['count'] > 0:
|
||||||
|
sample = data['instances'][0]
|
||||||
|
instance_class = sample['class']
|
||||||
|
engine = sample['engine']
|
||||||
|
multi_az = sample['multi_az']
|
||||||
|
region = sample['region']
|
||||||
|
count = data['count']
|
||||||
|
|
||||||
|
# RDS pricing is roughly 2x EC2 for same instance type
|
||||||
|
# This is a rough approximation
|
||||||
|
base_hourly = self._estimate_hourly_cost(instance_class.replace('db.', ''))
|
||||||
|
hourly_cost = base_hourly * 2
|
||||||
|
if multi_az:
|
||||||
|
hourly_cost *= 2 # Multi-AZ doubles the cost
|
||||||
|
|
||||||
|
hours_per_month = 730
|
||||||
|
savings = self._calculate_savings(hourly_cost, hours_per_month * count)
|
||||||
|
|
||||||
|
best_option = savings['3yr_all_upfront']
|
||||||
|
self.total_potential_savings += best_option['annual_savings']
|
||||||
|
|
||||||
|
self.recommendations['rds'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Instance Class': instance_class,
|
||||||
|
'Engine': engine,
|
||||||
|
'Multi-AZ': 'Yes' if multi_az else 'No',
|
||||||
|
'Count': count,
|
||||||
|
'Current Monthly Cost': f"${hourly_cost * hours_per_month * count:.2f}",
|
||||||
|
'1yr Savings (monthly)': f"${savings['1yr_all_upfront']['monthly_savings']:.2f}",
|
||||||
|
'3yr Savings (monthly)': f"${savings['3yr_all_upfront']['monthly_savings']:.2f}",
|
||||||
|
'Annual Savings (3yr)': f"${best_option['annual_savings']:.2f}",
|
||||||
|
'Recommendation': '3yr Standard RI (All Upfront)'
|
||||||
|
})
|
||||||
|
|
||||||
|
print(f" Found {len(self.recommendations['rds'])} RI opportunities")
|
||||||
|
|
||||||
|
def print_report(self):
|
||||||
|
"""Print RI recommendations report."""
|
||||||
|
print("\n" + "="*100)
|
||||||
|
print("RESERVED INSTANCE RECOMMENDATIONS")
|
||||||
|
print("="*100)
|
||||||
|
|
||||||
|
if self.recommendations['ec2']:
|
||||||
|
print("\nEC2 RESERVED INSTANCE OPPORTUNITIES")
|
||||||
|
print("-" * 100)
|
||||||
|
print(tabulate(self.recommendations['ec2'], headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
if self.recommendations['rds']:
|
||||||
|
print("\nRDS RESERVED INSTANCE OPPORTUNITIES")
|
||||||
|
print("-" * 100)
|
||||||
|
print(tabulate(self.recommendations['rds'], headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
print("\n" + "="*100)
|
||||||
|
print(f"TOTAL ANNUAL SAVINGS POTENTIAL: ${self.total_potential_savings:.2f}")
|
||||||
|
print("="*100)
|
||||||
|
|
||||||
|
print("\n\nRECOMMENDATIONS:")
|
||||||
|
print("- Standard RIs offer the highest discount but no flexibility to change instance type")
|
||||||
|
print("- Consider Convertible RIs if you need flexibility (slightly lower discount)")
|
||||||
|
print("- All Upfront payment offers maximum savings")
|
||||||
|
print("- Partial Upfront balances savings with cash flow")
|
||||||
|
print("- No Upfront minimizes initial cost but reduces savings")
|
||||||
|
print("\nNEXT STEPS:")
|
||||||
|
print("1. Review workload stability and growth projections")
|
||||||
|
print("2. Compare RI costs with Savings Plans for additional flexibility")
|
||||||
|
print("3. Purchase RIs through AWS Console or CLI")
|
||||||
|
print("4. Monitor RI utilization to ensure maximum benefit")
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
"""Run RI analysis."""
|
||||||
|
print(f"Analyzing AWS resources for RI opportunities...")
|
||||||
|
print(f"Looking at instances running for at least {self.days} days")
|
||||||
|
print(f"Scanning {len(self.regions)} region(s)...\n")
|
||||||
|
|
||||||
|
self.analyze_ec2_instances()
|
||||||
|
self.analyze_rds_instances()
|
||||||
|
|
||||||
|
self.print_report()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description='Analyze AWS resources for Reserved Instance opportunities',
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
# Analyze all regions with default profile
|
||||||
|
python3 analyze_ri_recommendations.py
|
||||||
|
|
||||||
|
# Analyze specific region for instances running 60+ days
|
||||||
|
python3 analyze_ri_recommendations.py --region us-east-1 --days 60
|
||||||
|
|
||||||
|
# Use named profile
|
||||||
|
python3 analyze_ri_recommendations.py --profile production
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument('--region', help='AWS region (default: all regions)')
|
||||||
|
parser.add_argument('--profile', help='AWS profile name (default: default profile)')
|
||||||
|
parser.add_argument('--days', type=int, default=30,
|
||||||
|
help='Minimum days instance must be running (default: 30)')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
analyzer = RIAnalyzer(
|
||||||
|
profile=args.profile,
|
||||||
|
region=args.region,
|
||||||
|
days=args.days
|
||||||
|
)
|
||||||
|
analyzer.run()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error: {str(e)}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
382
scripts/cost_anomaly_detector.py
Executable file
382
scripts/cost_anomaly_detector.py
Executable file
@@ -0,0 +1,382 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Detect cost anomalies and unusual spending patterns in AWS.
|
||||||
|
|
||||||
|
This script:
|
||||||
|
- Analyzes Cost Explorer data for spending trends
|
||||||
|
- Detects anomalies and unexpected cost increases
|
||||||
|
- Identifies top cost drivers
|
||||||
|
- Compares period-over-period spending
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 cost_anomaly_detector.py [--profile PROFILE] [--days DAYS]
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
pip install boto3 tabulate
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import boto3
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
from collections import defaultdict
|
||||||
|
from tabulate import tabulate
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
class CostAnomalyDetector:
|
||||||
|
def __init__(self, profile: str = None, days: int = 30):
|
||||||
|
self.session = boto3.Session(profile_name=profile) if profile else boto3.Session()
|
||||||
|
self.days = days
|
||||||
|
self.ce = self.session.client('ce', region_name='us-east-1') # Cost Explorer is global
|
||||||
|
|
||||||
|
self.findings = {
|
||||||
|
'anomalies': [],
|
||||||
|
'top_services': [],
|
||||||
|
'trend_analysis': []
|
||||||
|
}
|
||||||
|
|
||||||
|
# Anomaly detection threshold
|
||||||
|
self.anomaly_threshold = 1.5 # 50% increase triggers alert
|
||||||
|
|
||||||
|
def _get_date_range(self, days: int) -> tuple:
|
||||||
|
"""Get start and end dates for analysis."""
|
||||||
|
end = datetime.now().date()
|
||||||
|
start = end - timedelta(days=days)
|
||||||
|
return start.strftime('%Y-%m-%d'), end.strftime('%Y-%m-%d')
|
||||||
|
|
||||||
|
def analyze_daily_costs(self):
|
||||||
|
"""Analyze daily cost trends."""
|
||||||
|
print(f"\n[1/4] Analyzing daily costs (last {self.days} days)...")
|
||||||
|
|
||||||
|
start_date, end_date = self._get_date_range(self.days)
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = self.ce.get_cost_and_usage(
|
||||||
|
TimePeriod={'Start': start_date, 'End': end_date},
|
||||||
|
Granularity='DAILY',
|
||||||
|
Metrics=['UnblendedCost'],
|
||||||
|
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Aggregate daily costs
|
||||||
|
daily_totals = defaultdict(float)
|
||||||
|
service_costs = defaultdict(lambda: defaultdict(float))
|
||||||
|
|
||||||
|
for result in response['ResultsByTime']:
|
||||||
|
date = result['TimePeriod']['Start']
|
||||||
|
for group in result['Groups']:
|
||||||
|
service = group['Keys'][0]
|
||||||
|
cost = float(group['Metrics']['UnblendedCost']['Amount'])
|
||||||
|
|
||||||
|
daily_totals[date] += cost
|
||||||
|
service_costs[service][date] = cost
|
||||||
|
|
||||||
|
# Detect daily anomalies
|
||||||
|
dates = sorted(daily_totals.keys())
|
||||||
|
if len(dates) > 7:
|
||||||
|
# Calculate baseline (average of first week)
|
||||||
|
baseline = sum(daily_totals[d] for d in dates[:7]) / 7
|
||||||
|
|
||||||
|
for date in dates[7:]:
|
||||||
|
daily_cost = daily_totals[date]
|
||||||
|
if daily_cost > baseline * self.anomaly_threshold:
|
||||||
|
increase_pct = ((daily_cost - baseline) / baseline) * 100
|
||||||
|
|
||||||
|
# Find which service caused the spike
|
||||||
|
top_service = max(
|
||||||
|
((svc, service_costs[svc][date]) for svc in service_costs),
|
||||||
|
key=lambda x: x[1]
|
||||||
|
)
|
||||||
|
|
||||||
|
self.findings['anomalies'].append({
|
||||||
|
'Date': date,
|
||||||
|
'Daily Cost': f"${daily_cost:.2f}",
|
||||||
|
'Baseline': f"${baseline:.2f}",
|
||||||
|
'Increase': f"+{increase_pct:.1f}%",
|
||||||
|
'Top Service': top_service[0],
|
||||||
|
'Service Cost': f"${top_service[1]:.2f}",
|
||||||
|
'Severity': 'High' if increase_pct > 100 else 'Medium'
|
||||||
|
})
|
||||||
|
|
||||||
|
print(f" Detected {len(self.findings['anomalies'])} cost anomalies")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error analyzing daily costs: {str(e)}")
|
||||||
|
|
||||||
|
def analyze_top_services(self):
|
||||||
|
"""Identify top cost drivers."""
|
||||||
|
print(f"\n[2/4] Analyzing top cost drivers...")
|
||||||
|
|
||||||
|
start_date, end_date = self._get_date_range(self.days)
|
||||||
|
|
||||||
|
try:
|
||||||
|
response = self.ce.get_cost_and_usage(
|
||||||
|
TimePeriod={'Start': start_date, 'End': end_date},
|
||||||
|
Granularity='MONTHLY',
|
||||||
|
Metrics=['UnblendedCost'],
|
||||||
|
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
|
||||||
|
)
|
||||||
|
|
||||||
|
service_totals = {}
|
||||||
|
for result in response['ResultsByTime']:
|
||||||
|
for group in result['Groups']:
|
||||||
|
service = group['Keys'][0]
|
||||||
|
cost = float(group['Metrics']['UnblendedCost']['Amount'])
|
||||||
|
service_totals[service] = service_totals.get(service, 0) + cost
|
||||||
|
|
||||||
|
# Get top 10 services
|
||||||
|
sorted_services = sorted(service_totals.items(), key=lambda x: x[1], reverse=True)[:10]
|
||||||
|
|
||||||
|
total_cost = sum(service_totals.values())
|
||||||
|
|
||||||
|
for service, cost in sorted_services:
|
||||||
|
percentage = (cost / total_cost * 100) if total_cost > 0 else 0
|
||||||
|
|
||||||
|
self.findings['top_services'].append({
|
||||||
|
'Service': service,
|
||||||
|
'Cost': f"${cost:.2f}",
|
||||||
|
'Percentage': f"{percentage:.1f}%",
|
||||||
|
'Daily Average': f"${cost/self.days:.2f}"
|
||||||
|
})
|
||||||
|
|
||||||
|
print(f" Identified top {len(self.findings['top_services'])} cost drivers")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error analyzing top services: {str(e)}")
|
||||||
|
|
||||||
|
def compare_periods(self):
|
||||||
|
"""Compare current period with previous period."""
|
||||||
|
print(f"\n[3/4] Comparing cost trends...")
|
||||||
|
|
||||||
|
# Current period
|
||||||
|
current_end = datetime.now().date()
|
||||||
|
current_start = current_end - timedelta(days=self.days)
|
||||||
|
|
||||||
|
# Previous period
|
||||||
|
previous_end = current_start - timedelta(days=1)
|
||||||
|
previous_start = previous_end - timedelta(days=self.days)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get current period costs
|
||||||
|
current_response = self.ce.get_cost_and_usage(
|
||||||
|
TimePeriod={
|
||||||
|
'Start': current_start.strftime('%Y-%m-%d'),
|
||||||
|
'End': current_end.strftime('%Y-%m-%d')
|
||||||
|
},
|
||||||
|
Granularity='MONTHLY',
|
||||||
|
Metrics=['UnblendedCost'],
|
||||||
|
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Get previous period costs
|
||||||
|
previous_response = self.ce.get_cost_and_usage(
|
||||||
|
TimePeriod={
|
||||||
|
'Start': previous_start.strftime('%Y-%m-%d'),
|
||||||
|
'End': previous_end.strftime('%Y-%m-%d')
|
||||||
|
},
|
||||||
|
Granularity='MONTHLY',
|
||||||
|
Metrics=['UnblendedCost'],
|
||||||
|
GroupBy=[{'Type': 'DIMENSION', 'Key': 'SERVICE'}]
|
||||||
|
)
|
||||||
|
|
||||||
|
# Aggregate by service
|
||||||
|
current_costs = {}
|
||||||
|
for result in current_response['ResultsByTime']:
|
||||||
|
for group in result['Groups']:
|
||||||
|
service = group['Keys'][0]
|
||||||
|
cost = float(group['Metrics']['UnblendedCost']['Amount'])
|
||||||
|
current_costs[service] = current_costs.get(service, 0) + cost
|
||||||
|
|
||||||
|
previous_costs = {}
|
||||||
|
for result in previous_response['ResultsByTime']:
|
||||||
|
for group in result['Groups']:
|
||||||
|
service = group['Keys'][0]
|
||||||
|
cost = float(group['Metrics']['UnblendedCost']['Amount'])
|
||||||
|
previous_costs[service] = previous_costs.get(service, 0) + cost
|
||||||
|
|
||||||
|
# Compare services
|
||||||
|
all_services = set(current_costs.keys()) | set(previous_costs.keys())
|
||||||
|
|
||||||
|
for service in all_services:
|
||||||
|
current = current_costs.get(service, 0)
|
||||||
|
previous = previous_costs.get(service, 0)
|
||||||
|
|
||||||
|
if previous > 0:
|
||||||
|
change_pct = ((current - previous) / previous) * 100
|
||||||
|
change_amount = current - previous
|
||||||
|
elif current > 0:
|
||||||
|
change_pct = 100
|
||||||
|
change_amount = current
|
||||||
|
else:
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Only report significant changes (> 10% or > $10)
|
||||||
|
if abs(change_pct) > 10 or abs(change_amount) > 10:
|
||||||
|
trend = "↑ Increase" if change_amount > 0 else "↓ Decrease"
|
||||||
|
|
||||||
|
self.findings['trend_analysis'].append({
|
||||||
|
'Service': service,
|
||||||
|
'Previous Period': f"${previous:.2f}",
|
||||||
|
'Current Period': f"${current:.2f}",
|
||||||
|
'Change': f"${change_amount:+.2f}",
|
||||||
|
'Change %': f"{change_pct:+.1f}%",
|
||||||
|
'Trend': trend
|
||||||
|
})
|
||||||
|
|
||||||
|
# Sort by absolute change
|
||||||
|
self.findings['trend_analysis'].sort(
|
||||||
|
key=lambda x: abs(float(x['Change'].replace('$', '').replace('+', '').replace('-', ''))),
|
||||||
|
reverse=True
|
||||||
|
)
|
||||||
|
|
||||||
|
print(f" Compared {len(self.findings['trend_analysis'])} services")
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error comparing periods: {str(e)}")
|
||||||
|
|
||||||
|
def get_forecast(self):
|
||||||
|
"""Get AWS cost forecast."""
|
||||||
|
print(f"\n[4/4] Getting cost forecast...")
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Get 30-day forecast
|
||||||
|
start_date = datetime.now().date()
|
||||||
|
end_date = start_date + timedelta(days=30)
|
||||||
|
|
||||||
|
response = self.ce.get_cost_forecast(
|
||||||
|
TimePeriod={
|
||||||
|
'Start': start_date.strftime('%Y-%m-%d'),
|
||||||
|
'End': end_date.strftime('%Y-%m-%d')
|
||||||
|
},
|
||||||
|
Metric='UNBLENDED_COST',
|
||||||
|
Granularity='MONTHLY'
|
||||||
|
)
|
||||||
|
|
||||||
|
forecast_amount = float(response['Total']['Amount'])
|
||||||
|
print(f" 30-day forecast: ${forecast_amount:.2f}")
|
||||||
|
|
||||||
|
return forecast_amount
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error getting forecast: {str(e)}")
|
||||||
|
return None
|
||||||
|
|
||||||
|
def print_report(self, forecast_amount: float = None):
|
||||||
|
"""Print cost anomaly report."""
|
||||||
|
print("\n" + "="*110)
|
||||||
|
print("AWS COST ANOMALY DETECTION REPORT")
|
||||||
|
print("="*110)
|
||||||
|
|
||||||
|
# Anomalies
|
||||||
|
if self.findings['anomalies']:
|
||||||
|
print("\nCOST ANOMALIES DETECTED")
|
||||||
|
print("-" * 110)
|
||||||
|
print(tabulate(self.findings['anomalies'], headers='keys', tablefmt='grid'))
|
||||||
|
print("\n⚠️ These dates show unusual cost spikes. Investigate immediately.")
|
||||||
|
|
||||||
|
# Top Services
|
||||||
|
if self.findings['top_services']:
|
||||||
|
print("\nTOP COST DRIVERS")
|
||||||
|
print("-" * 110)
|
||||||
|
print(tabulate(self.findings['top_services'], headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
# Trend Analysis
|
||||||
|
if self.findings['trend_analysis']:
|
||||||
|
print("\nPERIOD-OVER-PERIOD COMPARISON")
|
||||||
|
print(f"(Current {self.days} days vs Previous {self.days} days)")
|
||||||
|
print("-" * 110)
|
||||||
|
# Show top 15 changes
|
||||||
|
print(tabulate(self.findings['trend_analysis'][:15], headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
# Forecast
|
||||||
|
if forecast_amount:
|
||||||
|
print("\nCOST FORECAST")
|
||||||
|
print("-" * 110)
|
||||||
|
print(f"Projected 30-day cost: ${forecast_amount:.2f}")
|
||||||
|
print(f"Projected monthly run rate: ${forecast_amount:.2f}")
|
||||||
|
|
||||||
|
print("\n" + "="*110)
|
||||||
|
|
||||||
|
print("\n\nRECOMMENDED ACTIONS:")
|
||||||
|
print("\n1. For Cost Anomalies:")
|
||||||
|
print(" - Review CloudWatch Logs for the affected service on anomaly dates")
|
||||||
|
print(" - Check for configuration changes or deployments")
|
||||||
|
print(" - Verify no unauthorized resource creation")
|
||||||
|
print(" - Set up billing alerts to catch future anomalies")
|
||||||
|
|
||||||
|
print("\n2. For Top Cost Drivers:")
|
||||||
|
print(" - Review each service for optimization opportunities")
|
||||||
|
print(" - Consider Reserved Instances for consistent workloads")
|
||||||
|
print(" - Implement auto-scaling to match demand")
|
||||||
|
print(" - Archive or delete unused resources")
|
||||||
|
|
||||||
|
print("\n3. Cost Monitoring Best Practices:")
|
||||||
|
print(" - Set up AWS Budgets with email/SNS alerts")
|
||||||
|
print(" - Enable Cost Anomaly Detection in AWS Console")
|
||||||
|
print(" - Tag resources for cost allocation and tracking")
|
||||||
|
print(" - Run this script weekly to track trends")
|
||||||
|
print(" - Review Cost Explorer monthly for detailed analysis")
|
||||||
|
|
||||||
|
print("\n4. Immediate Actions:")
|
||||||
|
print(" - aws budgets create-budget (set spending alerts)")
|
||||||
|
print(" - aws ce get-anomaly-subscriptions (enable anomaly detection)")
|
||||||
|
print(" - Review IAM policies to prevent unauthorized spending")
|
||||||
|
print(" - Implement cost allocation tags across all resources")
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
"""Run cost anomaly detection."""
|
||||||
|
print("="*80)
|
||||||
|
print("AWS COST ANOMALY DETECTOR")
|
||||||
|
print("="*80)
|
||||||
|
print(f"Analysis period: {self.days} days")
|
||||||
|
|
||||||
|
self.analyze_daily_costs()
|
||||||
|
self.analyze_top_services()
|
||||||
|
self.compare_periods()
|
||||||
|
forecast = self.get_forecast()
|
||||||
|
|
||||||
|
self.print_report(forecast)
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description='Detect AWS cost anomalies and analyze spending trends',
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
# Analyze last 30 days (default)
|
||||||
|
python3 cost_anomaly_detector.py
|
||||||
|
|
||||||
|
# Analyze last 60 days
|
||||||
|
python3 cost_anomaly_detector.py --days 60
|
||||||
|
|
||||||
|
# Use named profile
|
||||||
|
python3 cost_anomaly_detector.py --profile production
|
||||||
|
|
||||||
|
Note: This script requires Cost Explorer API access, which may incur small charges.
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument('--profile', help='AWS profile name (default: default profile)')
|
||||||
|
parser.add_argument('--days', type=int, default=30,
|
||||||
|
help='Days of cost data to analyze (default: 30)')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
detector = CostAnomalyDetector(
|
||||||
|
profile=args.profile,
|
||||||
|
days=args.days
|
||||||
|
)
|
||||||
|
detector.run()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error: {str(e)}", file=sys.stderr)
|
||||||
|
print("\nNote: Cost Explorer API access is required. Ensure:", file=sys.stderr)
|
||||||
|
print("1. Cost Explorer is enabled in AWS Console", file=sys.stderr)
|
||||||
|
print("2. IAM user has 'ce:GetCostAndUsage' and 'ce:GetCostForecast' permissions", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
334
scripts/detect_old_generations.py
Executable file
334
scripts/detect_old_generations.py
Executable file
@@ -0,0 +1,334 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Detect old generation EC2 and RDS instances that should be migrated to newer generations.
|
||||||
|
|
||||||
|
This script identifies:
|
||||||
|
- Old generation EC2 instances (t2 → t3, m4 → m5, etc.)
|
||||||
|
- ARM/Graviton migration opportunities
|
||||||
|
- Old generation RDS instances
|
||||||
|
- Calculates cost savings from migration
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 detect_old_generations.py [--region REGION] [--profile PROFILE]
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
pip install boto3 tabulate
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import boto3
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
from tabulate import tabulate
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
class OldGenerationDetector:
|
||||||
|
def __init__(self, profile: str = None, region: str = None):
|
||||||
|
self.session = boto3.Session(profile_name=profile) if profile else boto3.Session()
|
||||||
|
self.regions = [region] if region else self._get_all_regions()
|
||||||
|
self.findings = {
|
||||||
|
'ec2_migrations': [],
|
||||||
|
'graviton_opportunities': [],
|
||||||
|
'rds_migrations': []
|
||||||
|
}
|
||||||
|
self.total_savings = 0.0
|
||||||
|
|
||||||
|
# Migration mapping: old → new generation
|
||||||
|
self.ec2_migrations = {
|
||||||
|
# General Purpose
|
||||||
|
't2.micro': ('t3.micro', 0.10), # ~10% savings
|
||||||
|
't2.small': ('t3.small', 0.10),
|
||||||
|
't2.medium': ('t3.medium', 0.10),
|
||||||
|
't2.large': ('t3.large', 0.10),
|
||||||
|
't2.xlarge': ('t3.xlarge', 0.10),
|
||||||
|
't2.2xlarge': ('t3.2xlarge', 0.10),
|
||||||
|
'm4.large': ('m5.large', 0.04), # ~4% savings
|
||||||
|
'm4.xlarge': ('m5.xlarge', 0.04),
|
||||||
|
'm4.2xlarge': ('m5.2xlarge', 0.04),
|
||||||
|
'm4.4xlarge': ('m5.4xlarge', 0.04),
|
||||||
|
'm4.10xlarge': ('m5.12xlarge', 0.10),
|
||||||
|
'm4.16xlarge': ('m5.24xlarge', 0.10),
|
||||||
|
'm5.large': ('m6i.large', 0.04), # M5 → M6i
|
||||||
|
'm5.xlarge': ('m6i.xlarge', 0.04),
|
||||||
|
# Compute Optimized
|
||||||
|
'c4.large': ('c5.large', 0.10), # ~10% savings
|
||||||
|
'c4.xlarge': ('c5.xlarge', 0.10),
|
||||||
|
'c4.2xlarge': ('c5.2xlarge', 0.10),
|
||||||
|
'c4.4xlarge': ('c5.4xlarge', 0.10),
|
||||||
|
'c4.8xlarge': ('c5.9xlarge', 0.10),
|
||||||
|
'c5.large': ('c6i.large', 0.05), # C5 → C6i
|
||||||
|
'c5.xlarge': ('c6i.xlarge', 0.05),
|
||||||
|
# Memory Optimized
|
||||||
|
'r4.large': ('r5.large', 0.08), # ~8% savings
|
||||||
|
'r4.xlarge': ('r5.xlarge', 0.08),
|
||||||
|
'r4.2xlarge': ('r5.2xlarge', 0.08),
|
||||||
|
'r4.4xlarge': ('r5.4xlarge', 0.08),
|
||||||
|
'r4.8xlarge': ('r5.8xlarge', 0.08),
|
||||||
|
'r5.large': ('r6i.large', 0.03), # R5 → R6i
|
||||||
|
'r5.xlarge': ('r6i.xlarge', 0.03),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Graviton migration opportunities (even better savings)
|
||||||
|
self.graviton_migrations = {
|
||||||
|
't3.micro': ('t4g.micro', 0.20), # ~20% savings
|
||||||
|
't3.small': ('t4g.small', 0.20),
|
||||||
|
't3.medium': ('t4g.medium', 0.20),
|
||||||
|
't3.large': ('t4g.large', 0.20),
|
||||||
|
't3.xlarge': ('t4g.xlarge', 0.20),
|
||||||
|
't3.2xlarge': ('t4g.2xlarge', 0.20),
|
||||||
|
'm5.large': ('m6g.large', 0.20),
|
||||||
|
'm5.xlarge': ('m6g.xlarge', 0.20),
|
||||||
|
'm5.2xlarge': ('m6g.2xlarge', 0.20),
|
||||||
|
'm5.4xlarge': ('m6g.4xlarge', 0.20),
|
||||||
|
'm6i.large': ('m6g.large', 0.20),
|
||||||
|
'm6i.xlarge': ('m6g.xlarge', 0.20),
|
||||||
|
'c5.large': ('c6g.large', 0.20),
|
||||||
|
'c5.xlarge': ('c6g.xlarge', 0.20),
|
||||||
|
'c5.2xlarge': ('c6g.2xlarge', 0.20),
|
||||||
|
'r5.large': ('r6g.large', 0.20),
|
||||||
|
'r5.xlarge': ('r6g.xlarge', 0.20),
|
||||||
|
'r5.2xlarge': ('r6g.2xlarge', 0.20),
|
||||||
|
}
|
||||||
|
|
||||||
|
# RDS instance migrations
|
||||||
|
self.rds_migrations = {
|
||||||
|
'db.t2.micro': ('db.t3.micro', 0.10),
|
||||||
|
'db.t2.small': ('db.t3.small', 0.10),
|
||||||
|
'db.t2.medium': ('db.t3.medium', 0.10),
|
||||||
|
'db.m4.large': ('db.m5.large', 0.05),
|
||||||
|
'db.m4.xlarge': ('db.m5.xlarge', 0.05),
|
||||||
|
'db.m4.2xlarge': ('db.m5.2xlarge', 0.05),
|
||||||
|
'db.r4.large': ('db.r5.large', 0.08),
|
||||||
|
'db.r4.xlarge': ('db.r5.xlarge', 0.08),
|
||||||
|
'db.r4.2xlarge': ('db.r5.2xlarge', 0.08),
|
||||||
|
}
|
||||||
|
|
||||||
|
def _get_all_regions(self) -> List[str]:
|
||||||
|
"""Get all enabled AWS regions."""
|
||||||
|
ec2 = self.session.client('ec2', region_name='us-east-1')
|
||||||
|
regions = ec2.describe_regions(AllRegions=False)
|
||||||
|
return [region['RegionName'] for region in regions['Regions']]
|
||||||
|
|
||||||
|
def _estimate_hourly_cost(self, instance_type: str) -> float:
|
||||||
|
"""Rough estimate of hourly cost."""
|
||||||
|
cost_map = {
|
||||||
|
't2.micro': 0.0116, 't2.small': 0.023, 't2.medium': 0.0464,
|
||||||
|
't3.micro': 0.0104, 't3.small': 0.0208, 't3.medium': 0.0416,
|
||||||
|
't3.large': 0.0832, 't3.xlarge': 0.1664, 't3.2xlarge': 0.3328,
|
||||||
|
't4g.micro': 0.0084, 't4g.small': 0.0168, 't4g.medium': 0.0336,
|
||||||
|
'm4.large': 0.10, 'm4.xlarge': 0.20, 'm4.2xlarge': 0.40,
|
||||||
|
'm5.large': 0.096, 'm5.xlarge': 0.192, 'm5.2xlarge': 0.384,
|
||||||
|
'm6i.large': 0.096, 'm6i.xlarge': 0.192,
|
||||||
|
'm6g.large': 0.077, 'm6g.xlarge': 0.154, 'm6g.2xlarge': 0.308,
|
||||||
|
'c4.large': 0.10, 'c4.xlarge': 0.199, 'c4.2xlarge': 0.398,
|
||||||
|
'c5.large': 0.085, 'c5.xlarge': 0.17, 'c5.2xlarge': 0.34,
|
||||||
|
'c6i.large': 0.085, 'c6i.xlarge': 0.17,
|
||||||
|
'c6g.large': 0.068, 'c6g.xlarge': 0.136, 'c6g.2xlarge': 0.272,
|
||||||
|
'r4.large': 0.133, 'r4.xlarge': 0.266, 'r4.2xlarge': 0.532,
|
||||||
|
'r5.large': 0.126, 'r5.xlarge': 0.252, 'r5.2xlarge': 0.504,
|
||||||
|
'r6i.large': 0.126, 'r6i.xlarge': 0.252,
|
||||||
|
'r6g.large': 0.101, 'r6g.xlarge': 0.202, 'r6g.2xlarge': 0.403,
|
||||||
|
}
|
||||||
|
return cost_map.get(instance_type, 0.10)
|
||||||
|
|
||||||
|
def detect_ec2_migrations(self):
|
||||||
|
"""Detect old generation EC2 instances."""
|
||||||
|
print("\n[1/3] Scanning for old generation EC2 instances...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
ec2 = self.session.client('ec2', region_name=region)
|
||||||
|
instances = ec2.describe_instances(
|
||||||
|
Filters=[{'Name': 'instance-state-name', 'Values': ['running', 'stopped']}]
|
||||||
|
)
|
||||||
|
|
||||||
|
for reservation in instances['Reservations']:
|
||||||
|
for instance in reservation['Instances']:
|
||||||
|
instance_id = instance['InstanceId']
|
||||||
|
instance_type = instance['InstanceType']
|
||||||
|
state = instance['State']['Name']
|
||||||
|
|
||||||
|
name_tag = next((tag['Value'] for tag in instance.get('Tags', [])
|
||||||
|
if tag['Key'] == 'Name'), 'N/A')
|
||||||
|
|
||||||
|
# Check for standard migration
|
||||||
|
if instance_type in self.ec2_migrations:
|
||||||
|
new_type, savings_pct = self.ec2_migrations[instance_type]
|
||||||
|
current_cost = self._estimate_hourly_cost(instance_type)
|
||||||
|
new_cost = self._estimate_hourly_cost(new_type)
|
||||||
|
monthly_savings = (current_cost - new_cost) * 730
|
||||||
|
|
||||||
|
if state == 'running':
|
||||||
|
self.total_savings += monthly_savings * 12
|
||||||
|
|
||||||
|
self.findings['ec2_migrations'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Instance ID': instance_id,
|
||||||
|
'Name': name_tag,
|
||||||
|
'Current Type': instance_type,
|
||||||
|
'Recommended Type': new_type,
|
||||||
|
'State': state,
|
||||||
|
'Savings %': f"{savings_pct*100:.0f}%",
|
||||||
|
'Monthly Savings': f"${monthly_savings:.2f}",
|
||||||
|
'Migration Type': 'Standard Upgrade'
|
||||||
|
})
|
||||||
|
|
||||||
|
# Check for Graviton migration
|
||||||
|
elif instance_type in self.graviton_migrations:
|
||||||
|
new_type, savings_pct = self.graviton_migrations[instance_type]
|
||||||
|
current_cost = self._estimate_hourly_cost(instance_type)
|
||||||
|
new_cost = self._estimate_hourly_cost(new_type)
|
||||||
|
monthly_savings = (current_cost - new_cost) * 730
|
||||||
|
|
||||||
|
if state == 'running':
|
||||||
|
self.total_savings += monthly_savings * 12
|
||||||
|
|
||||||
|
self.findings['graviton_opportunities'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Instance ID': instance_id,
|
||||||
|
'Name': name_tag,
|
||||||
|
'Current Type': instance_type,
|
||||||
|
'Graviton Type': new_type,
|
||||||
|
'State': state,
|
||||||
|
'Savings %': f"{savings_pct*100:.0f}%",
|
||||||
|
'Monthly Savings': f"${monthly_savings:.2f}",
|
||||||
|
'Note': 'Requires ARM64 compatibility'
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['ec2_migrations'])} standard migrations")
|
||||||
|
print(f" Found {len(self.findings['graviton_opportunities'])} Graviton opportunities")
|
||||||
|
|
||||||
|
def detect_rds_migrations(self):
|
||||||
|
"""Detect old generation RDS instances."""
|
||||||
|
print("\n[2/3] Scanning for old generation RDS instances...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
rds = self.session.client('rds', region_name=region)
|
||||||
|
instances = rds.describe_db_instances()
|
||||||
|
|
||||||
|
for instance in instances['DBInstances']:
|
||||||
|
instance_id = instance['DBInstanceIdentifier']
|
||||||
|
instance_class = instance['DBInstanceClass']
|
||||||
|
engine = instance['Engine']
|
||||||
|
status = instance['DBInstanceStatus']
|
||||||
|
|
||||||
|
if instance_class in self.rds_migrations:
|
||||||
|
new_class, savings_pct = self.rds_migrations[instance_class]
|
||||||
|
|
||||||
|
# RDS pricing is roughly 2x EC2
|
||||||
|
base_type = instance_class.replace('db.', '')
|
||||||
|
current_cost = self._estimate_hourly_cost(base_type) * 2
|
||||||
|
new_cost = current_cost * (1 - savings_pct)
|
||||||
|
monthly_savings = (current_cost - new_cost) * 730
|
||||||
|
|
||||||
|
if status == 'available':
|
||||||
|
self.total_savings += monthly_savings * 12
|
||||||
|
|
||||||
|
self.findings['rds_migrations'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Instance ID': instance_id,
|
||||||
|
'Engine': engine,
|
||||||
|
'Current Class': instance_class,
|
||||||
|
'Recommended Class': new_class,
|
||||||
|
'Status': status,
|
||||||
|
'Savings %': f"{savings_pct*100:.0f}%",
|
||||||
|
'Monthly Savings': f"${monthly_savings:.2f}"
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['rds_migrations'])} RDS migrations")
|
||||||
|
|
||||||
|
def print_report(self):
|
||||||
|
"""Print migration recommendations report."""
|
||||||
|
print("\n" + "="*100)
|
||||||
|
print("OLD GENERATION INSTANCE MIGRATION REPORT")
|
||||||
|
print("="*100)
|
||||||
|
|
||||||
|
if self.findings['ec2_migrations']:
|
||||||
|
print("\nEC2 STANDARD MIGRATION OPPORTUNITIES")
|
||||||
|
print("-" * 100)
|
||||||
|
print(tabulate(self.findings['ec2_migrations'], headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
if self.findings['graviton_opportunities']:
|
||||||
|
print("\nEC2 GRAVITON (ARM64) MIGRATION OPPORTUNITIES")
|
||||||
|
print("-" * 100)
|
||||||
|
print(tabulate(self.findings['graviton_opportunities'], headers='keys', tablefmt='grid'))
|
||||||
|
print("\nNOTE: Graviton instances offer significant savings but require ARM64-compatible workloads")
|
||||||
|
print("Test thoroughly before migrating production workloads")
|
||||||
|
|
||||||
|
if self.findings['rds_migrations']:
|
||||||
|
print("\nRDS MIGRATION OPPORTUNITIES")
|
||||||
|
print("-" * 100)
|
||||||
|
print(tabulate(self.findings['rds_migrations'], headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
print("\n" + "="*100)
|
||||||
|
print(f"ESTIMATED ANNUAL SAVINGS: ${self.total_savings:.2f}")
|
||||||
|
print("="*100)
|
||||||
|
|
||||||
|
print("\n\nMIGRATION RECOMMENDATIONS:")
|
||||||
|
print("\nEC2 Standard Migrations (x86):")
|
||||||
|
print("- Generally drop-in replacements with better performance")
|
||||||
|
print("- Can be done with instance type change (stop/start required)")
|
||||||
|
print("- Minimal to no application changes needed")
|
||||||
|
print("\nGraviton Migrations (ARM64):")
|
||||||
|
print("- Requires ARM64-compatible applications and dependencies")
|
||||||
|
print("- Test in non-production first")
|
||||||
|
print("- Most modern languages/frameworks support ARM64")
|
||||||
|
print("- Offers best price/performance ratio")
|
||||||
|
print("\nRDS Migrations:")
|
||||||
|
print("- Requires database instance modification")
|
||||||
|
print("- Triggers brief downtime during modification")
|
||||||
|
print("- Schedule during maintenance window")
|
||||||
|
print("- Test with Multi-AZ for minimal downtime")
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
"""Run old generation detection."""
|
||||||
|
print(f"Scanning for old generation instances across {len(self.regions)} region(s)...")
|
||||||
|
|
||||||
|
self.detect_ec2_migrations()
|
||||||
|
self.detect_rds_migrations()
|
||||||
|
|
||||||
|
self.print_report()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description='Detect old generation AWS instances and recommend migrations',
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
# Scan all regions with default profile
|
||||||
|
python3 detect_old_generations.py
|
||||||
|
|
||||||
|
# Scan specific region
|
||||||
|
python3 detect_old_generations.py --region us-east-1
|
||||||
|
|
||||||
|
# Use named profile
|
||||||
|
python3 detect_old_generations.py --profile production
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument('--region', help='AWS region (default: all regions)')
|
||||||
|
parser.add_argument('--profile', help='AWS profile name (default: default profile)')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
detector = OldGenerationDetector(
|
||||||
|
profile=args.profile,
|
||||||
|
region=args.region
|
||||||
|
)
|
||||||
|
detector.run()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error: {str(e)}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
409
scripts/find_unused_resources.py
Executable file
409
scripts/find_unused_resources.py
Executable file
@@ -0,0 +1,409 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Find unused AWS resources that are costing money.
|
||||||
|
|
||||||
|
This script identifies:
|
||||||
|
- Unattached EBS volumes
|
||||||
|
- Old EBS snapshots
|
||||||
|
- Unused Elastic IPs
|
||||||
|
- Idle NAT Gateways
|
||||||
|
- Idle EC2 instances (low utilization)
|
||||||
|
- Unattached load balancers
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 find_unused_resources.py [--region REGION] [--profile PROFILE] [--snapshot-age-days DAYS]
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
pip install boto3 tabulate
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import boto3
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
from tabulate import tabulate
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
class UnusedResourceFinder:
|
||||||
|
def __init__(self, profile: str = None, region: str = None, snapshot_age_days: int = 90):
|
||||||
|
self.session = boto3.Session(profile_name=profile) if profile else boto3.Session()
|
||||||
|
self.regions = [region] if region else self._get_all_regions()
|
||||||
|
self.snapshot_age_days = snapshot_age_days
|
||||||
|
self.findings = {
|
||||||
|
'ebs_volumes': [],
|
||||||
|
'snapshots': [],
|
||||||
|
'elastic_ips': [],
|
||||||
|
'nat_gateways': [],
|
||||||
|
'idle_instances': [],
|
||||||
|
'load_balancers': []
|
||||||
|
}
|
||||||
|
self.total_cost_estimate = 0.0
|
||||||
|
|
||||||
|
def _get_all_regions(self) -> List[str]:
|
||||||
|
"""Get all enabled AWS regions."""
|
||||||
|
ec2 = self.session.client('ec2', region_name='us-east-1')
|
||||||
|
regions = ec2.describe_regions(AllRegions=False)
|
||||||
|
return [region['RegionName'] for region in regions['Regions']]
|
||||||
|
|
||||||
|
def find_unattached_volumes(self):
|
||||||
|
"""Find unattached EBS volumes."""
|
||||||
|
print("\n[1/6] Scanning for unattached EBS volumes...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
ec2 = self.session.client('ec2', region_name=region)
|
||||||
|
volumes = ec2.describe_volumes(
|
||||||
|
Filters=[{'Name': 'status', 'Values': ['available']}]
|
||||||
|
)
|
||||||
|
|
||||||
|
for volume in volumes['Volumes']:
|
||||||
|
# Rough cost estimate: gp3 $0.08/GB/month
|
||||||
|
monthly_cost = volume['Size'] * 0.08
|
||||||
|
self.total_cost_estimate += monthly_cost
|
||||||
|
|
||||||
|
self.findings['ebs_volumes'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Volume ID': volume['VolumeId'],
|
||||||
|
'Size (GB)': volume['Size'],
|
||||||
|
'Type': volume['VolumeType'],
|
||||||
|
'Created': volume['CreateTime'].strftime('%Y-%m-%d'),
|
||||||
|
'Est. Monthly Cost': f"${monthly_cost:.2f}"
|
||||||
|
})
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['ebs_volumes'])} unattached volumes")
|
||||||
|
|
||||||
|
def find_old_snapshots(self):
|
||||||
|
"""Find old EBS snapshots."""
|
||||||
|
print(f"\n[2/6] Scanning for snapshots older than {self.snapshot_age_days} days...")
|
||||||
|
|
||||||
|
cutoff_date = datetime.now(datetime.now().astimezone().tzinfo) - timedelta(days=self.snapshot_age_days)
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
ec2 = self.session.client('ec2', region_name=region)
|
||||||
|
|
||||||
|
# Get account ID
|
||||||
|
sts = self.session.client('sts')
|
||||||
|
account_id = sts.get_caller_identity()['Account']
|
||||||
|
|
||||||
|
snapshots = ec2.describe_snapshots(OwnerIds=[account_id])
|
||||||
|
|
||||||
|
for snapshot in snapshots['Snapshots']:
|
||||||
|
if snapshot['StartTime'] < cutoff_date:
|
||||||
|
# Snapshot cost: $0.05/GB/month
|
||||||
|
monthly_cost = snapshot['VolumeSize'] * 0.05
|
||||||
|
self.total_cost_estimate += monthly_cost
|
||||||
|
|
||||||
|
age_days = (datetime.now(datetime.now().astimezone().tzinfo) - snapshot['StartTime']).days
|
||||||
|
|
||||||
|
self.findings['snapshots'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Snapshot ID': snapshot['SnapshotId'],
|
||||||
|
'Size (GB)': snapshot['VolumeSize'],
|
||||||
|
'Age (days)': age_days,
|
||||||
|
'Created': snapshot['StartTime'].strftime('%Y-%m-%d'),
|
||||||
|
'Est. Monthly Cost': f"${monthly_cost:.2f}"
|
||||||
|
})
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['snapshots'])} old snapshots")
|
||||||
|
|
||||||
|
def find_unused_elastic_ips(self):
|
||||||
|
"""Find unassociated Elastic IPs."""
|
||||||
|
print("\n[3/6] Scanning for unused Elastic IPs...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
ec2 = self.session.client('ec2', region_name=region)
|
||||||
|
addresses = ec2.describe_addresses()
|
||||||
|
|
||||||
|
for address in addresses['Addresses']:
|
||||||
|
if 'AssociationId' not in address:
|
||||||
|
# Unassociated EIP: ~$3.65/month
|
||||||
|
monthly_cost = 3.65
|
||||||
|
self.total_cost_estimate += monthly_cost
|
||||||
|
|
||||||
|
self.findings['elastic_ips'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Allocation ID': address['AllocationId'],
|
||||||
|
'Public IP': address.get('PublicIp', 'N/A'),
|
||||||
|
'Status': 'Unassociated',
|
||||||
|
'Est. Monthly Cost': f"${monthly_cost:.2f}"
|
||||||
|
})
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['elastic_ips'])} unused Elastic IPs")
|
||||||
|
|
||||||
|
def find_idle_nat_gateways(self):
|
||||||
|
"""Find NAT Gateways with low traffic."""
|
||||||
|
print("\n[4/6] Scanning for idle NAT Gateways...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
ec2 = self.session.client('ec2', region_name=region)
|
||||||
|
nat_gateways = ec2.describe_nat_gateways(
|
||||||
|
Filters=[{'Name': 'state', 'Values': ['available']}]
|
||||||
|
)
|
||||||
|
|
||||||
|
cloudwatch = self.session.client('cloudwatch', region_name=region)
|
||||||
|
|
||||||
|
for nat in nat_gateways['NatGateways']:
|
||||||
|
nat_id = nat['NatGatewayId']
|
||||||
|
|
||||||
|
# Check CloudWatch metrics for the last 7 days
|
||||||
|
end_time = datetime.now()
|
||||||
|
start_time = end_time - timedelta(days=7)
|
||||||
|
|
||||||
|
try:
|
||||||
|
metrics = cloudwatch.get_metric_statistics(
|
||||||
|
Namespace='AWS/NATGateway',
|
||||||
|
MetricName='BytesOutToSource',
|
||||||
|
Dimensions=[{'Name': 'NatGatewayId', 'Value': nat_id}],
|
||||||
|
StartTime=start_time,
|
||||||
|
EndTime=end_time,
|
||||||
|
Period=86400, # 1 day
|
||||||
|
Statistics=['Sum']
|
||||||
|
)
|
||||||
|
|
||||||
|
total_bytes = sum([point['Sum'] for point in metrics['Datapoints']])
|
||||||
|
avg_gb_per_day = (total_bytes / (1024**3)) / 7
|
||||||
|
|
||||||
|
# NAT Gateway: ~$32.85/month + data processing
|
||||||
|
monthly_cost = 32.85
|
||||||
|
self.total_cost_estimate += monthly_cost
|
||||||
|
|
||||||
|
# Flag as idle if less than 1GB/day average
|
||||||
|
if avg_gb_per_day < 1:
|
||||||
|
self.findings['nat_gateways'].append({
|
||||||
|
'Region': region,
|
||||||
|
'NAT Gateway ID': nat_id,
|
||||||
|
'VPC': nat.get('VpcId', 'N/A'),
|
||||||
|
'Avg Traffic (GB/day)': f"{avg_gb_per_day:.2f}",
|
||||||
|
'Status': 'Low Traffic',
|
||||||
|
'Est. Monthly Cost': f"${monthly_cost:.2f}"
|
||||||
|
})
|
||||||
|
except Exception:
|
||||||
|
# If we can't get metrics, still report the NAT Gateway
|
||||||
|
self.findings['nat_gateways'].append({
|
||||||
|
'Region': region,
|
||||||
|
'NAT Gateway ID': nat_id,
|
||||||
|
'VPC': nat.get('VpcId', 'N/A'),
|
||||||
|
'Avg Traffic (GB/day)': 'N/A',
|
||||||
|
'Status': 'Metrics Unavailable',
|
||||||
|
'Est. Monthly Cost': f"${monthly_cost:.2f}"
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['nat_gateways'])} idle NAT Gateways")
|
||||||
|
|
||||||
|
def find_idle_instances(self):
|
||||||
|
"""Find EC2 instances with low CPU utilization."""
|
||||||
|
print("\n[5/6] Scanning for idle EC2 instances...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
ec2 = self.session.client('ec2', region_name=region)
|
||||||
|
cloudwatch = self.session.client('cloudwatch', region_name=region)
|
||||||
|
|
||||||
|
instances = ec2.describe_instances(
|
||||||
|
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
|
||||||
|
)
|
||||||
|
|
||||||
|
for reservation in instances['Reservations']:
|
||||||
|
for instance in reservation['Instances']:
|
||||||
|
instance_id = instance['InstanceId']
|
||||||
|
instance_type = instance['InstanceType']
|
||||||
|
|
||||||
|
# Check CPU utilization for the last 7 days
|
||||||
|
end_time = datetime.now()
|
||||||
|
start_time = end_time - timedelta(days=7)
|
||||||
|
|
||||||
|
try:
|
||||||
|
metrics = cloudwatch.get_metric_statistics(
|
||||||
|
Namespace='AWS/EC2',
|
||||||
|
MetricName='CPUUtilization',
|
||||||
|
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
|
||||||
|
StartTime=start_time,
|
||||||
|
EndTime=end_time,
|
||||||
|
Period=3600, # 1 hour
|
||||||
|
Statistics=['Average']
|
||||||
|
)
|
||||||
|
|
||||||
|
if metrics['Datapoints']:
|
||||||
|
avg_cpu = sum([point['Average'] for point in metrics['Datapoints']]) / len(metrics['Datapoints'])
|
||||||
|
max_cpu = max([point['Average'] for point in metrics['Datapoints']])
|
||||||
|
|
||||||
|
# Flag instances with avg CPU < 5% and max < 15%
|
||||||
|
if avg_cpu < 5 and max_cpu < 15:
|
||||||
|
# Rough cost estimate (varies by instance type)
|
||||||
|
# This is approximate - you'd need pricing API for accuracy
|
||||||
|
monthly_cost = self._estimate_instance_cost(instance_type)
|
||||||
|
self.total_cost_estimate += monthly_cost
|
||||||
|
|
||||||
|
name_tag = next((tag['Value'] for tag in instance.get('Tags', []) if tag['Key'] == 'Name'), 'N/A')
|
||||||
|
|
||||||
|
self.findings['idle_instances'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Instance ID': instance_id,
|
||||||
|
'Name': name_tag,
|
||||||
|
'Type': instance_type,
|
||||||
|
'Avg CPU (%)': f"{avg_cpu:.2f}",
|
||||||
|
'Max CPU (%)': f"{max_cpu:.2f}",
|
||||||
|
'Est. Monthly Cost': f"${monthly_cost:.2f}"
|
||||||
|
})
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['idle_instances'])} idle instances")
|
||||||
|
|
||||||
|
def find_unused_load_balancers(self):
|
||||||
|
"""Find load balancers with no targets."""
|
||||||
|
print("\n[6/6] Scanning for unused load balancers...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
# Check Application/Network Load Balancers
|
||||||
|
elbv2 = self.session.client('elbv2', region_name=region)
|
||||||
|
load_balancers = elbv2.describe_load_balancers()
|
||||||
|
|
||||||
|
for lb in load_balancers['LoadBalancers']:
|
||||||
|
lb_arn = lb['LoadBalancerArn']
|
||||||
|
lb_name = lb['LoadBalancerName']
|
||||||
|
lb_type = lb['Type']
|
||||||
|
|
||||||
|
# Check target groups
|
||||||
|
target_groups = elbv2.describe_target_groups(LoadBalancerArn=lb_arn)
|
||||||
|
|
||||||
|
has_healthy_targets = False
|
||||||
|
for tg in target_groups['TargetGroups']:
|
||||||
|
health = elbv2.describe_target_health(TargetGroupArn=tg['TargetGroupArn'])
|
||||||
|
if any(target['TargetHealth']['State'] == 'healthy' for target in health['TargetHealthDescriptions']):
|
||||||
|
has_healthy_targets = True
|
||||||
|
break
|
||||||
|
|
||||||
|
if not has_healthy_targets:
|
||||||
|
# ALB: ~$16.20/month, NLB: ~$22.35/month
|
||||||
|
monthly_cost = 22.35 if lb_type == 'network' else 16.20
|
||||||
|
self.total_cost_estimate += monthly_cost
|
||||||
|
|
||||||
|
self.findings['load_balancers'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Name': lb_name,
|
||||||
|
'Type': lb_type.upper(),
|
||||||
|
'DNS': lb['DNSName'],
|
||||||
|
'Status': 'No Healthy Targets',
|
||||||
|
'Est. Monthly Cost': f"${monthly_cost:.2f}"
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['load_balancers'])} unused load balancers")
|
||||||
|
|
||||||
|
def _estimate_instance_cost(self, instance_type: str) -> float:
|
||||||
|
"""Rough estimate of monthly instance cost (On-Demand, us-east-1)."""
|
||||||
|
# This is a simplified approximation
|
||||||
|
cost_map = {
|
||||||
|
't2': 0.0116, 't3': 0.0104, 't3a': 0.0094,
|
||||||
|
'm5': 0.096, 'm5a': 0.086, 'm6i': 0.096,
|
||||||
|
'c5': 0.085, 'c5a': 0.077, 'c6i': 0.085,
|
||||||
|
'r5': 0.126, 'r5a': 0.113, 'r6i': 0.126,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Extract family (e.g., 't3' from 't3.micro')
|
||||||
|
family = instance_type.split('.')[0]
|
||||||
|
hourly_cost = cost_map.get(family, 0.10) # Default to $0.10/hour
|
||||||
|
|
||||||
|
return hourly_cost * 730 # Hours per month
|
||||||
|
|
||||||
|
def print_report(self):
|
||||||
|
"""Print findings report."""
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print("AWS UNUSED RESOURCES REPORT")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
sections = [
|
||||||
|
('UNATTACHED EBS VOLUMES', 'ebs_volumes'),
|
||||||
|
('OLD SNAPSHOTS', 'snapshots'),
|
||||||
|
('UNUSED ELASTIC IPs', 'elastic_ips'),
|
||||||
|
('IDLE NAT GATEWAYS', 'nat_gateways'),
|
||||||
|
('IDLE EC2 INSTANCES', 'idle_instances'),
|
||||||
|
('UNUSED LOAD BALANCERS', 'load_balancers')
|
||||||
|
]
|
||||||
|
|
||||||
|
for title, key in sections:
|
||||||
|
findings = self.findings[key]
|
||||||
|
if findings:
|
||||||
|
print(f"\n{title} ({len(findings)} found)")
|
||||||
|
print("-" * 80)
|
||||||
|
print(tabulate(findings, headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
print("\n" + "="*80)
|
||||||
|
print(f"ESTIMATED MONTHLY SAVINGS: ${self.total_cost_estimate:.2f}")
|
||||||
|
print("="*80)
|
||||||
|
print("\nNOTE: Cost estimates are approximate. Actual savings may vary.")
|
||||||
|
print("Review each resource before deletion to avoid disrupting services.")
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
"""Run all scans."""
|
||||||
|
print(f"Scanning AWS account across {len(self.regions)} region(s)...")
|
||||||
|
print("This may take several minutes...\n")
|
||||||
|
|
||||||
|
self.find_unattached_volumes()
|
||||||
|
self.find_old_snapshots()
|
||||||
|
self.find_unused_elastic_ips()
|
||||||
|
self.find_idle_nat_gateways()
|
||||||
|
self.find_idle_instances()
|
||||||
|
self.find_unused_load_balancers()
|
||||||
|
|
||||||
|
self.print_report()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description='Find unused AWS resources that are costing money',
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
# Scan all regions with default profile
|
||||||
|
python3 find_unused_resources.py
|
||||||
|
|
||||||
|
# Scan specific region with named profile
|
||||||
|
python3 find_unused_resources.py --region us-east-1 --profile production
|
||||||
|
|
||||||
|
# Find snapshots older than 180 days
|
||||||
|
python3 find_unused_resources.py --snapshot-age-days 180
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument('--region', help='AWS region (default: all regions)')
|
||||||
|
parser.add_argument('--profile', help='AWS profile name (default: default profile)')
|
||||||
|
parser.add_argument('--snapshot-age-days', type=int, default=90,
|
||||||
|
help='Snapshots older than this are flagged (default: 90)')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
finder = UnusedResourceFinder(
|
||||||
|
profile=args.profile,
|
||||||
|
region=args.region,
|
||||||
|
snapshot_age_days=args.snapshot_age_days
|
||||||
|
)
|
||||||
|
finder.run()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error: {str(e)}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
387
scripts/rightsizing_analyzer.py
Executable file
387
scripts/rightsizing_analyzer.py
Executable file
@@ -0,0 +1,387 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Analyze EC2 and RDS instances for rightsizing opportunities.
|
||||||
|
|
||||||
|
This script identifies:
|
||||||
|
- Oversized EC2 instances (low CPU/memory utilization)
|
||||||
|
- Oversized RDS instances (low CPU/connection utilization)
|
||||||
|
- Recommended smaller instance types
|
||||||
|
- Potential cost savings
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 rightsizing_analyzer.py [--region REGION] [--profile PROFILE] [--days DAYS]
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
pip install boto3 tabulate
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import boto3
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from typing import List, Dict, Any, Optional
|
||||||
|
from tabulate import tabulate
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
class RightsizingAnalyzer:
|
||||||
|
def __init__(self, profile: str = None, region: str = None, days: int = 14):
|
||||||
|
self.session = boto3.Session(profile_name=profile) if profile else boto3.Session()
|
||||||
|
self.regions = [region] if region else self._get_all_regions()
|
||||||
|
self.days = days
|
||||||
|
self.findings = {
|
||||||
|
'ec2': [],
|
||||||
|
'rds': []
|
||||||
|
}
|
||||||
|
self.total_savings = 0.0
|
||||||
|
|
||||||
|
# CPU thresholds for rightsizing
|
||||||
|
self.cpu_thresholds = {
|
||||||
|
'underutilized': 15, # < 15% avg CPU
|
||||||
|
'low': 30, # < 30% avg CPU
|
||||||
|
}
|
||||||
|
|
||||||
|
def _get_all_regions(self) -> List[str]:
|
||||||
|
"""Get all enabled AWS regions."""
|
||||||
|
ec2 = self.session.client('ec2', region_name='us-east-1')
|
||||||
|
regions = ec2.describe_regions(AllRegions=False)
|
||||||
|
return [region['RegionName'] for region in regions['Regions']]
|
||||||
|
|
||||||
|
def _estimate_hourly_cost(self, instance_type: str) -> float:
|
||||||
|
"""Rough estimate of hourly cost."""
|
||||||
|
cost_map = {
|
||||||
|
't3.micro': 0.0104, 't3.small': 0.0208, 't3.medium': 0.0416,
|
||||||
|
't3.large': 0.0832, 't3.xlarge': 0.1664, 't3.2xlarge': 0.3328,
|
||||||
|
'm5.large': 0.096, 'm5.xlarge': 0.192, 'm5.2xlarge': 0.384,
|
||||||
|
'm5.4xlarge': 0.768, 'm5.8xlarge': 1.536, 'm5.12xlarge': 2.304,
|
||||||
|
'm5.16xlarge': 3.072, 'm5.24xlarge': 4.608,
|
||||||
|
'c5.large': 0.085, 'c5.xlarge': 0.17, 'c5.2xlarge': 0.34,
|
||||||
|
'c5.4xlarge': 0.68, 'c5.9xlarge': 1.53, 'c5.12xlarge': 2.04,
|
||||||
|
'c5.18xlarge': 3.06, 'c5.24xlarge': 4.08,
|
||||||
|
'r5.large': 0.126, 'r5.xlarge': 0.252, 'r5.2xlarge': 0.504,
|
||||||
|
'r5.4xlarge': 1.008, 'r5.8xlarge': 2.016, 'r5.12xlarge': 3.024,
|
||||||
|
'r5.16xlarge': 4.032, 'r5.24xlarge': 6.048,
|
||||||
|
}
|
||||||
|
|
||||||
|
if instance_type not in cost_map:
|
||||||
|
family = instance_type.split('.')[0]
|
||||||
|
family_defaults = {'t3': 0.04, 'm5': 0.20, 'c5': 0.17, 'r5': 0.25}
|
||||||
|
return family_defaults.get(family, 0.10)
|
||||||
|
|
||||||
|
return cost_map[instance_type]
|
||||||
|
|
||||||
|
def _get_smaller_instance_type(self, current_type: str) -> Optional[str]:
|
||||||
|
"""Suggest a smaller instance type."""
|
||||||
|
# Size progression within families
|
||||||
|
sizes = ['nano', 'micro', 'small', 'medium', 'large', 'xlarge', '2xlarge',
|
||||||
|
'3xlarge', '4xlarge', '8xlarge', '9xlarge', '12xlarge', '16xlarge',
|
||||||
|
'18xlarge', '24xlarge', '32xlarge']
|
||||||
|
|
||||||
|
parts = current_type.split('.')
|
||||||
|
if len(parts) != 2:
|
||||||
|
return None
|
||||||
|
|
||||||
|
family, size = parts
|
||||||
|
|
||||||
|
if size not in sizes:
|
||||||
|
return None
|
||||||
|
|
||||||
|
current_idx = sizes.index(size)
|
||||||
|
if current_idx <= 0:
|
||||||
|
return None # Already at smallest
|
||||||
|
|
||||||
|
# Go down one size
|
||||||
|
new_size = sizes[current_idx - 1]
|
||||||
|
return f"{family}.{new_size}"
|
||||||
|
|
||||||
|
def analyze_ec2_instances(self):
|
||||||
|
"""Analyze EC2 instances for rightsizing."""
|
||||||
|
print(f"\n[1/2] Analyzing EC2 instances (last {self.days} days)...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
ec2 = self.session.client('ec2', region_name=region)
|
||||||
|
cloudwatch = self.session.client('cloudwatch', region_name=region)
|
||||||
|
|
||||||
|
instances = ec2.describe_instances(
|
||||||
|
Filters=[{'Name': 'instance-state-name', 'Values': ['running']}]
|
||||||
|
)
|
||||||
|
|
||||||
|
for reservation in instances['Reservations']:
|
||||||
|
for instance in reservation['Instances']:
|
||||||
|
instance_id = instance['InstanceId']
|
||||||
|
instance_type = instance['InstanceType']
|
||||||
|
|
||||||
|
# Skip smallest instances (already optimized)
|
||||||
|
if any(size in instance_type for size in ['nano', 'micro', 'small']):
|
||||||
|
continue
|
||||||
|
|
||||||
|
name_tag = next((tag['Value'] for tag in instance.get('Tags', [])
|
||||||
|
if tag['Key'] == 'Name'), 'N/A')
|
||||||
|
|
||||||
|
# Get CloudWatch metrics
|
||||||
|
end_time = datetime.now()
|
||||||
|
start_time = end_time - timedelta(days=self.days)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# CPU Utilization
|
||||||
|
cpu_metrics = cloudwatch.get_metric_statistics(
|
||||||
|
Namespace='AWS/EC2',
|
||||||
|
MetricName='CPUUtilization',
|
||||||
|
Dimensions=[{'Name': 'InstanceId', 'Value': instance_id}],
|
||||||
|
StartTime=start_time,
|
||||||
|
EndTime=end_time,
|
||||||
|
Period=3600,
|
||||||
|
Statistics=['Average', 'Maximum']
|
||||||
|
)
|
||||||
|
|
||||||
|
if not cpu_metrics['Datapoints']:
|
||||||
|
continue
|
||||||
|
|
||||||
|
avg_cpu = sum([p['Average'] for p in cpu_metrics['Datapoints']]) / len(cpu_metrics['Datapoints'])
|
||||||
|
max_cpu = max([p['Maximum'] for p in cpu_metrics['Datapoints']])
|
||||||
|
|
||||||
|
# Check if underutilized
|
||||||
|
if avg_cpu < self.cpu_thresholds['low'] and max_cpu < 60:
|
||||||
|
smaller_type = self._get_smaller_instance_type(instance_type)
|
||||||
|
|
||||||
|
if smaller_type:
|
||||||
|
current_cost = self._estimate_hourly_cost(instance_type)
|
||||||
|
new_cost = self._estimate_hourly_cost(smaller_type)
|
||||||
|
monthly_savings = (current_cost - new_cost) * 730
|
||||||
|
annual_savings = monthly_savings * 12
|
||||||
|
|
||||||
|
self.total_savings += annual_savings
|
||||||
|
|
||||||
|
# Determine severity
|
||||||
|
if avg_cpu < self.cpu_thresholds['underutilized']:
|
||||||
|
severity = "High"
|
||||||
|
else:
|
||||||
|
severity = "Medium"
|
||||||
|
|
||||||
|
self.findings['ec2'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Instance ID': instance_id,
|
||||||
|
'Name': name_tag,
|
||||||
|
'Current Type': instance_type,
|
||||||
|
'Recommended Type': smaller_type,
|
||||||
|
'Avg CPU (%)': f"{avg_cpu:.1f}",
|
||||||
|
'Max CPU (%)': f"{max_cpu:.1f}",
|
||||||
|
'Monthly Savings': f"${monthly_savings:.2f}",
|
||||||
|
'Severity': severity
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
pass # Skip instances without metrics
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['ec2'])} rightsizing opportunities")
|
||||||
|
|
||||||
|
def analyze_rds_instances(self):
|
||||||
|
"""Analyze RDS instances for rightsizing."""
|
||||||
|
print(f"\n[2/2] Analyzing RDS instances (last {self.days} days)...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
rds = self.session.client('rds', region_name=region)
|
||||||
|
cloudwatch = self.session.client('cloudwatch', region_name=region)
|
||||||
|
|
||||||
|
instances = rds.describe_db_instances()
|
||||||
|
|
||||||
|
for instance in instances['DBInstances']:
|
||||||
|
instance_id = instance['DBInstanceIdentifier']
|
||||||
|
instance_class = instance['DBInstanceClass']
|
||||||
|
engine = instance['Engine']
|
||||||
|
|
||||||
|
# Skip smallest instances
|
||||||
|
if any(size in instance_class for size in ['micro', 'small']):
|
||||||
|
continue
|
||||||
|
|
||||||
|
# Get CloudWatch metrics
|
||||||
|
end_time = datetime.now()
|
||||||
|
start_time = end_time - timedelta(days=self.days)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# CPU Utilization
|
||||||
|
cpu_metrics = cloudwatch.get_metric_statistics(
|
||||||
|
Namespace='AWS/RDS',
|
||||||
|
MetricName='CPUUtilization',
|
||||||
|
Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': instance_id}],
|
||||||
|
StartTime=start_time,
|
||||||
|
EndTime=end_time,
|
||||||
|
Period=3600,
|
||||||
|
Statistics=['Average', 'Maximum']
|
||||||
|
)
|
||||||
|
|
||||||
|
# Database Connections
|
||||||
|
conn_metrics = cloudwatch.get_metric_statistics(
|
||||||
|
Namespace='AWS/RDS',
|
||||||
|
MetricName='DatabaseConnections',
|
||||||
|
Dimensions=[{'Name': 'DBInstanceIdentifier', 'Value': instance_id}],
|
||||||
|
StartTime=start_time,
|
||||||
|
EndTime=end_time,
|
||||||
|
Period=3600,
|
||||||
|
Statistics=['Average', 'Maximum']
|
||||||
|
)
|
||||||
|
|
||||||
|
if not cpu_metrics['Datapoints']:
|
||||||
|
continue
|
||||||
|
|
||||||
|
avg_cpu = sum([p['Average'] for p in cpu_metrics['Datapoints']]) / len(cpu_metrics['Datapoints'])
|
||||||
|
max_cpu = max([p['Maximum'] for p in cpu_metrics['Datapoints']])
|
||||||
|
|
||||||
|
avg_conns = 0
|
||||||
|
max_conns = 0
|
||||||
|
if conn_metrics['Datapoints']:
|
||||||
|
avg_conns = sum([p['Average'] for p in conn_metrics['Datapoints']]) / len(conn_metrics['Datapoints'])
|
||||||
|
max_conns = max([p['Maximum'] for p in conn_metrics['Datapoints']])
|
||||||
|
|
||||||
|
# Check if underutilized
|
||||||
|
if avg_cpu < self.cpu_thresholds['low'] and max_cpu < 60:
|
||||||
|
smaller_class = self._get_smaller_instance_type(instance_class)
|
||||||
|
|
||||||
|
if smaller_class:
|
||||||
|
# RDS pricing is roughly 2x EC2
|
||||||
|
base_type = instance_class.replace('db.', '')
|
||||||
|
current_cost = self._estimate_hourly_cost(base_type) * 2
|
||||||
|
new_base = smaller_class.replace('db.', '')
|
||||||
|
new_cost = self._estimate_hourly_cost(new_base) * 2
|
||||||
|
|
||||||
|
monthly_savings = (current_cost - new_cost) * 730
|
||||||
|
annual_savings = monthly_savings * 12
|
||||||
|
|
||||||
|
self.total_savings += annual_savings
|
||||||
|
|
||||||
|
# Determine severity
|
||||||
|
if avg_cpu < self.cpu_thresholds['underutilized']:
|
||||||
|
severity = "High"
|
||||||
|
else:
|
||||||
|
severity = "Medium"
|
||||||
|
|
||||||
|
self.findings['rds'].append({
|
||||||
|
'Region': region,
|
||||||
|
'Instance ID': instance_id,
|
||||||
|
'Engine': engine,
|
||||||
|
'Current Class': instance_class,
|
||||||
|
'Recommended Class': smaller_class,
|
||||||
|
'Avg CPU (%)': f"{avg_cpu:.1f}",
|
||||||
|
'Max CPU (%)': f"{max_cpu:.1f}",
|
||||||
|
'Avg Connections': f"{avg_conns:.0f}",
|
||||||
|
'Monthly Savings': f"${monthly_savings:.2f}",
|
||||||
|
'Severity': severity
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
pass # Skip instances without metrics
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Found {len(self.findings['rds'])} rightsizing opportunities")
|
||||||
|
|
||||||
|
def print_report(self):
|
||||||
|
"""Print rightsizing report."""
|
||||||
|
print("\n" + "="*110)
|
||||||
|
print("RIGHTSIZING RECOMMENDATIONS")
|
||||||
|
print("="*110)
|
||||||
|
|
||||||
|
if self.findings['ec2']:
|
||||||
|
print("\nEC2 RIGHTSIZING OPPORTUNITIES")
|
||||||
|
print("-" * 110)
|
||||||
|
sorted_ec2 = sorted(self.findings['ec2'],
|
||||||
|
key=lambda x: float(x['Monthly Savings'].replace('$', '')),
|
||||||
|
reverse=True)
|
||||||
|
print(tabulate(sorted_ec2, headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
if self.findings['rds']:
|
||||||
|
print("\nRDS RIGHTSIZING OPPORTUNITIES")
|
||||||
|
print("-" * 110)
|
||||||
|
sorted_rds = sorted(self.findings['rds'],
|
||||||
|
key=lambda x: float(x['Monthly Savings'].replace('$', '')),
|
||||||
|
reverse=True)
|
||||||
|
print(tabulate(sorted_rds, headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
print("\n" + "="*110)
|
||||||
|
print(f"TOTAL ANNUAL SAVINGS: ${self.total_savings:.2f}")
|
||||||
|
print("="*110)
|
||||||
|
|
||||||
|
print("\n\nRIGHTSIZING BEST PRACTICES:")
|
||||||
|
print("\n1. Before Rightsizing:")
|
||||||
|
print(" - Review metrics over longer period (30+ days recommended)")
|
||||||
|
print(" - Check for seasonal patterns or cyclical workloads")
|
||||||
|
print(" - Verify that current size isn't required for burst capacity")
|
||||||
|
print(" - Review application performance requirements")
|
||||||
|
|
||||||
|
print("\n2. Rightsizing Process:")
|
||||||
|
print(" - Test in non-production environment first")
|
||||||
|
print(" - Schedule during maintenance window")
|
||||||
|
print(" - EC2: Stop instance → Change type → Start")
|
||||||
|
print(" - RDS: Modify instance (causes brief downtime)")
|
||||||
|
print(" - Monitor performance after change")
|
||||||
|
|
||||||
|
print("\n3. Important Considerations:")
|
||||||
|
print(" - Some instance families can't be changed (requires new instance)")
|
||||||
|
print(" - EBS-optimized settings may change with instance type")
|
||||||
|
print(" - Network performance varies by instance size")
|
||||||
|
print(" - Consider vertical scaling limits vs horizontal scaling")
|
||||||
|
|
||||||
|
print("\n4. Alternative Approaches:")
|
||||||
|
print(" - Consider serverless options (Lambda, Fargate, Aurora Serverless)")
|
||||||
|
print(" - Use Auto Scaling to match capacity to demand")
|
||||||
|
print(" - Implement horizontal scaling instead of larger instances")
|
||||||
|
print(" - Evaluate containerization for better resource utilization")
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
"""Run rightsizing analysis."""
|
||||||
|
print(f"Analyzing AWS resources for rightsizing opportunities...")
|
||||||
|
print(f"Metrics period: {self.days} days")
|
||||||
|
print(f"Scanning {len(self.regions)} region(s)...\n")
|
||||||
|
|
||||||
|
self.analyze_ec2_instances()
|
||||||
|
self.analyze_rds_instances()
|
||||||
|
|
||||||
|
self.print_report()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description='Analyze AWS resources for rightsizing opportunities',
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
# Analyze all regions (14 days of metrics)
|
||||||
|
python3 rightsizing_analyzer.py
|
||||||
|
|
||||||
|
# Analyze with 30 days of metrics for better accuracy
|
||||||
|
python3 rightsizing_analyzer.py --days 30
|
||||||
|
|
||||||
|
# Analyze specific region
|
||||||
|
python3 rightsizing_analyzer.py --region us-east-1
|
||||||
|
|
||||||
|
# Use named profile
|
||||||
|
python3 rightsizing_analyzer.py --profile production
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument('--region', help='AWS region (default: all regions)')
|
||||||
|
parser.add_argument('--profile', help='AWS profile name (default: default profile)')
|
||||||
|
parser.add_argument('--days', type=int, default=14,
|
||||||
|
help='Days of metrics to analyze (default: 14)')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
analyzer = RightsizingAnalyzer(
|
||||||
|
profile=args.profile,
|
||||||
|
region=args.region,
|
||||||
|
days=args.days
|
||||||
|
)
|
||||||
|
analyzer.run()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error: {str(e)}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
281
scripts/spot_recommendations.py
Executable file
281
scripts/spot_recommendations.py
Executable file
@@ -0,0 +1,281 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Analyze EC2 workloads and recommend Spot instance opportunities.
|
||||||
|
|
||||||
|
This script identifies:
|
||||||
|
- Fault-tolerant workloads suitable for Spot instances
|
||||||
|
- Potential savings from Spot vs On-Demand
|
||||||
|
- Instances in Auto Scaling Groups (good Spot candidates)
|
||||||
|
- Non-critical workloads based on tags
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
python3 spot_recommendations.py [--region REGION] [--profile PROFILE]
|
||||||
|
|
||||||
|
Requirements:
|
||||||
|
pip install boto3 tabulate
|
||||||
|
"""
|
||||||
|
|
||||||
|
import argparse
|
||||||
|
import boto3
|
||||||
|
from datetime import datetime, timedelta
|
||||||
|
from typing import List, Dict, Any
|
||||||
|
from tabulate import tabulate
|
||||||
|
import sys
|
||||||
|
|
||||||
|
|
||||||
|
class SpotRecommendationAnalyzer:
|
||||||
|
def __init__(self, profile: str = None, region: str = None):
|
||||||
|
self.session = boto3.Session(profile_name=profile) if profile else boto3.Session()
|
||||||
|
self.regions = [region] if region else self._get_all_regions()
|
||||||
|
self.recommendations = []
|
||||||
|
self.total_savings = 0.0
|
||||||
|
|
||||||
|
# Average Spot savings (typically 60-90% discount)
|
||||||
|
self.spot_discount = 0.70 # Conservative 70% discount
|
||||||
|
|
||||||
|
# Tags that indicate Spot suitability
|
||||||
|
self.spot_friendly_tags = {
|
||||||
|
'Environment': ['dev', 'development', 'test', 'testing', 'staging', 'qa'],
|
||||||
|
'Workload': ['batch', 'processing', 'worker', 'ci', 'build'],
|
||||||
|
'CriticalLevel': ['low', 'non-critical', 'noncritical']
|
||||||
|
}
|
||||||
|
|
||||||
|
def _get_all_regions(self) -> List[str]:
|
||||||
|
"""Get all enabled AWS regions."""
|
||||||
|
ec2 = self.session.client('ec2', region_name='us-east-1')
|
||||||
|
regions = ec2.describe_regions(AllRegions=False)
|
||||||
|
return [region['RegionName'] for region in regions['Regions']]
|
||||||
|
|
||||||
|
def _estimate_hourly_cost(self, instance_type: str) -> float:
|
||||||
|
"""Rough estimate of hourly cost."""
|
||||||
|
cost_map = {
|
||||||
|
't3.micro': 0.0104, 't3.small': 0.0208, 't3.medium': 0.0416,
|
||||||
|
't3.large': 0.0832, 't3.xlarge': 0.1664, 't3.2xlarge': 0.3328,
|
||||||
|
'm5.large': 0.096, 'm5.xlarge': 0.192, 'm5.2xlarge': 0.384,
|
||||||
|
'm5.4xlarge': 0.768, 'm5.8xlarge': 1.536,
|
||||||
|
'c5.large': 0.085, 'c5.xlarge': 0.17, 'c5.2xlarge': 0.34,
|
||||||
|
'c5.4xlarge': 0.68, 'c5.9xlarge': 1.53,
|
||||||
|
'r5.large': 0.126, 'r5.xlarge': 0.252, 'r5.2xlarge': 0.504,
|
||||||
|
'r5.4xlarge': 1.008, 'r5.8xlarge': 2.016,
|
||||||
|
}
|
||||||
|
|
||||||
|
# Default fallback
|
||||||
|
if instance_type not in cost_map:
|
||||||
|
family = instance_type.split('.')[0]
|
||||||
|
family_defaults = {'t3': 0.04, 'm5': 0.10, 'c5': 0.09, 'r5': 0.13}
|
||||||
|
return family_defaults.get(family, 0.10)
|
||||||
|
|
||||||
|
return cost_map[instance_type]
|
||||||
|
|
||||||
|
def _calculate_suitability_score(self, instance: Dict, asg_member: bool) -> tuple:
|
||||||
|
"""Calculate Spot suitability score (0-100) and reasons."""
|
||||||
|
score = 0
|
||||||
|
reasons = []
|
||||||
|
|
||||||
|
# Check if in Auto Scaling Group (high suitability)
|
||||||
|
if asg_member:
|
||||||
|
score += 40
|
||||||
|
reasons.append("Part of Auto Scaling Group")
|
||||||
|
|
||||||
|
# Check tags for environment/workload type
|
||||||
|
tags = {tag['Key']: tag['Value'].lower() for tag in instance.get('Tags', [])}
|
||||||
|
|
||||||
|
for key, spot_values in self.spot_friendly_tags.items():
|
||||||
|
if key in tags and tags[key] in spot_values:
|
||||||
|
score += 20
|
||||||
|
reasons.append(f"{key}={tags[key]}")
|
||||||
|
|
||||||
|
# Check instance age (older instances might be more stable)
|
||||||
|
launch_time = instance['LaunchTime']
|
||||||
|
days_running = (datetime.now(launch_time.tzinfo) - launch_time).days
|
||||||
|
if days_running > 30:
|
||||||
|
score += 10
|
||||||
|
reasons.append(f"Running {days_running} days (stable)")
|
||||||
|
|
||||||
|
# Check instance size (smaller instances have better Spot availability)
|
||||||
|
instance_type = instance['InstanceType']
|
||||||
|
if any(size in instance_type for size in ['micro', 'small', 'medium', 'large']):
|
||||||
|
score += 15
|
||||||
|
reasons.append("Standard size (good Spot availability)")
|
||||||
|
|
||||||
|
# Default baseline
|
||||||
|
if not reasons:
|
||||||
|
score = 30
|
||||||
|
reasons.append("General compute workload")
|
||||||
|
|
||||||
|
return min(score, 100), reasons
|
||||||
|
|
||||||
|
def analyze_instances(self):
|
||||||
|
"""Analyze EC2 instances for Spot opportunities."""
|
||||||
|
print(f"\nAnalyzing EC2 instances across {len(self.regions)} region(s)...")
|
||||||
|
|
||||||
|
for region in self.regions:
|
||||||
|
try:
|
||||||
|
ec2 = self.session.client('ec2', region_name=region)
|
||||||
|
autoscaling = self.session.client('autoscaling', region_name=region)
|
||||||
|
|
||||||
|
# Get all Auto Scaling Groups
|
||||||
|
asg_instances = set()
|
||||||
|
try:
|
||||||
|
asgs = autoscaling.describe_auto_scaling_groups()
|
||||||
|
for asg in asgs['AutoScalingGroups']:
|
||||||
|
for instance in asg['Instances']:
|
||||||
|
asg_instances.add(instance['InstanceId'])
|
||||||
|
except Exception:
|
||||||
|
pass
|
||||||
|
|
||||||
|
# Get all running On-Demand instances
|
||||||
|
instances = ec2.describe_instances(
|
||||||
|
Filters=[
|
||||||
|
{'Name': 'instance-state-name', 'Values': ['running']},
|
||||||
|
{'Name': 'instance-lifecycle', 'Values': ['on-demand', 'scheduled']}
|
||||||
|
]
|
||||||
|
)
|
||||||
|
|
||||||
|
for reservation in instances['Reservations']:
|
||||||
|
for instance in reservation['Instances']:
|
||||||
|
instance_id = instance['InstanceId']
|
||||||
|
instance_type = instance['InstanceType']
|
||||||
|
asg_member = instance_id in asg_instances
|
||||||
|
|
||||||
|
# Calculate suitability
|
||||||
|
score, reasons = self._calculate_suitability_score(instance, asg_member)
|
||||||
|
|
||||||
|
# Calculate savings
|
||||||
|
hourly_cost = self._estimate_hourly_cost(instance_type)
|
||||||
|
monthly_savings = hourly_cost * 730 * self.spot_discount
|
||||||
|
annual_savings = monthly_savings * 12
|
||||||
|
|
||||||
|
self.total_savings += annual_savings
|
||||||
|
|
||||||
|
# Get instance name
|
||||||
|
name_tag = next((tag['Value'] for tag in instance.get('Tags', [])
|
||||||
|
if tag['Key'] == 'Name'), 'N/A')
|
||||||
|
|
||||||
|
# Determine recommendation
|
||||||
|
if score >= 70:
|
||||||
|
recommendation = "Highly Recommended"
|
||||||
|
elif score >= 50:
|
||||||
|
recommendation = "Recommended"
|
||||||
|
elif score >= 30:
|
||||||
|
recommendation = "Consider (with caution)"
|
||||||
|
else:
|
||||||
|
recommendation = "Not Recommended"
|
||||||
|
|
||||||
|
self.recommendations.append({
|
||||||
|
'Region': region,
|
||||||
|
'Instance ID': instance_id,
|
||||||
|
'Name': name_tag,
|
||||||
|
'Type': instance_type,
|
||||||
|
'In ASG': 'Yes' if asg_member else 'No',
|
||||||
|
'Suitability Score': f"{score}/100",
|
||||||
|
'Monthly Savings': f"${monthly_savings:.2f}",
|
||||||
|
'Recommendation': recommendation,
|
||||||
|
'Reasons': ', '.join(reasons[:2]) # Show top 2 reasons
|
||||||
|
})
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
print(f" Error scanning {region}: {str(e)}")
|
||||||
|
|
||||||
|
print(f" Analyzed {len(self.recommendations)} instances")
|
||||||
|
|
||||||
|
def print_report(self):
|
||||||
|
"""Print Spot recommendations report."""
|
||||||
|
print("\n" + "="*120)
|
||||||
|
print("SPOT INSTANCE RECOMMENDATIONS")
|
||||||
|
print("="*120)
|
||||||
|
|
||||||
|
# Sort by suitability score (descending)
|
||||||
|
sorted_recs = sorted(self.recommendations,
|
||||||
|
key=lambda x: int(x['Suitability Score'].split('/')[0]),
|
||||||
|
reverse=True)
|
||||||
|
|
||||||
|
if sorted_recs:
|
||||||
|
print(tabulate(sorted_recs, headers='keys', tablefmt='grid'))
|
||||||
|
|
||||||
|
print("\n" + "="*120)
|
||||||
|
print(f"TOTAL ANNUAL SAVINGS POTENTIAL: ${self.total_savings:.2f}")
|
||||||
|
print(f"(Assumes {int(self.spot_discount*100)}% average Spot discount)")
|
||||||
|
print("="*120)
|
||||||
|
|
||||||
|
print("\n\nSPOT INSTANCE BEST PRACTICES:")
|
||||||
|
print("\n1. Use Spot Instances for:")
|
||||||
|
print(" - Stateless applications")
|
||||||
|
print(" - Batch processing jobs")
|
||||||
|
print(" - CI/CD and build servers")
|
||||||
|
print(" - Data analysis and processing")
|
||||||
|
print(" - Dev/test/staging environments")
|
||||||
|
print(" - Auto Scaling Groups with mixed instance types")
|
||||||
|
|
||||||
|
print("\n2. Do NOT use Spot Instances for:")
|
||||||
|
print(" - Databases without replicas")
|
||||||
|
print(" - Stateful applications without checkpointing")
|
||||||
|
print(" - Real-time, latency-sensitive services")
|
||||||
|
print(" - Applications that can't handle interruptions")
|
||||||
|
|
||||||
|
print("\n3. Spot Best Practices:")
|
||||||
|
print(" - Use Spot Fleet or Auto Scaling Groups with Spot")
|
||||||
|
print(" - Diversify across multiple instance types")
|
||||||
|
print(" - Implement graceful shutdown handlers (2-minute warning)")
|
||||||
|
print(" - Use Spot Instance interruption notices")
|
||||||
|
print(" - Consider Spot + On-Demand mix (e.g., 70/30)")
|
||||||
|
print(" - Set appropriate max price (typically On-Demand price)")
|
||||||
|
|
||||||
|
print("\n4. Implementation Steps:")
|
||||||
|
print(" - Test Spot behavior in non-production first")
|
||||||
|
print(" - Implement interruption handling in your application")
|
||||||
|
print(" - Use EC2 Fleet or Auto Scaling with mixed instances policy")
|
||||||
|
print(" - Monitor Spot interruption rates")
|
||||||
|
print(" - Set up CloudWatch alarms for Spot terminations")
|
||||||
|
|
||||||
|
print("\n5. Tools to Use:")
|
||||||
|
print(" - EC2 Spot Instance Advisor (check interruption rates)")
|
||||||
|
print(" - Auto Scaling Groups with mixed instances policy")
|
||||||
|
print(" - Spot Fleet for diverse instance type selection")
|
||||||
|
print(" - AWS Spot Instances best practices guide")
|
||||||
|
|
||||||
|
def run(self):
|
||||||
|
"""Run Spot analysis."""
|
||||||
|
print("="*80)
|
||||||
|
print("AWS SPOT INSTANCE OPPORTUNITY ANALYZER")
|
||||||
|
print("="*80)
|
||||||
|
|
||||||
|
self.analyze_instances()
|
||||||
|
self.print_report()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description='Analyze EC2 workloads for Spot instance opportunities',
|
||||||
|
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||||
|
epilog="""
|
||||||
|
Examples:
|
||||||
|
# Analyze all regions with default profile
|
||||||
|
python3 spot_recommendations.py
|
||||||
|
|
||||||
|
# Analyze specific region
|
||||||
|
python3 spot_recommendations.py --region us-east-1
|
||||||
|
|
||||||
|
# Use named profile
|
||||||
|
python3 spot_recommendations.py --profile production
|
||||||
|
"""
|
||||||
|
)
|
||||||
|
|
||||||
|
parser.add_argument('--region', help='AWS region (default: all regions)')
|
||||||
|
parser.add_argument('--profile', help='AWS profile name (default: default profile)')
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
try:
|
||||||
|
analyzer = SpotRecommendationAnalyzer(
|
||||||
|
profile=args.profile,
|
||||||
|
region=args.region
|
||||||
|
)
|
||||||
|
analyzer.run()
|
||||||
|
except Exception as e:
|
||||||
|
print(f"Error: {str(e)}", file=sys.stderr)
|
||||||
|
sys.exit(1)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == '__main__':
|
||||||
|
main()
|
||||||
Reference in New Issue
Block a user