Initial commit
This commit is contained in:
362
references/best_practices.md
Normal file
362
references/best_practices.md
Normal file
@@ -0,0 +1,362 @@
|
||||
# AWS Cost Optimization Best Practices
|
||||
|
||||
Comprehensive strategies for optimizing AWS costs across all major service categories.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Compute Optimization](#compute-optimization)
|
||||
2. [Storage Optimization](#storage-optimization)
|
||||
3. [Network Optimization](#network-optimization)
|
||||
4. [Database Optimization](#database-optimization)
|
||||
5. [Container & Serverless Optimization](#container--serverless-optimization)
|
||||
6. [General Principles](#general-principles)
|
||||
|
||||
---
|
||||
|
||||
## Compute Optimization
|
||||
|
||||
### EC2 Instance Optimization
|
||||
|
||||
**Right Instance Family**
|
||||
- **General Purpose (T3, M5, M6i)**: Web servers, small-medium databases, dev environments
|
||||
- **Compute Optimized (C5, C6i, C6g)**: CPU-intensive workloads, batch processing, HPC
|
||||
- **Memory Optimized (R5, R6i, R6g)**: Databases, in-memory caches, big data
|
||||
- **Storage Optimized (I3, D2)**: High IOPS, data warehousing, Hadoop
|
||||
|
||||
**Graviton Migration (ARM64)**
|
||||
- Up to 20% cost savings with M6g, C6g, R6g, T4g instances
|
||||
- Test compatibility first: Most modern languages/frameworks support ARM64
|
||||
- Best for: Stateless applications, containerized workloads, open-source software
|
||||
|
||||
**Instance Sizing**
|
||||
- Start small and scale up based on metrics
|
||||
- Monitor CPU, memory, network for 2+ weeks before committing
|
||||
- Use CloudWatch metrics to identify underutilized instances
|
||||
- Consider burstable instances (T3) for variable workloads
|
||||
|
||||
**Purchase Options**
|
||||
- **On-Demand**: Flexible, no commitment, highest cost
|
||||
- **Reserved Instances**: 1-3 year commitment, up to 63% savings
|
||||
- Standard RI: Highest discount, no flexibility
|
||||
- Convertible RI: Moderate discount, can change instance types
|
||||
- **Savings Plans**: Flexible commitment to compute spend, up to 66% savings
|
||||
- **Spot Instances**: Up to 90% savings, suitable for fault-tolerant workloads
|
||||
|
||||
### Auto Scaling
|
||||
|
||||
**Horizontal Scaling**
|
||||
- Scale out during peak, scale in during off-peak
|
||||
- Use target tracking policies (CPU, ALB requests, custom metrics)
|
||||
- Set minimum instances for high availability, maximum for cost control
|
||||
- Consider scheduled scaling for predictable patterns
|
||||
|
||||
**Mixed Instances Policy**
|
||||
- Combine instance types for better Spot availability
|
||||
- Mix Spot and On-Demand for reliability
|
||||
- Example: 70% Spot, 30% On-Demand for fault-tolerant apps
|
||||
|
||||
### Lambda Optimization
|
||||
|
||||
**Memory Configuration**
|
||||
- Memory allocation determines CPU allocation
|
||||
- More memory = faster execution = potentially lower cost
|
||||
- Test different memory settings to find cost/performance sweet spot
|
||||
|
||||
**Cold Start Mitigation**
|
||||
- Provisioned concurrency for critical functions (adds cost)
|
||||
- Keep functions warm with scheduled invocations
|
||||
- Minimize deployment package size
|
||||
- Use Lambda layers for shared dependencies
|
||||
|
||||
**Execution Time**
|
||||
- Optimize code to reduce execution duration
|
||||
- Every 100ms of execution matters at scale
|
||||
- Consider Graviton2 (arm64) for 20% better price/performance
|
||||
|
||||
---
|
||||
|
||||
## Storage Optimization
|
||||
|
||||
### S3 Cost Optimization
|
||||
|
||||
**Storage Classes**
|
||||
- **S3 Standard**: Frequently accessed data
|
||||
- **S3 Intelligent-Tiering**: Auto-moves between tiers, ideal for unknown patterns
|
||||
- **S3 Standard-IA**: Infrequent access, 50% cheaper than Standard
|
||||
- **S3 One Zone-IA**: Non-critical, infrequent access, 20% cheaper than Standard-IA
|
||||
- **S3 Glacier Instant Retrieval**: Archive with instant access, 68% cheaper
|
||||
- **S3 Glacier Flexible Retrieval**: Archive, retrieval in minutes-hours, 77% cheaper
|
||||
- **S3 Glacier Deep Archive**: Long-term archive, retrieval in 12 hours, 83% cheaper
|
||||
|
||||
**Lifecycle Policies**
|
||||
- Automatically transition objects between storage classes
|
||||
- Delete incomplete multipart uploads after 7 days
|
||||
- Example policy:
|
||||
- 0-30 days: S3 Standard
|
||||
- 30-90 days: S3 Standard-IA
|
||||
- 90-365 days: S3 Glacier Flexible Retrieval
|
||||
- 365+ days: S3 Glacier Deep Archive or Delete
|
||||
|
||||
**Request Optimization**
|
||||
- Use CloudFront CDN to reduce S3 GET requests
|
||||
- Batch operations instead of individual API calls
|
||||
- Use S3 Select to retrieve subsets of data
|
||||
- Enable S3 Transfer Acceleration for faster uploads (if needed)
|
||||
|
||||
**Cost Monitoring**
|
||||
- Enable S3 Storage Lens for usage analytics
|
||||
- Set up S3 Storage Class Analysis
|
||||
- Monitor request costs (can exceed storage costs for small files)
|
||||
|
||||
### EBS Optimization
|
||||
|
||||
**Volume Types**
|
||||
- **gp3**: General purpose, 20% cheaper than gp2, configurable IOPS/throughput
|
||||
- **gp2**: Legacy general purpose (migrate to gp3)
|
||||
- **io2**: High performance, mission-critical (only if needed)
|
||||
- **st1**: Throughput-optimized HDD for big data (cheaper for sequential access)
|
||||
- **sc1**: Cold HDD for infrequent access (cheapest)
|
||||
|
||||
**Snapshot Management**
|
||||
- Delete old snapshots (they accumulate quickly)
|
||||
- Use Lifecycle Manager for automated snapshot policies
|
||||
- Snapshots are incremental but deletion is complex (use Data Lifecycle Manager)
|
||||
- Consider cross-region replication costs
|
||||
|
||||
**Volume Cleanup**
|
||||
- Delete unattached volumes
|
||||
- Right-size oversized volumes
|
||||
- Consider EBS Elastic Volumes to modify without downtime
|
||||
|
||||
---
|
||||
|
||||
## Network Optimization
|
||||
|
||||
### Data Transfer Costs
|
||||
|
||||
**General Rules**
|
||||
- **Free**: Inbound from internet, same-AZ traffic (same subnet)
|
||||
- **Cheap**: Same-region traffic across AZs
|
||||
- **Expensive**: Cross-region, outbound to internet, CloudFront to origin
|
||||
|
||||
**Optimization Strategies**
|
||||
- Colocate resources in same AZ when possible (consider HA trade-offs)
|
||||
- Use VPC endpoints for AWS service access (avoids NAT/IGW costs)
|
||||
- Implement caching with CloudFront, ElastiCache
|
||||
- Compress data before transfer
|
||||
- Use AWS PrivateLink instead of internet egress
|
||||
|
||||
### NAT Gateway Optimization
|
||||
|
||||
**Cost Structure**
|
||||
- ~$32.85/month per NAT Gateway
|
||||
- Data processing charges: $0.045/GB
|
||||
|
||||
**Alternatives**
|
||||
- **VPC Endpoints**: Direct access to AWS services (S3, DynamoDB, etc.)
|
||||
- Interface endpoints: $7.20/month + $0.01/GB
|
||||
- Gateway endpoints: Free for S3 and DynamoDB
|
||||
- **NAT Instance**: Cheaper but requires management
|
||||
- **Single NAT Gateway**: Use one instead of one per AZ (reduces HA)
|
||||
- **S3 Gateway Endpoint**: Free alternative for S3 access
|
||||
|
||||
**When to Use What**
|
||||
- High traffic to AWS services → VPC Endpoints
|
||||
- Low traffic, dev/test → Single NAT Gateway or NAT instance
|
||||
- Production, HA required → NAT Gateway per AZ
|
||||
- S3 access only → S3 Gateway Endpoint (free)
|
||||
|
||||
### CloudFront Optimization
|
||||
|
||||
**Use Cases for Savings**
|
||||
- Reduce S3 data transfer costs (CloudFront egress is cheaper)
|
||||
- Cache frequently accessed content
|
||||
- Regional edge caches for less popular content
|
||||
|
||||
**Configuration**
|
||||
- Use appropriate price class (exclude expensive regions if not needed)
|
||||
- Set proper TTL to maximize cache hit ratio
|
||||
- Use compression (gzip, brotli)
|
||||
- Monitor cache hit ratio and adjust
|
||||
|
||||
---
|
||||
|
||||
## Database Optimization
|
||||
|
||||
### RDS Cost Optimization
|
||||
|
||||
**Instance Sizing**
|
||||
- Right-size based on CloudWatch metrics (CPU, memory, connections)
|
||||
- Consider burstable instances (db.t3) for variable workloads
|
||||
- Graviton instances (db.m6g, db.r6g) offer 20% savings
|
||||
|
||||
**Storage Optimization**
|
||||
- Use gp3 instead of gp2 (20% cheaper)
|
||||
- Enable storage autoscaling with upper limit
|
||||
- Delete old automated backups
|
||||
- Reduce backup retention period if possible
|
||||
|
||||
**High Availability Trade-offs**
|
||||
- Multi-AZ doubles cost (needed for production)
|
||||
- Single-AZ acceptable for dev/test
|
||||
- Read replicas for read scaling (cheaper than bigger instance)
|
||||
|
||||
**Aurora vs RDS**
|
||||
- Aurora costs more but offers better scaling
|
||||
- Aurora Serverless v2 for variable workloads
|
||||
- Standard RDS for predictable workloads
|
||||
- PostgreSQL/MySQL community for dev/test
|
||||
|
||||
### DynamoDB Optimization
|
||||
|
||||
**Capacity Modes**
|
||||
- **On-Demand**: Pay per request, unpredictable traffic
|
||||
- **Provisioned**: Cheaper for consistent traffic, requires capacity planning
|
||||
- **Reserved Capacity**: 1-3 year commitment for provisioned capacity
|
||||
|
||||
**Table Design**
|
||||
- Use single-table design to minimize costs
|
||||
- Implement GSI/LSI carefully (they add cost)
|
||||
- Enable point-in-time recovery only if needed
|
||||
- Use TTL to auto-expire old data
|
||||
|
||||
**Read Optimization**
|
||||
- Use eventually consistent reads (50% cheaper than strongly consistent)
|
||||
- Implement caching (DAX or ElastiCache)
|
||||
- Batch operations when possible
|
||||
|
||||
### ElastiCache Optimization
|
||||
|
||||
**Node Types**
|
||||
- Graviton instances (cache.m6g, cache.r6g) for 20% savings
|
||||
- Right-size based on memory usage and eviction rates
|
||||
|
||||
**Redis vs Memcached**
|
||||
- Redis: More features, persistence, replication (more expensive)
|
||||
- Memcached: Simpler, no persistence, multi-threaded (cheaper)
|
||||
|
||||
**Strategies**
|
||||
- Reserved nodes for 30-55% savings
|
||||
- Single-AZ for dev/test
|
||||
- Monitor eviction rates to avoid over-provisioning
|
||||
|
||||
---
|
||||
|
||||
## Container & Serverless Optimization
|
||||
|
||||
### ECS/Fargate Optimization
|
||||
|
||||
**Compute Options**
|
||||
- **EC2 Launch Type**: More control, cheaper for steady workloads
|
||||
- **Fargate**: Serverless, easier management, better for variable loads
|
||||
- **Fargate Spot**: Up to 70% savings for fault-tolerant tasks
|
||||
|
||||
**Graviton Support**
|
||||
- Fargate ARM64 support available
|
||||
- ECS on Graviton2 EC2 instances for 20% savings
|
||||
|
||||
**Right-sizing**
|
||||
- Start with minimal CPU/memory, scale up based on metrics
|
||||
- Use Container Insights for utilization data
|
||||
- Consider task packing (multiple containers per task)
|
||||
|
||||
### EKS Optimization
|
||||
|
||||
**Control Plane**
|
||||
- $73/month per cluster (consider consolidation)
|
||||
- Use single cluster with namespaces when appropriate
|
||||
|
||||
**Worker Nodes**
|
||||
- Use Spot instances for fault-tolerant pods (up to 90% savings)
|
||||
- Managed node groups with Graviton instances
|
||||
- Karpenter for intelligent autoscaling
|
||||
- Mixed instance types for better Spot availability
|
||||
|
||||
**Cost Visibility**
|
||||
- Kubecost or OpenCost for K8s cost attribution
|
||||
- Resource requests/limits prevent waste
|
||||
- Cluster autoscaler for automatic node scaling
|
||||
|
||||
---
|
||||
|
||||
## General Principles
|
||||
|
||||
### Tagging Strategy
|
||||
|
||||
**Cost Allocation Tags**
|
||||
- Environment: prod, staging, dev, test
|
||||
- Owner: team/person responsible
|
||||
- Project: business initiative
|
||||
- CostCenter: chargeback allocation
|
||||
- Application: specific app name
|
||||
|
||||
**Tag Enforcement**
|
||||
- Use AWS Organizations policies to enforce tagging
|
||||
- Service Control Policies to prevent untagged resources
|
||||
- AWS Config rules for compliance
|
||||
|
||||
### Monitoring and Governance
|
||||
|
||||
**Cost Monitoring Tools**
|
||||
- AWS Cost Explorer: Historical analysis
|
||||
- AWS Budgets: Proactive alerts
|
||||
- Cost and Usage Reports: Detailed data export
|
||||
- Cost Anomaly Detection: Automatic anomaly alerts
|
||||
|
||||
**Regular Reviews**
|
||||
- Monthly cost review meetings
|
||||
- Quarterly rightsizing exercises
|
||||
- Annual Reserved Instance/Savings Plan optimization
|
||||
- Automated reports to stakeholders
|
||||
|
||||
### Automation
|
||||
|
||||
**Infrastructure as Code**
|
||||
- Define resource sizes in code (prevent oversizing)
|
||||
- Automated cleanup of dev/test resources
|
||||
- Scheduled shutdown of non-production resources
|
||||
|
||||
**Cost Optimization Tools**
|
||||
- AWS Compute Optimizer: ML-based recommendations
|
||||
- AWS Trusted Advisor: Best practice checks
|
||||
- Third-party tools: CloudHealth, Cloudability, Spot.io
|
||||
|
||||
### Cultural Best Practices
|
||||
|
||||
**Engineering Ownership**
|
||||
- Engineers should see cost impact of their changes
|
||||
- Cost metrics in dashboards alongside performance
|
||||
- Cost budgets for teams/projects
|
||||
|
||||
**Experiments and Cleanup**
|
||||
- Tag experimental resources with expiration dates
|
||||
- Automated cleanup of abandoned resources
|
||||
- Regular audits of unused resources
|
||||
|
||||
**Cost-Aware Architecture**
|
||||
- Design for cost from the beginning
|
||||
- Choose appropriate service tiers
|
||||
- Implement auto-scaling and right-sizing from day one
|
||||
- Consider serverless and managed services
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins Checklist
|
||||
|
||||
- [ ] Delete unattached EBS volumes
|
||||
- [ ] Delete old EBS snapshots
|
||||
- [ ] Release unused Elastic IPs
|
||||
- [ ] Stop or terminate idle EC2 instances
|
||||
- [ ] Right-size oversized instances
|
||||
- [ ] Convert gp2 to gp3 volumes
|
||||
- [ ] Enable S3 Intelligent-Tiering
|
||||
- [ ] Set up S3 lifecycle policies
|
||||
- [ ] Replace NAT Gateways with VPC Endpoints where possible
|
||||
- [ ] Migrate to Graviton instances
|
||||
- [ ] Purchase Reserved Instances/Savings Plans for stable workloads
|
||||
- [ ] Use Spot instances for fault-tolerant workloads
|
||||
- [ ] Delete old RDS snapshots
|
||||
- [ ] Enable DynamoDB auto-scaling
|
||||
- [ ] Set up cost allocation tags
|
||||
- [ ] Enable AWS Budgets alerts
|
||||
- [ ] Schedule shutdown of dev/test resources
|
||||
740
references/finops_governance.md
Normal file
740
references/finops_governance.md
Normal file
@@ -0,0 +1,740 @@
|
||||
# FinOps Governance Framework
|
||||
|
||||
Organizational practices, processes, and governance for AWS cost optimization.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [FinOps Principles](#finops-principles)
|
||||
2. [Cost Allocation & Tagging](#cost-allocation--tagging)
|
||||
3. [Budget Management](#budget-management)
|
||||
4. [Monthly Review Process](#monthly-review-process)
|
||||
5. [Roles & Responsibilities](#roles--responsibilities)
|
||||
6. [Chargeback & Showback](#chargeback--showback)
|
||||
7. [Policy & Governance](#policy--governance)
|
||||
8. [Metrics & KPIs](#metrics--kpis)
|
||||
|
||||
---
|
||||
|
||||
## FinOps Principles
|
||||
|
||||
### The FinOps Framework
|
||||
|
||||
FinOps is the practice of bringing financial accountability to cloud spending through collaboration between engineering, finance, and business teams.
|
||||
|
||||
**Core Principles:**
|
||||
|
||||
1. **Teams Need to Collaborate**
|
||||
- Engineering makes technical decisions
|
||||
- Finance provides visibility and reporting
|
||||
- Business sets priorities and budgets
|
||||
- Cross-functional cost optimization
|
||||
|
||||
2. **Everyone Takes Ownership**
|
||||
- Engineers see cost impact of their decisions
|
||||
- Teams have cost budgets and accountability
|
||||
- Cost is a efficiency metric, not just finance
|
||||
|
||||
3. **Decisions Driven by Business Value**
|
||||
- Speed, quality, and cost trade-offs
|
||||
- Investment vs optimization decisions
|
||||
- ROI-based prioritization
|
||||
|
||||
4. **Take Advantage of Variable Cost Model**
|
||||
- Scale resources up and down as needed
|
||||
- Use different pricing models strategically
|
||||
- Optimize for actual usage patterns
|
||||
|
||||
5. **Centralized Team Drives FinOps**
|
||||
- Central FinOps team enables
|
||||
- Distributed execution by product teams
|
||||
- Share best practices and tools
|
||||
|
||||
### FinOps Maturity Model
|
||||
|
||||
**Crawl Phase (Getting Started)**
|
||||
- Basic cost visibility
|
||||
- Manual reporting
|
||||
- Ad-hoc optimization
|
||||
- Initial tagging strategy
|
||||
- Basic budget alerts
|
||||
|
||||
**Walk Phase (Improving)**
|
||||
- Automated cost reporting
|
||||
- Regular optimization reviews
|
||||
- Systematic tagging enforcement
|
||||
- Team cost allocation
|
||||
- Reserved Instance planning
|
||||
- Monthly optimization meetings
|
||||
|
||||
**Run Phase (Optimized)**
|
||||
- Real-time cost visibility
|
||||
- Automated optimization
|
||||
- Cost-aware engineering culture
|
||||
- Predictive forecasting
|
||||
- Automated guardrails
|
||||
- FinOps integrated in SDLC
|
||||
|
||||
---
|
||||
|
||||
## Cost Allocation & Tagging
|
||||
|
||||
### Tagging Strategy
|
||||
|
||||
**Required Tags (Enforce via Policy)**
|
||||
|
||||
```yaml
|
||||
Required Tags:
|
||||
Environment:
|
||||
values: [prod, staging, dev, test]
|
||||
purpose: Separate production from non-production costs
|
||||
|
||||
Owner:
|
||||
values: [email or team name]
|
||||
purpose: Contact for resource questions
|
||||
|
||||
Project:
|
||||
values: [project code]
|
||||
purpose: Track project spending
|
||||
|
||||
CostCenter:
|
||||
values: [department code]
|
||||
purpose: Chargeback allocation
|
||||
|
||||
Application:
|
||||
values: [app name]
|
||||
purpose: Application-level cost tracking
|
||||
```
|
||||
|
||||
**Optional but Recommended Tags**
|
||||
|
||||
```yaml
|
||||
Optional Tags:
|
||||
ExpirationDate:
|
||||
format: YYYY-MM-DD
|
||||
purpose: Auto-cleanup scheduling
|
||||
|
||||
DataClassification:
|
||||
values: [public, internal, confidential, restricted]
|
||||
purpose: Security and compliance
|
||||
|
||||
BackupRequired:
|
||||
values: [true, false]
|
||||
purpose: Backup policy enforcement
|
||||
|
||||
Criticality:
|
||||
values: [critical, high, medium, low]
|
||||
purpose: Priority and SLA determination
|
||||
```
|
||||
|
||||
### Tag Enforcement
|
||||
|
||||
**Using AWS Organizations Service Control Policies (SCP)**
|
||||
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Sid": "DenyEC2CreationWithoutTags",
|
||||
"Effect": "Deny",
|
||||
"Action": [
|
||||
"ec2:RunInstances"
|
||||
],
|
||||
"Resource": [
|
||||
"arn:aws:ec2:*:*:instance/*"
|
||||
],
|
||||
"Condition": {
|
||||
"StringNotLike": {
|
||||
"aws:RequestTag/Environment": ["prod", "staging", "dev", "test"],
|
||||
"aws:RequestTag/Owner": "*",
|
||||
"aws:RequestTag/Project": "*"
|
||||
}
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
**Using AWS Config Rules**
|
||||
|
||||
- **required-tags**: Enforce tags on all resources
|
||||
- **ec2-instance-no-public-ip**: Prevent public IPs unless tagged
|
||||
- Custom Lambda-based rules for complex logic
|
||||
|
||||
**Tag Compliance Monitoring**
|
||||
|
||||
```python
|
||||
# Example: Check tag compliance
|
||||
# Run weekly to find untagged resources
|
||||
|
||||
aws resourcegroupstaggingapi get-resources \
|
||||
--query 'ResourceTagMappingList[?length(Tags) == `0`]' \
|
||||
--output table
|
||||
|
||||
# Or use Tag Editor in AWS Console
|
||||
```
|
||||
|
||||
### Cost Allocation Tags
|
||||
|
||||
**Activating Cost Allocation Tags**
|
||||
|
||||
1. Go to AWS Billing → Cost Allocation Tags
|
||||
2. Select user-defined tags to activate
|
||||
3. Wait 24 hours for tags to appear in Cost Explorer
|
||||
4. Tags only apply to charges after activation
|
||||
|
||||
**Best Practices**
|
||||
|
||||
- Activate tags before using them
|
||||
- Use consistent naming (e.g., `Environment` not `Env` or `environment`)
|
||||
- Document tag values in wiki/runbook
|
||||
- Review and update tag strategy quarterly
|
||||
|
||||
---
|
||||
|
||||
## Budget Management
|
||||
|
||||
### AWS Budgets Setup
|
||||
|
||||
**Budget Types**
|
||||
|
||||
1. **Cost Budget**: Track spending against threshold
|
||||
2. **Usage Budget**: Track service usage (e.g., EC2 hours)
|
||||
3. **Savings Plans Budget**: Track commitment utilization
|
||||
4. **Reservation Budget**: Track RI utilization
|
||||
|
||||
**Recommended Budgets**
|
||||
|
||||
**1. Overall Monthly Budget**
|
||||
```yaml
|
||||
Budget Name: Company-Wide-Monthly-Budget
|
||||
Amount: $50,000/month
|
||||
Alerts:
|
||||
- 50% actual: Email CFO, FinOps team
|
||||
- 80% actual: Email CFO, CTO, FinOps team
|
||||
- 100% forecasted: Email CFO, CTO, all team leads
|
||||
- 100% actual: Email everyone + Slack alert
|
||||
```
|
||||
|
||||
**2. Per-Environment Budgets**
|
||||
```yaml
|
||||
Budget Name: Production-Environment-Budget
|
||||
Amount: $30,000/month
|
||||
Filter: Environment=prod
|
||||
Alerts:
|
||||
- 80% actual: Email engineering leads
|
||||
- 100% forecasted: Email CTO + FinOps
|
||||
|
||||
Budget Name: Dev-Environment-Budget
|
||||
Amount: $5,000/month
|
||||
Filter: Environment=dev
|
||||
Alerts:
|
||||
- 100% actual: Email dev team leads
|
||||
- 120% actual: Automated shutdown (if possible)
|
||||
```
|
||||
|
||||
**3. Per-Team Budgets**
|
||||
```yaml
|
||||
Budget Name: Team-Platform-Budget
|
||||
Amount: $15,000/month
|
||||
Filter: Owner=platform-team
|
||||
Alerts:
|
||||
- 90% actual: Email platform team
|
||||
- 100% forecasted: Email platform team + manager
|
||||
```
|
||||
|
||||
**4. Per-Project Budgets**
|
||||
```yaml
|
||||
Budget Name: Project-Phoenix-Budget
|
||||
Amount: $8,000/month
|
||||
Filter: Project=phoenix
|
||||
Alerts:
|
||||
- 75% actual: Email project owner
|
||||
- 100% actual: Email project owner + sponsor
|
||||
```
|
||||
|
||||
### Budget Alert Actions
|
||||
|
||||
**Automated Responses to Budget Alerts**
|
||||
|
||||
```python
|
||||
# Lambda function triggered by Budget alert SNS topic
|
||||
|
||||
def lambda_handler(event, context):
|
||||
# Parse budget alert
|
||||
budget_name = event['budgetName']
|
||||
threshold = event['threshold']
|
||||
|
||||
if threshold >= 100:
|
||||
# Stop non-production instances
|
||||
stop_dev_instances()
|
||||
|
||||
# Send Slack alert
|
||||
send_slack_alert(f"🚨 Budget {budget_name} exceeded!")
|
||||
|
||||
# Create JIRA ticket
|
||||
create_cost_investigation_ticket()
|
||||
|
||||
elif threshold >= 80:
|
||||
# Send warning
|
||||
send_slack_alert(f"⚠️ Budget {budget_name} at 80%")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Monthly Review Process
|
||||
|
||||
### FinOps Monthly Cadence
|
||||
|
||||
**Week 1: Data Collection**
|
||||
- Export Cost & Usage Reports
|
||||
- Run cost optimization scripts
|
||||
- Gather CloudWatch metrics
|
||||
- Compile anomaly reports
|
||||
|
||||
**Week 2: Analysis**
|
||||
- Identify cost trends
|
||||
- Find optimization opportunities
|
||||
- Compare to previous months
|
||||
- Analyze tag compliance
|
||||
|
||||
**Week 3: Team Review Meetings**
|
||||
- Present findings to engineering teams
|
||||
- Discuss optimization opportunities
|
||||
- Assign action items
|
||||
- Review upcoming projects
|
||||
|
||||
**Week 4: Executive Reporting**
|
||||
- Create executive summary
|
||||
- Present cost trends to leadership
|
||||
- Report on optimization wins
|
||||
- Forecast next quarter
|
||||
|
||||
### Monthly Review Meeting Agenda
|
||||
|
||||
**Attendees**: Engineering Leads, FinOps Team, Finance Rep, Product Manager
|
||||
|
||||
**Agenda (1 hour)**
|
||||
|
||||
1. **Previous Month Recap (10 min)**
|
||||
- Total spend vs budget
|
||||
- Top 5 services by cost
|
||||
- Month-over-month comparison
|
||||
- Budget variance explanation
|
||||
|
||||
2. **Cost Anomalies (10 min)**
|
||||
- Unusual spending spikes
|
||||
- Root cause analysis
|
||||
- Prevention measures
|
||||
|
||||
3. **Optimization Opportunities (15 min)**
|
||||
- Unused resources found
|
||||
- Rightsizing recommendations
|
||||
- Reserved Instance opportunities
|
||||
- Estimated savings
|
||||
|
||||
4. **Team Cost Breakdown (10 min)**
|
||||
- Per-team spending
|
||||
- Top spenders
|
||||
- Tag compliance status
|
||||
|
||||
5. **Upcoming Changes (10 min)**
|
||||
- New projects launching
|
||||
- Expected cost impact
|
||||
- Budget adjustments needed
|
||||
|
||||
6. **Action Items Review (5 min)**
|
||||
- Follow-up on previous items
|
||||
- Assign new action items
|
||||
- Set deadlines
|
||||
|
||||
**Deliverable**: Monthly FinOps Report (template provided)
|
||||
|
||||
### Monthly Report Template
|
||||
|
||||
```markdown
|
||||
# AWS Cost Report - [Month Year]
|
||||
|
||||
## Executive Summary
|
||||
- Total spend: $XX,XXX
|
||||
- vs Budget: X% (under/over)
|
||||
- vs Last month: +/-X%
|
||||
- Optimization savings: $X,XXX
|
||||
|
||||
## Cost Breakdown
|
||||
| Service | Cost | % of Total | MoM Change |
|
||||
|---------|------|-----------|-----------|
|
||||
| EC2 | $XX | XX% | +/-X% |
|
||||
| RDS | $XX | XX% | +/-X% |
|
||||
|
||||
## Optimization Actions Taken
|
||||
1. Migrated 20 instances to Graviton (saved $X/month)
|
||||
2. Purchased Reserved Instances (saved $X/month)
|
||||
3. Deleted unused resources (saved $X/month)
|
||||
|
||||
## Recommendations for Next Month
|
||||
1. Right-size 15 oversized instances (potential $X/month savings)
|
||||
2. Implement S3 lifecycle policies (potential $X/month savings)
|
||||
|
||||
## Action Items
|
||||
- [ ] [Owner] Task description (Deadline)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Roles & Responsibilities
|
||||
|
||||
### FinOps Team Structure
|
||||
|
||||
**FinOps Lead**
|
||||
- Owns overall cloud financial management
|
||||
- Reports to CFO and CTO
|
||||
- Sets FinOps strategy and goals
|
||||
- Manages budget process
|
||||
|
||||
**Cloud Cost Analyst**
|
||||
- Analyzes spending trends
|
||||
- Generates reports and dashboards
|
||||
- Identifies optimization opportunities
|
||||
- Runs monthly review process
|
||||
|
||||
**Cloud Architect (FinOps focus)**
|
||||
- Advises on cost-optimized architectures
|
||||
- Implements cost optimization tools
|
||||
- Trains engineers on FinOps practices
|
||||
- Reviews architectural designs for cost impact
|
||||
|
||||
### Engineering Team Responsibilities
|
||||
|
||||
**Engineering Manager**
|
||||
- Owns team budget
|
||||
- Reviews monthly cost reports
|
||||
- Prioritizes optimization work
|
||||
- Ensures tagging compliance
|
||||
|
||||
**Engineers**
|
||||
- Tag all resources they create
|
||||
- Consider cost in design decisions
|
||||
- Implement optimization recommendations
|
||||
- Delete unused resources
|
||||
|
||||
**Platform/SRE Team**
|
||||
- Implements cost optimization tooling
|
||||
- Automates cost monitoring
|
||||
- Provides cost visibility dashboards
|
||||
- Enforces tagging policies
|
||||
|
||||
---
|
||||
|
||||
## Chargeback & Showback
|
||||
|
||||
### Showback (Visibility Only)
|
||||
|
||||
**Purpose**: Show teams their costs without charging them
|
||||
**Goal**: Raise cost awareness
|
||||
|
||||
**Implementation**:
|
||||
- Monthly cost reports per team
|
||||
- Dashboard showing team spending
|
||||
- Highlight cost trends
|
||||
- No budget enforcement
|
||||
|
||||
**Best for**: Organizations new to FinOps
|
||||
|
||||
### Chargeback (Financial Accountability)
|
||||
|
||||
**Purpose**: Allocate costs back to business units
|
||||
**Goal**: Financial accountability
|
||||
|
||||
**Implementation**:
|
||||
- Tag-based cost allocation
|
||||
- Transfer costs between cost centers
|
||||
- Teams have hard budgets
|
||||
- Overspending requires justification
|
||||
|
||||
**Best for**: Mature FinOps organizations
|
||||
|
||||
### Hybrid Model (Recommended)
|
||||
|
||||
**Shared Costs**: Charged to central IT
|
||||
- VPC resources
|
||||
- Security tools
|
||||
- Monitoring infrastructure
|
||||
- Shared services
|
||||
|
||||
**Team Costs**: Charged to teams
|
||||
- Compute resources (EC2, Lambda)
|
||||
- Databases
|
||||
- Storage
|
||||
- Application-specific services
|
||||
|
||||
**Implementation**:
|
||||
```
|
||||
Total AWS Bill: $100,000
|
||||
|
||||
Shared Costs (30%): $30,000
|
||||
→ Charged to IT/Platform budget
|
||||
|
||||
Team Costs (70%): $70,000
|
||||
→ Allocated by tags:
|
||||
- Team A (Project=alpha): $20,000
|
||||
- Team B (Project=beta): $25,000
|
||||
- Team C (Project=gamma): $15,000
|
||||
- Untagged (alert!): $10,000 → Needs investigation
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Policy & Governance
|
||||
|
||||
### Cost Governance Policies
|
||||
|
||||
**1. Resource Creation Policies**
|
||||
|
||||
```yaml
|
||||
Policy: All resources must be tagged
|
||||
Enforcement: Service Control Policy (SCP)
|
||||
Exception process: Request via FinOps team
|
||||
|
||||
Policy: Dev/test resources must auto-stop nights/weekends
|
||||
Enforcement: AWS Instance Scheduler
|
||||
Exception process: Tag with NoAutoStop=true (requires approval)
|
||||
|
||||
Policy: S3 buckets must have lifecycle policies
|
||||
Enforcement: AWS Config rule
|
||||
Exception process: Document justification in bucket tags
|
||||
```
|
||||
|
||||
**2. Approval Workflows**
|
||||
|
||||
```yaml
|
||||
# Spending thresholds requiring approval
|
||||
|
||||
< $1,000/month:
|
||||
- Auto-approved
|
||||
- Must be tagged
|
||||
|
||||
$1,000 - $5,000/month:
|
||||
- Engineering manager approval
|
||||
- Documented in JIRA
|
||||
|
||||
$5,000 - $20,000/month:
|
||||
- Director approval
|
||||
- Budget impact assessment
|
||||
- FinOps team review
|
||||
|
||||
> $20,000/month:
|
||||
- VP approval
|
||||
- Business case required
|
||||
- Quarterly review checkpoint
|
||||
```
|
||||
|
||||
**3. Reserved Instance / Savings Plans Policy**
|
||||
|
||||
```yaml
|
||||
Policy: All commitments require FinOps review
|
||||
|
||||
Process:
|
||||
1. Team identifies workload suitable for commitment
|
||||
2. Submit request to FinOps with:
|
||||
- Resource details
|
||||
- Usage history (30+ days)
|
||||
- Business justification
|
||||
3. FinOps analyzes and recommends
|
||||
4. Finance approves commitment
|
||||
5. FinOps purchases and tracks utilization
|
||||
```
|
||||
|
||||
### Automation & Guardrails
|
||||
|
||||
**Automated Actions**
|
||||
|
||||
```yaml
|
||||
# Non-production resource scheduling
|
||||
Schedule: Instance Scheduler
|
||||
- Stop all dev/test EC2/RDS instances at 7pm weekdays
|
||||
- Stop all dev/test instances all weekend
|
||||
- Start at 7am weekdays
|
||||
- Exception tag: NoAutoStop=true
|
||||
|
||||
# Untagged resource alerts
|
||||
Trigger: AWS Config rule violation
|
||||
Action:
|
||||
- Send Slack alert to team
|
||||
- Create JIRA ticket
|
||||
- Escalate if not tagged in 48 hours
|
||||
|
||||
# Old snapshot cleanup
|
||||
Schedule: Weekly Lambda function
|
||||
Action:
|
||||
- Delete snapshots older than 90 days (unless tagged KeepForever=true)
|
||||
- Notify teams of deletions
|
||||
- Estimate savings
|
||||
|
||||
# Budget breach response
|
||||
Trigger: Budget > 100%
|
||||
Action:
|
||||
- Email alerts to stakeholders
|
||||
- Create incident ticket
|
||||
- Stop non-production resources (optional)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Metrics & KPIs
|
||||
|
||||
### Key FinOps Metrics
|
||||
|
||||
**1. Cost Metrics**
|
||||
```yaml
|
||||
Total Monthly Cloud Spend:
|
||||
Target: Within budget
|
||||
Trend: Track month-over-month
|
||||
|
||||
Cost per Customer:
|
||||
Calculation: Total AWS Cost / Active Customers
|
||||
Target: Decreasing over time
|
||||
|
||||
Cost per Transaction:
|
||||
Calculation: Total AWS Cost / Transactions Processed
|
||||
Target: Optimize for efficiency
|
||||
|
||||
Unit Economics:
|
||||
Calculation: Revenue per Customer - Cost per Customer
|
||||
Target: Positive and growing
|
||||
```
|
||||
|
||||
**2. Efficiency Metrics**
|
||||
```yaml
|
||||
Compute Utilization:
|
||||
Metric: Average CPU utilization
|
||||
Target: 40-60% (room for burst, not over-provisioned)
|
||||
|
||||
Storage Utilization:
|
||||
Metric: % of S3 in cost-optimized tiers
|
||||
Target: >60% in IA or Glacier tiers
|
||||
|
||||
Reserved Instance Coverage:
|
||||
Metric: % of On-Demand usage covered by RIs/SPs
|
||||
Target: >70% for stable workloads
|
||||
|
||||
RI/SP Utilization:
|
||||
Metric: % of RIs/SPs actually used
|
||||
Target: >90%
|
||||
```
|
||||
|
||||
**3. Operational Metrics**
|
||||
```yaml
|
||||
Tag Compliance:
|
||||
Metric: % of resources with required tags
|
||||
Target: >95%
|
||||
|
||||
Budget Variance:
|
||||
Metric: Actual vs Budget %
|
||||
Target: ±5%
|
||||
|
||||
Optimization Savings:
|
||||
Metric: $ saved per month from optimizations
|
||||
Target: Growing
|
||||
|
||||
Mean Time to Optimize (MTTO):
|
||||
Metric: Days from finding opportunity to implementing
|
||||
Target: <30 days
|
||||
```
|
||||
|
||||
**4. Organizational Metrics**
|
||||
```yaml
|
||||
FinOps Engagement:
|
||||
Metric: % of teams attending monthly reviews
|
||||
Target: 100%
|
||||
|
||||
Cost Awareness:
|
||||
Survey: Do engineers know their team's monthly cost?
|
||||
Target: >80% aware
|
||||
|
||||
Optimization Velocity:
|
||||
Metric: # optimization tasks completed per quarter
|
||||
Target: Growing trend
|
||||
```
|
||||
|
||||
### Dashboard Requirements
|
||||
|
||||
**Executive Dashboard (Monthly)**
|
||||
- Total spend vs budget
|
||||
- Spend by service (top 10)
|
||||
- Month-over-month trend
|
||||
- Forecast for next quarter
|
||||
- Optimization savings achieved
|
||||
|
||||
**Engineering Dashboard (Real-time)**
|
||||
- Per-team costs (daily)
|
||||
- Cost anomaly alerts
|
||||
- Untagged resources count
|
||||
- Budget utilization %
|
||||
- Top cost drivers
|
||||
|
||||
**FinOps Dashboard (Daily)**
|
||||
- Detailed service costs
|
||||
- Tag compliance metrics
|
||||
- RI/SP utilization
|
||||
- Rightsizing opportunities
|
||||
- Unused resource counts
|
||||
|
||||
---
|
||||
|
||||
## Getting Started Checklist
|
||||
|
||||
### Phase 1: Foundation (Month 1)
|
||||
- [ ] Enable Cost Explorer
|
||||
- [ ] Set up AWS Budgets
|
||||
- [ ] Define tagging strategy
|
||||
- [ ] Activate cost allocation tags
|
||||
- [ ] Set up Cost and Usage Reports (CUR)
|
||||
- [ ] Create basic cost dashboard
|
||||
|
||||
### Phase 2: Visibility (Months 2-3)
|
||||
- [ ] Implement tagging enforcement
|
||||
- [ ] Run first optimization scripts
|
||||
- [ ] Set up monthly review meeting
|
||||
- [ ] Create team cost reports
|
||||
- [ ] Assign team cost owners
|
||||
- [ ] Document FinOps processes
|
||||
|
||||
### Phase 3: Optimization (Months 4-6)
|
||||
- [ ] Implement automated resource scheduling
|
||||
- [ ] Purchase first Reserved Instances
|
||||
- [ ] Set up cost anomaly detection
|
||||
- [ ] Automate reporting
|
||||
- [ ] Train engineering teams
|
||||
- [ ] Implement showback/chargeback
|
||||
|
||||
### Phase 4: Culture (Ongoing)
|
||||
- [ ] Cost metrics in engineering KPIs
|
||||
- [ ] Cost review in architecture reviews
|
||||
- [ ] Regular optimization sprints
|
||||
- [ ] FinOps champions in each team
|
||||
- [ ] Cost-aware development practices
|
||||
- [ ] Continuous improvement
|
||||
|
||||
---
|
||||
|
||||
## Resources
|
||||
|
||||
**AWS Native Tools**
|
||||
- AWS Cost Explorer
|
||||
- AWS Budgets
|
||||
- AWS Cost Anomaly Detection
|
||||
- AWS Compute Optimizer
|
||||
- AWS Trusted Advisor
|
||||
- AWS Cost & Usage Reports
|
||||
|
||||
**Third-Party Tools**
|
||||
- CloudHealth (VMware)
|
||||
- Cloudability (Apptio)
|
||||
- Kubecost (Kubernetes cost monitoring)
|
||||
- Spot.io (Cost optimization platform)
|
||||
|
||||
**FinOps Foundation**
|
||||
- https://www.finops.org
|
||||
- FinOps Certified Practitioner certification
|
||||
- FinOps community and best practices
|
||||
466
references/service_alternatives.md
Normal file
466
references/service_alternatives.md
Normal file
@@ -0,0 +1,466 @@
|
||||
# AWS Service Alternatives - Cost Optimization Guide
|
||||
|
||||
When to use cheaper alternatives and cost-effective service options for common AWS services.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Compute Alternatives](#compute-alternatives)
|
||||
2. [Storage Alternatives](#storage-alternatives)
|
||||
3. [Database Alternatives](#database-alternatives)
|
||||
4. [Networking Alternatives](#networking-alternatives)
|
||||
5. [Application Services](#application-services)
|
||||
|
||||
---
|
||||
|
||||
## Compute Alternatives
|
||||
|
||||
### EC2 vs Lambda vs Fargate
|
||||
|
||||
**EC2 (Most Economical for Consistent Workloads)**
|
||||
- **When to use**: 24/7 workloads, predictable traffic, need full OS control
|
||||
- **Cost model**: Hourly charges, cheaper with Reserved Instances
|
||||
- **Best for**: Always-on applications, legacy apps, specific OS/kernel requirements
|
||||
- **Example**: Web server handling steady traffic → EC2 with Reserved Instance
|
||||
|
||||
**Lambda (Most Economical for Intermittent Work)**
|
||||
- **When to use**: Event-driven, sporadic usage, < 15 minute executions
|
||||
- **Cost model**: Pay per execution and duration (GB-seconds)
|
||||
- **Best for**: APIs with sporadic traffic, scheduled tasks, event processing
|
||||
- **Example**: Image processing triggered by S3 upload → Lambda
|
||||
- **Break-even**: ~20-30 hours/month execution time vs equivalent EC2
|
||||
|
||||
**Fargate (Middle Ground)**
|
||||
- **When to use**: Containerized apps, variable traffic, don't want to manage servers
|
||||
- **Cost model**: Pay for vCPU and memory allocated
|
||||
- **Best for**: Microservices, batch jobs, variable load applications
|
||||
- **Example**: Background worker that scales 0-10 containers → Fargate
|
||||
- **Tip**: Fargate Spot offers up to 70% savings for fault-tolerant tasks
|
||||
|
||||
**Decision Matrix**
|
||||
```
|
||||
Consistent 24/7 load → EC2 with Reserved Instances
|
||||
Variable load, containerized → Fargate (or Fargate Spot)
|
||||
Event-driven, < 15 min → Lambda
|
||||
Batch processing → Fargate Spot or EC2 Spot
|
||||
```
|
||||
|
||||
### EC2 Instance Alternatives
|
||||
|
||||
**Standard vs Graviton (ARM64)**
|
||||
- **Graviton Savings**: 20% cheaper for same performance
|
||||
- **When to use**: Modern applications, ARM-compatible workloads
|
||||
- **Alternatives**:
|
||||
- t3.large → t4g.large (20% cheaper)
|
||||
- m5.xlarge → m6g.xlarge (20% cheaper)
|
||||
- c5.2xlarge → c6g.2xlarge (20% cheaper)
|
||||
- **Considerations**: Test application compatibility first
|
||||
|
||||
**Current vs Previous Generation**
|
||||
- **Migration Savings**: 5-10% cheaper, better performance
|
||||
- **Examples**:
|
||||
- t2 → t3 (10% cheaper, better performance)
|
||||
- m4 → m5 → m6i (progressive improvements)
|
||||
- c4 → c5 → c6i (better price/performance)
|
||||
- **Action**: Check `detect_old_generations.py` script
|
||||
|
||||
**On-Demand vs Spot vs Reserved**
|
||||
- **On-Demand**: $X/hour, highest cost, full flexibility
|
||||
- **Spot**: 60-90% discount, can be interrupted
|
||||
- **Reserved (1yr)**: 30-40% discount
|
||||
- **Reserved (3yr)**: 50-65% discount
|
||||
- **Decision**: Use Spot for fault-tolerant, RI for predictable, On-Demand for rest
|
||||
|
||||
---
|
||||
|
||||
## Storage Alternatives
|
||||
|
||||
### S3 Storage Classes
|
||||
|
||||
**Frequently Accessed Data**
|
||||
```
|
||||
S3 Standard → $0.023/GB/month
|
||||
Use when: Accessing files multiple times per month
|
||||
```
|
||||
|
||||
**Infrequently Accessed Data**
|
||||
```
|
||||
S3 Standard → S3 Standard-IA
|
||||
$0.023/GB/month → $0.0125/GB/month (46% cheaper)
|
||||
Retrieval cost: $0.01/GB
|
||||
Break-even: < 1 access per month
|
||||
Use when: Backups, disaster recovery, infrequently accessed files
|
||||
```
|
||||
|
||||
**Unknown Access Patterns**
|
||||
```
|
||||
S3 Standard → S3 Intelligent-Tiering
|
||||
$0.023/GB/month → Automatic optimization
|
||||
Extra cost: $0.0025 per 1000 objects monitored
|
||||
Use when: Unclear access patterns, don't want to manage lifecycle
|
||||
Best for: Mixed workloads, analytics datasets
|
||||
```
|
||||
|
||||
**Archive Storage**
|
||||
```
|
||||
S3 Standard → S3 Glacier Instant Retrieval
|
||||
$0.023/GB → $0.004/GB (83% cheaper)
|
||||
Retrieval: Milliseconds, $0.03/GB
|
||||
Use when: Archive with immediate access needs (e.g., medical records)
|
||||
|
||||
S3 Standard → S3 Glacier Flexible Retrieval
|
||||
$0.023/GB → $0.0036/GB (84% cheaper)
|
||||
Retrieval: Minutes to hours, $0.01/GB
|
||||
Use when: Archive data, acceptable retrieval delay
|
||||
|
||||
S3 Standard → S3 Glacier Deep Archive
|
||||
$0.023/GB → $0.00099/GB (96% cheaper)
|
||||
Retrieval: 12 hours, $0.02/GB
|
||||
Use when: Long-term archive, regulatory compliance, rarely accessed
|
||||
```
|
||||
|
||||
**Decision Tree**
|
||||
```
|
||||
Accessed daily → S3 Standard
|
||||
Accessed monthly → S3 Standard-IA
|
||||
Unknown pattern → S3 Intelligent-Tiering
|
||||
Archive, instant access → Glacier Instant Retrieval
|
||||
Archive, can wait hours → Glacier Flexible Retrieval
|
||||
Archive, can wait 12 hours → Glacier Deep Archive
|
||||
```
|
||||
|
||||
### EBS Volume Types
|
||||
|
||||
**General Purpose Volumes**
|
||||
```
|
||||
gp2 → gp3
|
||||
$0.10/GB → $0.08/GB (20% cheaper)
|
||||
Additional benefits: Configurable IOPS/throughput independent of size
|
||||
Action: Convert all gp2 to gp3 (no downtime required)
|
||||
```
|
||||
|
||||
**High Performance Workloads**
|
||||
```
|
||||
io1 → io2
|
||||
Same price, better durability and IOPS
|
||||
io2 Block Express: For highest performance needs
|
||||
|
||||
Consider: Do you really need provisioned IOPS?
|
||||
Many workloads perform fine on gp3 (up to 16,000 IOPS)
|
||||
Test gp3 before committing to io2
|
||||
```
|
||||
|
||||
**Throughput-Optimized Workloads**
|
||||
```
|
||||
gp3 → st1 (Throughput Optimized HDD)
|
||||
$0.08/GB → $0.045/GB (44% cheaper)
|
||||
Use when: Big data, data warehouses, log processing
|
||||
Sequential access patterns, throughput more important than IOPS
|
||||
```
|
||||
|
||||
**Cold Data**
|
||||
```
|
||||
gp3 → sc1 (Cold HDD)
|
||||
$0.08/GB → $0.015/GB (81% cheaper)
|
||||
Use when: Infrequently accessed data, lowest cost priority
|
||||
Example: Archive storage, cold backups
|
||||
```
|
||||
|
||||
### EFS vs S3 vs EBS
|
||||
|
||||
**S3 (Cheapest for Object Storage)**
|
||||
- **Cost**: $0.023/GB/month (Standard)
|
||||
- **When to use**: Object storage, static files, backups
|
||||
- **Pros**: Unlimited scale, integrates with everything
|
||||
- **Cons**: Not a file system, higher latency
|
||||
|
||||
**EBS (Best for Single-Instance Block Storage)**
|
||||
- **Cost**: $0.08/GB/month (gp3)
|
||||
- **When to use**: Boot volumes, database storage, single EC2 instance
|
||||
- **Pros**: High performance, low latency
|
||||
- **Cons**: Single-AZ, attached to one instance
|
||||
|
||||
**EFS (File System Across Multiple Instances)**
|
||||
- **Cost**: $0.30/GB/month (Standard), $0.016/GB/month (IA)
|
||||
- **When to use**: Shared file storage across multiple instances
|
||||
- **Pros**: Multi-AZ, grows automatically, NFSv4
|
||||
- **Cons**: More expensive than EBS
|
||||
- **Optimization**: Use EFS Intelligent-Tiering to auto-move to IA class
|
||||
|
||||
**Decision Matrix**
|
||||
```
|
||||
Single instance, block storage → EBS
|
||||
Multiple instances, shared files → EFS (with Intelligent-Tiering)
|
||||
Object storage, static files → S3
|
||||
Large data, high throughput → FSx for Lustre
|
||||
Windows file shares → FSx for Windows
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Database Alternatives
|
||||
|
||||
### RDS vs Aurora vs Self-Managed
|
||||
|
||||
**RDS PostgreSQL/MySQL (Baseline)**
|
||||
- **Cost**: Instance + storage
|
||||
- **When to use**: Standard relational DB needs
|
||||
- **Example**: db.t3.medium = ~$60/month + storage
|
||||
|
||||
**Aurora PostgreSQL/MySQL (2-3x RDS Cost)**
|
||||
- **Cost**: Instance + storage + I/O charges
|
||||
- **When to use**: Need high availability, auto-scaling storage, read replicas
|
||||
- **Pros**: Better performance, automatic failover, up to 15 read replicas
|
||||
- **Cons**: More expensive
|
||||
- **Break-even**: High read traffic, need fast replication
|
||||
|
||||
**Aurora Serverless v2 (Variable Workloads)**
|
||||
- **Cost**: Pay per ACU (Aurora Capacity Unit) per second
|
||||
- **When to use**: Variable load, dev/test, infrequent usage
|
||||
- **Example**: Dev database used 8 hours/day → 67% savings vs always-on
|
||||
- **Limitation**: Min capacity charges apply
|
||||
|
||||
**Self-Managed on EC2 (Cheapest for Experts)**
|
||||
- **Cost**: Just EC2 + EBS costs
|
||||
- **When to use**: Full control needed, specific configuration, cost-sensitive
|
||||
- **Pros**: Can be 50-70% cheaper than RDS
|
||||
- **Cons**: You manage backups, patching, HA, monitoring
|
||||
- **Consideration**: Factor in operational overhead
|
||||
|
||||
**Decision Matrix**
|
||||
```
|
||||
Standard workload, managed preferred → RDS
|
||||
High availability, many reads → Aurora
|
||||
Variable workload → Aurora Serverless v2
|
||||
Cost-sensitive, have DBA expertise → Self-managed on EC2
|
||||
Dev/test, intermittent use → Aurora Serverless v2
|
||||
```
|
||||
|
||||
### DynamoDB Pricing Models
|
||||
|
||||
**On-Demand (Unpredictable Traffic)**
|
||||
- **Cost**: $1.25 per million writes, $0.25 per million reads
|
||||
- **When to use**: Variable traffic, new applications, spiky workloads
|
||||
- **Pros**: No capacity planning, scales automatically
|
||||
- **Example**: New API with unknown traffic pattern
|
||||
|
||||
**Provisioned Capacity (Predictable Traffic)**
|
||||
- **Cost**: $0.00065 per WCU/hour, $0.00013 per RCU/hour
|
||||
- **When to use**: Predictable traffic patterns
|
||||
- **Savings**: 60-80% cheaper than on-demand at consistent usage
|
||||
- **Example**: Application with steady 100 req/sec
|
||||
|
||||
**Reserved Capacity (Long-term Commitment)**
|
||||
- **Cost**: Additional 30-50% discount on provisioned capacity
|
||||
- **When to use**: Known long-term capacity needs
|
||||
- **Commitment**: 1-3 years
|
||||
|
||||
**Break-Even Calculation**
|
||||
```
|
||||
On-Demand: $1.25 per million writes
|
||||
Provisioned: ~$0.47 per million writes (at capacity)
|
||||
Break-even: ~65% consistent utilization
|
||||
|
||||
Action: Start with on-demand, switch to provisioned once patterns clear
|
||||
```
|
||||
|
||||
### Database Migration Options
|
||||
|
||||
**From Commercial to Open Source**
|
||||
```
|
||||
Oracle → Aurora PostgreSQL or RDS PostgreSQL
|
||||
Savings: 90% on licensing costs
|
||||
Consider: PostgreSQL compatibility, migration effort
|
||||
|
||||
SQL Server → Aurora PostgreSQL or RDS PostgreSQL/MySQL
|
||||
Savings: 50-90% on licensing costs
|
||||
Consider: Application compatibility, migration effort
|
||||
```
|
||||
|
||||
**From RDS to Aurora**
|
||||
```
|
||||
Only if: High availability requirements, many read replicas needed
|
||||
Cost increase: 20-50% more
|
||||
Benefit: Better performance, automatic failover, scaling
|
||||
```
|
||||
|
||||
**From Aurora to RDS**
|
||||
```
|
||||
When: Don't need Aurora features, cost-conscious
|
||||
Savings: 20-50%
|
||||
Downgrade if: Single-AZ sufficient, limited read replicas needed
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Networking Alternatives
|
||||
|
||||
### NAT Gateway Alternatives
|
||||
|
||||
**NAT Gateway (Default, Expensive)**
|
||||
- **Cost**: $32.85/month + $0.045/GB processed
|
||||
- **When to use**: Production, high availability, easy management
|
||||
|
||||
**VPC Endpoints (Cheaper for AWS Services)**
|
||||
- **Gateway Endpoint (S3, DynamoDB)**: FREE
|
||||
- **Interface Endpoint**: $7.20/month + $0.01/GB
|
||||
- **When to use**: Accessing S3, DynamoDB, or other AWS services
|
||||
- **Savings**: $25-30/month vs NAT Gateway
|
||||
- **Example**: Lambda accessing S3 → Use S3 Gateway Endpoint
|
||||
|
||||
**NAT Instance (Cheapest, More Work)**
|
||||
- **Cost**: Just EC2 cost (e.g., t3.micro = $7.50/month)
|
||||
- **When to use**: Dev/test, cost-sensitive, low traffic
|
||||
- **Cons**: Must manage, less resilient, manual HA setup
|
||||
- **Savings**: 75% vs NAT Gateway
|
||||
|
||||
**Decision Matrix**
|
||||
```
|
||||
S3 or DynamoDB only → Gateway Endpoint (FREE)
|
||||
Other AWS services → Interface Endpoint
|
||||
Production, high availability → NAT Gateway
|
||||
Dev/test, low traffic → NAT Instance or single NAT Gateway
|
||||
```
|
||||
|
||||
### Load Balancer Alternatives
|
||||
|
||||
**Application Load Balancer (ALB)**
|
||||
- **Cost**: $16.20/month + LCU charges
|
||||
- **When to use**: HTTP/HTTPS, path-based routing, microservices
|
||||
- **Features**: Layer 7, content-based routing, Lambda targets
|
||||
|
||||
**Network Load Balancer (NLB)**
|
||||
- **Cost**: $22.35/month + LCU charges
|
||||
- **When to use**: TCP/UDP, extreme performance, static IPs
|
||||
- **Use case**: Non-HTTP protocols, high throughput
|
||||
|
||||
**Classic Load Balancer (Legacy)**
|
||||
- **Cost**: $18/month + data charges
|
||||
- **Recommendation**: Migrate to ALB or NLB (better features, often cheaper)
|
||||
|
||||
**CloudFront + S3 (Static Content)**
|
||||
- **Cost**: Much cheaper for static content
|
||||
- **When to use**: Static website, single-page app
|
||||
- **Setup**: S3 static hosting + CloudFront distribution
|
||||
- **Savings**: 90% vs ALB for static content
|
||||
|
||||
**API Gateway (REST APIs)**
|
||||
- **Cost**: Pay per request
|
||||
- **When to use**: REST API, need API management features
|
||||
- **Alternative to**: ALB for simple APIs
|
||||
|
||||
---
|
||||
|
||||
## Application Services
|
||||
|
||||
### Message Queue Alternatives
|
||||
|
||||
**SQS vs SNS vs EventBridge vs Kinesis**
|
||||
|
||||
**SQS (Point-to-Point, Cheapest)**
|
||||
- **Cost**: $0.40 per million requests (Standard), $0.50 (FIFO)
|
||||
- **When to use**: Work queues, decoupling services
|
||||
- **Best for**: Job processing, task queues
|
||||
|
||||
**SNS (Pub/Sub, Cheap)**
|
||||
- **Cost**: $0.50 per million publishes
|
||||
- **When to use**: Fan-out notifications, multiple subscribers
|
||||
- **Best for**: Notifications, multiple consumers
|
||||
|
||||
**EventBridge (Event Router)**
|
||||
- **Cost**: $1.00 per million events
|
||||
- **When to use**: Event-driven architecture, complex routing
|
||||
- **Best for**: Cross-account events, SaaS integrations
|
||||
|
||||
**Kinesis (Streaming, Expensive)**
|
||||
- **Cost**: $0.015 per shard-hour + PUT charges
|
||||
- **When to use**: Real-time streaming, ordered processing
|
||||
- **Best for**: Logs, analytics, real-time processing
|
||||
- **Alternative**: Kinesis Data Firehose (simpler, cheaper for basic needs)
|
||||
|
||||
**Decision Matrix**
|
||||
```
|
||||
Simple queue → SQS
|
||||
Multiple consumers → SNS
|
||||
Complex event routing → EventBridge
|
||||
Real-time streaming → Kinesis
|
||||
Log aggregation → Kinesis Firehose
|
||||
```
|
||||
|
||||
### Container Orchestration
|
||||
|
||||
**ECS vs EKS vs Fargate**
|
||||
|
||||
**ECS on EC2 (Cheapest)**
|
||||
- **Cost**: Just EC2 costs (no ECS fee)
|
||||
- **When to use**: AWS-native, simpler workloads
|
||||
- **Best for**: Cost-sensitive, AWS-specific deployments
|
||||
|
||||
**ECS on Fargate (Serverless, Easy)**
|
||||
- **Cost**: Pay per task (vCPU + memory)
|
||||
- **When to use**: Variable load, don't want to manage servers
|
||||
- **Best for**: Variable workloads, simpler operations
|
||||
|
||||
**EKS (Kubernetes, Expensive)**
|
||||
- **Cost**: $73/month per cluster + node costs
|
||||
- **When to use**: Need Kubernetes, multi-cloud, complex deployments
|
||||
- **Best for**: Kubernetes expertise, need K8s ecosystem
|
||||
- **Tip**: Consolidate workloads to fewer clusters
|
||||
|
||||
**Decision Matrix**
|
||||
```
|
||||
AWS-native, cost-sensitive → ECS on EC2
|
||||
Variable load, easy management → ECS on Fargate
|
||||
Need Kubernetes → EKS
|
||||
Multiple environments → Consider single EKS cluster with namespaces
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference: When to Switch
|
||||
|
||||
### Immediate Actions (Low Risk)
|
||||
- [ ] gp2 → gp3 (20% savings, no downtime)
|
||||
- [ ] S3 Standard → Intelligent-Tiering (auto-optimization)
|
||||
- [ ] NAT Gateway → VPC Endpoints for S3/DynamoDB (free)
|
||||
- [ ] Old generation instances → New generation (10-20% savings)
|
||||
- [ ] Intel → Graviton (20% savings, test first)
|
||||
|
||||
### Medium Effort Actions
|
||||
- [ ] On-Demand → Reserved Instances/Savings Plans (40-65% savings)
|
||||
- [ ] Always-on EC2 → Lambda for intermittent work
|
||||
- [ ] S3 Standard → Lifecycle policies (50-95% savings on old data)
|
||||
- [ ] RDS On-Demand → Reserved Instances (40-65% savings)
|
||||
- [ ] DynamoDB On-Demand → Provisioned (60-80% savings if predictable)
|
||||
|
||||
### High Effort Actions (Evaluate Carefully)
|
||||
- [ ] RDS → Aurora (usually more expensive, only if need features)
|
||||
- [ ] Aurora → RDS (20-50% savings if don't need Aurora features)
|
||||
- [ ] Commercial DB → PostgreSQL (90% savings, migration effort)
|
||||
- [ ] EC2 → Lambda (case-by-case, break-even analysis needed)
|
||||
- [ ] ECS → EKS (usually more expensive, only if need K8s)
|
||||
|
||||
---
|
||||
|
||||
## Cost Comparison Tool
|
||||
|
||||
Use this mental model when evaluating alternatives:
|
||||
|
||||
```
|
||||
1. Calculate current monthly cost
|
||||
2. Calculate alternative monthly cost
|
||||
3. Estimate migration effort (hours × $cost)
|
||||
4. Calculate payback period: Migration Cost / Monthly Savings
|
||||
5. Decide: Payback < 3 months → Likely worth it
|
||||
Payback > 6 months → Evaluate carefully
|
||||
```
|
||||
|
||||
**Example:**
|
||||
```
|
||||
Current: ALB for static site = $20/month
|
||||
Alternative: CloudFront + S3 = $2/month
|
||||
Savings: $18/month
|
||||
Migration: 4 hours × $100/hour = $400
|
||||
Payback: $400 / $18 = 22 months → Maybe not worth it
|
||||
|
||||
But if: Multiple sites, reusable pattern → Worth the investment
|
||||
```
|
||||
Reference in New Issue
Block a user