Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:09 +08:00
commit 9d4643f587
14 changed files with 4713 additions and 0 deletions

View File

@@ -0,0 +1,362 @@
# AWS Cost Optimization Best Practices
Comprehensive strategies for optimizing AWS costs across all major service categories.
## Table of Contents
1. [Compute Optimization](#compute-optimization)
2. [Storage Optimization](#storage-optimization)
3. [Network Optimization](#network-optimization)
4. [Database Optimization](#database-optimization)
5. [Container & Serverless Optimization](#container--serverless-optimization)
6. [General Principles](#general-principles)
---
## Compute Optimization
### EC2 Instance Optimization
**Right Instance Family**
- **General Purpose (T3, M5, M6i)**: Web servers, small-medium databases, dev environments
- **Compute Optimized (C5, C6i, C6g)**: CPU-intensive workloads, batch processing, HPC
- **Memory Optimized (R5, R6i, R6g)**: Databases, in-memory caches, big data
- **Storage Optimized (I3, D2)**: High IOPS, data warehousing, Hadoop
**Graviton Migration (ARM64)**
- Up to 20% cost savings with M6g, C6g, R6g, T4g instances
- Test compatibility first: Most modern languages/frameworks support ARM64
- Best for: Stateless applications, containerized workloads, open-source software
**Instance Sizing**
- Start small and scale up based on metrics
- Monitor CPU, memory, network for 2+ weeks before committing
- Use CloudWatch metrics to identify underutilized instances
- Consider burstable instances (T3) for variable workloads
**Purchase Options**
- **On-Demand**: Flexible, no commitment, highest cost
- **Reserved Instances**: 1-3 year commitment, up to 63% savings
- Standard RI: Highest discount, no flexibility
- Convertible RI: Moderate discount, can change instance types
- **Savings Plans**: Flexible commitment to compute spend, up to 66% savings
- **Spot Instances**: Up to 90% savings, suitable for fault-tolerant workloads
### Auto Scaling
**Horizontal Scaling**
- Scale out during peak, scale in during off-peak
- Use target tracking policies (CPU, ALB requests, custom metrics)
- Set minimum instances for high availability, maximum for cost control
- Consider scheduled scaling for predictable patterns
**Mixed Instances Policy**
- Combine instance types for better Spot availability
- Mix Spot and On-Demand for reliability
- Example: 70% Spot, 30% On-Demand for fault-tolerant apps
### Lambda Optimization
**Memory Configuration**
- Memory allocation determines CPU allocation
- More memory = faster execution = potentially lower cost
- Test different memory settings to find cost/performance sweet spot
**Cold Start Mitigation**
- Provisioned concurrency for critical functions (adds cost)
- Keep functions warm with scheduled invocations
- Minimize deployment package size
- Use Lambda layers for shared dependencies
**Execution Time**
- Optimize code to reduce execution duration
- Every 100ms of execution matters at scale
- Consider Graviton2 (arm64) for 20% better price/performance
---
## Storage Optimization
### S3 Cost Optimization
**Storage Classes**
- **S3 Standard**: Frequently accessed data
- **S3 Intelligent-Tiering**: Auto-moves between tiers, ideal for unknown patterns
- **S3 Standard-IA**: Infrequent access, 50% cheaper than Standard
- **S3 One Zone-IA**: Non-critical, infrequent access, 20% cheaper than Standard-IA
- **S3 Glacier Instant Retrieval**: Archive with instant access, 68% cheaper
- **S3 Glacier Flexible Retrieval**: Archive, retrieval in minutes-hours, 77% cheaper
- **S3 Glacier Deep Archive**: Long-term archive, retrieval in 12 hours, 83% cheaper
**Lifecycle Policies**
- Automatically transition objects between storage classes
- Delete incomplete multipart uploads after 7 days
- Example policy:
- 0-30 days: S3 Standard
- 30-90 days: S3 Standard-IA
- 90-365 days: S3 Glacier Flexible Retrieval
- 365+ days: S3 Glacier Deep Archive or Delete
**Request Optimization**
- Use CloudFront CDN to reduce S3 GET requests
- Batch operations instead of individual API calls
- Use S3 Select to retrieve subsets of data
- Enable S3 Transfer Acceleration for faster uploads (if needed)
**Cost Monitoring**
- Enable S3 Storage Lens for usage analytics
- Set up S3 Storage Class Analysis
- Monitor request costs (can exceed storage costs for small files)
### EBS Optimization
**Volume Types**
- **gp3**: General purpose, 20% cheaper than gp2, configurable IOPS/throughput
- **gp2**: Legacy general purpose (migrate to gp3)
- **io2**: High performance, mission-critical (only if needed)
- **st1**: Throughput-optimized HDD for big data (cheaper for sequential access)
- **sc1**: Cold HDD for infrequent access (cheapest)
**Snapshot Management**
- Delete old snapshots (they accumulate quickly)
- Use Lifecycle Manager for automated snapshot policies
- Snapshots are incremental but deletion is complex (use Data Lifecycle Manager)
- Consider cross-region replication costs
**Volume Cleanup**
- Delete unattached volumes
- Right-size oversized volumes
- Consider EBS Elastic Volumes to modify without downtime
---
## Network Optimization
### Data Transfer Costs
**General Rules**
- **Free**: Inbound from internet, same-AZ traffic (same subnet)
- **Cheap**: Same-region traffic across AZs
- **Expensive**: Cross-region, outbound to internet, CloudFront to origin
**Optimization Strategies**
- Colocate resources in same AZ when possible (consider HA trade-offs)
- Use VPC endpoints for AWS service access (avoids NAT/IGW costs)
- Implement caching with CloudFront, ElastiCache
- Compress data before transfer
- Use AWS PrivateLink instead of internet egress
### NAT Gateway Optimization
**Cost Structure**
- ~$32.85/month per NAT Gateway
- Data processing charges: $0.045/GB
**Alternatives**
- **VPC Endpoints**: Direct access to AWS services (S3, DynamoDB, etc.)
- Interface endpoints: $7.20/month + $0.01/GB
- Gateway endpoints: Free for S3 and DynamoDB
- **NAT Instance**: Cheaper but requires management
- **Single NAT Gateway**: Use one instead of one per AZ (reduces HA)
- **S3 Gateway Endpoint**: Free alternative for S3 access
**When to Use What**
- High traffic to AWS services → VPC Endpoints
- Low traffic, dev/test → Single NAT Gateway or NAT instance
- Production, HA required → NAT Gateway per AZ
- S3 access only → S3 Gateway Endpoint (free)
### CloudFront Optimization
**Use Cases for Savings**
- Reduce S3 data transfer costs (CloudFront egress is cheaper)
- Cache frequently accessed content
- Regional edge caches for less popular content
**Configuration**
- Use appropriate price class (exclude expensive regions if not needed)
- Set proper TTL to maximize cache hit ratio
- Use compression (gzip, brotli)
- Monitor cache hit ratio and adjust
---
## Database Optimization
### RDS Cost Optimization
**Instance Sizing**
- Right-size based on CloudWatch metrics (CPU, memory, connections)
- Consider burstable instances (db.t3) for variable workloads
- Graviton instances (db.m6g, db.r6g) offer 20% savings
**Storage Optimization**
- Use gp3 instead of gp2 (20% cheaper)
- Enable storage autoscaling with upper limit
- Delete old automated backups
- Reduce backup retention period if possible
**High Availability Trade-offs**
- Multi-AZ doubles cost (needed for production)
- Single-AZ acceptable for dev/test
- Read replicas for read scaling (cheaper than bigger instance)
**Aurora vs RDS**
- Aurora costs more but offers better scaling
- Aurora Serverless v2 for variable workloads
- Standard RDS for predictable workloads
- PostgreSQL/MySQL community for dev/test
### DynamoDB Optimization
**Capacity Modes**
- **On-Demand**: Pay per request, unpredictable traffic
- **Provisioned**: Cheaper for consistent traffic, requires capacity planning
- **Reserved Capacity**: 1-3 year commitment for provisioned capacity
**Table Design**
- Use single-table design to minimize costs
- Implement GSI/LSI carefully (they add cost)
- Enable point-in-time recovery only if needed
- Use TTL to auto-expire old data
**Read Optimization**
- Use eventually consistent reads (50% cheaper than strongly consistent)
- Implement caching (DAX or ElastiCache)
- Batch operations when possible
### ElastiCache Optimization
**Node Types**
- Graviton instances (cache.m6g, cache.r6g) for 20% savings
- Right-size based on memory usage and eviction rates
**Redis vs Memcached**
- Redis: More features, persistence, replication (more expensive)
- Memcached: Simpler, no persistence, multi-threaded (cheaper)
**Strategies**
- Reserved nodes for 30-55% savings
- Single-AZ for dev/test
- Monitor eviction rates to avoid over-provisioning
---
## Container & Serverless Optimization
### ECS/Fargate Optimization
**Compute Options**
- **EC2 Launch Type**: More control, cheaper for steady workloads
- **Fargate**: Serverless, easier management, better for variable loads
- **Fargate Spot**: Up to 70% savings for fault-tolerant tasks
**Graviton Support**
- Fargate ARM64 support available
- ECS on Graviton2 EC2 instances for 20% savings
**Right-sizing**
- Start with minimal CPU/memory, scale up based on metrics
- Use Container Insights for utilization data
- Consider task packing (multiple containers per task)
### EKS Optimization
**Control Plane**
- $73/month per cluster (consider consolidation)
- Use single cluster with namespaces when appropriate
**Worker Nodes**
- Use Spot instances for fault-tolerant pods (up to 90% savings)
- Managed node groups with Graviton instances
- Karpenter for intelligent autoscaling
- Mixed instance types for better Spot availability
**Cost Visibility**
- Kubecost or OpenCost for K8s cost attribution
- Resource requests/limits prevent waste
- Cluster autoscaler for automatic node scaling
---
## General Principles
### Tagging Strategy
**Cost Allocation Tags**
- Environment: prod, staging, dev, test
- Owner: team/person responsible
- Project: business initiative
- CostCenter: chargeback allocation
- Application: specific app name
**Tag Enforcement**
- Use AWS Organizations policies to enforce tagging
- Service Control Policies to prevent untagged resources
- AWS Config rules for compliance
### Monitoring and Governance
**Cost Monitoring Tools**
- AWS Cost Explorer: Historical analysis
- AWS Budgets: Proactive alerts
- Cost and Usage Reports: Detailed data export
- Cost Anomaly Detection: Automatic anomaly alerts
**Regular Reviews**
- Monthly cost review meetings
- Quarterly rightsizing exercises
- Annual Reserved Instance/Savings Plan optimization
- Automated reports to stakeholders
### Automation
**Infrastructure as Code**
- Define resource sizes in code (prevent oversizing)
- Automated cleanup of dev/test resources
- Scheduled shutdown of non-production resources
**Cost Optimization Tools**
- AWS Compute Optimizer: ML-based recommendations
- AWS Trusted Advisor: Best practice checks
- Third-party tools: CloudHealth, Cloudability, Spot.io
### Cultural Best Practices
**Engineering Ownership**
- Engineers should see cost impact of their changes
- Cost metrics in dashboards alongside performance
- Cost budgets for teams/projects
**Experiments and Cleanup**
- Tag experimental resources with expiration dates
- Automated cleanup of abandoned resources
- Regular audits of unused resources
**Cost-Aware Architecture**
- Design for cost from the beginning
- Choose appropriate service tiers
- Implement auto-scaling and right-sizing from day one
- Consider serverless and managed services
---
## Quick Wins Checklist
- [ ] Delete unattached EBS volumes
- [ ] Delete old EBS snapshots
- [ ] Release unused Elastic IPs
- [ ] Stop or terminate idle EC2 instances
- [ ] Right-size oversized instances
- [ ] Convert gp2 to gp3 volumes
- [ ] Enable S3 Intelligent-Tiering
- [ ] Set up S3 lifecycle policies
- [ ] Replace NAT Gateways with VPC Endpoints where possible
- [ ] Migrate to Graviton instances
- [ ] Purchase Reserved Instances/Savings Plans for stable workloads
- [ ] Use Spot instances for fault-tolerant workloads
- [ ] Delete old RDS snapshots
- [ ] Enable DynamoDB auto-scaling
- [ ] Set up cost allocation tags
- [ ] Enable AWS Budgets alerts
- [ ] Schedule shutdown of dev/test resources

View File

@@ -0,0 +1,740 @@
# FinOps Governance Framework
Organizational practices, processes, and governance for AWS cost optimization.
## Table of Contents
1. [FinOps Principles](#finops-principles)
2. [Cost Allocation & Tagging](#cost-allocation--tagging)
3. [Budget Management](#budget-management)
4. [Monthly Review Process](#monthly-review-process)
5. [Roles & Responsibilities](#roles--responsibilities)
6. [Chargeback & Showback](#chargeback--showback)
7. [Policy & Governance](#policy--governance)
8. [Metrics & KPIs](#metrics--kpis)
---
## FinOps Principles
### The FinOps Framework
FinOps is the practice of bringing financial accountability to cloud spending through collaboration between engineering, finance, and business teams.
**Core Principles:**
1. **Teams Need to Collaborate**
- Engineering makes technical decisions
- Finance provides visibility and reporting
- Business sets priorities and budgets
- Cross-functional cost optimization
2. **Everyone Takes Ownership**
- Engineers see cost impact of their decisions
- Teams have cost budgets and accountability
- Cost is a efficiency metric, not just finance
3. **Decisions Driven by Business Value**
- Speed, quality, and cost trade-offs
- Investment vs optimization decisions
- ROI-based prioritization
4. **Take Advantage of Variable Cost Model**
- Scale resources up and down as needed
- Use different pricing models strategically
- Optimize for actual usage patterns
5. **Centralized Team Drives FinOps**
- Central FinOps team enables
- Distributed execution by product teams
- Share best practices and tools
### FinOps Maturity Model
**Crawl Phase (Getting Started)**
- Basic cost visibility
- Manual reporting
- Ad-hoc optimization
- Initial tagging strategy
- Basic budget alerts
**Walk Phase (Improving)**
- Automated cost reporting
- Regular optimization reviews
- Systematic tagging enforcement
- Team cost allocation
- Reserved Instance planning
- Monthly optimization meetings
**Run Phase (Optimized)**
- Real-time cost visibility
- Automated optimization
- Cost-aware engineering culture
- Predictive forecasting
- Automated guardrails
- FinOps integrated in SDLC
---
## Cost Allocation & Tagging
### Tagging Strategy
**Required Tags (Enforce via Policy)**
```yaml
Required Tags:
Environment:
values: [prod, staging, dev, test]
purpose: Separate production from non-production costs
Owner:
values: [email or team name]
purpose: Contact for resource questions
Project:
values: [project code]
purpose: Track project spending
CostCenter:
values: [department code]
purpose: Chargeback allocation
Application:
values: [app name]
purpose: Application-level cost tracking
```
**Optional but Recommended Tags**
```yaml
Optional Tags:
ExpirationDate:
format: YYYY-MM-DD
purpose: Auto-cleanup scheduling
DataClassification:
values: [public, internal, confidential, restricted]
purpose: Security and compliance
BackupRequired:
values: [true, false]
purpose: Backup policy enforcement
Criticality:
values: [critical, high, medium, low]
purpose: Priority and SLA determination
```
### Tag Enforcement
**Using AWS Organizations Service Control Policies (SCP)**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "DenyEC2CreationWithoutTags",
"Effect": "Deny",
"Action": [
"ec2:RunInstances"
],
"Resource": [
"arn:aws:ec2:*:*:instance/*"
],
"Condition": {
"StringNotLike": {
"aws:RequestTag/Environment": ["prod", "staging", "dev", "test"],
"aws:RequestTag/Owner": "*",
"aws:RequestTag/Project": "*"
}
}
}
]
}
```
**Using AWS Config Rules**
- **required-tags**: Enforce tags on all resources
- **ec2-instance-no-public-ip**: Prevent public IPs unless tagged
- Custom Lambda-based rules for complex logic
**Tag Compliance Monitoring**
```python
# Example: Check tag compliance
# Run weekly to find untagged resources
aws resourcegroupstaggingapi get-resources \
--query 'ResourceTagMappingList[?length(Tags) == `0`]' \
--output table
# Or use Tag Editor in AWS Console
```
### Cost Allocation Tags
**Activating Cost Allocation Tags**
1. Go to AWS Billing → Cost Allocation Tags
2. Select user-defined tags to activate
3. Wait 24 hours for tags to appear in Cost Explorer
4. Tags only apply to charges after activation
**Best Practices**
- Activate tags before using them
- Use consistent naming (e.g., `Environment` not `Env` or `environment`)
- Document tag values in wiki/runbook
- Review and update tag strategy quarterly
---
## Budget Management
### AWS Budgets Setup
**Budget Types**
1. **Cost Budget**: Track spending against threshold
2. **Usage Budget**: Track service usage (e.g., EC2 hours)
3. **Savings Plans Budget**: Track commitment utilization
4. **Reservation Budget**: Track RI utilization
**Recommended Budgets**
**1. Overall Monthly Budget**
```yaml
Budget Name: Company-Wide-Monthly-Budget
Amount: $50,000/month
Alerts:
- 50% actual: Email CFO, FinOps team
- 80% actual: Email CFO, CTO, FinOps team
- 100% forecasted: Email CFO, CTO, all team leads
- 100% actual: Email everyone + Slack alert
```
**2. Per-Environment Budgets**
```yaml
Budget Name: Production-Environment-Budget
Amount: $30,000/month
Filter: Environment=prod
Alerts:
- 80% actual: Email engineering leads
- 100% forecasted: Email CTO + FinOps
Budget Name: Dev-Environment-Budget
Amount: $5,000/month
Filter: Environment=dev
Alerts:
- 100% actual: Email dev team leads
- 120% actual: Automated shutdown (if possible)
```
**3. Per-Team Budgets**
```yaml
Budget Name: Team-Platform-Budget
Amount: $15,000/month
Filter: Owner=platform-team
Alerts:
- 90% actual: Email platform team
- 100% forecasted: Email platform team + manager
```
**4. Per-Project Budgets**
```yaml
Budget Name: Project-Phoenix-Budget
Amount: $8,000/month
Filter: Project=phoenix
Alerts:
- 75% actual: Email project owner
- 100% actual: Email project owner + sponsor
```
### Budget Alert Actions
**Automated Responses to Budget Alerts**
```python
# Lambda function triggered by Budget alert SNS topic
def lambda_handler(event, context):
# Parse budget alert
budget_name = event['budgetName']
threshold = event['threshold']
if threshold >= 100:
# Stop non-production instances
stop_dev_instances()
# Send Slack alert
send_slack_alert(f"🚨 Budget {budget_name} exceeded!")
# Create JIRA ticket
create_cost_investigation_ticket()
elif threshold >= 80:
# Send warning
send_slack_alert(f"⚠️ Budget {budget_name} at 80%")
```
---
## Monthly Review Process
### FinOps Monthly Cadence
**Week 1: Data Collection**
- Export Cost & Usage Reports
- Run cost optimization scripts
- Gather CloudWatch metrics
- Compile anomaly reports
**Week 2: Analysis**
- Identify cost trends
- Find optimization opportunities
- Compare to previous months
- Analyze tag compliance
**Week 3: Team Review Meetings**
- Present findings to engineering teams
- Discuss optimization opportunities
- Assign action items
- Review upcoming projects
**Week 4: Executive Reporting**
- Create executive summary
- Present cost trends to leadership
- Report on optimization wins
- Forecast next quarter
### Monthly Review Meeting Agenda
**Attendees**: Engineering Leads, FinOps Team, Finance Rep, Product Manager
**Agenda (1 hour)**
1. **Previous Month Recap (10 min)**
- Total spend vs budget
- Top 5 services by cost
- Month-over-month comparison
- Budget variance explanation
2. **Cost Anomalies (10 min)**
- Unusual spending spikes
- Root cause analysis
- Prevention measures
3. **Optimization Opportunities (15 min)**
- Unused resources found
- Rightsizing recommendations
- Reserved Instance opportunities
- Estimated savings
4. **Team Cost Breakdown (10 min)**
- Per-team spending
- Top spenders
- Tag compliance status
5. **Upcoming Changes (10 min)**
- New projects launching
- Expected cost impact
- Budget adjustments needed
6. **Action Items Review (5 min)**
- Follow-up on previous items
- Assign new action items
- Set deadlines
**Deliverable**: Monthly FinOps Report (template provided)
### Monthly Report Template
```markdown
# AWS Cost Report - [Month Year]
## Executive Summary
- Total spend: $XX,XXX
- vs Budget: X% (under/over)
- vs Last month: +/-X%
- Optimization savings: $X,XXX
## Cost Breakdown
| Service | Cost | % of Total | MoM Change |
|---------|------|-----------|-----------|
| EC2 | $XX | XX% | +/-X% |
| RDS | $XX | XX% | +/-X% |
## Optimization Actions Taken
1. Migrated 20 instances to Graviton (saved $X/month)
2. Purchased Reserved Instances (saved $X/month)
3. Deleted unused resources (saved $X/month)
## Recommendations for Next Month
1. Right-size 15 oversized instances (potential $X/month savings)
2. Implement S3 lifecycle policies (potential $X/month savings)
## Action Items
- [ ] [Owner] Task description (Deadline)
```
---
## Roles & Responsibilities
### FinOps Team Structure
**FinOps Lead**
- Owns overall cloud financial management
- Reports to CFO and CTO
- Sets FinOps strategy and goals
- Manages budget process
**Cloud Cost Analyst**
- Analyzes spending trends
- Generates reports and dashboards
- Identifies optimization opportunities
- Runs monthly review process
**Cloud Architect (FinOps focus)**
- Advises on cost-optimized architectures
- Implements cost optimization tools
- Trains engineers on FinOps practices
- Reviews architectural designs for cost impact
### Engineering Team Responsibilities
**Engineering Manager**
- Owns team budget
- Reviews monthly cost reports
- Prioritizes optimization work
- Ensures tagging compliance
**Engineers**
- Tag all resources they create
- Consider cost in design decisions
- Implement optimization recommendations
- Delete unused resources
**Platform/SRE Team**
- Implements cost optimization tooling
- Automates cost monitoring
- Provides cost visibility dashboards
- Enforces tagging policies
---
## Chargeback & Showback
### Showback (Visibility Only)
**Purpose**: Show teams their costs without charging them
**Goal**: Raise cost awareness
**Implementation**:
- Monthly cost reports per team
- Dashboard showing team spending
- Highlight cost trends
- No budget enforcement
**Best for**: Organizations new to FinOps
### Chargeback (Financial Accountability)
**Purpose**: Allocate costs back to business units
**Goal**: Financial accountability
**Implementation**:
- Tag-based cost allocation
- Transfer costs between cost centers
- Teams have hard budgets
- Overspending requires justification
**Best for**: Mature FinOps organizations
### Hybrid Model (Recommended)
**Shared Costs**: Charged to central IT
- VPC resources
- Security tools
- Monitoring infrastructure
- Shared services
**Team Costs**: Charged to teams
- Compute resources (EC2, Lambda)
- Databases
- Storage
- Application-specific services
**Implementation**:
```
Total AWS Bill: $100,000
Shared Costs (30%): $30,000
→ Charged to IT/Platform budget
Team Costs (70%): $70,000
→ Allocated by tags:
- Team A (Project=alpha): $20,000
- Team B (Project=beta): $25,000
- Team C (Project=gamma): $15,000
- Untagged (alert!): $10,000 → Needs investigation
```
---
## Policy & Governance
### Cost Governance Policies
**1. Resource Creation Policies**
```yaml
Policy: All resources must be tagged
Enforcement: Service Control Policy (SCP)
Exception process: Request via FinOps team
Policy: Dev/test resources must auto-stop nights/weekends
Enforcement: AWS Instance Scheduler
Exception process: Tag with NoAutoStop=true (requires approval)
Policy: S3 buckets must have lifecycle policies
Enforcement: AWS Config rule
Exception process: Document justification in bucket tags
```
**2. Approval Workflows**
```yaml
# Spending thresholds requiring approval
< $1,000/month:
- Auto-approved
- Must be tagged
$1,000 - $5,000/month:
- Engineering manager approval
- Documented in JIRA
$5,000 - $20,000/month:
- Director approval
- Budget impact assessment
- FinOps team review
> $20,000/month:
- VP approval
- Business case required
- Quarterly review checkpoint
```
**3. Reserved Instance / Savings Plans Policy**
```yaml
Policy: All commitments require FinOps review
Process:
1. Team identifies workload suitable for commitment
2. Submit request to FinOps with:
- Resource details
- Usage history (30+ days)
- Business justification
3. FinOps analyzes and recommends
4. Finance approves commitment
5. FinOps purchases and tracks utilization
```
### Automation & Guardrails
**Automated Actions**
```yaml
# Non-production resource scheduling
Schedule: Instance Scheduler
- Stop all dev/test EC2/RDS instances at 7pm weekdays
- Stop all dev/test instances all weekend
- Start at 7am weekdays
- Exception tag: NoAutoStop=true
# Untagged resource alerts
Trigger: AWS Config rule violation
Action:
- Send Slack alert to team
- Create JIRA ticket
- Escalate if not tagged in 48 hours
# Old snapshot cleanup
Schedule: Weekly Lambda function
Action:
- Delete snapshots older than 90 days (unless tagged KeepForever=true)
- Notify teams of deletions
- Estimate savings
# Budget breach response
Trigger: Budget > 100%
Action:
- Email alerts to stakeholders
- Create incident ticket
- Stop non-production resources (optional)
```
---
## Metrics & KPIs
### Key FinOps Metrics
**1. Cost Metrics**
```yaml
Total Monthly Cloud Spend:
Target: Within budget
Trend: Track month-over-month
Cost per Customer:
Calculation: Total AWS Cost / Active Customers
Target: Decreasing over time
Cost per Transaction:
Calculation: Total AWS Cost / Transactions Processed
Target: Optimize for efficiency
Unit Economics:
Calculation: Revenue per Customer - Cost per Customer
Target: Positive and growing
```
**2. Efficiency Metrics**
```yaml
Compute Utilization:
Metric: Average CPU utilization
Target: 40-60% (room for burst, not over-provisioned)
Storage Utilization:
Metric: % of S3 in cost-optimized tiers
Target: >60% in IA or Glacier tiers
Reserved Instance Coverage:
Metric: % of On-Demand usage covered by RIs/SPs
Target: >70% for stable workloads
RI/SP Utilization:
Metric: % of RIs/SPs actually used
Target: >90%
```
**3. Operational Metrics**
```yaml
Tag Compliance:
Metric: % of resources with required tags
Target: >95%
Budget Variance:
Metric: Actual vs Budget %
Target: ±5%
Optimization Savings:
Metric: $ saved per month from optimizations
Target: Growing
Mean Time to Optimize (MTTO):
Metric: Days from finding opportunity to implementing
Target: <30 days
```
**4. Organizational Metrics**
```yaml
FinOps Engagement:
Metric: % of teams attending monthly reviews
Target: 100%
Cost Awareness:
Survey: Do engineers know their team's monthly cost?
Target: >80% aware
Optimization Velocity:
Metric: # optimization tasks completed per quarter
Target: Growing trend
```
### Dashboard Requirements
**Executive Dashboard (Monthly)**
- Total spend vs budget
- Spend by service (top 10)
- Month-over-month trend
- Forecast for next quarter
- Optimization savings achieved
**Engineering Dashboard (Real-time)**
- Per-team costs (daily)
- Cost anomaly alerts
- Untagged resources count
- Budget utilization %
- Top cost drivers
**FinOps Dashboard (Daily)**
- Detailed service costs
- Tag compliance metrics
- RI/SP utilization
- Rightsizing opportunities
- Unused resource counts
---
## Getting Started Checklist
### Phase 1: Foundation (Month 1)
- [ ] Enable Cost Explorer
- [ ] Set up AWS Budgets
- [ ] Define tagging strategy
- [ ] Activate cost allocation tags
- [ ] Set up Cost and Usage Reports (CUR)
- [ ] Create basic cost dashboard
### Phase 2: Visibility (Months 2-3)
- [ ] Implement tagging enforcement
- [ ] Run first optimization scripts
- [ ] Set up monthly review meeting
- [ ] Create team cost reports
- [ ] Assign team cost owners
- [ ] Document FinOps processes
### Phase 3: Optimization (Months 4-6)
- [ ] Implement automated resource scheduling
- [ ] Purchase first Reserved Instances
- [ ] Set up cost anomaly detection
- [ ] Automate reporting
- [ ] Train engineering teams
- [ ] Implement showback/chargeback
### Phase 4: Culture (Ongoing)
- [ ] Cost metrics in engineering KPIs
- [ ] Cost review in architecture reviews
- [ ] Regular optimization sprints
- [ ] FinOps champions in each team
- [ ] Cost-aware development practices
- [ ] Continuous improvement
---
## Resources
**AWS Native Tools**
- AWS Cost Explorer
- AWS Budgets
- AWS Cost Anomaly Detection
- AWS Compute Optimizer
- AWS Trusted Advisor
- AWS Cost & Usage Reports
**Third-Party Tools**
- CloudHealth (VMware)
- Cloudability (Apptio)
- Kubecost (Kubernetes cost monitoring)
- Spot.io (Cost optimization platform)
**FinOps Foundation**
- https://www.finops.org
- FinOps Certified Practitioner certification
- FinOps community and best practices

View File

@@ -0,0 +1,466 @@
# AWS Service Alternatives - Cost Optimization Guide
When to use cheaper alternatives and cost-effective service options for common AWS services.
## Table of Contents
1. [Compute Alternatives](#compute-alternatives)
2. [Storage Alternatives](#storage-alternatives)
3. [Database Alternatives](#database-alternatives)
4. [Networking Alternatives](#networking-alternatives)
5. [Application Services](#application-services)
---
## Compute Alternatives
### EC2 vs Lambda vs Fargate
**EC2 (Most Economical for Consistent Workloads)**
- **When to use**: 24/7 workloads, predictable traffic, need full OS control
- **Cost model**: Hourly charges, cheaper with Reserved Instances
- **Best for**: Always-on applications, legacy apps, specific OS/kernel requirements
- **Example**: Web server handling steady traffic → EC2 with Reserved Instance
**Lambda (Most Economical for Intermittent Work)**
- **When to use**: Event-driven, sporadic usage, < 15 minute executions
- **Cost model**: Pay per execution and duration (GB-seconds)
- **Best for**: APIs with sporadic traffic, scheduled tasks, event processing
- **Example**: Image processing triggered by S3 upload → Lambda
- **Break-even**: ~20-30 hours/month execution time vs equivalent EC2
**Fargate (Middle Ground)**
- **When to use**: Containerized apps, variable traffic, don't want to manage servers
- **Cost model**: Pay for vCPU and memory allocated
- **Best for**: Microservices, batch jobs, variable load applications
- **Example**: Background worker that scales 0-10 containers → Fargate
- **Tip**: Fargate Spot offers up to 70% savings for fault-tolerant tasks
**Decision Matrix**
```
Consistent 24/7 load → EC2 with Reserved Instances
Variable load, containerized → Fargate (or Fargate Spot)
Event-driven, < 15 min → Lambda
Batch processing → Fargate Spot or EC2 Spot
```
### EC2 Instance Alternatives
**Standard vs Graviton (ARM64)**
- **Graviton Savings**: 20% cheaper for same performance
- **When to use**: Modern applications, ARM-compatible workloads
- **Alternatives**:
- t3.large → t4g.large (20% cheaper)
- m5.xlarge → m6g.xlarge (20% cheaper)
- c5.2xlarge → c6g.2xlarge (20% cheaper)
- **Considerations**: Test application compatibility first
**Current vs Previous Generation**
- **Migration Savings**: 5-10% cheaper, better performance
- **Examples**:
- t2 → t3 (10% cheaper, better performance)
- m4 → m5 → m6i (progressive improvements)
- c4 → c5 → c6i (better price/performance)
- **Action**: Check `detect_old_generations.py` script
**On-Demand vs Spot vs Reserved**
- **On-Demand**: $X/hour, highest cost, full flexibility
- **Spot**: 60-90% discount, can be interrupted
- **Reserved (1yr)**: 30-40% discount
- **Reserved (3yr)**: 50-65% discount
- **Decision**: Use Spot for fault-tolerant, RI for predictable, On-Demand for rest
---
## Storage Alternatives
### S3 Storage Classes
**Frequently Accessed Data**
```
S3 Standard → $0.023/GB/month
Use when: Accessing files multiple times per month
```
**Infrequently Accessed Data**
```
S3 Standard → S3 Standard-IA
$0.023/GB/month → $0.0125/GB/month (46% cheaper)
Retrieval cost: $0.01/GB
Break-even: < 1 access per month
Use when: Backups, disaster recovery, infrequently accessed files
```
**Unknown Access Patterns**
```
S3 Standard → S3 Intelligent-Tiering
$0.023/GB/month → Automatic optimization
Extra cost: $0.0025 per 1000 objects monitored
Use when: Unclear access patterns, don't want to manage lifecycle
Best for: Mixed workloads, analytics datasets
```
**Archive Storage**
```
S3 Standard → S3 Glacier Instant Retrieval
$0.023/GB → $0.004/GB (83% cheaper)
Retrieval: Milliseconds, $0.03/GB
Use when: Archive with immediate access needs (e.g., medical records)
S3 Standard → S3 Glacier Flexible Retrieval
$0.023/GB → $0.0036/GB (84% cheaper)
Retrieval: Minutes to hours, $0.01/GB
Use when: Archive data, acceptable retrieval delay
S3 Standard → S3 Glacier Deep Archive
$0.023/GB → $0.00099/GB (96% cheaper)
Retrieval: 12 hours, $0.02/GB
Use when: Long-term archive, regulatory compliance, rarely accessed
```
**Decision Tree**
```
Accessed daily → S3 Standard
Accessed monthly → S3 Standard-IA
Unknown pattern → S3 Intelligent-Tiering
Archive, instant access → Glacier Instant Retrieval
Archive, can wait hours → Glacier Flexible Retrieval
Archive, can wait 12 hours → Glacier Deep Archive
```
### EBS Volume Types
**General Purpose Volumes**
```
gp2 → gp3
$0.10/GB → $0.08/GB (20% cheaper)
Additional benefits: Configurable IOPS/throughput independent of size
Action: Convert all gp2 to gp3 (no downtime required)
```
**High Performance Workloads**
```
io1 → io2
Same price, better durability and IOPS
io2 Block Express: For highest performance needs
Consider: Do you really need provisioned IOPS?
Many workloads perform fine on gp3 (up to 16,000 IOPS)
Test gp3 before committing to io2
```
**Throughput-Optimized Workloads**
```
gp3 → st1 (Throughput Optimized HDD)
$0.08/GB → $0.045/GB (44% cheaper)
Use when: Big data, data warehouses, log processing
Sequential access patterns, throughput more important than IOPS
```
**Cold Data**
```
gp3 → sc1 (Cold HDD)
$0.08/GB → $0.015/GB (81% cheaper)
Use when: Infrequently accessed data, lowest cost priority
Example: Archive storage, cold backups
```
### EFS vs S3 vs EBS
**S3 (Cheapest for Object Storage)**
- **Cost**: $0.023/GB/month (Standard)
- **When to use**: Object storage, static files, backups
- **Pros**: Unlimited scale, integrates with everything
- **Cons**: Not a file system, higher latency
**EBS (Best for Single-Instance Block Storage)**
- **Cost**: $0.08/GB/month (gp3)
- **When to use**: Boot volumes, database storage, single EC2 instance
- **Pros**: High performance, low latency
- **Cons**: Single-AZ, attached to one instance
**EFS (File System Across Multiple Instances)**
- **Cost**: $0.30/GB/month (Standard), $0.016/GB/month (IA)
- **When to use**: Shared file storage across multiple instances
- **Pros**: Multi-AZ, grows automatically, NFSv4
- **Cons**: More expensive than EBS
- **Optimization**: Use EFS Intelligent-Tiering to auto-move to IA class
**Decision Matrix**
```
Single instance, block storage → EBS
Multiple instances, shared files → EFS (with Intelligent-Tiering)
Object storage, static files → S3
Large data, high throughput → FSx for Lustre
Windows file shares → FSx for Windows
```
---
## Database Alternatives
### RDS vs Aurora vs Self-Managed
**RDS PostgreSQL/MySQL (Baseline)**
- **Cost**: Instance + storage
- **When to use**: Standard relational DB needs
- **Example**: db.t3.medium = ~$60/month + storage
**Aurora PostgreSQL/MySQL (2-3x RDS Cost)**
- **Cost**: Instance + storage + I/O charges
- **When to use**: Need high availability, auto-scaling storage, read replicas
- **Pros**: Better performance, automatic failover, up to 15 read replicas
- **Cons**: More expensive
- **Break-even**: High read traffic, need fast replication
**Aurora Serverless v2 (Variable Workloads)**
- **Cost**: Pay per ACU (Aurora Capacity Unit) per second
- **When to use**: Variable load, dev/test, infrequent usage
- **Example**: Dev database used 8 hours/day → 67% savings vs always-on
- **Limitation**: Min capacity charges apply
**Self-Managed on EC2 (Cheapest for Experts)**
- **Cost**: Just EC2 + EBS costs
- **When to use**: Full control needed, specific configuration, cost-sensitive
- **Pros**: Can be 50-70% cheaper than RDS
- **Cons**: You manage backups, patching, HA, monitoring
- **Consideration**: Factor in operational overhead
**Decision Matrix**
```
Standard workload, managed preferred → RDS
High availability, many reads → Aurora
Variable workload → Aurora Serverless v2
Cost-sensitive, have DBA expertise → Self-managed on EC2
Dev/test, intermittent use → Aurora Serverless v2
```
### DynamoDB Pricing Models
**On-Demand (Unpredictable Traffic)**
- **Cost**: $1.25 per million writes, $0.25 per million reads
- **When to use**: Variable traffic, new applications, spiky workloads
- **Pros**: No capacity planning, scales automatically
- **Example**: New API with unknown traffic pattern
**Provisioned Capacity (Predictable Traffic)**
- **Cost**: $0.00065 per WCU/hour, $0.00013 per RCU/hour
- **When to use**: Predictable traffic patterns
- **Savings**: 60-80% cheaper than on-demand at consistent usage
- **Example**: Application with steady 100 req/sec
**Reserved Capacity (Long-term Commitment)**
- **Cost**: Additional 30-50% discount on provisioned capacity
- **When to use**: Known long-term capacity needs
- **Commitment**: 1-3 years
**Break-Even Calculation**
```
On-Demand: $1.25 per million writes
Provisioned: ~$0.47 per million writes (at capacity)
Break-even: ~65% consistent utilization
Action: Start with on-demand, switch to provisioned once patterns clear
```
### Database Migration Options
**From Commercial to Open Source**
```
Oracle → Aurora PostgreSQL or RDS PostgreSQL
Savings: 90% on licensing costs
Consider: PostgreSQL compatibility, migration effort
SQL Server → Aurora PostgreSQL or RDS PostgreSQL/MySQL
Savings: 50-90% on licensing costs
Consider: Application compatibility, migration effort
```
**From RDS to Aurora**
```
Only if: High availability requirements, many read replicas needed
Cost increase: 20-50% more
Benefit: Better performance, automatic failover, scaling
```
**From Aurora to RDS**
```
When: Don't need Aurora features, cost-conscious
Savings: 20-50%
Downgrade if: Single-AZ sufficient, limited read replicas needed
```
---
## Networking Alternatives
### NAT Gateway Alternatives
**NAT Gateway (Default, Expensive)**
- **Cost**: $32.85/month + $0.045/GB processed
- **When to use**: Production, high availability, easy management
**VPC Endpoints (Cheaper for AWS Services)**
- **Gateway Endpoint (S3, DynamoDB)**: FREE
- **Interface Endpoint**: $7.20/month + $0.01/GB
- **When to use**: Accessing S3, DynamoDB, or other AWS services
- **Savings**: $25-30/month vs NAT Gateway
- **Example**: Lambda accessing S3 → Use S3 Gateway Endpoint
**NAT Instance (Cheapest, More Work)**
- **Cost**: Just EC2 cost (e.g., t3.micro = $7.50/month)
- **When to use**: Dev/test, cost-sensitive, low traffic
- **Cons**: Must manage, less resilient, manual HA setup
- **Savings**: 75% vs NAT Gateway
**Decision Matrix**
```
S3 or DynamoDB only → Gateway Endpoint (FREE)
Other AWS services → Interface Endpoint
Production, high availability → NAT Gateway
Dev/test, low traffic → NAT Instance or single NAT Gateway
```
### Load Balancer Alternatives
**Application Load Balancer (ALB)**
- **Cost**: $16.20/month + LCU charges
- **When to use**: HTTP/HTTPS, path-based routing, microservices
- **Features**: Layer 7, content-based routing, Lambda targets
**Network Load Balancer (NLB)**
- **Cost**: $22.35/month + LCU charges
- **When to use**: TCP/UDP, extreme performance, static IPs
- **Use case**: Non-HTTP protocols, high throughput
**Classic Load Balancer (Legacy)**
- **Cost**: $18/month + data charges
- **Recommendation**: Migrate to ALB or NLB (better features, often cheaper)
**CloudFront + S3 (Static Content)**
- **Cost**: Much cheaper for static content
- **When to use**: Static website, single-page app
- **Setup**: S3 static hosting + CloudFront distribution
- **Savings**: 90% vs ALB for static content
**API Gateway (REST APIs)**
- **Cost**: Pay per request
- **When to use**: REST API, need API management features
- **Alternative to**: ALB for simple APIs
---
## Application Services
### Message Queue Alternatives
**SQS vs SNS vs EventBridge vs Kinesis**
**SQS (Point-to-Point, Cheapest)**
- **Cost**: $0.40 per million requests (Standard), $0.50 (FIFO)
- **When to use**: Work queues, decoupling services
- **Best for**: Job processing, task queues
**SNS (Pub/Sub, Cheap)**
- **Cost**: $0.50 per million publishes
- **When to use**: Fan-out notifications, multiple subscribers
- **Best for**: Notifications, multiple consumers
**EventBridge (Event Router)**
- **Cost**: $1.00 per million events
- **When to use**: Event-driven architecture, complex routing
- **Best for**: Cross-account events, SaaS integrations
**Kinesis (Streaming, Expensive)**
- **Cost**: $0.015 per shard-hour + PUT charges
- **When to use**: Real-time streaming, ordered processing
- **Best for**: Logs, analytics, real-time processing
- **Alternative**: Kinesis Data Firehose (simpler, cheaper for basic needs)
**Decision Matrix**
```
Simple queue → SQS
Multiple consumers → SNS
Complex event routing → EventBridge
Real-time streaming → Kinesis
Log aggregation → Kinesis Firehose
```
### Container Orchestration
**ECS vs EKS vs Fargate**
**ECS on EC2 (Cheapest)**
- **Cost**: Just EC2 costs (no ECS fee)
- **When to use**: AWS-native, simpler workloads
- **Best for**: Cost-sensitive, AWS-specific deployments
**ECS on Fargate (Serverless, Easy)**
- **Cost**: Pay per task (vCPU + memory)
- **When to use**: Variable load, don't want to manage servers
- **Best for**: Variable workloads, simpler operations
**EKS (Kubernetes, Expensive)**
- **Cost**: $73/month per cluster + node costs
- **When to use**: Need Kubernetes, multi-cloud, complex deployments
- **Best for**: Kubernetes expertise, need K8s ecosystem
- **Tip**: Consolidate workloads to fewer clusters
**Decision Matrix**
```
AWS-native, cost-sensitive → ECS on EC2
Variable load, easy management → ECS on Fargate
Need Kubernetes → EKS
Multiple environments → Consider single EKS cluster with namespaces
```
---
## Quick Reference: When to Switch
### Immediate Actions (Low Risk)
- [ ] gp2 → gp3 (20% savings, no downtime)
- [ ] S3 Standard → Intelligent-Tiering (auto-optimization)
- [ ] NAT Gateway → VPC Endpoints for S3/DynamoDB (free)
- [ ] Old generation instances → New generation (10-20% savings)
- [ ] Intel → Graviton (20% savings, test first)
### Medium Effort Actions
- [ ] On-Demand → Reserved Instances/Savings Plans (40-65% savings)
- [ ] Always-on EC2 → Lambda for intermittent work
- [ ] S3 Standard → Lifecycle policies (50-95% savings on old data)
- [ ] RDS On-Demand → Reserved Instances (40-65% savings)
- [ ] DynamoDB On-Demand → Provisioned (60-80% savings if predictable)
### High Effort Actions (Evaluate Carefully)
- [ ] RDS → Aurora (usually more expensive, only if need features)
- [ ] Aurora → RDS (20-50% savings if don't need Aurora features)
- [ ] Commercial DB → PostgreSQL (90% savings, migration effort)
- [ ] EC2 → Lambda (case-by-case, break-even analysis needed)
- [ ] ECS → EKS (usually more expensive, only if need K8s)
---
## Cost Comparison Tool
Use this mental model when evaluating alternatives:
```
1. Calculate current monthly cost
2. Calculate alternative monthly cost
3. Estimate migration effort (hours × $cost)
4. Calculate payback period: Migration Cost / Monthly Savings
5. Decide: Payback < 3 months → Likely worth it
Payback > 6 months → Evaluate carefully
```
**Example:**
```
Current: ALB for static site = $20/month
Alternative: CloudFront + S3 = $2/month
Savings: $18/month
Migration: 4 hours × $100/hour = $400
Payback: $400 / $18 = 22 months → Maybe not worth it
But if: Multiple sites, reusable pattern → Worth the investment
```