Initial commit

2025-11-29 17:51:09 +08:00
commit 9d4643f587
14 changed files with 4713 additions and 0 deletions
--- a/references/best_practices.md
+++ b/references/best_practices.md
@@ -0,0 +1,362 @@
+# AWS Cost Optimization Best Practices
+
+Comprehensive strategies for optimizing AWS costs across all major service categories.
+
+## Table of Contents
+
+1. [Compute Optimization](#compute-optimization)
+2. [Storage Optimization](#storage-optimization)
+3. [Network Optimization](#network-optimization)
+4. [Database Optimization](#database-optimization)
+5. [Container & Serverless Optimization](#container--serverless-optimization)
+6. [General Principles](#general-principles)
+
+---
+
+## Compute Optimization
+
+### EC2 Instance Optimization
+
+**Right Instance Family**
+- **General Purpose (T3, M5, M6i)**: Web servers, small-medium databases, dev environments
+- **Compute Optimized (C5, C6i, C6g)**: CPU-intensive workloads, batch processing, HPC
+- **Memory Optimized (R5, R6i, R6g)**: Databases, in-memory caches, big data
+- **Storage Optimized (I3, D2)**: High IOPS, data warehousing, Hadoop
+
+**Graviton Migration (ARM64)**
+- Up to 20% cost savings with M6g, C6g, R6g, T4g instances
+- Test compatibility first: Most modern languages/frameworks support ARM64
+- Best for: Stateless applications, containerized workloads, open-source software
+
+**Instance Sizing**
+- Start small and scale up based on metrics
+- Monitor CPU, memory, network for 2+ weeks before committing
+- Use CloudWatch metrics to identify underutilized instances
+- Consider burstable instances (T3) for variable workloads
+
+**Purchase Options**
+- **On-Demand**: Flexible, no commitment, highest cost
+- **Reserved Instances**: 1-3 year commitment, up to 63% savings
+  - Standard RI: Highest discount, no flexibility
+  - Convertible RI: Moderate discount, can change instance types
+- **Savings Plans**: Flexible commitment to compute spend, up to 66% savings
+- **Spot Instances**: Up to 90% savings, suitable for fault-tolerant workloads
+
+### Auto Scaling
+
+**Horizontal Scaling**
+- Scale out during peak, scale in during off-peak
+- Use target tracking policies (CPU, ALB requests, custom metrics)
+- Set minimum instances for high availability, maximum for cost control
+- Consider scheduled scaling for predictable patterns
+
+**Mixed Instances Policy**
+- Combine instance types for better Spot availability
+- Mix Spot and On-Demand for reliability
+- Example: 70% Spot, 30% On-Demand for fault-tolerant apps
+
+### Lambda Optimization
+
+**Memory Configuration**
+- Memory allocation determines CPU allocation
+- More memory = faster execution = potentially lower cost
+- Test different memory settings to find cost/performance sweet spot
+
+**Cold Start Mitigation**
+- Provisioned concurrency for critical functions (adds cost)
+- Keep functions warm with scheduled invocations
+- Minimize deployment package size
+- Use Lambda layers for shared dependencies
+
+**Execution Time**
+- Optimize code to reduce execution duration
+- Every 100ms of execution matters at scale
+- Consider Graviton2 (arm64) for 20% better price/performance
+
+---
+
+## Storage Optimization
+
+### S3 Cost Optimization
+
+**Storage Classes**
+- **S3 Standard**: Frequently accessed data
+- **S3 Intelligent-Tiering**: Auto-moves between tiers, ideal for unknown patterns
+- **S3 Standard-IA**: Infrequent access, 50% cheaper than Standard
+- **S3 One Zone-IA**: Non-critical, infrequent access, 20% cheaper than Standard-IA
+- **S3 Glacier Instant Retrieval**: Archive with instant access, 68% cheaper
+- **S3 Glacier Flexible Retrieval**: Archive, retrieval in minutes-hours, 77% cheaper
+- **S3 Glacier Deep Archive**: Long-term archive, retrieval in 12 hours, 83% cheaper
+
+**Lifecycle Policies**
+- Automatically transition objects between storage classes
+- Delete incomplete multipart uploads after 7 days
+- Example policy:
+  - 0-30 days: S3 Standard
+  - 30-90 days: S3 Standard-IA
+  - 90-365 days: S3 Glacier Flexible Retrieval
+  - 365+ days: S3 Glacier Deep Archive or Delete
+
+**Request Optimization**
+- Use CloudFront CDN to reduce S3 GET requests
+- Batch operations instead of individual API calls
+- Use S3 Select to retrieve subsets of data
+- Enable S3 Transfer Acceleration for faster uploads (if needed)
+
+**Cost Monitoring**
+- Enable S3 Storage Lens for usage analytics
+- Set up S3 Storage Class Analysis
+- Monitor request costs (can exceed storage costs for small files)
+
+### EBS Optimization
+
+**Volume Types**
+- **gp3**: General purpose, 20% cheaper than gp2, configurable IOPS/throughput
+- **gp2**: Legacy general purpose (migrate to gp3)
+- **io2**: High performance, mission-critical (only if needed)
+- **st1**: Throughput-optimized HDD for big data (cheaper for sequential access)
+- **sc1**: Cold HDD for infrequent access (cheapest)
+
+**Snapshot Management**
+- Delete old snapshots (they accumulate quickly)
+- Use Lifecycle Manager for automated snapshot policies
+- Snapshots are incremental but deletion is complex (use Data Lifecycle Manager)
+- Consider cross-region replication costs
+
+**Volume Cleanup**
+- Delete unattached volumes
+- Right-size oversized volumes
+- Consider EBS Elastic Volumes to modify without downtime
+
+---
+
+## Network Optimization
+
+### Data Transfer Costs
+
+**General Rules**
+- **Free**: Inbound from internet, same-AZ traffic (same subnet)
+- **Cheap**: Same-region traffic across AZs
+- **Expensive**: Cross-region, outbound to internet, CloudFront to origin
+
+**Optimization Strategies**
+- Colocate resources in same AZ when possible (consider HA trade-offs)
+- Use VPC endpoints for AWS service access (avoids NAT/IGW costs)
+- Implement caching with CloudFront, ElastiCache
+- Compress data before transfer
+- Use AWS PrivateLink instead of internet egress
+
+### NAT Gateway Optimization
+
+**Cost Structure**
+- ~$32.85/month per NAT Gateway
+- Data processing charges: $0.045/GB
+
+**Alternatives**
+- **VPC Endpoints**: Direct access to AWS services (S3, DynamoDB, etc.)
+  - Interface endpoints: $7.20/month + $0.01/GB
+  - Gateway endpoints: Free for S3 and DynamoDB
+- **NAT Instance**: Cheaper but requires management
+- **Single NAT Gateway**: Use one instead of one per AZ (reduces HA)
+- **S3 Gateway Endpoint**: Free alternative for S3 access
+
+**When to Use What**
+- High traffic to AWS services → VPC Endpoints
+- Low traffic, dev/test → Single NAT Gateway or NAT instance
+- Production, HA required → NAT Gateway per AZ
+- S3 access only → S3 Gateway Endpoint (free)
+
+### CloudFront Optimization
+
+**Use Cases for Savings**
+- Reduce S3 data transfer costs (CloudFront egress is cheaper)
+- Cache frequently accessed content
+- Regional edge caches for less popular content
+
+**Configuration**
+- Use appropriate price class (exclude expensive regions if not needed)
+- Set proper TTL to maximize cache hit ratio
+- Use compression (gzip, brotli)
+- Monitor cache hit ratio and adjust
+
+---
+
+## Database Optimization
+
+### RDS Cost Optimization
+
+**Instance Sizing**
+- Right-size based on CloudWatch metrics (CPU, memory, connections)
+- Consider burstable instances (db.t3) for variable workloads
+- Graviton instances (db.m6g, db.r6g) offer 20% savings
+
+**Storage Optimization**
+- Use gp3 instead of gp2 (20% cheaper)
+- Enable storage autoscaling with upper limit
+- Delete old automated backups
+- Reduce backup retention period if possible
+
+**High Availability Trade-offs**
+- Multi-AZ doubles cost (needed for production)
+- Single-AZ acceptable for dev/test
+- Read replicas for read scaling (cheaper than bigger instance)
+
+**Aurora vs RDS**
+- Aurora costs more but offers better scaling
+- Aurora Serverless v2 for variable workloads
+- Standard RDS for predictable workloads
+- PostgreSQL/MySQL community for dev/test
+
+### DynamoDB Optimization
+
+**Capacity Modes**
+- **On-Demand**: Pay per request, unpredictable traffic
+- **Provisioned**: Cheaper for consistent traffic, requires capacity planning
+- **Reserved Capacity**: 1-3 year commitment for provisioned capacity
+
+**Table Design**
+- Use single-table design to minimize costs
+- Implement GSI/LSI carefully (they add cost)
+- Enable point-in-time recovery only if needed
+- Use TTL to auto-expire old data
+
+**Read Optimization**
+- Use eventually consistent reads (50% cheaper than strongly consistent)
+- Implement caching (DAX or ElastiCache)
+- Batch operations when possible
+
+### ElastiCache Optimization
+
+**Node Types**
+- Graviton instances (cache.m6g, cache.r6g) for 20% savings
+- Right-size based on memory usage and eviction rates
+
+**Redis vs Memcached**
+- Redis: More features, persistence, replication (more expensive)
+- Memcached: Simpler, no persistence, multi-threaded (cheaper)
+
+**Strategies**
+- Reserved nodes for 30-55% savings
+- Single-AZ for dev/test
+- Monitor eviction rates to avoid over-provisioning
+
+---
+
+## Container & Serverless Optimization
+
+### ECS/Fargate Optimization
+
+**Compute Options**
+- **EC2 Launch Type**: More control, cheaper for steady workloads
+- **Fargate**: Serverless, easier management, better for variable loads
+- **Fargate Spot**: Up to 70% savings for fault-tolerant tasks
+
+**Graviton Support**
+- Fargate ARM64 support available
+- ECS on Graviton2 EC2 instances for 20% savings
+
+**Right-sizing**
+- Start with minimal CPU/memory, scale up based on metrics
+- Use Container Insights for utilization data
+- Consider task packing (multiple containers per task)
+
+### EKS Optimization
+
+**Control Plane**
+- $73/month per cluster (consider consolidation)
+- Use single cluster with namespaces when appropriate
+
+**Worker Nodes**
+- Use Spot instances for fault-tolerant pods (up to 90% savings)
+- Managed node groups with Graviton instances
+- Karpenter for intelligent autoscaling
+- Mixed instance types for better Spot availability
+
+**Cost Visibility**
+- Kubecost or OpenCost for K8s cost attribution
+- Resource requests/limits prevent waste
+- Cluster autoscaler for automatic node scaling
+
+---
+
+## General Principles
+
+### Tagging Strategy
+
+**Cost Allocation Tags**
+- Environment: prod, staging, dev, test
+- Owner: team/person responsible
+- Project: business initiative
+- CostCenter: chargeback allocation
+- Application: specific app name
+
+**Tag Enforcement**
+- Use AWS Organizations policies to enforce tagging
+- Service Control Policies to prevent untagged resources
+- AWS Config rules for compliance
+
+### Monitoring and Governance
+
+**Cost Monitoring Tools**
+- AWS Cost Explorer: Historical analysis
+- AWS Budgets: Proactive alerts
+- Cost and Usage Reports: Detailed data export
+- Cost Anomaly Detection: Automatic anomaly alerts
+
+**Regular Reviews**
+- Monthly cost review meetings
+- Quarterly rightsizing exercises
+- Annual Reserved Instance/Savings Plan optimization
+- Automated reports to stakeholders
+
+### Automation
+
+**Infrastructure as Code**
+- Define resource sizes in code (prevent oversizing)
+- Automated cleanup of dev/test resources
+- Scheduled shutdown of non-production resources
+
+**Cost Optimization Tools**
+- AWS Compute Optimizer: ML-based recommendations
+- AWS Trusted Advisor: Best practice checks
+- Third-party tools: CloudHealth, Cloudability, Spot.io
+
+### Cultural Best Practices
+
+**Engineering Ownership**
+- Engineers should see cost impact of their changes
+- Cost metrics in dashboards alongside performance
+- Cost budgets for teams/projects
+
+**Experiments and Cleanup**
+- Tag experimental resources with expiration dates
+- Automated cleanup of abandoned resources
+- Regular audits of unused resources
+
+**Cost-Aware Architecture**
+- Design for cost from the beginning
+- Choose appropriate service tiers
+- Implement auto-scaling and right-sizing from day one
+- Consider serverless and managed services
+
+---
+
+## Quick Wins Checklist
+
+- [ ] Delete unattached EBS volumes
+- [ ] Delete old EBS snapshots
+- [ ] Release unused Elastic IPs
+- [ ] Stop or terminate idle EC2 instances
+- [ ] Right-size oversized instances
+- [ ] Convert gp2 to gp3 volumes
+- [ ] Enable S3 Intelligent-Tiering
+- [ ] Set up S3 lifecycle policies
+- [ ] Replace NAT Gateways with VPC Endpoints where possible
+- [ ] Migrate to Graviton instances
+- [ ] Purchase Reserved Instances/Savings Plans for stable workloads
+- [ ] Use Spot instances for fault-tolerant workloads
+- [ ] Delete old RDS snapshots
+- [ ] Enable DynamoDB auto-scaling
+- [ ] Set up cost allocation tags
+- [ ] Enable AWS Budgets alerts
+- [ ] Schedule shutdown of dev/test resources
--- a/references/finops_governance.md
+++ b/references/finops_governance.md
@@ -0,0 +1,740 @@
+# FinOps Governance Framework
+
+Organizational practices, processes, and governance for AWS cost optimization.
+
+## Table of Contents
+
+1. [FinOps Principles](#finops-principles)
+2. [Cost Allocation & Tagging](#cost-allocation--tagging)
+3. [Budget Management](#budget-management)
+4. [Monthly Review Process](#monthly-review-process)
+5. [Roles & Responsibilities](#roles--responsibilities)
+6. [Chargeback & Showback](#chargeback--showback)
+7. [Policy & Governance](#policy--governance)
+8. [Metrics & KPIs](#metrics--kpis)
+
+---
+
+## FinOps Principles
+
+### The FinOps Framework
+
+FinOps is the practice of bringing financial accountability to cloud spending through collaboration between engineering, finance, and business teams.
+
+**Core Principles:**
+
+1. **Teams Need to Collaborate**
+   - Engineering makes technical decisions
+   - Finance provides visibility and reporting
+   - Business sets priorities and budgets
+   - Cross-functional cost optimization
+
+2. **Everyone Takes Ownership**
+   - Engineers see cost impact of their decisions
+   - Teams have cost budgets and accountability
+   - Cost is a efficiency metric, not just finance
+
+3. **Decisions Driven by Business Value**
+   - Speed, quality, and cost trade-offs
+   - Investment vs optimization decisions
+   - ROI-based prioritization
+
+4. **Take Advantage of Variable Cost Model**
+   - Scale resources up and down as needed
+   - Use different pricing models strategically
+   - Optimize for actual usage patterns
+
+5. **Centralized Team Drives FinOps**
+   - Central FinOps team enables
+   - Distributed execution by product teams
+   - Share best practices and tools
+
+### FinOps Maturity Model
+
+**Crawl Phase (Getting Started)**
+- Basic cost visibility
+- Manual reporting
+- Ad-hoc optimization
+- Initial tagging strategy
+- Basic budget alerts
+
+**Walk Phase (Improving)**
+- Automated cost reporting
+- Regular optimization reviews
+- Systematic tagging enforcement
+- Team cost allocation
+- Reserved Instance planning
+- Monthly optimization meetings
+
+**Run Phase (Optimized)**
+- Real-time cost visibility
+- Automated optimization
+- Cost-aware engineering culture
+- Predictive forecasting
+- Automated guardrails
+- FinOps integrated in SDLC
+
+---
+
+## Cost Allocation & Tagging
+
+### Tagging Strategy
+
+**Required Tags (Enforce via Policy)**
+
+```yaml
+Required Tags:
+  Environment:
+    values: [prod, staging, dev, test]
+    purpose: Separate production from non-production costs
+
+  Owner:
+    values: [email or team name]
+    purpose: Contact for resource questions
+
+  Project:
+    values: [project code]
+    purpose: Track project spending
+
+  CostCenter:
+    values: [department code]
+    purpose: Chargeback allocation
+
+  Application:
+    values: [app name]
+    purpose: Application-level cost tracking
+```
+
+**Optional but Recommended Tags**
+
+```yaml
+Optional Tags:
+  ExpirationDate:
+    format: YYYY-MM-DD
+    purpose: Auto-cleanup scheduling
+
+  DataClassification:
+    values: [public, internal, confidential, restricted]
+    purpose: Security and compliance
+
+  BackupRequired:
+    values: [true, false]
+    purpose: Backup policy enforcement
+
+  Criticality:
+    values: [critical, high, medium, low]
+    purpose: Priority and SLA determination
+```
+
+### Tag Enforcement
+
+**Using AWS Organizations Service Control Policies (SCP)**
+
+```json
+{
+  "Version": "2012-10-17",
+  "Statement": [
+    {
+      "Sid": "DenyEC2CreationWithoutTags",
+      "Effect": "Deny",
+      "Action": [
+        "ec2:RunInstances"
+      ],
+      "Resource": [
+        "arn:aws:ec2:*:*:instance/*"
+      ],
+      "Condition": {
+        "StringNotLike": {
+          "aws:RequestTag/Environment": ["prod", "staging", "dev", "test"],
+          "aws:RequestTag/Owner": "*",
+          "aws:RequestTag/Project": "*"
+        }
+      }
+    }
+  ]
+}
+```
+
+**Using AWS Config Rules**
+
+- **required-tags**: Enforce tags on all resources
+- **ec2-instance-no-public-ip**: Prevent public IPs unless tagged
+- Custom Lambda-based rules for complex logic
+
+**Tag Compliance Monitoring**
+
+```python
+# Example: Check tag compliance
+# Run weekly to find untagged resources
+
+aws resourcegroupstaggingapi get-resources \
+  --query 'ResourceTagMappingList[?length(Tags) == `0`]' \
+  --output table
+
+# Or use Tag Editor in AWS Console
+```
+
+### Cost Allocation Tags
+
+**Activating Cost Allocation Tags**
+
+1. Go to AWS Billing → Cost Allocation Tags
+2. Select user-defined tags to activate
+3. Wait 24 hours for tags to appear in Cost Explorer
+4. Tags only apply to charges after activation
+
+**Best Practices**
+
+- Activate tags before using them
+- Use consistent naming (e.g., `Environment` not `Env` or `environment`)
+- Document tag values in wiki/runbook
+- Review and update tag strategy quarterly
+
+---
+
+## Budget Management
+
+### AWS Budgets Setup
+
+**Budget Types**
+
+1. **Cost Budget**: Track spending against threshold
+2. **Usage Budget**: Track service usage (e.g., EC2 hours)
+3. **Savings Plans Budget**: Track commitment utilization
+4. **Reservation Budget**: Track RI utilization
+
+**Recommended Budgets**
+
+**1. Overall Monthly Budget**
+```yaml
+Budget Name: Company-Wide-Monthly-Budget
+Amount: $50,000/month
+Alerts:
+  - 50% actual: Email CFO, FinOps team
+  - 80% actual: Email CFO, CTO, FinOps team
+  - 100% forecasted: Email CFO, CTO, all team leads
+  - 100% actual: Email everyone + Slack alert
+```
+
+**2. Per-Environment Budgets**
+```yaml
+Budget Name: Production-Environment-Budget
+Amount: $30,000/month
+Filter: Environment=prod
+Alerts:
+  - 80% actual: Email engineering leads
+  - 100% forecasted: Email CTO + FinOps
+
+Budget Name: Dev-Environment-Budget
+Amount: $5,000/month
+Filter: Environment=dev
+Alerts:
+  - 100% actual: Email dev team leads
+  - 120% actual: Automated shutdown (if possible)
+```
+
+**3. Per-Team Budgets**
+```yaml
+Budget Name: Team-Platform-Budget
+Amount: $15,000/month
+Filter: Owner=platform-team
+Alerts:
+  - 90% actual: Email platform team
+  - 100% forecasted: Email platform team + manager
+```
+
+**4. Per-Project Budgets**
+```yaml
+Budget Name: Project-Phoenix-Budget
+Amount: $8,000/month
+Filter: Project=phoenix
+Alerts:
+  - 75% actual: Email project owner
+  - 100% actual: Email project owner + sponsor
+```
+
+### Budget Alert Actions
+
+**Automated Responses to Budget Alerts**
+
+```python
+# Lambda function triggered by Budget alert SNS topic
+
+def lambda_handler(event, context):
+    # Parse budget alert
+    budget_name = event['budgetName']
+    threshold = event['threshold']
+
+    if threshold >= 100:
+        # Stop non-production instances
+        stop_dev_instances()
+
+        # Send Slack alert
+        send_slack_alert(f"🚨 Budget {budget_name} exceeded!")
+
+        # Create JIRA ticket
+        create_cost_investigation_ticket()
+
+    elif threshold >= 80:
+        # Send warning
+        send_slack_alert(f"⚠️  Budget {budget_name} at 80%")
+```
+
+---
+
+## Monthly Review Process
+
+### FinOps Monthly Cadence
+
+**Week 1: Data Collection**
+- Export Cost & Usage Reports
+- Run cost optimization scripts
+- Gather CloudWatch metrics
+- Compile anomaly reports
+
+**Week 2: Analysis**
+- Identify cost trends
+- Find optimization opportunities
+- Compare to previous months
+- Analyze tag compliance
+
+**Week 3: Team Review Meetings**
+- Present findings to engineering teams
+- Discuss optimization opportunities
+- Assign action items
+- Review upcoming projects
+
+**Week 4: Executive Reporting**
+- Create executive summary
+- Present cost trends to leadership
+- Report on optimization wins
+- Forecast next quarter
+
+### Monthly Review Meeting Agenda
+
+**Attendees**: Engineering Leads, FinOps Team, Finance Rep, Product Manager
+
+**Agenda (1 hour)**
+
+1. **Previous Month Recap (10 min)**
+   - Total spend vs budget
+   - Top 5 services by cost
+   - Month-over-month comparison
+   - Budget variance explanation
+
+2. **Cost Anomalies (10 min)**
+   - Unusual spending spikes
+   - Root cause analysis
+   - Prevention measures
+
+3. **Optimization Opportunities (15 min)**
+   - Unused resources found
+   - Rightsizing recommendations
+   - Reserved Instance opportunities
+   - Estimated savings
+
+4. **Team Cost Breakdown (10 min)**
+   - Per-team spending
+   - Top spenders
+   - Tag compliance status
+
+5. **Upcoming Changes (10 min)**
+   - New projects launching
+   - Expected cost impact
+   - Budget adjustments needed
+
+6. **Action Items Review (5 min)**
+   - Follow-up on previous items
+   - Assign new action items
+   - Set deadlines
+
+**Deliverable**: Monthly FinOps Report (template provided)
+
+### Monthly Report Template
+
+```markdown
+# AWS Cost Report - [Month Year]
+
+## Executive Summary
+- Total spend: $XX,XXX
+- vs Budget: X% (under/over)
+- vs Last month: +/-X%
+- Optimization savings: $X,XXX
+
+## Cost Breakdown
+| Service | Cost | % of Total | MoM Change |
+|---------|------|-----------|-----------|
+| EC2     | $XX  | XX%       | +/-X%     |
+| RDS     | $XX  | XX%       | +/-X%     |
+
+## Optimization Actions Taken
+1. Migrated 20 instances to Graviton (saved $X/month)
+2. Purchased Reserved Instances (saved $X/month)
+3. Deleted unused resources (saved $X/month)
+
+## Recommendations for Next Month
+1. Right-size 15 oversized instances (potential $X/month savings)
+2. Implement S3 lifecycle policies (potential $X/month savings)
+
+## Action Items
+- [ ] [Owner] Task description (Deadline)
+```
+
+---
+
+## Roles & Responsibilities
+
+### FinOps Team Structure
+
+**FinOps Lead**
+- Owns overall cloud financial management
+- Reports to CFO and CTO
+- Sets FinOps strategy and goals
+- Manages budget process
+
+**Cloud Cost Analyst**
+- Analyzes spending trends
+- Generates reports and dashboards
+- Identifies optimization opportunities
+- Runs monthly review process
+
+**Cloud Architect (FinOps focus)**
+- Advises on cost-optimized architectures
+- Implements cost optimization tools
+- Trains engineers on FinOps practices
+- Reviews architectural designs for cost impact
+
+### Engineering Team Responsibilities
+
+**Engineering Manager**
+- Owns team budget
+- Reviews monthly cost reports
+- Prioritizes optimization work
+- Ensures tagging compliance
+
+**Engineers**
+- Tag all resources they create
+- Consider cost in design decisions
+- Implement optimization recommendations
+- Delete unused resources
+
+**Platform/SRE Team**
+- Implements cost optimization tooling
+- Automates cost monitoring
+- Provides cost visibility dashboards
+- Enforces tagging policies
+
+---
+
+## Chargeback & Showback
+
+### Showback (Visibility Only)
+
+**Purpose**: Show teams their costs without charging them
+**Goal**: Raise cost awareness
+
+**Implementation**:
+- Monthly cost reports per team
+- Dashboard showing team spending
+- Highlight cost trends
+- No budget enforcement
+
+**Best for**: Organizations new to FinOps
+
+### Chargeback (Financial Accountability)
+
+**Purpose**: Allocate costs back to business units
+**Goal**: Financial accountability
+
+**Implementation**:
+- Tag-based cost allocation
+- Transfer costs between cost centers
+- Teams have hard budgets
+- Overspending requires justification
+
+**Best for**: Mature FinOps organizations
+
+### Hybrid Model (Recommended)
+
+**Shared Costs**: Charged to central IT
+- VPC resources
+- Security tools
+- Monitoring infrastructure
+- Shared services
+
+**Team Costs**: Charged to teams
+- Compute resources (EC2, Lambda)
+- Databases
+- Storage
+- Application-specific services
+
+**Implementation**:
+```
+Total AWS Bill: $100,000
+
+Shared Costs (30%): $30,000
+  → Charged to IT/Platform budget
+
+Team Costs (70%): $70,000
+  → Allocated by tags:
+    - Team A (Project=alpha): $20,000
+    - Team B (Project=beta): $25,000
+    - Team C (Project=gamma): $15,000
+    - Untagged (alert!): $10,000 → Needs investigation
+```
+
+---
+
+## Policy & Governance
+
+### Cost Governance Policies
+
+**1. Resource Creation Policies**
+
+```yaml
+Policy: All resources must be tagged
+Enforcement: Service Control Policy (SCP)
+Exception process: Request via FinOps team
+
+Policy: Dev/test resources must auto-stop nights/weekends
+Enforcement: AWS Instance Scheduler
+Exception process: Tag with NoAutoStop=true (requires approval)
+
+Policy: S3 buckets must have lifecycle policies
+Enforcement: AWS Config rule
+Exception process: Document justification in bucket tags
+```
+
+**2. Approval Workflows**
+
+```yaml
+# Spending thresholds requiring approval
+
+< $1,000/month:
+  - Auto-approved
+  - Must be tagged
+
+$1,000 - $5,000/month:
+  - Engineering manager approval
+  - Documented in JIRA
+
+$5,000 - $20,000/month:
+  - Director approval
+  - Budget impact assessment
+  - FinOps team review
+
+> $20,000/month:
+  - VP approval
+  - Business case required
+  - Quarterly review checkpoint
+```
+
+**3. Reserved Instance / Savings Plans Policy**
+
+```yaml
+Policy: All commitments require FinOps review
+
+Process:
+  1. Team identifies workload suitable for commitment
+  2. Submit request to FinOps with:
+     - Resource details
+     - Usage history (30+ days)
+     - Business justification
+  3. FinOps analyzes and recommends
+  4. Finance approves commitment
+  5. FinOps purchases and tracks utilization
+```
+
+### Automation & Guardrails
+
+**Automated Actions**
+
+```yaml
+# Non-production resource scheduling
+Schedule: Instance Scheduler
+  - Stop all dev/test EC2/RDS instances at 7pm weekdays
+  - Stop all dev/test instances all weekend
+  - Start at 7am weekdays
+  - Exception tag: NoAutoStop=true
+
+# Untagged resource alerts
+Trigger: AWS Config rule violation
+Action:
+  - Send Slack alert to team
+  - Create JIRA ticket
+  - Escalate if not tagged in 48 hours
+
+# Old snapshot cleanup
+Schedule: Weekly Lambda function
+Action:
+  - Delete snapshots older than 90 days (unless tagged KeepForever=true)
+  - Notify teams of deletions
+  - Estimate savings
+
+# Budget breach response
+Trigger: Budget > 100%
+Action:
+  - Email alerts to stakeholders
+  - Create incident ticket
+  - Stop non-production resources (optional)
+```
+
+---
+
+## Metrics & KPIs
+
+### Key FinOps Metrics
+
+**1. Cost Metrics**
+```yaml
+Total Monthly Cloud Spend:
+  Target: Within budget
+  Trend: Track month-over-month
+
+Cost per Customer:
+  Calculation: Total AWS Cost / Active Customers
+  Target: Decreasing over time
+
+Cost per Transaction:
+  Calculation: Total AWS Cost / Transactions Processed
+  Target: Optimize for efficiency
+
+Unit Economics:
+  Calculation: Revenue per Customer - Cost per Customer
+  Target: Positive and growing
+```
+
+**2. Efficiency Metrics**
+```yaml
+Compute Utilization:
+  Metric: Average CPU utilization
+  Target: 40-60% (room for burst, not over-provisioned)
+
+Storage Utilization:
+  Metric: % of S3 in cost-optimized tiers
+  Target: >60% in IA or Glacier tiers
+
+Reserved Instance Coverage:
+  Metric: % of On-Demand usage covered by RIs/SPs
+  Target: >70% for stable workloads
+
+RI/SP Utilization:
+  Metric: % of RIs/SPs actually used
+  Target: >90%
+```
+
+**3. Operational Metrics**
+```yaml
+Tag Compliance:
+  Metric: % of resources with required tags
+  Target: >95%
+
+Budget Variance:
+  Metric: Actual vs Budget %
+  Target: ±5%
+
+Optimization Savings:
+  Metric: $ saved per month from optimizations
+  Target: Growing
+
+Mean Time to Optimize (MTTO):
+  Metric: Days from finding opportunity to implementing
+  Target: <30 days
+```
+
+**4. Organizational Metrics**
+```yaml
+FinOps Engagement:
+  Metric: % of teams attending monthly reviews
+  Target: 100%
+
+Cost Awareness:
+  Survey: Do engineers know their team's monthly cost?
+  Target: >80% aware
+
+Optimization Velocity:
+  Metric: # optimization tasks completed per quarter
+  Target: Growing trend
+```
+
+### Dashboard Requirements
+
+**Executive Dashboard (Monthly)**
+- Total spend vs budget
+- Spend by service (top 10)
+- Month-over-month trend
+- Forecast for next quarter
+- Optimization savings achieved
+
+**Engineering Dashboard (Real-time)**
+- Per-team costs (daily)
+- Cost anomaly alerts
+- Untagged resources count
+- Budget utilization %
+- Top cost drivers
+
+**FinOps Dashboard (Daily)**
+- Detailed service costs
+- Tag compliance metrics
+- RI/SP utilization
+- Rightsizing opportunities
+- Unused resource counts
+
+---
+
+## Getting Started Checklist
+
+### Phase 1: Foundation (Month 1)
+- [ ] Enable Cost Explorer
+- [ ] Set up AWS Budgets
+- [ ] Define tagging strategy
+- [ ] Activate cost allocation tags
+- [ ] Set up Cost and Usage Reports (CUR)
+- [ ] Create basic cost dashboard
+
+### Phase 2: Visibility (Months 2-3)
+- [ ] Implement tagging enforcement
+- [ ] Run first optimization scripts
+- [ ] Set up monthly review meeting
+- [ ] Create team cost reports
+- [ ] Assign team cost owners
+- [ ] Document FinOps processes
+
+### Phase 3: Optimization (Months 4-6)
+- [ ] Implement automated resource scheduling
+- [ ] Purchase first Reserved Instances
+- [ ] Set up cost anomaly detection
+- [ ] Automate reporting
+- [ ] Train engineering teams
+- [ ] Implement showback/chargeback
+
+### Phase 4: Culture (Ongoing)
+- [ ] Cost metrics in engineering KPIs
+- [ ] Cost review in architecture reviews
+- [ ] Regular optimization sprints
+- [ ] FinOps champions in each team
+- [ ] Cost-aware development practices
+- [ ] Continuous improvement
+
+---
+
+## Resources
+
+**AWS Native Tools**
+- AWS Cost Explorer
+- AWS Budgets
+- AWS Cost Anomaly Detection
+- AWS Compute Optimizer
+- AWS Trusted Advisor
+- AWS Cost & Usage Reports
+
+**Third-Party Tools**
+- CloudHealth (VMware)
+- Cloudability (Apptio)
+- Kubecost (Kubernetes cost monitoring)
+- Spot.io (Cost optimization platform)
+
+**FinOps Foundation**
+- https://www.finops.org
+- FinOps Certified Practitioner certification
+- FinOps community and best practices
--- a/references/service_alternatives.md
+++ b/references/service_alternatives.md
@@ -0,0 +1,466 @@
+# AWS Service Alternatives - Cost Optimization Guide
+
+When to use cheaper alternatives and cost-effective service options for common AWS services.
+
+## Table of Contents
+
+1. [Compute Alternatives](#compute-alternatives)
+2. [Storage Alternatives](#storage-alternatives)
+3. [Database Alternatives](#database-alternatives)
+4. [Networking Alternatives](#networking-alternatives)
+5. [Application Services](#application-services)
+
+---
+
+## Compute Alternatives
+
+### EC2 vs Lambda vs Fargate
+
+**EC2 (Most Economical for Consistent Workloads)**
+- **When to use**: 24/7 workloads, predictable traffic, need full OS control
+- **Cost model**: Hourly charges, cheaper with Reserved Instances
+- **Best for**: Always-on applications, legacy apps, specific OS/kernel requirements
+- **Example**: Web server handling steady traffic → EC2 with Reserved Instance
+
+**Lambda (Most Economical for Intermittent Work)**
+- **When to use**: Event-driven, sporadic usage, < 15 minute executions
+- **Cost model**: Pay per execution and duration (GB-seconds)
+- **Best for**: APIs with sporadic traffic, scheduled tasks, event processing
+- **Example**: Image processing triggered by S3 upload → Lambda
+- **Break-even**: ~20-30 hours/month execution time vs equivalent EC2
+
+**Fargate (Middle Ground)**
+- **When to use**: Containerized apps, variable traffic, don't want to manage servers
+- **Cost model**: Pay for vCPU and memory allocated
+- **Best for**: Microservices, batch jobs, variable load applications
+- **Example**: Background worker that scales 0-10 containers → Fargate
+- **Tip**: Fargate Spot offers up to 70% savings for fault-tolerant tasks
+
+**Decision Matrix**
+```
+Consistent 24/7 load → EC2 with Reserved Instances
+Variable load, containerized → Fargate (or Fargate Spot)
+Event-driven, < 15 min → Lambda
+Batch processing → Fargate Spot or EC2 Spot
+```
+
+### EC2 Instance Alternatives
+
+**Standard vs Graviton (ARM64)**
+- **Graviton Savings**: 20% cheaper for same performance
+- **When to use**: Modern applications, ARM-compatible workloads
+- **Alternatives**:
+  - t3.large → t4g.large (20% cheaper)
+  - m5.xlarge → m6g.xlarge (20% cheaper)
+  - c5.2xlarge → c6g.2xlarge (20% cheaper)
+- **Considerations**: Test application compatibility first
+
+**Current vs Previous Generation**
+- **Migration Savings**: 5-10% cheaper, better performance
+- **Examples**:
+  - t2 → t3 (10% cheaper, better performance)
+  - m4 → m5 → m6i (progressive improvements)
+  - c4 → c5 → c6i (better price/performance)
+- **Action**: Check `detect_old_generations.py` script
+
+**On-Demand vs Spot vs Reserved**
+- **On-Demand**: $X/hour, highest cost, full flexibility
+- **Spot**: 60-90% discount, can be interrupted
+- **Reserved (1yr)**: 30-40% discount
+- **Reserved (3yr)**: 50-65% discount
+- **Decision**: Use Spot for fault-tolerant, RI for predictable, On-Demand for rest
+
+---
+
+## Storage Alternatives
+
+### S3 Storage Classes
+
+**Frequently Accessed Data**
+```
+S3 Standard → $0.023/GB/month
+Use when: Accessing files multiple times per month
+```
+
+**Infrequently Accessed Data**
+```
+S3 Standard → S3 Standard-IA
+$0.023/GB/month → $0.0125/GB/month (46% cheaper)
+Retrieval cost: $0.01/GB
+Break-even: < 1 access per month
+Use when: Backups, disaster recovery, infrequently accessed files
+```
+
+**Unknown Access Patterns**
+```
+S3 Standard → S3 Intelligent-Tiering
+$0.023/GB/month → Automatic optimization
+Extra cost: $0.0025 per 1000 objects monitored
+Use when: Unclear access patterns, don't want to manage lifecycle
+Best for: Mixed workloads, analytics datasets
+```
+
+**Archive Storage**
+```
+S3 Standard → S3 Glacier Instant Retrieval
+$0.023/GB → $0.004/GB (83% cheaper)
+Retrieval: Milliseconds, $0.03/GB
+Use when: Archive with immediate access needs (e.g., medical records)
+
+S3 Standard → S3 Glacier Flexible Retrieval
+$0.023/GB → $0.0036/GB (84% cheaper)
+Retrieval: Minutes to hours, $0.01/GB
+Use when: Archive data, acceptable retrieval delay
+
+S3 Standard → S3 Glacier Deep Archive
+$0.023/GB → $0.00099/GB (96% cheaper)
+Retrieval: 12 hours, $0.02/GB
+Use when: Long-term archive, regulatory compliance, rarely accessed
+```
+
+**Decision Tree**
+```
+Accessed daily → S3 Standard
+Accessed monthly → S3 Standard-IA
+Unknown pattern → S3 Intelligent-Tiering
+Archive, instant access → Glacier Instant Retrieval
+Archive, can wait hours → Glacier Flexible Retrieval
+Archive, can wait 12 hours → Glacier Deep Archive
+```
+
+### EBS Volume Types
+
+**General Purpose Volumes**
+```
+gp2 → gp3
+$0.10/GB → $0.08/GB (20% cheaper)
+Additional benefits: Configurable IOPS/throughput independent of size
+Action: Convert all gp2 to gp3 (no downtime required)
+```
+
+**High Performance Workloads**
+```
+io1 → io2
+Same price, better durability and IOPS
+io2 Block Express: For highest performance needs
+
+Consider: Do you really need provisioned IOPS?
+Many workloads perform fine on gp3 (up to 16,000 IOPS)
+Test gp3 before committing to io2
+```
+
+**Throughput-Optimized Workloads**
+```
+gp3 → st1 (Throughput Optimized HDD)
+$0.08/GB → $0.045/GB (44% cheaper)
+Use when: Big data, data warehouses, log processing
+Sequential access patterns, throughput more important than IOPS
+```
+
+**Cold Data**
+```
+gp3 → sc1 (Cold HDD)
+$0.08/GB → $0.015/GB (81% cheaper)
+Use when: Infrequently accessed data, lowest cost priority
+Example: Archive storage, cold backups
+```
+
+### EFS vs S3 vs EBS
+
+**S3 (Cheapest for Object Storage)**
+- **Cost**: $0.023/GB/month (Standard)
+- **When to use**: Object storage, static files, backups
+- **Pros**: Unlimited scale, integrates with everything
+- **Cons**: Not a file system, higher latency
+
+**EBS (Best for Single-Instance Block Storage)**
+- **Cost**: $0.08/GB/month (gp3)
+- **When to use**: Boot volumes, database storage, single EC2 instance
+- **Pros**: High performance, low latency
+- **Cons**: Single-AZ, attached to one instance
+
+**EFS (File System Across Multiple Instances)**
+- **Cost**: $0.30/GB/month (Standard), $0.016/GB/month (IA)
+- **When to use**: Shared file storage across multiple instances
+- **Pros**: Multi-AZ, grows automatically, NFSv4
+- **Cons**: More expensive than EBS
+- **Optimization**: Use EFS Intelligent-Tiering to auto-move to IA class
+
+**Decision Matrix**
+```
+Single instance, block storage → EBS
+Multiple instances, shared files → EFS (with Intelligent-Tiering)
+Object storage, static files → S3
+Large data, high throughput → FSx for Lustre
+Windows file shares → FSx for Windows
+```
+
+---
+
+## Database Alternatives
+
+### RDS vs Aurora vs Self-Managed
+
+**RDS PostgreSQL/MySQL (Baseline)**
+- **Cost**: Instance + storage
+- **When to use**: Standard relational DB needs
+- **Example**: db.t3.medium = ~$60/month + storage
+
+**Aurora PostgreSQL/MySQL (2-3x RDS Cost)**
+- **Cost**: Instance + storage + I/O charges
+- **When to use**: Need high availability, auto-scaling storage, read replicas
+- **Pros**: Better performance, automatic failover, up to 15 read replicas
+- **Cons**: More expensive
+- **Break-even**: High read traffic, need fast replication
+
+**Aurora Serverless v2 (Variable Workloads)**
+- **Cost**: Pay per ACU (Aurora Capacity Unit) per second
+- **When to use**: Variable load, dev/test, infrequent usage
+- **Example**: Dev database used 8 hours/day → 67% savings vs always-on
+- **Limitation**: Min capacity charges apply
+
+**Self-Managed on EC2 (Cheapest for Experts)**
+- **Cost**: Just EC2 + EBS costs
+- **When to use**: Full control needed, specific configuration, cost-sensitive
+- **Pros**: Can be 50-70% cheaper than RDS
+- **Cons**: You manage backups, patching, HA, monitoring
+- **Consideration**: Factor in operational overhead
+
+**Decision Matrix**
+```
+Standard workload, managed preferred → RDS
+High availability, many reads → Aurora
+Variable workload → Aurora Serverless v2
+Cost-sensitive, have DBA expertise → Self-managed on EC2
+Dev/test, intermittent use → Aurora Serverless v2
+```
+
+### DynamoDB Pricing Models
+
+**On-Demand (Unpredictable Traffic)**
+- **Cost**: $1.25 per million writes, $0.25 per million reads
+- **When to use**: Variable traffic, new applications, spiky workloads
+- **Pros**: No capacity planning, scales automatically
+- **Example**: New API with unknown traffic pattern
+
+**Provisioned Capacity (Predictable Traffic)**
+- **Cost**: $0.00065 per WCU/hour, $0.00013 per RCU/hour
+- **When to use**: Predictable traffic patterns
+- **Savings**: 60-80% cheaper than on-demand at consistent usage
+- **Example**: Application with steady 100 req/sec
+
+**Reserved Capacity (Long-term Commitment)**
+- **Cost**: Additional 30-50% discount on provisioned capacity
+- **When to use**: Known long-term capacity needs
+- **Commitment**: 1-3 years
+
+**Break-Even Calculation**
+```
+On-Demand: $1.25 per million writes
+Provisioned: ~$0.47 per million writes (at capacity)
+Break-even: ~65% consistent utilization
+
+Action: Start with on-demand, switch to provisioned once patterns clear
+```
+
+### Database Migration Options
+
+**From Commercial to Open Source**
+```
+Oracle → Aurora PostgreSQL or RDS PostgreSQL
+Savings: 90% on licensing costs
+Consider: PostgreSQL compatibility, migration effort
+
+SQL Server → Aurora PostgreSQL or RDS PostgreSQL/MySQL
+Savings: 50-90% on licensing costs
+Consider: Application compatibility, migration effort
+```
+
+**From RDS to Aurora**
+```
+Only if: High availability requirements, many read replicas needed
+Cost increase: 20-50% more
+Benefit: Better performance, automatic failover, scaling
+```
+
+**From Aurora to RDS**
+```
+When: Don't need Aurora features, cost-conscious
+Savings: 20-50%
+Downgrade if: Single-AZ sufficient, limited read replicas needed
+```
+
+---
+
+## Networking Alternatives
+
+### NAT Gateway Alternatives
+
+**NAT Gateway (Default, Expensive)**
+- **Cost**: $32.85/month + $0.045/GB processed
+- **When to use**: Production, high availability, easy management
+
+**VPC Endpoints (Cheaper for AWS Services)**
+- **Gateway Endpoint (S3, DynamoDB)**: FREE
+- **Interface Endpoint**: $7.20/month + $0.01/GB
+- **When to use**: Accessing S3, DynamoDB, or other AWS services
+- **Savings**: $25-30/month vs NAT Gateway
+- **Example**: Lambda accessing S3 → Use S3 Gateway Endpoint
+
+**NAT Instance (Cheapest, More Work)**
+- **Cost**: Just EC2 cost (e.g., t3.micro = $7.50/month)
+- **When to use**: Dev/test, cost-sensitive, low traffic
+- **Cons**: Must manage, less resilient, manual HA setup
+- **Savings**: 75% vs NAT Gateway
+
+**Decision Matrix**
+```
+S3 or DynamoDB only → Gateway Endpoint (FREE)
+Other AWS services → Interface Endpoint
+Production, high availability → NAT Gateway
+Dev/test, low traffic → NAT Instance or single NAT Gateway
+```
+
+### Load Balancer Alternatives
+
+**Application Load Balancer (ALB)**
+- **Cost**: $16.20/month + LCU charges
+- **When to use**: HTTP/HTTPS, path-based routing, microservices
+- **Features**: Layer 7, content-based routing, Lambda targets
+
+**Network Load Balancer (NLB)**
+- **Cost**: $22.35/month + LCU charges
+- **When to use**: TCP/UDP, extreme performance, static IPs
+- **Use case**: Non-HTTP protocols, high throughput
+
+**Classic Load Balancer (Legacy)**
+- **Cost**: $18/month + data charges
+- **Recommendation**: Migrate to ALB or NLB (better features, often cheaper)
+
+**CloudFront + S3 (Static Content)**
+- **Cost**: Much cheaper for static content
+- **When to use**: Static website, single-page app
+- **Setup**: S3 static hosting + CloudFront distribution
+- **Savings**: 90% vs ALB for static content
+
+**API Gateway (REST APIs)**
+- **Cost**: Pay per request
+- **When to use**: REST API, need API management features
+- **Alternative to**: ALB for simple APIs
+
+---
+
+## Application Services
+
+### Message Queue Alternatives
+
+**SQS vs SNS vs EventBridge vs Kinesis**
+
+**SQS (Point-to-Point, Cheapest)**
+- **Cost**: $0.40 per million requests (Standard), $0.50 (FIFO)
+- **When to use**: Work queues, decoupling services
+- **Best for**: Job processing, task queues
+
+**SNS (Pub/Sub, Cheap)**
+- **Cost**: $0.50 per million publishes
+- **When to use**: Fan-out notifications, multiple subscribers
+- **Best for**: Notifications, multiple consumers
+
+**EventBridge (Event Router)**
+- **Cost**: $1.00 per million events
+- **When to use**: Event-driven architecture, complex routing
+- **Best for**: Cross-account events, SaaS integrations
+
+**Kinesis (Streaming, Expensive)**
+- **Cost**: $0.015 per shard-hour + PUT charges
+- **When to use**: Real-time streaming, ordered processing
+- **Best for**: Logs, analytics, real-time processing
+- **Alternative**: Kinesis Data Firehose (simpler, cheaper for basic needs)
+
+**Decision Matrix**
+```
+Simple queue → SQS
+Multiple consumers → SNS
+Complex event routing → EventBridge
+Real-time streaming → Kinesis
+Log aggregation → Kinesis Firehose
+```
+
+### Container Orchestration
+
+**ECS vs EKS vs Fargate**
+
+**ECS on EC2 (Cheapest)**
+- **Cost**: Just EC2 costs (no ECS fee)
+- **When to use**: AWS-native, simpler workloads
+- **Best for**: Cost-sensitive, AWS-specific deployments
+
+**ECS on Fargate (Serverless, Easy)**
+- **Cost**: Pay per task (vCPU + memory)
+- **When to use**: Variable load, don't want to manage servers
+- **Best for**: Variable workloads, simpler operations
+
+**EKS (Kubernetes, Expensive)**
+- **Cost**: $73/month per cluster + node costs
+- **When to use**: Need Kubernetes, multi-cloud, complex deployments
+- **Best for**: Kubernetes expertise, need K8s ecosystem
+- **Tip**: Consolidate workloads to fewer clusters
+
+**Decision Matrix**
+```
+AWS-native, cost-sensitive → ECS on EC2
+Variable load, easy management → ECS on Fargate
+Need Kubernetes → EKS
+Multiple environments → Consider single EKS cluster with namespaces
+```
+
+---
+
+## Quick Reference: When to Switch
+
+### Immediate Actions (Low Risk)
+- [ ] gp2 → gp3 (20% savings, no downtime)
+- [ ] S3 Standard → Intelligent-Tiering (auto-optimization)
+- [ ] NAT Gateway → VPC Endpoints for S3/DynamoDB (free)
+- [ ] Old generation instances → New generation (10-20% savings)
+- [ ] Intel → Graviton (20% savings, test first)
+
+### Medium Effort Actions
+- [ ] On-Demand → Reserved Instances/Savings Plans (40-65% savings)
+- [ ] Always-on EC2 → Lambda for intermittent work
+- [ ] S3 Standard → Lifecycle policies (50-95% savings on old data)
+- [ ] RDS On-Demand → Reserved Instances (40-65% savings)
+- [ ] DynamoDB On-Demand → Provisioned (60-80% savings if predictable)
+
+### High Effort Actions (Evaluate Carefully)
+- [ ] RDS → Aurora (usually more expensive, only if need features)
+- [ ] Aurora → RDS (20-50% savings if don't need Aurora features)
+- [ ] Commercial DB → PostgreSQL (90% savings, migration effort)
+- [ ] EC2 → Lambda (case-by-case, break-even analysis needed)
+- [ ] ECS → EKS (usually more expensive, only if need K8s)
+
+---
+
+## Cost Comparison Tool
+
+Use this mental model when evaluating alternatives:
+
+```
+1. Calculate current monthly cost
+2. Calculate alternative monthly cost
+3. Estimate migration effort (hours × $cost)
+4. Calculate payback period: Migration Cost / Monthly Savings
+5. Decide: Payback < 3 months → Likely worth it
+           Payback > 6 months → Evaluate carefully
+```
+
+**Example:**
+```
+Current: ALB for static site = $20/month
+Alternative: CloudFront + S3 = $2/month
+Savings: $18/month
+Migration: 4 hours × $100/hour = $400
+Payback: $400 / $18 = 22 months → Maybe not worth it
+
+But if: Multiple sites, reusable pattern → Worth the investment
+```