gh-ahmedasmar-devops-claude…/references/best_practices.md

# AWS Cost Optimization Best Practices

Comprehensive strategies for optimizing AWS costs across all major service categories.

## Table of Contents

1. [Compute Optimization](#compute-optimization)
2. [Storage Optimization](#storage-optimization)
3. [Network Optimization](#network-optimization)
4. [Database Optimization](#database-optimization)
5. [Container & Serverless Optimization](#container--serverless-optimization)
6. [General Principles](#general-principles)

---

## Compute Optimization

### EC2 Instance Optimization

**Right Instance Family**
- **General Purpose (T3, M5, M6i)**: Web servers, small-medium databases, dev environments
- **Compute Optimized (C5, C6i, C6g)**: CPU-intensive workloads, batch processing, HPC
- **Memory Optimized (R5, R6i, R6g)**: Databases, in-memory caches, big data
- **Storage Optimized (I3, D2)**: High IOPS, data warehousing, Hadoop

**Graviton Migration (ARM64)**
- Up to 20% cost savings with M6g, C6g, R6g, T4g instances
- Test compatibility first: Most modern languages/frameworks support ARM64
- Best for: Stateless applications, containerized workloads, open-source software

**Instance Sizing**
- Start small and scale up based on metrics
- Monitor CPU, memory, network for 2+ weeks before committing
- Use CloudWatch metrics to identify underutilized instances
- Consider burstable instances (T3) for variable workloads

**Purchase Options**
- **On-Demand**: Flexible, no commitment, highest cost
- **Reserved Instances**: 1-3 year commitment, up to 63% savings
  - Standard RI: Highest discount, no flexibility
  - Convertible RI: Moderate discount, can change instance types
- **Savings Plans**: Flexible commitment to compute spend, up to 66% savings
- **Spot Instances**: Up to 90% savings, suitable for fault-tolerant workloads

### Auto Scaling

**Horizontal Scaling**
- Scale out during peak, scale in during off-peak
- Use target tracking policies (CPU, ALB requests, custom metrics)
- Set minimum instances for high availability, maximum for cost control
- Consider scheduled scaling for predictable patterns

**Mixed Instances Policy**
- Combine instance types for better Spot availability
- Mix Spot and On-Demand for reliability
- Example: 70% Spot, 30% On-Demand for fault-tolerant apps

### Lambda Optimization

**Memory Configuration**
- Memory allocation determines CPU allocation
- More memory = faster execution = potentially lower cost
- Test different memory settings to find cost/performance sweet spot

**Cold Start Mitigation**
- Provisioned concurrency for critical functions (adds cost)
- Keep functions warm with scheduled invocations
- Minimize deployment package size
- Use Lambda layers for shared dependencies

**Execution Time**
- Optimize code to reduce execution duration
- Every 100ms of execution matters at scale
- Consider Graviton2 (arm64) for 20% better price/performance

---

## Storage Optimization

### S3 Cost Optimization

**Storage Classes**
- **S3 Standard**: Frequently accessed data
- **S3 Intelligent-Tiering**: Auto-moves between tiers, ideal for unknown patterns
- **S3 Standard-IA**: Infrequent access, 50% cheaper than Standard
- **S3 One Zone-IA**: Non-critical, infrequent access, 20% cheaper than Standard-IA
- **S3 Glacier Instant Retrieval**: Archive with instant access, 68% cheaper
- **S3 Glacier Flexible Retrieval**: Archive, retrieval in minutes-hours, 77% cheaper
- **S3 Glacier Deep Archive**: Long-term archive, retrieval in 12 hours, 83% cheaper

**Lifecycle Policies**
- Automatically transition objects between storage classes
- Delete incomplete multipart uploads after 7 days
- Example policy:
  - 0-30 days: S3 Standard
  - 30-90 days: S3 Standard-IA
  - 90-365 days: S3 Glacier Flexible Retrieval
  - 365+ days: S3 Glacier Deep Archive or Delete

**Request Optimization**
- Use CloudFront CDN to reduce S3 GET requests
- Batch operations instead of individual API calls
- Use S3 Select to retrieve subsets of data
- Enable S3 Transfer Acceleration for faster uploads (if needed)

**Cost Monitoring**
- Enable S3 Storage Lens for usage analytics
- Set up S3 Storage Class Analysis
- Monitor request costs (can exceed storage costs for small files)

### EBS Optimization

**Volume Types**
- **gp3**: General purpose, 20% cheaper than gp2, configurable IOPS/throughput
- **gp2**: Legacy general purpose (migrate to gp3)
- **io2**: High performance, mission-critical (only if needed)
- **st1**: Throughput-optimized HDD for big data (cheaper for sequential access)
- **sc1**: Cold HDD for infrequent access (cheapest)

**Snapshot Management**
- Delete old snapshots (they accumulate quickly)
- Use Lifecycle Manager for automated snapshot policies
- Snapshots are incremental but deletion is complex (use Data Lifecycle Manager)
- Consider cross-region replication costs

**Volume Cleanup**
- Delete unattached volumes
- Right-size oversized volumes
- Consider EBS Elastic Volumes to modify without downtime

---

## Network Optimization

### Data Transfer Costs

**General Rules**
- **Free**: Inbound from internet, same-AZ traffic (same subnet)
- **Cheap**: Same-region traffic across AZs
- **Expensive**: Cross-region, outbound to internet, CloudFront to origin

**Optimization Strategies**
- Colocate resources in same AZ when possible (consider HA trade-offs)
- Use VPC endpoints for AWS service access (avoids NAT/IGW costs)
- Implement caching with CloudFront, ElastiCache
- Compress data before transfer
- Use AWS PrivateLink instead of internet egress

### NAT Gateway Optimization

**Cost Structure**
- ~$32.85/month per NAT Gateway
- Data processing charges: $0.045/GB

**Alternatives**
- **VPC Endpoints**: Direct access to AWS services (S3, DynamoDB, etc.)
  - Interface endpoints: $7.20/month + $0.01/GB
  - Gateway endpoints: Free for S3 and DynamoDB
- **NAT Instance**: Cheaper but requires management
- **Single NAT Gateway**: Use one instead of one per AZ (reduces HA)
- **S3 Gateway Endpoint**: Free alternative for S3 access

**When to Use What**
- High traffic to AWS services → VPC Endpoints
- Low traffic, dev/test → Single NAT Gateway or NAT instance
- Production, HA required → NAT Gateway per AZ
- S3 access only → S3 Gateway Endpoint (free)

### CloudFront Optimization

**Use Cases for Savings**
- Reduce S3 data transfer costs (CloudFront egress is cheaper)
- Cache frequently accessed content
- Regional edge caches for less popular content

**Configuration**
- Use appropriate price class (exclude expensive regions if not needed)
- Set proper TTL to maximize cache hit ratio
- Use compression (gzip, brotli)
- Monitor cache hit ratio and adjust

---

## Database Optimization

### RDS Cost Optimization

**Instance Sizing**
- Right-size based on CloudWatch metrics (CPU, memory, connections)
- Consider burstable instances (db.t3) for variable workloads
- Graviton instances (db.m6g, db.r6g) offer 20% savings

**Storage Optimization**
- Use gp3 instead of gp2 (20% cheaper)
- Enable storage autoscaling with upper limit
- Delete old automated backups
- Reduce backup retention period if possible

**High Availability Trade-offs**
- Multi-AZ doubles cost (needed for production)
- Single-AZ acceptable for dev/test
- Read replicas for read scaling (cheaper than bigger instance)

**Aurora vs RDS**
- Aurora costs more but offers better scaling
- Aurora Serverless v2 for variable workloads
- Standard RDS for predictable workloads
- PostgreSQL/MySQL community for dev/test

### DynamoDB Optimization

**Capacity Modes**
- **On-Demand**: Pay per request, unpredictable traffic
- **Provisioned**: Cheaper for consistent traffic, requires capacity planning
- **Reserved Capacity**: 1-3 year commitment for provisioned capacity

**Table Design**
- Use single-table design to minimize costs
- Implement GSI/LSI carefully (they add cost)
- Enable point-in-time recovery only if needed
- Use TTL to auto-expire old data

**Read Optimization**
- Use eventually consistent reads (50% cheaper than strongly consistent)
- Implement caching (DAX or ElastiCache)
- Batch operations when possible

### ElastiCache Optimization

**Node Types**
- Graviton instances (cache.m6g, cache.r6g) for 20% savings
- Right-size based on memory usage and eviction rates

**Redis vs Memcached**
- Redis: More features, persistence, replication (more expensive)
- Memcached: Simpler, no persistence, multi-threaded (cheaper)

**Strategies**
- Reserved nodes for 30-55% savings
- Single-AZ for dev/test
- Monitor eviction rates to avoid over-provisioning

---

## Container & Serverless Optimization

### ECS/Fargate Optimization

**Compute Options**
- **EC2 Launch Type**: More control, cheaper for steady workloads
- **Fargate**: Serverless, easier management, better for variable loads
- **Fargate Spot**: Up to 70% savings for fault-tolerant tasks

**Graviton Support**
- Fargate ARM64 support available
- ECS on Graviton2 EC2 instances for 20% savings

**Right-sizing**
- Start with minimal CPU/memory, scale up based on metrics
- Use Container Insights for utilization data
- Consider task packing (multiple containers per task)

### EKS Optimization

**Control Plane**
- $73/month per cluster (consider consolidation)
- Use single cluster with namespaces when appropriate

**Worker Nodes**
- Use Spot instances for fault-tolerant pods (up to 90% savings)
- Managed node groups with Graviton instances
- Karpenter for intelligent autoscaling
- Mixed instance types for better Spot availability

**Cost Visibility**
- Kubecost or OpenCost for K8s cost attribution
- Resource requests/limits prevent waste
- Cluster autoscaler for automatic node scaling

---

## General Principles

### Tagging Strategy

**Cost Allocation Tags**
- Environment: prod, staging, dev, test
- Owner: team/person responsible
- Project: business initiative
- CostCenter: chargeback allocation
- Application: specific app name

**Tag Enforcement**
- Use AWS Organizations policies to enforce tagging
- Service Control Policies to prevent untagged resources
- AWS Config rules for compliance

### Monitoring and Governance

**Cost Monitoring Tools**
- AWS Cost Explorer: Historical analysis
- AWS Budgets: Proactive alerts
- Cost and Usage Reports: Detailed data export
- Cost Anomaly Detection: Automatic anomaly alerts

**Regular Reviews**
- Monthly cost review meetings
- Quarterly rightsizing exercises
- Annual Reserved Instance/Savings Plan optimization
- Automated reports to stakeholders

### Automation

**Infrastructure as Code**
- Define resource sizes in code (prevent oversizing)
- Automated cleanup of dev/test resources
- Scheduled shutdown of non-production resources

**Cost Optimization Tools**
- AWS Compute Optimizer: ML-based recommendations
- AWS Trusted Advisor: Best practice checks
- Third-party tools: CloudHealth, Cloudability, Spot.io

### Cultural Best Practices

**Engineering Ownership**
- Engineers should see cost impact of their changes
- Cost metrics in dashboards alongside performance
- Cost budgets for teams/projects

**Experiments and Cleanup**
- Tag experimental resources with expiration dates
- Automated cleanup of abandoned resources
- Regular audits of unused resources

**Cost-Aware Architecture**
- Design for cost from the beginning
- Choose appropriate service tiers
- Implement auto-scaling and right-sizing from day one
- Consider serverless and managed services

---

## Quick Wins Checklist

- [ ] Delete unattached EBS volumes
- [ ] Delete old EBS snapshots
- [ ] Release unused Elastic IPs
- [ ] Stop or terminate idle EC2 instances
- [ ] Right-size oversized instances
- [ ] Convert gp2 to gp3 volumes
- [ ] Enable S3 Intelligent-Tiering
- [ ] Set up S3 lifecycle policies
- [ ] Replace NAT Gateways with VPC Endpoints where possible
- [ ] Migrate to Graviton instances
- [ ] Purchase Reserved Instances/Savings Plans for stable workloads
- [ ] Use Spot instances for fault-tolerant workloads
- [ ] Delete old RDS snapshots
- [ ] Enable DynamoDB auto-scaling
- [ ] Set up cost allocation tags
- [ ] Enable AWS Budgets alerts
- [ ] Schedule shutdown of dev/test resources