Initial commit
This commit is contained in:
362
references/best_practices.md
Normal file
362
references/best_practices.md
Normal file
@@ -0,0 +1,362 @@
|
||||
# AWS Cost Optimization Best Practices
|
||||
|
||||
Comprehensive strategies for optimizing AWS costs across all major service categories.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Compute Optimization](#compute-optimization)
|
||||
2. [Storage Optimization](#storage-optimization)
|
||||
3. [Network Optimization](#network-optimization)
|
||||
4. [Database Optimization](#database-optimization)
|
||||
5. [Container & Serverless Optimization](#container--serverless-optimization)
|
||||
6. [General Principles](#general-principles)
|
||||
|
||||
---
|
||||
|
||||
## Compute Optimization
|
||||
|
||||
### EC2 Instance Optimization
|
||||
|
||||
**Right Instance Family**
|
||||
- **General Purpose (T3, M5, M6i)**: Web servers, small-medium databases, dev environments
|
||||
- **Compute Optimized (C5, C6i, C6g)**: CPU-intensive workloads, batch processing, HPC
|
||||
- **Memory Optimized (R5, R6i, R6g)**: Databases, in-memory caches, big data
|
||||
- **Storage Optimized (I3, D2)**: High IOPS, data warehousing, Hadoop
|
||||
|
||||
**Graviton Migration (ARM64)**
|
||||
- Up to 20% cost savings with M6g, C6g, R6g, T4g instances
|
||||
- Test compatibility first: Most modern languages/frameworks support ARM64
|
||||
- Best for: Stateless applications, containerized workloads, open-source software
|
||||
|
||||
**Instance Sizing**
|
||||
- Start small and scale up based on metrics
|
||||
- Monitor CPU, memory, network for 2+ weeks before committing
|
||||
- Use CloudWatch metrics to identify underutilized instances
|
||||
- Consider burstable instances (T3) for variable workloads
|
||||
|
||||
**Purchase Options**
|
||||
- **On-Demand**: Flexible, no commitment, highest cost
|
||||
- **Reserved Instances**: 1-3 year commitment, up to 63% savings
|
||||
- Standard RI: Highest discount, no flexibility
|
||||
- Convertible RI: Moderate discount, can change instance types
|
||||
- **Savings Plans**: Flexible commitment to compute spend, up to 66% savings
|
||||
- **Spot Instances**: Up to 90% savings, suitable for fault-tolerant workloads
|
||||
|
||||
### Auto Scaling
|
||||
|
||||
**Horizontal Scaling**
|
||||
- Scale out during peak, scale in during off-peak
|
||||
- Use target tracking policies (CPU, ALB requests, custom metrics)
|
||||
- Set minimum instances for high availability, maximum for cost control
|
||||
- Consider scheduled scaling for predictable patterns
|
||||
|
||||
**Mixed Instances Policy**
|
||||
- Combine instance types for better Spot availability
|
||||
- Mix Spot and On-Demand for reliability
|
||||
- Example: 70% Spot, 30% On-Demand for fault-tolerant apps
|
||||
|
||||
### Lambda Optimization
|
||||
|
||||
**Memory Configuration**
|
||||
- Memory allocation determines CPU allocation
|
||||
- More memory = faster execution = potentially lower cost
|
||||
- Test different memory settings to find cost/performance sweet spot
|
||||
|
||||
**Cold Start Mitigation**
|
||||
- Provisioned concurrency for critical functions (adds cost)
|
||||
- Keep functions warm with scheduled invocations
|
||||
- Minimize deployment package size
|
||||
- Use Lambda layers for shared dependencies
|
||||
|
||||
**Execution Time**
|
||||
- Optimize code to reduce execution duration
|
||||
- Every 100ms of execution matters at scale
|
||||
- Consider Graviton2 (arm64) for 20% better price/performance
|
||||
|
||||
---
|
||||
|
||||
## Storage Optimization
|
||||
|
||||
### S3 Cost Optimization
|
||||
|
||||
**Storage Classes**
|
||||
- **S3 Standard**: Frequently accessed data
|
||||
- **S3 Intelligent-Tiering**: Auto-moves between tiers, ideal for unknown patterns
|
||||
- **S3 Standard-IA**: Infrequent access, 50% cheaper than Standard
|
||||
- **S3 One Zone-IA**: Non-critical, infrequent access, 20% cheaper than Standard-IA
|
||||
- **S3 Glacier Instant Retrieval**: Archive with instant access, 68% cheaper
|
||||
- **S3 Glacier Flexible Retrieval**: Archive, retrieval in minutes-hours, 77% cheaper
|
||||
- **S3 Glacier Deep Archive**: Long-term archive, retrieval in 12 hours, 83% cheaper
|
||||
|
||||
**Lifecycle Policies**
|
||||
- Automatically transition objects between storage classes
|
||||
- Delete incomplete multipart uploads after 7 days
|
||||
- Example policy:
|
||||
- 0-30 days: S3 Standard
|
||||
- 30-90 days: S3 Standard-IA
|
||||
- 90-365 days: S3 Glacier Flexible Retrieval
|
||||
- 365+ days: S3 Glacier Deep Archive or Delete
|
||||
|
||||
**Request Optimization**
|
||||
- Use CloudFront CDN to reduce S3 GET requests
|
||||
- Batch operations instead of individual API calls
|
||||
- Use S3 Select to retrieve subsets of data
|
||||
- Enable S3 Transfer Acceleration for faster uploads (if needed)
|
||||
|
||||
**Cost Monitoring**
|
||||
- Enable S3 Storage Lens for usage analytics
|
||||
- Set up S3 Storage Class Analysis
|
||||
- Monitor request costs (can exceed storage costs for small files)
|
||||
|
||||
### EBS Optimization
|
||||
|
||||
**Volume Types**
|
||||
- **gp3**: General purpose, 20% cheaper than gp2, configurable IOPS/throughput
|
||||
- **gp2**: Legacy general purpose (migrate to gp3)
|
||||
- **io2**: High performance, mission-critical (only if needed)
|
||||
- **st1**: Throughput-optimized HDD for big data (cheaper for sequential access)
|
||||
- **sc1**: Cold HDD for infrequent access (cheapest)
|
||||
|
||||
**Snapshot Management**
|
||||
- Delete old snapshots (they accumulate quickly)
|
||||
- Use Lifecycle Manager for automated snapshot policies
|
||||
- Snapshots are incremental but deletion is complex (use Data Lifecycle Manager)
|
||||
- Consider cross-region replication costs
|
||||
|
||||
**Volume Cleanup**
|
||||
- Delete unattached volumes
|
||||
- Right-size oversized volumes
|
||||
- Consider EBS Elastic Volumes to modify without downtime
|
||||
|
||||
---
|
||||
|
||||
## Network Optimization
|
||||
|
||||
### Data Transfer Costs
|
||||
|
||||
**General Rules**
|
||||
- **Free**: Inbound from internet, same-AZ traffic (same subnet)
|
||||
- **Cheap**: Same-region traffic across AZs
|
||||
- **Expensive**: Cross-region, outbound to internet, CloudFront to origin
|
||||
|
||||
**Optimization Strategies**
|
||||
- Colocate resources in same AZ when possible (consider HA trade-offs)
|
||||
- Use VPC endpoints for AWS service access (avoids NAT/IGW costs)
|
||||
- Implement caching with CloudFront, ElastiCache
|
||||
- Compress data before transfer
|
||||
- Use AWS PrivateLink instead of internet egress
|
||||
|
||||
### NAT Gateway Optimization
|
||||
|
||||
**Cost Structure**
|
||||
- ~$32.85/month per NAT Gateway
|
||||
- Data processing charges: $0.045/GB
|
||||
|
||||
**Alternatives**
|
||||
- **VPC Endpoints**: Direct access to AWS services (S3, DynamoDB, etc.)
|
||||
- Interface endpoints: $7.20/month + $0.01/GB
|
||||
- Gateway endpoints: Free for S3 and DynamoDB
|
||||
- **NAT Instance**: Cheaper but requires management
|
||||
- **Single NAT Gateway**: Use one instead of one per AZ (reduces HA)
|
||||
- **S3 Gateway Endpoint**: Free alternative for S3 access
|
||||
|
||||
**When to Use What**
|
||||
- High traffic to AWS services → VPC Endpoints
|
||||
- Low traffic, dev/test → Single NAT Gateway or NAT instance
|
||||
- Production, HA required → NAT Gateway per AZ
|
||||
- S3 access only → S3 Gateway Endpoint (free)
|
||||
|
||||
### CloudFront Optimization
|
||||
|
||||
**Use Cases for Savings**
|
||||
- Reduce S3 data transfer costs (CloudFront egress is cheaper)
|
||||
- Cache frequently accessed content
|
||||
- Regional edge caches for less popular content
|
||||
|
||||
**Configuration**
|
||||
- Use appropriate price class (exclude expensive regions if not needed)
|
||||
- Set proper TTL to maximize cache hit ratio
|
||||
- Use compression (gzip, brotli)
|
||||
- Monitor cache hit ratio and adjust
|
||||
|
||||
---
|
||||
|
||||
## Database Optimization
|
||||
|
||||
### RDS Cost Optimization
|
||||
|
||||
**Instance Sizing**
|
||||
- Right-size based on CloudWatch metrics (CPU, memory, connections)
|
||||
- Consider burstable instances (db.t3) for variable workloads
|
||||
- Graviton instances (db.m6g, db.r6g) offer 20% savings
|
||||
|
||||
**Storage Optimization**
|
||||
- Use gp3 instead of gp2 (20% cheaper)
|
||||
- Enable storage autoscaling with upper limit
|
||||
- Delete old automated backups
|
||||
- Reduce backup retention period if possible
|
||||
|
||||
**High Availability Trade-offs**
|
||||
- Multi-AZ doubles cost (needed for production)
|
||||
- Single-AZ acceptable for dev/test
|
||||
- Read replicas for read scaling (cheaper than bigger instance)
|
||||
|
||||
**Aurora vs RDS**
|
||||
- Aurora costs more but offers better scaling
|
||||
- Aurora Serverless v2 for variable workloads
|
||||
- Standard RDS for predictable workloads
|
||||
- PostgreSQL/MySQL community for dev/test
|
||||
|
||||
### DynamoDB Optimization
|
||||
|
||||
**Capacity Modes**
|
||||
- **On-Demand**: Pay per request, unpredictable traffic
|
||||
- **Provisioned**: Cheaper for consistent traffic, requires capacity planning
|
||||
- **Reserved Capacity**: 1-3 year commitment for provisioned capacity
|
||||
|
||||
**Table Design**
|
||||
- Use single-table design to minimize costs
|
||||
- Implement GSI/LSI carefully (they add cost)
|
||||
- Enable point-in-time recovery only if needed
|
||||
- Use TTL to auto-expire old data
|
||||
|
||||
**Read Optimization**
|
||||
- Use eventually consistent reads (50% cheaper than strongly consistent)
|
||||
- Implement caching (DAX or ElastiCache)
|
||||
- Batch operations when possible
|
||||
|
||||
### ElastiCache Optimization
|
||||
|
||||
**Node Types**
|
||||
- Graviton instances (cache.m6g, cache.r6g) for 20% savings
|
||||
- Right-size based on memory usage and eviction rates
|
||||
|
||||
**Redis vs Memcached**
|
||||
- Redis: More features, persistence, replication (more expensive)
|
||||
- Memcached: Simpler, no persistence, multi-threaded (cheaper)
|
||||
|
||||
**Strategies**
|
||||
- Reserved nodes for 30-55% savings
|
||||
- Single-AZ for dev/test
|
||||
- Monitor eviction rates to avoid over-provisioning
|
||||
|
||||
---
|
||||
|
||||
## Container & Serverless Optimization
|
||||
|
||||
### ECS/Fargate Optimization
|
||||
|
||||
**Compute Options**
|
||||
- **EC2 Launch Type**: More control, cheaper for steady workloads
|
||||
- **Fargate**: Serverless, easier management, better for variable loads
|
||||
- **Fargate Spot**: Up to 70% savings for fault-tolerant tasks
|
||||
|
||||
**Graviton Support**
|
||||
- Fargate ARM64 support available
|
||||
- ECS on Graviton2 EC2 instances for 20% savings
|
||||
|
||||
**Right-sizing**
|
||||
- Start with minimal CPU/memory, scale up based on metrics
|
||||
- Use Container Insights for utilization data
|
||||
- Consider task packing (multiple containers per task)
|
||||
|
||||
### EKS Optimization
|
||||
|
||||
**Control Plane**
|
||||
- $73/month per cluster (consider consolidation)
|
||||
- Use single cluster with namespaces when appropriate
|
||||
|
||||
**Worker Nodes**
|
||||
- Use Spot instances for fault-tolerant pods (up to 90% savings)
|
||||
- Managed node groups with Graviton instances
|
||||
- Karpenter for intelligent autoscaling
|
||||
- Mixed instance types for better Spot availability
|
||||
|
||||
**Cost Visibility**
|
||||
- Kubecost or OpenCost for K8s cost attribution
|
||||
- Resource requests/limits prevent waste
|
||||
- Cluster autoscaler for automatic node scaling
|
||||
|
||||
---
|
||||
|
||||
## General Principles
|
||||
|
||||
### Tagging Strategy
|
||||
|
||||
**Cost Allocation Tags**
|
||||
- Environment: prod, staging, dev, test
|
||||
- Owner: team/person responsible
|
||||
- Project: business initiative
|
||||
- CostCenter: chargeback allocation
|
||||
- Application: specific app name
|
||||
|
||||
**Tag Enforcement**
|
||||
- Use AWS Organizations policies to enforce tagging
|
||||
- Service Control Policies to prevent untagged resources
|
||||
- AWS Config rules for compliance
|
||||
|
||||
### Monitoring and Governance
|
||||
|
||||
**Cost Monitoring Tools**
|
||||
- AWS Cost Explorer: Historical analysis
|
||||
- AWS Budgets: Proactive alerts
|
||||
- Cost and Usage Reports: Detailed data export
|
||||
- Cost Anomaly Detection: Automatic anomaly alerts
|
||||
|
||||
**Regular Reviews**
|
||||
- Monthly cost review meetings
|
||||
- Quarterly rightsizing exercises
|
||||
- Annual Reserved Instance/Savings Plan optimization
|
||||
- Automated reports to stakeholders
|
||||
|
||||
### Automation
|
||||
|
||||
**Infrastructure as Code**
|
||||
- Define resource sizes in code (prevent oversizing)
|
||||
- Automated cleanup of dev/test resources
|
||||
- Scheduled shutdown of non-production resources
|
||||
|
||||
**Cost Optimization Tools**
|
||||
- AWS Compute Optimizer: ML-based recommendations
|
||||
- AWS Trusted Advisor: Best practice checks
|
||||
- Third-party tools: CloudHealth, Cloudability, Spot.io
|
||||
|
||||
### Cultural Best Practices
|
||||
|
||||
**Engineering Ownership**
|
||||
- Engineers should see cost impact of their changes
|
||||
- Cost metrics in dashboards alongside performance
|
||||
- Cost budgets for teams/projects
|
||||
|
||||
**Experiments and Cleanup**
|
||||
- Tag experimental resources with expiration dates
|
||||
- Automated cleanup of abandoned resources
|
||||
- Regular audits of unused resources
|
||||
|
||||
**Cost-Aware Architecture**
|
||||
- Design for cost from the beginning
|
||||
- Choose appropriate service tiers
|
||||
- Implement auto-scaling and right-sizing from day one
|
||||
- Consider serverless and managed services
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins Checklist
|
||||
|
||||
- [ ] Delete unattached EBS volumes
|
||||
- [ ] Delete old EBS snapshots
|
||||
- [ ] Release unused Elastic IPs
|
||||
- [ ] Stop or terminate idle EC2 instances
|
||||
- [ ] Right-size oversized instances
|
||||
- [ ] Convert gp2 to gp3 volumes
|
||||
- [ ] Enable S3 Intelligent-Tiering
|
||||
- [ ] Set up S3 lifecycle policies
|
||||
- [ ] Replace NAT Gateways with VPC Endpoints where possible
|
||||
- [ ] Migrate to Graviton instances
|
||||
- [ ] Purchase Reserved Instances/Savings Plans for stable workloads
|
||||
- [ ] Use Spot instances for fault-tolerant workloads
|
||||
- [ ] Delete old RDS snapshots
|
||||
- [ ] Enable DynamoDB auto-scaling
|
||||
- [ ] Set up cost allocation tags
|
||||
- [ ] Enable AWS Budgets alerts
|
||||
- [ ] Schedule shutdown of dev/test resources
|
||||
Reference in New Issue
Block a user