zhongwei/gh-ahmedasmar-devops-claude-skills-aws-cost-optimization

Fork 0

Files

Zhongwei Li 9d4643f587 Initial commit

2025-11-29 17:51:09 +08:00

12 KiB

Raw Permalink Blame History

AWS Cost Optimization Best Practices

Comprehensive strategies for optimizing AWS costs across all major service categories.

Compute Optimization
Storage Optimization
Network Optimization
Database Optimization
Container & Serverless Optimization
General Principles

Compute Optimization

EC2 Instance Optimization

Right Instance Family

General Purpose (T3, M5, M6i): Web servers, small-medium databases, dev environments
Compute Optimized (C5, C6i, C6g): CPU-intensive workloads, batch processing, HPC
Memory Optimized (R5, R6i, R6g): Databases, in-memory caches, big data
Storage Optimized (I3, D2): High IOPS, data warehousing, Hadoop

Graviton Migration (ARM64)

Up to 20% cost savings with M6g, C6g, R6g, T4g instances
Test compatibility first: Most modern languages/frameworks support ARM64
Best for: Stateless applications, containerized workloads, open-source software

Instance Sizing

Start small and scale up based on metrics
Monitor CPU, memory, network for 2+ weeks before committing
Use CloudWatch metrics to identify underutilized instances
Consider burstable instances (T3) for variable workloads

Purchase Options

On-Demand: Flexible, no commitment, highest cost
Reserved Instances: 1-3 year commitment, up to 63% savings
- Standard RI: Highest discount, no flexibility
- Convertible RI: Moderate discount, can change instance types
Savings Plans: Flexible commitment to compute spend, up to 66% savings
Spot Instances: Up to 90% savings, suitable for fault-tolerant workloads

Auto Scaling

Horizontal Scaling

Scale out during peak, scale in during off-peak
Use target tracking policies (CPU, ALB requests, custom metrics)
Set minimum instances for high availability, maximum for cost control
Consider scheduled scaling for predictable patterns

Mixed Instances Policy

Combine instance types for better Spot availability
Mix Spot and On-Demand for reliability
Example: 70% Spot, 30% On-Demand for fault-tolerant apps

Lambda Optimization

Memory Configuration

Memory allocation determines CPU allocation
More memory = faster execution = potentially lower cost
Test different memory settings to find cost/performance sweet spot

Cold Start Mitigation

Provisioned concurrency for critical functions (adds cost)
Keep functions warm with scheduled invocations
Minimize deployment package size
Use Lambda layers for shared dependencies

Execution Time

Optimize code to reduce execution duration
Every 100ms of execution matters at scale
Consider Graviton2 (arm64) for 20% better price/performance

Storage Optimization

S3 Cost Optimization

Storage Classes

S3 Standard: Frequently accessed data
S3 Intelligent-Tiering: Auto-moves between tiers, ideal for unknown patterns
S3 Standard-IA: Infrequent access, 50% cheaper than Standard
S3 One Zone-IA: Non-critical, infrequent access, 20% cheaper than Standard-IA
S3 Glacier Instant Retrieval: Archive with instant access, 68% cheaper
S3 Glacier Flexible Retrieval: Archive, retrieval in minutes-hours, 77% cheaper
S3 Glacier Deep Archive: Long-term archive, retrieval in 12 hours, 83% cheaper

Lifecycle Policies

Automatically transition objects between storage classes
Delete incomplete multipart uploads after 7 days
Example policy:
- 0-30 days: S3 Standard
- 30-90 days: S3 Standard-IA
- 90-365 days: S3 Glacier Flexible Retrieval
- 365+ days: S3 Glacier Deep Archive or Delete

Request Optimization

Use CloudFront CDN to reduce S3 GET requests
Batch operations instead of individual API calls
Use S3 Select to retrieve subsets of data
Enable S3 Transfer Acceleration for faster uploads (if needed)

Cost Monitoring

Enable S3 Storage Lens for usage analytics
Set up S3 Storage Class Analysis
Monitor request costs (can exceed storage costs for small files)

EBS Optimization

Volume Types

gp3: General purpose, 20% cheaper than gp2, configurable IOPS/throughput
gp2: Legacy general purpose (migrate to gp3)
io2: High performance, mission-critical (only if needed)
st1: Throughput-optimized HDD for big data (cheaper for sequential access)
sc1: Cold HDD for infrequent access (cheapest)

Snapshot Management

Delete old snapshots (they accumulate quickly)
Use Lifecycle Manager for automated snapshot policies
Snapshots are incremental but deletion is complex (use Data Lifecycle Manager)
Consider cross-region replication costs

Volume Cleanup

Delete unattached volumes
Right-size oversized volumes
Consider EBS Elastic Volumes to modify without downtime

Network Optimization

Data Transfer Costs

General Rules

Free: Inbound from internet, same-AZ traffic (same subnet)
Cheap: Same-region traffic across AZs
Expensive: Cross-region, outbound to internet, CloudFront to origin

Optimization Strategies

Colocate resources in same AZ when possible (consider HA trade-offs)
Use VPC endpoints for AWS service access (avoids NAT/IGW costs)
Implement caching with CloudFront, ElastiCache
Compress data before transfer
Use AWS PrivateLink instead of internet egress

NAT Gateway Optimization

Cost Structure

~$32.85/month per NAT Gateway
Data processing charges: $0.045/GB

Alternatives

VPC Endpoints: Direct access to AWS services (S3, DynamoDB, etc.)
- Interface endpoints: $7.20/month + $0.01/GB
- Gateway endpoints: Free for S3 and DynamoDB
NAT Instance: Cheaper but requires management
Single NAT Gateway: Use one instead of one per AZ (reduces HA)
S3 Gateway Endpoint: Free alternative for S3 access

When to Use What

High traffic to AWS services → VPC Endpoints
Low traffic, dev/test → Single NAT Gateway or NAT instance
Production, HA required → NAT Gateway per AZ
S3 access only → S3 Gateway Endpoint (free)

CloudFront Optimization

Use Cases for Savings

Reduce S3 data transfer costs (CloudFront egress is cheaper)
Cache frequently accessed content
Regional edge caches for less popular content

Configuration

Use appropriate price class (exclude expensive regions if not needed)
Set proper TTL to maximize cache hit ratio
Use compression (gzip, brotli)
Monitor cache hit ratio and adjust

Database Optimization

RDS Cost Optimization

Instance Sizing

Right-size based on CloudWatch metrics (CPU, memory, connections)
Consider burstable instances (db.t3) for variable workloads
Graviton instances (db.m6g, db.r6g) offer 20% savings

Storage Optimization

Use gp3 instead of gp2 (20% cheaper)
Enable storage autoscaling with upper limit
Delete old automated backups
Reduce backup retention period if possible

High Availability Trade-offs

Multi-AZ doubles cost (needed for production)
Single-AZ acceptable for dev/test
Read replicas for read scaling (cheaper than bigger instance)

Aurora vs RDS

Aurora costs more but offers better scaling
Aurora Serverless v2 for variable workloads
Standard RDS for predictable workloads
PostgreSQL/MySQL community for dev/test

DynamoDB Optimization

Capacity Modes

On-Demand: Pay per request, unpredictable traffic
Provisioned: Cheaper for consistent traffic, requires capacity planning
Reserved Capacity: 1-3 year commitment for provisioned capacity

Table Design

Use single-table design to minimize costs
Implement GSI/LSI carefully (they add cost)
Enable point-in-time recovery only if needed
Use TTL to auto-expire old data

Read Optimization

Use eventually consistent reads (50% cheaper than strongly consistent)
Implement caching (DAX or ElastiCache)
Batch operations when possible

ElastiCache Optimization

Node Types

Graviton instances (cache.m6g, cache.r6g) for 20% savings
Right-size based on memory usage and eviction rates

Redis vs Memcached

Redis: More features, persistence, replication (more expensive)
Memcached: Simpler, no persistence, multi-threaded (cheaper)

Strategies

Reserved nodes for 30-55% savings
Single-AZ for dev/test
Monitor eviction rates to avoid over-provisioning

Container & Serverless Optimization

ECS/Fargate Optimization

Compute Options

EC2 Launch Type: More control, cheaper for steady workloads
Fargate: Serverless, easier management, better for variable loads
Fargate Spot: Up to 70% savings for fault-tolerant tasks

Graviton Support

Fargate ARM64 support available
ECS on Graviton2 EC2 instances for 20% savings

Right-sizing

Start with minimal CPU/memory, scale up based on metrics
Use Container Insights for utilization data
Consider task packing (multiple containers per task)

EKS Optimization

Control Plane

$73/month per cluster (consider consolidation)
Use single cluster with namespaces when appropriate

Worker Nodes

Use Spot instances for fault-tolerant pods (up to 90% savings)
Managed node groups with Graviton instances
Karpenter for intelligent autoscaling
Mixed instance types for better Spot availability

Cost Visibility

Kubecost or OpenCost for K8s cost attribution
Resource requests/limits prevent waste
Cluster autoscaler for automatic node scaling

General Principles

Tagging Strategy

Cost Allocation Tags

Environment: prod, staging, dev, test
Owner: team/person responsible
Project: business initiative
CostCenter: chargeback allocation
Application: specific app name

Tag Enforcement

Use AWS Organizations policies to enforce tagging
Service Control Policies to prevent untagged resources
AWS Config rules for compliance

Monitoring and Governance

Cost Monitoring Tools

AWS Cost Explorer: Historical analysis
AWS Budgets: Proactive alerts
Cost and Usage Reports: Detailed data export
Cost Anomaly Detection: Automatic anomaly alerts

Regular Reviews

Monthly cost review meetings
Quarterly rightsizing exercises
Annual Reserved Instance/Savings Plan optimization
Automated reports to stakeholders

Automation

Infrastructure as Code

Define resource sizes in code (prevent oversizing)
Automated cleanup of dev/test resources
Scheduled shutdown of non-production resources

Cost Optimization Tools

AWS Compute Optimizer: ML-based recommendations
AWS Trusted Advisor: Best practice checks
Third-party tools: CloudHealth, Cloudability, Spot.io

Cultural Best Practices

Engineering Ownership

Engineers should see cost impact of their changes
Cost metrics in dashboards alongside performance
Cost budgets for teams/projects

Experiments and Cleanup

Tag experimental resources with expiration dates
Automated cleanup of abandoned resources
Regular audits of unused resources

Cost-Aware Architecture

Design for cost from the beginning
Choose appropriate service tiers
Implement auto-scaling and right-sizing from day one
Consider serverless and managed services

Quick Wins Checklist

Delete unattached EBS volumes
Delete old EBS snapshots
Release unused Elastic IPs
Stop or terminate idle EC2 instances
Right-size oversized instances
Convert gp2 to gp3 volumes
Enable S3 Intelligent-Tiering
Set up S3 lifecycle policies
Replace NAT Gateways with VPC Endpoints where possible
Migrate to Graviton instances
Purchase Reserved Instances/Savings Plans for stable workloads
Use Spot instances for fault-tolerant workloads
Delete old RDS snapshots
Enable DynamoDB auto-scaling
Set up cost allocation tags
Enable AWS Budgets alerts
Schedule shutdown of dev/test resources

12 KiB Raw Permalink Blame History