10 KiB
AWS Cost & Operations Patterns
Comprehensive patterns and best practices for AWS cost optimization, monitoring, and operational excellence.
Table of Contents
- Cost Optimization Patterns
- Monitoring Patterns
- Observability Patterns
- Security and Audit Patterns
- Troubleshooting Workflows
Cost Optimization Patterns
Pattern 1: Cost Estimation Before Deployment
When: Before deploying any new infrastructure
MCP Server: AWS Pricing MCP
Steps:
- List all resources to be deployed
- Query pricing for each resource type
- Calculate monthly costs based on expected usage
- Compare pricing across regions
- Document cost estimates in architecture docs
Example:
Resource: Lambda Function
- Invocations: 1,000,000/month
- Duration: 3 seconds avg
- Memory: 512 MB
- Region: us-east-1
Estimated cost: $X/month
Pattern 2: Monthly Cost Review
When: First week of every month
MCP Servers: Cost Explorer MCP, Billing and Cost Management MCP
Steps:
- Review total spending vs. budget
- Analyze cost by service (top 5 services)
- Identify cost anomalies (>20% increase)
- Review cost by environment (dev/staging/prod)
- Check cost allocation tag coverage
- Generate cost optimization recommendations
Key Metrics:
- Month-over-month cost change
- Cost per environment
- Cost per application/project
- Untagged resource costs
Pattern 3: Right-Sizing Resources
When: Quarterly or when utilization alerts trigger
MCP Servers: CloudWatch MCP, Cost Explorer MCP
Steps:
- Query CloudWatch for resource utilization metrics
- Identify over-provisioned resources (< 40% utilization)
- Identify under-provisioned resources (> 80% utilization)
- Calculate potential savings from right-sizing
- Plan and execute right-sizing changes
- Monitor post-change performance
Common Right-Sizing Scenarios:
- EC2 instances with low CPU utilization
- RDS instances with excess capacity
- DynamoDB tables with low read/write usage
- Lambda functions with excessive memory allocation
Pattern 4: Unused Resource Cleanup
When: Monthly or triggered by cost anomalies
MCP Servers: Cost Explorer MCP, CloudTrail MCP
Steps:
- Identify resources with zero usage
- Query CloudTrail for last access time
- Tag resources for deletion review
- Notify resource owners
- Delete confirmed unused resources
- Track cost savings
Common Unused Resources:
- Unattached EBS volumes
- Old EBS snapshots
- Idle Load Balancers
- Unused Elastic IPs
- Old AMIs and snapshots
- Stopped EC2 instances (long-term)
Monitoring Patterns
Pattern 1: Critical Service Monitoring
When: All production services
MCP Server: CloudWatch MCP
Metrics to Monitor:
- Availability: Service uptime, health checks
- Performance: Latency, response time
- Errors: Error rate, failed requests
- Saturation: CPU, memory, disk, network utilization
Alarm Thresholds (adjust based on SLAs):
- Error rate: > 1% for 2 consecutive periods
- Latency: p99 > 1 second for 5 minutes
- CPU: > 80% for 10 minutes
- Memory: > 85% for 5 minutes
Pattern 2: Lambda Function Monitoring
MCP Server: CloudWatch MCP
Key Metrics:
- Invocations (Count)
- Errors (Count, %)
- Duration (Average, p99)
- Throttles (Count)
- ConcurrentExecutions (Max)
- IteratorAge (for stream processing)
Recommended Alarms:
- Error rate > 1%
- Duration > 80% of timeout
- Throttles > 0
- ConcurrentExecutions > 80% of reserved
Pattern 3: API Gateway Monitoring
MCP Server: CloudWatch MCP
Key Metrics:
- Count (Total requests)
- 4XXError, 5XXError
- Latency (p50, p95, p99)
- IntegrationLatency
- CacheHitCount, CacheMissCount
Recommended Alarms:
- 5XX error rate > 0.5%
- 4XX error rate > 5%
- Latency p99 > 2 seconds
- Integration latency spike
Pattern 4: Database Monitoring
MCP Server: CloudWatch MCP
RDS Metrics:
- CPUUtilization
- DatabaseConnections
- FreeableMemory
- ReadLatency, WriteLatency
- ReadIOPS, WriteIOPS
- FreeStorageSpace
DynamoDB Metrics:
- ConsumedReadCapacityUnits
- ConsumedWriteCapacityUnits
- UserErrors
- SystemErrors
- ThrottledRequests
Recommended Alarms:
- RDS CPU > 80% for 10 minutes
- RDS connections > 80% of max
- RDS free storage < 10 GB
- DynamoDB throttled requests > 0
- DynamoDB user errors spike
Observability Patterns
Pattern 1: Distributed Tracing Setup
MCP Server: CloudWatch Application Signals MCP
Components:
- Service Map: Visualize service dependencies
- Traces: Track requests across services
- Metrics: Monitor latency and errors per service
- SLOs: Define and track service level objectives
Implementation:
- Enable X-Ray tracing on Lambda functions
- Add X-Ray SDK to application code
- Configure sampling rules
- Create service lens dashboards
Pattern 2: Log Aggregation and Analysis
MCP Server: CloudWatch MCP
Log Strategy:
- Centralize Logs: Send all application logs to CloudWatch Logs
- Structure Logs: Use JSON format for structured logging
- Log Insights: Use CloudWatch Logs Insights for queries
- Retention: Set appropriate retention periods
Example Log Insights Queries:
# Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
# Count errors by type
stats count() by error_type
| sort count desc
# Calculate p99 latency
stats percentile(duration, 99) by service_name
Pattern 3: Custom Metrics
MCP Server: CloudWatch MCP
When to Use Custom Metrics:
- Business-specific KPIs (orders/minute, revenue/hour)
- Application-specific metrics (cache hit rate, queue depth)
- Performance metrics not provided by AWS
Best Practices:
- Use consistent namespace:
CompanyName/ApplicationName - Include relevant dimensions (environment, region, version)
- Publish metrics at appropriate intervals
- Use metric filters for log-derived metrics
Security and Audit Patterns
Pattern 1: API Activity Auditing
MCP Server: CloudTrail MCP
Regular Audit Queries:
# Find all IAM changes
eventName: CreateUser, DeleteUser, AttachUserPolicy, etc.
Time: Last 24 hours
# Track S3 bucket deletions
eventName: DeleteBucket
Time: Last 7 days
# Find failed login attempts
eventName: ConsoleLogin
errorCode: Failure
# Monitor privileged actions
userIdentity.arn: *admin* OR *root*
Audit Schedule:
- Daily: Review privileged user actions
- Weekly: Audit IAM changes and security group modifications
- Monthly: Comprehensive security review
Pattern 2: Security Posture Assessment
MCP Server: Well-Architected Security Assessment Tool MCP
Assessment Areas:
-
Identity and Access Management
- Least privilege implementation
- MFA enforcement
- Role-based access control
- Service control policies
-
Detective Controls
- CloudTrail enabled in all regions
- GuardDuty findings review
- Config rule compliance
- Security Hub findings
-
Infrastructure Protection
- VPC security groups review
- Network ACLs configuration
- AWS WAF rules
- Security group ingress rules
-
Data Protection
- Encryption at rest (S3, EBS, RDS)
- Encryption in transit (TLS/SSL)
- KMS key usage and rotation
- Secrets Manager utilization
-
Incident Response
- IR playbooks documented
- Automated response procedures
- Contact information current
- Regular IR drills
Assessment Frequency:
- Quarterly: Full Well-Architected review
- Monthly: High-priority findings review
- Weekly: Critical security findings
Pattern 3: Compliance Monitoring
MCP Servers: CloudTrail MCP, CloudWatch MCP
Compliance Requirements:
- Data residency (ensure data stays in approved regions)
- Access logging (all access logged and retained)
- Encryption requirements (data encrypted at rest and in transit)
- Change management (all changes tracked in CloudTrail)
Compliance Dashboards:
- Encryption coverage by service
- CloudTrail logging status
- Failed login attempts
- Privileged access usage
- Non-compliant resources
Troubleshooting Workflows
Workflow 1: High Lambda Error Rate
MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP
Steps:
- Query CloudWatch for Lambda error metrics
- Check error logs in CloudWatch Logs
- Identify error patterns (timeout, memory, permission)
- Check Lambda configuration (memory, timeout, permissions)
- Review recent code deployments
- Check downstream service health
- Implement fix and monitor
Workflow 2: Increased Latency
MCP Servers: CloudWatch MCP, CloudWatch Application Signals MCP
Steps:
- Identify latency spike in CloudWatch metrics
- Check service map for slow dependencies
- Query distributed traces for slow requests
- Check database query performance
- Review API Gateway integration latency
- Check Lambda cold starts
- Identify bottleneck and optimize
Workflow 3: Cost Spike Investigation
MCP Servers: Cost Explorer MCP, CloudWatch MCP, CloudTrail MCP
Steps:
- Use Cost Explorer to identify service causing spike
- Check CloudWatch metrics for usage increase
- Review CloudTrail for recent resource creation
- Identify root cause (misconfiguration, runaway process, attack)
- Implement cost controls (budgets, alarms, service quotas)
- Clean up unnecessary resources
Workflow 4: Security Incident Response
MCP Servers: CloudTrail MCP, GuardDuty (via CloudWatch), Well-Architected Assessment MCP
Steps:
- Identify security event in GuardDuty or CloudWatch
- Query CloudTrail for related API activity
- Determine scope and impact
- Isolate affected resources
- Revoke compromised credentials
- Implement remediation
- Conduct post-incident review
- Update security controls
Summary
- Cost Optimization: Use Pricing, Cost Explorer, and Billing MCPs for proactive cost management
- Monitoring: Set up comprehensive CloudWatch alarms for all critical services
- Observability: Implement distributed tracing and structured logging
- Security: Regular CloudTrail audits and Well-Architected assessments
- Proactive: Don't wait for incidents - monitor and optimize continuously