395 lines
10 KiB
Markdown
395 lines
10 KiB
Markdown
# AWS Cost & Operations Patterns
|
|
|
|
Comprehensive patterns and best practices for AWS cost optimization, monitoring, and operational excellence.
|
|
|
|
## Table of Contents
|
|
|
|
- [Cost Optimization Patterns](#cost-optimization-patterns)
|
|
- [Monitoring Patterns](#monitoring-patterns)
|
|
- [Observability Patterns](#observability-patterns)
|
|
- [Security and Audit Patterns](#security-and-audit-patterns)
|
|
- [Troubleshooting Workflows](#troubleshooting-workflows)
|
|
|
|
## Cost Optimization Patterns
|
|
|
|
### Pattern 1: Cost Estimation Before Deployment
|
|
|
|
**When**: Before deploying any new infrastructure
|
|
|
|
**MCP Server**: AWS Pricing MCP
|
|
|
|
**Steps**:
|
|
1. List all resources to be deployed
|
|
2. Query pricing for each resource type
|
|
3. Calculate monthly costs based on expected usage
|
|
4. Compare pricing across regions
|
|
5. Document cost estimates in architecture docs
|
|
|
|
**Example**:
|
|
```
|
|
Resource: Lambda Function
|
|
- Invocations: 1,000,000/month
|
|
- Duration: 3 seconds avg
|
|
- Memory: 512 MB
|
|
- Region: us-east-1
|
|
Estimated cost: $X/month
|
|
```
|
|
|
|
### Pattern 2: Monthly Cost Review
|
|
|
|
**When**: First week of every month
|
|
|
|
**MCP Servers**: Cost Explorer MCP, Billing and Cost Management MCP
|
|
|
|
**Steps**:
|
|
1. Review total spending vs. budget
|
|
2. Analyze cost by service (top 5 services)
|
|
3. Identify cost anomalies (>20% increase)
|
|
4. Review cost by environment (dev/staging/prod)
|
|
5. Check cost allocation tag coverage
|
|
6. Generate cost optimization recommendations
|
|
|
|
**Key Metrics**:
|
|
- Month-over-month cost change
|
|
- Cost per environment
|
|
- Cost per application/project
|
|
- Untagged resource costs
|
|
|
|
### Pattern 3: Right-Sizing Resources
|
|
|
|
**When**: Quarterly or when utilization alerts trigger
|
|
|
|
**MCP Servers**: CloudWatch MCP, Cost Explorer MCP
|
|
|
|
**Steps**:
|
|
1. Query CloudWatch for resource utilization metrics
|
|
2. Identify over-provisioned resources (< 40% utilization)
|
|
3. Identify under-provisioned resources (> 80% utilization)
|
|
4. Calculate potential savings from right-sizing
|
|
5. Plan and execute right-sizing changes
|
|
6. Monitor post-change performance
|
|
|
|
**Common Right-Sizing Scenarios**:
|
|
- EC2 instances with low CPU utilization
|
|
- RDS instances with excess capacity
|
|
- DynamoDB tables with low read/write usage
|
|
- Lambda functions with excessive memory allocation
|
|
|
|
### Pattern 4: Unused Resource Cleanup
|
|
|
|
**When**: Monthly or triggered by cost anomalies
|
|
|
|
**MCP Servers**: Cost Explorer MCP, CloudTrail MCP
|
|
|
|
**Steps**:
|
|
1. Identify resources with zero usage
|
|
2. Query CloudTrail for last access time
|
|
3. Tag resources for deletion review
|
|
4. Notify resource owners
|
|
5. Delete confirmed unused resources
|
|
6. Track cost savings
|
|
|
|
**Common Unused Resources**:
|
|
- Unattached EBS volumes
|
|
- Old EBS snapshots
|
|
- Idle Load Balancers
|
|
- Unused Elastic IPs
|
|
- Old AMIs and snapshots
|
|
- Stopped EC2 instances (long-term)
|
|
|
|
## Monitoring Patterns
|
|
|
|
### Pattern 1: Critical Service Monitoring
|
|
|
|
**When**: All production services
|
|
|
|
**MCP Server**: CloudWatch MCP
|
|
|
|
**Metrics to Monitor**:
|
|
- **Availability**: Service uptime, health checks
|
|
- **Performance**: Latency, response time
|
|
- **Errors**: Error rate, failed requests
|
|
- **Saturation**: CPU, memory, disk, network utilization
|
|
|
|
**Alarm Thresholds** (adjust based on SLAs):
|
|
- Error rate: > 1% for 2 consecutive periods
|
|
- Latency: p99 > 1 second for 5 minutes
|
|
- CPU: > 80% for 10 minutes
|
|
- Memory: > 85% for 5 minutes
|
|
|
|
### Pattern 2: Lambda Function Monitoring
|
|
|
|
**MCP Server**: CloudWatch MCP
|
|
|
|
**Key Metrics**:
|
|
```
|
|
- Invocations (Count)
|
|
- Errors (Count, %)
|
|
- Duration (Average, p99)
|
|
- Throttles (Count)
|
|
- ConcurrentExecutions (Max)
|
|
- IteratorAge (for stream processing)
|
|
```
|
|
|
|
**Recommended Alarms**:
|
|
- Error rate > 1%
|
|
- Duration > 80% of timeout
|
|
- Throttles > 0
|
|
- ConcurrentExecutions > 80% of reserved
|
|
|
|
### Pattern 3: API Gateway Monitoring
|
|
|
|
**MCP Server**: CloudWatch MCP
|
|
|
|
**Key Metrics**:
|
|
```
|
|
- Count (Total requests)
|
|
- 4XXError, 5XXError
|
|
- Latency (p50, p95, p99)
|
|
- IntegrationLatency
|
|
- CacheHitCount, CacheMissCount
|
|
```
|
|
|
|
**Recommended Alarms**:
|
|
- 5XX error rate > 0.5%
|
|
- 4XX error rate > 5%
|
|
- Latency p99 > 2 seconds
|
|
- Integration latency spike
|
|
|
|
### Pattern 4: Database Monitoring
|
|
|
|
**MCP Server**: CloudWatch MCP
|
|
|
|
**RDS Metrics**:
|
|
```
|
|
- CPUUtilization
|
|
- DatabaseConnections
|
|
- FreeableMemory
|
|
- ReadLatency, WriteLatency
|
|
- ReadIOPS, WriteIOPS
|
|
- FreeStorageSpace
|
|
```
|
|
|
|
**DynamoDB Metrics**:
|
|
```
|
|
- ConsumedReadCapacityUnits
|
|
- ConsumedWriteCapacityUnits
|
|
- UserErrors
|
|
- SystemErrors
|
|
- ThrottledRequests
|
|
```
|
|
|
|
**Recommended Alarms**:
|
|
- RDS CPU > 80% for 10 minutes
|
|
- RDS connections > 80% of max
|
|
- RDS free storage < 10 GB
|
|
- DynamoDB throttled requests > 0
|
|
- DynamoDB user errors spike
|
|
|
|
## Observability Patterns
|
|
|
|
### Pattern 1: Distributed Tracing Setup
|
|
|
|
**MCP Server**: CloudWatch Application Signals MCP
|
|
|
|
**Components**:
|
|
1. **Service Map**: Visualize service dependencies
|
|
2. **Traces**: Track requests across services
|
|
3. **Metrics**: Monitor latency and errors per service
|
|
4. **SLOs**: Define and track service level objectives
|
|
|
|
**Implementation**:
|
|
- Enable X-Ray tracing on Lambda functions
|
|
- Add X-Ray SDK to application code
|
|
- Configure sampling rules
|
|
- Create service lens dashboards
|
|
|
|
### Pattern 2: Log Aggregation and Analysis
|
|
|
|
**MCP Server**: CloudWatch MCP
|
|
|
|
**Log Strategy**:
|
|
1. **Centralize Logs**: Send all application logs to CloudWatch Logs
|
|
2. **Structure Logs**: Use JSON format for structured logging
|
|
3. **Log Insights**: Use CloudWatch Logs Insights for queries
|
|
4. **Retention**: Set appropriate retention periods
|
|
|
|
**Example Log Insights Queries**:
|
|
```
|
|
# Find errors in last hour
|
|
fields @timestamp, @message
|
|
| filter @message like /ERROR/
|
|
| sort @timestamp desc
|
|
| limit 100
|
|
|
|
# Count errors by type
|
|
stats count() by error_type
|
|
| sort count desc
|
|
|
|
# Calculate p99 latency
|
|
stats percentile(duration, 99) by service_name
|
|
```
|
|
|
|
### Pattern 3: Custom Metrics
|
|
|
|
**MCP Server**: CloudWatch MCP
|
|
|
|
**When to Use Custom Metrics**:
|
|
- Business-specific KPIs (orders/minute, revenue/hour)
|
|
- Application-specific metrics (cache hit rate, queue depth)
|
|
- Performance metrics not provided by AWS
|
|
|
|
**Best Practices**:
|
|
- Use consistent namespace: `CompanyName/ApplicationName`
|
|
- Include relevant dimensions (environment, region, version)
|
|
- Publish metrics at appropriate intervals
|
|
- Use metric filters for log-derived metrics
|
|
|
|
## Security and Audit Patterns
|
|
|
|
### Pattern 1: API Activity Auditing
|
|
|
|
**MCP Server**: CloudTrail MCP
|
|
|
|
**Regular Audit Queries**:
|
|
```
|
|
# Find all IAM changes
|
|
eventName: CreateUser, DeleteUser, AttachUserPolicy, etc.
|
|
Time: Last 24 hours
|
|
|
|
# Track S3 bucket deletions
|
|
eventName: DeleteBucket
|
|
Time: Last 7 days
|
|
|
|
# Find failed login attempts
|
|
eventName: ConsoleLogin
|
|
errorCode: Failure
|
|
|
|
# Monitor privileged actions
|
|
userIdentity.arn: *admin* OR *root*
|
|
```
|
|
|
|
**Audit Schedule**:
|
|
- Daily: Review privileged user actions
|
|
- Weekly: Audit IAM changes and security group modifications
|
|
- Monthly: Comprehensive security review
|
|
|
|
### Pattern 2: Security Posture Assessment
|
|
|
|
**MCP Server**: Well-Architected Security Assessment Tool MCP
|
|
|
|
**Assessment Areas**:
|
|
1. **Identity and Access Management**
|
|
- Least privilege implementation
|
|
- MFA enforcement
|
|
- Role-based access control
|
|
- Service control policies
|
|
|
|
2. **Detective Controls**
|
|
- CloudTrail enabled in all regions
|
|
- GuardDuty findings review
|
|
- Config rule compliance
|
|
- Security Hub findings
|
|
|
|
3. **Infrastructure Protection**
|
|
- VPC security groups review
|
|
- Network ACLs configuration
|
|
- AWS WAF rules
|
|
- Security group ingress rules
|
|
|
|
4. **Data Protection**
|
|
- Encryption at rest (S3, EBS, RDS)
|
|
- Encryption in transit (TLS/SSL)
|
|
- KMS key usage and rotation
|
|
- Secrets Manager utilization
|
|
|
|
5. **Incident Response**
|
|
- IR playbooks documented
|
|
- Automated response procedures
|
|
- Contact information current
|
|
- Regular IR drills
|
|
|
|
**Assessment Frequency**:
|
|
- Quarterly: Full Well-Architected review
|
|
- Monthly: High-priority findings review
|
|
- Weekly: Critical security findings
|
|
|
|
### Pattern 3: Compliance Monitoring
|
|
|
|
**MCP Servers**: CloudTrail MCP, CloudWatch MCP
|
|
|
|
**Compliance Requirements**:
|
|
- Data residency (ensure data stays in approved regions)
|
|
- Access logging (all access logged and retained)
|
|
- Encryption requirements (data encrypted at rest and in transit)
|
|
- Change management (all changes tracked in CloudTrail)
|
|
|
|
**Compliance Dashboards**:
|
|
- Encryption coverage by service
|
|
- CloudTrail logging status
|
|
- Failed login attempts
|
|
- Privileged access usage
|
|
- Non-compliant resources
|
|
|
|
## Troubleshooting Workflows
|
|
|
|
### Workflow 1: High Lambda Error Rate
|
|
|
|
**MCP Servers**: CloudWatch MCP, CloudWatch Application Signals MCP
|
|
|
|
**Steps**:
|
|
1. Query CloudWatch for Lambda error metrics
|
|
2. Check error logs in CloudWatch Logs
|
|
3. Identify error patterns (timeout, memory, permission)
|
|
4. Check Lambda configuration (memory, timeout, permissions)
|
|
5. Review recent code deployments
|
|
6. Check downstream service health
|
|
7. Implement fix and monitor
|
|
|
|
### Workflow 2: Increased Latency
|
|
|
|
**MCP Servers**: CloudWatch MCP, CloudWatch Application Signals MCP
|
|
|
|
**Steps**:
|
|
1. Identify latency spike in CloudWatch metrics
|
|
2. Check service map for slow dependencies
|
|
3. Query distributed traces for slow requests
|
|
4. Check database query performance
|
|
5. Review API Gateway integration latency
|
|
6. Check Lambda cold starts
|
|
7. Identify bottleneck and optimize
|
|
|
|
### Workflow 3: Cost Spike Investigation
|
|
|
|
**MCP Servers**: Cost Explorer MCP, CloudWatch MCP, CloudTrail MCP
|
|
|
|
**Steps**:
|
|
1. Use Cost Explorer to identify service causing spike
|
|
2. Check CloudWatch metrics for usage increase
|
|
3. Review CloudTrail for recent resource creation
|
|
4. Identify root cause (misconfiguration, runaway process, attack)
|
|
5. Implement cost controls (budgets, alarms, service quotas)
|
|
6. Clean up unnecessary resources
|
|
|
|
### Workflow 4: Security Incident Response
|
|
|
|
**MCP Servers**: CloudTrail MCP, GuardDuty (via CloudWatch), Well-Architected Assessment MCP
|
|
|
|
**Steps**:
|
|
1. Identify security event in GuardDuty or CloudWatch
|
|
2. Query CloudTrail for related API activity
|
|
3. Determine scope and impact
|
|
4. Isolate affected resources
|
|
5. Revoke compromised credentials
|
|
6. Implement remediation
|
|
7. Conduct post-incident review
|
|
8. Update security controls
|
|
|
|
## Summary
|
|
|
|
- **Cost Optimization**: Use Pricing, Cost Explorer, and Billing MCPs for proactive cost management
|
|
- **Monitoring**: Set up comprehensive CloudWatch alarms for all critical services
|
|
- **Observability**: Implement distributed tracing and structured logging
|
|
- **Security**: Regular CloudTrail audits and Well-Architected assessments
|
|
- **Proactive**: Don't wait for incidents - monitor and optimize continuously
|