Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:08:38 +08:00
commit 38b562e994
16 changed files with 7792 additions and 0 deletions

View File

@@ -0,0 +1,299 @@
---
name: aws-cost-operations
description: This skill provides AWS cost optimization, monitoring, and operational best practices with integrated MCP servers for billing analysis, cost estimation, observability, and security assessment.
---
# AWS Cost & Operations
This skill provides comprehensive guidance for AWS cost optimization, monitoring, observability, and operational excellence with integrated MCP servers.
## Integrated MCP Servers
This skill includes 8 MCP servers automatically configured with the plugin:
### Cost Management Servers
#### 1. AWS Billing and Cost Management MCP Server
**Purpose**: Real-time billing and cost management
- View current AWS spending and trends
- Analyze billing details across services
- Track budget utilization
- Monitor cost allocation tags
- Review consolidated billing for organizations
#### 2. AWS Pricing MCP Server
**Purpose**: Pre-deployment cost estimation and optimization
- Estimate costs before deploying resources
- Compare pricing across regions
- Calculate Total Cost of Ownership (TCO)
- Evaluate different service options for cost efficiency
- Get current pricing information for AWS services
#### 3. AWS Cost Explorer MCP Server
**Purpose**: Detailed cost analysis and reporting
- Analyze historical spending patterns
- Create custom cost reports
- Identify cost anomalies and trends
- Forecast future costs
- Analyze cost by service, region, or tag
- Generate cost optimization recommendations
### Monitoring & Observability Servers
#### 4. Amazon CloudWatch MCP Server
**Purpose**: Metrics, alarms, and logs analysis
- Query CloudWatch metrics and logs
- Create and manage CloudWatch alarms
- Analyze application performance metrics
- Troubleshoot operational issues
- Set up custom dashboards
- Monitor resource utilization
#### 5. Amazon CloudWatch Application Signals MCP Server
**Purpose**: Application monitoring and performance insights
- Monitor application health and performance
- Analyze service-level objectives (SLOs)
- Track application dependencies
- Identify performance bottlenecks
- Monitor service map and traces
#### 6. AWS Managed Prometheus MCP Server
**Purpose**: Prometheus-compatible monitoring
- Query Prometheus metrics
- Monitor containerized applications
- Analyze Kubernetes workload metrics
- Create PromQL queries
- Track custom application metrics
### Audit & Security Servers
#### 7. AWS CloudTrail MCP Server
**Purpose**: AWS API activity and audit analysis
- Analyze AWS API calls and user activity
- Track resource changes and modifications
- Investigate security incidents
- Audit compliance requirements
- Identify unusual access patterns
- Review who made what changes when
#### 8. AWS Well-Architected Security Assessment Tool MCP Server
**Purpose**: Security assessment against Well-Architected Framework
- Assess security posture against AWS best practices
- Identify security gaps and vulnerabilities
- Get security improvement recommendations
- Review security pillar compliance
- Generate security assessment reports
## When to Use This Skill
Use this skill when:
- Optimizing AWS costs and reducing spending
- Estimating costs before deployment
- Monitoring application and infrastructure performance
- Setting up observability and alerting
- Analyzing spending patterns and trends
- Investigating operational issues
- Auditing AWS activity and changes
- Assessing security posture
- Implementing operational excellence
## Cost Optimization Best Practices
### Pre-Deployment Cost Estimation
**Always estimate costs before deploying**:
1. Use **AWS Pricing MCP** to estimate resource costs
2. Compare pricing across different regions
3. Evaluate alternative service options
4. Calculate expected monthly costs
5. Plan for scaling and growth
**Example workflow**:
```
"Estimate the monthly cost of running a Lambda function with
1 million invocations, 512MB memory, 3-second duration in us-east-1"
```
### Cost Analysis and Optimization
**Regular cost reviews**:
1. Use **Cost Explorer MCP** to analyze spending trends
2. Identify cost anomalies and unexpected charges
3. Review costs by service, region, and environment
4. Compare actual vs. budgeted costs
5. Generate cost optimization recommendations
**Cost optimization strategies**:
- Right-size over-provisioned resources
- Use appropriate storage classes (S3, EBS)
- Implement auto-scaling for dynamic workloads
- Leverage Savings Plans and Reserved Instances
- Delete unused resources and snapshots
- Use cost allocation tags effectively
### Budget Monitoring
**Track spending against budgets**:
1. Use **Billing and Cost Management MCP** to monitor budgets
2. Set up budget alerts for threshold breaches
3. Review budget utilization regularly
4. Adjust budgets based on trends
5. Implement cost controls and governance
## Monitoring and Observability Best Practices
### CloudWatch Metrics and Alarms
**Implement comprehensive monitoring**:
1. Use **CloudWatch MCP** to query metrics and logs
2. Set up alarms for critical metrics:
- CPU and memory utilization
- Error rates and latency
- Queue depths and processing times
- API gateway throttling
- Lambda errors and timeouts
3. Create CloudWatch dashboards for visualization
4. Use log insights for troubleshooting
**Example alarm scenarios**:
- Lambda error rate > 1%
- EC2 CPU utilization > 80%
- API Gateway 4xx/5xx error spike
- DynamoDB throttled requests
- ECS task failures
### Application Performance Monitoring
**Monitor application health**:
1. Use **CloudWatch Application Signals MCP** for APM
2. Track service-level objectives (SLOs)
3. Monitor application dependencies
4. Identify performance bottlenecks
5. Set up distributed tracing
### Container and Kubernetes Monitoring
**For containerized workloads**:
1. Use **AWS Managed Prometheus MCP** for metrics
2. Monitor container resource utilization
3. Track pod and node health
4. Create PromQL queries for custom metrics
5. Set up alerts for container anomalies
## Audit and Security Best Practices
### CloudTrail Activity Analysis
**Audit AWS activity**:
1. Use **CloudTrail MCP** to analyze API activity
2. Track who made changes to resources
3. Investigate security incidents
4. Monitor for suspicious activity patterns
5. Audit compliance with policies
**Common audit scenarios**:
- "Who deleted this S3 bucket?"
- "Show all IAM role changes in the last 24 hours"
- "List failed login attempts"
- "Find all actions by a specific user"
- "Track modifications to security groups"
### Security Assessment
**Regular security reviews**:
1. Use **Well-Architected Security Assessment MCP**
2. Assess security posture against best practices
3. Identify security gaps and vulnerabilities
4. Implement recommended security improvements
5. Document security compliance
**Security assessment areas**:
- Identity and Access Management (IAM)
- Detective controls and monitoring
- Infrastructure protection
- Data protection and encryption
- Incident response preparedness
## Using MCP Servers Effectively
### Cost Analysis Workflow
1. **Pre-deployment**: Use Pricing MCP to estimate costs
2. **Post-deployment**: Use Billing MCP to track actual spending
3. **Analysis**: Use Cost Explorer MCP for detailed cost analysis
4. **Optimization**: Implement recommendations from Cost Explorer
### Monitoring Workflow
1. **Setup**: Configure CloudWatch metrics and alarms
2. **Monitor**: Use CloudWatch MCP to track key metrics
3. **Analyze**: Use Application Signals for APM insights
4. **Troubleshoot**: Query CloudWatch Logs for issue resolution
### Security Workflow
1. **Audit**: Use CloudTrail MCP to review activity
2. **Assess**: Use Well-Architected Security Assessment
3. **Remediate**: Implement security recommendations
4. **Monitor**: Track security events via CloudWatch
### MCP Usage Best Practices
1. **Cost Awareness**: Check pricing before deploying resources
2. **Proactive Monitoring**: Set up alarms for critical metrics
3. **Regular Reviews**: Analyze costs and performance weekly
4. **Audit Trails**: Review CloudTrail logs for compliance
5. **Security First**: Run security assessments regularly
6. **Optimize Continuously**: Act on cost and performance recommendations
## Operational Excellence Guidelines
### Cost Optimization
- **Tag Everything**: Use consistent cost allocation tags
- **Review Monthly**: Analyze spending trends and anomalies
- **Right-size**: Match resources to actual usage
- **Automate**: Use auto-scaling and scheduling
- **Monitor Budgets**: Set alerts for cost overruns
### Monitoring and Alerting
- **Critical Metrics**: Alert on business-critical metrics
- **Noise Reduction**: Fine-tune thresholds to reduce false positives
- **Actionable Alerts**: Ensure alerts have clear remediation steps
- **Dashboard Visibility**: Create dashboards for key stakeholders
- **Log Retention**: Balance cost and compliance needs
### Security and Compliance
- **Least Privilege**: Grant minimum required permissions
- **Audit Regularly**: Review CloudTrail logs for anomalies
- **Encrypt Data**: Use encryption at rest and in transit
- **Assess Continuously**: Run security assessments frequently
- **Incident Response**: Have procedures for security events
## Additional Resources
For detailed operational patterns and best practices, refer to the comprehensive reference:
**File**: `references/operations-patterns.md`
This reference includes:
- Cost optimization strategies
- Monitoring and alerting patterns
- Observability best practices
- Security and compliance guidelines
- Troubleshooting workflows
## CloudWatch Alarms Reference
**File**: `references/cloudwatch-alarms.md`
Common alarm configurations for:
- Lambda functions
- EC2 instances
- RDS databases
- DynamoDB tables
- API Gateway
- ECS services
- Application Load Balancers

View File

@@ -0,0 +1,567 @@
# CloudWatch Alarms Reference
Common CloudWatch alarm configurations for AWS services.
## Lambda Functions
### Error Rate Alarm
```typescript
new cloudwatch.Alarm(this, 'LambdaErrorAlarm', {
metric: lambdaFunction.metricErrors({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
alarmDescription: 'Lambda error count exceeded threshold',
});
```
### Duration Alarm (Approaching Timeout)
```typescript
new cloudwatch.Alarm(this, 'LambdaDurationAlarm', {
metric: lambdaFunction.metricDuration({
statistic: 'Maximum',
period: Duration.minutes(5),
}),
threshold: lambdaFunction.timeout.toMilliseconds() * 0.8, // 80% of timeout
evaluationPeriods: 2,
alarmDescription: 'Lambda duration approaching timeout',
});
```
### Throttle Alarm
```typescript
new cloudwatch.Alarm(this, 'LambdaThrottleAlarm', {
metric: lambdaFunction.metricThrottles({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'Lambda function is being throttled',
});
```
### Concurrent Executions Alarm
```typescript
new cloudwatch.Alarm(this, 'LambdaConcurrencyAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/Lambda',
metricName: 'ConcurrentExecutions',
dimensionsMap: {
FunctionName: lambdaFunction.functionName,
},
statistic: 'Maximum',
period: Duration.minutes(1),
}),
threshold: 100, // Adjust based on reserved concurrency
evaluationPeriods: 2,
alarmDescription: 'Lambda concurrent executions high',
});
```
## API Gateway
### 5XX Error Rate Alarm
```typescript
new cloudwatch.Alarm(this, 'Api5xxAlarm', {
metric: api.metricServerError({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
alarmDescription: 'API Gateway 5XX errors exceeded threshold',
});
```
### 4XX Error Rate Alarm
```typescript
new cloudwatch.Alarm(this, 'Api4xxAlarm', {
metric: api.metricClientError({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 50,
evaluationPeriods: 2,
alarmDescription: 'API Gateway 4XX errors exceeded threshold',
});
```
### Latency Alarm
```typescript
new cloudwatch.Alarm(this, 'ApiLatencyAlarm', {
metric: api.metricLatency({
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 2000, // 2 seconds
evaluationPeriods: 2,
alarmDescription: 'API Gateway p99 latency exceeded threshold',
});
```
## DynamoDB
### Read Throttle Alarm
```typescript
new cloudwatch.Alarm(this, 'DynamoDBReadThrottleAlarm', {
metric: table.metricUserErrors({
dimensions: {
Operation: 'GetItem',
},
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'DynamoDB read operations being throttled',
});
```
### Write Throttle Alarm
```typescript
new cloudwatch.Alarm(this, 'DynamoDBWriteThrottleAlarm', {
metric: table.metricUserErrors({
dimensions: {
Operation: 'PutItem',
},
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'DynamoDB write operations being throttled',
});
```
### Consumed Capacity Alarm
```typescript
new cloudwatch.Alarm(this, 'DynamoDBCapacityAlarm', {
metric: table.metricConsumedReadCapacityUnits({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: provisionedCapacity * 0.8, // 80% of provisioned
evaluationPeriods: 2,
alarmDescription: 'DynamoDB consumed capacity approaching limit',
});
```
## EC2 Instances
### CPU Utilization Alarm
```typescript
new cloudwatch.Alarm(this, 'EC2CpuAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/EC2',
metricName: 'CPUUtilization',
dimensionsMap: {
InstanceId: instance.instanceId,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'EC2 CPU utilization high',
});
```
### Status Check Failed Alarm
```typescript
new cloudwatch.Alarm(this, 'EC2StatusCheckAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/EC2',
metricName: 'StatusCheckFailed',
dimensionsMap: {
InstanceId: instance.instanceId,
},
statistic: 'Maximum',
period: Duration.minutes(1),
}),
threshold: 1,
evaluationPeriods: 2,
alarmDescription: 'EC2 status check failed',
});
```
### Disk Space Alarm (Requires CloudWatch Agent)
```typescript
new cloudwatch.Alarm(this, 'EC2DiskAlarm', {
metric: new cloudwatch.Metric({
namespace: 'CWAgent',
metricName: 'disk_used_percent',
dimensionsMap: {
InstanceId: instance.instanceId,
path: '/',
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 85,
evaluationPeriods: 2,
alarmDescription: 'EC2 disk space usage high',
});
```
## RDS Databases
### CPU Alarm
```typescript
new cloudwatch.Alarm(this, 'RDSCpuAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'CPUUtilization',
dimensionsMap: {
DBInstanceIdentifier: dbInstance.instanceIdentifier,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'RDS CPU utilization high',
});
```
### Connection Count Alarm
```typescript
new cloudwatch.Alarm(this, 'RDSConnectionAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'DatabaseConnections',
dimensionsMap: {
DBInstanceIdentifier: dbInstance.instanceIdentifier,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: maxConnections * 0.8, // 80% of max connections
evaluationPeriods: 2,
alarmDescription: 'RDS connection count approaching limit',
});
```
### Free Storage Space Alarm
```typescript
new cloudwatch.Alarm(this, 'RDSStorageAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'FreeStorageSpace',
dimensionsMap: {
DBInstanceIdentifier: dbInstance.instanceIdentifier,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 10 * 1024 * 1024 * 1024, // 10 GB in bytes
comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
evaluationPeriods: 1,
alarmDescription: 'RDS free storage space low',
});
```
## ECS Services
### Task Count Alarm
```typescript
new cloudwatch.Alarm(this, 'ECSTaskCountAlarm', {
metric: new cloudwatch.Metric({
namespace: 'ECS/ContainerInsights',
metricName: 'RunningTaskCount',
dimensionsMap: {
ServiceName: service.serviceName,
ClusterName: cluster.clusterName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 1,
comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
evaluationPeriods: 2,
alarmDescription: 'ECS service has no running tasks',
});
```
### CPU Utilization Alarm
```typescript
new cloudwatch.Alarm(this, 'ECSCpuAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ECS',
metricName: 'CPUUtilization',
dimensionsMap: {
ServiceName: service.serviceName,
ClusterName: cluster.clusterName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'ECS service CPU utilization high',
});
```
### Memory Utilization Alarm
```typescript
new cloudwatch.Alarm(this, 'ECSMemoryAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ECS',
metricName: 'MemoryUtilization',
dimensionsMap: {
ServiceName: service.serviceName,
ClusterName: cluster.clusterName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 85,
evaluationPeriods: 2,
alarmDescription: 'ECS service memory utilization high',
});
```
## SQS Queues
### Queue Depth Alarm
```typescript
new cloudwatch.Alarm(this, 'SQSDepthAlarm', {
metric: queue.metricApproximateNumberOfMessagesVisible({
statistic: 'Maximum',
period: Duration.minutes(5),
}),
threshold: 1000,
evaluationPeriods: 2,
alarmDescription: 'SQS queue depth exceeded threshold',
});
```
### Age of Oldest Message Alarm
```typescript
new cloudwatch.Alarm(this, 'SQSAgeAlarm', {
metric: queue.metricApproximateAgeOfOldestMessage({
statistic: 'Maximum',
period: Duration.minutes(5),
}),
threshold: 300, // 5 minutes in seconds
evaluationPeriods: 1,
alarmDescription: 'SQS messages not being processed timely',
});
```
## Application Load Balancer
### Target Health Alarm
```typescript
new cloudwatch.Alarm(this, 'ALBUnhealthyTargetAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ApplicationELB',
metricName: 'UnHealthyHostCount',
dimensionsMap: {
LoadBalancer: loadBalancer.loadBalancerFullName,
TargetGroup: targetGroup.targetGroupFullName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 1,
evaluationPeriods: 2,
alarmDescription: 'ALB has unhealthy targets',
});
```
### HTTP 5XX Alarm
```typescript
new cloudwatch.Alarm(this, 'ALB5xxAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ApplicationELB',
metricName: 'HTTPCode_Target_5XX_Count',
dimensionsMap: {
LoadBalancer: loadBalancer.loadBalancerFullName,
},
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
alarmDescription: 'ALB target 5XX errors exceeded threshold',
});
```
### Response Time Alarm
```typescript
new cloudwatch.Alarm(this, 'ALBLatencyAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ApplicationELB',
metricName: 'TargetResponseTime',
dimensionsMap: {
LoadBalancer: loadBalancer.loadBalancerFullName,
},
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 1, // 1 second
evaluationPeriods: 2,
alarmDescription: 'ALB p99 response time exceeded threshold',
});
```
## Composite Alarms
### Service Health Composite Alarm
```typescript
const errorAlarm = new cloudwatch.Alarm(this, 'ErrorAlarm', { /* ... */ });
const latencyAlarm = new cloudwatch.Alarm(this, 'LatencyAlarm', { /* ... */ });
const throttleAlarm = new cloudwatch.Alarm(this, 'ThrottleAlarm', { /* ... */ });
new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
compositeAlarmName: 'service-health',
alarmRule: cloudwatch.AlarmRule.anyOf(
errorAlarm,
latencyAlarm,
throttleAlarm
),
alarmDescription: 'Overall service health degraded',
});
```
## Alarm Actions
### SNS Topic Integration
```typescript
const topic = new sns.Topic(this, 'AlarmTopic', {
displayName: 'CloudWatch Alarms',
});
// Email subscription
topic.addSubscription(new subscriptions.EmailSubscription('ops@example.com'));
// Add action to alarm
alarm.addAlarmAction(new actions.SnsAction(topic));
alarm.addOkAction(new actions.SnsAction(topic));
```
### Auto Scaling Action
```typescript
const scalingAction = targetGroup.scaleOnMetric('ScaleUp', {
metric: targetGroup.metricTargetResponseTime(),
scalingSteps: [
{ upper: 1, change: 0 },
{ lower: 1, change: +1 },
{ lower: 2, change: +2 },
],
});
```
## Alarm Best Practices
### Threshold Selection
**CPU/Memory Alarms**:
- Warning: 70-80%
- Critical: 80-90%
- Consider burst patterns and normal usage
**Error Rate Alarms**:
- Threshold based on SLA (e.g., 99.9% = 0.1% error rate)
- Account for normal error rates
- Different thresholds for different error types
**Latency Alarms**:
- p99 latency for user-facing APIs
- Warning: 80% of SLA target
- Critical: 100% of SLA target
### Evaluation Periods
**Fast-changing metrics** (1-2 periods):
- Error counts
- Failed health checks
- Critical application errors
**Slow-changing metrics** (3-5 periods):
- CPU utilization
- Memory usage
- Disk usage
**Cost-related metrics** (longer periods):
- Daily spending
- Resource count changes
- Usage patterns
### Missing Data Handling
```typescript
// For intermittent workloads
alarm.treatMissingData(cloudwatch.TreatMissingData.NOT_BREACHING);
// For always-on services
alarm.treatMissingData(cloudwatch.TreatMissingData.BREACHING);
// To distinguish from data issues
alarm.treatMissingData(cloudwatch.TreatMissingData.MISSING);
```
### Alarm Naming Conventions
```typescript
// Pattern: <service>-<metric>-<severity>
'lambda-errors-critical'
'api-latency-warning'
'rds-cpu-warning'
'ecs-tasks-critical'
```
### Alarm Actions Best Practices
1. **Separate topics by severity**:
- Critical alarms → PagerDuty/on-call
- Warning alarms → Slack/email
- Info alarms → Metrics dashboard
2. **Include context in alarm description**:
- Service name
- Expected threshold
- Troubleshooting runbook link
3. **Auto-remediation where possible**:
- Lambda errors → automatic retry
- CPU high → auto-scaling trigger
- Disk full → automated cleanup
4. **Alarm fatigue prevention**:
- Tune thresholds based on actual patterns
- Use composite alarms to reduce noise
- Implement proper evaluation periods
- Regularly review and adjust alarms
## Monitoring Dashboard
### Recommended Dashboard Layout
**Service Overview**:
- Request count and rate
- Error count and percentage
- Latency (p50, p95, p99)
- Availability percentage
**Resource Utilization**:
- CPU utilization by service
- Memory utilization by service
- Network throughput
- Disk I/O
**Cost Metrics**:
- Daily spending by service
- Month-to-date costs
- Budget utilization
- Cost anomalies
**Security Metrics**:
- Failed login attempts
- IAM policy changes
- Security group modifications
- GuardDuty findings

View File

@@ -0,0 +1,394 @@
# AWS Cost & Operations Patterns
Comprehensive patterns and best practices for AWS cost optimization, monitoring, and operational excellence.
## Table of Contents
- [Cost Optimization Patterns](#cost-optimization-patterns)
- [Monitoring Patterns](#monitoring-patterns)
- [Observability Patterns](#observability-patterns)
- [Security and Audit Patterns](#security-and-audit-patterns)
- [Troubleshooting Workflows](#troubleshooting-workflows)
## Cost Optimization Patterns
### Pattern 1: Cost Estimation Before Deployment
**When**: Before deploying any new infrastructure
**MCP Server**: AWS Pricing MCP
**Steps**:
1. List all resources to be deployed
2. Query pricing for each resource type
3. Calculate monthly costs based on expected usage
4. Compare pricing across regions
5. Document cost estimates in architecture docs
**Example**:
```
Resource: Lambda Function
- Invocations: 1,000,000/month
- Duration: 3 seconds avg
- Memory: 512 MB
- Region: us-east-1
Estimated cost: $X/month
```
### Pattern 2: Monthly Cost Review
**When**: First week of every month
**MCP Servers**: Cost Explorer MCP, Billing and Cost Management MCP
**Steps**:
1. Review total spending vs. budget
2. Analyze cost by service (top 5 services)
3. Identify cost anomalies (>20% increase)
4. Review cost by environment (dev/staging/prod)
5. Check cost allocation tag coverage
6. Generate cost optimization recommendations
**Key Metrics**:
- Month-over-month cost change
- Cost per environment
- Cost per application/project
- Untagged resource costs
### Pattern 3: Right-Sizing Resources
**When**: Quarterly or when utilization alerts trigger
**MCP Servers**: CloudWatch MCP, Cost Explorer MCP
**Steps**:
1. Query CloudWatch for resource utilization metrics
2. Identify over-provisioned resources (< 40% utilization)
3. Identify under-provisioned resources (> 80% utilization)
4. Calculate potential savings from right-sizing
5. Plan and execute right-sizing changes
6. Monitor post-change performance
**Common Right-Sizing Scenarios**:
- EC2 instances with low CPU utilization
- RDS instances with excess capacity
- DynamoDB tables with low read/write usage
- Lambda functions with excessive memory allocation
### Pattern 4: Unused Resource Cleanup
**When**: Monthly or triggered by cost anomalies
**MCP Servers**: Cost Explorer MCP, CloudTrail MCP
**Steps**:
1. Identify resources with zero usage
2. Query CloudTrail for last access time
3. Tag resources for deletion review
4. Notify resource owners
5. Delete confirmed unused resources
6. Track cost savings
**Common Unused Resources**:
- Unattached EBS volumes
- Old EBS snapshots
- Idle Load Balancers
- Unused Elastic IPs
- Old AMIs and snapshots
- Stopped EC2 instances (long-term)
## Monitoring Patterns
### Pattern 1: Critical Service Monitoring
**When**: All production services
**MCP Server**: CloudWatch MCP
**Metrics to Monitor**:
- **Availability**: Service uptime, health checks
- **Performance**: Latency, response time
- **Errors**: Error rate, failed requests
- **Saturation**: CPU, memory, disk, network utilization
**Alarm Thresholds** (adjust based on SLAs):
- Error rate: > 1% for 2 consecutive periods
- Latency: p99 > 1 second for 5 minutes
- CPU: > 80% for 10 minutes
- Memory: > 85% for 5 minutes
### Pattern 2: Lambda Function Monitoring
**MCP Server**: CloudWatch MCP
**Key Metrics**:
```
- Invocations (Count)
- Errors (Count, %)
- Duration (Average, p99)
- Throttles (Count)
- ConcurrentExecutions (Max)
- IteratorAge (for stream processing)
```
**Recommended Alarms**:
- Error rate > 1%
- Duration > 80% of timeout
- Throttles > 0
- ConcurrentExecutions > 80% of reserved
### Pattern 3: API Gateway Monitoring
**MCP Server**: CloudWatch MCP
**Key Metrics**:
```
- Count (Total requests)
- 4XXError, 5XXError
- Latency (p50, p95, p99)
- IntegrationLatency
- CacheHitCount, CacheMissCount
```
**Recommended Alarms**:
- 5XX error rate > 0.5%
- 4XX error rate > 5%
- Latency p99 > 2 seconds
- Integration latency spike
### Pattern 4: Database Monitoring
**MCP Server**: CloudWatch MCP
**RDS Metrics**:
```
- CPUUtilization
- DatabaseConnections
- FreeableMemory
- ReadLatency, WriteLatency
- ReadIOPS, WriteIOPS
- FreeStorageSpace
```
**DynamoDB Metrics**:
```
- ConsumedReadCapacityUnits
- ConsumedWriteCapacityUnits
- UserErrors
- SystemErrors
- ThrottledRequests
```
**Recommended Alarms**:
- RDS CPU > 80% for 10 minutes
- RDS connections > 80% of max
- RDS free storage < 10 GB
- DynamoDB throttled requests > 0
- DynamoDB user errors spike
## Observability Patterns
### Pattern 1: Distributed Tracing Setup
**MCP Server**: CloudWatch Application Signals MCP
**Components**:
1. **Service Map**: Visualize service dependencies
2. **Traces**: Track requests across services
3. **Metrics**: Monitor latency and errors per service
4. **SLOs**: Define and track service level objectives
**Implementation**:
- Enable X-Ray tracing on Lambda functions
- Add X-Ray SDK to application code
- Configure sampling rules
- Create service lens dashboards
### Pattern 2: Log Aggregation and Analysis
**MCP Server**: CloudWatch MCP
**Log Strategy**:
1. **Centralize Logs**: Send all application logs to CloudWatch Logs
2. **Structure Logs**: Use JSON format for structured logging
3. **Log Insights**: Use CloudWatch Logs Insights for queries
4. **Retention**: Set appropriate retention periods
**Example Log Insights Queries**:
```
# Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
# Count errors by type
stats count() by error_type
| sort count desc
# Calculate p99 latency
stats percentile(duration, 99) by service_name
```
### Pattern 3: Custom Metrics
**MCP Server**: CloudWatch MCP
**When to Use Custom Metrics**:
- Business-specific KPIs (orders/minute, revenue/hour)
- Application-specific metrics (cache hit rate, queue depth)
- Performance metrics not provided by AWS
**Best Practices**:
- Use consistent namespace: `CompanyName/ApplicationName`
- Include relevant dimensions (environment, region, version)
- Publish metrics at appropriate intervals
- Use metric filters for log-derived metrics
## Security and Audit Patterns
### Pattern 1: API Activity Auditing
**MCP Server**: CloudTrail MCP
**Regular Audit Queries**:
```
# Find all IAM changes
eventName: CreateUser, DeleteUser, AttachUserPolicy, etc.
Time: Last 24 hours
# Track S3 bucket deletions
eventName: DeleteBucket
Time: Last 7 days
# Find failed login attempts
eventName: ConsoleLogin
errorCode: Failure
# Monitor privileged actions
userIdentity.arn: *admin* OR *root*
```
**Audit Schedule**:
- Daily: Review privileged user actions
- Weekly: Audit IAM changes and security group modifications
- Monthly: Comprehensive security review
### Pattern 2: Security Posture Assessment
**MCP Server**: Well-Architected Security Assessment Tool MCP
**Assessment Areas**:
1. **Identity and Access Management**
- Least privilege implementation
- MFA enforcement
- Role-based access control
- Service control policies
2. **Detective Controls**
- CloudTrail enabled in all regions
- GuardDuty findings review
- Config rule compliance
- Security Hub findings
3. **Infrastructure Protection**
- VPC security groups review
- Network ACLs configuration
- AWS WAF rules
- Security group ingress rules
4. **Data Protection**
- Encryption at rest (S3, EBS, RDS)
- Encryption in transit (TLS/SSL)
- KMS key usage and rotation
- Secrets Manager utilization
5. **Incident Response**
- IR playbooks documented
- Automated response procedures
- Contact information current
- Regular IR drills
**Assessment Frequency**:
- Quarterly: Full Well-Architected review
- Monthly: High-priority findings review
- Weekly: Critical security findings
### Pattern 3: Compliance Monitoring
**MCP Servers**: CloudTrail MCP, CloudWatch MCP
**Compliance Requirements**:
- Data residency (ensure data stays in approved regions)
- Access logging (all access logged and retained)
- Encryption requirements (data encrypted at rest and in transit)
- Change management (all changes tracked in CloudTrail)
**Compliance Dashboards**:
- Encryption coverage by service
- CloudTrail logging status
- Failed login attempts
- Privileged access usage
- Non-compliant resources
## Troubleshooting Workflows
### Workflow 1: High Lambda Error Rate
**MCP Servers**: CloudWatch MCP, CloudWatch Application Signals MCP
**Steps**:
1. Query CloudWatch for Lambda error metrics
2. Check error logs in CloudWatch Logs
3. Identify error patterns (timeout, memory, permission)
4. Check Lambda configuration (memory, timeout, permissions)
5. Review recent code deployments
6. Check downstream service health
7. Implement fix and monitor
### Workflow 2: Increased Latency
**MCP Servers**: CloudWatch MCP, CloudWatch Application Signals MCP
**Steps**:
1. Identify latency spike in CloudWatch metrics
2. Check service map for slow dependencies
3. Query distributed traces for slow requests
4. Check database query performance
5. Review API Gateway integration latency
6. Check Lambda cold starts
7. Identify bottleneck and optimize
### Workflow 3: Cost Spike Investigation
**MCP Servers**: Cost Explorer MCP, CloudWatch MCP, CloudTrail MCP
**Steps**:
1. Use Cost Explorer to identify service causing spike
2. Check CloudWatch metrics for usage increase
3. Review CloudTrail for recent resource creation
4. Identify root cause (misconfiguration, runaway process, attack)
5. Implement cost controls (budgets, alarms, service quotas)
6. Clean up unnecessary resources
### Workflow 4: Security Incident Response
**MCP Servers**: CloudTrail MCP, GuardDuty (via CloudWatch), Well-Architected Assessment MCP
**Steps**:
1. Identify security event in GuardDuty or CloudWatch
2. Query CloudTrail for related API activity
3. Determine scope and impact
4. Isolate affected resources
5. Revoke compromised credentials
6. Implement remediation
7. Conduct post-incident review
8. Update security controls
## Summary
- **Cost Optimization**: Use Pricing, Cost Explorer, and Billing MCPs for proactive cost management
- **Monitoring**: Set up comprehensive CloudWatch alarms for all critical services
- **Observability**: Implement distributed tracing and structured logging
- **Security**: Regular CloudTrail audits and Well-Architected assessments
- **Proactive**: Don't wait for incidents - monitor and optimize continuously