Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:08:38 +08:00
commit 38b562e994
16 changed files with 7792 additions and 0 deletions

View File

@@ -0,0 +1,567 @@
# CloudWatch Alarms Reference
Common CloudWatch alarm configurations for AWS services.
## Lambda Functions
### Error Rate Alarm
```typescript
new cloudwatch.Alarm(this, 'LambdaErrorAlarm', {
metric: lambdaFunction.metricErrors({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
alarmDescription: 'Lambda error count exceeded threshold',
});
```
### Duration Alarm (Approaching Timeout)
```typescript
new cloudwatch.Alarm(this, 'LambdaDurationAlarm', {
metric: lambdaFunction.metricDuration({
statistic: 'Maximum',
period: Duration.minutes(5),
}),
threshold: lambdaFunction.timeout.toMilliseconds() * 0.8, // 80% of timeout
evaluationPeriods: 2,
alarmDescription: 'Lambda duration approaching timeout',
});
```
### Throttle Alarm
```typescript
new cloudwatch.Alarm(this, 'LambdaThrottleAlarm', {
metric: lambdaFunction.metricThrottles({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'Lambda function is being throttled',
});
```
### Concurrent Executions Alarm
```typescript
new cloudwatch.Alarm(this, 'LambdaConcurrencyAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/Lambda',
metricName: 'ConcurrentExecutions',
dimensionsMap: {
FunctionName: lambdaFunction.functionName,
},
statistic: 'Maximum',
period: Duration.minutes(1),
}),
threshold: 100, // Adjust based on reserved concurrency
evaluationPeriods: 2,
alarmDescription: 'Lambda concurrent executions high',
});
```
## API Gateway
### 5XX Error Rate Alarm
```typescript
new cloudwatch.Alarm(this, 'Api5xxAlarm', {
metric: api.metricServerError({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
alarmDescription: 'API Gateway 5XX errors exceeded threshold',
});
```
### 4XX Error Rate Alarm
```typescript
new cloudwatch.Alarm(this, 'Api4xxAlarm', {
metric: api.metricClientError({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 50,
evaluationPeriods: 2,
alarmDescription: 'API Gateway 4XX errors exceeded threshold',
});
```
### Latency Alarm
```typescript
new cloudwatch.Alarm(this, 'ApiLatencyAlarm', {
metric: api.metricLatency({
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 2000, // 2 seconds
evaluationPeriods: 2,
alarmDescription: 'API Gateway p99 latency exceeded threshold',
});
```
## DynamoDB
### Read Throttle Alarm
```typescript
new cloudwatch.Alarm(this, 'DynamoDBReadThrottleAlarm', {
metric: table.metricUserErrors({
dimensions: {
Operation: 'GetItem',
},
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'DynamoDB read operations being throttled',
});
```
### Write Throttle Alarm
```typescript
new cloudwatch.Alarm(this, 'DynamoDBWriteThrottleAlarm', {
metric: table.metricUserErrors({
dimensions: {
Operation: 'PutItem',
},
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 5,
evaluationPeriods: 1,
alarmDescription: 'DynamoDB write operations being throttled',
});
```
### Consumed Capacity Alarm
```typescript
new cloudwatch.Alarm(this, 'DynamoDBCapacityAlarm', {
metric: table.metricConsumedReadCapacityUnits({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: provisionedCapacity * 0.8, // 80% of provisioned
evaluationPeriods: 2,
alarmDescription: 'DynamoDB consumed capacity approaching limit',
});
```
## EC2 Instances
### CPU Utilization Alarm
```typescript
new cloudwatch.Alarm(this, 'EC2CpuAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/EC2',
metricName: 'CPUUtilization',
dimensionsMap: {
InstanceId: instance.instanceId,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'EC2 CPU utilization high',
});
```
### Status Check Failed Alarm
```typescript
new cloudwatch.Alarm(this, 'EC2StatusCheckAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/EC2',
metricName: 'StatusCheckFailed',
dimensionsMap: {
InstanceId: instance.instanceId,
},
statistic: 'Maximum',
period: Duration.minutes(1),
}),
threshold: 1,
evaluationPeriods: 2,
alarmDescription: 'EC2 status check failed',
});
```
### Disk Space Alarm (Requires CloudWatch Agent)
```typescript
new cloudwatch.Alarm(this, 'EC2DiskAlarm', {
metric: new cloudwatch.Metric({
namespace: 'CWAgent',
metricName: 'disk_used_percent',
dimensionsMap: {
InstanceId: instance.instanceId,
path: '/',
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 85,
evaluationPeriods: 2,
alarmDescription: 'EC2 disk space usage high',
});
```
## RDS Databases
### CPU Alarm
```typescript
new cloudwatch.Alarm(this, 'RDSCpuAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'CPUUtilization',
dimensionsMap: {
DBInstanceIdentifier: dbInstance.instanceIdentifier,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'RDS CPU utilization high',
});
```
### Connection Count Alarm
```typescript
new cloudwatch.Alarm(this, 'RDSConnectionAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'DatabaseConnections',
dimensionsMap: {
DBInstanceIdentifier: dbInstance.instanceIdentifier,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: maxConnections * 0.8, // 80% of max connections
evaluationPeriods: 2,
alarmDescription: 'RDS connection count approaching limit',
});
```
### Free Storage Space Alarm
```typescript
new cloudwatch.Alarm(this, 'RDSStorageAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/RDS',
metricName: 'FreeStorageSpace',
dimensionsMap: {
DBInstanceIdentifier: dbInstance.instanceIdentifier,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 10 * 1024 * 1024 * 1024, // 10 GB in bytes
comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
evaluationPeriods: 1,
alarmDescription: 'RDS free storage space low',
});
```
## ECS Services
### Task Count Alarm
```typescript
new cloudwatch.Alarm(this, 'ECSTaskCountAlarm', {
metric: new cloudwatch.Metric({
namespace: 'ECS/ContainerInsights',
metricName: 'RunningTaskCount',
dimensionsMap: {
ServiceName: service.serviceName,
ClusterName: cluster.clusterName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 1,
comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
evaluationPeriods: 2,
alarmDescription: 'ECS service has no running tasks',
});
```
### CPU Utilization Alarm
```typescript
new cloudwatch.Alarm(this, 'ECSCpuAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ECS',
metricName: 'CPUUtilization',
dimensionsMap: {
ServiceName: service.serviceName,
ClusterName: cluster.clusterName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 80,
evaluationPeriods: 3,
alarmDescription: 'ECS service CPU utilization high',
});
```
### Memory Utilization Alarm
```typescript
new cloudwatch.Alarm(this, 'ECSMemoryAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ECS',
metricName: 'MemoryUtilization',
dimensionsMap: {
ServiceName: service.serviceName,
ClusterName: cluster.clusterName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 85,
evaluationPeriods: 2,
alarmDescription: 'ECS service memory utilization high',
});
```
## SQS Queues
### Queue Depth Alarm
```typescript
new cloudwatch.Alarm(this, 'SQSDepthAlarm', {
metric: queue.metricApproximateNumberOfMessagesVisible({
statistic: 'Maximum',
period: Duration.minutes(5),
}),
threshold: 1000,
evaluationPeriods: 2,
alarmDescription: 'SQS queue depth exceeded threshold',
});
```
### Age of Oldest Message Alarm
```typescript
new cloudwatch.Alarm(this, 'SQSAgeAlarm', {
metric: queue.metricApproximateAgeOfOldestMessage({
statistic: 'Maximum',
period: Duration.minutes(5),
}),
threshold: 300, // 5 minutes in seconds
evaluationPeriods: 1,
alarmDescription: 'SQS messages not being processed timely',
});
```
## Application Load Balancer
### Target Health Alarm
```typescript
new cloudwatch.Alarm(this, 'ALBUnhealthyTargetAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ApplicationELB',
metricName: 'UnHealthyHostCount',
dimensionsMap: {
LoadBalancer: loadBalancer.loadBalancerFullName,
TargetGroup: targetGroup.targetGroupFullName,
},
statistic: 'Average',
period: Duration.minutes(5),
}),
threshold: 1,
evaluationPeriods: 2,
alarmDescription: 'ALB has unhealthy targets',
});
```
### HTTP 5XX Alarm
```typescript
new cloudwatch.Alarm(this, 'ALB5xxAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ApplicationELB',
metricName: 'HTTPCode_Target_5XX_Count',
dimensionsMap: {
LoadBalancer: loadBalancer.loadBalancerFullName,
},
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
alarmDescription: 'ALB target 5XX errors exceeded threshold',
});
```
### Response Time Alarm
```typescript
new cloudwatch.Alarm(this, 'ALBLatencyAlarm', {
metric: new cloudwatch.Metric({
namespace: 'AWS/ApplicationELB',
metricName: 'TargetResponseTime',
dimensionsMap: {
LoadBalancer: loadBalancer.loadBalancerFullName,
},
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 1, // 1 second
evaluationPeriods: 2,
alarmDescription: 'ALB p99 response time exceeded threshold',
});
```
## Composite Alarms
### Service Health Composite Alarm
```typescript
const errorAlarm = new cloudwatch.Alarm(this, 'ErrorAlarm', { /* ... */ });
const latencyAlarm = new cloudwatch.Alarm(this, 'LatencyAlarm', { /* ... */ });
const throttleAlarm = new cloudwatch.Alarm(this, 'ThrottleAlarm', { /* ... */ });
new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
compositeAlarmName: 'service-health',
alarmRule: cloudwatch.AlarmRule.anyOf(
errorAlarm,
latencyAlarm,
throttleAlarm
),
alarmDescription: 'Overall service health degraded',
});
```
## Alarm Actions
### SNS Topic Integration
```typescript
const topic = new sns.Topic(this, 'AlarmTopic', {
displayName: 'CloudWatch Alarms',
});
// Email subscription
topic.addSubscription(new subscriptions.EmailSubscription('ops@example.com'));
// Add action to alarm
alarm.addAlarmAction(new actions.SnsAction(topic));
alarm.addOkAction(new actions.SnsAction(topic));
```
### Auto Scaling Action
```typescript
const scalingAction = targetGroup.scaleOnMetric('ScaleUp', {
metric: targetGroup.metricTargetResponseTime(),
scalingSteps: [
{ upper: 1, change: 0 },
{ lower: 1, change: +1 },
{ lower: 2, change: +2 },
],
});
```
## Alarm Best Practices
### Threshold Selection
**CPU/Memory Alarms**:
- Warning: 70-80%
- Critical: 80-90%
- Consider burst patterns and normal usage
**Error Rate Alarms**:
- Threshold based on SLA (e.g., 99.9% = 0.1% error rate)
- Account for normal error rates
- Different thresholds for different error types
**Latency Alarms**:
- p99 latency for user-facing APIs
- Warning: 80% of SLA target
- Critical: 100% of SLA target
### Evaluation Periods
**Fast-changing metrics** (1-2 periods):
- Error counts
- Failed health checks
- Critical application errors
**Slow-changing metrics** (3-5 periods):
- CPU utilization
- Memory usage
- Disk usage
**Cost-related metrics** (longer periods):
- Daily spending
- Resource count changes
- Usage patterns
### Missing Data Handling
```typescript
// For intermittent workloads
alarm.treatMissingData(cloudwatch.TreatMissingData.NOT_BREACHING);
// For always-on services
alarm.treatMissingData(cloudwatch.TreatMissingData.BREACHING);
// To distinguish from data issues
alarm.treatMissingData(cloudwatch.TreatMissingData.MISSING);
```
### Alarm Naming Conventions
```typescript
// Pattern: <service>-<metric>-<severity>
'lambda-errors-critical'
'api-latency-warning'
'rds-cpu-warning'
'ecs-tasks-critical'
```
### Alarm Actions Best Practices
1. **Separate topics by severity**:
- Critical alarms → PagerDuty/on-call
- Warning alarms → Slack/email
- Info alarms → Metrics dashboard
2. **Include context in alarm description**:
- Service name
- Expected threshold
- Troubleshooting runbook link
3. **Auto-remediation where possible**:
- Lambda errors → automatic retry
- CPU high → auto-scaling trigger
- Disk full → automated cleanup
4. **Alarm fatigue prevention**:
- Tune thresholds based on actual patterns
- Use composite alarms to reduce noise
- Implement proper evaluation periods
- Regularly review and adjust alarms
## Monitoring Dashboard
### Recommended Dashboard Layout
**Service Overview**:
- Request count and rate
- Error count and percentage
- Latency (p50, p95, p99)
- Availability percentage
**Resource Utilization**:
- CPU utilization by service
- Memory utilization by service
- Network throughput
- Disk I/O
**Cost Metrics**:
- Daily spending by service
- Month-to-date costs
- Budget utilization
- Cost anomalies
**Security Metrics**:
- Failed login attempts
- IAM policy changes
- Security group modifications
- GuardDuty findings

View File

@@ -0,0 +1,394 @@
# AWS Cost & Operations Patterns
Comprehensive patterns and best practices for AWS cost optimization, monitoring, and operational excellence.
## Table of Contents
- [Cost Optimization Patterns](#cost-optimization-patterns)
- [Monitoring Patterns](#monitoring-patterns)
- [Observability Patterns](#observability-patterns)
- [Security and Audit Patterns](#security-and-audit-patterns)
- [Troubleshooting Workflows](#troubleshooting-workflows)
## Cost Optimization Patterns
### Pattern 1: Cost Estimation Before Deployment
**When**: Before deploying any new infrastructure
**MCP Server**: AWS Pricing MCP
**Steps**:
1. List all resources to be deployed
2. Query pricing for each resource type
3. Calculate monthly costs based on expected usage
4. Compare pricing across regions
5. Document cost estimates in architecture docs
**Example**:
```
Resource: Lambda Function
- Invocations: 1,000,000/month
- Duration: 3 seconds avg
- Memory: 512 MB
- Region: us-east-1
Estimated cost: $X/month
```
### Pattern 2: Monthly Cost Review
**When**: First week of every month
**MCP Servers**: Cost Explorer MCP, Billing and Cost Management MCP
**Steps**:
1. Review total spending vs. budget
2. Analyze cost by service (top 5 services)
3. Identify cost anomalies (>20% increase)
4. Review cost by environment (dev/staging/prod)
5. Check cost allocation tag coverage
6. Generate cost optimization recommendations
**Key Metrics**:
- Month-over-month cost change
- Cost per environment
- Cost per application/project
- Untagged resource costs
### Pattern 3: Right-Sizing Resources
**When**: Quarterly or when utilization alerts trigger
**MCP Servers**: CloudWatch MCP, Cost Explorer MCP
**Steps**:
1. Query CloudWatch for resource utilization metrics
2. Identify over-provisioned resources (< 40% utilization)
3. Identify under-provisioned resources (> 80% utilization)
4. Calculate potential savings from right-sizing
5. Plan and execute right-sizing changes
6. Monitor post-change performance
**Common Right-Sizing Scenarios**:
- EC2 instances with low CPU utilization
- RDS instances with excess capacity
- DynamoDB tables with low read/write usage
- Lambda functions with excessive memory allocation
### Pattern 4: Unused Resource Cleanup
**When**: Monthly or triggered by cost anomalies
**MCP Servers**: Cost Explorer MCP, CloudTrail MCP
**Steps**:
1. Identify resources with zero usage
2. Query CloudTrail for last access time
3. Tag resources for deletion review
4. Notify resource owners
5. Delete confirmed unused resources
6. Track cost savings
**Common Unused Resources**:
- Unattached EBS volumes
- Old EBS snapshots
- Idle Load Balancers
- Unused Elastic IPs
- Old AMIs and snapshots
- Stopped EC2 instances (long-term)
## Monitoring Patterns
### Pattern 1: Critical Service Monitoring
**When**: All production services
**MCP Server**: CloudWatch MCP
**Metrics to Monitor**:
- **Availability**: Service uptime, health checks
- **Performance**: Latency, response time
- **Errors**: Error rate, failed requests
- **Saturation**: CPU, memory, disk, network utilization
**Alarm Thresholds** (adjust based on SLAs):
- Error rate: > 1% for 2 consecutive periods
- Latency: p99 > 1 second for 5 minutes
- CPU: > 80% for 10 minutes
- Memory: > 85% for 5 minutes
### Pattern 2: Lambda Function Monitoring
**MCP Server**: CloudWatch MCP
**Key Metrics**:
```
- Invocations (Count)
- Errors (Count, %)
- Duration (Average, p99)
- Throttles (Count)
- ConcurrentExecutions (Max)
- IteratorAge (for stream processing)
```
**Recommended Alarms**:
- Error rate > 1%
- Duration > 80% of timeout
- Throttles > 0
- ConcurrentExecutions > 80% of reserved
### Pattern 3: API Gateway Monitoring
**MCP Server**: CloudWatch MCP
**Key Metrics**:
```
- Count (Total requests)
- 4XXError, 5XXError
- Latency (p50, p95, p99)
- IntegrationLatency
- CacheHitCount, CacheMissCount
```
**Recommended Alarms**:
- 5XX error rate > 0.5%
- 4XX error rate > 5%
- Latency p99 > 2 seconds
- Integration latency spike
### Pattern 4: Database Monitoring
**MCP Server**: CloudWatch MCP
**RDS Metrics**:
```
- CPUUtilization
- DatabaseConnections
- FreeableMemory
- ReadLatency, WriteLatency
- ReadIOPS, WriteIOPS
- FreeStorageSpace
```
**DynamoDB Metrics**:
```
- ConsumedReadCapacityUnits
- ConsumedWriteCapacityUnits
- UserErrors
- SystemErrors
- ThrottledRequests
```
**Recommended Alarms**:
- RDS CPU > 80% for 10 minutes
- RDS connections > 80% of max
- RDS free storage < 10 GB
- DynamoDB throttled requests > 0
- DynamoDB user errors spike
## Observability Patterns
### Pattern 1: Distributed Tracing Setup
**MCP Server**: CloudWatch Application Signals MCP
**Components**:
1. **Service Map**: Visualize service dependencies
2. **Traces**: Track requests across services
3. **Metrics**: Monitor latency and errors per service
4. **SLOs**: Define and track service level objectives
**Implementation**:
- Enable X-Ray tracing on Lambda functions
- Add X-Ray SDK to application code
- Configure sampling rules
- Create service lens dashboards
### Pattern 2: Log Aggregation and Analysis
**MCP Server**: CloudWatch MCP
**Log Strategy**:
1. **Centralize Logs**: Send all application logs to CloudWatch Logs
2. **Structure Logs**: Use JSON format for structured logging
3. **Log Insights**: Use CloudWatch Logs Insights for queries
4. **Retention**: Set appropriate retention periods
**Example Log Insights Queries**:
```
# Find errors in last hour
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
# Count errors by type
stats count() by error_type
| sort count desc
# Calculate p99 latency
stats percentile(duration, 99) by service_name
```
### Pattern 3: Custom Metrics
**MCP Server**: CloudWatch MCP
**When to Use Custom Metrics**:
- Business-specific KPIs (orders/minute, revenue/hour)
- Application-specific metrics (cache hit rate, queue depth)
- Performance metrics not provided by AWS
**Best Practices**:
- Use consistent namespace: `CompanyName/ApplicationName`
- Include relevant dimensions (environment, region, version)
- Publish metrics at appropriate intervals
- Use metric filters for log-derived metrics
## Security and Audit Patterns
### Pattern 1: API Activity Auditing
**MCP Server**: CloudTrail MCP
**Regular Audit Queries**:
```
# Find all IAM changes
eventName: CreateUser, DeleteUser, AttachUserPolicy, etc.
Time: Last 24 hours
# Track S3 bucket deletions
eventName: DeleteBucket
Time: Last 7 days
# Find failed login attempts
eventName: ConsoleLogin
errorCode: Failure
# Monitor privileged actions
userIdentity.arn: *admin* OR *root*
```
**Audit Schedule**:
- Daily: Review privileged user actions
- Weekly: Audit IAM changes and security group modifications
- Monthly: Comprehensive security review
### Pattern 2: Security Posture Assessment
**MCP Server**: Well-Architected Security Assessment Tool MCP
**Assessment Areas**:
1. **Identity and Access Management**
- Least privilege implementation
- MFA enforcement
- Role-based access control
- Service control policies
2. **Detective Controls**
- CloudTrail enabled in all regions
- GuardDuty findings review
- Config rule compliance
- Security Hub findings
3. **Infrastructure Protection**
- VPC security groups review
- Network ACLs configuration
- AWS WAF rules
- Security group ingress rules
4. **Data Protection**
- Encryption at rest (S3, EBS, RDS)
- Encryption in transit (TLS/SSL)
- KMS key usage and rotation
- Secrets Manager utilization
5. **Incident Response**
- IR playbooks documented
- Automated response procedures
- Contact information current
- Regular IR drills
**Assessment Frequency**:
- Quarterly: Full Well-Architected review
- Monthly: High-priority findings review
- Weekly: Critical security findings
### Pattern 3: Compliance Monitoring
**MCP Servers**: CloudTrail MCP, CloudWatch MCP
**Compliance Requirements**:
- Data residency (ensure data stays in approved regions)
- Access logging (all access logged and retained)
- Encryption requirements (data encrypted at rest and in transit)
- Change management (all changes tracked in CloudTrail)
**Compliance Dashboards**:
- Encryption coverage by service
- CloudTrail logging status
- Failed login attempts
- Privileged access usage
- Non-compliant resources
## Troubleshooting Workflows
### Workflow 1: High Lambda Error Rate
**MCP Servers**: CloudWatch MCP, CloudWatch Application Signals MCP
**Steps**:
1. Query CloudWatch for Lambda error metrics
2. Check error logs in CloudWatch Logs
3. Identify error patterns (timeout, memory, permission)
4. Check Lambda configuration (memory, timeout, permissions)
5. Review recent code deployments
6. Check downstream service health
7. Implement fix and monitor
### Workflow 2: Increased Latency
**MCP Servers**: CloudWatch MCP, CloudWatch Application Signals MCP
**Steps**:
1. Identify latency spike in CloudWatch metrics
2. Check service map for slow dependencies
3. Query distributed traces for slow requests
4. Check database query performance
5. Review API Gateway integration latency
6. Check Lambda cold starts
7. Identify bottleneck and optimize
### Workflow 3: Cost Spike Investigation
**MCP Servers**: Cost Explorer MCP, CloudWatch MCP, CloudTrail MCP
**Steps**:
1. Use Cost Explorer to identify service causing spike
2. Check CloudWatch metrics for usage increase
3. Review CloudTrail for recent resource creation
4. Identify root cause (misconfiguration, runaway process, attack)
5. Implement cost controls (budgets, alarms, service quotas)
6. Clean up unnecessary resources
### Workflow 4: Security Incident Response
**MCP Servers**: CloudTrail MCP, GuardDuty (via CloudWatch), Well-Architected Assessment MCP
**Steps**:
1. Identify security event in GuardDuty or CloudWatch
2. Query CloudTrail for related API activity
3. Determine scope and impact
4. Isolate affected resources
5. Revoke compromised credentials
6. Implement remediation
7. Conduct post-incident review
8. Update security controls
## Summary
- **Cost Optimization**: Use Pricing, Cost Explorer, and Billing MCPs for proactive cost management
- **Monitoring**: Set up comprehensive CloudWatch alarms for all critical services
- **Observability**: Implement distributed tracing and structured logging
- **Security**: Regular CloudTrail audits and Well-Architected assessments
- **Proactive**: Don't wait for incidents - monitor and optimize continuously