Initial commit

2025-11-30 09:08:38 +08:00
commit 38b562e994
16 changed files with 7792 additions and 0 deletions
--- a/skills/aws-cost-operations/references/cloudwatch-alarms.md
+++ b/skills/aws-cost-operations/references/cloudwatch-alarms.md
@@ -0,0 +1,567 @@
+# CloudWatch Alarms Reference
+
+Common CloudWatch alarm configurations for AWS services.
+
+## Lambda Functions
+
+### Error Rate Alarm
+```typescript
+new cloudwatch.Alarm(this, 'LambdaErrorAlarm', {
+  metric: lambdaFunction.metricErrors({
+    statistic: 'Sum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 10,
+  evaluationPeriods: 1,
+  treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
+  alarmDescription: 'Lambda error count exceeded threshold',
+});
+```
+
+### Duration Alarm (Approaching Timeout)
+```typescript
+new cloudwatch.Alarm(this, 'LambdaDurationAlarm', {
+  metric: lambdaFunction.metricDuration({
+    statistic: 'Maximum',
+    period: Duration.minutes(5),
+  }),
+  threshold: lambdaFunction.timeout.toMilliseconds() * 0.8, // 80% of timeout
+  evaluationPeriods: 2,
+  alarmDescription: 'Lambda duration approaching timeout',
+});
+```
+
+### Throttle Alarm
+```typescript
+new cloudwatch.Alarm(this, 'LambdaThrottleAlarm', {
+  metric: lambdaFunction.metricThrottles({
+    statistic: 'Sum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 5,
+  evaluationPeriods: 1,
+  alarmDescription: 'Lambda function is being throttled',
+});
+```
+
+### Concurrent Executions Alarm
+```typescript
+new cloudwatch.Alarm(this, 'LambdaConcurrencyAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/Lambda',
+    metricName: 'ConcurrentExecutions',
+    dimensionsMap: {
+      FunctionName: lambdaFunction.functionName,
+    },
+    statistic: 'Maximum',
+    period: Duration.minutes(1),
+  }),
+  threshold: 100, // Adjust based on reserved concurrency
+  evaluationPeriods: 2,
+  alarmDescription: 'Lambda concurrent executions high',
+});
+```
+
+## API Gateway
+
+### 5XX Error Rate Alarm
+```typescript
+new cloudwatch.Alarm(this, 'Api5xxAlarm', {
+  metric: api.metricServerError({
+    statistic: 'Sum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 10,
+  evaluationPeriods: 1,
+  alarmDescription: 'API Gateway 5XX errors exceeded threshold',
+});
+```
+
+### 4XX Error Rate Alarm
+```typescript
+new cloudwatch.Alarm(this, 'Api4xxAlarm', {
+  metric: api.metricClientError({
+    statistic: 'Sum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 50,
+  evaluationPeriods: 2,
+  alarmDescription: 'API Gateway 4XX errors exceeded threshold',
+});
+```
+
+### Latency Alarm
+```typescript
+new cloudwatch.Alarm(this, 'ApiLatencyAlarm', {
+  metric: api.metricLatency({
+    statistic: 'p99',
+    period: Duration.minutes(5),
+  }),
+  threshold: 2000, // 2 seconds
+  evaluationPeriods: 2,
+  alarmDescription: 'API Gateway p99 latency exceeded threshold',
+});
+```
+
+## DynamoDB
+
+### Read Throttle Alarm
+```typescript
+new cloudwatch.Alarm(this, 'DynamoDBReadThrottleAlarm', {
+  metric: table.metricUserErrors({
+    dimensions: {
+      Operation: 'GetItem',
+    },
+    statistic: 'Sum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 5,
+  evaluationPeriods: 1,
+  alarmDescription: 'DynamoDB read operations being throttled',
+});
+```
+
+### Write Throttle Alarm
+```typescript
+new cloudwatch.Alarm(this, 'DynamoDBWriteThrottleAlarm', {
+  metric: table.metricUserErrors({
+    dimensions: {
+      Operation: 'PutItem',
+    },
+    statistic: 'Sum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 5,
+  evaluationPeriods: 1,
+  alarmDescription: 'DynamoDB write operations being throttled',
+});
+```
+
+### Consumed Capacity Alarm
+```typescript
+new cloudwatch.Alarm(this, 'DynamoDBCapacityAlarm', {
+  metric: table.metricConsumedReadCapacityUnits({
+    statistic: 'Sum',
+    period: Duration.minutes(5),
+  }),
+  threshold: provisionedCapacity * 0.8, // 80% of provisioned
+  evaluationPeriods: 2,
+  alarmDescription: 'DynamoDB consumed capacity approaching limit',
+});
+```
+
+## EC2 Instances
+
+### CPU Utilization Alarm
+```typescript
+new cloudwatch.Alarm(this, 'EC2CpuAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/EC2',
+    metricName: 'CPUUtilization',
+    dimensionsMap: {
+      InstanceId: instance.instanceId,
+    },
+    statistic: 'Average',
+    period: Duration.minutes(5),
+  }),
+  threshold: 80,
+  evaluationPeriods: 3,
+  alarmDescription: 'EC2 CPU utilization high',
+});
+```
+
+### Status Check Failed Alarm
+```typescript
+new cloudwatch.Alarm(this, 'EC2StatusCheckAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/EC2',
+    metricName: 'StatusCheckFailed',
+    dimensionsMap: {
+      InstanceId: instance.instanceId,
+    },
+    statistic: 'Maximum',
+    period: Duration.minutes(1),
+  }),
+  threshold: 1,
+  evaluationPeriods: 2,
+  alarmDescription: 'EC2 status check failed',
+});
+```
+
+### Disk Space Alarm (Requires CloudWatch Agent)
+```typescript
+new cloudwatch.Alarm(this, 'EC2DiskAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'CWAgent',
+    metricName: 'disk_used_percent',
+    dimensionsMap: {
+      InstanceId: instance.instanceId,
+      path: '/',
+    },
+    statistic: 'Average',
+    period: Duration.minutes(5),
+  }),
+  threshold: 85,
+  evaluationPeriods: 2,
+  alarmDescription: 'EC2 disk space usage high',
+});
+```
+
+## RDS Databases
+
+### CPU Alarm
+```typescript
+new cloudwatch.Alarm(this, 'RDSCpuAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/RDS',
+    metricName: 'CPUUtilization',
+    dimensionsMap: {
+      DBInstanceIdentifier: dbInstance.instanceIdentifier,
+    },
+    statistic: 'Average',
+    period: Duration.minutes(5),
+  }),
+  threshold: 80,
+  evaluationPeriods: 3,
+  alarmDescription: 'RDS CPU utilization high',
+});
+```
+
+### Connection Count Alarm
+```typescript
+new cloudwatch.Alarm(this, 'RDSConnectionAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/RDS',
+    metricName: 'DatabaseConnections',
+    dimensionsMap: {
+      DBInstanceIdentifier: dbInstance.instanceIdentifier,
+    },
+    statistic: 'Average',
+    period: Duration.minutes(5),
+  }),
+  threshold: maxConnections * 0.8, // 80% of max connections
+  evaluationPeriods: 2,
+  alarmDescription: 'RDS connection count approaching limit',
+});
+```
+
+### Free Storage Space Alarm
+```typescript
+new cloudwatch.Alarm(this, 'RDSStorageAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/RDS',
+    metricName: 'FreeStorageSpace',
+    dimensionsMap: {
+      DBInstanceIdentifier: dbInstance.instanceIdentifier,
+    },
+    statistic: 'Average',
+    period: Duration.minutes(5),
+  }),
+  threshold: 10 * 1024 * 1024 * 1024, // 10 GB in bytes
+  comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
+  evaluationPeriods: 1,
+  alarmDescription: 'RDS free storage space low',
+});
+```
+
+## ECS Services
+
+### Task Count Alarm
+```typescript
+new cloudwatch.Alarm(this, 'ECSTaskCountAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'ECS/ContainerInsights',
+    metricName: 'RunningTaskCount',
+    dimensionsMap: {
+      ServiceName: service.serviceName,
+      ClusterName: cluster.clusterName,
+    },
+    statistic: 'Average',
+    period: Duration.minutes(5),
+  }),
+  threshold: 1,
+  comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
+  evaluationPeriods: 2,
+  alarmDescription: 'ECS service has no running tasks',
+});
+```
+
+### CPU Utilization Alarm
+```typescript
+new cloudwatch.Alarm(this, 'ECSCpuAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/ECS',
+    metricName: 'CPUUtilization',
+    dimensionsMap: {
+      ServiceName: service.serviceName,
+      ClusterName: cluster.clusterName,
+    },
+    statistic: 'Average',
+    period: Duration.minutes(5),
+  }),
+  threshold: 80,
+  evaluationPeriods: 3,
+  alarmDescription: 'ECS service CPU utilization high',
+});
+```
+
+### Memory Utilization Alarm
+```typescript
+new cloudwatch.Alarm(this, 'ECSMemoryAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/ECS',
+    metricName: 'MemoryUtilization',
+    dimensionsMap: {
+      ServiceName: service.serviceName,
+      ClusterName: cluster.clusterName,
+    },
+    statistic: 'Average',
+    period: Duration.minutes(5),
+  }),
+  threshold: 85,
+  evaluationPeriods: 2,
+  alarmDescription: 'ECS service memory utilization high',
+});
+```
+
+## SQS Queues
+
+### Queue Depth Alarm
+```typescript
+new cloudwatch.Alarm(this, 'SQSDepthAlarm', {
+  metric: queue.metricApproximateNumberOfMessagesVisible({
+    statistic: 'Maximum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 1000,
+  evaluationPeriods: 2,
+  alarmDescription: 'SQS queue depth exceeded threshold',
+});
+```
+
+### Age of Oldest Message Alarm
+```typescript
+new cloudwatch.Alarm(this, 'SQSAgeAlarm', {
+  metric: queue.metricApproximateAgeOfOldestMessage({
+    statistic: 'Maximum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 300, // 5 minutes in seconds
+  evaluationPeriods: 1,
+  alarmDescription: 'SQS messages not being processed timely',
+});
+```
+
+## Application Load Balancer
+
+### Target Health Alarm
+```typescript
+new cloudwatch.Alarm(this, 'ALBUnhealthyTargetAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/ApplicationELB',
+    metricName: 'UnHealthyHostCount',
+    dimensionsMap: {
+      LoadBalancer: loadBalancer.loadBalancerFullName,
+      TargetGroup: targetGroup.targetGroupFullName,
+    },
+    statistic: 'Average',
+    period: Duration.minutes(5),
+  }),
+  threshold: 1,
+  evaluationPeriods: 2,
+  alarmDescription: 'ALB has unhealthy targets',
+});
+```
+
+### HTTP 5XX Alarm
+```typescript
+new cloudwatch.Alarm(this, 'ALB5xxAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/ApplicationELB',
+    metricName: 'HTTPCode_Target_5XX_Count',
+    dimensionsMap: {
+      LoadBalancer: loadBalancer.loadBalancerFullName,
+    },
+    statistic: 'Sum',
+    period: Duration.minutes(5),
+  }),
+  threshold: 10,
+  evaluationPeriods: 1,
+  alarmDescription: 'ALB target 5XX errors exceeded threshold',
+});
+```
+
+### Response Time Alarm
+```typescript
+new cloudwatch.Alarm(this, 'ALBLatencyAlarm', {
+  metric: new cloudwatch.Metric({
+    namespace: 'AWS/ApplicationELB',
+    metricName: 'TargetResponseTime',
+    dimensionsMap: {
+      LoadBalancer: loadBalancer.loadBalancerFullName,
+    },
+    statistic: 'p99',
+    period: Duration.minutes(5),
+  }),
+  threshold: 1, // 1 second
+  evaluationPeriods: 2,
+  alarmDescription: 'ALB p99 response time exceeded threshold',
+});
+```
+
+## Composite Alarms
+
+### Service Health Composite Alarm
+```typescript
+const errorAlarm = new cloudwatch.Alarm(this, 'ErrorAlarm', { /* ... */ });
+const latencyAlarm = new cloudwatch.Alarm(this, 'LatencyAlarm', { /* ... */ });
+const throttleAlarm = new cloudwatch.Alarm(this, 'ThrottleAlarm', { /* ... */ });
+
+new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
+  compositeAlarmName: 'service-health',
+  alarmRule: cloudwatch.AlarmRule.anyOf(
+    errorAlarm,
+    latencyAlarm,
+    throttleAlarm
+  ),
+  alarmDescription: 'Overall service health degraded',
+});
+```
+
+## Alarm Actions
+
+### SNS Topic Integration
+```typescript
+const topic = new sns.Topic(this, 'AlarmTopic', {
+  displayName: 'CloudWatch Alarms',
+});
+
+// Email subscription
+topic.addSubscription(new subscriptions.EmailSubscription('ops@example.com'));
+
+// Add action to alarm
+alarm.addAlarmAction(new actions.SnsAction(topic));
+alarm.addOkAction(new actions.SnsAction(topic));
+```
+
+### Auto Scaling Action
+```typescript
+const scalingAction = targetGroup.scaleOnMetric('ScaleUp', {
+  metric: targetGroup.metricTargetResponseTime(),
+  scalingSteps: [
+    { upper: 1, change: 0 },
+    { lower: 1, change: +1 },
+    { lower: 2, change: +2 },
+  ],
+});
+```
+
+## Alarm Best Practices
+
+### Threshold Selection
+
+**CPU/Memory Alarms**:
+- Warning: 70-80%
+- Critical: 80-90%
+- Consider burst patterns and normal usage
+
+**Error Rate Alarms**:
+- Threshold based on SLA (e.g., 99.9% = 0.1% error rate)
+- Account for normal error rates
+- Different thresholds for different error types
+
+**Latency Alarms**:
+- p99 latency for user-facing APIs
+- Warning: 80% of SLA target
+- Critical: 100% of SLA target
+
+### Evaluation Periods
+
+**Fast-changing metrics** (1-2 periods):
+- Error counts
+- Failed health checks
+- Critical application errors
+
+**Slow-changing metrics** (3-5 periods):
+- CPU utilization
+- Memory usage
+- Disk usage
+
+**Cost-related metrics** (longer periods):
+- Daily spending
+- Resource count changes
+- Usage patterns
+
+### Missing Data Handling
+
+```typescript
+// For intermittent workloads
+alarm.treatMissingData(cloudwatch.TreatMissingData.NOT_BREACHING);
+
+// For always-on services
+alarm.treatMissingData(cloudwatch.TreatMissingData.BREACHING);
+
+// To distinguish from data issues
+alarm.treatMissingData(cloudwatch.TreatMissingData.MISSING);
+```
+
+### Alarm Naming Conventions
+
+```typescript
+// Pattern: <service>-<metric>-<severity>
+'lambda-errors-critical'
+'api-latency-warning'
+'rds-cpu-warning'
+'ecs-tasks-critical'
+```
+
+### Alarm Actions Best Practices
+
+1. **Separate topics by severity**:
+   - Critical alarms → PagerDuty/on-call
+   - Warning alarms → Slack/email
+   - Info alarms → Metrics dashboard
+
+2. **Include context in alarm description**:
+   - Service name
+   - Expected threshold
+   - Troubleshooting runbook link
+
+3. **Auto-remediation where possible**:
+   - Lambda errors → automatic retry
+   - CPU high → auto-scaling trigger
+   - Disk full → automated cleanup
+
+4. **Alarm fatigue prevention**:
+   - Tune thresholds based on actual patterns
+   - Use composite alarms to reduce noise
+   - Implement proper evaluation periods
+   - Regularly review and adjust alarms
+
+## Monitoring Dashboard
+
+### Recommended Dashboard Layout
+
+**Service Overview**:
+- Request count and rate
+- Error count and percentage
+- Latency (p50, p95, p99)
+- Availability percentage
+
+**Resource Utilization**:
+- CPU utilization by service
+- Memory utilization by service
+- Network throughput
+- Disk I/O
+
+**Cost Metrics**:
+- Daily spending by service
+- Month-to-date costs
+- Budget utilization
+- Cost anomalies
+
+**Security Metrics**:
+- Failed login attempts
+- IAM policy changes
+- Security group modifications
+- GuardDuty findings
--- a/skills/aws-cost-operations/references/operations-patterns.md
+++ b/skills/aws-cost-operations/references/operations-patterns.md
@@ -0,0 +1,394 @@
+# AWS Cost & Operations Patterns
+
+Comprehensive patterns and best practices for AWS cost optimization, monitoring, and operational excellence.
+
+## Table of Contents
+
+- [Cost Optimization Patterns](#cost-optimization-patterns)
+- [Monitoring Patterns](#monitoring-patterns)
+- [Observability Patterns](#observability-patterns)
+- [Security and Audit Patterns](#security-and-audit-patterns)
+- [Troubleshooting Workflows](#troubleshooting-workflows)
+
+## Cost Optimization Patterns
+
+### Pattern 1: Cost Estimation Before Deployment
+
+**When**: Before deploying any new infrastructure
+
+**MCP Server**: AWS Pricing MCP
+
+**Steps**:
+1. List all resources to be deployed
+2. Query pricing for each resource type
+3. Calculate monthly costs based on expected usage
+4. Compare pricing across regions
+5. Document cost estimates in architecture docs
+
+**Example**:
+```
+Resource: Lambda Function
+- Invocations: 1,000,000/month
+- Duration: 3 seconds avg
+- Memory: 512 MB
+- Region: us-east-1
+Estimated cost: $X/month
+```
+
+### Pattern 2: Monthly Cost Review
+
+**When**: First week of every month
+
+**MCP Servers**: Cost Explorer MCP, Billing and Cost Management MCP
+
+**Steps**:
+1. Review total spending vs. budget
+2. Analyze cost by service (top 5 services)
+3. Identify cost anomalies (>20% increase)
+4. Review cost by environment (dev/staging/prod)
+5. Check cost allocation tag coverage
+6. Generate cost optimization recommendations
+
+**Key Metrics**:
+- Month-over-month cost change
+- Cost per environment
+- Cost per application/project
+- Untagged resource costs
+
+### Pattern 3: Right-Sizing Resources
+
+**When**: Quarterly or when utilization alerts trigger
+
+**MCP Servers**: CloudWatch MCP, Cost Explorer MCP
+
+**Steps**:
+1. Query CloudWatch for resource utilization metrics
+2. Identify over-provisioned resources (< 40% utilization)
+3. Identify under-provisioned resources (> 80% utilization)
+4. Calculate potential savings from right-sizing
+5. Plan and execute right-sizing changes
+6. Monitor post-change performance
+
+**Common Right-Sizing Scenarios**:
+- EC2 instances with low CPU utilization
+- RDS instances with excess capacity
+- DynamoDB tables with low read/write usage
+- Lambda functions with excessive memory allocation
+
+### Pattern 4: Unused Resource Cleanup
+
+**When**: Monthly or triggered by cost anomalies
+
+**MCP Servers**: Cost Explorer MCP, CloudTrail MCP
+
+**Steps**:
+1. Identify resources with zero usage
+2. Query CloudTrail for last access time
+3. Tag resources for deletion review
+4. Notify resource owners
+5. Delete confirmed unused resources
+6. Track cost savings
+
+**Common Unused Resources**:
+- Unattached EBS volumes
+- Old EBS snapshots
+- Idle Load Balancers
+- Unused Elastic IPs
+- Old AMIs and snapshots
+- Stopped EC2 instances (long-term)
+
+## Monitoring Patterns
+
+### Pattern 1: Critical Service Monitoring
+
+**When**: All production services
+
+**MCP Server**: CloudWatch MCP
+
+**Metrics to Monitor**:
+- **Availability**: Service uptime, health checks
+- **Performance**: Latency, response time
+- **Errors**: Error rate, failed requests
+- **Saturation**: CPU, memory, disk, network utilization
+
+**Alarm Thresholds** (adjust based on SLAs):
+- Error rate: > 1% for 2 consecutive periods
+- Latency: p99 > 1 second for 5 minutes
+- CPU: > 80% for 10 minutes
+- Memory: > 85% for 5 minutes
+
+### Pattern 2: Lambda Function Monitoring
+
+**MCP Server**: CloudWatch MCP
+
+**Key Metrics**:
+```
+- Invocations (Count)
+- Errors (Count, %)
+- Duration (Average, p99)
+- Throttles (Count)
+- ConcurrentExecutions (Max)
+- IteratorAge (for stream processing)
+```
+
+**Recommended Alarms**:
+- Error rate > 1%
+- Duration > 80% of timeout
+- Throttles > 0
+- ConcurrentExecutions > 80% of reserved
+
+### Pattern 3: API Gateway Monitoring
+
+**MCP Server**: CloudWatch MCP
+
+**Key Metrics**:
+```
+- Count (Total requests)
+- 4XXError, 5XXError
+- Latency (p50, p95, p99)
+- IntegrationLatency
+- CacheHitCount, CacheMissCount
+```
+
+**Recommended Alarms**:
+- 5XX error rate > 0.5%
+- 4XX error rate > 5%
+- Latency p99 > 2 seconds
+- Integration latency spike
+
+### Pattern 4: Database Monitoring
+
+**MCP Server**: CloudWatch MCP
+
+**RDS Metrics**:
+```
+- CPUUtilization
+- DatabaseConnections
+- FreeableMemory
+- ReadLatency, WriteLatency
+- ReadIOPS, WriteIOPS
+- FreeStorageSpace
+```
+
+**DynamoDB Metrics**:
+```
+- ConsumedReadCapacityUnits
+- ConsumedWriteCapacityUnits
+- UserErrors
+- SystemErrors
+- ThrottledRequests
+```
+
+**Recommended Alarms**:
+- RDS CPU > 80% for 10 minutes
+- RDS connections > 80% of max
+- RDS free storage < 10 GB
+- DynamoDB throttled requests > 0
+- DynamoDB user errors spike
+
+## Observability Patterns
+
+### Pattern 1: Distributed Tracing Setup
+
+**MCP Server**: CloudWatch Application Signals MCP
+
+**Components**:
+1. **Service Map**: Visualize service dependencies
+2. **Traces**: Track requests across services
+3. **Metrics**: Monitor latency and errors per service
+4. **SLOs**: Define and track service level objectives
+
+**Implementation**:
+- Enable X-Ray tracing on Lambda functions
+- Add X-Ray SDK to application code
+- Configure sampling rules
+- Create service lens dashboards
+
+### Pattern 2: Log Aggregation and Analysis
+
+**MCP Server**: CloudWatch MCP
+
+**Log Strategy**:
+1. **Centralize Logs**: Send all application logs to CloudWatch Logs
+2. **Structure Logs**: Use JSON format for structured logging
+3. **Log Insights**: Use CloudWatch Logs Insights for queries
+4. **Retention**: Set appropriate retention periods
+
+**Example Log Insights Queries**:
+```
+# Find errors in last hour
+fields @timestamp, @message
+| filter @message like /ERROR/
+| sort @timestamp desc
+| limit 100
+
+# Count errors by type
+stats count() by error_type
+| sort count desc
+
+# Calculate p99 latency
+stats percentile(duration, 99) by service_name
+```
+
+### Pattern 3: Custom Metrics
+
+**MCP Server**: CloudWatch MCP
+
+**When to Use Custom Metrics**:
+- Business-specific KPIs (orders/minute, revenue/hour)
+- Application-specific metrics (cache hit rate, queue depth)
+- Performance metrics not provided by AWS
+
+**Best Practices**:
+- Use consistent namespace: `CompanyName/ApplicationName`
+- Include relevant dimensions (environment, region, version)
+- Publish metrics at appropriate intervals
+- Use metric filters for log-derived metrics
+
+## Security and Audit Patterns
+
+### Pattern 1: API Activity Auditing
+
+**MCP Server**: CloudTrail MCP
+
+**Regular Audit Queries**:
+```
+# Find all IAM changes
+eventName: CreateUser, DeleteUser, AttachUserPolicy, etc.
+Time: Last 24 hours
+
+# Track S3 bucket deletions
+eventName: DeleteBucket
+Time: Last 7 days
+
+# Find failed login attempts
+eventName: ConsoleLogin
+errorCode: Failure
+
+# Monitor privileged actions
+userIdentity.arn: *admin* OR *root*
+```
+
+**Audit Schedule**:
+- Daily: Review privileged user actions
+- Weekly: Audit IAM changes and security group modifications
+- Monthly: Comprehensive security review
+
+### Pattern 2: Security Posture Assessment
+
+**MCP Server**: Well-Architected Security Assessment Tool MCP
+
+**Assessment Areas**:
+1. **Identity and Access Management**
+   - Least privilege implementation
+   - MFA enforcement
+   - Role-based access control
+   - Service control policies
+
+2. **Detective Controls**
+   - CloudTrail enabled in all regions
+   - GuardDuty findings review
+   - Config rule compliance
+   - Security Hub findings
+
+3. **Infrastructure Protection**
+   - VPC security groups review
+   - Network ACLs configuration
+   - AWS WAF rules
+   - Security group ingress rules
+
+4. **Data Protection**
+   - Encryption at rest (S3, EBS, RDS)
+   - Encryption in transit (TLS/SSL)
+   - KMS key usage and rotation
+   - Secrets Manager utilization
+
+5. **Incident Response**
+   - IR playbooks documented
+   - Automated response procedures
+   - Contact information current
+   - Regular IR drills
+
+**Assessment Frequency**:
+- Quarterly: Full Well-Architected review
+- Monthly: High-priority findings review
+- Weekly: Critical security findings
+
+### Pattern 3: Compliance Monitoring
+
+**MCP Servers**: CloudTrail MCP, CloudWatch MCP
+
+**Compliance Requirements**:
+- Data residency (ensure data stays in approved regions)
+- Access logging (all access logged and retained)
+- Encryption requirements (data encrypted at rest and in transit)
+- Change management (all changes tracked in CloudTrail)
+
+**Compliance Dashboards**:
+- Encryption coverage by service
+- CloudTrail logging status
+- Failed login attempts
+- Privileged access usage
+- Non-compliant resources
+
+## Troubleshooting Workflows
+
+### Workflow 1: High Lambda Error Rate
+
+**MCP Servers**: CloudWatch MCP, CloudWatch Application Signals MCP
+
+**Steps**:
+1. Query CloudWatch for Lambda error metrics
+2. Check error logs in CloudWatch Logs
+3. Identify error patterns (timeout, memory, permission)
+4. Check Lambda configuration (memory, timeout, permissions)
+5. Review recent code deployments
+6. Check downstream service health
+7. Implement fix and monitor
+
+### Workflow 2: Increased Latency
+
+**MCP Servers**: CloudWatch MCP, CloudWatch Application Signals MCP
+
+**Steps**:
+1. Identify latency spike in CloudWatch metrics
+2. Check service map for slow dependencies
+3. Query distributed traces for slow requests
+4. Check database query performance
+5. Review API Gateway integration latency
+6. Check Lambda cold starts
+7. Identify bottleneck and optimize
+
+### Workflow 3: Cost Spike Investigation
+
+**MCP Servers**: Cost Explorer MCP, CloudWatch MCP, CloudTrail MCP
+
+**Steps**:
+1. Use Cost Explorer to identify service causing spike
+2. Check CloudWatch metrics for usage increase
+3. Review CloudTrail for recent resource creation
+4. Identify root cause (misconfiguration, runaway process, attack)
+5. Implement cost controls (budgets, alarms, service quotas)
+6. Clean up unnecessary resources
+
+### Workflow 4: Security Incident Response
+
+**MCP Servers**: CloudTrail MCP, GuardDuty (via CloudWatch), Well-Architected Assessment MCP
+
+**Steps**:
+1. Identify security event in GuardDuty or CloudWatch
+2. Query CloudTrail for related API activity
+3. Determine scope and impact
+4. Isolate affected resources
+5. Revoke compromised credentials
+6. Implement remediation
+7. Conduct post-incident review
+8. Update security controls
+
+## Summary
+
+- **Cost Optimization**: Use Pricing, Cost Explorer, and Billing MCPs for proactive cost management
+- **Monitoring**: Set up comprehensive CloudWatch alarms for all critical services
+- **Observability**: Implement distributed tracing and structured logging
+- **Security**: Regular CloudTrail audits and Well-Architected assessments
+- **Proactive**: Don't wait for incidents - monitor and optimize continuously