568 lines
13 KiB
Markdown
568 lines
13 KiB
Markdown
# CloudWatch Alarms Reference
|
|
|
|
Common CloudWatch alarm configurations for AWS services.
|
|
|
|
## Lambda Functions
|
|
|
|
### Error Rate Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'LambdaErrorAlarm', {
|
|
metric: lambdaFunction.metricErrors({
|
|
statistic: 'Sum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 10,
|
|
evaluationPeriods: 1,
|
|
treatMissingData: cloudwatch.TreatMissingData.NOT_BREACHING,
|
|
alarmDescription: 'Lambda error count exceeded threshold',
|
|
});
|
|
```
|
|
|
|
### Duration Alarm (Approaching Timeout)
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'LambdaDurationAlarm', {
|
|
metric: lambdaFunction.metricDuration({
|
|
statistic: 'Maximum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: lambdaFunction.timeout.toMilliseconds() * 0.8, // 80% of timeout
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'Lambda duration approaching timeout',
|
|
});
|
|
```
|
|
|
|
### Throttle Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'LambdaThrottleAlarm', {
|
|
metric: lambdaFunction.metricThrottles({
|
|
statistic: 'Sum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 5,
|
|
evaluationPeriods: 1,
|
|
alarmDescription: 'Lambda function is being throttled',
|
|
});
|
|
```
|
|
|
|
### Concurrent Executions Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'LambdaConcurrencyAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/Lambda',
|
|
metricName: 'ConcurrentExecutions',
|
|
dimensionsMap: {
|
|
FunctionName: lambdaFunction.functionName,
|
|
},
|
|
statistic: 'Maximum',
|
|
period: Duration.minutes(1),
|
|
}),
|
|
threshold: 100, // Adjust based on reserved concurrency
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'Lambda concurrent executions high',
|
|
});
|
|
```
|
|
|
|
## API Gateway
|
|
|
|
### 5XX Error Rate Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'Api5xxAlarm', {
|
|
metric: api.metricServerError({
|
|
statistic: 'Sum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 10,
|
|
evaluationPeriods: 1,
|
|
alarmDescription: 'API Gateway 5XX errors exceeded threshold',
|
|
});
|
|
```
|
|
|
|
### 4XX Error Rate Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'Api4xxAlarm', {
|
|
metric: api.metricClientError({
|
|
statistic: 'Sum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 50,
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'API Gateway 4XX errors exceeded threshold',
|
|
});
|
|
```
|
|
|
|
### Latency Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'ApiLatencyAlarm', {
|
|
metric: api.metricLatency({
|
|
statistic: 'p99',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 2000, // 2 seconds
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'API Gateway p99 latency exceeded threshold',
|
|
});
|
|
```
|
|
|
|
## DynamoDB
|
|
|
|
### Read Throttle Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'DynamoDBReadThrottleAlarm', {
|
|
metric: table.metricUserErrors({
|
|
dimensions: {
|
|
Operation: 'GetItem',
|
|
},
|
|
statistic: 'Sum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 5,
|
|
evaluationPeriods: 1,
|
|
alarmDescription: 'DynamoDB read operations being throttled',
|
|
});
|
|
```
|
|
|
|
### Write Throttle Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'DynamoDBWriteThrottleAlarm', {
|
|
metric: table.metricUserErrors({
|
|
dimensions: {
|
|
Operation: 'PutItem',
|
|
},
|
|
statistic: 'Sum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 5,
|
|
evaluationPeriods: 1,
|
|
alarmDescription: 'DynamoDB write operations being throttled',
|
|
});
|
|
```
|
|
|
|
### Consumed Capacity Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'DynamoDBCapacityAlarm', {
|
|
metric: table.metricConsumedReadCapacityUnits({
|
|
statistic: 'Sum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: provisionedCapacity * 0.8, // 80% of provisioned
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'DynamoDB consumed capacity approaching limit',
|
|
});
|
|
```
|
|
|
|
## EC2 Instances
|
|
|
|
### CPU Utilization Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'EC2CpuAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/EC2',
|
|
metricName: 'CPUUtilization',
|
|
dimensionsMap: {
|
|
InstanceId: instance.instanceId,
|
|
},
|
|
statistic: 'Average',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 80,
|
|
evaluationPeriods: 3,
|
|
alarmDescription: 'EC2 CPU utilization high',
|
|
});
|
|
```
|
|
|
|
### Status Check Failed Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'EC2StatusCheckAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/EC2',
|
|
metricName: 'StatusCheckFailed',
|
|
dimensionsMap: {
|
|
InstanceId: instance.instanceId,
|
|
},
|
|
statistic: 'Maximum',
|
|
period: Duration.minutes(1),
|
|
}),
|
|
threshold: 1,
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'EC2 status check failed',
|
|
});
|
|
```
|
|
|
|
### Disk Space Alarm (Requires CloudWatch Agent)
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'EC2DiskAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'CWAgent',
|
|
metricName: 'disk_used_percent',
|
|
dimensionsMap: {
|
|
InstanceId: instance.instanceId,
|
|
path: '/',
|
|
},
|
|
statistic: 'Average',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 85,
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'EC2 disk space usage high',
|
|
});
|
|
```
|
|
|
|
## RDS Databases
|
|
|
|
### CPU Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'RDSCpuAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/RDS',
|
|
metricName: 'CPUUtilization',
|
|
dimensionsMap: {
|
|
DBInstanceIdentifier: dbInstance.instanceIdentifier,
|
|
},
|
|
statistic: 'Average',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 80,
|
|
evaluationPeriods: 3,
|
|
alarmDescription: 'RDS CPU utilization high',
|
|
});
|
|
```
|
|
|
|
### Connection Count Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'RDSConnectionAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/RDS',
|
|
metricName: 'DatabaseConnections',
|
|
dimensionsMap: {
|
|
DBInstanceIdentifier: dbInstance.instanceIdentifier,
|
|
},
|
|
statistic: 'Average',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: maxConnections * 0.8, // 80% of max connections
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'RDS connection count approaching limit',
|
|
});
|
|
```
|
|
|
|
### Free Storage Space Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'RDSStorageAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/RDS',
|
|
metricName: 'FreeStorageSpace',
|
|
dimensionsMap: {
|
|
DBInstanceIdentifier: dbInstance.instanceIdentifier,
|
|
},
|
|
statistic: 'Average',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 10 * 1024 * 1024 * 1024, // 10 GB in bytes
|
|
comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
|
|
evaluationPeriods: 1,
|
|
alarmDescription: 'RDS free storage space low',
|
|
});
|
|
```
|
|
|
|
## ECS Services
|
|
|
|
### Task Count Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'ECSTaskCountAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'ECS/ContainerInsights',
|
|
metricName: 'RunningTaskCount',
|
|
dimensionsMap: {
|
|
ServiceName: service.serviceName,
|
|
ClusterName: cluster.clusterName,
|
|
},
|
|
statistic: 'Average',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 1,
|
|
comparisonOperator: cloudwatch.ComparisonOperator.LESS_THAN_THRESHOLD,
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'ECS service has no running tasks',
|
|
});
|
|
```
|
|
|
|
### CPU Utilization Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'ECSCpuAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/ECS',
|
|
metricName: 'CPUUtilization',
|
|
dimensionsMap: {
|
|
ServiceName: service.serviceName,
|
|
ClusterName: cluster.clusterName,
|
|
},
|
|
statistic: 'Average',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 80,
|
|
evaluationPeriods: 3,
|
|
alarmDescription: 'ECS service CPU utilization high',
|
|
});
|
|
```
|
|
|
|
### Memory Utilization Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'ECSMemoryAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/ECS',
|
|
metricName: 'MemoryUtilization',
|
|
dimensionsMap: {
|
|
ServiceName: service.serviceName,
|
|
ClusterName: cluster.clusterName,
|
|
},
|
|
statistic: 'Average',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 85,
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'ECS service memory utilization high',
|
|
});
|
|
```
|
|
|
|
## SQS Queues
|
|
|
|
### Queue Depth Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'SQSDepthAlarm', {
|
|
metric: queue.metricApproximateNumberOfMessagesVisible({
|
|
statistic: 'Maximum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 1000,
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'SQS queue depth exceeded threshold',
|
|
});
|
|
```
|
|
|
|
### Age of Oldest Message Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'SQSAgeAlarm', {
|
|
metric: queue.metricApproximateAgeOfOldestMessage({
|
|
statistic: 'Maximum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 300, // 5 minutes in seconds
|
|
evaluationPeriods: 1,
|
|
alarmDescription: 'SQS messages not being processed timely',
|
|
});
|
|
```
|
|
|
|
## Application Load Balancer
|
|
|
|
### Target Health Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'ALBUnhealthyTargetAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/ApplicationELB',
|
|
metricName: 'UnHealthyHostCount',
|
|
dimensionsMap: {
|
|
LoadBalancer: loadBalancer.loadBalancerFullName,
|
|
TargetGroup: targetGroup.targetGroupFullName,
|
|
},
|
|
statistic: 'Average',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 1,
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'ALB has unhealthy targets',
|
|
});
|
|
```
|
|
|
|
### HTTP 5XX Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'ALB5xxAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/ApplicationELB',
|
|
metricName: 'HTTPCode_Target_5XX_Count',
|
|
dimensionsMap: {
|
|
LoadBalancer: loadBalancer.loadBalancerFullName,
|
|
},
|
|
statistic: 'Sum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 10,
|
|
evaluationPeriods: 1,
|
|
alarmDescription: 'ALB target 5XX errors exceeded threshold',
|
|
});
|
|
```
|
|
|
|
### Response Time Alarm
|
|
```typescript
|
|
new cloudwatch.Alarm(this, 'ALBLatencyAlarm', {
|
|
metric: new cloudwatch.Metric({
|
|
namespace: 'AWS/ApplicationELB',
|
|
metricName: 'TargetResponseTime',
|
|
dimensionsMap: {
|
|
LoadBalancer: loadBalancer.loadBalancerFullName,
|
|
},
|
|
statistic: 'p99',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 1, // 1 second
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'ALB p99 response time exceeded threshold',
|
|
});
|
|
```
|
|
|
|
## Composite Alarms
|
|
|
|
### Service Health Composite Alarm
|
|
```typescript
|
|
const errorAlarm = new cloudwatch.Alarm(this, 'ErrorAlarm', { /* ... */ });
|
|
const latencyAlarm = new cloudwatch.Alarm(this, 'LatencyAlarm', { /* ... */ });
|
|
const throttleAlarm = new cloudwatch.Alarm(this, 'ThrottleAlarm', { /* ... */ });
|
|
|
|
new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
|
|
compositeAlarmName: 'service-health',
|
|
alarmRule: cloudwatch.AlarmRule.anyOf(
|
|
errorAlarm,
|
|
latencyAlarm,
|
|
throttleAlarm
|
|
),
|
|
alarmDescription: 'Overall service health degraded',
|
|
});
|
|
```
|
|
|
|
## Alarm Actions
|
|
|
|
### SNS Topic Integration
|
|
```typescript
|
|
const topic = new sns.Topic(this, 'AlarmTopic', {
|
|
displayName: 'CloudWatch Alarms',
|
|
});
|
|
|
|
// Email subscription
|
|
topic.addSubscription(new subscriptions.EmailSubscription('ops@example.com'));
|
|
|
|
// Add action to alarm
|
|
alarm.addAlarmAction(new actions.SnsAction(topic));
|
|
alarm.addOkAction(new actions.SnsAction(topic));
|
|
```
|
|
|
|
### Auto Scaling Action
|
|
```typescript
|
|
const scalingAction = targetGroup.scaleOnMetric('ScaleUp', {
|
|
metric: targetGroup.metricTargetResponseTime(),
|
|
scalingSteps: [
|
|
{ upper: 1, change: 0 },
|
|
{ lower: 1, change: +1 },
|
|
{ lower: 2, change: +2 },
|
|
],
|
|
});
|
|
```
|
|
|
|
## Alarm Best Practices
|
|
|
|
### Threshold Selection
|
|
|
|
**CPU/Memory Alarms**:
|
|
- Warning: 70-80%
|
|
- Critical: 80-90%
|
|
- Consider burst patterns and normal usage
|
|
|
|
**Error Rate Alarms**:
|
|
- Threshold based on SLA (e.g., 99.9% = 0.1% error rate)
|
|
- Account for normal error rates
|
|
- Different thresholds for different error types
|
|
|
|
**Latency Alarms**:
|
|
- p99 latency for user-facing APIs
|
|
- Warning: 80% of SLA target
|
|
- Critical: 100% of SLA target
|
|
|
|
### Evaluation Periods
|
|
|
|
**Fast-changing metrics** (1-2 periods):
|
|
- Error counts
|
|
- Failed health checks
|
|
- Critical application errors
|
|
|
|
**Slow-changing metrics** (3-5 periods):
|
|
- CPU utilization
|
|
- Memory usage
|
|
- Disk usage
|
|
|
|
**Cost-related metrics** (longer periods):
|
|
- Daily spending
|
|
- Resource count changes
|
|
- Usage patterns
|
|
|
|
### Missing Data Handling
|
|
|
|
```typescript
|
|
// For intermittent workloads
|
|
alarm.treatMissingData(cloudwatch.TreatMissingData.NOT_BREACHING);
|
|
|
|
// For always-on services
|
|
alarm.treatMissingData(cloudwatch.TreatMissingData.BREACHING);
|
|
|
|
// To distinguish from data issues
|
|
alarm.treatMissingData(cloudwatch.TreatMissingData.MISSING);
|
|
```
|
|
|
|
### Alarm Naming Conventions
|
|
|
|
```typescript
|
|
// Pattern: <service>-<metric>-<severity>
|
|
'lambda-errors-critical'
|
|
'api-latency-warning'
|
|
'rds-cpu-warning'
|
|
'ecs-tasks-critical'
|
|
```
|
|
|
|
### Alarm Actions Best Practices
|
|
|
|
1. **Separate topics by severity**:
|
|
- Critical alarms → PagerDuty/on-call
|
|
- Warning alarms → Slack/email
|
|
- Info alarms → Metrics dashboard
|
|
|
|
2. **Include context in alarm description**:
|
|
- Service name
|
|
- Expected threshold
|
|
- Troubleshooting runbook link
|
|
|
|
3. **Auto-remediation where possible**:
|
|
- Lambda errors → automatic retry
|
|
- CPU high → auto-scaling trigger
|
|
- Disk full → automated cleanup
|
|
|
|
4. **Alarm fatigue prevention**:
|
|
- Tune thresholds based on actual patterns
|
|
- Use composite alarms to reduce noise
|
|
- Implement proper evaluation periods
|
|
- Regularly review and adjust alarms
|
|
|
|
## Monitoring Dashboard
|
|
|
|
### Recommended Dashboard Layout
|
|
|
|
**Service Overview**:
|
|
- Request count and rate
|
|
- Error count and percentage
|
|
- Latency (p50, p95, p99)
|
|
- Availability percentage
|
|
|
|
**Resource Utilization**:
|
|
- CPU utilization by service
|
|
- Memory utilization by service
|
|
- Network throughput
|
|
- Disk I/O
|
|
|
|
**Cost Metrics**:
|
|
- Daily spending by service
|
|
- Month-to-date costs
|
|
- Budget utilization
|
|
- Cost anomalies
|
|
|
|
**Security Metrics**:
|
|
- Failed login attempts
|
|
- IAM policy changes
|
|
- Security group modifications
|
|
- GuardDuty findings
|