771 lines
19 KiB
Markdown
771 lines
19 KiB
Markdown
# Serverless Observability Best Practices
|
|
|
|
Comprehensive observability patterns for serverless applications based on AWS best practices.
|
|
|
|
## Table of Contents
|
|
|
|
- [Three Pillars of Observability](#three-pillars-of-observability)
|
|
- [Metrics](#metrics)
|
|
- [Logging](#logging)
|
|
- [Tracing](#tracing)
|
|
- [Unified Observability](#unified-observability)
|
|
- [Alerting](#alerting)
|
|
|
|
## Three Pillars of Observability
|
|
|
|
### Metrics
|
|
**Numeric data measured at intervals (time series)**
|
|
- Request rate, error rate, duration
|
|
- CPU%, memory%, disk%
|
|
- Custom business metrics
|
|
- Service Level Indicators (SLIs)
|
|
|
|
### Logs
|
|
**Timestamped records of discrete events**
|
|
- Application events and errors
|
|
- State transformations
|
|
- Debugging information
|
|
- Audit trails
|
|
|
|
### Traces
|
|
**Single user's journey across services**
|
|
- Request flow through distributed system
|
|
- Service dependencies
|
|
- Latency breakdown
|
|
- Error propagation
|
|
|
|
## Metrics
|
|
|
|
### CloudWatch Metrics for Lambda
|
|
|
|
**Out-of-the-box metrics** (automatically available):
|
|
```
|
|
- Invocations
|
|
- Errors
|
|
- Throttles
|
|
- Duration
|
|
- ConcurrentExecutions
|
|
- IteratorAge (for streams)
|
|
```
|
|
|
|
**CDK Configuration**:
|
|
```typescript
|
|
const fn = new NodejsFunction(this, 'Function', {
|
|
entry: 'src/handler.ts',
|
|
});
|
|
|
|
// Create alarms on metrics
|
|
new cloudwatch.Alarm(this, 'ErrorAlarm', {
|
|
metric: fn.metricErrors({
|
|
statistic: 'Sum',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 10,
|
|
evaluationPeriods: 1,
|
|
});
|
|
|
|
new cloudwatch.Alarm(this, 'DurationAlarm', {
|
|
metric: fn.metricDuration({
|
|
statistic: 'p99',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 1000, // 1 second
|
|
evaluationPeriods: 2,
|
|
});
|
|
```
|
|
|
|
### Custom Metrics
|
|
|
|
**Use CloudWatch Embedded Metric Format (EMF)**:
|
|
|
|
```typescript
|
|
export const handler = async (event: any) => {
|
|
const startTime = Date.now();
|
|
|
|
try {
|
|
const result = await processOrder(event);
|
|
|
|
// Emit custom metrics
|
|
console.log(JSON.stringify({
|
|
_aws: {
|
|
Timestamp: Date.now(),
|
|
CloudWatchMetrics: [{
|
|
Namespace: 'MyApp/Orders',
|
|
Dimensions: [['ServiceName', 'Operation']],
|
|
Metrics: [
|
|
{ Name: 'ProcessingTime', Unit: 'Milliseconds' },
|
|
{ Name: 'OrderValue', Unit: 'None' },
|
|
],
|
|
}],
|
|
},
|
|
ServiceName: 'OrderService',
|
|
Operation: 'ProcessOrder',
|
|
ProcessingTime: Date.now() - startTime,
|
|
OrderValue: result.amount,
|
|
}));
|
|
|
|
return result;
|
|
} catch (error) {
|
|
// Emit error metric
|
|
console.log(JSON.stringify({
|
|
_aws: {
|
|
CloudWatchMetrics: [{
|
|
Namespace: 'MyApp/Orders',
|
|
Dimensions: [['ServiceName']],
|
|
Metrics: [{ Name: 'Errors', Unit: 'Count' }],
|
|
}],
|
|
},
|
|
ServiceName: 'OrderService',
|
|
Errors: 1,
|
|
}));
|
|
|
|
throw error;
|
|
}
|
|
};
|
|
```
|
|
|
|
**Using Lambda Powertools**:
|
|
|
|
```typescript
|
|
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
|
|
|
|
const metrics = new Metrics({
|
|
namespace: 'MyApp',
|
|
serviceName: 'OrderService',
|
|
});
|
|
|
|
export const handler = async (event: any) => {
|
|
metrics.addMetric('Invocation', MetricUnits.Count, 1);
|
|
|
|
const startTime = Date.now();
|
|
|
|
try {
|
|
const result = await processOrder(event);
|
|
|
|
metrics.addMetric('Success', MetricUnits.Count, 1);
|
|
metrics.addMetric('ProcessingTime', MetricUnits.Milliseconds, Date.now() - startTime);
|
|
metrics.addMetric('OrderValue', MetricUnits.None, result.amount);
|
|
|
|
return result;
|
|
} catch (error) {
|
|
metrics.addMetric('Error', MetricUnits.Count, 1);
|
|
throw error;
|
|
} finally {
|
|
metrics.publishStoredMetrics();
|
|
}
|
|
};
|
|
```
|
|
|
|
## Logging
|
|
|
|
### Structured Logging
|
|
|
|
**Use JSON format for logs**:
|
|
|
|
```typescript
|
|
// ✅ GOOD - Structured JSON logging
|
|
export const handler = async (event: any) => {
|
|
console.log(JSON.stringify({
|
|
level: 'INFO',
|
|
message: 'Processing order',
|
|
orderId: event.orderId,
|
|
customerId: event.customerId,
|
|
timestamp: new Date().toISOString(),
|
|
requestId: context.requestId,
|
|
}));
|
|
|
|
try {
|
|
const result = await processOrder(event);
|
|
|
|
console.log(JSON.stringify({
|
|
level: 'INFO',
|
|
message: 'Order processed successfully',
|
|
orderId: event.orderId,
|
|
duration: Date.now() - startTime,
|
|
timestamp: new Date().toISOString(),
|
|
}));
|
|
|
|
return result;
|
|
} catch (error) {
|
|
console.error(JSON.stringify({
|
|
level: 'ERROR',
|
|
message: 'Order processing failed',
|
|
orderId: event.orderId,
|
|
error: {
|
|
name: error.name,
|
|
message: error.message,
|
|
stack: error.stack,
|
|
},
|
|
timestamp: new Date().toISOString(),
|
|
}));
|
|
|
|
throw error;
|
|
}
|
|
};
|
|
|
|
// ❌ BAD - Unstructured logging
|
|
console.log('Processing order ' + orderId + ' for customer ' + customerId);
|
|
```
|
|
|
|
**Using Lambda Powertools Logger**:
|
|
|
|
```typescript
|
|
import { Logger } from '@aws-lambda-powertools/logger';
|
|
|
|
const logger = new Logger({
|
|
serviceName: 'OrderService',
|
|
logLevel: 'INFO',
|
|
});
|
|
|
|
export const handler = async (event: any, context: Context) => {
|
|
logger.addContext(context);
|
|
|
|
logger.info('Processing order', {
|
|
orderId: event.orderId,
|
|
customerId: event.customerId,
|
|
});
|
|
|
|
try {
|
|
const result = await processOrder(event);
|
|
|
|
logger.info('Order processed', {
|
|
orderId: event.orderId,
|
|
amount: result.amount,
|
|
});
|
|
|
|
return result;
|
|
} catch (error) {
|
|
logger.error('Order processing failed', {
|
|
orderId: event.orderId,
|
|
error,
|
|
});
|
|
|
|
throw error;
|
|
}
|
|
};
|
|
```
|
|
|
|
### Log Levels
|
|
|
|
**Use appropriate log levels**:
|
|
- **ERROR**: Errors requiring immediate attention
|
|
- **WARN**: Warnings or recoverable errors
|
|
- **INFO**: Important business events
|
|
- **DEBUG**: Detailed debugging information (disable in production)
|
|
|
|
```typescript
|
|
const logger = new Logger({
|
|
serviceName: 'OrderService',
|
|
logLevel: process.env.LOG_LEVEL || 'INFO',
|
|
});
|
|
|
|
logger.debug('Detailed processing info', { data });
|
|
logger.info('Business event occurred', { event });
|
|
logger.warn('Recoverable error', { error });
|
|
logger.error('Critical failure', { error });
|
|
```
|
|
|
|
### Log Insights Queries
|
|
|
|
**Common CloudWatch Logs Insights queries**:
|
|
|
|
```
|
|
# Find errors in last hour
|
|
fields @timestamp, @message, level, error.message
|
|
| filter level = "ERROR"
|
|
| sort @timestamp desc
|
|
| limit 100
|
|
|
|
# Count errors by type
|
|
stats count() by error.name as ErrorType
|
|
| sort count desc
|
|
|
|
# Calculate p99 latency
|
|
stats percentile(duration, 99) by serviceName
|
|
|
|
# Find slow requests
|
|
fields @timestamp, orderId, duration
|
|
| filter duration > 1000
|
|
| sort duration desc
|
|
| limit 50
|
|
|
|
# Track specific customer requests
|
|
fields @timestamp, @message, orderId
|
|
| filter customerId = "customer-123"
|
|
| sort @timestamp desc
|
|
```
|
|
|
|
## Tracing
|
|
|
|
### Enable X-Ray Tracing
|
|
|
|
**Configure X-Ray for Lambda**:
|
|
|
|
```typescript
|
|
const fn = new NodejsFunction(this, 'Function', {
|
|
entry: 'src/handler.ts',
|
|
tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
|
|
});
|
|
|
|
// API Gateway tracing
|
|
const api = new apigateway.RestApi(this, 'Api', {
|
|
deployOptions: {
|
|
tracingEnabled: true,
|
|
},
|
|
});
|
|
|
|
// Step Functions tracing
|
|
new stepfunctions.StateMachine(this, 'StateMachine', {
|
|
definition,
|
|
tracingEnabled: true,
|
|
});
|
|
```
|
|
|
|
**Instrument application code**:
|
|
|
|
```typescript
|
|
import { captureAWSv3Client } from 'aws-xray-sdk-core';
|
|
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
|
|
|
|
// Wrap AWS SDK clients
|
|
const client = captureAWSv3Client(new DynamoDBClient({}));
|
|
|
|
// Custom segments
|
|
import AWSXRay from 'aws-xray-sdk-core';
|
|
|
|
export const handler = async (event: any) => {
|
|
const segment = AWSXRay.getSegment();
|
|
|
|
// Custom subsegment
|
|
const subsegment = segment.addNewSubsegment('ProcessOrder');
|
|
|
|
try {
|
|
// Add annotations (indexed for filtering)
|
|
subsegment.addAnnotation('orderId', event.orderId);
|
|
subsegment.addAnnotation('customerId', event.customerId);
|
|
|
|
// Add metadata (not indexed, detailed info)
|
|
subsegment.addMetadata('orderDetails', event);
|
|
|
|
const result = await processOrder(event);
|
|
|
|
subsegment.addAnnotation('status', 'success');
|
|
subsegment.close();
|
|
|
|
return result;
|
|
} catch (error) {
|
|
subsegment.addError(error);
|
|
subsegment.close();
|
|
throw error;
|
|
}
|
|
};
|
|
```
|
|
|
|
**Using Lambda Powertools Tracer**:
|
|
|
|
```typescript
|
|
import { Tracer } from '@aws-lambda-powertools/tracer';
|
|
|
|
const tracer = new Tracer({ serviceName: 'OrderService' });
|
|
|
|
export const handler = async (event: any) => {
|
|
const segment = tracer.getSegment();
|
|
|
|
// Automatically captures and traces
|
|
const result = await tracer.captureAWSv3Client(dynamodb).getItem({
|
|
TableName: process.env.TABLE_NAME,
|
|
Key: { orderId: event.orderId },
|
|
});
|
|
|
|
// Custom annotation
|
|
tracer.putAnnotation('orderId', event.orderId);
|
|
tracer.putMetadata('orderDetails', event);
|
|
|
|
return result;
|
|
};
|
|
```
|
|
|
|
### Service Map
|
|
|
|
**Visualize service dependencies** with X-Ray:
|
|
- Shows service-to-service communication
|
|
- Identifies latency bottlenecks
|
|
- Highlights error rates between services
|
|
- Tracks downstream dependencies
|
|
|
|
### Distributed Tracing Best Practices
|
|
|
|
1. **Enable tracing everywhere**: Lambda, API Gateway, Step Functions
|
|
2. **Use annotations for filtering**: Indexed fields for queries
|
|
3. **Use metadata for details**: Non-indexed detailed information
|
|
4. **Sample appropriately**: 100% for low traffic, sampled for high traffic
|
|
5. **Correlate with logs**: Include trace ID in log entries
|
|
|
|
## Unified Observability
|
|
|
|
### Correlation Between Pillars
|
|
|
|
**Include trace ID in logs**:
|
|
|
|
```typescript
|
|
export const handler = async (event: any, context: Context) => {
|
|
const traceId = process.env._X_AMZN_TRACE_ID;
|
|
|
|
console.log(JSON.stringify({
|
|
level: 'INFO',
|
|
message: 'Processing order',
|
|
traceId,
|
|
requestId: context.requestId,
|
|
orderId: event.orderId,
|
|
}));
|
|
};
|
|
```
|
|
|
|
### CloudWatch ServiceLens
|
|
|
|
**Unified view of traces and metrics**:
|
|
- Automatically correlates X-Ray traces with CloudWatch metrics
|
|
- Shows service map with metrics overlay
|
|
- Identifies performance and availability issues
|
|
- Provides end-to-end request view
|
|
|
|
### Lambda Powertools Integration
|
|
|
|
**All three pillars in one**:
|
|
|
|
```typescript
|
|
import { Logger } from '@aws-lambda-powertools/logger';
|
|
import { Tracer } from '@aws-lambda-powertools/tracer';
|
|
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
|
|
|
|
const logger = new Logger({ serviceName: 'OrderService' });
|
|
const tracer = new Tracer({ serviceName: 'OrderService' });
|
|
const metrics = new Metrics({ namespace: 'MyApp', serviceName: 'OrderService' });
|
|
|
|
export const handler = async (event: any, context: Context) => {
|
|
// Automatically adds trace context to logs
|
|
logger.addContext(context);
|
|
|
|
logger.info('Processing order', { orderId: event.orderId });
|
|
|
|
// Add trace annotations
|
|
tracer.putAnnotation('orderId', event.orderId);
|
|
|
|
// Add metrics
|
|
metrics.addMetric('Invocation', MetricUnits.Count, 1);
|
|
|
|
const startTime = Date.now();
|
|
|
|
try {
|
|
const result = await processOrder(event);
|
|
|
|
metrics.addMetric('Success', MetricUnits.Count, 1);
|
|
metrics.addMetric('Duration', MetricUnits.Milliseconds, Date.now() - startTime);
|
|
|
|
logger.info('Order processed', { orderId: event.orderId });
|
|
|
|
return result;
|
|
} catch (error) {
|
|
metrics.addMetric('Error', MetricUnits.Count, 1);
|
|
logger.error('Processing failed', { orderId: event.orderId, error });
|
|
throw error;
|
|
} finally {
|
|
metrics.publishStoredMetrics();
|
|
}
|
|
};
|
|
```
|
|
|
|
## Alerting
|
|
|
|
### Effective Alerting Strategy
|
|
|
|
**Alert on what matters**:
|
|
- **Critical**: Customer-impacting issues (errors, high latency)
|
|
- **Warning**: Approaching thresholds (80% capacity)
|
|
- **Info**: Trends and anomalies (cost spikes)
|
|
|
|
**Alarm fatigue prevention**:
|
|
- Tune thresholds based on actual patterns
|
|
- Use composite alarms to reduce noise
|
|
- Set appropriate evaluation periods
|
|
- Include clear remediation steps
|
|
|
|
### CloudWatch Alarms
|
|
|
|
**Common alarm patterns**:
|
|
|
|
```typescript
|
|
// Error rate alarm
|
|
new cloudwatch.Alarm(this, 'ErrorRateAlarm', {
|
|
metric: new cloudwatch.MathExpression({
|
|
expression: 'errors / invocations * 100',
|
|
usingMetrics: {
|
|
errors: fn.metricErrors({ statistic: 'Sum' }),
|
|
invocations: fn.metricInvocations({ statistic: 'Sum' }),
|
|
},
|
|
}),
|
|
threshold: 1, // 1% error rate
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'Error rate exceeded 1%',
|
|
});
|
|
|
|
// Latency alarm (p99)
|
|
new cloudwatch.Alarm(this, 'LatencyAlarm', {
|
|
metric: fn.metricDuration({
|
|
statistic: 'p99',
|
|
period: Duration.minutes(5),
|
|
}),
|
|
threshold: 1000, // 1 second
|
|
evaluationPeriods: 2,
|
|
alarmDescription: 'p99 latency exceeded 1 second',
|
|
});
|
|
|
|
// Concurrent executions approaching limit
|
|
new cloudwatch.Alarm(this, 'ConcurrencyAlarm', {
|
|
metric: fn.metricConcurrentExecutions({
|
|
statistic: 'Maximum',
|
|
}),
|
|
threshold: 800, // 80% of 1000 default limit
|
|
evaluationPeriods: 1,
|
|
alarmDescription: 'Approaching concurrency limit',
|
|
});
|
|
```
|
|
|
|
### Composite Alarms
|
|
|
|
**Reduce alert noise**:
|
|
|
|
```typescript
|
|
const errorAlarm = new cloudwatch.Alarm(this, 'Errors', {
|
|
metric: fn.metricErrors(),
|
|
threshold: 10,
|
|
evaluationPeriods: 1,
|
|
});
|
|
|
|
const throttleAlarm = new cloudwatch.Alarm(this, 'Throttles', {
|
|
metric: fn.metricThrottles(),
|
|
threshold: 5,
|
|
evaluationPeriods: 1,
|
|
});
|
|
|
|
const latencyAlarm = new cloudwatch.Alarm(this, 'Latency', {
|
|
metric: fn.metricDuration({ statistic: 'p99' }),
|
|
threshold: 2000,
|
|
evaluationPeriods: 2,
|
|
});
|
|
|
|
// Composite alarm (any of the above)
|
|
new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
|
|
compositeAlarmName: 'order-service-health',
|
|
alarmRule: cloudwatch.AlarmRule.anyOf(
|
|
errorAlarm,
|
|
throttleAlarm,
|
|
latencyAlarm
|
|
),
|
|
alarmDescription: 'Overall service health degraded',
|
|
});
|
|
```
|
|
|
|
## Dashboard Best Practices
|
|
|
|
### Service Dashboard Layout
|
|
|
|
**Recommended sections**:
|
|
|
|
1. **Overview**:
|
|
- Total invocations
|
|
- Error rate percentage
|
|
- P50, P95, P99 latency
|
|
- Availability percentage
|
|
|
|
2. **Resource Utilization**:
|
|
- Concurrent executions
|
|
- Memory utilization
|
|
- Duration distribution
|
|
- Throttles
|
|
|
|
3. **Business Metrics**:
|
|
- Orders processed
|
|
- Revenue per minute
|
|
- Customer activity
|
|
- Feature usage
|
|
|
|
4. **Errors and Alerts**:
|
|
- Error count by type
|
|
- Active alarms
|
|
- DLQ message count
|
|
- Failed transactions
|
|
|
|
### CloudWatch Dashboard CDK
|
|
|
|
```typescript
|
|
const dashboard = new cloudwatch.Dashboard(this, 'ServiceDashboard', {
|
|
dashboardName: 'order-service',
|
|
});
|
|
|
|
dashboard.addWidgets(
|
|
// Row 1: Overview
|
|
new cloudwatch.GraphWidget({
|
|
title: 'Invocations',
|
|
left: [fn.metricInvocations()],
|
|
}),
|
|
new cloudwatch.SingleValueWidget({
|
|
title: 'Error Rate',
|
|
metrics: [
|
|
new cloudwatch.MathExpression({
|
|
expression: 'errors / invocations * 100',
|
|
usingMetrics: {
|
|
errors: fn.metricErrors({ statistic: 'Sum' }),
|
|
invocations: fn.metricInvocations({ statistic: 'Sum' }),
|
|
},
|
|
}),
|
|
],
|
|
}),
|
|
new cloudwatch.GraphWidget({
|
|
title: 'Latency (p50, p95, p99)',
|
|
left: [
|
|
fn.metricDuration({ statistic: 'p50', label: 'p50' }),
|
|
fn.metricDuration({ statistic: 'p95', label: 'p95' }),
|
|
fn.metricDuration({ statistic: 'p99', label: 'p99' }),
|
|
],
|
|
})
|
|
);
|
|
|
|
// Row 2: Errors
|
|
dashboard.addWidgets(
|
|
new cloudwatch.LogQueryWidget({
|
|
title: 'Recent Errors',
|
|
logGroupNames: [fn.logGroup.logGroupName],
|
|
queryLines: [
|
|
'fields @timestamp, @message',
|
|
'filter level = "ERROR"',
|
|
'sort @timestamp desc',
|
|
'limit 20',
|
|
],
|
|
})
|
|
);
|
|
```
|
|
|
|
## Monitoring Serverless Architectures
|
|
|
|
### End-to-End Monitoring
|
|
|
|
**Monitor the entire flow**:
|
|
|
|
```
|
|
API Gateway → Lambda → DynamoDB → EventBridge → Lambda
|
|
↓ ↓ ↓ ↓ ↓
|
|
Metrics Traces Metrics Metrics Logs
|
|
```
|
|
|
|
**Key metrics per service**:
|
|
|
|
| Service | Key Metrics |
|
|
|---------|-------------|
|
|
| API Gateway | Count, 4XXError, 5XXError, Latency, CacheHitCount |
|
|
| Lambda | Invocations, Errors, Duration, Throttles, ConcurrentExecutions |
|
|
| DynamoDB | ConsumedReadCapacity, ConsumedWriteCapacity, UserErrors, SystemErrors |
|
|
| SQS | NumberOfMessagesSent, NumberOfMessagesReceived, ApproximateAgeOfOldestMessage |
|
|
| EventBridge | Invocations, FailedInvocations, TriggeredRules |
|
|
| Step Functions | ExecutionsStarted, ExecutionsFailed, ExecutionTime |
|
|
|
|
### Synthetic Monitoring
|
|
|
|
**Use CloudWatch Synthetics for API monitoring**:
|
|
|
|
```typescript
|
|
import { Canary, Test, Code, Schedule } from '@aws-cdk/aws-synthetics-alpha';
|
|
|
|
new Canary(this, 'ApiCanary', {
|
|
canaryName: 'api-health-check',
|
|
schedule: Schedule.rate(Duration.minutes(5)),
|
|
test: Test.custom({
|
|
code: Code.fromInline(`
|
|
const synthetics = require('Synthetics');
|
|
|
|
const apiCanaryBlueprint = async function () {
|
|
const response = await synthetics.executeHttpStep('Verify API', {
|
|
url: 'https://api.example.com/health',
|
|
method: 'GET',
|
|
});
|
|
|
|
return response.statusCode === 200 ? 'success' : 'failure';
|
|
};
|
|
|
|
exports.handler = async () => {
|
|
return await apiCanaryBlueprint();
|
|
};
|
|
`),
|
|
handler: 'index.handler',
|
|
}),
|
|
runtime: synthetics.Runtime.SYNTHETICS_NODEJS_PUPPETEER_6_2,
|
|
});
|
|
```
|
|
|
|
## OpenTelemetry Integration
|
|
|
|
### Amazon Distro for OpenTelemetry (ADOT)
|
|
|
|
**Use ADOT for vendor-neutral observability**:
|
|
|
|
```typescript
|
|
// Lambda Layer with ADOT
|
|
const adotLayer = lambda.LayerVersion.fromLayerVersionArn(
|
|
this,
|
|
'AdotLayer',
|
|
`arn:aws:lambda:${this.region}:901920570463:layer:aws-otel-nodejs-amd64-ver-1-18-1:4`
|
|
);
|
|
|
|
new NodejsFunction(this, 'Function', {
|
|
entry: 'src/handler.ts',
|
|
layers: [adotLayer],
|
|
tracing: lambda.Tracing.ACTIVE,
|
|
environment: {
|
|
AWS_LAMBDA_EXEC_WRAPPER: '/opt/otel-handler',
|
|
OPENTELEMETRY_COLLECTOR_CONFIG_FILE: '/var/task/collector.yaml',
|
|
},
|
|
});
|
|
```
|
|
|
|
**Benefits of ADOT**:
|
|
- Vendor-neutral (works with Datadog, New Relic, Honeycomb, etc.)
|
|
- Automatic instrumentation
|
|
- Consistent format across services
|
|
- Export to multiple backends
|
|
|
|
## Best Practices Summary
|
|
|
|
### Metrics
|
|
- ✅ Use CloudWatch Embedded Metric Format (EMF)
|
|
- ✅ Track business metrics, not just technical metrics
|
|
- ✅ Set alarms on error rate, latency, and throughput
|
|
- ✅ Use p99 for latency, not average
|
|
- ✅ Create dashboards for key services
|
|
|
|
### Logging
|
|
- ✅ Use structured JSON logging
|
|
- ✅ Include correlation IDs (request ID, trace ID)
|
|
- ✅ Use appropriate log levels
|
|
- ✅ Never log sensitive data (PII, secrets)
|
|
- ✅ Use CloudWatch Logs Insights for analysis
|
|
|
|
### Tracing
|
|
- ✅ Enable X-Ray tracing on all services
|
|
- ✅ Instrument AWS SDK calls
|
|
- ✅ Add custom annotations for business context
|
|
- ✅ Use service map to understand dependencies
|
|
- ✅ Correlate traces with logs and metrics
|
|
|
|
### Alerting
|
|
- ✅ Alert on customer-impacting issues
|
|
- ✅ Tune thresholds to reduce false positives
|
|
- ✅ Use composite alarms to reduce noise
|
|
- ✅ Include clear remediation steps
|
|
- ✅ Escalate critical alarms appropriately
|
|
|
|
### Tools
|
|
- ✅ Use Lambda Powertools for unified observability
|
|
- ✅ Use CloudWatch ServiceLens for service view
|
|
- ✅ Use Synthetics for proactive monitoring
|
|
- ✅ Consider ADOT for vendor-neutral observability
|