Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:08:38 +08:00
commit 38b562e994
16 changed files with 7792 additions and 0 deletions

View File

@@ -0,0 +1,770 @@
# Serverless Observability Best Practices
Comprehensive observability patterns for serverless applications based on AWS best practices.
## Table of Contents
- [Three Pillars of Observability](#three-pillars-of-observability)
- [Metrics](#metrics)
- [Logging](#logging)
- [Tracing](#tracing)
- [Unified Observability](#unified-observability)
- [Alerting](#alerting)
## Three Pillars of Observability
### Metrics
**Numeric data measured at intervals (time series)**
- Request rate, error rate, duration
- CPU%, memory%, disk%
- Custom business metrics
- Service Level Indicators (SLIs)
### Logs
**Timestamped records of discrete events**
- Application events and errors
- State transformations
- Debugging information
- Audit trails
### Traces
**Single user's journey across services**
- Request flow through distributed system
- Service dependencies
- Latency breakdown
- Error propagation
## Metrics
### CloudWatch Metrics for Lambda
**Out-of-the-box metrics** (automatically available):
```
- Invocations
- Errors
- Throttles
- Duration
- ConcurrentExecutions
- IteratorAge (for streams)
```
**CDK Configuration**:
```typescript
const fn = new NodejsFunction(this, 'Function', {
entry: 'src/handler.ts',
});
// Create alarms on metrics
new cloudwatch.Alarm(this, 'ErrorAlarm', {
metric: fn.metricErrors({
statistic: 'Sum',
period: Duration.minutes(5),
}),
threshold: 10,
evaluationPeriods: 1,
});
new cloudwatch.Alarm(this, 'DurationAlarm', {
metric: fn.metricDuration({
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 1000, // 1 second
evaluationPeriods: 2,
});
```
### Custom Metrics
**Use CloudWatch Embedded Metric Format (EMF)**:
```typescript
export const handler = async (event: any) => {
const startTime = Date.now();
try {
const result = await processOrder(event);
// Emit custom metrics
console.log(JSON.stringify({
_aws: {
Timestamp: Date.now(),
CloudWatchMetrics: [{
Namespace: 'MyApp/Orders',
Dimensions: [['ServiceName', 'Operation']],
Metrics: [
{ Name: 'ProcessingTime', Unit: 'Milliseconds' },
{ Name: 'OrderValue', Unit: 'None' },
],
}],
},
ServiceName: 'OrderService',
Operation: 'ProcessOrder',
ProcessingTime: Date.now() - startTime,
OrderValue: result.amount,
}));
return result;
} catch (error) {
// Emit error metric
console.log(JSON.stringify({
_aws: {
CloudWatchMetrics: [{
Namespace: 'MyApp/Orders',
Dimensions: [['ServiceName']],
Metrics: [{ Name: 'Errors', Unit: 'Count' }],
}],
},
ServiceName: 'OrderService',
Errors: 1,
}));
throw error;
}
};
```
**Using Lambda Powertools**:
```typescript
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
const metrics = new Metrics({
namespace: 'MyApp',
serviceName: 'OrderService',
});
export const handler = async (event: any) => {
metrics.addMetric('Invocation', MetricUnits.Count, 1);
const startTime = Date.now();
try {
const result = await processOrder(event);
metrics.addMetric('Success', MetricUnits.Count, 1);
metrics.addMetric('ProcessingTime', MetricUnits.Milliseconds, Date.now() - startTime);
metrics.addMetric('OrderValue', MetricUnits.None, result.amount);
return result;
} catch (error) {
metrics.addMetric('Error', MetricUnits.Count, 1);
throw error;
} finally {
metrics.publishStoredMetrics();
}
};
```
## Logging
### Structured Logging
**Use JSON format for logs**:
```typescript
// ✅ GOOD - Structured JSON logging
export const handler = async (event: any) => {
console.log(JSON.stringify({
level: 'INFO',
message: 'Processing order',
orderId: event.orderId,
customerId: event.customerId,
timestamp: new Date().toISOString(),
requestId: context.requestId,
}));
try {
const result = await processOrder(event);
console.log(JSON.stringify({
level: 'INFO',
message: 'Order processed successfully',
orderId: event.orderId,
duration: Date.now() - startTime,
timestamp: new Date().toISOString(),
}));
return result;
} catch (error) {
console.error(JSON.stringify({
level: 'ERROR',
message: 'Order processing failed',
orderId: event.orderId,
error: {
name: error.name,
message: error.message,
stack: error.stack,
},
timestamp: new Date().toISOString(),
}));
throw error;
}
};
// ❌ BAD - Unstructured logging
console.log('Processing order ' + orderId + ' for customer ' + customerId);
```
**Using Lambda Powertools Logger**:
```typescript
import { Logger } from '@aws-lambda-powertools/logger';
const logger = new Logger({
serviceName: 'OrderService',
logLevel: 'INFO',
});
export const handler = async (event: any, context: Context) => {
logger.addContext(context);
logger.info('Processing order', {
orderId: event.orderId,
customerId: event.customerId,
});
try {
const result = await processOrder(event);
logger.info('Order processed', {
orderId: event.orderId,
amount: result.amount,
});
return result;
} catch (error) {
logger.error('Order processing failed', {
orderId: event.orderId,
error,
});
throw error;
}
};
```
### Log Levels
**Use appropriate log levels**:
- **ERROR**: Errors requiring immediate attention
- **WARN**: Warnings or recoverable errors
- **INFO**: Important business events
- **DEBUG**: Detailed debugging information (disable in production)
```typescript
const logger = new Logger({
serviceName: 'OrderService',
logLevel: process.env.LOG_LEVEL || 'INFO',
});
logger.debug('Detailed processing info', { data });
logger.info('Business event occurred', { event });
logger.warn('Recoverable error', { error });
logger.error('Critical failure', { error });
```
### Log Insights Queries
**Common CloudWatch Logs Insights queries**:
```
# Find errors in last hour
fields @timestamp, @message, level, error.message
| filter level = "ERROR"
| sort @timestamp desc
| limit 100
# Count errors by type
stats count() by error.name as ErrorType
| sort count desc
# Calculate p99 latency
stats percentile(duration, 99) by serviceName
# Find slow requests
fields @timestamp, orderId, duration
| filter duration > 1000
| sort duration desc
| limit 50
# Track specific customer requests
fields @timestamp, @message, orderId
| filter customerId = "customer-123"
| sort @timestamp desc
```
## Tracing
### Enable X-Ray Tracing
**Configure X-Ray for Lambda**:
```typescript
const fn = new NodejsFunction(this, 'Function', {
entry: 'src/handler.ts',
tracing: lambda.Tracing.ACTIVE, // Enable X-Ray
});
// API Gateway tracing
const api = new apigateway.RestApi(this, 'Api', {
deployOptions: {
tracingEnabled: true,
},
});
// Step Functions tracing
new stepfunctions.StateMachine(this, 'StateMachine', {
definition,
tracingEnabled: true,
});
```
**Instrument application code**:
```typescript
import { captureAWSv3Client } from 'aws-xray-sdk-core';
import { DynamoDBClient } from '@aws-sdk/client-dynamodb';
// Wrap AWS SDK clients
const client = captureAWSv3Client(new DynamoDBClient({}));
// Custom segments
import AWSXRay from 'aws-xray-sdk-core';
export const handler = async (event: any) => {
const segment = AWSXRay.getSegment();
// Custom subsegment
const subsegment = segment.addNewSubsegment('ProcessOrder');
try {
// Add annotations (indexed for filtering)
subsegment.addAnnotation('orderId', event.orderId);
subsegment.addAnnotation('customerId', event.customerId);
// Add metadata (not indexed, detailed info)
subsegment.addMetadata('orderDetails', event);
const result = await processOrder(event);
subsegment.addAnnotation('status', 'success');
subsegment.close();
return result;
} catch (error) {
subsegment.addError(error);
subsegment.close();
throw error;
}
};
```
**Using Lambda Powertools Tracer**:
```typescript
import { Tracer } from '@aws-lambda-powertools/tracer';
const tracer = new Tracer({ serviceName: 'OrderService' });
export const handler = async (event: any) => {
const segment = tracer.getSegment();
// Automatically captures and traces
const result = await tracer.captureAWSv3Client(dynamodb).getItem({
TableName: process.env.TABLE_NAME,
Key: { orderId: event.orderId },
});
// Custom annotation
tracer.putAnnotation('orderId', event.orderId);
tracer.putMetadata('orderDetails', event);
return result;
};
```
### Service Map
**Visualize service dependencies** with X-Ray:
- Shows service-to-service communication
- Identifies latency bottlenecks
- Highlights error rates between services
- Tracks downstream dependencies
### Distributed Tracing Best Practices
1. **Enable tracing everywhere**: Lambda, API Gateway, Step Functions
2. **Use annotations for filtering**: Indexed fields for queries
3. **Use metadata for details**: Non-indexed detailed information
4. **Sample appropriately**: 100% for low traffic, sampled for high traffic
5. **Correlate with logs**: Include trace ID in log entries
## Unified Observability
### Correlation Between Pillars
**Include trace ID in logs**:
```typescript
export const handler = async (event: any, context: Context) => {
const traceId = process.env._X_AMZN_TRACE_ID;
console.log(JSON.stringify({
level: 'INFO',
message: 'Processing order',
traceId,
requestId: context.requestId,
orderId: event.orderId,
}));
};
```
### CloudWatch ServiceLens
**Unified view of traces and metrics**:
- Automatically correlates X-Ray traces with CloudWatch metrics
- Shows service map with metrics overlay
- Identifies performance and availability issues
- Provides end-to-end request view
### Lambda Powertools Integration
**All three pillars in one**:
```typescript
import { Logger } from '@aws-lambda-powertools/logger';
import { Tracer } from '@aws-lambda-powertools/tracer';
import { Metrics, MetricUnits } from '@aws-lambda-powertools/metrics';
const logger = new Logger({ serviceName: 'OrderService' });
const tracer = new Tracer({ serviceName: 'OrderService' });
const metrics = new Metrics({ namespace: 'MyApp', serviceName: 'OrderService' });
export const handler = async (event: any, context: Context) => {
// Automatically adds trace context to logs
logger.addContext(context);
logger.info('Processing order', { orderId: event.orderId });
// Add trace annotations
tracer.putAnnotation('orderId', event.orderId);
// Add metrics
metrics.addMetric('Invocation', MetricUnits.Count, 1);
const startTime = Date.now();
try {
const result = await processOrder(event);
metrics.addMetric('Success', MetricUnits.Count, 1);
metrics.addMetric('Duration', MetricUnits.Milliseconds, Date.now() - startTime);
logger.info('Order processed', { orderId: event.orderId });
return result;
} catch (error) {
metrics.addMetric('Error', MetricUnits.Count, 1);
logger.error('Processing failed', { orderId: event.orderId, error });
throw error;
} finally {
metrics.publishStoredMetrics();
}
};
```
## Alerting
### Effective Alerting Strategy
**Alert on what matters**:
- **Critical**: Customer-impacting issues (errors, high latency)
- **Warning**: Approaching thresholds (80% capacity)
- **Info**: Trends and anomalies (cost spikes)
**Alarm fatigue prevention**:
- Tune thresholds based on actual patterns
- Use composite alarms to reduce noise
- Set appropriate evaluation periods
- Include clear remediation steps
### CloudWatch Alarms
**Common alarm patterns**:
```typescript
// Error rate alarm
new cloudwatch.Alarm(this, 'ErrorRateAlarm', {
metric: new cloudwatch.MathExpression({
expression: 'errors / invocations * 100',
usingMetrics: {
errors: fn.metricErrors({ statistic: 'Sum' }),
invocations: fn.metricInvocations({ statistic: 'Sum' }),
},
}),
threshold: 1, // 1% error rate
evaluationPeriods: 2,
alarmDescription: 'Error rate exceeded 1%',
});
// Latency alarm (p99)
new cloudwatch.Alarm(this, 'LatencyAlarm', {
metric: fn.metricDuration({
statistic: 'p99',
period: Duration.minutes(5),
}),
threshold: 1000, // 1 second
evaluationPeriods: 2,
alarmDescription: 'p99 latency exceeded 1 second',
});
// Concurrent executions approaching limit
new cloudwatch.Alarm(this, 'ConcurrencyAlarm', {
metric: fn.metricConcurrentExecutions({
statistic: 'Maximum',
}),
threshold: 800, // 80% of 1000 default limit
evaluationPeriods: 1,
alarmDescription: 'Approaching concurrency limit',
});
```
### Composite Alarms
**Reduce alert noise**:
```typescript
const errorAlarm = new cloudwatch.Alarm(this, 'Errors', {
metric: fn.metricErrors(),
threshold: 10,
evaluationPeriods: 1,
});
const throttleAlarm = new cloudwatch.Alarm(this, 'Throttles', {
metric: fn.metricThrottles(),
threshold: 5,
evaluationPeriods: 1,
});
const latencyAlarm = new cloudwatch.Alarm(this, 'Latency', {
metric: fn.metricDuration({ statistic: 'p99' }),
threshold: 2000,
evaluationPeriods: 2,
});
// Composite alarm (any of the above)
new cloudwatch.CompositeAlarm(this, 'ServiceHealthAlarm', {
compositeAlarmName: 'order-service-health',
alarmRule: cloudwatch.AlarmRule.anyOf(
errorAlarm,
throttleAlarm,
latencyAlarm
),
alarmDescription: 'Overall service health degraded',
});
```
## Dashboard Best Practices
### Service Dashboard Layout
**Recommended sections**:
1. **Overview**:
- Total invocations
- Error rate percentage
- P50, P95, P99 latency
- Availability percentage
2. **Resource Utilization**:
- Concurrent executions
- Memory utilization
- Duration distribution
- Throttles
3. **Business Metrics**:
- Orders processed
- Revenue per minute
- Customer activity
- Feature usage
4. **Errors and Alerts**:
- Error count by type
- Active alarms
- DLQ message count
- Failed transactions
### CloudWatch Dashboard CDK
```typescript
const dashboard = new cloudwatch.Dashboard(this, 'ServiceDashboard', {
dashboardName: 'order-service',
});
dashboard.addWidgets(
// Row 1: Overview
new cloudwatch.GraphWidget({
title: 'Invocations',
left: [fn.metricInvocations()],
}),
new cloudwatch.SingleValueWidget({
title: 'Error Rate',
metrics: [
new cloudwatch.MathExpression({
expression: 'errors / invocations * 100',
usingMetrics: {
errors: fn.metricErrors({ statistic: 'Sum' }),
invocations: fn.metricInvocations({ statistic: 'Sum' }),
},
}),
],
}),
new cloudwatch.GraphWidget({
title: 'Latency (p50, p95, p99)',
left: [
fn.metricDuration({ statistic: 'p50', label: 'p50' }),
fn.metricDuration({ statistic: 'p95', label: 'p95' }),
fn.metricDuration({ statistic: 'p99', label: 'p99' }),
],
})
);
// Row 2: Errors
dashboard.addWidgets(
new cloudwatch.LogQueryWidget({
title: 'Recent Errors',
logGroupNames: [fn.logGroup.logGroupName],
queryLines: [
'fields @timestamp, @message',
'filter level = "ERROR"',
'sort @timestamp desc',
'limit 20',
],
})
);
```
## Monitoring Serverless Architectures
### End-to-End Monitoring
**Monitor the entire flow**:
```
API Gateway → Lambda → DynamoDB → EventBridge → Lambda
↓ ↓ ↓ ↓ ↓
Metrics Traces Metrics Metrics Logs
```
**Key metrics per service**:
| Service | Key Metrics |
|---------|-------------|
| API Gateway | Count, 4XXError, 5XXError, Latency, CacheHitCount |
| Lambda | Invocations, Errors, Duration, Throttles, ConcurrentExecutions |
| DynamoDB | ConsumedReadCapacity, ConsumedWriteCapacity, UserErrors, SystemErrors |
| SQS | NumberOfMessagesSent, NumberOfMessagesReceived, ApproximateAgeOfOldestMessage |
| EventBridge | Invocations, FailedInvocations, TriggeredRules |
| Step Functions | ExecutionsStarted, ExecutionsFailed, ExecutionTime |
### Synthetic Monitoring
**Use CloudWatch Synthetics for API monitoring**:
```typescript
import { Canary, Test, Code, Schedule } from '@aws-cdk/aws-synthetics-alpha';
new Canary(this, 'ApiCanary', {
canaryName: 'api-health-check',
schedule: Schedule.rate(Duration.minutes(5)),
test: Test.custom({
code: Code.fromInline(`
const synthetics = require('Synthetics');
const apiCanaryBlueprint = async function () {
const response = await synthetics.executeHttpStep('Verify API', {
url: 'https://api.example.com/health',
method: 'GET',
});
return response.statusCode === 200 ? 'success' : 'failure';
};
exports.handler = async () => {
return await apiCanaryBlueprint();
};
`),
handler: 'index.handler',
}),
runtime: synthetics.Runtime.SYNTHETICS_NODEJS_PUPPETEER_6_2,
});
```
## OpenTelemetry Integration
### Amazon Distro for OpenTelemetry (ADOT)
**Use ADOT for vendor-neutral observability**:
```typescript
// Lambda Layer with ADOT
const adotLayer = lambda.LayerVersion.fromLayerVersionArn(
this,
'AdotLayer',
`arn:aws:lambda:${this.region}:901920570463:layer:aws-otel-nodejs-amd64-ver-1-18-1:4`
);
new NodejsFunction(this, 'Function', {
entry: 'src/handler.ts',
layers: [adotLayer],
tracing: lambda.Tracing.ACTIVE,
environment: {
AWS_LAMBDA_EXEC_WRAPPER: '/opt/otel-handler',
OPENTELEMETRY_COLLECTOR_CONFIG_FILE: '/var/task/collector.yaml',
},
});
```
**Benefits of ADOT**:
- Vendor-neutral (works with Datadog, New Relic, Honeycomb, etc.)
- Automatic instrumentation
- Consistent format across services
- Export to multiple backends
## Best Practices Summary
### Metrics
- ✅ Use CloudWatch Embedded Metric Format (EMF)
- ✅ Track business metrics, not just technical metrics
- ✅ Set alarms on error rate, latency, and throughput
- ✅ Use p99 for latency, not average
- ✅ Create dashboards for key services
### Logging
- ✅ Use structured JSON logging
- ✅ Include correlation IDs (request ID, trace ID)
- ✅ Use appropriate log levels
- ✅ Never log sensitive data (PII, secrets)
- ✅ Use CloudWatch Logs Insights for analysis
### Tracing
- ✅ Enable X-Ray tracing on all services
- ✅ Instrument AWS SDK calls
- ✅ Add custom annotations for business context
- ✅ Use service map to understand dependencies
- ✅ Correlate traces with logs and metrics
### Alerting
- ✅ Alert on customer-impacting issues
- ✅ Tune thresholds to reduce false positives
- ✅ Use composite alarms to reduce noise
- ✅ Include clear remediation steps
- ✅ Escalate critical alarms appropriately
### Tools
- ✅ Use Lambda Powertools for unified observability
- ✅ Use CloudWatch ServiceLens for service view
- ✅ Use Synthetics for proactive monitoring
- ✅ Consider ADOT for vendor-neutral observability