776 lines
14 KiB
Markdown
776 lines
14 KiB
Markdown
# Logging Guide
|
|
|
|
## Structured Logging
|
|
|
|
### Why Structured Logs?
|
|
|
|
**Unstructured** (text):
|
|
```
|
|
2024-10-28 14:32:15 User john@example.com logged in from 192.168.1.1
|
|
```
|
|
|
|
**Structured** (JSON):
|
|
```json
|
|
{
|
|
"timestamp": "2024-10-28T14:32:15Z",
|
|
"level": "info",
|
|
"message": "User logged in",
|
|
"user": "john@example.com",
|
|
"ip": "192.168.1.1",
|
|
"event_type": "user_login"
|
|
}
|
|
```
|
|
|
|
**Benefits**:
|
|
- Easy to parse and query
|
|
- Consistent format
|
|
- Machine-readable
|
|
- Efficient storage and indexing
|
|
|
|
---
|
|
|
|
## Log Levels
|
|
|
|
Use appropriate log levels for better filtering and alerting.
|
|
|
|
### DEBUG
|
|
**When**: Development, troubleshooting
|
|
**Examples**:
|
|
- Function entry/exit
|
|
- Variable values
|
|
- Internal state changes
|
|
|
|
```python
|
|
logger.debug("Processing request", extra={
|
|
"request_id": req_id,
|
|
"params": params
|
|
})
|
|
```
|
|
|
|
### INFO
|
|
**When**: Important business events
|
|
**Examples**:
|
|
- User actions (login, purchase)
|
|
- System state changes (started, stopped)
|
|
- Significant milestones
|
|
|
|
```python
|
|
logger.info("Order placed", extra={
|
|
"order_id": "12345",
|
|
"user_id": "user123",
|
|
"amount": 99.99
|
|
})
|
|
```
|
|
|
|
### WARN
|
|
**When**: Potentially problematic situations
|
|
**Examples**:
|
|
- Deprecated API usage
|
|
- Slow operations (but not failing)
|
|
- Retry attempts
|
|
- Resource usage approaching limits
|
|
|
|
```python
|
|
logger.warning("API response slow", extra={
|
|
"endpoint": "/api/users",
|
|
"duration_ms": 2500,
|
|
"threshold_ms": 1000
|
|
})
|
|
```
|
|
|
|
### ERROR
|
|
**When**: Error conditions that need attention
|
|
**Examples**:
|
|
- Failed requests
|
|
- Exceptions caught and handled
|
|
- Integration failures
|
|
- Data validation errors
|
|
|
|
```python
|
|
logger.error("Payment processing failed", extra={
|
|
"order_id": "12345",
|
|
"error": str(e),
|
|
"payment_gateway": "stripe"
|
|
}, exc_info=True)
|
|
```
|
|
|
|
### FATAL/CRITICAL
|
|
**When**: Severe errors causing shutdown
|
|
**Examples**:
|
|
- Database connection lost
|
|
- Out of memory
|
|
- Configuration errors preventing startup
|
|
|
|
```python
|
|
logger.critical("Database connection lost", extra={
|
|
"database": "postgres",
|
|
"host": "db.example.com",
|
|
"attempt": 3
|
|
})
|
|
```
|
|
|
|
---
|
|
|
|
## Required Fields
|
|
|
|
Every log entry should include:
|
|
|
|
### 1. Timestamp
|
|
ISO 8601 format with timezone:
|
|
```json
|
|
{
|
|
"timestamp": "2024-10-28T14:32:15.123Z"
|
|
}
|
|
```
|
|
|
|
### 2. Level
|
|
Standard levels: debug, info, warn, error, critical
|
|
```json
|
|
{
|
|
"level": "error"
|
|
}
|
|
```
|
|
|
|
### 3. Message
|
|
Human-readable description:
|
|
```json
|
|
{
|
|
"message": "User authentication failed"
|
|
}
|
|
```
|
|
|
|
### 4. Service/Application
|
|
What component logged this:
|
|
```json
|
|
{
|
|
"service": "api-gateway",
|
|
"version": "1.2.3"
|
|
}
|
|
```
|
|
|
|
### 5. Environment
|
|
```json
|
|
{
|
|
"environment": "production"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Recommended Fields
|
|
|
|
### Request Context
|
|
```json
|
|
{
|
|
"request_id": "550e8400-e29b-41d4-a716-446655440000",
|
|
"user_id": "user123",
|
|
"session_id": "sess_abc",
|
|
"ip_address": "192.168.1.1",
|
|
"user_agent": "Mozilla/5.0..."
|
|
}
|
|
```
|
|
|
|
### Performance Metrics
|
|
```json
|
|
{
|
|
"duration_ms": 245,
|
|
"response_size_bytes": 1024
|
|
}
|
|
```
|
|
|
|
### Error Details
|
|
```json
|
|
{
|
|
"error_type": "ValidationError",
|
|
"error_message": "Invalid email format",
|
|
"stack_trace": "...",
|
|
"error_code": "VAL_001"
|
|
}
|
|
```
|
|
|
|
### Business Context
|
|
```json
|
|
{
|
|
"order_id": "ORD-12345",
|
|
"customer_id": "CUST-789",
|
|
"transaction_amount": 99.99,
|
|
"payment_method": "credit_card"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Implementation Examples
|
|
|
|
### Python (using structlog)
|
|
```python
|
|
import structlog
|
|
|
|
logger = structlog.get_logger()
|
|
|
|
# Configure structured logging
|
|
structlog.configure(
|
|
processors=[
|
|
structlog.processors.TimeStamper(fmt="iso"),
|
|
structlog.processors.add_log_level,
|
|
structlog.processors.JSONRenderer()
|
|
]
|
|
)
|
|
|
|
# Usage
|
|
logger.info(
|
|
"user_logged_in",
|
|
user_id="user123",
|
|
ip_address="192.168.1.1",
|
|
login_method="oauth"
|
|
)
|
|
```
|
|
|
|
### Node.js (using Winston)
|
|
```javascript
|
|
const winston = require('winston');
|
|
|
|
const logger = winston.createLogger({
|
|
format: winston.format.json(),
|
|
defaultMeta: { service: 'api-gateway' },
|
|
transports: [
|
|
new winston.transports.Console()
|
|
]
|
|
});
|
|
|
|
logger.info('User logged in', {
|
|
userId: 'user123',
|
|
ipAddress: '192.168.1.1',
|
|
loginMethod: 'oauth'
|
|
});
|
|
```
|
|
|
|
### Go (using zap)
|
|
```go
|
|
import "go.uber.org/zap"
|
|
|
|
logger, _ := zap.NewProduction()
|
|
defer logger.Sync()
|
|
|
|
logger.Info("User logged in",
|
|
zap.String("userId", "user123"),
|
|
zap.String("ipAddress", "192.168.1.1"),
|
|
zap.String("loginMethod", "oauth"),
|
|
)
|
|
```
|
|
|
|
### Java (using Logback with JSON)
|
|
```java
|
|
import org.slf4j.Logger;
|
|
import org.slf4j.LoggerFactory;
|
|
import net.logstash.logback.argument.StructuredArguments;
|
|
|
|
Logger logger = LoggerFactory.getLogger(MyClass.class);
|
|
|
|
logger.info("User logged in",
|
|
StructuredArguments.kv("userId", "user123"),
|
|
StructuredArguments.kv("ipAddress", "192.168.1.1"),
|
|
StructuredArguments.kv("loginMethod", "oauth")
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## Log Aggregation Patterns
|
|
|
|
### Pattern 1: ELK Stack (Elasticsearch, Logstash, Kibana)
|
|
|
|
**Architecture**:
|
|
```
|
|
Application → Filebeat → Logstash → Elasticsearch → Kibana
|
|
```
|
|
|
|
**filebeat.yml**:
|
|
```yaml
|
|
filebeat.inputs:
|
|
- type: log
|
|
enabled: true
|
|
paths:
|
|
- /var/log/app/*.log
|
|
json.keys_under_root: true
|
|
json.add_error_key: true
|
|
|
|
output.logstash:
|
|
hosts: ["logstash:5044"]
|
|
```
|
|
|
|
**logstash.conf**:
|
|
```
|
|
input {
|
|
beats {
|
|
port => 5044
|
|
}
|
|
}
|
|
|
|
filter {
|
|
json {
|
|
source => "message"
|
|
}
|
|
|
|
date {
|
|
match => ["timestamp", "ISO8601"]
|
|
}
|
|
|
|
grok {
|
|
match => { "message" => "%{COMBINEDAPACHELOG}" }
|
|
}
|
|
}
|
|
|
|
output {
|
|
elasticsearch {
|
|
hosts => ["elasticsearch:9200"]
|
|
index => "app-logs-%{+YYYY.MM.dd}"
|
|
}
|
|
}
|
|
```
|
|
|
|
### Pattern 2: Loki (Grafana Loki)
|
|
|
|
**Architecture**:
|
|
```
|
|
Application → Promtail → Loki → Grafana
|
|
```
|
|
|
|
**promtail-config.yml**:
|
|
```yaml
|
|
server:
|
|
http_listen_port: 9080
|
|
|
|
positions:
|
|
filename: /tmp/positions.yaml
|
|
|
|
clients:
|
|
- url: http://loki:3100/loki/api/v1/push
|
|
|
|
scrape_configs:
|
|
- job_name: app
|
|
static_configs:
|
|
- targets:
|
|
- localhost
|
|
labels:
|
|
job: app
|
|
__path__: /var/log/app/*.log
|
|
pipeline_stages:
|
|
- json:
|
|
expressions:
|
|
level: level
|
|
timestamp: timestamp
|
|
- labels:
|
|
level:
|
|
service:
|
|
- timestamp:
|
|
source: timestamp
|
|
format: RFC3339
|
|
```
|
|
|
|
**Query in Grafana**:
|
|
```logql
|
|
{job="app"} |= "error" | json | level="error"
|
|
```
|
|
|
|
### Pattern 3: CloudWatch Logs
|
|
|
|
**Install CloudWatch agent**:
|
|
```json
|
|
{
|
|
"logs": {
|
|
"logs_collected": {
|
|
"files": {
|
|
"collect_list": [
|
|
{
|
|
"file_path": "/var/log/app/*.log",
|
|
"log_group_name": "/aws/app/production",
|
|
"log_stream_name": "{instance_id}",
|
|
"timezone": "UTC"
|
|
}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Query with CloudWatch Insights**:
|
|
```
|
|
fields @timestamp, level, message, user_id
|
|
| filter level = "error"
|
|
| sort @timestamp desc
|
|
| limit 100
|
|
```
|
|
|
|
### Pattern 4: Fluentd/Fluent Bit
|
|
|
|
**fluent-bit.conf**:
|
|
```
|
|
[INPUT]
|
|
Name tail
|
|
Path /var/log/app/*.log
|
|
Parser json
|
|
Tag app.*
|
|
|
|
[FILTER]
|
|
Name record_modifier
|
|
Match *
|
|
Record hostname ${HOSTNAME}
|
|
Record cluster production
|
|
|
|
[OUTPUT]
|
|
Name es
|
|
Match *
|
|
Host elasticsearch
|
|
Port 9200
|
|
Index app-logs
|
|
Type _doc
|
|
```
|
|
|
|
---
|
|
|
|
## Query Patterns
|
|
|
|
### Find Errors in Time Range
|
|
**Elasticsearch**:
|
|
```json
|
|
GET /app-logs-*/_search
|
|
{
|
|
"query": {
|
|
"bool": {
|
|
"must": [
|
|
{ "match": { "level": "error" } },
|
|
{ "range": { "@timestamp": {
|
|
"gte": "now-1h",
|
|
"lte": "now"
|
|
}}}
|
|
]
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Loki (LogQL)**:
|
|
```logql
|
|
{job="app", level="error"} |= "error"
|
|
```
|
|
|
|
**CloudWatch Insights**:
|
|
```
|
|
fields @timestamp, @message
|
|
| filter level = "error"
|
|
| filter @timestamp > ago(1h)
|
|
```
|
|
|
|
### Count Errors by Type
|
|
**Elasticsearch**:
|
|
```json
|
|
GET /app-logs-*/_search
|
|
{
|
|
"size": 0,
|
|
"query": { "match": { "level": "error" } },
|
|
"aggs": {
|
|
"error_types": {
|
|
"terms": { "field": "error_type.keyword" }
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Loki**:
|
|
```logql
|
|
sum by (error_type) (count_over_time({job="app", level="error"}[1h]))
|
|
```
|
|
|
|
### Find Slow Requests
|
|
**Elasticsearch**:
|
|
```json
|
|
GET /app-logs-*/_search
|
|
{
|
|
"query": {
|
|
"range": { "duration_ms": { "gte": 1000 } }
|
|
},
|
|
"sort": [ { "duration_ms": "desc" } ]
|
|
}
|
|
```
|
|
|
|
### Trace Request Through Services
|
|
**Elasticsearch** (using request_id):
|
|
```json
|
|
GET /_search
|
|
{
|
|
"query": {
|
|
"match": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
|
|
},
|
|
"sort": [ { "@timestamp": "asc" } ]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Sampling and Rate Limiting
|
|
|
|
### When to Sample
|
|
- **High volume services**: > 10,000 logs/second
|
|
- **Debug logs in production**: Sample 1-10%
|
|
- **Cost optimization**: Reduce storage costs
|
|
|
|
### Sampling Strategies
|
|
|
|
**1. Random Sampling**:
|
|
```python
|
|
import random
|
|
|
|
if random.random() < 0.1: # Sample 10%
|
|
logger.debug("Debug message", ...)
|
|
```
|
|
|
|
**2. Rate Limiting**:
|
|
```python
|
|
from rate_limiter import RateLimiter
|
|
|
|
limiter = RateLimiter(max_per_second=100)
|
|
|
|
if limiter.allow():
|
|
logger.info("Rate limited log", ...)
|
|
```
|
|
|
|
**3. Error-Biased Sampling**:
|
|
```python
|
|
# Always log errors, sample successful requests
|
|
if level == "error" or random.random() < 0.01:
|
|
logger.log(level, message, ...)
|
|
```
|
|
|
|
**4. Head-Based Sampling** (trace-aware):
|
|
```python
|
|
# If trace is sampled, log all related logs
|
|
if trace_context.is_sampled():
|
|
logger.info("Traced log", trace_id=trace_context.trace_id)
|
|
```
|
|
|
|
---
|
|
|
|
## Log Retention
|
|
|
|
### Retention Strategy
|
|
|
|
**Hot tier** (fast SSD): 7-30 days
|
|
- Recent logs
|
|
- Full query performance
|
|
- High cost
|
|
|
|
**Warm tier** (regular disk): 30-90 days
|
|
- Older logs
|
|
- Slower queries acceptable
|
|
- Medium cost
|
|
|
|
**Cold tier** (object storage): 90+ days
|
|
- Archive logs
|
|
- Query via restore
|
|
- Low cost
|
|
|
|
### Example: Elasticsearch ILM Policy
|
|
```json
|
|
{
|
|
"policy": {
|
|
"phases": {
|
|
"hot": {
|
|
"actions": {
|
|
"rollover": {
|
|
"max_size": "50GB",
|
|
"max_age": "1d"
|
|
}
|
|
}
|
|
},
|
|
"warm": {
|
|
"min_age": "7d",
|
|
"actions": {
|
|
"allocate": { "number_of_replicas": 1 },
|
|
"shrink": { "number_of_shards": 1 }
|
|
}
|
|
},
|
|
"cold": {
|
|
"min_age": "30d",
|
|
"actions": {
|
|
"allocate": { "require": { "box_type": "cold" } }
|
|
}
|
|
},
|
|
"delete": {
|
|
"min_age": "90d",
|
|
"actions": {
|
|
"delete": {}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Security and Compliance
|
|
|
|
### PII Redaction
|
|
|
|
**Before logging**:
|
|
```python
|
|
import re
|
|
|
|
def redact_pii(data):
|
|
# Redact email
|
|
data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
|
|
'[EMAIL]', data)
|
|
# Redact credit card
|
|
data = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
|
|
'[CARD]', data)
|
|
# Redact SSN
|
|
data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
|
|
return data
|
|
|
|
logger.info("User data", user_input=redact_pii(user_input))
|
|
```
|
|
|
|
**In Logstash**:
|
|
```
|
|
filter {
|
|
mutate {
|
|
gsub => [
|
|
"message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]",
|
|
"message", "\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"
|
|
]
|
|
}
|
|
}
|
|
```
|
|
|
|
### Access Control
|
|
|
|
**Elasticsearch** (with Security):
|
|
```yaml
|
|
# Role for developers
|
|
dev_logs:
|
|
indices:
|
|
- names: ['app-logs-*']
|
|
privileges: ['read']
|
|
query: '{"match": {"environment": "development"}}'
|
|
```
|
|
|
|
**CloudWatch** (IAM Policy):
|
|
```json
|
|
{
|
|
"Effect": "Allow",
|
|
"Action": [
|
|
"logs:DescribeLogGroups",
|
|
"logs:GetLogEvents",
|
|
"logs:FilterLogEvents"
|
|
],
|
|
"Resource": "arn:aws:logs:*:*:log-group:/aws/app/production:*"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Common Pitfalls
|
|
|
|
### 1. Logging Sensitive Data
|
|
❌ `logger.info("Login", password=password)`
|
|
✅ `logger.info("Login", user_id=user_id)`
|
|
|
|
### 2. Excessive Logging
|
|
❌ Logging every iteration of a loop
|
|
✅ Log aggregate results or sample
|
|
|
|
### 3. Not Including Context
|
|
❌ `logger.error("Failed")`
|
|
✅ `logger.error("Payment failed", order_id=order_id, error=str(e))`
|
|
|
|
### 4. Inconsistent Formats
|
|
❌ Mix of JSON and plain text
|
|
✅ Pick one format and stick to it
|
|
|
|
### 5. No Request IDs
|
|
❌ Can't trace request across services
|
|
✅ Generate and propagate request_id
|
|
|
|
### 6. Logging to Multiple Places
|
|
❌ Log to file AND stdout AND syslog
|
|
✅ Log to stdout, let agent handle routing
|
|
|
|
### 7. Blocking on Log Writes
|
|
❌ Synchronous writes to remote systems
|
|
✅ Asynchronous buffered writes
|
|
|
|
---
|
|
|
|
## Performance Optimization
|
|
|
|
### 1. Async Logging
|
|
```python
|
|
import logging
|
|
from logging.handlers import QueueHandler, QueueListener
|
|
import queue
|
|
|
|
# Create queue
|
|
log_queue = queue.Queue()
|
|
|
|
# Configure async handler
|
|
queue_handler = QueueHandler(log_queue)
|
|
logger.addHandler(queue_handler)
|
|
|
|
# Process logs in background thread
|
|
listener = QueueListener(log_queue, *handlers)
|
|
listener.start()
|
|
```
|
|
|
|
### 2. Conditional Logging
|
|
```python
|
|
# Avoid expensive operations if not logging
|
|
if logger.isEnabledFor(logging.DEBUG):
|
|
logger.debug("Details", data=expensive_serialization(obj))
|
|
```
|
|
|
|
### 3. Batching
|
|
```python
|
|
# Batch logs before sending
|
|
batch = []
|
|
for log in logs:
|
|
batch.append(log)
|
|
if len(batch) >= 100:
|
|
send_to_aggregator(batch)
|
|
batch = []
|
|
```
|
|
|
|
### 4. Compression
|
|
```yaml
|
|
# Filebeat with compression
|
|
output.logstash:
|
|
hosts: ["logstash:5044"]
|
|
compression_level: 3
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring Log Pipeline
|
|
|
|
Track pipeline health with metrics:
|
|
|
|
```promql
|
|
# Log ingestion rate
|
|
rate(logs_ingested_total[5m])
|
|
|
|
# Pipeline lag
|
|
log_processing_lag_seconds
|
|
|
|
# Dropped logs
|
|
rate(logs_dropped_total[5m])
|
|
|
|
# Error parsing rate
|
|
rate(logs_parse_errors_total[5m])
|
|
```
|
|
|
|
Alert on:
|
|
- Sudden drop in log volume (service down?)
|
|
- High parse error rate (format changed?)
|
|
- Pipeline lag > 1 minute (capacity issue?)
|