Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 17:51:22 +08:00
commit 23753b435e
24 changed files with 9837 additions and 0 deletions

775
references/logging_guide.md Normal file
View File

@@ -0,0 +1,775 @@
# Logging Guide
## Structured Logging
### Why Structured Logs?
**Unstructured** (text):
```
2024-10-28 14:32:15 User john@example.com logged in from 192.168.1.1
```
**Structured** (JSON):
```json
{
"timestamp": "2024-10-28T14:32:15Z",
"level": "info",
"message": "User logged in",
"user": "john@example.com",
"ip": "192.168.1.1",
"event_type": "user_login"
}
```
**Benefits**:
- Easy to parse and query
- Consistent format
- Machine-readable
- Efficient storage and indexing
---
## Log Levels
Use appropriate log levels for better filtering and alerting.
### DEBUG
**When**: Development, troubleshooting
**Examples**:
- Function entry/exit
- Variable values
- Internal state changes
```python
logger.debug("Processing request", extra={
"request_id": req_id,
"params": params
})
```
### INFO
**When**: Important business events
**Examples**:
- User actions (login, purchase)
- System state changes (started, stopped)
- Significant milestones
```python
logger.info("Order placed", extra={
"order_id": "12345",
"user_id": "user123",
"amount": 99.99
})
```
### WARN
**When**: Potentially problematic situations
**Examples**:
- Deprecated API usage
- Slow operations (but not failing)
- Retry attempts
- Resource usage approaching limits
```python
logger.warning("API response slow", extra={
"endpoint": "/api/users",
"duration_ms": 2500,
"threshold_ms": 1000
})
```
### ERROR
**When**: Error conditions that need attention
**Examples**:
- Failed requests
- Exceptions caught and handled
- Integration failures
- Data validation errors
```python
logger.error("Payment processing failed", extra={
"order_id": "12345",
"error": str(e),
"payment_gateway": "stripe"
}, exc_info=True)
```
### FATAL/CRITICAL
**When**: Severe errors causing shutdown
**Examples**:
- Database connection lost
- Out of memory
- Configuration errors preventing startup
```python
logger.critical("Database connection lost", extra={
"database": "postgres",
"host": "db.example.com",
"attempt": 3
})
```
---
## Required Fields
Every log entry should include:
### 1. Timestamp
ISO 8601 format with timezone:
```json
{
"timestamp": "2024-10-28T14:32:15.123Z"
}
```
### 2. Level
Standard levels: debug, info, warn, error, critical
```json
{
"level": "error"
}
```
### 3. Message
Human-readable description:
```json
{
"message": "User authentication failed"
}
```
### 4. Service/Application
What component logged this:
```json
{
"service": "api-gateway",
"version": "1.2.3"
}
```
### 5. Environment
```json
{
"environment": "production"
}
```
---
## Recommended Fields
### Request Context
```json
{
"request_id": "550e8400-e29b-41d4-a716-446655440000",
"user_id": "user123",
"session_id": "sess_abc",
"ip_address": "192.168.1.1",
"user_agent": "Mozilla/5.0..."
}
```
### Performance Metrics
```json
{
"duration_ms": 245,
"response_size_bytes": 1024
}
```
### Error Details
```json
{
"error_type": "ValidationError",
"error_message": "Invalid email format",
"stack_trace": "...",
"error_code": "VAL_001"
}
```
### Business Context
```json
{
"order_id": "ORD-12345",
"customer_id": "CUST-789",
"transaction_amount": 99.99,
"payment_method": "credit_card"
}
```
---
## Implementation Examples
### Python (using structlog)
```python
import structlog
logger = structlog.get_logger()
# Configure structured logging
structlog.configure(
processors=[
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.add_log_level,
structlog.processors.JSONRenderer()
]
)
# Usage
logger.info(
"user_logged_in",
user_id="user123",
ip_address="192.168.1.1",
login_method="oauth"
)
```
### Node.js (using Winston)
```javascript
const winston = require('winston');
const logger = winston.createLogger({
format: winston.format.json(),
defaultMeta: { service: 'api-gateway' },
transports: [
new winston.transports.Console()
]
});
logger.info('User logged in', {
userId: 'user123',
ipAddress: '192.168.1.1',
loginMethod: 'oauth'
});
```
### Go (using zap)
```go
import "go.uber.org/zap"
logger, _ := zap.NewProduction()
defer logger.Sync()
logger.Info("User logged in",
zap.String("userId", "user123"),
zap.String("ipAddress", "192.168.1.1"),
zap.String("loginMethod", "oauth"),
)
```
### Java (using Logback with JSON)
```java
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import net.logstash.logback.argument.StructuredArguments;
Logger logger = LoggerFactory.getLogger(MyClass.class);
logger.info("User logged in",
StructuredArguments.kv("userId", "user123"),
StructuredArguments.kv("ipAddress", "192.168.1.1"),
StructuredArguments.kv("loginMethod", "oauth")
);
```
---
## Log Aggregation Patterns
### Pattern 1: ELK Stack (Elasticsearch, Logstash, Kibana)
**Architecture**:
```
Application → Filebeat → Logstash → Elasticsearch → Kibana
```
**filebeat.yml**:
```yaml
filebeat.inputs:
- type: log
enabled: true
paths:
- /var/log/app/*.log
json.keys_under_root: true
json.add_error_key: true
output.logstash:
hosts: ["logstash:5044"]
```
**logstash.conf**:
```
input {
beats {
port => 5044
}
}
filter {
json {
source => "message"
}
date {
match => ["timestamp", "ISO8601"]
}
grok {
match => { "message" => "%{COMBINEDAPACHELOG}" }
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "app-logs-%{+YYYY.MM.dd}"
}
}
```
### Pattern 2: Loki (Grafana Loki)
**Architecture**:
```
Application → Promtail → Loki → Grafana
```
**promtail-config.yml**:
```yaml
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: app
static_configs:
- targets:
- localhost
labels:
job: app
__path__: /var/log/app/*.log
pipeline_stages:
- json:
expressions:
level: level
timestamp: timestamp
- labels:
level:
service:
- timestamp:
source: timestamp
format: RFC3339
```
**Query in Grafana**:
```logql
{job="app"} |= "error" | json | level="error"
```
### Pattern 3: CloudWatch Logs
**Install CloudWatch agent**:
```json
{
"logs": {
"logs_collected": {
"files": {
"collect_list": [
{
"file_path": "/var/log/app/*.log",
"log_group_name": "/aws/app/production",
"log_stream_name": "{instance_id}",
"timezone": "UTC"
}
]
}
}
}
}
```
**Query with CloudWatch Insights**:
```
fields @timestamp, level, message, user_id
| filter level = "error"
| sort @timestamp desc
| limit 100
```
### Pattern 4: Fluentd/Fluent Bit
**fluent-bit.conf**:
```
[INPUT]
Name tail
Path /var/log/app/*.log
Parser json
Tag app.*
[FILTER]
Name record_modifier
Match *
Record hostname ${HOSTNAME}
Record cluster production
[OUTPUT]
Name es
Match *
Host elasticsearch
Port 9200
Index app-logs
Type _doc
```
---
## Query Patterns
### Find Errors in Time Range
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": {
"gte": "now-1h",
"lte": "now"
}}}
]
}
}
}
```
**Loki (LogQL)**:
```logql
{job="app", level="error"} |= "error"
```
**CloudWatch Insights**:
```
fields @timestamp, @message
| filter level = "error"
| filter @timestamp > ago(1h)
```
### Count Errors by Type
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
"size": 0,
"query": { "match": { "level": "error" } },
"aggs": {
"error_types": {
"terms": { "field": "error_type.keyword" }
}
}
}
```
**Loki**:
```logql
sum by (error_type) (count_over_time({job="app", level="error"}[1h]))
```
### Find Slow Requests
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
"query": {
"range": { "duration_ms": { "gte": 1000 } }
},
"sort": [ { "duration_ms": "desc" } ]
}
```
### Trace Request Through Services
**Elasticsearch** (using request_id):
```json
GET /_search
{
"query": {
"match": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
},
"sort": [ { "@timestamp": "asc" } ]
}
```
---
## Sampling and Rate Limiting
### When to Sample
- **High volume services**: > 10,000 logs/second
- **Debug logs in production**: Sample 1-10%
- **Cost optimization**: Reduce storage costs
### Sampling Strategies
**1. Random Sampling**:
```python
import random
if random.random() < 0.1: # Sample 10%
logger.debug("Debug message", ...)
```
**2. Rate Limiting**:
```python
from rate_limiter import RateLimiter
limiter = RateLimiter(max_per_second=100)
if limiter.allow():
logger.info("Rate limited log", ...)
```
**3. Error-Biased Sampling**:
```python
# Always log errors, sample successful requests
if level == "error" or random.random() < 0.01:
logger.log(level, message, ...)
```
**4. Head-Based Sampling** (trace-aware):
```python
# If trace is sampled, log all related logs
if trace_context.is_sampled():
logger.info("Traced log", trace_id=trace_context.trace_id)
```
---
## Log Retention
### Retention Strategy
**Hot tier** (fast SSD): 7-30 days
- Recent logs
- Full query performance
- High cost
**Warm tier** (regular disk): 30-90 days
- Older logs
- Slower queries acceptable
- Medium cost
**Cold tier** (object storage): 90+ days
- Archive logs
- Query via restore
- Low cost
### Example: Elasticsearch ILM Policy
```json
{
"policy": {
"phases": {
"hot": {
"actions": {
"rollover": {
"max_size": "50GB",
"max_age": "1d"
}
}
},
"warm": {
"min_age": "7d",
"actions": {
"allocate": { "number_of_replicas": 1 },
"shrink": { "number_of_shards": 1 }
}
},
"cold": {
"min_age": "30d",
"actions": {
"allocate": { "require": { "box_type": "cold" } }
}
},
"delete": {
"min_age": "90d",
"actions": {
"delete": {}
}
}
}
}
}
```
---
## Security and Compliance
### PII Redaction
**Before logging**:
```python
import re
def redact_pii(data):
# Redact email
data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
'[EMAIL]', data)
# Redact credit card
data = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
'[CARD]', data)
# Redact SSN
data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
return data
logger.info("User data", user_input=redact_pii(user_input))
```
**In Logstash**:
```
filter {
mutate {
gsub => [
"message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]",
"message", "\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"
]
}
}
```
### Access Control
**Elasticsearch** (with Security):
```yaml
# Role for developers
dev_logs:
indices:
- names: ['app-logs-*']
privileges: ['read']
query: '{"match": {"environment": "development"}}'
```
**CloudWatch** (IAM Policy):
```json
{
"Effect": "Allow",
"Action": [
"logs:DescribeLogGroups",
"logs:GetLogEvents",
"logs:FilterLogEvents"
],
"Resource": "arn:aws:logs:*:*:log-group:/aws/app/production:*"
}
```
---
## Common Pitfalls
### 1. Logging Sensitive Data
`logger.info("Login", password=password)`
`logger.info("Login", user_id=user_id)`
### 2. Excessive Logging
❌ Logging every iteration of a loop
✅ Log aggregate results or sample
### 3. Not Including Context
`logger.error("Failed")`
`logger.error("Payment failed", order_id=order_id, error=str(e))`
### 4. Inconsistent Formats
❌ Mix of JSON and plain text
✅ Pick one format and stick to it
### 5. No Request IDs
❌ Can't trace request across services
✅ Generate and propagate request_id
### 6. Logging to Multiple Places
❌ Log to file AND stdout AND syslog
✅ Log to stdout, let agent handle routing
### 7. Blocking on Log Writes
❌ Synchronous writes to remote systems
✅ Asynchronous buffered writes
---
## Performance Optimization
### 1. Async Logging
```python
import logging
from logging.handlers import QueueHandler, QueueListener
import queue
# Create queue
log_queue = queue.Queue()
# Configure async handler
queue_handler = QueueHandler(log_queue)
logger.addHandler(queue_handler)
# Process logs in background thread
listener = QueueListener(log_queue, *handlers)
listener.start()
```
### 2. Conditional Logging
```python
# Avoid expensive operations if not logging
if logger.isEnabledFor(logging.DEBUG):
logger.debug("Details", data=expensive_serialization(obj))
```
### 3. Batching
```python
# Batch logs before sending
batch = []
for log in logs:
batch.append(log)
if len(batch) >= 100:
send_to_aggregator(batch)
batch = []
```
### 4. Compression
```yaml
# Filebeat with compression
output.logstash:
hosts: ["logstash:5044"]
compression_level: 3
```
---
## Monitoring Log Pipeline
Track pipeline health with metrics:
```promql
# Log ingestion rate
rate(logs_ingested_total[5m])
# Pipeline lag
log_processing_lag_seconds
# Dropped logs
rate(logs_dropped_total[5m])
# Error parsing rate
rate(logs_parse_errors_total[5m])
```
Alert on:
- Sudden drop in log volume (service down?)
- High parse error rate (format changed?)
- Pipeline lag > 1 minute (capacity issue?)