gh-ahmedasmar-devops-claude…/references/logging_guide.md

# Logging Guide

## Structured Logging

### Why Structured Logs?

**Unstructured** (text):
```
2024-10-28 14:32:15 User john@example.com logged in from 192.168.1.1
```

**Structured** (JSON):
```json
{
  "timestamp": "2024-10-28T14:32:15Z",
  "level": "info",
  "message": "User logged in",
  "user": "john@example.com",
  "ip": "192.168.1.1",
  "event_type": "user_login"
}
```

**Benefits**:
- Easy to parse and query
- Consistent format
- Machine-readable
- Efficient storage and indexing

---

## Log Levels

Use appropriate log levels for better filtering and alerting.

### DEBUG
**When**: Development, troubleshooting
**Examples**:
- Function entry/exit
- Variable values
- Internal state changes

```python
logger.debug("Processing request", extra={
    "request_id": req_id,
    "params": params
})
```

### INFO
**When**: Important business events
**Examples**:
- User actions (login, purchase)
- System state changes (started, stopped)
- Significant milestones

```python
logger.info("Order placed", extra={
    "order_id": "12345",
    "user_id": "user123",
    "amount": 99.99
})
```

### WARN
**When**: Potentially problematic situations
**Examples**:
- Deprecated API usage
- Slow operations (but not failing)
- Retry attempts
- Resource usage approaching limits

```python
logger.warning("API response slow", extra={
    "endpoint": "/api/users",
    "duration_ms": 2500,
    "threshold_ms": 1000
})
```

### ERROR
**When**: Error conditions that need attention
**Examples**:
- Failed requests
- Exceptions caught and handled
- Integration failures
- Data validation errors

```python
logger.error("Payment processing failed", extra={
    "order_id": "12345",
    "error": str(e),
    "payment_gateway": "stripe"
}, exc_info=True)
```

### FATAL/CRITICAL
**When**: Severe errors causing shutdown
**Examples**:
- Database connection lost
- Out of memory
- Configuration errors preventing startup

```python
logger.critical("Database connection lost", extra={
    "database": "postgres",
    "host": "db.example.com",
    "attempt": 3
})
```

---

## Required Fields

Every log entry should include:

### 1. Timestamp
ISO 8601 format with timezone:
```json
{
  "timestamp": "2024-10-28T14:32:15.123Z"
}
```

### 2. Level
Standard levels: debug, info, warn, error, critical
```json
{
  "level": "error"
}
```

### 3. Message
Human-readable description:
```json
{
  "message": "User authentication failed"
}
```

### 4. Service/Application
What component logged this:
```json
{
  "service": "api-gateway",
  "version": "1.2.3"
}
```

### 5. Environment
```json
{
  "environment": "production"
}
```

---

## Recommended Fields

### Request Context
```json
{
  "request_id": "550e8400-e29b-41d4-a716-446655440000",
  "user_id": "user123",
  "session_id": "sess_abc",
  "ip_address": "192.168.1.1",
  "user_agent": "Mozilla/5.0..."
}
```

### Performance Metrics
```json
{
  "duration_ms": 245,
  "response_size_bytes": 1024
}
```

### Error Details
```json
{
  "error_type": "ValidationError",
  "error_message": "Invalid email format",
  "stack_trace": "...",
  "error_code": "VAL_001"
}
```

### Business Context
```json
{
  "order_id": "ORD-12345",
  "customer_id": "CUST-789",
  "transaction_amount": 99.99,
  "payment_method": "credit_card"
}
```

---

## Implementation Examples

### Python (using structlog)
```python
import structlog

logger = structlog.get_logger()

# Configure structured logging
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.add_log_level,
        structlog.processors.JSONRenderer()
    ]
)

# Usage
logger.info(
    "user_logged_in",
    user_id="user123",
    ip_address="192.168.1.1",
    login_method="oauth"
)
```

### Node.js (using Winston)
```javascript
const winston = require('winston');

const logger = winston.createLogger({
  format: winston.format.json(),
  defaultMeta: { service: 'api-gateway' },
  transports: [
    new winston.transports.Console()
  ]
});

logger.info('User logged in', {
  userId: 'user123',
  ipAddress: '192.168.1.1',
  loginMethod: 'oauth'
});
```

### Go (using zap)
```go
import "go.uber.org/zap"

logger, _ := zap.NewProduction()
defer logger.Sync()

logger.Info("User logged in",
    zap.String("userId", "user123"),
    zap.String("ipAddress", "192.168.1.1"),
    zap.String("loginMethod", "oauth"),
)
```

### Java (using Logback with JSON)
```java
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import net.logstash.logback.argument.StructuredArguments;

Logger logger = LoggerFactory.getLogger(MyClass.class);

logger.info("User logged in",
    StructuredArguments.kv("userId", "user123"),
    StructuredArguments.kv("ipAddress", "192.168.1.1"),
    StructuredArguments.kv("loginMethod", "oauth")
);
```

---

## Log Aggregation Patterns

### Pattern 1: ELK Stack (Elasticsearch, Logstash, Kibana)

**Architecture**:
```
Application → Filebeat → Logstash → Elasticsearch → Kibana
```

**filebeat.yml**:
```yaml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/app/*.log
    json.keys_under_root: true
    json.add_error_key: true

output.logstash:
  hosts: ["logstash:5044"]
```

**logstash.conf**:
```
input {
  beats {
    port => 5044
  }
}

filter {
  json {
    source => "message"
  }

  date {
    match => ["timestamp", "ISO8601"]
  }

  grok {
    match => { "message" => "%{COMBINEDAPACHELOG}" }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "app-logs-%{+YYYY.MM.dd}"
  }
}
```

### Pattern 2: Loki (Grafana Loki)

**Architecture**:
```
Application → Promtail → Loki → Grafana
```

**promtail-config.yml**:
```yaml
server:
  http_listen_port: 9080

positions:
  filename: /tmp/positions.yaml

clients:
  - url: http://loki:3100/loki/api/v1/push

scrape_configs:
  - job_name: app
    static_configs:
      - targets:
          - localhost
        labels:
          job: app
          __path__: /var/log/app/*.log
    pipeline_stages:
      - json:
          expressions:
            level: level
            timestamp: timestamp
      - labels:
          level:
          service:
      - timestamp:
          source: timestamp
          format: RFC3339
```

**Query in Grafana**:
```logql
{job="app"} |= "error" | json | level="error"
```

### Pattern 3: CloudWatch Logs

**Install CloudWatch agent**:
```json
{
  "logs": {
    "logs_collected": {
      "files": {
        "collect_list": [
          {
            "file_path": "/var/log/app/*.log",
            "log_group_name": "/aws/app/production",
            "log_stream_name": "{instance_id}",
            "timezone": "UTC"
          }
        ]
      }
    }
  }
}
```

**Query with CloudWatch Insights**:
```
fields @timestamp, level, message, user_id
| filter level = "error"
| sort @timestamp desc
| limit 100
```

### Pattern 4: Fluentd/Fluent Bit

**fluent-bit.conf**:
```
[INPUT]
    Name              tail
    Path              /var/log/app/*.log
    Parser            json
    Tag               app.*

[FILTER]
    Name              record_modifier
    Match             *
    Record            hostname ${HOSTNAME}
    Record            cluster production

[OUTPUT]
    Name              es
    Match             *
    Host              elasticsearch
    Port              9200
    Index             app-logs
    Type              _doc
```

---

## Query Patterns

### Find Errors in Time Range
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "error" } },
        { "range": { "@timestamp": {
            "gte": "now-1h",
            "lte": "now"
        }}}
      ]
    }
  }
}
```

**Loki (LogQL)**:
```logql
{job="app", level="error"} |= "error"
```

**CloudWatch Insights**:
```
fields @timestamp, @message
| filter level = "error"
| filter @timestamp > ago(1h)
```

### Count Errors by Type
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
  "size": 0,
  "query": { "match": { "level": "error" } },
  "aggs": {
    "error_types": {
      "terms": { "field": "error_type.keyword" }
    }
  }
}
```

**Loki**:
```logql
sum by (error_type) (count_over_time({job="app", level="error"}[1h]))
```

### Find Slow Requests
**Elasticsearch**:
```json
GET /app-logs-*/_search
{
  "query": {
    "range": { "duration_ms": { "gte": 1000 } }
  },
  "sort": [ { "duration_ms": "desc" } ]
}
```

### Trace Request Through Services
**Elasticsearch** (using request_id):
```json
GET /_search
{
  "query": {
    "match": { "request_id": "550e8400-e29b-41d4-a716-446655440000" }
  },
  "sort": [ { "@timestamp": "asc" } ]
}
```

---

## Sampling and Rate Limiting

### When to Sample
- **High volume services**: > 10,000 logs/second
- **Debug logs in production**: Sample 1-10%
- **Cost optimization**: Reduce storage costs

### Sampling Strategies

**1. Random Sampling**:
```python
import random

if random.random() < 0.1:  # Sample 10%
    logger.debug("Debug message", ...)
```

**2. Rate Limiting**:
```python
from rate_limiter import RateLimiter

limiter = RateLimiter(max_per_second=100)

if limiter.allow():
    logger.info("Rate limited log", ...)
```

**3. Error-Biased Sampling**:
```python
# Always log errors, sample successful requests
if level == "error" or random.random() < 0.01:
    logger.log(level, message, ...)
```

**4. Head-Based Sampling** (trace-aware):
```python
# If trace is sampled, log all related logs
if trace_context.is_sampled():
    logger.info("Traced log", trace_id=trace_context.trace_id)
```

---

## Log Retention

### Retention Strategy

**Hot tier** (fast SSD): 7-30 days
- Recent logs
- Full query performance
- High cost

**Warm tier** (regular disk): 30-90 days
- Older logs
- Slower queries acceptable
- Medium cost

**Cold tier** (object storage): 90+ days
- Archive logs
- Query via restore
- Low cost

### Example: Elasticsearch ILM Policy
```json
{
  "policy": {
    "phases": {
      "hot": {
        "actions": {
          "rollover": {
            "max_size": "50GB",
            "max_age": "1d"
          }
        }
      },
      "warm": {
        "min_age": "7d",
        "actions": {
          "allocate": { "number_of_replicas": 1 },
          "shrink": { "number_of_shards": 1 }
        }
      },
      "cold": {
        "min_age": "30d",
        "actions": {
          "allocate": { "require": { "box_type": "cold" } }
        }
      },
      "delete": {
        "min_age": "90d",
        "actions": {
          "delete": {}
        }
      }
    }
  }
}
```

---

## Security and Compliance

### PII Redaction

**Before logging**:
```python
import re

def redact_pii(data):
    # Redact email
    data = re.sub(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b',
                  '[EMAIL]', data)
    # Redact credit card
    data = re.sub(r'\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b',
                  '[CARD]', data)
    # Redact SSN
    data = re.sub(r'\b\d{3}-\d{2}-\d{4}\b', '[SSN]', data)
    return data

logger.info("User data", user_input=redact_pii(user_input))
```

**In Logstash**:
```
filter {
  mutate {
    gsub => [
      "message", "\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b", "[EMAIL]",
      "message", "\b\d{4}[\s-]?\d{4}[\s-]?\d{4}[\s-]?\d{4}\b", "[CARD]"
    ]
  }
}
```

### Access Control

**Elasticsearch** (with Security):
```yaml
# Role for developers
dev_logs:
  indices:
    - names: ['app-logs-*']
      privileges: ['read']
      query: '{"match": {"environment": "development"}}'
```

**CloudWatch** (IAM Policy):
```json
{
  "Effect": "Allow",
  "Action": [
    "logs:DescribeLogGroups",
    "logs:GetLogEvents",
    "logs:FilterLogEvents"
  ],
  "Resource": "arn:aws:logs:*:*:log-group:/aws/app/production:*"
}
```

---

## Common Pitfalls

### 1. Logging Sensitive Data
❌ `logger.info("Login", password=password)`
✅ `logger.info("Login", user_id=user_id)`

### 2. Excessive Logging
❌ Logging every iteration of a loop
✅ Log aggregate results or sample

### 3. Not Including Context
❌ `logger.error("Failed")`
✅ `logger.error("Payment failed", order_id=order_id, error=str(e))`

### 4. Inconsistent Formats
❌ Mix of JSON and plain text
✅ Pick one format and stick to it

### 5. No Request IDs
❌ Can't trace request across services
✅ Generate and propagate request_id

### 6. Logging to Multiple Places
❌ Log to file AND stdout AND syslog
✅ Log to stdout, let agent handle routing

### 7. Blocking on Log Writes
❌ Synchronous writes to remote systems
✅ Asynchronous buffered writes

---

## Performance Optimization

### 1. Async Logging
```python
import logging
from logging.handlers import QueueHandler, QueueListener
import queue

# Create queue
log_queue = queue.Queue()

# Configure async handler
queue_handler = QueueHandler(log_queue)
logger.addHandler(queue_handler)

# Process logs in background thread
listener = QueueListener(log_queue, *handlers)
listener.start()
```

### 2. Conditional Logging
```python
# Avoid expensive operations if not logging
if logger.isEnabledFor(logging.DEBUG):
    logger.debug("Details", data=expensive_serialization(obj))
```

### 3. Batching
```python
# Batch logs before sending
batch = []
for log in logs:
    batch.append(log)
    if len(batch) >= 100:
        send_to_aggregator(batch)
        batch = []
```

### 4. Compression
```yaml
# Filebeat with compression
output.logstash:
  hosts: ["logstash:5044"]
  compression_level: 3
```

---

## Monitoring Log Pipeline

Track pipeline health with metrics:

```promql
# Log ingestion rate
rate(logs_ingested_total[5m])

# Pipeline lag
log_processing_lag_seconds

# Dropped logs
rate(logs_dropped_total[5m])

# Error parsing rate
rate(logs_parse_errors_total[5m])
```

Alert on:
- Sudden drop in log volume (service down?)
- High parse error rate (format changed?)
- Pipeline lag > 1 minute (capacity issue?)