Files
gh-anton-abyzov-specweave-p…/agents/sre/playbooks/08-data-corruption.md
2025-11-29 17:56:41 +08:00

7.7 KiB

Playbook: Data Corruption

Symptoms

  • Users report incorrect data
  • Database integrity constraint violations
  • Foreign key errors
  • Application errors due to unexpected data
  • Failed backups (checksum mismatch)
  • Monitoring alert: "Data integrity check failed"

Severity

  • SEV1 - Critical data corrupted (financial, health, legal)
  • SEV2 - Non-critical data corrupted (user profiles, cache)
  • SEV3 - Recoverable corruption (can restore from backup)

Diagnosis

Step 1: Confirm Corruption

Database Integrity Check (PostgreSQL):

-- Check for corruption
SELECT * FROM pg_catalog.pg_database WHERE datname = 'your_database';

-- Verify checksums (if enabled)
SELECT datname, datcollate, datctype
FROM pg_database
WHERE datname = 'your_database';

-- Check for bloat
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Database Integrity Check (MySQL):

-- Check table for corruption
CHECK TABLE users;

-- Repair table (if corrupted)
REPAIR TABLE users;

-- Optimize table (defragment)
OPTIMIZE TABLE users;

Step 2: Identify Scope

Questions to answer:

  • Which tables/data are affected?
  • How many records corrupted?
  • When did corruption start?
  • What's the impact on users?

Check Database Logs:

# PostgreSQL
grep "ERROR\|FATAL\|PANIC" /var/log/postgresql/postgresql.log

# MySQL
grep "ERROR" /var/log/mysql/error.log

# Look for:
# - Constraint violations
# - Foreign key errors
# - Checksum errors
# - Disk I/O errors

Step 3: Determine Root Cause

Common causes:

Cause Symptoms
Disk corruption I/O errors in dmesg, checksum failures
Application bug Logical corruption (wrong data, not random)
Failed migration Schema mismatch, foreign key violations
Concurrent writes Race condition, duplicate records
Hardware failure Random corruption, unrelated records
Malicious attack Deliberate data modification

Check for Disk Errors:

# Check disk errors
dmesg | grep -i "I/O error\|disk error"

# Check SMART status
smartctl -a /dev/sda

# Look for: Reallocated_Sector_Ct, Current_Pending_Sector

Mitigation

Immediate (Now - 5 min)

CRITICAL: Preserve Evidence

# 1. STOP ALL WRITES (prevent further corruption)
# Put application in read-only mode OR
# Take application offline

# 2. Snapshot/backup current state (even if corrupted)
# PostgreSQL:
pg_dump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql

# MySQL:
mysqldump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql

# 3. Snapshot disk (cloud)
# AWS:
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0 --description "Corruption snapshot"

# Impact: Preserves evidence for forensics
# Risk: None (read-only operations)

CRITICAL: DO NOT:

  • Delete corrupted data (may need for forensics)
  • Run REPAIR TABLE (may destroy evidence)
  • Restart database (may clear logs)

Short-term (5 min - 1 hour)

Option A: Restore from Backup (if recent clean backup)

# 1. Identify last known good backup
ls -lh /backup/ | grep pg_dump

# Example:
# backup-20251026-0200.sql  ← Clean backup (before corruption)
# backup-20251026-0800.sql  ← Corrupted

# 2. Restore from clean backup
# PostgreSQL:
psql your_database < /backup/backup-20251026-0200.sql

# MySQL:
mysql your_database < /backup/backup-20251026-0200.sql

# 3. Verify data integrity
# Run application tests
# Check user-reported issues

# Impact: Data restored to clean state
# Risk: Medium (lose data after backup time)

Option B: Repair Corrupted Records (if isolated corruption)

-- Identify corrupted records
SELECT * FROM users WHERE email IS NULL;  -- Should not be null

-- Fix corrupted records
UPDATE users SET email = 'unknown@example.com' WHERE email IS NULL;

-- Verify fix
SELECT count(*) FROM users WHERE email IS NULL;  -- Should be 0

-- Impact: Corruption fixed
-- Risk: Low (if corruption is known and fixable)

Option C: Point-in-Time Recovery (PostgreSQL)

# If WAL (Write-Ahead Logging) enabled:

# 1. Determine recovery point (before corruption)
# 2025-10-26 07:00:00 (corruption detected at 08:00)

# 2. Restore from base backup + WAL
pg_basebackup -D /var/lib/postgresql/data-recovery

# 3. Configure recovery.conf
# recovery_target_time = '2025-10-26 07:00:00'

# 4. Start PostgreSQL (will replay WAL until target time)
systemctl start postgresql

# Impact: Restore to exact point before corruption
# Risk: Low (if WAL available)

Long-term (1 hour+)

Root Cause Analysis:

If disk corruption:

  • Replace disk immediately
  • Check RAID status
  • Run filesystem check (fsck)
  • Enable database checksums

If application bug:

  • Fix bug in application code
  • Add data validation
  • Add integrity checks
  • Add regression test

If failed migration:

  • Review migration script
  • Test migrations in staging first
  • Add rollback plan
  • Use transaction-based migrations

If concurrent writes:

  • Add locking (row-level, table-level)
  • Use optimistic locking (version column)
  • Review transaction isolation level
  • Add unique constraints

Prevention

Backups:

  • Daily automated backups
  • Test restore process monthly
  • Multiple backup locations (local + S3)
  • Point-in-time recovery enabled (WAL)
  • Retention: 30 days

Monitoring:

  • Data integrity checks (checksums)
  • Foreign key violation alerts
  • Disk error monitoring (SMART)
  • Backup success/failure alerts
  • Application-level data validation

Data Validation:

  • Database constraints (NOT NULL, FOREIGN KEY, CHECK)
  • Application-level validation
  • Schema migrations in transactions
  • Automated data quality tests

Redundancy:

  • Database replication (primary + replica)
  • RAID for disk redundancy
  • Multi-AZ deployment (cloud)

Escalation

Escalate to DBA if:

  • Database-level corruption
  • Need expert for recovery

Escalate to developer if:

  • Application bug causing corruption
  • Need code fix

Escalate to security team if:

  • Suspected malicious attack
  • Unauthorized data modification

Escalate to management if:

  • Critical data lost
  • Legal/compliance implications
  • Data breach

Legal/Compliance

If critical data corrupted:

  • Notify legal team
  • Notify compliance team
  • Check notification requirements:
    • GDPR: 72 hours for breach notification
    • HIPAA: 60 days for breach notification
    • PCI-DSS: Immediate notification
  • Document incident timeline (for audit)
  • Preserve evidence (forensics)


Post-Incident

After resolving:

  • Create post-mortem (MANDATORY for SEV1)
  • Root cause analysis (what, why, how)
  • Identify affected users/records
  • User communication (if needed)
  • Action items (prevent recurrence)
  • Update backup/recovery procedures
  • Update this runbook if needed

Useful Commands Reference

# PostgreSQL integrity check
psql -c "SELECT * FROM pg_catalog.pg_database"

# MySQL table check
mysqlcheck -c your_database

# Backup
pg_dump your_database > backup.sql
mysqldump your_database > backup.sql

# Restore
psql your_database < backup.sql
mysql your_database < backup.sql

# Disk check
dmesg | grep -i "I/O error"
smartctl -a /dev/sda
fsck /dev/sda1

# Snapshot (AWS)
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0