gh-anton-abyzov-specweave-p…/agents/sre/playbooks/08-data-corruption.md

# Playbook: Data Corruption

## Symptoms

- Users report incorrect data
- Database integrity constraint violations
- Foreign key errors
- Application errors due to unexpected data
- Failed backups (checksum mismatch)
- Monitoring alert: "Data integrity check failed"

## Severity

- **SEV1** - Critical data corrupted (financial, health, legal)
- **SEV2** - Non-critical data corrupted (user profiles, cache)
- **SEV3** - Recoverable corruption (can restore from backup)

## Diagnosis

### Step 1: Confirm Corruption

**Database Integrity Check** (PostgreSQL):
```sql
-- Check for corruption
SELECT * FROM pg_catalog.pg_database WHERE datname = 'your_database';

-- Verify checksums (if enabled)
SELECT datname, datcollate, datctype
FROM pg_database
WHERE datname = 'your_database';

-- Check for bloat
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
```

**Database Integrity Check** (MySQL):
```sql
-- Check table for corruption
CHECK TABLE users;

-- Repair table (if corrupted)
REPAIR TABLE users;

-- Optimize table (defragment)
OPTIMIZE TABLE users;
```

---

### Step 2: Identify Scope

**Questions to answer**:
- Which tables/data are affected?
- How many records corrupted?
- When did corruption start?
- What's the impact on users?

**Check Database Logs**:
```bash
# PostgreSQL
grep "ERROR\|FATAL\|PANIC" /var/log/postgresql/postgresql.log

# MySQL
grep "ERROR" /var/log/mysql/error.log

# Look for:
# - Constraint violations
# - Foreign key errors
# - Checksum errors
# - Disk I/O errors
```

---

### Step 3: Determine Root Cause

**Common causes**:

| Cause | Symptoms |
|-------|----------|
| Disk corruption | I/O errors in dmesg, checksum failures |
| Application bug | Logical corruption (wrong data, not random) |
| Failed migration | Schema mismatch, foreign key violations |
| Concurrent writes | Race condition, duplicate records |
| Hardware failure | Random corruption, unrelated records |
| Malicious attack | Deliberate data modification |

**Check for Disk Errors**:
```bash
# Check disk errors
dmesg | grep -i "I/O error\|disk error"

# Check SMART status
smartctl -a /dev/sda

# Look for: Reallocated_Sector_Ct, Current_Pending_Sector
```

---

## Mitigation

### Immediate (Now - 5 min)

**CRITICAL: Preserve Evidence**
```bash
# 1. STOP ALL WRITES (prevent further corruption)
# Put application in read-only mode OR
# Take application offline

# 2. Snapshot/backup current state (even if corrupted)
# PostgreSQL:
pg_dump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql

# MySQL:
mysqldump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql

# 3. Snapshot disk (cloud)
# AWS:
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0 --description "Corruption snapshot"

# Impact: Preserves evidence for forensics
# Risk: None (read-only operations)
```

**CRITICAL: DO NOT**:
- Delete corrupted data (may need for forensics)
- Run REPAIR TABLE (may destroy evidence)
- Restart database (may clear logs)

---

### Short-term (5 min - 1 hour)

**Option A: Restore from Backup** (if recent clean backup)
```bash
# 1. Identify last known good backup
ls -lh /backup/ | grep pg_dump

# Example:
# backup-20251026-0200.sql  ← Clean backup (before corruption)
# backup-20251026-0800.sql  ← Corrupted

# 2. Restore from clean backup
# PostgreSQL:
psql your_database < /backup/backup-20251026-0200.sql

# MySQL:
mysql your_database < /backup/backup-20251026-0200.sql

# 3. Verify data integrity
# Run application tests
# Check user-reported issues

# Impact: Data restored to clean state
# Risk: Medium (lose data after backup time)
```

**Option B: Repair Corrupted Records** (if isolated corruption)
```sql
-- Identify corrupted records
SELECT * FROM users WHERE email IS NULL;  -- Should not be null

-- Fix corrupted records
UPDATE users SET email = 'unknown@example.com' WHERE email IS NULL;

-- Verify fix
SELECT count(*) FROM users WHERE email IS NULL;  -- Should be 0

-- Impact: Corruption fixed
-- Risk: Low (if corruption is known and fixable)
```

**Option C: Point-in-Time Recovery** (PostgreSQL)
```bash
# If WAL (Write-Ahead Logging) enabled:

# 1. Determine recovery point (before corruption)
# 2025-10-26 07:00:00 (corruption detected at 08:00)

# 2. Restore from base backup + WAL
pg_basebackup -D /var/lib/postgresql/data-recovery

# 3. Configure recovery.conf
# recovery_target_time = '2025-10-26 07:00:00'

# 4. Start PostgreSQL (will replay WAL until target time)
systemctl start postgresql

# Impact: Restore to exact point before corruption
# Risk: Low (if WAL available)
```

---

### Long-term (1 hour+)

**Root Cause Analysis**:

**If disk corruption**:
- [ ] Replace disk immediately
- [ ] Check RAID status
- [ ] Run filesystem check (fsck)
- [ ] Enable database checksums

**If application bug**:
- [ ] Fix bug in application code
- [ ] Add data validation
- [ ] Add integrity checks
- [ ] Add regression test

**If failed migration**:
- [ ] Review migration script
- [ ] Test migrations in staging first
- [ ] Add rollback plan
- [ ] Use transaction-based migrations

**If concurrent writes**:
- [ ] Add locking (row-level, table-level)
- [ ] Use optimistic locking (version column)
- [ ] Review transaction isolation level
- [ ] Add unique constraints

---

## Prevention

**Backups**:
- [ ] Daily automated backups
- [ ] Test restore process monthly
- [ ] Multiple backup locations (local + S3)
- [ ] Point-in-time recovery enabled (WAL)
- [ ] Retention: 30 days

**Monitoring**:
- [ ] Data integrity checks (checksums)
- [ ] Foreign key violation alerts
- [ ] Disk error monitoring (SMART)
- [ ] Backup success/failure alerts
- [ ] Application-level data validation

**Data Validation**:
- [ ] Database constraints (NOT NULL, FOREIGN KEY, CHECK)
- [ ] Application-level validation
- [ ] Schema migrations in transactions
- [ ] Automated data quality tests

**Redundancy**:
- [ ] Database replication (primary + replica)
- [ ] RAID for disk redundancy
- [ ] Multi-AZ deployment (cloud)

---

## Escalation

**Escalate to DBA if**:
- Database-level corruption
- Need expert for recovery

**Escalate to developer if**:
- Application bug causing corruption
- Need code fix

**Escalate to security team if**:
- Suspected malicious attack
- Unauthorized data modification

**Escalate to management if**:
- Critical data lost
- Legal/compliance implications
- Data breach

---

## Legal/Compliance

**If critical data corrupted**:
- [ ] Notify legal team
- [ ] Notify compliance team
- [ ] Check notification requirements:
  - GDPR: 72 hours for breach notification
  - HIPAA: 60 days for breach notification
  - PCI-DSS: Immediate notification
- [ ] Document incident timeline (for audit)
- [ ] Preserve evidence (forensics)

---

## Related Runbooks

- [07-service-down.md](07-service-down.md) - If database down
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
- [../modules/security-incidents.md](../modules/security-incidents.md) - If malicious attack

---

## Post-Incident

After resolving:
- [ ] Create post-mortem (MANDATORY for SEV1)
- [ ] Root cause analysis (what, why, how)
- [ ] Identify affected users/records
- [ ] User communication (if needed)
- [ ] Action items (prevent recurrence)
- [ ] Update backup/recovery procedures
- [ ] Update this runbook if needed

---

## Useful Commands Reference

```bash
# PostgreSQL integrity check
psql -c "SELECT * FROM pg_catalog.pg_database"

# MySQL table check
mysqlcheck -c your_database

# Backup
pg_dump your_database > backup.sql
mysqldump your_database > backup.sql

# Restore
psql your_database < backup.sql
mysql your_database < backup.sql

# Disk check
dmesg | grep -i "I/O error"
smartctl -a /dev/sda
fsck /dev/sda1

# Snapshot (AWS)
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0
```