Initial commit

2025-11-29 17:56:41 +08:00
commit 9427ed1eea
40 changed files with 15189 additions and 0 deletions
--- a/agents/sre/playbooks/08-data-corruption.md
+++ b/agents/sre/playbooks/08-data-corruption.md
@@ -0,0 +1,337 @@
+# Playbook: Data Corruption
+
+## Symptoms
+
+- Users report incorrect data
+- Database integrity constraint violations
+- Foreign key errors
+- Application errors due to unexpected data
+- Failed backups (checksum mismatch)
+- Monitoring alert: "Data integrity check failed"
+
+## Severity
+
+- **SEV1** - Critical data corrupted (financial, health, legal)
+- **SEV2** - Non-critical data corrupted (user profiles, cache)
+- **SEV3** - Recoverable corruption (can restore from backup)
+
+## Diagnosis
+
+### Step 1: Confirm Corruption
+
+**Database Integrity Check** (PostgreSQL):
+```sql
+-- Check for corruption
+SELECT * FROM pg_catalog.pg_database WHERE datname = 'your_database';
+
+-- Verify checksums (if enabled)
+SELECT datname, datcollate, datctype
+FROM pg_database
+WHERE datname = 'your_database';
+
+-- Check for bloat
+SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
+FROM pg_tables
+ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
+```
+
+**Database Integrity Check** (MySQL):
+```sql
+-- Check table for corruption
+CHECK TABLE users;
+
+-- Repair table (if corrupted)
+REPAIR TABLE users;
+
+-- Optimize table (defragment)
+OPTIMIZE TABLE users;
+```
+
+---
+
+### Step 2: Identify Scope
+
+**Questions to answer**:
+- Which tables/data are affected?
+- How many records corrupted?
+- When did corruption start?
+- What's the impact on users?
+
+**Check Database Logs**:
+```bash
+# PostgreSQL
+grep "ERROR\|FATAL\|PANIC" /var/log/postgresql/postgresql.log
+
+# MySQL
+grep "ERROR" /var/log/mysql/error.log
+
+# Look for:
+# - Constraint violations
+# - Foreign key errors
+# - Checksum errors
+# - Disk I/O errors
+```
+
+---
+
+### Step 3: Determine Root Cause
+
+**Common causes**:
+
+| Cause | Symptoms |
+|-------|----------|
+| Disk corruption | I/O errors in dmesg, checksum failures |
+| Application bug | Logical corruption (wrong data, not random) |
+| Failed migration | Schema mismatch, foreign key violations |
+| Concurrent writes | Race condition, duplicate records |
+| Hardware failure | Random corruption, unrelated records |
+| Malicious attack | Deliberate data modification |
+
+**Check for Disk Errors**:
+```bash
+# Check disk errors
+dmesg | grep -i "I/O error\|disk error"
+
+# Check SMART status
+smartctl -a /dev/sda
+
+# Look for: Reallocated_Sector_Ct, Current_Pending_Sector
+```
+
+---
+
+## Mitigation
+
+### Immediate (Now - 5 min)
+
+**CRITICAL: Preserve Evidence**
+```bash
+# 1. STOP ALL WRITES (prevent further corruption)
+# Put application in read-only mode OR
+# Take application offline
+
+# 2. Snapshot/backup current state (even if corrupted)
+# PostgreSQL:
+pg_dump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
+
+# MySQL:
+mysqldump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
+
+# 3. Snapshot disk (cloud)
+# AWS:
+aws ec2 create-snapshot --volume-id vol-1234567890abcdef0 --description "Corruption snapshot"
+
+# Impact: Preserves evidence for forensics
+# Risk: None (read-only operations)
+```
+
+**CRITICAL: DO NOT**:
+- Delete corrupted data (may need for forensics)
+- Run REPAIR TABLE (may destroy evidence)
+- Restart database (may clear logs)
+
+---
+
+### Short-term (5 min - 1 hour)
+
+**Option A: Restore from Backup** (if recent clean backup)
+```bash
+# 1. Identify last known good backup
+ls -lh /backup/ | grep pg_dump
+
+# Example:
+# backup-20251026-0200.sql  ← Clean backup (before corruption)
+# backup-20251026-0800.sql  ← Corrupted
+
+# 2. Restore from clean backup
+# PostgreSQL:
+psql your_database < /backup/backup-20251026-0200.sql
+
+# MySQL:
+mysql your_database < /backup/backup-20251026-0200.sql
+
+# 3. Verify data integrity
+# Run application tests
+# Check user-reported issues
+
+# Impact: Data restored to clean state
+# Risk: Medium (lose data after backup time)
+```
+
+**Option B: Repair Corrupted Records** (if isolated corruption)
+```sql
+-- Identify corrupted records
+SELECT * FROM users WHERE email IS NULL;  -- Should not be null
+
+-- Fix corrupted records
+UPDATE users SET email = 'unknown@example.com' WHERE email IS NULL;
+
+-- Verify fix
+SELECT count(*) FROM users WHERE email IS NULL;  -- Should be 0
+
+-- Impact: Corruption fixed
+-- Risk: Low (if corruption is known and fixable)
+```
+
+**Option C: Point-in-Time Recovery** (PostgreSQL)
+```bash
+# If WAL (Write-Ahead Logging) enabled:
+
+# 1. Determine recovery point (before corruption)
+# 2025-10-26 07:00:00 (corruption detected at 08:00)
+
+# 2. Restore from base backup + WAL
+pg_basebackup -D /var/lib/postgresql/data-recovery
+
+# 3. Configure recovery.conf
+# recovery_target_time = '2025-10-26 07:00:00'
+
+# 4. Start PostgreSQL (will replay WAL until target time)
+systemctl start postgresql
+
+# Impact: Restore to exact point before corruption
+# Risk: Low (if WAL available)
+```
+
+---
+
+### Long-term (1 hour+)
+
+**Root Cause Analysis**:
+
+**If disk corruption**:
+- [ ] Replace disk immediately
+- [ ] Check RAID status
+- [ ] Run filesystem check (fsck)
+- [ ] Enable database checksums
+
+**If application bug**:
+- [ ] Fix bug in application code
+- [ ] Add data validation
+- [ ] Add integrity checks
+- [ ] Add regression test
+
+**If failed migration**:
+- [ ] Review migration script
+- [ ] Test migrations in staging first
+- [ ] Add rollback plan
+- [ ] Use transaction-based migrations
+
+**If concurrent writes**:
+- [ ] Add locking (row-level, table-level)
+- [ ] Use optimistic locking (version column)
+- [ ] Review transaction isolation level
+- [ ] Add unique constraints
+
+---
+
+## Prevention
+
+**Backups**:
+- [ ] Daily automated backups
+- [ ] Test restore process monthly
+- [ ] Multiple backup locations (local + S3)
+- [ ] Point-in-time recovery enabled (WAL)
+- [ ] Retention: 30 days
+
+**Monitoring**:
+- [ ] Data integrity checks (checksums)
+- [ ] Foreign key violation alerts
+- [ ] Disk error monitoring (SMART)
+- [ ] Backup success/failure alerts
+- [ ] Application-level data validation
+
+**Data Validation**:
+- [ ] Database constraints (NOT NULL, FOREIGN KEY, CHECK)
+- [ ] Application-level validation
+- [ ] Schema migrations in transactions
+- [ ] Automated data quality tests
+
+**Redundancy**:
+- [ ] Database replication (primary + replica)
+- [ ] RAID for disk redundancy
+- [ ] Multi-AZ deployment (cloud)
+
+---
+
+## Escalation
+
+**Escalate to DBA if**:
+- Database-level corruption
+- Need expert for recovery
+
+**Escalate to developer if**:
+- Application bug causing corruption
+- Need code fix
+
+**Escalate to security team if**:
+- Suspected malicious attack
+- Unauthorized data modification
+
+**Escalate to management if**:
+- Critical data lost
+- Legal/compliance implications
+- Data breach
+
+---
+
+## Legal/Compliance
+
+**If critical data corrupted**:
+- [ ] Notify legal team
+- [ ] Notify compliance team
+- [ ] Check notification requirements:
+  - GDPR: 72 hours for breach notification
+  - HIPAA: 60 days for breach notification
+  - PCI-DSS: Immediate notification
+- [ ] Document incident timeline (for audit)
+- [ ] Preserve evidence (forensics)
+
+---
+
+## Related Runbooks
+
+- [07-service-down.md](07-service-down.md) - If database down
+- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
+- [../modules/security-incidents.md](../modules/security-incidents.md) - If malicious attack
+
+---
+
+## Post-Incident
+
+After resolving:
+- [ ] Create post-mortem (MANDATORY for SEV1)
+- [ ] Root cause analysis (what, why, how)
+- [ ] Identify affected users/records
+- [ ] User communication (if needed)
+- [ ] Action items (prevent recurrence)
+- [ ] Update backup/recovery procedures
+- [ ] Update this runbook if needed
+
+---
+
+## Useful Commands Reference
+
+```bash
+# PostgreSQL integrity check
+psql -c "SELECT * FROM pg_catalog.pg_database"
+
+# MySQL table check
+mysqlcheck -c your_database
+
+# Backup
+pg_dump your_database > backup.sql
+mysqldump your_database > backup.sql
+
+# Restore
+psql your_database < backup.sql
+mysql your_database < backup.sql
+
+# Disk check
+dmesg | grep -i "I/O error"
+smartctl -a /dev/sda
+fsck /dev/sda1
+
+# Snapshot (AWS)
+aws ec2 create-snapshot --volume-id vol-1234567890abcdef0
+```