Initial commit
This commit is contained in:
337
agents/sre/playbooks/08-data-corruption.md
Normal file
337
agents/sre/playbooks/08-data-corruption.md
Normal file
@@ -0,0 +1,337 @@
|
||||
# Playbook: Data Corruption
|
||||
|
||||
## Symptoms
|
||||
|
||||
- Users report incorrect data
|
||||
- Database integrity constraint violations
|
||||
- Foreign key errors
|
||||
- Application errors due to unexpected data
|
||||
- Failed backups (checksum mismatch)
|
||||
- Monitoring alert: "Data integrity check failed"
|
||||
|
||||
## Severity
|
||||
|
||||
- **SEV1** - Critical data corrupted (financial, health, legal)
|
||||
- **SEV2** - Non-critical data corrupted (user profiles, cache)
|
||||
- **SEV3** - Recoverable corruption (can restore from backup)
|
||||
|
||||
## Diagnosis
|
||||
|
||||
### Step 1: Confirm Corruption
|
||||
|
||||
**Database Integrity Check** (PostgreSQL):
|
||||
```sql
|
||||
-- Check for corruption
|
||||
SELECT * FROM pg_catalog.pg_database WHERE datname = 'your_database';
|
||||
|
||||
-- Verify checksums (if enabled)
|
||||
SELECT datname, datcollate, datctype
|
||||
FROM pg_database
|
||||
WHERE datname = 'your_database';
|
||||
|
||||
-- Check for bloat
|
||||
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
|
||||
FROM pg_tables
|
||||
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;
|
||||
```
|
||||
|
||||
**Database Integrity Check** (MySQL):
|
||||
```sql
|
||||
-- Check table for corruption
|
||||
CHECK TABLE users;
|
||||
|
||||
-- Repair table (if corrupted)
|
||||
REPAIR TABLE users;
|
||||
|
||||
-- Optimize table (defragment)
|
||||
OPTIMIZE TABLE users;
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Identify Scope
|
||||
|
||||
**Questions to answer**:
|
||||
- Which tables/data are affected?
|
||||
- How many records corrupted?
|
||||
- When did corruption start?
|
||||
- What's the impact on users?
|
||||
|
||||
**Check Database Logs**:
|
||||
```bash
|
||||
# PostgreSQL
|
||||
grep "ERROR\|FATAL\|PANIC" /var/log/postgresql/postgresql.log
|
||||
|
||||
# MySQL
|
||||
grep "ERROR" /var/log/mysql/error.log
|
||||
|
||||
# Look for:
|
||||
# - Constraint violations
|
||||
# - Foreign key errors
|
||||
# - Checksum errors
|
||||
# - Disk I/O errors
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Determine Root Cause
|
||||
|
||||
**Common causes**:
|
||||
|
||||
| Cause | Symptoms |
|
||||
|-------|----------|
|
||||
| Disk corruption | I/O errors in dmesg, checksum failures |
|
||||
| Application bug | Logical corruption (wrong data, not random) |
|
||||
| Failed migration | Schema mismatch, foreign key violations |
|
||||
| Concurrent writes | Race condition, duplicate records |
|
||||
| Hardware failure | Random corruption, unrelated records |
|
||||
| Malicious attack | Deliberate data modification |
|
||||
|
||||
**Check for Disk Errors**:
|
||||
```bash
|
||||
# Check disk errors
|
||||
dmesg | grep -i "I/O error\|disk error"
|
||||
|
||||
# Check SMART status
|
||||
smartctl -a /dev/sda
|
||||
|
||||
# Look for: Reallocated_Sector_Ct, Current_Pending_Sector
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Mitigation
|
||||
|
||||
### Immediate (Now - 5 min)
|
||||
|
||||
**CRITICAL: Preserve Evidence**
|
||||
```bash
|
||||
# 1. STOP ALL WRITES (prevent further corruption)
|
||||
# Put application in read-only mode OR
|
||||
# Take application offline
|
||||
|
||||
# 2. Snapshot/backup current state (even if corrupted)
|
||||
# PostgreSQL:
|
||||
pg_dump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
|
||||
|
||||
# MySQL:
|
||||
mysqldump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql
|
||||
|
||||
# 3. Snapshot disk (cloud)
|
||||
# AWS:
|
||||
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0 --description "Corruption snapshot"
|
||||
|
||||
# Impact: Preserves evidence for forensics
|
||||
# Risk: None (read-only operations)
|
||||
```
|
||||
|
||||
**CRITICAL: DO NOT**:
|
||||
- Delete corrupted data (may need for forensics)
|
||||
- Run REPAIR TABLE (may destroy evidence)
|
||||
- Restart database (may clear logs)
|
||||
|
||||
---
|
||||
|
||||
### Short-term (5 min - 1 hour)
|
||||
|
||||
**Option A: Restore from Backup** (if recent clean backup)
|
||||
```bash
|
||||
# 1. Identify last known good backup
|
||||
ls -lh /backup/ | grep pg_dump
|
||||
|
||||
# Example:
|
||||
# backup-20251026-0200.sql ← Clean backup (before corruption)
|
||||
# backup-20251026-0800.sql ← Corrupted
|
||||
|
||||
# 2. Restore from clean backup
|
||||
# PostgreSQL:
|
||||
psql your_database < /backup/backup-20251026-0200.sql
|
||||
|
||||
# MySQL:
|
||||
mysql your_database < /backup/backup-20251026-0200.sql
|
||||
|
||||
# 3. Verify data integrity
|
||||
# Run application tests
|
||||
# Check user-reported issues
|
||||
|
||||
# Impact: Data restored to clean state
|
||||
# Risk: Medium (lose data after backup time)
|
||||
```
|
||||
|
||||
**Option B: Repair Corrupted Records** (if isolated corruption)
|
||||
```sql
|
||||
-- Identify corrupted records
|
||||
SELECT * FROM users WHERE email IS NULL; -- Should not be null
|
||||
|
||||
-- Fix corrupted records
|
||||
UPDATE users SET email = 'unknown@example.com' WHERE email IS NULL;
|
||||
|
||||
-- Verify fix
|
||||
SELECT count(*) FROM users WHERE email IS NULL; -- Should be 0
|
||||
|
||||
-- Impact: Corruption fixed
|
||||
-- Risk: Low (if corruption is known and fixable)
|
||||
```
|
||||
|
||||
**Option C: Point-in-Time Recovery** (PostgreSQL)
|
||||
```bash
|
||||
# If WAL (Write-Ahead Logging) enabled:
|
||||
|
||||
# 1. Determine recovery point (before corruption)
|
||||
# 2025-10-26 07:00:00 (corruption detected at 08:00)
|
||||
|
||||
# 2. Restore from base backup + WAL
|
||||
pg_basebackup -D /var/lib/postgresql/data-recovery
|
||||
|
||||
# 3. Configure recovery.conf
|
||||
# recovery_target_time = '2025-10-26 07:00:00'
|
||||
|
||||
# 4. Start PostgreSQL (will replay WAL until target time)
|
||||
systemctl start postgresql
|
||||
|
||||
# Impact: Restore to exact point before corruption
|
||||
# Risk: Low (if WAL available)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Long-term (1 hour+)
|
||||
|
||||
**Root Cause Analysis**:
|
||||
|
||||
**If disk corruption**:
|
||||
- [ ] Replace disk immediately
|
||||
- [ ] Check RAID status
|
||||
- [ ] Run filesystem check (fsck)
|
||||
- [ ] Enable database checksums
|
||||
|
||||
**If application bug**:
|
||||
- [ ] Fix bug in application code
|
||||
- [ ] Add data validation
|
||||
- [ ] Add integrity checks
|
||||
- [ ] Add regression test
|
||||
|
||||
**If failed migration**:
|
||||
- [ ] Review migration script
|
||||
- [ ] Test migrations in staging first
|
||||
- [ ] Add rollback plan
|
||||
- [ ] Use transaction-based migrations
|
||||
|
||||
**If concurrent writes**:
|
||||
- [ ] Add locking (row-level, table-level)
|
||||
- [ ] Use optimistic locking (version column)
|
||||
- [ ] Review transaction isolation level
|
||||
- [ ] Add unique constraints
|
||||
|
||||
---
|
||||
|
||||
## Prevention
|
||||
|
||||
**Backups**:
|
||||
- [ ] Daily automated backups
|
||||
- [ ] Test restore process monthly
|
||||
- [ ] Multiple backup locations (local + S3)
|
||||
- [ ] Point-in-time recovery enabled (WAL)
|
||||
- [ ] Retention: 30 days
|
||||
|
||||
**Monitoring**:
|
||||
- [ ] Data integrity checks (checksums)
|
||||
- [ ] Foreign key violation alerts
|
||||
- [ ] Disk error monitoring (SMART)
|
||||
- [ ] Backup success/failure alerts
|
||||
- [ ] Application-level data validation
|
||||
|
||||
**Data Validation**:
|
||||
- [ ] Database constraints (NOT NULL, FOREIGN KEY, CHECK)
|
||||
- [ ] Application-level validation
|
||||
- [ ] Schema migrations in transactions
|
||||
- [ ] Automated data quality tests
|
||||
|
||||
**Redundancy**:
|
||||
- [ ] Database replication (primary + replica)
|
||||
- [ ] RAID for disk redundancy
|
||||
- [ ] Multi-AZ deployment (cloud)
|
||||
|
||||
---
|
||||
|
||||
## Escalation
|
||||
|
||||
**Escalate to DBA if**:
|
||||
- Database-level corruption
|
||||
- Need expert for recovery
|
||||
|
||||
**Escalate to developer if**:
|
||||
- Application bug causing corruption
|
||||
- Need code fix
|
||||
|
||||
**Escalate to security team if**:
|
||||
- Suspected malicious attack
|
||||
- Unauthorized data modification
|
||||
|
||||
**Escalate to management if**:
|
||||
- Critical data lost
|
||||
- Legal/compliance implications
|
||||
- Data breach
|
||||
|
||||
---
|
||||
|
||||
## Legal/Compliance
|
||||
|
||||
**If critical data corrupted**:
|
||||
- [ ] Notify legal team
|
||||
- [ ] Notify compliance team
|
||||
- [ ] Check notification requirements:
|
||||
- GDPR: 72 hours for breach notification
|
||||
- HIPAA: 60 days for breach notification
|
||||
- PCI-DSS: Immediate notification
|
||||
- [ ] Document incident timeline (for audit)
|
||||
- [ ] Preserve evidence (forensics)
|
||||
|
||||
---
|
||||
|
||||
## Related Runbooks
|
||||
|
||||
- [07-service-down.md](07-service-down.md) - If database down
|
||||
- [../modules/database-diagnostics.md](../modules/database-diagnostics.md) - Database troubleshooting
|
||||
- [../modules/security-incidents.md](../modules/security-incidents.md) - If malicious attack
|
||||
|
||||
---
|
||||
|
||||
## Post-Incident
|
||||
|
||||
After resolving:
|
||||
- [ ] Create post-mortem (MANDATORY for SEV1)
|
||||
- [ ] Root cause analysis (what, why, how)
|
||||
- [ ] Identify affected users/records
|
||||
- [ ] User communication (if needed)
|
||||
- [ ] Action items (prevent recurrence)
|
||||
- [ ] Update backup/recovery procedures
|
||||
- [ ] Update this runbook if needed
|
||||
|
||||
---
|
||||
|
||||
## Useful Commands Reference
|
||||
|
||||
```bash
|
||||
# PostgreSQL integrity check
|
||||
psql -c "SELECT * FROM pg_catalog.pg_database"
|
||||
|
||||
# MySQL table check
|
||||
mysqlcheck -c your_database
|
||||
|
||||
# Backup
|
||||
pg_dump your_database > backup.sql
|
||||
mysqldump your_database > backup.sql
|
||||
|
||||
# Restore
|
||||
psql your_database < backup.sql
|
||||
mysql your_database < backup.sql
|
||||
|
||||
# Disk check
|
||||
dmesg | grep -i "I/O error"
|
||||
smartctl -a /dev/sda
|
||||
fsck /dev/sda1
|
||||
|
||||
# Snapshot (AWS)
|
||||
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0
|
||||
```
|
||||
Reference in New Issue
Block a user