zhongwei/gh-anton-abyzov-specweave-plugins-specweave-infrastructure

Fork 0

Files

Zhongwei Li 9427ed1eea Initial commit

2025-11-29 17:56:41 +08:00

7.7 KiB

Raw Blame History

Playbook: Data Corruption

Symptoms

Users report incorrect data
Database integrity constraint violations
Foreign key errors
Application errors due to unexpected data
Failed backups (checksum mismatch)
Monitoring alert: "Data integrity check failed"

Severity

SEV1 - Critical data corrupted (financial, health, legal)
SEV2 - Non-critical data corrupted (user profiles, cache)
SEV3 - Recoverable corruption (can restore from backup)

Diagnosis

Step 1: Confirm Corruption

Database Integrity Check (PostgreSQL):

-- Check for corruption
SELECT * FROM pg_catalog.pg_database WHERE datname = 'your_database';

-- Verify checksums (if enabled)
SELECT datname, datcollate, datctype
FROM pg_database
WHERE datname = 'your_database';

-- Check for bloat
SELECT schemaname, tablename, pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename))
FROM pg_tables
ORDER BY pg_total_relation_size(schemaname||'.'||tablename) DESC;

Database Integrity Check (MySQL):

-- Check table for corruption
CHECK TABLE users;

-- Repair table (if corrupted)
REPAIR TABLE users;

-- Optimize table (defragment)
OPTIMIZE TABLE users;

Step 2: Identify Scope

Questions to answer:

Which tables/data are affected?
How many records corrupted?
When did corruption start?
What's the impact on users?

Check Database Logs:

# PostgreSQL
grep "ERROR\|FATAL\|PANIC" /var/log/postgresql/postgresql.log

# MySQL
grep "ERROR" /var/log/mysql/error.log

# Look for:
# - Constraint violations
# - Foreign key errors
# - Checksum errors
# - Disk I/O errors

Step 3: Determine Root Cause

Common causes:

Cause	Symptoms
Disk corruption	I/O errors in dmesg, checksum failures
Application bug	Logical corruption (wrong data, not random)
Failed migration	Schema mismatch, foreign key violations
Concurrent writes	Race condition, duplicate records
Hardware failure	Random corruption, unrelated records
Malicious attack	Deliberate data modification

Check for Disk Errors:

# Check disk errors
dmesg | grep -i "I/O error\|disk error"

# Check SMART status
smartctl -a /dev/sda

# Look for: Reallocated_Sector_Ct, Current_Pending_Sector

Mitigation

Immediate (Now - 5 min)

CRITICAL: Preserve Evidence

# 1. STOP ALL WRITES (prevent further corruption)
# Put application in read-only mode OR
# Take application offline

# 2. Snapshot/backup current state (even if corrupted)
# PostgreSQL:
pg_dump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql

# MySQL:
mysqldump your_database > /backup/corrupted-$(date +%Y%m%d-%H%M%S).sql

# 3. Snapshot disk (cloud)
# AWS:
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0 --description "Corruption snapshot"

# Impact: Preserves evidence for forensics
# Risk: None (read-only operations)

CRITICAL: DO NOT:

Delete corrupted data (may need for forensics)
Run REPAIR TABLE (may destroy evidence)
Restart database (may clear logs)

Short-term (5 min - 1 hour)

Option A: Restore from Backup (if recent clean backup)

# 1. Identify last known good backup
ls -lh /backup/ | grep pg_dump

# Example:
# backup-20251026-0200.sql  ← Clean backup (before corruption)
# backup-20251026-0800.sql  ← Corrupted

# 2. Restore from clean backup
# PostgreSQL:
psql your_database < /backup/backup-20251026-0200.sql

# MySQL:
mysql your_database < /backup/backup-20251026-0200.sql

# 3. Verify data integrity
# Run application tests
# Check user-reported issues

# Impact: Data restored to clean state
# Risk: Medium (lose data after backup time)

Option B: Repair Corrupted Records (if isolated corruption)

-- Identify corrupted records
SELECT * FROM users WHERE email IS NULL;  -- Should not be null

-- Fix corrupted records
UPDATE users SET email = 'unknown@example.com' WHERE email IS NULL;

-- Verify fix
SELECT count(*) FROM users WHERE email IS NULL;  -- Should be 0

-- Impact: Corruption fixed
-- Risk: Low (if corruption is known and fixable)

Option C: Point-in-Time Recovery (PostgreSQL)

# If WAL (Write-Ahead Logging) enabled:

# 1. Determine recovery point (before corruption)
# 2025-10-26 07:00:00 (corruption detected at 08:00)

# 2. Restore from base backup + WAL
pg_basebackup -D /var/lib/postgresql/data-recovery

# 3. Configure recovery.conf
# recovery_target_time = '2025-10-26 07:00:00'

# 4. Start PostgreSQL (will replay WAL until target time)
systemctl start postgresql

# Impact: Restore to exact point before corruption
# Risk: Low (if WAL available)

Long-term (1 hour+)

Root Cause Analysis:

If disk corruption:

Replace disk immediately
Check RAID status
Run filesystem check (fsck)
Enable database checksums

If application bug:

Fix bug in application code
Add data validation
Add integrity checks
Add regression test

If failed migration:

Review migration script
Test migrations in staging first
Add rollback plan
Use transaction-based migrations

If concurrent writes:

Add locking (row-level, table-level)
Use optimistic locking (version column)
Review transaction isolation level
Add unique constraints

Prevention

Backups:

Daily automated backups
Test restore process monthly
Multiple backup locations (local + S3)
Point-in-time recovery enabled (WAL)
Retention: 30 days

Monitoring:

Data integrity checks (checksums)
Foreign key violation alerts
Disk error monitoring (SMART)
Backup success/failure alerts
Application-level data validation

Data Validation:

Database constraints (NOT NULL, FOREIGN KEY, CHECK)
Application-level validation
Schema migrations in transactions
Automated data quality tests

Redundancy:

Database replication (primary + replica)
RAID for disk redundancy
Multi-AZ deployment (cloud)

Escalation

Escalate to DBA if:

Database-level corruption
Need expert for recovery

Escalate to developer if:

Application bug causing corruption
Need code fix

Escalate to security team if:

Suspected malicious attack
Unauthorized data modification

Escalate to management if:

Critical data lost
Legal/compliance implications
Data breach

Legal/Compliance

If critical data corrupted:

Notify legal team
Notify compliance team
Check notification requirements:
- GDPR: 72 hours for breach notification
- HIPAA: 60 days for breach notification
- PCI-DSS: Immediate notification
Document incident timeline (for audit)
Preserve evidence (forensics)

07-service-down.md - If database down
../modules/database-diagnostics.md - Database troubleshooting
../modules/security-incidents.md - If malicious attack

Post-Incident

After resolving:

Create post-mortem (MANDATORY for SEV1)
Root cause analysis (what, why, how)
Identify affected users/records
User communication (if needed)
Action items (prevent recurrence)
Update backup/recovery procedures
Update this runbook if needed

Useful Commands Reference

# PostgreSQL integrity check
psql -c "SELECT * FROM pg_catalog.pg_database"

# MySQL table check
mysqlcheck -c your_database

# Backup
pg_dump your_database > backup.sql
mysqldump your_database > backup.sql

# Restore
psql your_database < backup.sql
mysql your_database < backup.sql

# Disk check
dmesg | grep -i "I/O error"
smartctl -a /dev/sda
fsck /dev/sda1

# Snapshot (AWS)
aws ec2 create-snapshot --volume-id vol-1234567890abcdef0

7.7 KiB Raw Blame History