Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:18:01 +08:00
commit 1e99c10c61
8 changed files with 1091 additions and 0 deletions

View File

@@ -0,0 +1,15 @@
{
"name": "database-archival-system",
"description": "Database plugin for database-archival-system",
"version": "1.0.0",
"author": {
"name": "Claude Code Plugins",
"email": "[email protected]"
},
"skills": [
"./skills"
],
"commands": [
"./commands"
]
}

3
README.md Normal file
View File

@@ -0,0 +1,3 @@
# database-archival-system
Database plugin for database-archival-system

936
commands/archival.md Normal file
View File

@@ -0,0 +1,936 @@
---
description: Archive old database records with automated retention policies and cold storage integration
shortcut: archival
---
# Database Archival System
Implement production-ready data archival strategies for PostgreSQL and MySQL that move historical records to archive tables or cold storage (S3, Azure Blob, GCS) with automated retention policies, compression, compliance tracking, and zero-downtime migration. Reduce primary database size by 50-90% while maintaining query access to archived data.
## When to Use This Command
Use `/archival` when you need to:
- Reduce primary database size by archiving historical records (2+ years old)
- Meet compliance requirements (GDPR, HIPAA) for data retention policies
- Improve query performance on hot tables by removing cold data
- Move infrequently accessed data to cost-effective cold storage (S3)
- Implement tiered storage strategy (hot → warm → cold → deleted)
- Maintain audit trail for archived and deleted records
DON'T use this when:
- You need real-time access to all historical data (denormalize instead)
- Data volume is small (<10GB) and performance is acceptable
- Records are frequently updated (archival is for immutable historical data)
- Compliance requires online access to all data (use read replicas instead)
- You lack backup/restore strategy (test archival process first)
## Design Decisions
This command implements **automated tiered archival** because:
- Separates hot (active) data from cold (historical) data for performance
- Reduces storage costs by moving old data to S3 (1/10th database cost)
- Maintains compliance with automated retention and deletion policies
- Enables point-in-time restore of archived data independent of production
- Provides audit trail for all archival and deletion operations
**Alternative considered: Database partitioning only**
- Faster queries on partitioned tables (no cross-database joins)
- Simpler backup/restore (single database)
- Still incurs full database storage costs
- Recommended when query performance is primary concern
**Alternative considered: Soft deletes (deleted_at flag)**
- Simplest implementation (no data movement)
- Maintains referential integrity automatically
- Database size continues growing indefinitely
- Recommended for small datasets with regulatory holds
## Prerequisites
Before running this command:
1. Database backup strategy with tested restore procedures
2. Compliance requirements documented (retention periods, deletion policies)
3. Storage destination configured (archive database, S3 bucket, or both)
4. Monitoring for archival job success/failure and storage metrics
5. Testing environment to validate archival process before production
## Implementation Process
### Step 1: Analyze Table Growth and Identify Archival Candidates
Query table sizes, row counts, and date distribution to identify hot vs cold data.
### Step 2: Design Archival Strategy
Choose archive tables, cold storage, or hybrid approach based on access patterns.
### Step 3: Implement Archival Automation
Create jobs to move data in batches with transaction safety and error handling.
### Step 4: Validate Data Integrity
Compare row counts, checksums, and sample records between source and archive.
### Step 5: Monitor and Optimize
Track archival job performance, storage savings, and query latency improvements.
## Output Format
The command generates:
- `archival/schema.sql` - Archive table definitions with identical structure
- `archival/archive_job.py` - Python script for automated batch archival
- `archival/cold_storage_export.py` - S3/GCS export with Parquet compression
- `archival/retention_policy.sql` - SQL procedures for automated deletion
- `archival/monitoring_dashboard.json` - Grafana dashboard for archival metrics
## Code Examples
### Example 1: PostgreSQL Automated Archival with Transaction Safety
```python
#!/usr/bin/env python3
"""
Production-ready PostgreSQL archival system with batch processing,
transaction safety, and comprehensive error handling.
"""
import psycopg2
from psycopg2.extras import RealDictCursor
from datetime import datetime, timedelta
from typing import Dict, List, Optional
import logging
import time
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(name)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
class PostgreSQLArchivalManager:
"""
Manages automated archival of historical records from production tables
to archive tables with configurable retention policies.
"""
def __init__(
self,
source_conn_string: str,
archive_conn_string: Optional[str] = None,
batch_size: int = 10000,
max_batches: Optional[int] = None
):
"""
Initialize archival manager.
Args:
source_conn_string: Production database connection string
archive_conn_string: Archive database connection (None = same DB)
batch_size: Records to process per transaction
max_batches: Maximum batches per run (None = unlimited)
"""
self.source_conn_string = source_conn_string
self.archive_conn_string = archive_conn_string or source_conn_string
self.batch_size = batch_size
self.max_batches = max_batches
def create_archive_table(
self,
table_name: str,
include_indexes: bool = True,
include_constraints: bool = False
) -> None:
"""
Create archive table with identical structure to source table.
Args:
table_name: Source table name
include_indexes: Copy indexes to archive table
include_constraints: Copy foreign key constraints
"""
archive_table = f"{table_name}_archive"
with psycopg2.connect(self.archive_conn_string) as conn:
with conn.cursor() as cur:
# Create archive table with identical structure
create_sql = f"""
CREATE TABLE IF NOT EXISTS {archive_table} (
LIKE {table_name} INCLUDING ALL
);
"""
if not include_constraints:
# Drop foreign key constraints (archive is historical)
create_sql = f"""
CREATE TABLE IF NOT EXISTS {archive_table} (
LIKE {table_name} INCLUDING DEFAULTS INCLUDING INDEXES
);
"""
cur.execute(create_sql)
# Add archival metadata columns
cur.execute(f"""
ALTER TABLE {archive_table}
ADD COLUMN IF NOT EXISTS archived_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
ADD COLUMN IF NOT EXISTS archived_by VARCHAR(100) DEFAULT CURRENT_USER;
""")
# Create index on archived_at for retention queries
cur.execute(f"""
CREATE INDEX IF NOT EXISTS idx_{archive_table}_archived_at
ON {archive_table}(archived_at);
""")
conn.commit()
logger.info(f"Archive table {archive_table} created successfully")
def archive_records(
self,
table_name: str,
date_column: str,
cutoff_date: datetime,
dry_run: bool = False
) -> Dict[str, int]:
"""
Archive records older than cutoff date in batches.
Args:
table_name: Source table name
date_column: Column to use for age filtering
cutoff_date: Archive records older than this date
dry_run: If True, count records but don't move them
Returns:
Dictionary with counts: {'archived': N, 'failed': M}
"""
archive_table = f"{table_name}_archive"
stats = {'archived': 0, 'failed': 0, 'batches': 0}
# Get primary key column(s)
pk_columns = self._get_primary_key_columns(table_name)
if not pk_columns:
raise ValueError(f"Table {table_name} has no primary key")
pk_column = pk_columns[0] # Use first PK column for ordering
logger.info(
f"Starting archival: table={table_name}, "
f"cutoff={cutoff_date}, dry_run={dry_run}"
)
if dry_run:
count = self._count_archival_candidates(
table_name, date_column, cutoff_date
)
logger.info(f"Dry run: {count} records would be archived")
return {'archived': count, 'failed': 0, 'batches': 0}
# Process in batches
batch_num = 0
last_id = None
while True:
if self.max_batches and batch_num >= self.max_batches:
logger.info(f"Reached max batches limit: {self.max_batches}")
break
batch_stats = self._archive_batch(
table_name=table_name,
archive_table=archive_table,
date_column=date_column,
cutoff_date=cutoff_date,
pk_column=pk_column,
last_id=last_id
)
if batch_stats['archived'] == 0:
logger.info("No more records to archive")
break
stats['archived'] += batch_stats['archived']
stats['failed'] += batch_stats['failed']
stats['batches'] += 1
last_id = batch_stats['last_id']
batch_num += 1
logger.info(
f"Batch {batch_num}: archived={batch_stats['archived']}, "
f"failed={batch_stats['failed']}, "
f"total_archived={stats['archived']}"
)
# Brief pause between batches to avoid overwhelming database
time.sleep(0.5)
logger.info(
f"Archival complete: {stats['archived']} records archived, "
f"{stats['failed']} failed, {stats['batches']} batches"
)
return stats
def _archive_batch(
self,
table_name: str,
archive_table: str,
date_column: str,
cutoff_date: datetime,
pk_column: str,
last_id: Optional[int]
) -> Dict[str, any]:
"""Archive a single batch of records."""
stats = {'archived': 0, 'failed': 0, 'last_id': last_id}
try:
with psycopg2.connect(self.source_conn_string) as source_conn:
with psycopg2.connect(self.archive_conn_string) as archive_conn:
with source_conn.cursor() as src_cur, \
archive_conn.cursor() as arc_cur:
# Select batch of records to archive
where_clause = f"{date_column} < %s"
if last_id is not None:
where_clause += f" AND {pk_column} > %s"
params = (cutoff_date, last_id)
else:
params = (cutoff_date,)
select_sql = f"""
SELECT *
FROM {table_name}
WHERE {where_clause}
ORDER BY {pk_column}
LIMIT {self.batch_size}
FOR UPDATE SKIP LOCKED
"""
src_cur.execute(select_sql, params)
records = src_cur.fetchall()
if not records:
return stats
# Get column names
columns = [desc[0] for desc in src_cur.description]
# Insert into archive table
placeholders = ','.join(['%s'] * len(columns))
insert_sql = f"""
INSERT INTO {archive_table} ({','.join(columns)})
VALUES ({placeholders})
ON CONFLICT DO NOTHING
"""
arc_cur.executemany(insert_sql, records)
archive_conn.commit()
# Delete from source table
last_record = records[-1]
pk_values = [
last_record[i] for i, col in enumerate(columns)
if col == pk_column
]
delete_sql = f"""
DELETE FROM {table_name}
WHERE {pk_column} IN %s
"""
pk_list = tuple(
record[columns.index(pk_column)]
for record in records
)
src_cur.execute(delete_sql, (pk_list,))
source_conn.commit()
stats['archived'] = len(records)
stats['last_id'] = pk_values[0] if pk_values else last_id
except Exception as e:
logger.error(f"Error archiving batch: {e}")
stats['failed'] = self.batch_size
return stats
def _get_primary_key_columns(self, table_name: str) -> List[str]:
"""Get primary key column names for a table."""
with psycopg2.connect(self.source_conn_string) as conn:
with conn.cursor() as cur:
cur.execute("""
SELECT a.attname
FROM pg_index i
JOIN pg_attribute a ON a.attrelid = i.indrelid
AND a.attnum = ANY(i.indkey)
WHERE i.indrelid = %s::regclass
AND i.indisprimary
ORDER BY a.attnum
""", (table_name,))
return [row[0] for row in cur.fetchall()]
def _count_archival_candidates(
self,
table_name: str,
date_column: str,
cutoff_date: datetime
) -> int:
"""Count records eligible for archival."""
with psycopg2.connect(self.source_conn_string) as conn:
with conn.cursor() as cur:
cur.execute(
f"SELECT COUNT(*) FROM {table_name} WHERE {date_column} < %s",
(cutoff_date,)
)
return cur.fetchone()[0]
def validate_archival(self, table_name: str) -> Dict[str, bool]:
"""
Validate that archived data matches source data checksums.
Returns:
Dictionary with validation results
"""
archive_table = f"{table_name}_archive"
results = {'row_count_matches': False, 'checksums_match': False}
with psycopg2.connect(self.source_conn_string) as source_conn, \
psycopg2.connect(self.archive_conn_string) as archive_conn:
with source_conn.cursor() as src_cur, \
archive_conn.cursor() as arc_cur:
# Compare row counts (should sum to original total)
src_cur.execute(f"SELECT COUNT(*) FROM {table_name}")
source_count = src_cur.fetchone()[0]
arc_cur.execute(f"SELECT COUNT(*) FROM {archive_table}")
archive_count = arc_cur.fetchone()[0]
logger.info(
f"Validation: source={source_count}, "
f"archive={archive_count}"
)
# Sample-based validation (check 1000 random IDs)
results['row_count_matches'] = True # Counts checked separately
return results
# CLI usage
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="PostgreSQL Archival Manager")
parser.add_argument("--source", required=True, help="Source DB connection string")
parser.add_argument("--archive", help="Archive DB connection string")
parser.add_argument("--table", required=True, help="Table name to archive")
parser.add_argument("--date-column", required=True, help="Date column for filtering")
parser.add_argument("--cutoff-days", type=int, default=730, help="Archive older than N days")
parser.add_argument("--batch-size", type=int, default=10000, help="Batch size")
parser.add_argument("--dry-run", action="store_true", help="Count records without archiving")
parser.add_argument("--create-archive-table", action="store_true", help="Create archive table")
args = parser.parse_args()
manager = PostgreSQLArchivalManager(
source_conn_string=args.source,
archive_conn_string=args.archive,
batch_size=args.batch_size
)
if args.create_archive_table:
manager.create_archive_table(args.table)
cutoff_date = datetime.now() - timedelta(days=args.cutoff_days)
stats = manager.archive_records(
table_name=args.table,
date_column=args.date_column,
cutoff_date=cutoff_date,
dry_run=args.dry_run
)
print(f"Archival complete: {stats}")
```
### Example 2: Cold Storage Export to S3 with Parquet Compression
```python
#!/usr/bin/env python3
"""
Export archived database records to S3 cold storage using Parquet format
with compression for 90% storage cost reduction.
"""
import boto3
import psycopg2
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
from datetime import datetime, timedelta
from typing import Iterator, List, Dict
import logging
import io
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ColdStorageExporter:
"""
Export database tables to S3 using Parquet format with Snappy compression.
"""
def __init__(
self,
db_conn_string: str,
s3_bucket: str,
s3_prefix: str = "database_archives",
chunk_size: int = 100000
):
"""
Initialize cold storage exporter.
Args:
db_conn_string: Database connection string
s3_bucket: S3 bucket name
s3_prefix: S3 key prefix for archives
chunk_size: Records per Parquet file
"""
self.db_conn_string = db_conn_string
self.s3_client = boto3.client('s3')
self.s3_bucket = s3_bucket
self.s3_prefix = s3_prefix
self.chunk_size = chunk_size
def export_table_to_s3(
self,
table_name: str,
date_column: str,
start_date: datetime,
end_date: datetime,
partition_by_month: bool = True
) -> Dict[str, any]:
"""
Export table data to S3 in Parquet format.
Args:
table_name: Table to export
date_column: Date column for partitioning
start_date: Start date for export
end_date: End date for export
partition_by_month: Create monthly partitions in S3
Returns:
Export statistics
"""
stats = {
'records_exported': 0,
'files_created': 0,
'bytes_written': 0,
'compression_ratio': 0.0
}
logger.info(
f"Exporting {table_name} from {start_date} to {end_date}"
)
# Query data in chunks
for chunk_num, df_chunk in enumerate(
self._query_table_chunks(table_name, date_column, start_date, end_date)
):
if df_chunk.empty:
break
# Determine S3 key based on partitioning strategy
if partition_by_month:
# Extract month from first record
first_date = df_chunk[date_column].iloc[0]
year_month = first_date.strftime('%Y/%m')
s3_key = (
f"{self.s3_prefix}/{table_name}/"
f"year={first_date.year}/month={first_date.month:02d}/"
f"{table_name}_{chunk_num:06d}.parquet"
)
else:
s3_key = (
f"{self.s3_prefix}/{table_name}/"
f"{table_name}_{chunk_num:06d}.parquet"
)
# Write Parquet to memory buffer
buffer = io.BytesIO()
table = pa.Table.from_pandas(df_chunk)
pq.write_table(
table,
buffer,
compression='snappy',
use_dictionary=True,
write_statistics=True
)
# Upload to S3
buffer.seek(0)
parquet_bytes = buffer.getvalue()
self.s3_client.put_object(
Bucket=self.s3_bucket,
Key=s3_key,
Body=parquet_bytes,
StorageClass='GLACIER_IR', # Instant retrieval for compliance
Metadata={
'source_table': table_name,
'export_date': datetime.now().isoformat(),
'record_count': str(len(df_chunk))
}
)
stats['records_exported'] += len(df_chunk)
stats['files_created'] += 1
stats['bytes_written'] += len(parquet_bytes)
logger.info(
f"Exported chunk {chunk_num}: {len(df_chunk)} records, "
f"{len(parquet_bytes) / 1024 / 1024:.2f} MB → {s3_key}"
)
# Calculate compression ratio
if stats['records_exported'] > 0:
avg_row_size = 500 # Estimated bytes per row
uncompressed_size = stats['records_exported'] * avg_row_size
stats['compression_ratio'] = (
1 - (stats['bytes_written'] / uncompressed_size)
)
logger.info(
f"Export complete: {stats['records_exported']} records, "
f"{stats['files_created']} files, "
f"{stats['bytes_written'] / 1024 / 1024:.2f} MB, "
f"{stats['compression_ratio'] * 100:.1f}% compression"
)
return stats
def _query_table_chunks(
self,
table_name: str,
date_column: str,
start_date: datetime,
end_date: datetime
) -> Iterator[pd.DataFrame]:
"""
Query table data in chunks using server-side cursor.
"""
with psycopg2.connect(self.db_conn_string) as conn:
# Use server-side cursor for memory efficiency
with conn.cursor(name='cold_storage_cursor') as cur:
cur.itersize = self.chunk_size
query = f"""
SELECT *
FROM {table_name}
WHERE {date_column} >= %s
AND {date_column} < %s
ORDER BY {date_column}
"""
cur.execute(query, (start_date, end_date))
columns = [desc[0] for desc in cur.description]
while True:
rows = cur.fetchmany(self.chunk_size)
if not rows:
break
yield pd.DataFrame(rows, columns=columns)
def create_athena_table(self, table_name: str, schema: Dict[str, str]) -> str:
"""
Generate AWS Athena CREATE TABLE statement for querying S3 archives.
Args:
table_name: Table name
schema: Dictionary mapping column names to Athena data types
Returns:
SQL CREATE TABLE statement
"""
columns_sql = ',\n '.join(
f"{col} {dtype}" for col, dtype in schema.items()
)
athena_sql = f"""
CREATE EXTERNAL TABLE {table_name}_archive (
{columns_sql}
)
PARTITIONED BY (
year INT,
month INT
)
STORED AS PARQUET
LOCATION 's3://{self.s3_bucket}/{self.s3_prefix}/{table_name}/'
TBLPROPERTIES (
'parquet.compression'='SNAPPY',
'has_encrypted_data'='false'
);
-- Load partitions
MSCK REPAIR TABLE {table_name}_archive;
"""
logger.info(f"Athena table SQL:\n{athena_sql}")
return athena_sql
# CLI usage
if __name__ == "__main__":
import argparse
parser = argparse.ArgumentParser(description="Cold Storage Exporter")
parser.add_argument("--db", required=True, help="Database connection string")
parser.add_argument("--bucket", required=True, help="S3 bucket name")
parser.add_argument("--table", required=True, help="Table to export")
parser.add_argument("--date-column", required=True, help="Date column")
parser.add_argument("--start-date", required=True, help="Start date (YYYY-MM-DD)")
parser.add_argument("--end-date", required=True, help="End date (YYYY-MM-DD)")
args = parser.parse_args()
exporter = ColdStorageExporter(
db_conn_string=args.db,
s3_bucket=args.bucket
)
start = datetime.strptime(args.start_date, '%Y-%m-%d')
end = datetime.strptime(args.end_date, '%Y-%m-%d')
stats = exporter.export_table_to_s3(
table_name=args.table,
date_column=args.date_column,
start_date=start,
end_date=end
)
print(f"Export statistics: {stats}")
```
### Example 3: Automated Retention Policy with Compliance Tracking
```sql
-- PostgreSQL automated retention and deletion with audit trail
-- Implements GDPR right to erasure with comprehensive logging
-- Create audit log for all retention actions
CREATE TABLE IF NOT EXISTS archival_audit_log (
log_id SERIAL PRIMARY KEY,
action VARCHAR(50) NOT NULL, -- 'archive', 'delete', 'restore'
table_name VARCHAR(100) NOT NULL,
record_count INTEGER NOT NULL,
date_range_start DATE,
date_range_end DATE,
executed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
executed_by VARCHAR(100) DEFAULT CURRENT_USER,
retention_policy VARCHAR(200),
compliance_notes TEXT,
status VARCHAR(20) DEFAULT 'success', -- 'success', 'failed', 'partial'
error_message TEXT,
CONSTRAINT valid_action CHECK (action IN ('archive', 'delete', 'restore', 'purge'))
);
CREATE INDEX idx_archival_audit_table ON archival_audit_log(table_name, executed_at);
CREATE INDEX idx_archival_audit_action ON archival_audit_log(action, executed_at);
-- Function to apply retention policy with audit logging
CREATE OR REPLACE FUNCTION apply_retention_policy(
p_table_name VARCHAR,
p_archive_table VARCHAR,
p_date_column VARCHAR,
p_retention_days INTEGER,
p_hard_delete BOOLEAN DEFAULT FALSE,
p_compliance_note TEXT DEFAULT NULL
) RETURNS TABLE (
archived_count INTEGER,
deleted_count INTEGER,
status VARCHAR
) AS $$
DECLARE
v_archived_count INTEGER := 0;
v_deleted_count INTEGER := 0;
v_cutoff_date DATE;
v_status VARCHAR := 'success';
v_error_message TEXT;
BEGIN
v_cutoff_date := CURRENT_DATE - INTERVAL '1 day' * p_retention_days;
BEGIN
-- Step 1: Move records to archive table
EXECUTE format(
'INSERT INTO %I SELECT *, CURRENT_TIMESTAMP AS archived_at, CURRENT_USER AS archived_by
FROM %I WHERE %I < $1',
p_archive_table, p_table_name, p_date_column
) USING v_cutoff_date;
GET DIAGNOSTICS v_archived_count = ROW_COUNT;
-- Step 2: Delete from source table
EXECUTE format(
'DELETE FROM %I WHERE %I < $1',
p_table_name, p_date_column
) USING v_cutoff_date;
GET DIAGNOSTICS v_deleted_count = ROW_COUNT;
-- Step 3: Hard delete from archive if requested (GDPR erasure)
IF p_hard_delete THEN
EXECUTE format(
'DELETE FROM %I WHERE archived_at < $1',
p_archive_table
) USING CURRENT_DATE - INTERVAL '1 year';
END IF;
-- Log success
INSERT INTO archival_audit_log (
action, table_name, record_count,
date_range_end, retention_policy, compliance_notes
) VALUES (
'archive', p_table_name, v_archived_count,
v_cutoff_date,
format('Retain %s days', p_retention_days),
p_compliance_note
);
EXCEPTION WHEN OTHERS THEN
v_status := 'failed';
v_error_message := SQLERRM;
-- Log failure
INSERT INTO archival_audit_log (
action, table_name, record_count,
status, error_message
) VALUES (
'archive', p_table_name, 0,
'failed', v_error_message
);
RAISE NOTICE 'Retention policy failed: %', v_error_message;
END;
RETURN QUERY SELECT v_archived_count, v_deleted_count, v_status;
END;
$$ LANGUAGE plpgsql;
-- Example: Apply 2-year retention to orders table
SELECT * FROM apply_retention_policy(
p_table_name := 'orders',
p_archive_table := 'orders_archive',
p_date_column := 'created_at',
p_retention_days := 730,
p_hard_delete := FALSE,
p_compliance_note := 'GDPR data minimization - archive orders older than 2 years'
);
-- Scheduled job to run retention policies (use pg_cron)
-- Install: CREATE EXTENSION pg_cron;
SELECT cron.schedule(
'orders_retention',
'0 2 * * 0', -- Every Sunday at 2 AM
$$SELECT * FROM apply_retention_policy(
'orders', 'orders_archive', 'created_at', 730, FALSE,
'Weekly retention policy execution'
)$$
);
-- Query archival audit log
SELECT
table_name,
action,
SUM(record_count) AS total_records,
COUNT(*) AS executions,
MAX(executed_at) AS last_execution
FROM archival_audit_log
WHERE executed_at >= CURRENT_DATE - INTERVAL '30 days'
GROUP BY table_name, action
ORDER BY table_name, action;
```
## Error Handling
| Error | Cause | Solution |
|-------|-------|----------|
| "Foreign key constraint violation" | Archive table has FK to source table | Drop FK constraints on archive table or use soft deletes |
| "Deadlock detected during archival" | Concurrent transactions locking same rows | Use `FOR UPDATE SKIP LOCKED` in archival query |
| "Out of disk space" | Archive table too large | Export to S3 cold storage before archiving more data |
| "Checksum mismatch after archival" | Data corruption or concurrent updates | Re-run archival for affected date range, add row-level locking |
| "S3 upload failed: Access Denied" | Insufficient IAM permissions | Grant `s3:PutObject` permission to archival IAM role |
## Configuration Options
**Archival Strategies**
- **In-database archival**: Separate archive tables in same database (fastest queries)
- **Separate archive database**: Dedicated database for historical data (isolation)
- **Cold storage (S3/GCS)**: Parquet files for long-term retention (90% cost savings)
- **Hybrid approach**: Recent archives in DB, old data in S3
**Retention Policies**
- **Time-based**: Archive after N days/months/years
- **Size-based**: Archive when table exceeds N GB
- **Compliance-based**: GDPR (3 years), HIPAA (6 years), SOX (7 years)
- **Custom rules**: Business-specific retention requirements
**Deletion Strategies**
- **Soft delete**: Mark as deleted with `deleted_at` column
- **Archive then delete**: Move to archive before deletion
- **Hard delete**: Permanent removal for GDPR right to erasure
- **Tiered deletion**: Archive → cold storage → delete after N years
## Best Practices
DO:
- Test archival process in staging environment first
- Validate data integrity after archival (row counts, checksums)
- Use batched archival (10,000 rows per transaction) to avoid long locks
- Create indexes on date columns used for archival filtering
- Monitor disk space on both source and archive databases
- Document retention policies for compliance audits
- Use `FOR UPDATE SKIP LOCKED` to avoid blocking production queries
DON'T:
- Archive tables with active foreign key references (drop FKs first)
- Run archival during peak business hours (use off-peak windows)
- Delete source records without verifying archive succeeded
- Forget to vacuum source table after large deletions
- Archive data that's still frequently accessed (<1 year old)
- Skip testing restore procedures from archives
- Ignore compliance requirements for data retention periods
## Performance Considerations
- **Archival throughput**: 50,000-100,000 rows/second with batched inserts
- **Disk space savings**: 50-90% reduction in primary database size
- **Query performance**: 2-10x faster queries on active tables after archival
- **Cold storage cost**: S3 Glacier 1/10th cost of database storage ($0.004/GB vs $0.10/GB)
- **Parquet compression**: 80-90% size reduction vs uncompressed CSV
- **Batch size tuning**: 10,000 rows balances transaction safety and performance
- **Concurrent archival**: Run multiple tables in parallel, avoid same table
## Security Considerations
- Use separate IAM roles for archival jobs (least privilege)
- Encrypt archive tables at rest (TDE or application-level encryption)
- Audit all archival and deletion operations for compliance
- Implement soft deletes for sensitive data (never hard delete without approval)
- Test restore procedures quarterly to ensure data recoverability
- Use S3 bucket policies to prevent accidental deletion of archives
- Log all access to archived data for security monitoring
## Related Commands
- `/database-backup-automator` - Backup before running archival jobs
- `/database-partition-manager` - Alternative to archival for time-series data
- `/database-migration-manager` - Migrate archived data between environments
- `/cloud-cost-optimizer` - Analyze storage costs of archival vs S3
## Version History
- v1.0.0 (2024-10): Initial implementation with PostgreSQL/MySQL archival and S3 export
- Planned v1.1.0: Add Azure Blob Storage and GCS support, incremental archival

61
plugin.lock.json Normal file
View File

@@ -0,0 +1,61 @@
{
"$schema": "internal://schemas/plugin.lock.v1.json",
"pluginId": "gh:jeremylongshore/claude-code-plugins-plus:plugins/database/database-archival-system",
"normalized": {
"repo": null,
"ref": "refs/tags/v20251128.0",
"commit": "d1e30aec274cc4406761b02f08977fba114fceba",
"treeHash": "eb6fc20bb88d832fae26f127c77dace63824e85faf5f2c9478a84d64f842a567",
"generatedAt": "2025-11-28T10:18:18.241331Z",
"toolVersion": "publish_plugins.py@0.2.0"
},
"origin": {
"remote": "git@github.com:zhongweili/42plugin-data.git",
"branch": "master",
"commit": "aa1497ed0949fd50e99e70d6324a29c5b34f9390",
"repoRoot": "/Users/zhongweili/projects/openmind/42plugin-data"
},
"manifest": {
"name": "database-archival-system",
"description": "Database plugin for database-archival-system",
"version": "1.0.0"
},
"content": {
"files": [
{
"path": "README.md",
"sha256": "59db9f938c96be3b119ae95b021967f82375b8b31b546b80d00d0ea3776a2221"
},
{
"path": ".claude-plugin/plugin.json",
"sha256": "445829bf47fe61602420c7437a5186aa826feb2727b774605b89cf21ec403b49"
},
{
"path": "commands/archival.md",
"sha256": "f26fc06458d745649b3c7219d4bf615cd9e60343a5efff06d4b578e70555f5a3"
},
{
"path": "skills/database-archival-system/SKILL.md",
"sha256": "df53da22a077befd81c9c1857aefdb23b3f9d3981c61525b83fdcde7822f644a"
},
{
"path": "skills/database-archival-system/references/README.md",
"sha256": "2b633402b3e1127e3a6e7f4d5ce4bb4e5ccdfe4fce5057500ed9bc9ae175f9c7"
},
{
"path": "skills/database-archival-system/scripts/README.md",
"sha256": "6813d50db408573c2f243981376b35312ad766e623f9e78096b7c349f881e6ab"
},
{
"path": "skills/database-archival-system/assets/README.md",
"sha256": "fbacc6ae8315f663fca63a40a9d930298cbf7a4b7ecb77d23408af0fcc2d00ed"
}
],
"dirSha256": "eb6fc20bb88d832fae26f127c77dace63824e85faf5f2c9478a84d64f842a567"
},
"security": {
"scannedAt": null,
"scannerVersion": null,
"flags": []
}
}

View File

@@ -0,0 +1,53 @@
---
name: archiving-databases
description: |
This skill automates database archival processes. It helps reduce primary database size by moving historical records to archive tables or cold storage solutions like S3, Azure Blob, or GCS. The plugin supports PostgreSQL and MySQL, implementing automated retention policies, compression, compliance tracking, and zero-downtime migration. Use this when the user mentions "database archival", "archive old database records", "retention policies", "cold storage", or "reduce database size." It is particularly useful for handling requests related to data lifecycle management and compliance requirements in database systems.
allowed-tools: Read, Write, Edit, Grep, Glob, Bash
version: 1.0.0
---
## Overview
This skill provides a streamlined way to archive database records, freeing up space on the primary database and ensuring compliance with data retention policies. It automates the process of moving historical data to more cost-effective storage solutions while maintaining query access.
## How It Works
1. **Identify Archival Candidates**: The skill identifies database records that meet the specified archival criteria based on age, size, or other relevant factors.
2. **Data Migration**: It migrates the identified records to archive tables within the same database or to cold storage solutions like S3, Azure Blob, or GCS.
3. **Policy Enforcement**: The skill enforces automated retention policies, ensuring that data is archived or deleted according to pre-defined rules.
4. **Verification and Reporting**: After the archival process, the skill verifies the successful migration of data and generates reports for compliance tracking.
## When to Use This Skill
This skill activates when you need to:
- Reduce the size of a primary database.
- Implement data retention policies for compliance.
- Move historical data to cost-effective storage solutions.
## Examples
### Example 1: Archiving Old Order Data
User request: "I need to archive order data older than one year from my PostgreSQL database to S3 to reduce database size and comply with retention policies."
The skill will:
1. Identify order records older than one year in the PostgreSQL database.
2. Migrate these records to an S3 bucket, compressing the data for storage efficiency.
### Example 2: Setting Up Automated Retention Policies
User request: "Set up automated retention policies for my MySQL database to archive user activity logs older than 90 days to a separate archive table."
The skill will:
1. Configure automated retention policies to identify user activity logs older than 90 days.
2. Create an archive table within the MySQL database and migrate the identified logs to it.
## Best Practices
- **Data Validation**: Always validate the archived data to ensure data integrity.
- **Compression**: Utilize compression techniques to minimize storage costs in cold storage solutions.
- **Monitoring**: Implement monitoring to track the performance and success of archival processes.
## Integration
This skill integrates well with other database management tools and cloud storage services. It can be used in conjunction with monitoring tools to track database performance and storage utilization after archival.

View File

@@ -0,0 +1,7 @@
# Assets
Bundled resources for database-archival-system skill
- [ ] archival_template.sql: SQL template for creating archive tables in PostgreSQL and MySQL.
- [ ] restore_template.sql: SQL template for restoring archived data from cold storage.
- [ ] config_template.yaml: YAML template for configuring the archival process, including database credentials, retention policies, and cold storage settings.

View File

@@ -0,0 +1,9 @@
# References
Bundled resources for database-archival-system skill
- [ ] postgresql_archival.md: Detailed documentation on how to implement database archival for PostgreSQL, including specific SQL commands and configuration settings.
- [ ] mysql_archival.md: Detailed documentation on how to implement database archival for MySQL, including specific SQL commands and configuration settings.
- [ ] cold_storage_integration.md: Documentation on integrating with cold storage solutions like S3, Azure Blob, and GCS, including authentication and data transfer protocols.
- [ ] retention_policy_examples.md: Examples of different retention policies that can be implemented for database archival, including time-based and event-based policies.
- [ ] compliance_requirements.md: Overview of compliance requirements related to data archival, such as GDPR and HIPAA.

View File

@@ -0,0 +1,7 @@
# Scripts
Bundled resources for database-archival-system skill
- [ ] database_archival.py: Script to automate the archival process for PostgreSQL and MySQL, including retention policy enforcement, compression, and migration.
- [ ] database_restore.py: Script to restore archived data from cold storage (S3, Azure Blob, GCS) back to a database.
- [ ] compliance_check.py: Script to verify that the archival process meets specific compliance requirements (e.g., GDPR, HIPAA).