# Data Processing Skill Pattern Use this pattern when your skill **processes, analyzes, or transforms** data to extract insights. ## When to Use - Skill ingests data from files or APIs - Performs analysis or transformation - Generates insights, reports, or visualizations - Examples: cc-insights (conversation analysis) ## Structure ### Data Flow Architecture Define clear data pipeline: ``` Input Sources → Processing → Storage → Query/Analysis → Output ``` Example: ``` JSONL files → Parser → SQLite + Vector DB → Search/Analytics → Reports/Dashboard ``` ### Processing Modes **Batch Processing:** - Process all data at once - Good for: Initial setup, complete reprocessing - Trade-off: Slow startup, complete data **Incremental Processing:** - Process only new/changed data - Good for: Regular updates, performance - Trade-off: Complex state tracking **Streaming Processing:** - Process data as it arrives - Good for: Real-time updates - Trade-off: Complex implementation ### Storage Strategy Choose appropriate storage: **SQLite:** - Structured metadata - Fast queries - Relational data - Good for: Indexes, aggregations **Vector Database (ChromaDB):** - Semantic embeddings - Similarity search - Good for: RAG, semantic queries **File System:** - Raw data - Large blobs - Good for: Backups, archives ## Example: CC Insights **Input**: Claude Code conversation JSONL files **Processing Pipeline:** 1. JSONL Parser - Decode base64, extract messages 2. Metadata Extractor - Timestamps, files, tools 3. Embeddings Generator - Vector representations 4. Pattern Detector - Identify trends **Storage:** - SQLite: Conversation metadata, fast queries - ChromaDB: Vector embeddings, semantic search - Cache: Processed conversation data **Query Interfaces:** 1. CLI Search - Command-line semantic search 2. Insight Generator - Pattern-based reports 3. Dashboard - Interactive web UI **Outputs:** - Search results with similarity scores - Weekly activity reports - File heatmaps - Tool usage analytics ## Data Processing Workflow ### Phase 1: Ingestion ```markdown 1. **Discover Data Sources** - Locate input files/APIs - Validate accessibility - Calculate scope (file count, size) 2. **Initial Validation** - Check format validity - Verify schema compliance - Estimate processing time 3. **State Management** - Track what's been processed - Support incremental updates - Handle failures gracefully ``` ### Phase 2: Processing ```markdown 1. **Parse/Transform** - Read raw data - Apply transformations - Handle errors and edge cases 2. **Extract Features** - Generate metadata - Calculate metrics - Create embeddings (if semantic search) 3. **Store Results** - Write to database(s) - Update indexes - Maintain consistency ``` ### Phase 3: Analysis ```markdown 1. **Query Interface** - Support multiple query types - Optimize for common patterns - Return ranked results 2. **Pattern Detection** - Aggregate data - Identify trends - Generate insights 3. **Visualization** - Format for human consumption - Support multiple output formats - Interactive when possible ``` ## Performance Characteristics Document expected performance: ```markdown ### Performance Characteristics - **Initial indexing**: ~1-2 minutes for 100 records - **Incremental updates**: <5 seconds for new records - **Search latency**: <1 second for queries - **Report generation**: <10 seconds for standard reports - **Memory usage**: ~200MB for 1000 records ``` ## Best Practices 1. **Incremental Processing**: Don't reprocess everything on each run 2. **State Tracking**: Track what's been processed to avoid duplicates 3. **Batch Operations**: Process in batches for memory efficiency 4. **Progress Indicators**: Show progress for long operations 5. **Error Recovery**: Handle failures gracefully, resume where left off 6. **Data Validation**: Validate inputs before expensive processing 7. **Index Optimization**: Optimize databases for common queries 8. **Memory Management**: Stream large files, don't load everything 9. **Parallel Processing**: Use parallelism when possible 10. **Cache Wisely**: Cache expensive computations ## Scripts Structure For data processing skills, provide helper scripts: ``` scripts/ ├── processor.py # Main data processing script ├── indexer.py # Build indexes/embeddings ├── query.py # Query interface (CLI) └── generator.py # Report/insight generation ``` ### Script Best Practices ```python # Good patterns for processing scripts: # 1. Use click for CLI import click @click.command() @click.option('--input', help='Input path') @click.option('--reindex', is_flag=True) def process(input, reindex): """Process data from input source.""" pass # 2. Show progress from tqdm import tqdm for item in tqdm(items, desc="Processing"): process_item(item) # 3. Handle errors gracefully try: result = process_item(item) except Exception as e: logger.error(f"Failed to process {item}: {e}") continue # Continue with next item # 4. Support incremental updates if not reindex and is_already_processed(item): continue # 5. Use batch processing for batch in chunks(items, batch_size=32): process_batch(batch) ``` ## Storage Schema Document your data schema: ```sql -- Example SQLite schema CREATE TABLE conversations ( id TEXT PRIMARY KEY, timestamp INTEGER, message_count INTEGER, files_modified TEXT, -- JSON array tools_used TEXT -- JSON array ); CREATE INDEX idx_timestamp ON conversations(timestamp); CREATE INDEX idx_files ON conversations(files_modified); ``` ## Output Formats Support multiple output formats: 1. **Markdown**: Human-readable reports 2. **JSON**: Machine-readable for integration 3. **CSV**: Spreadsheet-compatible data 4. **HTML**: Styled reports with charts 5. **Interactive**: Web dashboards (optional)