5.8 KiB
5.8 KiB
Data Processing Skill Pattern
Use this pattern when your skill processes, analyzes, or transforms data to extract insights.
When to Use
- Skill ingests data from files or APIs
- Performs analysis or transformation
- Generates insights, reports, or visualizations
- Examples: cc-insights (conversation analysis)
Structure
Data Flow Architecture
Define clear data pipeline:
Input Sources → Processing → Storage → Query/Analysis → Output
Example:
JSONL files → Parser → SQLite + Vector DB → Search/Analytics → Reports/Dashboard
Processing Modes
Batch Processing:
- Process all data at once
- Good for: Initial setup, complete reprocessing
- Trade-off: Slow startup, complete data
Incremental Processing:
- Process only new/changed data
- Good for: Regular updates, performance
- Trade-off: Complex state tracking
Streaming Processing:
- Process data as it arrives
- Good for: Real-time updates
- Trade-off: Complex implementation
Storage Strategy
Choose appropriate storage:
SQLite:
- Structured metadata
- Fast queries
- Relational data
- Good for: Indexes, aggregations
Vector Database (ChromaDB):
- Semantic embeddings
- Similarity search
- Good for: RAG, semantic queries
File System:
- Raw data
- Large blobs
- Good for: Backups, archives
Example: CC Insights
Input: Claude Code conversation JSONL files
Processing Pipeline:
- JSONL Parser - Decode base64, extract messages
- Metadata Extractor - Timestamps, files, tools
- Embeddings Generator - Vector representations
- Pattern Detector - Identify trends
Storage:
- SQLite: Conversation metadata, fast queries
- ChromaDB: Vector embeddings, semantic search
- Cache: Processed conversation data
Query Interfaces:
- CLI Search - Command-line semantic search
- Insight Generator - Pattern-based reports
- Dashboard - Interactive web UI
Outputs:
- Search results with similarity scores
- Weekly activity reports
- File heatmaps
- Tool usage analytics
Data Processing Workflow
Phase 1: Ingestion
1. **Discover Data Sources**
- Locate input files/APIs
- Validate accessibility
- Calculate scope (file count, size)
2. **Initial Validation**
- Check format validity
- Verify schema compliance
- Estimate processing time
3. **State Management**
- Track what's been processed
- Support incremental updates
- Handle failures gracefully
Phase 2: Processing
1. **Parse/Transform**
- Read raw data
- Apply transformations
- Handle errors and edge cases
2. **Extract Features**
- Generate metadata
- Calculate metrics
- Create embeddings (if semantic search)
3. **Store Results**
- Write to database(s)
- Update indexes
- Maintain consistency
Phase 3: Analysis
1. **Query Interface**
- Support multiple query types
- Optimize for common patterns
- Return ranked results
2. **Pattern Detection**
- Aggregate data
- Identify trends
- Generate insights
3. **Visualization**
- Format for human consumption
- Support multiple output formats
- Interactive when possible
Performance Characteristics
Document expected performance:
### Performance Characteristics
- **Initial indexing**: ~1-2 minutes for 100 records
- **Incremental updates**: <5 seconds for new records
- **Search latency**: <1 second for queries
- **Report generation**: <10 seconds for standard reports
- **Memory usage**: ~200MB for 1000 records
Best Practices
- Incremental Processing: Don't reprocess everything on each run
- State Tracking: Track what's been processed to avoid duplicates
- Batch Operations: Process in batches for memory efficiency
- Progress Indicators: Show progress for long operations
- Error Recovery: Handle failures gracefully, resume where left off
- Data Validation: Validate inputs before expensive processing
- Index Optimization: Optimize databases for common queries
- Memory Management: Stream large files, don't load everything
- Parallel Processing: Use parallelism when possible
- Cache Wisely: Cache expensive computations
Scripts Structure
For data processing skills, provide helper scripts:
scripts/
├── processor.py # Main data processing script
├── indexer.py # Build indexes/embeddings
├── query.py # Query interface (CLI)
└── generator.py # Report/insight generation
Script Best Practices
# Good patterns for processing scripts:
# 1. Use click for CLI
import click
@click.command()
@click.option('--input', help='Input path')
@click.option('--reindex', is_flag=True)
def process(input, reindex):
"""Process data from input source."""
pass
# 2. Show progress
from tqdm import tqdm
for item in tqdm(items, desc="Processing"):
process_item(item)
# 3. Handle errors gracefully
try:
result = process_item(item)
except Exception as e:
logger.error(f"Failed to process {item}: {e}")
continue # Continue with next item
# 4. Support incremental updates
if not reindex and is_already_processed(item):
continue
# 5. Use batch processing
for batch in chunks(items, batch_size=32):
process_batch(batch)
Storage Schema
Document your data schema:
-- Example SQLite schema
CREATE TABLE conversations (
id TEXT PRIMARY KEY,
timestamp INTEGER,
message_count INTEGER,
files_modified TEXT, -- JSON array
tools_used TEXT -- JSON array
);
CREATE INDEX idx_timestamp ON conversations(timestamp);
CREATE INDEX idx_files ON conversations(files_modified);
Output Formats
Support multiple output formats:
- Markdown: Human-readable reports
- JSON: Machine-readable for integration
- CSV: Spreadsheet-compatible data
- HTML: Styled reports with charts
- Interactive: Web dashboards (optional)