Initial commit
This commit is contained in:
247
skills/skill-creator/patterns/data-processing.md
Normal file
247
skills/skill-creator/patterns/data-processing.md
Normal file
@@ -0,0 +1,247 @@
|
||||
# Data Processing Skill Pattern
|
||||
|
||||
Use this pattern when your skill **processes, analyzes, or transforms** data to extract insights.
|
||||
|
||||
## When to Use
|
||||
|
||||
- Skill ingests data from files or APIs
|
||||
- Performs analysis or transformation
|
||||
- Generates insights, reports, or visualizations
|
||||
- Examples: cc-insights (conversation analysis)
|
||||
|
||||
## Structure
|
||||
|
||||
### Data Flow Architecture
|
||||
|
||||
Define clear data pipeline:
|
||||
|
||||
```
|
||||
Input Sources → Processing → Storage → Query/Analysis → Output
|
||||
```
|
||||
|
||||
Example:
|
||||
```
|
||||
JSONL files → Parser → SQLite + Vector DB → Search/Analytics → Reports/Dashboard
|
||||
```
|
||||
|
||||
### Processing Modes
|
||||
|
||||
**Batch Processing:**
|
||||
- Process all data at once
|
||||
- Good for: Initial setup, complete reprocessing
|
||||
- Trade-off: Slow startup, complete data
|
||||
|
||||
**Incremental Processing:**
|
||||
- Process only new/changed data
|
||||
- Good for: Regular updates, performance
|
||||
- Trade-off: Complex state tracking
|
||||
|
||||
**Streaming Processing:**
|
||||
- Process data as it arrives
|
||||
- Good for: Real-time updates
|
||||
- Trade-off: Complex implementation
|
||||
|
||||
### Storage Strategy
|
||||
|
||||
Choose appropriate storage:
|
||||
|
||||
**SQLite:**
|
||||
- Structured metadata
|
||||
- Fast queries
|
||||
- Relational data
|
||||
- Good for: Indexes, aggregations
|
||||
|
||||
**Vector Database (ChromaDB):**
|
||||
- Semantic embeddings
|
||||
- Similarity search
|
||||
- Good for: RAG, semantic queries
|
||||
|
||||
**File System:**
|
||||
- Raw data
|
||||
- Large blobs
|
||||
- Good for: Backups, archives
|
||||
|
||||
## Example: CC Insights
|
||||
|
||||
**Input**: Claude Code conversation JSONL files
|
||||
|
||||
**Processing Pipeline:**
|
||||
1. JSONL Parser - Decode base64, extract messages
|
||||
2. Metadata Extractor - Timestamps, files, tools
|
||||
3. Embeddings Generator - Vector representations
|
||||
4. Pattern Detector - Identify trends
|
||||
|
||||
**Storage:**
|
||||
- SQLite: Conversation metadata, fast queries
|
||||
- ChromaDB: Vector embeddings, semantic search
|
||||
- Cache: Processed conversation data
|
||||
|
||||
**Query Interfaces:**
|
||||
1. CLI Search - Command-line semantic search
|
||||
2. Insight Generator - Pattern-based reports
|
||||
3. Dashboard - Interactive web UI
|
||||
|
||||
**Outputs:**
|
||||
- Search results with similarity scores
|
||||
- Weekly activity reports
|
||||
- File heatmaps
|
||||
- Tool usage analytics
|
||||
|
||||
## Data Processing Workflow
|
||||
|
||||
### Phase 1: Ingestion
|
||||
```markdown
|
||||
1. **Discover Data Sources**
|
||||
- Locate input files/APIs
|
||||
- Validate accessibility
|
||||
- Calculate scope (file count, size)
|
||||
|
||||
2. **Initial Validation**
|
||||
- Check format validity
|
||||
- Verify schema compliance
|
||||
- Estimate processing time
|
||||
|
||||
3. **State Management**
|
||||
- Track what's been processed
|
||||
- Support incremental updates
|
||||
- Handle failures gracefully
|
||||
```
|
||||
|
||||
### Phase 2: Processing
|
||||
```markdown
|
||||
1. **Parse/Transform**
|
||||
- Read raw data
|
||||
- Apply transformations
|
||||
- Handle errors and edge cases
|
||||
|
||||
2. **Extract Features**
|
||||
- Generate metadata
|
||||
- Calculate metrics
|
||||
- Create embeddings (if semantic search)
|
||||
|
||||
3. **Store Results**
|
||||
- Write to database(s)
|
||||
- Update indexes
|
||||
- Maintain consistency
|
||||
```
|
||||
|
||||
### Phase 3: Analysis
|
||||
```markdown
|
||||
1. **Query Interface**
|
||||
- Support multiple query types
|
||||
- Optimize for common patterns
|
||||
- Return ranked results
|
||||
|
||||
2. **Pattern Detection**
|
||||
- Aggregate data
|
||||
- Identify trends
|
||||
- Generate insights
|
||||
|
||||
3. **Visualization**
|
||||
- Format for human consumption
|
||||
- Support multiple output formats
|
||||
- Interactive when possible
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
Document expected performance:
|
||||
|
||||
```markdown
|
||||
### Performance Characteristics
|
||||
|
||||
- **Initial indexing**: ~1-2 minutes for 100 records
|
||||
- **Incremental updates**: <5 seconds for new records
|
||||
- **Search latency**: <1 second for queries
|
||||
- **Report generation**: <10 seconds for standard reports
|
||||
- **Memory usage**: ~200MB for 1000 records
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Incremental Processing**: Don't reprocess everything on each run
|
||||
2. **State Tracking**: Track what's been processed to avoid duplicates
|
||||
3. **Batch Operations**: Process in batches for memory efficiency
|
||||
4. **Progress Indicators**: Show progress for long operations
|
||||
5. **Error Recovery**: Handle failures gracefully, resume where left off
|
||||
6. **Data Validation**: Validate inputs before expensive processing
|
||||
7. **Index Optimization**: Optimize databases for common queries
|
||||
8. **Memory Management**: Stream large files, don't load everything
|
||||
9. **Parallel Processing**: Use parallelism when possible
|
||||
10. **Cache Wisely**: Cache expensive computations
|
||||
|
||||
## Scripts Structure
|
||||
|
||||
For data processing skills, provide helper scripts:
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── processor.py # Main data processing script
|
||||
├── indexer.py # Build indexes/embeddings
|
||||
├── query.py # Query interface (CLI)
|
||||
└── generator.py # Report/insight generation
|
||||
```
|
||||
|
||||
### Script Best Practices
|
||||
|
||||
```python
|
||||
# Good patterns for processing scripts:
|
||||
|
||||
# 1. Use click for CLI
|
||||
import click
|
||||
|
||||
@click.command()
|
||||
@click.option('--input', help='Input path')
|
||||
@click.option('--reindex', is_flag=True)
|
||||
def process(input, reindex):
|
||||
"""Process data from input source."""
|
||||
pass
|
||||
|
||||
# 2. Show progress
|
||||
from tqdm import tqdm
|
||||
for item in tqdm(items, desc="Processing"):
|
||||
process_item(item)
|
||||
|
||||
# 3. Handle errors gracefully
|
||||
try:
|
||||
result = process_item(item)
|
||||
except Exception as e:
|
||||
logger.error(f"Failed to process {item}: {e}")
|
||||
continue # Continue with next item
|
||||
|
||||
# 4. Support incremental updates
|
||||
if not reindex and is_already_processed(item):
|
||||
continue
|
||||
|
||||
# 5. Use batch processing
|
||||
for batch in chunks(items, batch_size=32):
|
||||
process_batch(batch)
|
||||
```
|
||||
|
||||
## Storage Schema
|
||||
|
||||
Document your data schema:
|
||||
|
||||
```sql
|
||||
-- Example SQLite schema
|
||||
CREATE TABLE conversations (
|
||||
id TEXT PRIMARY KEY,
|
||||
timestamp INTEGER,
|
||||
message_count INTEGER,
|
||||
files_modified TEXT, -- JSON array
|
||||
tools_used TEXT -- JSON array
|
||||
);
|
||||
|
||||
CREATE INDEX idx_timestamp ON conversations(timestamp);
|
||||
CREATE INDEX idx_files ON conversations(files_modified);
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
Support multiple output formats:
|
||||
|
||||
1. **Markdown**: Human-readable reports
|
||||
2. **JSON**: Machine-readable for integration
|
||||
3. **CSV**: Spreadsheet-compatible data
|
||||
4. **HTML**: Styled reports with charts
|
||||
5. **Interactive**: Web dashboards (optional)
|
||||
Reference in New Issue
Block a user