Initial commit

2025-11-29 18:16:56 +08:00
commit 8a3d331e04
61 changed files with 11808 additions and 0 deletions
--- a/skills/skill-creator/patterns/data-processing.md
+++ b/skills/skill-creator/patterns/data-processing.md
@@ -0,0 +1,247 @@
+# Data Processing Skill Pattern
+
+Use this pattern when your skill **processes, analyzes, or transforms** data to extract insights.
+
+## When to Use
+
+- Skill ingests data from files or APIs
+- Performs analysis or transformation
+- Generates insights, reports, or visualizations
+- Examples: cc-insights (conversation analysis)
+
+## Structure
+
+### Data Flow Architecture
+
+Define clear data pipeline:
+
+```
+Input Sources → Processing → Storage → Query/Analysis → Output
+```
+
+Example:
+```
+JSONL files → Parser → SQLite + Vector DB → Search/Analytics → Reports/Dashboard
+```
+
+### Processing Modes
+
+**Batch Processing:**
+- Process all data at once
+- Good for: Initial setup, complete reprocessing
+- Trade-off: Slow startup, complete data
+
+**Incremental Processing:**
+- Process only new/changed data
+- Good for: Regular updates, performance
+- Trade-off: Complex state tracking
+
+**Streaming Processing:**
+- Process data as it arrives
+- Good for: Real-time updates
+- Trade-off: Complex implementation
+
+### Storage Strategy
+
+Choose appropriate storage:
+
+**SQLite:**
+- Structured metadata
+- Fast queries
+- Relational data
+- Good for: Indexes, aggregations
+
+**Vector Database (ChromaDB):**
+- Semantic embeddings
+- Similarity search
+- Good for: RAG, semantic queries
+
+**File System:**
+- Raw data
+- Large blobs
+- Good for: Backups, archives
+
+## Example: CC Insights
+
+**Input**: Claude Code conversation JSONL files
+
+**Processing Pipeline:**
+1. JSONL Parser - Decode base64, extract messages
+2. Metadata Extractor - Timestamps, files, tools
+3. Embeddings Generator - Vector representations
+4. Pattern Detector - Identify trends
+
+**Storage:**
+- SQLite: Conversation metadata, fast queries
+- ChromaDB: Vector embeddings, semantic search
+- Cache: Processed conversation data
+
+**Query Interfaces:**
+1. CLI Search - Command-line semantic search
+2. Insight Generator - Pattern-based reports
+3. Dashboard - Interactive web UI
+
+**Outputs:**
+- Search results with similarity scores
+- Weekly activity reports
+- File heatmaps
+- Tool usage analytics
+
+## Data Processing Workflow
+
+### Phase 1: Ingestion
+```markdown
+1. **Discover Data Sources**
+   - Locate input files/APIs
+   - Validate accessibility
+   - Calculate scope (file count, size)
+
+2. **Initial Validation**
+   - Check format validity
+   - Verify schema compliance
+   - Estimate processing time
+
+3. **State Management**
+   - Track what's been processed
+   - Support incremental updates
+   - Handle failures gracefully
+```
+
+### Phase 2: Processing
+```markdown
+1. **Parse/Transform**
+   - Read raw data
+   - Apply transformations
+   - Handle errors and edge cases
+
+2. **Extract Features**
+   - Generate metadata
+   - Calculate metrics
+   - Create embeddings (if semantic search)
+
+3. **Store Results**
+   - Write to database(s)
+   - Update indexes
+   - Maintain consistency
+```
+
+### Phase 3: Analysis
+```markdown
+1. **Query Interface**
+   - Support multiple query types
+   - Optimize for common patterns
+   - Return ranked results
+
+2. **Pattern Detection**
+   - Aggregate data
+   - Identify trends
+   - Generate insights
+
+3. **Visualization**
+   - Format for human consumption
+   - Support multiple output formats
+   - Interactive when possible
+```
+
+## Performance Characteristics
+
+Document expected performance:
+
+```markdown
+### Performance Characteristics
+
+- **Initial indexing**: ~1-2 minutes for 100 records
+- **Incremental updates**: <5 seconds for new records
+- **Search latency**: <1 second for queries
+- **Report generation**: <10 seconds for standard reports
+- **Memory usage**: ~200MB for 1000 records
+```
+
+## Best Practices
+
+1. **Incremental Processing**: Don't reprocess everything on each run
+2. **State Tracking**: Track what's been processed to avoid duplicates
+3. **Batch Operations**: Process in batches for memory efficiency
+4. **Progress Indicators**: Show progress for long operations
+5. **Error Recovery**: Handle failures gracefully, resume where left off
+6. **Data Validation**: Validate inputs before expensive processing
+7. **Index Optimization**: Optimize databases for common queries
+8. **Memory Management**: Stream large files, don't load everything
+9. **Parallel Processing**: Use parallelism when possible
+10. **Cache Wisely**: Cache expensive computations
+
+## Scripts Structure
+
+For data processing skills, provide helper scripts:
+
+```
+scripts/
+├── processor.py          # Main data processing script
+├── indexer.py           # Build indexes/embeddings
+├── query.py             # Query interface (CLI)
+└── generator.py         # Report/insight generation
+```
+
+### Script Best Practices
+
+```python
+# Good patterns for processing scripts:
+
+# 1. Use click for CLI
+import click
+
+@click.command()
+@click.option('--input', help='Input path')
+@click.option('--reindex', is_flag=True)
+def process(input, reindex):
+    """Process data from input source."""
+    pass
+
+# 2. Show progress
+from tqdm import tqdm
+for item in tqdm(items, desc="Processing"):
+    process_item(item)
+
+# 3. Handle errors gracefully
+try:
+    result = process_item(item)
+except Exception as e:
+    logger.error(f"Failed to process {item}: {e}")
+    continue  # Continue with next item
+
+# 4. Support incremental updates
+if not reindex and is_already_processed(item):
+    continue
+
+# 5. Use batch processing
+for batch in chunks(items, batch_size=32):
+    process_batch(batch)
+```
+
+## Storage Schema
+
+Document your data schema:
+
+```sql
+-- Example SQLite schema
+CREATE TABLE conversations (
+    id TEXT PRIMARY KEY,
+    timestamp INTEGER,
+    message_count INTEGER,
+    files_modified TEXT,  -- JSON array
+    tools_used TEXT       -- JSON array
+);
+
+CREATE INDEX idx_timestamp ON conversations(timestamp);
+CREATE INDEX idx_files ON conversations(files_modified);
+```
+
+## Output Formats
+
+Support multiple output formats:
+
+1. **Markdown**: Human-readable reports
+2. **JSON**: Machine-readable for integration
+3. **CSV**: Spreadsheet-compatible data
+4. **HTML**: Styled reports with charts
+5. **Interactive**: Web dashboards (optional)