zhongwei/gh-cskiro-claudex-meta-tools

Fork 0

Files

Zhongwei Li 8a3d331e04 Initial commit

2025-11-29 18:16:56 +08:00

5.8 KiB

Raw Blame History

Data Processing Skill Pattern

Use this pattern when your skill processes, analyzes, or transforms data to extract insights.

When to Use

Skill ingests data from files or APIs
Performs analysis or transformation
Generates insights, reports, or visualizations
Examples: cc-insights (conversation analysis)

Structure

Data Flow Architecture

Define clear data pipeline:

Input Sources → Processing → Storage → Query/Analysis → Output

Example:

JSONL files → Parser → SQLite + Vector DB → Search/Analytics → Reports/Dashboard

Processing Modes

Batch Processing:

Process all data at once
Good for: Initial setup, complete reprocessing
Trade-off: Slow startup, complete data

Incremental Processing:

Process only new/changed data
Good for: Regular updates, performance
Trade-off: Complex state tracking

Streaming Processing:

Process data as it arrives
Good for: Real-time updates
Trade-off: Complex implementation

Storage Strategy

Choose appropriate storage:

SQLite:

Structured metadata
Fast queries
Relational data
Good for: Indexes, aggregations

Vector Database (ChromaDB):

Semantic embeddings
Similarity search
Good for: RAG, semantic queries

File System:

Raw data
Large blobs
Good for: Backups, archives

Example: CC Insights

Input: Claude Code conversation JSONL files

Processing Pipeline:

JSONL Parser - Decode base64, extract messages
Metadata Extractor - Timestamps, files, tools
Embeddings Generator - Vector representations
Pattern Detector - Identify trends

Storage:

SQLite: Conversation metadata, fast queries
ChromaDB: Vector embeddings, semantic search
Cache: Processed conversation data

Query Interfaces:

CLI Search - Command-line semantic search
Insight Generator - Pattern-based reports
Dashboard - Interactive web UI

Outputs:

Search results with similarity scores
Weekly activity reports
File heatmaps
Tool usage analytics

Data Processing Workflow

Phase 1: Ingestion

1. **Discover Data Sources**
   - Locate input files/APIs
   - Validate accessibility
   - Calculate scope (file count, size)

2. **Initial Validation**
   - Check format validity
   - Verify schema compliance
   - Estimate processing time

3. **State Management**
   - Track what's been processed
   - Support incremental updates
   - Handle failures gracefully

Phase 2: Processing

1. **Parse/Transform**
   - Read raw data
   - Apply transformations
   - Handle errors and edge cases

2. **Extract Features**
   - Generate metadata
   - Calculate metrics
   - Create embeddings (if semantic search)

3. **Store Results**
   - Write to database(s)
   - Update indexes
   - Maintain consistency

Phase 3: Analysis

1. **Query Interface**
   - Support multiple query types
   - Optimize for common patterns
   - Return ranked results

2. **Pattern Detection**
   - Aggregate data
   - Identify trends
   - Generate insights

3. **Visualization**
   - Format for human consumption
   - Support multiple output formats
   - Interactive when possible

Performance Characteristics

Document expected performance:

### Performance Characteristics

- **Initial indexing**: ~1-2 minutes for 100 records
- **Incremental updates**: <5 seconds for new records
- **Search latency**: <1 second for queries
- **Report generation**: <10 seconds for standard reports
- **Memory usage**: ~200MB for 1000 records

Best Practices

Incremental Processing: Don't reprocess everything on each run
State Tracking: Track what's been processed to avoid duplicates
Batch Operations: Process in batches for memory efficiency
Progress Indicators: Show progress for long operations
Error Recovery: Handle failures gracefully, resume where left off
Data Validation: Validate inputs before expensive processing
Index Optimization: Optimize databases for common queries
Memory Management: Stream large files, don't load everything
Parallel Processing: Use parallelism when possible
Cache Wisely: Cache expensive computations

Scripts Structure

For data processing skills, provide helper scripts:

scripts/
├── processor.py          # Main data processing script
├── indexer.py           # Build indexes/embeddings
├── query.py             # Query interface (CLI)
└── generator.py         # Report/insight generation

Script Best Practices

# Good patterns for processing scripts:

# 1. Use click for CLI
import click

@click.command()
@click.option('--input', help='Input path')
@click.option('--reindex', is_flag=True)
def process(input, reindex):
    """Process data from input source."""
    pass

# 2. Show progress
from tqdm import tqdm
for item in tqdm(items, desc="Processing"):
    process_item(item)

# 3. Handle errors gracefully
try:
    result = process_item(item)
except Exception as e:
    logger.error(f"Failed to process {item}: {e}")
    continue  # Continue with next item

# 4. Support incremental updates
if not reindex and is_already_processed(item):
    continue

# 5. Use batch processing
for batch in chunks(items, batch_size=32):
    process_batch(batch)

Storage Schema

Document your data schema:

-- Example SQLite schema
CREATE TABLE conversations (
    id TEXT PRIMARY KEY,
    timestamp INTEGER,
    message_count INTEGER,
    files_modified TEXT,  -- JSON array
    tools_used TEXT       -- JSON array
);

CREATE INDEX idx_timestamp ON conversations(timestamp);
CREATE INDEX idx_files ON conversations(files_modified);

Output Formats

Support multiple output formats:

Markdown: Human-readable reports
JSON: Machine-readable for integration
CSV: Spreadsheet-compatible data
HTML: Styled reports with charts
Interactive: Web dashboards (optional)

5.8 KiB Raw Blame History