zhongwei/gh-cskiro-claudex-meta-tools

Files

Zhongwei Li 8a3d331e04 Initial commit

2025-11-29 18:16:56 +08:00

2.7 KiB

Raw Permalink Blame History

Phase 2: Smart Clustering

Purpose: Group related insights using similarity analysis to identify skill candidates.

Steps

1. Load clustering configuration

Read data/clustering-config.yaml for weights and thresholds
Similarity weights:
- Same category: +0.3
- Shared keyword: +0.1 per keyword
- Temporal proximity (within 7 days): +0.05
- Title similarity: +0.15
- Content overlap: +0.2
Clustering threshold: 0.6 minimum to group
Standalone quality threshold: 0.8 for single-insight skills

2. Extract keywords from each insight

Normalize text (lowercase, remove punctuation)
Extract significant words from title (weight 2x)
Extract significant words from body (weight 1x)
Filter out common stop words
Apply category-specific keyword boosting
Build keyword vector for each insight

3. Calculate pairwise similarity scores

For each pair of insights (i, j):

Base score = 0
If same category: +0.3
For each shared keyword: +0.1
If dates within 7 days: +0.05
Calculate title word overlap: shared_words / total_words * 0.15
Calculate content concept overlap: shared_concepts / total_concepts * 0.2
Final score = sum of all components

4. Build clusters

Start with highest similarity pairs
Group insights with similarity >= 0.6
Use connected components algorithm
Identify standalone insights (don't cluster with any others)
For standalone insights, check if quality score >= 0.8

5. Assess cluster characteristics

For each cluster:

Count insights
Identify dominant category
Extract common keywords
Assess complexity (lines, code examples, etc.)
Recommend skill complexity (minimal/standard/complex)
Suggest skill pattern (phase-based/mode-based/validation)

6. Handle large clusters (>5 insights)

Attempt sub-clustering by:
- Temporal splits (early vs. late insights)
- Sub-topic splits (different keyword groups)
- Complexity splits (simple vs. complex insights)
Ask user if they want to split or keep as comprehensive skill

7. Present clustering results interactively

For each cluster, show:

Cluster ID and size
Suggested skill name (from keywords)
Dominant category
Insight titles in cluster
Similarity scores
Recommended complexity

Ask user to:

Review proposed clusters
Accept/reject/modify groupings
Combine or split clusters
Remove low-value insights

Output

Validated clusters of insights, each representing a skill candidate.

Common Issues

All insights are unrelated (no clusters): Offer to generate standalone skills or exit
One giant cluster: Suggest sub-clustering or mode-based skill
Too many standalone insights: Suggest raising similarity threshold or manual grouping