# Phase 2: Smart Clustering **Purpose**: Group related insights using similarity analysis to identify skill candidates. ## Steps ### 1. Load clustering configuration - Read `data/clustering-config.yaml` for weights and thresholds - Similarity weights: - Same category: +0.3 - Shared keyword: +0.1 per keyword - Temporal proximity (within 7 days): +0.05 - Title similarity: +0.15 - Content overlap: +0.2 - Clustering threshold: 0.6 minimum to group - Standalone quality threshold: 0.8 for single-insight skills ### 2. Extract keywords from each insight - Normalize text (lowercase, remove punctuation) - Extract significant words from title (weight 2x) - Extract significant words from body (weight 1x) - Filter out common stop words - Apply category-specific keyword boosting - Build keyword vector for each insight ### 3. Calculate pairwise similarity scores For each pair of insights (i, j): - Base score = 0 - If same category: +0.3 - For each shared keyword: +0.1 - If dates within 7 days: +0.05 - Calculate title word overlap: shared_words / total_words * 0.15 - Calculate content concept overlap: shared_concepts / total_concepts * 0.2 - Final score = sum of all components ### 4. Build clusters - Start with highest similarity pairs - Group insights with similarity >= 0.6 - Use connected components algorithm - Identify standalone insights (don't cluster with any others) - For standalone insights, check if quality score >= 0.8 ### 5. Assess cluster characteristics For each cluster: - Count insights - Identify dominant category - Extract common keywords - Assess complexity (lines, code examples, etc.) - Recommend skill complexity (minimal/standard/complex) - Suggest skill pattern (phase-based/mode-based/validation) ### 6. Handle large clusters (>5 insights) - Attempt sub-clustering by: - Temporal splits (early vs. late insights) - Sub-topic splits (different keyword groups) - Complexity splits (simple vs. complex insights) - Ask user if they want to split or keep as comprehensive skill ### 7. Present clustering results interactively For each cluster, show: - Cluster ID and size - Suggested skill name (from keywords) - Dominant category - Insight titles in cluster - Similarity scores - Recommended complexity Ask user to: - Review proposed clusters - Accept/reject/modify groupings - Combine or split clusters - Remove low-value insights ## Output Validated clusters of insights, each representing a skill candidate. ## Common Issues - **All insights are unrelated** (no clusters): Offer to generate standalone skills or exit - **One giant cluster**: Suggest sub-clustering or mode-based skill - **Too many standalone insights**: Suggest raising similarity threshold or manual grouping