83 lines
2.7 KiB
Markdown
83 lines
2.7 KiB
Markdown
# Phase 2: Smart Clustering
|
|
|
|
**Purpose**: Group related insights using similarity analysis to identify skill candidates.
|
|
|
|
## Steps
|
|
|
|
### 1. Load clustering configuration
|
|
- Read `data/clustering-config.yaml` for weights and thresholds
|
|
- Similarity weights:
|
|
- Same category: +0.3
|
|
- Shared keyword: +0.1 per keyword
|
|
- Temporal proximity (within 7 days): +0.05
|
|
- Title similarity: +0.15
|
|
- Content overlap: +0.2
|
|
- Clustering threshold: 0.6 minimum to group
|
|
- Standalone quality threshold: 0.8 for single-insight skills
|
|
|
|
### 2. Extract keywords from each insight
|
|
- Normalize text (lowercase, remove punctuation)
|
|
- Extract significant words from title (weight 2x)
|
|
- Extract significant words from body (weight 1x)
|
|
- Filter out common stop words
|
|
- Apply category-specific keyword boosting
|
|
- Build keyword vector for each insight
|
|
|
|
### 3. Calculate pairwise similarity scores
|
|
For each pair of insights (i, j):
|
|
- Base score = 0
|
|
- If same category: +0.3
|
|
- For each shared keyword: +0.1
|
|
- If dates within 7 days: +0.05
|
|
- Calculate title word overlap: shared_words / total_words * 0.15
|
|
- Calculate content concept overlap: shared_concepts / total_concepts * 0.2
|
|
- Final score = sum of all components
|
|
|
|
### 4. Build clusters
|
|
- Start with highest similarity pairs
|
|
- Group insights with similarity >= 0.6
|
|
- Use connected components algorithm
|
|
- Identify standalone insights (don't cluster with any others)
|
|
- For standalone insights, check if quality score >= 0.8
|
|
|
|
### 5. Assess cluster characteristics
|
|
For each cluster:
|
|
- Count insights
|
|
- Identify dominant category
|
|
- Extract common keywords
|
|
- Assess complexity (lines, code examples, etc.)
|
|
- Recommend skill complexity (minimal/standard/complex)
|
|
- Suggest skill pattern (phase-based/mode-based/validation)
|
|
|
|
### 6. Handle large clusters (>5 insights)
|
|
- Attempt sub-clustering by:
|
|
- Temporal splits (early vs. late insights)
|
|
- Sub-topic splits (different keyword groups)
|
|
- Complexity splits (simple vs. complex insights)
|
|
- Ask user if they want to split or keep as comprehensive skill
|
|
|
|
### 7. Present clustering results interactively
|
|
For each cluster, show:
|
|
- Cluster ID and size
|
|
- Suggested skill name (from keywords)
|
|
- Dominant category
|
|
- Insight titles in cluster
|
|
- Similarity scores
|
|
- Recommended complexity
|
|
|
|
Ask user to:
|
|
- Review proposed clusters
|
|
- Accept/reject/modify groupings
|
|
- Combine or split clusters
|
|
- Remove low-value insights
|
|
|
|
## Output
|
|
|
|
Validated clusters of insights, each representing a skill candidate.
|
|
|
|
## Common Issues
|
|
|
|
- **All insights are unrelated** (no clusters): Offer to generate standalone skills or exit
|
|
- **One giant cluster**: Suggest sub-clustering or mode-based skill
|
|
- **Too many standalone insights**: Suggest raising similarity threshold or manual grouping
|