140 lines
4.4 KiB
Markdown
140 lines
4.4 KiB
Markdown
# Phase 1: Insight Discovery and Parsing
|
|
|
|
**Purpose**: Locate, read, deduplicate, and structure all insights from the project's lessons-learned directory.
|
|
|
|
## Steps
|
|
|
|
### 1. Verify project structure
|
|
- Ask user for project root directory (default: current working directory)
|
|
- Check if `docs/lessons-learned/` exists
|
|
- If not found, explain the expected structure and offer to search alternative locations
|
|
- List all categories found (testing, configuration, hooks-and-events, etc.)
|
|
|
|
### 2. Scan and catalog insight files
|
|
|
|
**File Naming Convention**:
|
|
Files MUST follow: `YYYY-MM-DD-descriptive-slug.md`
|
|
- Date prefix for chronological sorting
|
|
- Descriptive slug (3-5 words) summarizing the insight topic
|
|
- Examples:
|
|
- `2025-11-21-jwt-refresh-token-pattern.md`
|
|
- `2025-11-20-vitest-mocking-best-practices.md`
|
|
- `2025-11-19-react-testing-library-queries.md`
|
|
|
|
**Scanning**:
|
|
- Use Glob tool to find all markdown files: `docs/lessons-learned/**/*.md`
|
|
- For each file found, extract:
|
|
- File path and category (from directory name)
|
|
- Creation date (from filename prefix)
|
|
- Descriptive title (from filename slug)
|
|
- File size and line count
|
|
- Build initial inventory report
|
|
|
|
### 3. Deduplicate insights (CRITICAL)
|
|
|
|
**Why**: The extraction hook may create duplicate entries within files.
|
|
|
|
**Deduplication Algorithm**:
|
|
```python
|
|
def deduplicate_insights(insights):
|
|
seen_hashes = set()
|
|
unique_insights = []
|
|
|
|
for insight in insights:
|
|
# Create hash from normalized content
|
|
content_hash = hash(normalize(insight.title + insight.content[:200]))
|
|
|
|
if content_hash not in seen_hashes:
|
|
seen_hashes.add(content_hash)
|
|
unique_insights.append(insight)
|
|
else:
|
|
log_duplicate(insight)
|
|
|
|
return unique_insights
|
|
```
|
|
|
|
**Deduplication Checks**:
|
|
- Exact title match → duplicate
|
|
- First 200 chars content match → duplicate
|
|
- Same code blocks in same order → duplicate
|
|
- Report: "Found X insights, removed Y duplicates (Z unique)"
|
|
|
|
### 4. Parse individual insights
|
|
- Read each file using Read tool
|
|
- Extract session metadata (session ID, timestamp from file headers)
|
|
- Split file content on `---` separator (insights are separated by horizontal rules)
|
|
- For each insight section:
|
|
- Extract title (first line, often wrapped in `**bold**`)
|
|
- Extract body content (remaining markdown)
|
|
- Identify code blocks
|
|
- Extract actionable items (lines starting with `- [ ]` or numbered lists)
|
|
- Note any warnings/cautions
|
|
|
|
### 5. Apply quality filters
|
|
|
|
**Filter out low-depth insights** that are:
|
|
- Basic explanatory notes without actionable steps
|
|
- Simple definitions or concept explanations
|
|
- Single-paragraph observations
|
|
|
|
**Keep insights that have**:
|
|
- Actionable workflows (numbered steps, checklists)
|
|
- Decision frameworks (trade-offs, when to use X vs Y)
|
|
- Code patterns with explanation of WHY
|
|
- Troubleshooting guides with solutions
|
|
- Best practices with concrete examples
|
|
|
|
**Quality Score Calculation**:
|
|
```
|
|
score = 0
|
|
if has_actionable_items: score += 3
|
|
if has_code_examples: score += 2
|
|
if has_numbered_steps: score += 2
|
|
if word_count > 200: score += 1
|
|
if has_warnings_or_notes: score += 1
|
|
|
|
# Minimum score for skill consideration: 4
|
|
```
|
|
|
|
### 6. Build structured insight inventory
|
|
```
|
|
{
|
|
id: unique_id,
|
|
title: string,
|
|
content: string,
|
|
category: string,
|
|
date: ISO_date,
|
|
session_id: string,
|
|
source_file: path,
|
|
code_examples: [{ language, code }],
|
|
action_items: [string],
|
|
keywords: [string],
|
|
quality_score: int,
|
|
paragraph_count: int,
|
|
line_count: int
|
|
}
|
|
```
|
|
|
|
### 7. Present discovery summary
|
|
- Total insights found (before deduplication)
|
|
- Duplicates removed
|
|
- Low-quality insights filtered
|
|
- **Final count**: Unique, quality insights
|
|
- Category breakdown
|
|
- Date range (earliest to latest)
|
|
- Preview of top 5 insights by quality score
|
|
|
|
## Output
|
|
|
|
Deduplicated, quality-filtered inventory of insights with metadata and categorization.
|
|
|
|
## Common Issues
|
|
|
|
- **No lessons-learned directory**: Ask if user wants to search elsewhere or exit
|
|
- **Empty files**: Skip and report count of empty files
|
|
- **Malformed markdown**: Log warning but continue parsing (best effort)
|
|
- **Missing session metadata**: Use filename date as fallback
|
|
- **High duplicate count**: Indicates extraction hook bug - warn user
|
|
- **All insights filtered as low-quality**: Lower threshold or suggest manual curation
|
|
- **Files without descriptive names**: Suggest renaming for better organization
|