4.4 KiB
4.4 KiB
Phase 1: Insight Discovery and Parsing
Purpose: Locate, read, deduplicate, and structure all insights from the project's lessons-learned directory.
Steps
1. Verify project structure
- Ask user for project root directory (default: current working directory)
- Check if
docs/lessons-learned/exists - If not found, explain the expected structure and offer to search alternative locations
- List all categories found (testing, configuration, hooks-and-events, etc.)
2. Scan and catalog insight files
File Naming Convention:
Files MUST follow: YYYY-MM-DD-descriptive-slug.md
- Date prefix for chronological sorting
- Descriptive slug (3-5 words) summarizing the insight topic
- Examples:
2025-11-21-jwt-refresh-token-pattern.md2025-11-20-vitest-mocking-best-practices.md2025-11-19-react-testing-library-queries.md
Scanning:
- Use Glob tool to find all markdown files:
docs/lessons-learned/**/*.md - For each file found, extract:
- File path and category (from directory name)
- Creation date (from filename prefix)
- Descriptive title (from filename slug)
- File size and line count
- Build initial inventory report
3. Deduplicate insights (CRITICAL)
Why: The extraction hook may create duplicate entries within files.
Deduplication Algorithm:
def deduplicate_insights(insights):
seen_hashes = set()
unique_insights = []
for insight in insights:
# Create hash from normalized content
content_hash = hash(normalize(insight.title + insight.content[:200]))
if content_hash not in seen_hashes:
seen_hashes.add(content_hash)
unique_insights.append(insight)
else:
log_duplicate(insight)
return unique_insights
Deduplication Checks:
- Exact title match → duplicate
- First 200 chars content match → duplicate
- Same code blocks in same order → duplicate
- Report: "Found X insights, removed Y duplicates (Z unique)"
4. Parse individual insights
- Read each file using Read tool
- Extract session metadata (session ID, timestamp from file headers)
- Split file content on
---separator (insights are separated by horizontal rules) - For each insight section:
- Extract title (first line, often wrapped in
**bold**) - Extract body content (remaining markdown)
- Identify code blocks
- Extract actionable items (lines starting with
- [ ]or numbered lists) - Note any warnings/cautions
- Extract title (first line, often wrapped in
5. Apply quality filters
Filter out low-depth insights that are:
- Basic explanatory notes without actionable steps
- Simple definitions or concept explanations
- Single-paragraph observations
Keep insights that have:
- Actionable workflows (numbered steps, checklists)
- Decision frameworks (trade-offs, when to use X vs Y)
- Code patterns with explanation of WHY
- Troubleshooting guides with solutions
- Best practices with concrete examples
Quality Score Calculation:
score = 0
if has_actionable_items: score += 3
if has_code_examples: score += 2
if has_numbered_steps: score += 2
if word_count > 200: score += 1
if has_warnings_or_notes: score += 1
# Minimum score for skill consideration: 4
6. Build structured insight inventory
{
id: unique_id,
title: string,
content: string,
category: string,
date: ISO_date,
session_id: string,
source_file: path,
code_examples: [{ language, code }],
action_items: [string],
keywords: [string],
quality_score: int,
paragraph_count: int,
line_count: int
}
7. Present discovery summary
- Total insights found (before deduplication)
- Duplicates removed
- Low-quality insights filtered
- Final count: Unique, quality insights
- Category breakdown
- Date range (earliest to latest)
- Preview of top 5 insights by quality score
Output
Deduplicated, quality-filtered inventory of insights with metadata and categorization.
Common Issues
- No lessons-learned directory: Ask if user wants to search elsewhere or exit
- Empty files: Skip and report count of empty files
- Malformed markdown: Log warning but continue parsing (best effort)
- Missing session metadata: Use filename date as fallback
- High duplicate count: Indicates extraction hook bug - warn user
- All insights filtered as low-quality: Lower threshold or suggest manual curation
- Files without descriptive names: Suggest renaming for better organization