# Phase 1: Insight Discovery and Parsing **Purpose**: Locate, read, deduplicate, and structure all insights from the project's lessons-learned directory. ## Steps ### 1. Verify project structure - Ask user for project root directory (default: current working directory) - Check if `docs/lessons-learned/` exists - If not found, explain the expected structure and offer to search alternative locations - List all categories found (testing, configuration, hooks-and-events, etc.) ### 2. Scan and catalog insight files **File Naming Convention**: Files MUST follow: `YYYY-MM-DD-descriptive-slug.md` - Date prefix for chronological sorting - Descriptive slug (3-5 words) summarizing the insight topic - Examples: - `2025-11-21-jwt-refresh-token-pattern.md` - `2025-11-20-vitest-mocking-best-practices.md` - `2025-11-19-react-testing-library-queries.md` **Scanning**: - Use Glob tool to find all markdown files: `docs/lessons-learned/**/*.md` - For each file found, extract: - File path and category (from directory name) - Creation date (from filename prefix) - Descriptive title (from filename slug) - File size and line count - Build initial inventory report ### 3. Deduplicate insights (CRITICAL) **Why**: The extraction hook may create duplicate entries within files. **Deduplication Algorithm**: ```python def deduplicate_insights(insights): seen_hashes = set() unique_insights = [] for insight in insights: # Create hash from normalized content content_hash = hash(normalize(insight.title + insight.content[:200])) if content_hash not in seen_hashes: seen_hashes.add(content_hash) unique_insights.append(insight) else: log_duplicate(insight) return unique_insights ``` **Deduplication Checks**: - Exact title match → duplicate - First 200 chars content match → duplicate - Same code blocks in same order → duplicate - Report: "Found X insights, removed Y duplicates (Z unique)" ### 4. Parse individual insights - Read each file using Read tool - Extract session metadata (session ID, timestamp from file headers) - Split file content on `---` separator (insights are separated by horizontal rules) - For each insight section: - Extract title (first line, often wrapped in `**bold**`) - Extract body content (remaining markdown) - Identify code blocks - Extract actionable items (lines starting with `- [ ]` or numbered lists) - Note any warnings/cautions ### 5. Apply quality filters **Filter out low-depth insights** that are: - Basic explanatory notes without actionable steps - Simple definitions or concept explanations - Single-paragraph observations **Keep insights that have**: - Actionable workflows (numbered steps, checklists) - Decision frameworks (trade-offs, when to use X vs Y) - Code patterns with explanation of WHY - Troubleshooting guides with solutions - Best practices with concrete examples **Quality Score Calculation**: ``` score = 0 if has_actionable_items: score += 3 if has_code_examples: score += 2 if has_numbered_steps: score += 2 if word_count > 200: score += 1 if has_warnings_or_notes: score += 1 # Minimum score for skill consideration: 4 ``` ### 6. Build structured insight inventory ``` { id: unique_id, title: string, content: string, category: string, date: ISO_date, session_id: string, source_file: path, code_examples: [{ language, code }], action_items: [string], keywords: [string], quality_score: int, paragraph_count: int, line_count: int } ``` ### 7. Present discovery summary - Total insights found (before deduplication) - Duplicates removed - Low-quality insights filtered - **Final count**: Unique, quality insights - Category breakdown - Date range (earliest to latest) - Preview of top 5 insights by quality score ## Output Deduplicated, quality-filtered inventory of insights with metadata and categorization. ## Common Issues - **No lessons-learned directory**: Ask if user wants to search elsewhere or exit - **Empty files**: Skip and report count of empty files - **Malformed markdown**: Log warning but continue parsing (best effort) - **Missing session metadata**: Use filename date as fallback - **High duplicate count**: Indicates extraction hook bug - warn user - **All insights filtered as low-quality**: Lower threshold or suggest manual curation - **Files without descriptive names**: Suggest renaming for better organization