Files
gh-cskiro-claudex-meta-tools/skills/insight-skill-generator/workflow/phase-1-discovery.md
2025-11-29 18:16:56 +08:00

4.4 KiB

Phase 1: Insight Discovery and Parsing

Purpose: Locate, read, deduplicate, and structure all insights from the project's lessons-learned directory.

Steps

1. Verify project structure

  • Ask user for project root directory (default: current working directory)
  • Check if docs/lessons-learned/ exists
  • If not found, explain the expected structure and offer to search alternative locations
  • List all categories found (testing, configuration, hooks-and-events, etc.)

2. Scan and catalog insight files

File Naming Convention: Files MUST follow: YYYY-MM-DD-descriptive-slug.md

  • Date prefix for chronological sorting
  • Descriptive slug (3-5 words) summarizing the insight topic
  • Examples:
    • 2025-11-21-jwt-refresh-token-pattern.md
    • 2025-11-20-vitest-mocking-best-practices.md
    • 2025-11-19-react-testing-library-queries.md

Scanning:

  • Use Glob tool to find all markdown files: docs/lessons-learned/**/*.md
  • For each file found, extract:
    • File path and category (from directory name)
    • Creation date (from filename prefix)
    • Descriptive title (from filename slug)
    • File size and line count
  • Build initial inventory report

3. Deduplicate insights (CRITICAL)

Why: The extraction hook may create duplicate entries within files.

Deduplication Algorithm:

def deduplicate_insights(insights):
    seen_hashes = set()
    unique_insights = []

    for insight in insights:
        # Create hash from normalized content
        content_hash = hash(normalize(insight.title + insight.content[:200]))

        if content_hash not in seen_hashes:
            seen_hashes.add(content_hash)
            unique_insights.append(insight)
        else:
            log_duplicate(insight)

    return unique_insights

Deduplication Checks:

  • Exact title match → duplicate
  • First 200 chars content match → duplicate
  • Same code blocks in same order → duplicate
  • Report: "Found X insights, removed Y duplicates (Z unique)"

4. Parse individual insights

  • Read each file using Read tool
  • Extract session metadata (session ID, timestamp from file headers)
  • Split file content on --- separator (insights are separated by horizontal rules)
  • For each insight section:
    • Extract title (first line, often wrapped in **bold**)
    • Extract body content (remaining markdown)
    • Identify code blocks
    • Extract actionable items (lines starting with - [ ] or numbered lists)
    • Note any warnings/cautions

5. Apply quality filters

Filter out low-depth insights that are:

  • Basic explanatory notes without actionable steps
  • Simple definitions or concept explanations
  • Single-paragraph observations

Keep insights that have:

  • Actionable workflows (numbered steps, checklists)
  • Decision frameworks (trade-offs, when to use X vs Y)
  • Code patterns with explanation of WHY
  • Troubleshooting guides with solutions
  • Best practices with concrete examples

Quality Score Calculation:

score = 0
if has_actionable_items: score += 3
if has_code_examples: score += 2
if has_numbered_steps: score += 2
if word_count > 200: score += 1
if has_warnings_or_notes: score += 1

# Minimum score for skill consideration: 4

6. Build structured insight inventory

{
  id: unique_id,
  title: string,
  content: string,
  category: string,
  date: ISO_date,
  session_id: string,
  source_file: path,
  code_examples: [{ language, code }],
  action_items: [string],
  keywords: [string],
  quality_score: int,
  paragraph_count: int,
  line_count: int
}

7. Present discovery summary

  • Total insights found (before deduplication)
  • Duplicates removed
  • Low-quality insights filtered
  • Final count: Unique, quality insights
  • Category breakdown
  • Date range (earliest to latest)
  • Preview of top 5 insights by quality score

Output

Deduplicated, quality-filtered inventory of insights with metadata and categorization.

Common Issues

  • No lessons-learned directory: Ask if user wants to search elsewhere or exit
  • Empty files: Skip and report count of empty files
  • Malformed markdown: Log warning but continue parsing (best effort)
  • Missing session metadata: Use filename date as fallback
  • High duplicate count: Indicates extraction hook bug - warn user
  • All insights filtered as low-quality: Lower threshold or suggest manual curation
  • Files without descriptive names: Suggest renaming for better organization