Initial commit

2025-11-30 09:02:23 +08:00
commit 5f44d04a64
18 changed files with 1976 additions and 0 deletions
--- a/skills/knowledge-capture/evaluations/README.md
+++ b/skills/knowledge-capture/evaluations/README.md
@@ -0,0 +1,142 @@
+# Knowledge Capture - Evaluation Scenarios
+
+This directory contains test scenarios for validating the Knowledge Capture skill across different Claude models.
+
+## Overview
+
+These evaluations ensure that Knowledge Capture works consistently and effectively with:
+- Claude 3 Haiku
+- Claude 3 Sonnet
+- Claude 3 Opus
+
+Each scenario includes:
+- **Input**: The conversation or content to capture
+- **Expected Behavior**: What the skill should do
+- **Success Criteria**: How to verify success
+
+## Running Evaluations
+
+### Manual Testing
+
+1. Copy the conversation/input from a scenario
+2. Activate Knowledge Capture skill in Claude Code
+3. Request to capture it as specified
+4. Evaluate against the success criteria
+
+### Test Coverage
+
+- Basic conversation capture
+- Different documentation types
+- Complex multi-topic discussions
+- Action item extraction
+- Decision documentation
+- Error handling
+
+## Scenarios
+
+### scenario-1-meeting-notes.json
+**Type**: Meeting Summary Capture
+**Complexity**: Medium
+**Input**: Unstructured meeting notes from a team discussion
+**Expected Output**: Structured meeting summary with decisions and action items
+**Success Criteria**: All action items identified, decisions documented, participants listed
+
+### scenario-2-architecture-discussion.json
+**Type**: Decision Record Capture
+**Complexity**: Medium-High
+**Input**: Complex technical architecture discussion
+**Expected Output**: Decision record with options considered and rationale
+**Success Criteria**: All alternatives documented, clear decision stated, consequences identified
+
+### scenario-3-quick-context.json
+**Type**: Quick Capture
+**Complexity**: Low
+**Input**: Brief status update
+**Expected Output**: Quick reference document
+**Success Criteria**: Key points captured, properly formatted
+
+### scenario-4-action-items-focus.json
+**Type**: Action Item Extraction
+**Complexity**: Medium
+**Input**: Meeting with many scattered action items
+**Expected Output**: Clear list of who, what, when
+**Success Criteria**: All action items found, owners assigned, deadlines noted
+
+### scenario-5-learning-from-incident.json
+**Type**: Learning Document
+**Complexity**: High
+**Input**: Post-incident discussion about what went wrong
+**Expected Output**: Learning document with best practices
+**Success Criteria**: Root causes identified, lessons extracted, preventive measures noted
+
+## Evaluation Criteria
+
+### Comprehensiveness
+- Did the skill capture all key information?
+- Were important details missed?
+- Is the output complete?
+
+### Structure
+- Is the output well-organized?
+- Are headings and sections appropriate?
+- Does it follow the documentation type's structure?
+
+### Actionability
+- Are action items clear and assignable?
+- Are next steps defined?
+- Can someone understand what to do next?
+
+### Linking
+- Are cross-references included?
+- Could this be discovered from related docs?
+- Are important connections made?
+
+### Accuracy
+- Is the captured information accurate?
+- Were details misinterpreted?
+- Does it faithfully represent the source?
+
+## Model-Specific Notes
+
+### Haiku
+- May need slightly more structured input
+- Evaluates on accuracy and completeness
+- Good for quick captures
+
+### Sonnet
+- Handles complex discussions well
+- Good at inferring structure
+- Balanced performance across all scenario types
+
+### Opus
+- Excels at inferring structure from unstructured input
+- Best for complex, multi-topic discussions
+- Highest accuracy on nuanced extraction
+
+## Failure Modes to Watch For
+
+1. **Orphaning**: Documentation created without linking to hub pages
+2. **Structure Loss**: Loss of hierarchical organization
+3. **Missing Details**: Key decisions or action items not captured
+4. **Wrong Type**: Documentation created as wrong type (e.g., FAQ instead of Decision Record)
+5. **Broken Context**: Missing important contextual information
+6. **Inaccuracy**: Misinterpretation of what was discussed
+
+## Updating Scenarios
+
+When updating scenarios:
+1. Ensure they reflect real-world use cases
+2. Include edge cases and challenging inputs
+3. Keep success criteria clear and measurable
+4. Test with all three model sizes
+5. Document any model-specific behavior
+
+## Expected Performance
+
+**Target Success Rate**: 90%+ on Sonnet and Opus, 85%+ on Haiku
+
+When a scenario fails:
+1. Review the specific failure mode
+2. Check if it's a consistent issue or model-specific
+3. Update the skill prompt if needed
+4. Document the failure and resolution