4.3 KiB
Knowledge Capture - Evaluation Scenarios
This directory contains test scenarios for validating the Knowledge Capture skill across different Claude models.
Overview
These evaluations ensure that Knowledge Capture works consistently and effectively with:
- Claude 3 Haiku
- Claude 3 Sonnet
- Claude 3 Opus
Each scenario includes:
- Input: The conversation or content to capture
- Expected Behavior: What the skill should do
- Success Criteria: How to verify success
Running Evaluations
Manual Testing
- Copy the conversation/input from a scenario
- Activate Knowledge Capture skill in Claude Code
- Request to capture it as specified
- Evaluate against the success criteria
Test Coverage
- Basic conversation capture
- Different documentation types
- Complex multi-topic discussions
- Action item extraction
- Decision documentation
- Error handling
Scenarios
scenario-1-meeting-notes.json
Type: Meeting Summary Capture Complexity: Medium Input: Unstructured meeting notes from a team discussion Expected Output: Structured meeting summary with decisions and action items Success Criteria: All action items identified, decisions documented, participants listed
scenario-2-architecture-discussion.json
Type: Decision Record Capture Complexity: Medium-High Input: Complex technical architecture discussion Expected Output: Decision record with options considered and rationale Success Criteria: All alternatives documented, clear decision stated, consequences identified
scenario-3-quick-context.json
Type: Quick Capture Complexity: Low Input: Brief status update Expected Output: Quick reference document Success Criteria: Key points captured, properly formatted
scenario-4-action-items-focus.json
Type: Action Item Extraction Complexity: Medium Input: Meeting with many scattered action items Expected Output: Clear list of who, what, when Success Criteria: All action items found, owners assigned, deadlines noted
scenario-5-learning-from-incident.json
Type: Learning Document Complexity: High Input: Post-incident discussion about what went wrong Expected Output: Learning document with best practices Success Criteria: Root causes identified, lessons extracted, preventive measures noted
Evaluation Criteria
Comprehensiveness
- Did the skill capture all key information?
- Were important details missed?
- Is the output complete?
Structure
- Is the output well-organized?
- Are headings and sections appropriate?
- Does it follow the documentation type's structure?
Actionability
- Are action items clear and assignable?
- Are next steps defined?
- Can someone understand what to do next?
Linking
- Are cross-references included?
- Could this be discovered from related docs?
- Are important connections made?
Accuracy
- Is the captured information accurate?
- Were details misinterpreted?
- Does it faithfully represent the source?
Model-Specific Notes
Haiku
- May need slightly more structured input
- Evaluates on accuracy and completeness
- Good for quick captures
Sonnet
- Handles complex discussions well
- Good at inferring structure
- Balanced performance across all scenario types
Opus
- Excels at inferring structure from unstructured input
- Best for complex, multi-topic discussions
- Highest accuracy on nuanced extraction
Failure Modes to Watch For
- Orphaning: Documentation created without linking to hub pages
- Structure Loss: Loss of hierarchical organization
- Missing Details: Key decisions or action items not captured
- Wrong Type: Documentation created as wrong type (e.g., FAQ instead of Decision Record)
- Broken Context: Missing important contextual information
- Inaccuracy: Misinterpretation of what was discussed
Updating Scenarios
When updating scenarios:
- Ensure they reflect real-world use cases
- Include edge cases and challenging inputs
- Keep success criteria clear and measurable
- Test with all three model sizes
- Document any model-specific behavior
Expected Performance
Target Success Rate: 90%+ on Sonnet and Opus, 85%+ on Haiku
When a scenario fails:
- Review the specific failure mode
- Check if it's a consistent issue or model-specific
- Update the skill prompt if needed
- Document the failure and resolution