Initial commit
This commit is contained in:
142
skills/knowledge-capture/evaluations/README.md
Normal file
142
skills/knowledge-capture/evaluations/README.md
Normal file
@@ -0,0 +1,142 @@
|
||||
# Knowledge Capture - Evaluation Scenarios
|
||||
|
||||
This directory contains test scenarios for validating the Knowledge Capture skill across different Claude models.
|
||||
|
||||
## Overview
|
||||
|
||||
These evaluations ensure that Knowledge Capture works consistently and effectively with:
|
||||
- Claude 3 Haiku
|
||||
- Claude 3 Sonnet
|
||||
- Claude 3 Opus
|
||||
|
||||
Each scenario includes:
|
||||
- **Input**: The conversation or content to capture
|
||||
- **Expected Behavior**: What the skill should do
|
||||
- **Success Criteria**: How to verify success
|
||||
|
||||
## Running Evaluations
|
||||
|
||||
### Manual Testing
|
||||
|
||||
1. Copy the conversation/input from a scenario
|
||||
2. Activate Knowledge Capture skill in Claude Code
|
||||
3. Request to capture it as specified
|
||||
4. Evaluate against the success criteria
|
||||
|
||||
### Test Coverage
|
||||
|
||||
- Basic conversation capture
|
||||
- Different documentation types
|
||||
- Complex multi-topic discussions
|
||||
- Action item extraction
|
||||
- Decision documentation
|
||||
- Error handling
|
||||
|
||||
## Scenarios
|
||||
|
||||
### scenario-1-meeting-notes.json
|
||||
**Type**: Meeting Summary Capture
|
||||
**Complexity**: Medium
|
||||
**Input**: Unstructured meeting notes from a team discussion
|
||||
**Expected Output**: Structured meeting summary with decisions and action items
|
||||
**Success Criteria**: All action items identified, decisions documented, participants listed
|
||||
|
||||
### scenario-2-architecture-discussion.json
|
||||
**Type**: Decision Record Capture
|
||||
**Complexity**: Medium-High
|
||||
**Input**: Complex technical architecture discussion
|
||||
**Expected Output**: Decision record with options considered and rationale
|
||||
**Success Criteria**: All alternatives documented, clear decision stated, consequences identified
|
||||
|
||||
### scenario-3-quick-context.json
|
||||
**Type**: Quick Capture
|
||||
**Complexity**: Low
|
||||
**Input**: Brief status update
|
||||
**Expected Output**: Quick reference document
|
||||
**Success Criteria**: Key points captured, properly formatted
|
||||
|
||||
### scenario-4-action-items-focus.json
|
||||
**Type**: Action Item Extraction
|
||||
**Complexity**: Medium
|
||||
**Input**: Meeting with many scattered action items
|
||||
**Expected Output**: Clear list of who, what, when
|
||||
**Success Criteria**: All action items found, owners assigned, deadlines noted
|
||||
|
||||
### scenario-5-learning-from-incident.json
|
||||
**Type**: Learning Document
|
||||
**Complexity**: High
|
||||
**Input**: Post-incident discussion about what went wrong
|
||||
**Expected Output**: Learning document with best practices
|
||||
**Success Criteria**: Root causes identified, lessons extracted, preventive measures noted
|
||||
|
||||
## Evaluation Criteria
|
||||
|
||||
### Comprehensiveness
|
||||
- Did the skill capture all key information?
|
||||
- Were important details missed?
|
||||
- Is the output complete?
|
||||
|
||||
### Structure
|
||||
- Is the output well-organized?
|
||||
- Are headings and sections appropriate?
|
||||
- Does it follow the documentation type's structure?
|
||||
|
||||
### Actionability
|
||||
- Are action items clear and assignable?
|
||||
- Are next steps defined?
|
||||
- Can someone understand what to do next?
|
||||
|
||||
### Linking
|
||||
- Are cross-references included?
|
||||
- Could this be discovered from related docs?
|
||||
- Are important connections made?
|
||||
|
||||
### Accuracy
|
||||
- Is the captured information accurate?
|
||||
- Were details misinterpreted?
|
||||
- Does it faithfully represent the source?
|
||||
|
||||
## Model-Specific Notes
|
||||
|
||||
### Haiku
|
||||
- May need slightly more structured input
|
||||
- Evaluates on accuracy and completeness
|
||||
- Good for quick captures
|
||||
|
||||
### Sonnet
|
||||
- Handles complex discussions well
|
||||
- Good at inferring structure
|
||||
- Balanced performance across all scenario types
|
||||
|
||||
### Opus
|
||||
- Excels at inferring structure from unstructured input
|
||||
- Best for complex, multi-topic discussions
|
||||
- Highest accuracy on nuanced extraction
|
||||
|
||||
## Failure Modes to Watch For
|
||||
|
||||
1. **Orphaning**: Documentation created without linking to hub pages
|
||||
2. **Structure Loss**: Loss of hierarchical organization
|
||||
3. **Missing Details**: Key decisions or action items not captured
|
||||
4. **Wrong Type**: Documentation created as wrong type (e.g., FAQ instead of Decision Record)
|
||||
5. **Broken Context**: Missing important contextual information
|
||||
6. **Inaccuracy**: Misinterpretation of what was discussed
|
||||
|
||||
## Updating Scenarios
|
||||
|
||||
When updating scenarios:
|
||||
1. Ensure they reflect real-world use cases
|
||||
2. Include edge cases and challenging inputs
|
||||
3. Keep success criteria clear and measurable
|
||||
4. Test with all three model sizes
|
||||
5. Document any model-specific behavior
|
||||
|
||||
## Expected Performance
|
||||
|
||||
**Target Success Rate**: 90%+ on Sonnet and Opus, 85%+ on Haiku
|
||||
|
||||
When a scenario fails:
|
||||
1. Review the specific failure mode
|
||||
2. Check if it's a consistent issue or model-specific
|
||||
3. Update the skill prompt if needed
|
||||
4. Document the failure and resolution
|
||||
Reference in New Issue
Block a user