143 lines
4.3 KiB
Markdown
143 lines
4.3 KiB
Markdown
# Knowledge Capture - Evaluation Scenarios
|
|
|
|
This directory contains test scenarios for validating the Knowledge Capture skill across different Claude models.
|
|
|
|
## Overview
|
|
|
|
These evaluations ensure that Knowledge Capture works consistently and effectively with:
|
|
- Claude 3 Haiku
|
|
- Claude 3 Sonnet
|
|
- Claude 3 Opus
|
|
|
|
Each scenario includes:
|
|
- **Input**: The conversation or content to capture
|
|
- **Expected Behavior**: What the skill should do
|
|
- **Success Criteria**: How to verify success
|
|
|
|
## Running Evaluations
|
|
|
|
### Manual Testing
|
|
|
|
1. Copy the conversation/input from a scenario
|
|
2. Activate Knowledge Capture skill in Claude Code
|
|
3. Request to capture it as specified
|
|
4. Evaluate against the success criteria
|
|
|
|
### Test Coverage
|
|
|
|
- Basic conversation capture
|
|
- Different documentation types
|
|
- Complex multi-topic discussions
|
|
- Action item extraction
|
|
- Decision documentation
|
|
- Error handling
|
|
|
|
## Scenarios
|
|
|
|
### scenario-1-meeting-notes.json
|
|
**Type**: Meeting Summary Capture
|
|
**Complexity**: Medium
|
|
**Input**: Unstructured meeting notes from a team discussion
|
|
**Expected Output**: Structured meeting summary with decisions and action items
|
|
**Success Criteria**: All action items identified, decisions documented, participants listed
|
|
|
|
### scenario-2-architecture-discussion.json
|
|
**Type**: Decision Record Capture
|
|
**Complexity**: Medium-High
|
|
**Input**: Complex technical architecture discussion
|
|
**Expected Output**: Decision record with options considered and rationale
|
|
**Success Criteria**: All alternatives documented, clear decision stated, consequences identified
|
|
|
|
### scenario-3-quick-context.json
|
|
**Type**: Quick Capture
|
|
**Complexity**: Low
|
|
**Input**: Brief status update
|
|
**Expected Output**: Quick reference document
|
|
**Success Criteria**: Key points captured, properly formatted
|
|
|
|
### scenario-4-action-items-focus.json
|
|
**Type**: Action Item Extraction
|
|
**Complexity**: Medium
|
|
**Input**: Meeting with many scattered action items
|
|
**Expected Output**: Clear list of who, what, when
|
|
**Success Criteria**: All action items found, owners assigned, deadlines noted
|
|
|
|
### scenario-5-learning-from-incident.json
|
|
**Type**: Learning Document
|
|
**Complexity**: High
|
|
**Input**: Post-incident discussion about what went wrong
|
|
**Expected Output**: Learning document with best practices
|
|
**Success Criteria**: Root causes identified, lessons extracted, preventive measures noted
|
|
|
|
## Evaluation Criteria
|
|
|
|
### Comprehensiveness
|
|
- Did the skill capture all key information?
|
|
- Were important details missed?
|
|
- Is the output complete?
|
|
|
|
### Structure
|
|
- Is the output well-organized?
|
|
- Are headings and sections appropriate?
|
|
- Does it follow the documentation type's structure?
|
|
|
|
### Actionability
|
|
- Are action items clear and assignable?
|
|
- Are next steps defined?
|
|
- Can someone understand what to do next?
|
|
|
|
### Linking
|
|
- Are cross-references included?
|
|
- Could this be discovered from related docs?
|
|
- Are important connections made?
|
|
|
|
### Accuracy
|
|
- Is the captured information accurate?
|
|
- Were details misinterpreted?
|
|
- Does it faithfully represent the source?
|
|
|
|
## Model-Specific Notes
|
|
|
|
### Haiku
|
|
- May need slightly more structured input
|
|
- Evaluates on accuracy and completeness
|
|
- Good for quick captures
|
|
|
|
### Sonnet
|
|
- Handles complex discussions well
|
|
- Good at inferring structure
|
|
- Balanced performance across all scenario types
|
|
|
|
### Opus
|
|
- Excels at inferring structure from unstructured input
|
|
- Best for complex, multi-topic discussions
|
|
- Highest accuracy on nuanced extraction
|
|
|
|
## Failure Modes to Watch For
|
|
|
|
1. **Orphaning**: Documentation created without linking to hub pages
|
|
2. **Structure Loss**: Loss of hierarchical organization
|
|
3. **Missing Details**: Key decisions or action items not captured
|
|
4. **Wrong Type**: Documentation created as wrong type (e.g., FAQ instead of Decision Record)
|
|
5. **Broken Context**: Missing important contextual information
|
|
6. **Inaccuracy**: Misinterpretation of what was discussed
|
|
|
|
## Updating Scenarios
|
|
|
|
When updating scenarios:
|
|
1. Ensure they reflect real-world use cases
|
|
2. Include edge cases and challenging inputs
|
|
3. Keep success criteria clear and measurable
|
|
4. Test with all three model sizes
|
|
5. Document any model-specific behavior
|
|
|
|
## Expected Performance
|
|
|
|
**Target Success Rate**: 90%+ on Sonnet and Opus, 85%+ on Haiku
|
|
|
|
When a scenario fails:
|
|
1. Review the specific failure mode
|
|
2. Check if it's a consistent issue or model-specific
|
|
3. Update the skill prompt if needed
|
|
4. Document the failure and resolution
|