Files
2025-11-30 09:02:23 +08:00

143 lines
4.3 KiB
Markdown

# Knowledge Capture - Evaluation Scenarios
This directory contains test scenarios for validating the Knowledge Capture skill across different Claude models.
## Overview
These evaluations ensure that Knowledge Capture works consistently and effectively with:
- Claude 3 Haiku
- Claude 3 Sonnet
- Claude 3 Opus
Each scenario includes:
- **Input**: The conversation or content to capture
- **Expected Behavior**: What the skill should do
- **Success Criteria**: How to verify success
## Running Evaluations
### Manual Testing
1. Copy the conversation/input from a scenario
2. Activate Knowledge Capture skill in Claude Code
3. Request to capture it as specified
4. Evaluate against the success criteria
### Test Coverage
- Basic conversation capture
- Different documentation types
- Complex multi-topic discussions
- Action item extraction
- Decision documentation
- Error handling
## Scenarios
### scenario-1-meeting-notes.json
**Type**: Meeting Summary Capture
**Complexity**: Medium
**Input**: Unstructured meeting notes from a team discussion
**Expected Output**: Structured meeting summary with decisions and action items
**Success Criteria**: All action items identified, decisions documented, participants listed
### scenario-2-architecture-discussion.json
**Type**: Decision Record Capture
**Complexity**: Medium-High
**Input**: Complex technical architecture discussion
**Expected Output**: Decision record with options considered and rationale
**Success Criteria**: All alternatives documented, clear decision stated, consequences identified
### scenario-3-quick-context.json
**Type**: Quick Capture
**Complexity**: Low
**Input**: Brief status update
**Expected Output**: Quick reference document
**Success Criteria**: Key points captured, properly formatted
### scenario-4-action-items-focus.json
**Type**: Action Item Extraction
**Complexity**: Medium
**Input**: Meeting with many scattered action items
**Expected Output**: Clear list of who, what, when
**Success Criteria**: All action items found, owners assigned, deadlines noted
### scenario-5-learning-from-incident.json
**Type**: Learning Document
**Complexity**: High
**Input**: Post-incident discussion about what went wrong
**Expected Output**: Learning document with best practices
**Success Criteria**: Root causes identified, lessons extracted, preventive measures noted
## Evaluation Criteria
### Comprehensiveness
- Did the skill capture all key information?
- Were important details missed?
- Is the output complete?
### Structure
- Is the output well-organized?
- Are headings and sections appropriate?
- Does it follow the documentation type's structure?
### Actionability
- Are action items clear and assignable?
- Are next steps defined?
- Can someone understand what to do next?
### Linking
- Are cross-references included?
- Could this be discovered from related docs?
- Are important connections made?
### Accuracy
- Is the captured information accurate?
- Were details misinterpreted?
- Does it faithfully represent the source?
## Model-Specific Notes
### Haiku
- May need slightly more structured input
- Evaluates on accuracy and completeness
- Good for quick captures
### Sonnet
- Handles complex discussions well
- Good at inferring structure
- Balanced performance across all scenario types
### Opus
- Excels at inferring structure from unstructured input
- Best for complex, multi-topic discussions
- Highest accuracy on nuanced extraction
## Failure Modes to Watch For
1. **Orphaning**: Documentation created without linking to hub pages
2. **Structure Loss**: Loss of hierarchical organization
3. **Missing Details**: Key decisions or action items not captured
4. **Wrong Type**: Documentation created as wrong type (e.g., FAQ instead of Decision Record)
5. **Broken Context**: Missing important contextual information
6. **Inaccuracy**: Misinterpretation of what was discussed
## Updating Scenarios
When updating scenarios:
1. Ensure they reflect real-world use cases
2. Include edge cases and challenging inputs
3. Keep success criteria clear and measurable
4. Test with all three model sizes
5. Document any model-specific behavior
## Expected Performance
**Target Success Rate**: 90%+ on Sonnet and Opus, 85%+ on Haiku
When a scenario fails:
1. Review the specific failure mode
2. Check if it's a consistent issue or model-specific
3. Update the skill prompt if needed
4. Document the failure and resolution