Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 09:02:23 +08:00
commit 5f44d04a64
18 changed files with 1976 additions and 0 deletions

View File

@@ -0,0 +1,142 @@
# Knowledge Capture - Evaluation Scenarios
This directory contains test scenarios for validating the Knowledge Capture skill across different Claude models.
## Overview
These evaluations ensure that Knowledge Capture works consistently and effectively with:
- Claude 3 Haiku
- Claude 3 Sonnet
- Claude 3 Opus
Each scenario includes:
- **Input**: The conversation or content to capture
- **Expected Behavior**: What the skill should do
- **Success Criteria**: How to verify success
## Running Evaluations
### Manual Testing
1. Copy the conversation/input from a scenario
2. Activate Knowledge Capture skill in Claude Code
3. Request to capture it as specified
4. Evaluate against the success criteria
### Test Coverage
- Basic conversation capture
- Different documentation types
- Complex multi-topic discussions
- Action item extraction
- Decision documentation
- Error handling
## Scenarios
### scenario-1-meeting-notes.json
**Type**: Meeting Summary Capture
**Complexity**: Medium
**Input**: Unstructured meeting notes from a team discussion
**Expected Output**: Structured meeting summary with decisions and action items
**Success Criteria**: All action items identified, decisions documented, participants listed
### scenario-2-architecture-discussion.json
**Type**: Decision Record Capture
**Complexity**: Medium-High
**Input**: Complex technical architecture discussion
**Expected Output**: Decision record with options considered and rationale
**Success Criteria**: All alternatives documented, clear decision stated, consequences identified
### scenario-3-quick-context.json
**Type**: Quick Capture
**Complexity**: Low
**Input**: Brief status update
**Expected Output**: Quick reference document
**Success Criteria**: Key points captured, properly formatted
### scenario-4-action-items-focus.json
**Type**: Action Item Extraction
**Complexity**: Medium
**Input**: Meeting with many scattered action items
**Expected Output**: Clear list of who, what, when
**Success Criteria**: All action items found, owners assigned, deadlines noted
### scenario-5-learning-from-incident.json
**Type**: Learning Document
**Complexity**: High
**Input**: Post-incident discussion about what went wrong
**Expected Output**: Learning document with best practices
**Success Criteria**: Root causes identified, lessons extracted, preventive measures noted
## Evaluation Criteria
### Comprehensiveness
- Did the skill capture all key information?
- Were important details missed?
- Is the output complete?
### Structure
- Is the output well-organized?
- Are headings and sections appropriate?
- Does it follow the documentation type's structure?
### Actionability
- Are action items clear and assignable?
- Are next steps defined?
- Can someone understand what to do next?
### Linking
- Are cross-references included?
- Could this be discovered from related docs?
- Are important connections made?
### Accuracy
- Is the captured information accurate?
- Were details misinterpreted?
- Does it faithfully represent the source?
## Model-Specific Notes
### Haiku
- May need slightly more structured input
- Evaluates on accuracy and completeness
- Good for quick captures
### Sonnet
- Handles complex discussions well
- Good at inferring structure
- Balanced performance across all scenario types
### Opus
- Excels at inferring structure from unstructured input
- Best for complex, multi-topic discussions
- Highest accuracy on nuanced extraction
## Failure Modes to Watch For
1. **Orphaning**: Documentation created without linking to hub pages
2. **Structure Loss**: Loss of hierarchical organization
3. **Missing Details**: Key decisions or action items not captured
4. **Wrong Type**: Documentation created as wrong type (e.g., FAQ instead of Decision Record)
5. **Broken Context**: Missing important contextual information
6. **Inaccuracy**: Misinterpretation of what was discussed
## Updating Scenarios
When updating scenarios:
1. Ensure they reflect real-world use cases
2. Include edge cases and challenging inputs
3. Keep success criteria clear and measurable
4. Test with all three model sizes
5. Document any model-specific behavior
## Expected Performance
**Target Success Rate**: 90%+ on Sonnet and Opus, 85%+ on Haiku
When a scenario fails:
1. Review the specific failure mode
2. Check if it's a consistent issue or model-specific
3. Update the skill prompt if needed
4. Document the failure and resolution

View File

@@ -0,0 +1,99 @@
{
"id": "scenario-1-meeting-notes",
"name": "Meeting Notes Capture",
"description": "Convert unstructured meeting notes into a structured meeting summary",
"complexity": "medium",
"input_type": "meeting_notes",
"output_type": "meeting_summary",
"input": {
"context": "Team planning meeting about Q4 roadmap",
"content": "Had a meeting today with product, engineering, and design about Q4 priorities. We talked about the fact that customers have been asking for dark mode for a while - it's our most requested feature. Sarah (PM) said we should prioritize it, and Mike (design) showed mockups that look great. The engineering team (me, Alex, and Jordan) think we can get it done in 2 sprints if we focus. We also talked about the mobile app - James mentioned that iOS performance is slow when rendering large lists. We didn't decide on a timeline for that yet. Alex volunteered to look into virtualization. For dark mode, we decided to ship it with a toggle in settings that persists to the database. Next steps: Sarah will write the spec this week, Mike will finalize designs by end of week, and our team will start on implementation next sprint. We also have some tech debt from the old auth system that Mike thinks is slowing us down, but we deferred that to next quarter. Finally, we need a QA plan for dark mode - Jessica said she'd put together a testing plan."
},
"expected_output": {
"document_type": "meeting_summary",
"structure": {
"title": "Q4 Roadmap Planning - Team Meeting",
"sections": [
"Meeting Details",
"Key Decisions",
"Action Items",
"Topics Discussed",
"Deferred Items"
]
},
"key_decisions": [
{
"decision": "Prioritize dark mode implementation",
"rationale": "Most frequently requested customer feature"
},
{
"decision": "Dark mode will use toggle in settings with database persistence",
"rationale": "User preference persistence requirement"
}
],
"action_items": [
{
"owner": "Sarah",
"task": "Write dark mode specification",
"deadline": "This week"
},
{
"owner": "Mike",
"task": "Finalize dark mode design mockups",
"deadline": "End of week"
},
{
"owner": "Engineering team",
"task": "Begin dark mode implementation",
"deadline": "Next sprint"
},
{
"owner": "Alex",
"task": "Investigate list virtualization for iOS performance",
"deadline": "TBD"
},
{
"owner": "Jessica",
"task": "Create QA test plan for dark mode",
"deadline": "TBD"
}
],
"attendees": [
"Sarah (Product)",
"Mike (Design)",
"Engineering: self, Alex, Jordan",
"James (mentioned iOS issues)",
"Jessica (QA)"
]
},
"success_criteria": {
"must_capture": [
"Meeting purpose and date",
"All participants/roles",
"Both major decisions with rationale",
"All 5 action items with owners and deadlines",
"Topics discussed (dark mode, mobile performance, tech debt)",
"Implementation timeline (2 sprints for dark mode)"
],
"should_capture": [
"Design mockups mentioned",
"Tech debt deferred to Q1",
"Connection to customer requests"
],
"must_not_do": [
"Invent action items not mentioned",
"Misassign ownership",
"Lose deadline information",
"Create orphaned document without linking context"
]
},
"evaluation_notes": {
"difficulty": "Medium - requires extracting structure from unstructured narrative",
"key_challenge": "Identifying implicit decisions and action items from conversation flow",
"model_variations": {
"haiku": "May miss some implicit items, but should catch main points",
"sonnet": "Should capture all explicit items and most implicit ones",
"opus": "Should capture full context with implicit relationships"
}
}
}