gh-tommy-ca-notion-skills-p…/skills/knowledge-capture/evaluations/README.md

# Knowledge Capture - Evaluation Scenarios

This directory contains test scenarios for validating the Knowledge Capture skill across different Claude models.

## Overview

These evaluations ensure that Knowledge Capture works consistently and effectively with:
- Claude 3 Haiku
- Claude 3 Sonnet
- Claude 3 Opus

Each scenario includes:
- **Input**: The conversation or content to capture
- **Expected Behavior**: What the skill should do
- **Success Criteria**: How to verify success

## Running Evaluations

### Manual Testing

1. Copy the conversation/input from a scenario
2. Activate Knowledge Capture skill in Claude Code
3. Request to capture it as specified
4. Evaluate against the success criteria

### Test Coverage

- Basic conversation capture
- Different documentation types
- Complex multi-topic discussions
- Action item extraction
- Decision documentation
- Error handling

## Scenarios

### scenario-1-meeting-notes.json
**Type**: Meeting Summary Capture
**Complexity**: Medium
**Input**: Unstructured meeting notes from a team discussion
**Expected Output**: Structured meeting summary with decisions and action items
**Success Criteria**: All action items identified, decisions documented, participants listed

### scenario-2-architecture-discussion.json
**Type**: Decision Record Capture
**Complexity**: Medium-High
**Input**: Complex technical architecture discussion
**Expected Output**: Decision record with options considered and rationale
**Success Criteria**: All alternatives documented, clear decision stated, consequences identified

### scenario-3-quick-context.json
**Type**: Quick Capture
**Complexity**: Low
**Input**: Brief status update
**Expected Output**: Quick reference document
**Success Criteria**: Key points captured, properly formatted

### scenario-4-action-items-focus.json
**Type**: Action Item Extraction
**Complexity**: Medium
**Input**: Meeting with many scattered action items
**Expected Output**: Clear list of who, what, when
**Success Criteria**: All action items found, owners assigned, deadlines noted

### scenario-5-learning-from-incident.json
**Type**: Learning Document
**Complexity**: High
**Input**: Post-incident discussion about what went wrong
**Expected Output**: Learning document with best practices
**Success Criteria**: Root causes identified, lessons extracted, preventive measures noted

## Evaluation Criteria

### Comprehensiveness
- Did the skill capture all key information?
- Were important details missed?
- Is the output complete?

### Structure
- Is the output well-organized?
- Are headings and sections appropriate?
- Does it follow the documentation type's structure?

### Actionability
- Are action items clear and assignable?
- Are next steps defined?
- Can someone understand what to do next?

### Linking
- Are cross-references included?
- Could this be discovered from related docs?
- Are important connections made?

### Accuracy
- Is the captured information accurate?
- Were details misinterpreted?
- Does it faithfully represent the source?

## Model-Specific Notes

### Haiku
- May need slightly more structured input
- Evaluates on accuracy and completeness
- Good for quick captures

### Sonnet
- Handles complex discussions well
- Good at inferring structure
- Balanced performance across all scenario types

### Opus
- Excels at inferring structure from unstructured input
- Best for complex, multi-topic discussions
- Highest accuracy on nuanced extraction

## Failure Modes to Watch For

1. **Orphaning**: Documentation created without linking to hub pages
2. **Structure Loss**: Loss of hierarchical organization
3. **Missing Details**: Key decisions or action items not captured
4. **Wrong Type**: Documentation created as wrong type (e.g., FAQ instead of Decision Record)
5. **Broken Context**: Missing important contextual information
6. **Inaccuracy**: Misinterpretation of what was discussed

## Updating Scenarios

When updating scenarios:
1. Ensure they reflect real-world use cases
2. Include edge cases and challenging inputs
3. Keep success criteria clear and measurable
4. Test with all three model sizes
5. Document any model-specific behavior

## Expected Performance

**Target Success Rate**: 90%+ on Sonnet and Opus, 85%+ on Haiku

When a scenario fails:
1. Review the specific failure mode
2. Check if it's a consistent issue or model-specific
3. Update the skill prompt if needed
4. Document the failure and resolution