zhongwei/gh-tommy-ca-notion-skills-plugins-notion-skills

Fork 0

Files

Zhongwei Li 5f44d04a64 Initial commit

2025-11-30 09:02:23 +08:00

4.3 KiB

Raw Permalink Blame History

Knowledge Capture - Evaluation Scenarios

This directory contains test scenarios for validating the Knowledge Capture skill across different Claude models.

Overview

These evaluations ensure that Knowledge Capture works consistently and effectively with:

Claude 3 Haiku
Claude 3 Sonnet
Claude 3 Opus

Each scenario includes:

Input: The conversation or content to capture
Expected Behavior: What the skill should do
Success Criteria: How to verify success

Running Evaluations

Manual Testing

Copy the conversation/input from a scenario
Activate Knowledge Capture skill in Claude Code
Request to capture it as specified
Evaluate against the success criteria

Test Coverage

Basic conversation capture
Different documentation types
Complex multi-topic discussions
Action item extraction
Decision documentation
Error handling

Scenarios

scenario-1-meeting-notes.json

Type: Meeting Summary Capture Complexity: Medium Input: Unstructured meeting notes from a team discussion Expected Output: Structured meeting summary with decisions and action items Success Criteria: All action items identified, decisions documented, participants listed

scenario-2-architecture-discussion.json

Type: Decision Record Capture Complexity: Medium-High Input: Complex technical architecture discussion Expected Output: Decision record with options considered and rationale Success Criteria: All alternatives documented, clear decision stated, consequences identified

scenario-3-quick-context.json

Type: Quick Capture Complexity: Low Input: Brief status update Expected Output: Quick reference document Success Criteria: Key points captured, properly formatted

scenario-4-action-items-focus.json

Type: Action Item Extraction Complexity: Medium Input: Meeting with many scattered action items Expected Output: Clear list of who, what, when Success Criteria: All action items found, owners assigned, deadlines noted

scenario-5-learning-from-incident.json

Type: Learning Document Complexity: High Input: Post-incident discussion about what went wrong Expected Output: Learning document with best practices Success Criteria: Root causes identified, lessons extracted, preventive measures noted

Evaluation Criteria

Comprehensiveness

Did the skill capture all key information?
Were important details missed?
Is the output complete?

Structure

Is the output well-organized?
Are headings and sections appropriate?
Does it follow the documentation type's structure?

Actionability

Are action items clear and assignable?
Are next steps defined?
Can someone understand what to do next?

Linking

Are cross-references included?
Could this be discovered from related docs?
Are important connections made?

Accuracy

Is the captured information accurate?
Were details misinterpreted?
Does it faithfully represent the source?

Model-Specific Notes

Haiku

May need slightly more structured input
Evaluates on accuracy and completeness
Good for quick captures

Sonnet

Handles complex discussions well
Good at inferring structure
Balanced performance across all scenario types

Opus

Excels at inferring structure from unstructured input
Best for complex, multi-topic discussions
Highest accuracy on nuanced extraction

Failure Modes to Watch For

Orphaning: Documentation created without linking to hub pages
Structure Loss: Loss of hierarchical organization
Missing Details: Key decisions or action items not captured
Wrong Type: Documentation created as wrong type (e.g., FAQ instead of Decision Record)
Broken Context: Missing important contextual information
Inaccuracy: Misinterpretation of what was discussed

Updating Scenarios

When updating scenarios:

Ensure they reflect real-world use cases
Include edge cases and challenging inputs
Keep success criteria clear and measurable
Test with all three model sizes
Document any model-specific behavior

Expected Performance

Target Success Rate: 90%+ on Sonnet and Opus, 85%+ on Haiku

When a scenario fails:

Review the specific failure mode
Check if it's a consistent issue or model-specific
Update the skill prompt if needed
Document the failure and resolution

4.3 KiB Raw Permalink Blame History