Files
gh-tommy-ca-notion-skills-p…/skills/knowledge-capture/evaluations
2025-11-30 09:02:23 +08:00
..
2025-11-30 09:02:23 +08:00
2025-11-30 09:02:23 +08:00

Knowledge Capture - Evaluation Scenarios

This directory contains test scenarios for validating the Knowledge Capture skill across different Claude models.

Overview

These evaluations ensure that Knowledge Capture works consistently and effectively with:

  • Claude 3 Haiku
  • Claude 3 Sonnet
  • Claude 3 Opus

Each scenario includes:

  • Input: The conversation or content to capture
  • Expected Behavior: What the skill should do
  • Success Criteria: How to verify success

Running Evaluations

Manual Testing

  1. Copy the conversation/input from a scenario
  2. Activate Knowledge Capture skill in Claude Code
  3. Request to capture it as specified
  4. Evaluate against the success criteria

Test Coverage

  • Basic conversation capture
  • Different documentation types
  • Complex multi-topic discussions
  • Action item extraction
  • Decision documentation
  • Error handling

Scenarios

scenario-1-meeting-notes.json

Type: Meeting Summary Capture Complexity: Medium Input: Unstructured meeting notes from a team discussion Expected Output: Structured meeting summary with decisions and action items Success Criteria: All action items identified, decisions documented, participants listed

scenario-2-architecture-discussion.json

Type: Decision Record Capture Complexity: Medium-High Input: Complex technical architecture discussion Expected Output: Decision record with options considered and rationale Success Criteria: All alternatives documented, clear decision stated, consequences identified

scenario-3-quick-context.json

Type: Quick Capture Complexity: Low Input: Brief status update Expected Output: Quick reference document Success Criteria: Key points captured, properly formatted

scenario-4-action-items-focus.json

Type: Action Item Extraction Complexity: Medium Input: Meeting with many scattered action items Expected Output: Clear list of who, what, when Success Criteria: All action items found, owners assigned, deadlines noted

scenario-5-learning-from-incident.json

Type: Learning Document Complexity: High Input: Post-incident discussion about what went wrong Expected Output: Learning document with best practices Success Criteria: Root causes identified, lessons extracted, preventive measures noted

Evaluation Criteria

Comprehensiveness

  • Did the skill capture all key information?
  • Were important details missed?
  • Is the output complete?

Structure

  • Is the output well-organized?
  • Are headings and sections appropriate?
  • Does it follow the documentation type's structure?

Actionability

  • Are action items clear and assignable?
  • Are next steps defined?
  • Can someone understand what to do next?

Linking

  • Are cross-references included?
  • Could this be discovered from related docs?
  • Are important connections made?

Accuracy

  • Is the captured information accurate?
  • Were details misinterpreted?
  • Does it faithfully represent the source?

Model-Specific Notes

Haiku

  • May need slightly more structured input
  • Evaluates on accuracy and completeness
  • Good for quick captures

Sonnet

  • Handles complex discussions well
  • Good at inferring structure
  • Balanced performance across all scenario types

Opus

  • Excels at inferring structure from unstructured input
  • Best for complex, multi-topic discussions
  • Highest accuracy on nuanced extraction

Failure Modes to Watch For

  1. Orphaning: Documentation created without linking to hub pages
  2. Structure Loss: Loss of hierarchical organization
  3. Missing Details: Key decisions or action items not captured
  4. Wrong Type: Documentation created as wrong type (e.g., FAQ instead of Decision Record)
  5. Broken Context: Missing important contextual information
  6. Inaccuracy: Misinterpretation of what was discussed

Updating Scenarios

When updating scenarios:

  1. Ensure they reflect real-world use cases
  2. Include edge cases and challenging inputs
  3. Keep success criteria clear and measurable
  4. Test with all three model sizes
  5. Document any model-specific behavior

Expected Performance

Target Success Rate: 90%+ on Sonnet and Opus, 85%+ on Haiku

When a scenario fails:

  1. Review the specific failure mode
  2. Check if it's a consistent issue or model-specific
  3. Update the skill prompt if needed
  4. Document the failure and resolution