zhongwei/gh-brunoasm-my-claude-skills-general-skills

Files

Zhongwei Li 69617b598e Initial commit

2025-11-29 18:02:40 +08:00

8.3 KiB

Raw Blame History

Validation and Quality Assurance Guide

Overview

Validation quantifies extraction accuracy using precision, recall, and F1 metrics by comparing automated extraction against manually annotated ground truth.

When to Validate

Before production use - Establish baseline quality
After schema changes - Verify improvements
When comparing models - Test Haiku vs Sonnet vs Ollama
For publication - Report extraction quality metrics

Recommended Sample Sizes

Small projects (<100 papers): 10-20 papers
Medium projects (100-500 papers): 20-50 papers
Large projects (>500 papers): 50-100 papers

Step 7: Prepare Validation Set

Sample papers for manual annotation using one of three strategies.

Random Sampling (General Quality)

python scripts/07_prepare_validation_set.py \
  --extraction-results cleaned_data.json \
  --schema my_schema.json \
  --sample-size 20 \
  --strategy random \
  --output validation_set.json

Provides overall quality estimate but may miss rare cases.

Stratified Sampling (Identify Weaknesses)

python scripts/07_prepare_validation_set.py \
  --extraction-results cleaned_data.json \
  --schema my_schema.json \
  --sample-size 20 \
  --strategy stratified \
  --output validation_set.json

Samples papers with different characteristics:

Papers with no records
Papers with few records (1-2)
Papers with medium records (3-5)
Papers with many records (6+)

Best for identifying weak points in extraction.

Diverse Sampling (Comprehensive)

python scripts/07_prepare_validation_set.py \
  --extraction-results cleaned_data.json \
  --schema my_schema.json \
  --sample-size 20 \
  --strategy diverse \
  --output validation_set.json

Maximizes diversity across different paper types.

Step 8: Manual Annotation

Annotation Process

Open validation file:

# Use your preferred JSON editor
code validation_set.json  # VS Code
vim validation_set.json   # Vim

For each paper in validation_papers:
- Locate and read the original PDF
- Extract data according to the schema
- Fill the ground_truth field with correct extraction
- The structure should match automated_extraction
Fill metadata fields:
- annotator: Your name
- annotation_date: YYYY-MM-DD
- notes: Any ambiguous cases or comments

Annotation Tips

Be thorough:

Extract ALL relevant information, even if automated extraction missed it
This ensures accurate recall calculation

Be precise:

Use exact values as they appear in the paper
Follow the same schema structure as automated extraction

Be consistent:

Apply the same interpretation rules across all papers
Document interpretation decisions in notes

Mark ambiguities:

If a field is unclear, note it and make your best judgment
Consider having multiple annotators for inter-rater reliability

Example Annotation

{
  "paper_id_123": {
    "automated_extraction": {
      "has_relevant_data": true,
      "records": [
        {
          "species": "Apis mellifera",
          "location": "Brazil"
        }
      ]
    },
    "ground_truth": {
      "has_relevant_data": true,
      "records": [
        {
          "species": "Apis mellifera",
          "location": "Brazil",
          "state_province": "São Paulo"  // Automated missed this
        },
        {
          "species": "Bombus terrestris",  // Automated missed this record
          "location": "Brazil",
          "state_province": "São Paulo"
        }
      ]
    },
    "notes": "Automated extraction missed the state and second species",
    "annotator": "John Doe",
    "annotation_date": "2025-01-15"
  }
}

Step 9: Calculate Validation Metrics

Basic Metrics Calculation

python scripts/08_calculate_validation_metrics.py \
  --annotations validation_set.json \
  --output validation_metrics.json \
  --report validation_report.txt

Advanced Options

Fuzzy string matching:

python scripts/08_calculate_validation_metrics.py \
  --annotations validation_set.json \
  --fuzzy-strings \
  --output validation_metrics.json

Normalizes whitespace and case for string comparisons.

Numeric tolerance:

python scripts/08_calculate_validation_metrics.py \
  --annotations validation_set.json \
  --numeric-tolerance 0.01 \
  --output validation_metrics.json

Allows small differences in numeric values.

Ordered list comparison:

python scripts/08_calculate_validation_metrics.py \
  --annotations validation_set.json \
  --list-order-matters \
  --output validation_metrics.json

Treats lists as ordered sequences instead of sets.

Understanding the Metrics

Precision

Definition: Of the items extracted, what percentage are correct?

Formula: TP / (TP + FP)

Example: Extracted 10 species, 8 were correct → Precision = 80%

High precision, low recall: Conservative extraction (misses data)

Recall

Definition: Of the true items, what percentage were extracted?

Formula: TP / (TP + FN)

Example: Paper has 12 species, extracted 8 → Recall = 67%

Low precision, high recall: Liberal extraction (includes errors)

F1 Score

Definition: Harmonic mean of precision and recall

Formula: 2 × (Precision × Recall) / (Precision + Recall)

Use: Single metric balancing precision and recall

Field-Level Metrics

Metrics are calculated for each field type:

Boolean fields:

True positives, false positives, false negatives

Numeric fields:

Exact match or within tolerance

String fields:

Exact or fuzzy match

List fields:

Set-based comparison (default)
Items in both (TP), in automated only (FP), in truth only (FN)

Nested objects:

Recursive field-by-field comparison

Interpreting Results

Validation Report Structure

OVERALL METRICS
  Papers evaluated: 20
  Precision: 87.3%
  Recall: 79.2%
  F1 Score: 83.1%

METRICS BY FIELD
  Field                  Precision    Recall       F1
  species               95.2%        89.1%        92.0%
  location              82.3%        75.4%        78.7%
  method                91.0%        68.2%        77.9%

COMMON ISSUES
  Fields with low recall (missed information):
  - method: 68.2% recall, 12 missed items

  Fields with low precision (incorrect extractions):
  - location: 82.3% precision, 8 incorrect items

Using Results to Improve

Low Recall (Missing Information):

Review extraction prompt instructions
Add examples of the missed pattern
Emphasize completeness in prompt
Consider using more capable model (Haiku → Sonnet)

Low Precision (Incorrect Extractions):

Add validation rules to prompt
Provide clearer field definitions
Add negative examples
Tighten extraction criteria

Field-Specific Issues:

Identify problematic field types
Revise schema definitions
Add field-specific instructions
Update examples

Inter-Rater Reliability (Optional)

For critical applications, have multiple annotators:

Split validation set:
- 10 papers: Single annotator
- 10 papers: Both annotators independently

Calculate agreement:

python scripts/08_calculate_validation_metrics.py \
  --annotations annotator1.json \
  --compare-with annotator2.json \
  --output agreement_metrics.json

Resolve disagreements:
- Discuss discrepancies
- Establish interpretation guidelines
- Re-annotate if needed

Iterative Improvement Workflow

Baseline: Run extraction with initial schema
Validate: Calculate metrics on sample
Analyze: Identify weak fields and error patterns
Revise: Update schema, prompts, or model
Re-extract: Run extraction with improvements
Re-validate: Calculate new metrics
Compare: Check if metrics improved
Repeat: Until acceptable quality achieved

Reporting Validation in Publications

Include in methods section:

Extraction quality was assessed on a stratified random sample of
20 papers. Automated extraction achieved 87.3% precision (95% CI:
81.2-93.4%) and 79.2% recall (95% CI: 72.8-85.6%), with an F1
score of 83.1%. Field-level metrics ranged from 77.9% (method
descriptions) to 92.0% (species names).

Consider reporting:

Sample size and sampling strategy
Overall precision, recall, F1
Field-level metrics for key fields
Confidence intervals
Common error types

8.3 KiB Raw Blame History Unescape Escape