8.3 KiB
Validation and Quality Assurance Guide
Overview
Validation quantifies extraction accuracy using precision, recall, and F1 metrics by comparing automated extraction against manually annotated ground truth.
When to Validate
- Before production use - Establish baseline quality
- After schema changes - Verify improvements
- When comparing models - Test Haiku vs Sonnet vs Ollama
- For publication - Report extraction quality metrics
Recommended Sample Sizes
- Small projects (<100 papers): 10-20 papers
- Medium projects (100-500 papers): 20-50 papers
- Large projects (>500 papers): 50-100 papers
Step 7: Prepare Validation Set
Sample papers for manual annotation using one of three strategies.
Random Sampling (General Quality)
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy random \
--output validation_set.json
Provides overall quality estimate but may miss rare cases.
Stratified Sampling (Identify Weaknesses)
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy stratified \
--output validation_set.json
Samples papers with different characteristics:
- Papers with no records
- Papers with few records (1-2)
- Papers with medium records (3-5)
- Papers with many records (6+)
Best for identifying weak points in extraction.
Diverse Sampling (Comprehensive)
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy diverse \
--output validation_set.json
Maximizes diversity across different paper types.
Step 8: Manual Annotation
Annotation Process
-
Open validation file:
# Use your preferred JSON editor code validation_set.json # VS Code vim validation_set.json # Vim -
For each paper in
validation_papers:- Locate and read the original PDF
- Extract data according to the schema
- Fill the
ground_truthfield with correct extraction - The structure should match
automated_extraction
-
Fill metadata fields:
annotator: Your nameannotation_date: YYYY-MM-DDnotes: Any ambiguous cases or comments
Annotation Tips
Be thorough:
- Extract ALL relevant information, even if automated extraction missed it
- This ensures accurate recall calculation
Be precise:
- Use exact values as they appear in the paper
- Follow the same schema structure as automated extraction
Be consistent:
- Apply the same interpretation rules across all papers
- Document interpretation decisions in notes
Mark ambiguities:
- If a field is unclear, note it and make your best judgment
- Consider having multiple annotators for inter-rater reliability
Example Annotation
{
"paper_id_123": {
"automated_extraction": {
"has_relevant_data": true,
"records": [
{
"species": "Apis mellifera",
"location": "Brazil"
}
]
},
"ground_truth": {
"has_relevant_data": true,
"records": [
{
"species": "Apis mellifera",
"location": "Brazil",
"state_province": "São Paulo" // Automated missed this
},
{
"species": "Bombus terrestris", // Automated missed this record
"location": "Brazil",
"state_province": "São Paulo"
}
]
},
"notes": "Automated extraction missed the state and second species",
"annotator": "John Doe",
"annotation_date": "2025-01-15"
}
}
Step 9: Calculate Validation Metrics
Basic Metrics Calculation
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--output validation_metrics.json \
--report validation_report.txt
Advanced Options
Fuzzy string matching:
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--fuzzy-strings \
--output validation_metrics.json
Normalizes whitespace and case for string comparisons.
Numeric tolerance:
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--numeric-tolerance 0.01 \
--output validation_metrics.json
Allows small differences in numeric values.
Ordered list comparison:
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--list-order-matters \
--output validation_metrics.json
Treats lists as ordered sequences instead of sets.
Understanding the Metrics
Precision
Definition: Of the items extracted, what percentage are correct?
Formula: TP / (TP + FP)
Example: Extracted 10 species, 8 were correct → Precision = 80%
High precision, low recall: Conservative extraction (misses data)
Recall
Definition: Of the true items, what percentage were extracted?
Formula: TP / (TP + FN)
Example: Paper has 12 species, extracted 8 → Recall = 67%
Low precision, high recall: Liberal extraction (includes errors)
F1 Score
Definition: Harmonic mean of precision and recall
Formula: 2 × (Precision × Recall) / (Precision + Recall)
Use: Single metric balancing precision and recall
Field-Level Metrics
Metrics are calculated for each field type:
Boolean fields:
- True positives, false positives, false negatives
Numeric fields:
- Exact match or within tolerance
String fields:
- Exact or fuzzy match
List fields:
- Set-based comparison (default)
- Items in both (TP), in automated only (FP), in truth only (FN)
Nested objects:
- Recursive field-by-field comparison
Interpreting Results
Validation Report Structure
OVERALL METRICS
Papers evaluated: 20
Precision: 87.3%
Recall: 79.2%
F1 Score: 83.1%
METRICS BY FIELD
Field Precision Recall F1
species 95.2% 89.1% 92.0%
location 82.3% 75.4% 78.7%
method 91.0% 68.2% 77.9%
COMMON ISSUES
Fields with low recall (missed information):
- method: 68.2% recall, 12 missed items
Fields with low precision (incorrect extractions):
- location: 82.3% precision, 8 incorrect items
Using Results to Improve
Low Recall (Missing Information):
- Review extraction prompt instructions
- Add examples of the missed pattern
- Emphasize completeness in prompt
- Consider using more capable model (Haiku → Sonnet)
Low Precision (Incorrect Extractions):
- Add validation rules to prompt
- Provide clearer field definitions
- Add negative examples
- Tighten extraction criteria
Field-Specific Issues:
- Identify problematic field types
- Revise schema definitions
- Add field-specific instructions
- Update examples
Inter-Rater Reliability (Optional)
For critical applications, have multiple annotators:
-
Split validation set:
- 10 papers: Single annotator
- 10 papers: Both annotators independently
-
Calculate agreement:
python scripts/08_calculate_validation_metrics.py \ --annotations annotator1.json \ --compare-with annotator2.json \ --output agreement_metrics.json -
Resolve disagreements:
- Discuss discrepancies
- Establish interpretation guidelines
- Re-annotate if needed
Iterative Improvement Workflow
- Baseline: Run extraction with initial schema
- Validate: Calculate metrics on sample
- Analyze: Identify weak fields and error patterns
- Revise: Update schema, prompts, or model
- Re-extract: Run extraction with improvements
- Re-validate: Calculate new metrics
- Compare: Check if metrics improved
- Repeat: Until acceptable quality achieved
Reporting Validation in Publications
Include in methods section:
Extraction quality was assessed on a stratified random sample of
20 papers. Automated extraction achieved 87.3% precision (95% CI:
81.2-93.4%) and 79.2% recall (95% CI: 72.8-85.6%), with an F1
score of 83.1%. Field-level metrics ranged from 77.9% (method
descriptions) to 92.0% (species names).
Consider reporting:
- Sample size and sampling strategy
- Overall precision, recall, F1
- Field-level metrics for key fields
- Confidence intervals
- Common error types