Initial commit
This commit is contained in:
329
skills/extract_from_pdfs/references/validation_guide.md
Normal file
329
skills/extract_from_pdfs/references/validation_guide.md
Normal file
@@ -0,0 +1,329 @@
|
||||
# Validation and Quality Assurance Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Validation quantifies extraction accuracy using precision, recall, and F1 metrics by comparing automated extraction against manually annotated ground truth.
|
||||
|
||||
## When to Validate
|
||||
|
||||
- **Before production use** - Establish baseline quality
|
||||
- **After schema changes** - Verify improvements
|
||||
- **When comparing models** - Test Haiku vs Sonnet vs Ollama
|
||||
- **For publication** - Report extraction quality metrics
|
||||
|
||||
## Recommended Sample Sizes
|
||||
|
||||
- Small projects (<100 papers): 10-20 papers
|
||||
- Medium projects (100-500 papers): 20-50 papers
|
||||
- Large projects (>500 papers): 50-100 papers
|
||||
|
||||
## Step 7: Prepare Validation Set
|
||||
|
||||
Sample papers for manual annotation using one of three strategies.
|
||||
|
||||
### Random Sampling (General Quality)
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy random \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Provides overall quality estimate but may miss rare cases.
|
||||
|
||||
### Stratified Sampling (Identify Weaknesses)
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy stratified \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Samples papers with different characteristics:
|
||||
- Papers with no records
|
||||
- Papers with few records (1-2)
|
||||
- Papers with medium records (3-5)
|
||||
- Papers with many records (6+)
|
||||
|
||||
Best for identifying weak points in extraction.
|
||||
|
||||
### Diverse Sampling (Comprehensive)
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy diverse \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Maximizes diversity across different paper types.
|
||||
|
||||
## Step 8: Manual Annotation
|
||||
|
||||
### Annotation Process
|
||||
|
||||
1. **Open validation file:**
|
||||
```bash
|
||||
# Use your preferred JSON editor
|
||||
code validation_set.json # VS Code
|
||||
vim validation_set.json # Vim
|
||||
```
|
||||
|
||||
2. **For each paper in `validation_papers`:**
|
||||
- Locate and read the original PDF
|
||||
- Extract data according to the schema
|
||||
- Fill the `ground_truth` field with correct extraction
|
||||
- The structure should match `automated_extraction`
|
||||
|
||||
3. **Fill metadata fields:**
|
||||
- `annotator`: Your name
|
||||
- `annotation_date`: YYYY-MM-DD
|
||||
- `notes`: Any ambiguous cases or comments
|
||||
|
||||
### Annotation Tips
|
||||
|
||||
**Be thorough:**
|
||||
- Extract ALL relevant information, even if automated extraction missed it
|
||||
- This ensures accurate recall calculation
|
||||
|
||||
**Be precise:**
|
||||
- Use exact values as they appear in the paper
|
||||
- Follow the same schema structure as automated extraction
|
||||
|
||||
**Be consistent:**
|
||||
- Apply the same interpretation rules across all papers
|
||||
- Document interpretation decisions in notes
|
||||
|
||||
**Mark ambiguities:**
|
||||
- If a field is unclear, note it and make your best judgment
|
||||
- Consider having multiple annotators for inter-rater reliability
|
||||
|
||||
### Example Annotation
|
||||
|
||||
```json
|
||||
{
|
||||
"paper_id_123": {
|
||||
"automated_extraction": {
|
||||
"has_relevant_data": true,
|
||||
"records": [
|
||||
{
|
||||
"species": "Apis mellifera",
|
||||
"location": "Brazil"
|
||||
}
|
||||
]
|
||||
},
|
||||
"ground_truth": {
|
||||
"has_relevant_data": true,
|
||||
"records": [
|
||||
{
|
||||
"species": "Apis mellifera",
|
||||
"location": "Brazil",
|
||||
"state_province": "São Paulo" // Automated missed this
|
||||
},
|
||||
{
|
||||
"species": "Bombus terrestris", // Automated missed this record
|
||||
"location": "Brazil",
|
||||
"state_province": "São Paulo"
|
||||
}
|
||||
]
|
||||
},
|
||||
"notes": "Automated extraction missed the state and second species",
|
||||
"annotator": "John Doe",
|
||||
"annotation_date": "2025-01-15"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Step 9: Calculate Validation Metrics
|
||||
|
||||
### Basic Metrics Calculation
|
||||
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--output validation_metrics.json \
|
||||
--report validation_report.txt
|
||||
```
|
||||
|
||||
### Advanced Options
|
||||
|
||||
**Fuzzy string matching:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--fuzzy-strings \
|
||||
--output validation_metrics.json
|
||||
```
|
||||
|
||||
Normalizes whitespace and case for string comparisons.
|
||||
|
||||
**Numeric tolerance:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--numeric-tolerance 0.01 \
|
||||
--output validation_metrics.json
|
||||
```
|
||||
|
||||
Allows small differences in numeric values.
|
||||
|
||||
**Ordered list comparison:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--list-order-matters \
|
||||
--output validation_metrics.json
|
||||
```
|
||||
|
||||
Treats lists as ordered sequences instead of sets.
|
||||
|
||||
## Understanding the Metrics
|
||||
|
||||
### Precision
|
||||
**Definition:** Of the items extracted, what percentage are correct?
|
||||
|
||||
**Formula:** TP / (TP + FP)
|
||||
|
||||
**Example:** Extracted 10 species, 8 were correct → Precision = 80%
|
||||
|
||||
**High precision, low recall:** Conservative extraction (misses data)
|
||||
|
||||
### Recall
|
||||
**Definition:** Of the true items, what percentage were extracted?
|
||||
|
||||
**Formula:** TP / (TP + FN)
|
||||
|
||||
**Example:** Paper has 12 species, extracted 8 → Recall = 67%
|
||||
|
||||
**Low precision, high recall:** Liberal extraction (includes errors)
|
||||
|
||||
### F1 Score
|
||||
**Definition:** Harmonic mean of precision and recall
|
||||
|
||||
**Formula:** 2 × (Precision × Recall) / (Precision + Recall)
|
||||
|
||||
**Use:** Single metric balancing precision and recall
|
||||
|
||||
### Field-Level Metrics
|
||||
|
||||
Metrics are calculated for each field type:
|
||||
|
||||
**Boolean fields:**
|
||||
- True positives, false positives, false negatives
|
||||
|
||||
**Numeric fields:**
|
||||
- Exact match or within tolerance
|
||||
|
||||
**String fields:**
|
||||
- Exact or fuzzy match
|
||||
|
||||
**List fields:**
|
||||
- Set-based comparison (default)
|
||||
- Items in both (TP), in automated only (FP), in truth only (FN)
|
||||
|
||||
**Nested objects:**
|
||||
- Recursive field-by-field comparison
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
### Validation Report Structure
|
||||
|
||||
```
|
||||
OVERALL METRICS
|
||||
Papers evaluated: 20
|
||||
Precision: 87.3%
|
||||
Recall: 79.2%
|
||||
F1 Score: 83.1%
|
||||
|
||||
METRICS BY FIELD
|
||||
Field Precision Recall F1
|
||||
species 95.2% 89.1% 92.0%
|
||||
location 82.3% 75.4% 78.7%
|
||||
method 91.0% 68.2% 77.9%
|
||||
|
||||
COMMON ISSUES
|
||||
Fields with low recall (missed information):
|
||||
- method: 68.2% recall, 12 missed items
|
||||
|
||||
Fields with low precision (incorrect extractions):
|
||||
- location: 82.3% precision, 8 incorrect items
|
||||
```
|
||||
|
||||
### Using Results to Improve
|
||||
|
||||
**Low Recall (Missing Information):**
|
||||
- Review extraction prompt instructions
|
||||
- Add examples of the missed pattern
|
||||
- Emphasize completeness in prompt
|
||||
- Consider using more capable model (Haiku → Sonnet)
|
||||
|
||||
**Low Precision (Incorrect Extractions):**
|
||||
- Add validation rules to prompt
|
||||
- Provide clearer field definitions
|
||||
- Add negative examples
|
||||
- Tighten extraction criteria
|
||||
|
||||
**Field-Specific Issues:**
|
||||
- Identify problematic field types
|
||||
- Revise schema definitions
|
||||
- Add field-specific instructions
|
||||
- Update examples
|
||||
|
||||
## Inter-Rater Reliability (Optional)
|
||||
|
||||
For critical applications, have multiple annotators:
|
||||
|
||||
1. **Split validation set:**
|
||||
- 10 papers: Single annotator
|
||||
- 10 papers: Both annotators independently
|
||||
|
||||
2. **Calculate agreement:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations annotator1.json \
|
||||
--compare-with annotator2.json \
|
||||
--output agreement_metrics.json
|
||||
```
|
||||
|
||||
3. **Resolve disagreements:**
|
||||
- Discuss discrepancies
|
||||
- Establish interpretation guidelines
|
||||
- Re-annotate if needed
|
||||
|
||||
## Iterative Improvement Workflow
|
||||
|
||||
1. **Baseline:** Run extraction with initial schema
|
||||
2. **Validate:** Calculate metrics on sample
|
||||
3. **Analyze:** Identify weak fields and error patterns
|
||||
4. **Revise:** Update schema, prompts, or model
|
||||
5. **Re-extract:** Run extraction with improvements
|
||||
6. **Re-validate:** Calculate new metrics
|
||||
7. **Compare:** Check if metrics improved
|
||||
8. **Repeat:** Until acceptable quality achieved
|
||||
|
||||
## Reporting Validation in Publications
|
||||
|
||||
Include in methods section:
|
||||
|
||||
```
|
||||
Extraction quality was assessed on a stratified random sample of
|
||||
20 papers. Automated extraction achieved 87.3% precision (95% CI:
|
||||
81.2-93.4%) and 79.2% recall (95% CI: 72.8-85.6%), with an F1
|
||||
score of 83.1%. Field-level metrics ranged from 77.9% (method
|
||||
descriptions) to 92.0% (species names).
|
||||
```
|
||||
|
||||
Consider reporting:
|
||||
- Sample size and sampling strategy
|
||||
- Overall precision, recall, F1
|
||||
- Field-level metrics for key fields
|
||||
- Confidence intervals
|
||||
- Common error types
|
||||
Reference in New Issue
Block a user