Files
gh-brunoasm-my-claude-skill…/skills/extract_from_pdfs/references/validation_guide.md
2025-11-29 18:02:40 +08:00

330 lines
8.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Validation and Quality Assurance Guide
## Overview
Validation quantifies extraction accuracy using precision, recall, and F1 metrics by comparing automated extraction against manually annotated ground truth.
## When to Validate
- **Before production use** - Establish baseline quality
- **After schema changes** - Verify improvements
- **When comparing models** - Test Haiku vs Sonnet vs Ollama
- **For publication** - Report extraction quality metrics
## Recommended Sample Sizes
- Small projects (<100 papers): 10-20 papers
- Medium projects (100-500 papers): 20-50 papers
- Large projects (>500 papers): 50-100 papers
## Step 7: Prepare Validation Set
Sample papers for manual annotation using one of three strategies.
### Random Sampling (General Quality)
```bash
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy random \
--output validation_set.json
```
Provides overall quality estimate but may miss rare cases.
### Stratified Sampling (Identify Weaknesses)
```bash
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy stratified \
--output validation_set.json
```
Samples papers with different characteristics:
- Papers with no records
- Papers with few records (1-2)
- Papers with medium records (3-5)
- Papers with many records (6+)
Best for identifying weak points in extraction.
### Diverse Sampling (Comprehensive)
```bash
python scripts/07_prepare_validation_set.py \
--extraction-results cleaned_data.json \
--schema my_schema.json \
--sample-size 20 \
--strategy diverse \
--output validation_set.json
```
Maximizes diversity across different paper types.
## Step 8: Manual Annotation
### Annotation Process
1. **Open validation file:**
```bash
# Use your preferred JSON editor
code validation_set.json # VS Code
vim validation_set.json # Vim
```
2. **For each paper in `validation_papers`:**
- Locate and read the original PDF
- Extract data according to the schema
- Fill the `ground_truth` field with correct extraction
- The structure should match `automated_extraction`
3. **Fill metadata fields:**
- `annotator`: Your name
- `annotation_date`: YYYY-MM-DD
- `notes`: Any ambiguous cases or comments
### Annotation Tips
**Be thorough:**
- Extract ALL relevant information, even if automated extraction missed it
- This ensures accurate recall calculation
**Be precise:**
- Use exact values as they appear in the paper
- Follow the same schema structure as automated extraction
**Be consistent:**
- Apply the same interpretation rules across all papers
- Document interpretation decisions in notes
**Mark ambiguities:**
- If a field is unclear, note it and make your best judgment
- Consider having multiple annotators for inter-rater reliability
### Example Annotation
```json
{
"paper_id_123": {
"automated_extraction": {
"has_relevant_data": true,
"records": [
{
"species": "Apis mellifera",
"location": "Brazil"
}
]
},
"ground_truth": {
"has_relevant_data": true,
"records": [
{
"species": "Apis mellifera",
"location": "Brazil",
"state_province": "São Paulo" // Automated missed this
},
{
"species": "Bombus terrestris", // Automated missed this record
"location": "Brazil",
"state_province": "São Paulo"
}
]
},
"notes": "Automated extraction missed the state and second species",
"annotator": "John Doe",
"annotation_date": "2025-01-15"
}
}
```
## Step 9: Calculate Validation Metrics
### Basic Metrics Calculation
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--output validation_metrics.json \
--report validation_report.txt
```
### Advanced Options
**Fuzzy string matching:**
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--fuzzy-strings \
--output validation_metrics.json
```
Normalizes whitespace and case for string comparisons.
**Numeric tolerance:**
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--numeric-tolerance 0.01 \
--output validation_metrics.json
```
Allows small differences in numeric values.
**Ordered list comparison:**
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations validation_set.json \
--list-order-matters \
--output validation_metrics.json
```
Treats lists as ordered sequences instead of sets.
## Understanding the Metrics
### Precision
**Definition:** Of the items extracted, what percentage are correct?
**Formula:** TP / (TP + FP)
**Example:** Extracted 10 species, 8 were correct → Precision = 80%
**High precision, low recall:** Conservative extraction (misses data)
### Recall
**Definition:** Of the true items, what percentage were extracted?
**Formula:** TP / (TP + FN)
**Example:** Paper has 12 species, extracted 8 → Recall = 67%
**Low precision, high recall:** Liberal extraction (includes errors)
### F1 Score
**Definition:** Harmonic mean of precision and recall
**Formula:** 2 × (Precision × Recall) / (Precision + Recall)
**Use:** Single metric balancing precision and recall
### Field-Level Metrics
Metrics are calculated for each field type:
**Boolean fields:**
- True positives, false positives, false negatives
**Numeric fields:**
- Exact match or within tolerance
**String fields:**
- Exact or fuzzy match
**List fields:**
- Set-based comparison (default)
- Items in both (TP), in automated only (FP), in truth only (FN)
**Nested objects:**
- Recursive field-by-field comparison
## Interpreting Results
### Validation Report Structure
```
OVERALL METRICS
Papers evaluated: 20
Precision: 87.3%
Recall: 79.2%
F1 Score: 83.1%
METRICS BY FIELD
Field Precision Recall F1
species 95.2% 89.1% 92.0%
location 82.3% 75.4% 78.7%
method 91.0% 68.2% 77.9%
COMMON ISSUES
Fields with low recall (missed information):
- method: 68.2% recall, 12 missed items
Fields with low precision (incorrect extractions):
- location: 82.3% precision, 8 incorrect items
```
### Using Results to Improve
**Low Recall (Missing Information):**
- Review extraction prompt instructions
- Add examples of the missed pattern
- Emphasize completeness in prompt
- Consider using more capable model (Haiku → Sonnet)
**Low Precision (Incorrect Extractions):**
- Add validation rules to prompt
- Provide clearer field definitions
- Add negative examples
- Tighten extraction criteria
**Field-Specific Issues:**
- Identify problematic field types
- Revise schema definitions
- Add field-specific instructions
- Update examples
## Inter-Rater Reliability (Optional)
For critical applications, have multiple annotators:
1. **Split validation set:**
- 10 papers: Single annotator
- 10 papers: Both annotators independently
2. **Calculate agreement:**
```bash
python scripts/08_calculate_validation_metrics.py \
--annotations annotator1.json \
--compare-with annotator2.json \
--output agreement_metrics.json
```
3. **Resolve disagreements:**
- Discuss discrepancies
- Establish interpretation guidelines
- Re-annotate if needed
## Iterative Improvement Workflow
1. **Baseline:** Run extraction with initial schema
2. **Validate:** Calculate metrics on sample
3. **Analyze:** Identify weak fields and error patterns
4. **Revise:** Update schema, prompts, or model
5. **Re-extract:** Run extraction with improvements
6. **Re-validate:** Calculate new metrics
7. **Compare:** Check if metrics improved
8. **Repeat:** Until acceptable quality achieved
## Reporting Validation in Publications
Include in methods section:
```
Extraction quality was assessed on a stratified random sample of
20 papers. Automated extraction achieved 87.3% precision (95% CI:
81.2-93.4%) and 79.2% recall (95% CI: 72.8-85.6%), with an F1
score of 83.1%. Field-level metrics ranged from 77.9% (method
descriptions) to 92.0% (species names).
```
Consider reporting:
- Sample size and sampling strategy
- Overall precision, recall, F1
- Field-level metrics for key fields
- Confidence intervals
- Common error types