# Validation and Quality Assurance Guide ## Overview Validation quantifies extraction accuracy using precision, recall, and F1 metrics by comparing automated extraction against manually annotated ground truth. ## When to Validate - **Before production use** - Establish baseline quality - **After schema changes** - Verify improvements - **When comparing models** - Test Haiku vs Sonnet vs Ollama - **For publication** - Report extraction quality metrics ## Recommended Sample Sizes - Small projects (<100 papers): 10-20 papers - Medium projects (100-500 papers): 20-50 papers - Large projects (>500 papers): 50-100 papers ## Step 7: Prepare Validation Set Sample papers for manual annotation using one of three strategies. ### Random Sampling (General Quality) ```bash python scripts/07_prepare_validation_set.py \ --extraction-results cleaned_data.json \ --schema my_schema.json \ --sample-size 20 \ --strategy random \ --output validation_set.json ``` Provides overall quality estimate but may miss rare cases. ### Stratified Sampling (Identify Weaknesses) ```bash python scripts/07_prepare_validation_set.py \ --extraction-results cleaned_data.json \ --schema my_schema.json \ --sample-size 20 \ --strategy stratified \ --output validation_set.json ``` Samples papers with different characteristics: - Papers with no records - Papers with few records (1-2) - Papers with medium records (3-5) - Papers with many records (6+) Best for identifying weak points in extraction. ### Diverse Sampling (Comprehensive) ```bash python scripts/07_prepare_validation_set.py \ --extraction-results cleaned_data.json \ --schema my_schema.json \ --sample-size 20 \ --strategy diverse \ --output validation_set.json ``` Maximizes diversity across different paper types. ## Step 8: Manual Annotation ### Annotation Process 1. **Open validation file:** ```bash # Use your preferred JSON editor code validation_set.json # VS Code vim validation_set.json # Vim ``` 2. **For each paper in `validation_papers`:** - Locate and read the original PDF - Extract data according to the schema - Fill the `ground_truth` field with correct extraction - The structure should match `automated_extraction` 3. **Fill metadata fields:** - `annotator`: Your name - `annotation_date`: YYYY-MM-DD - `notes`: Any ambiguous cases or comments ### Annotation Tips **Be thorough:** - Extract ALL relevant information, even if automated extraction missed it - This ensures accurate recall calculation **Be precise:** - Use exact values as they appear in the paper - Follow the same schema structure as automated extraction **Be consistent:** - Apply the same interpretation rules across all papers - Document interpretation decisions in notes **Mark ambiguities:** - If a field is unclear, note it and make your best judgment - Consider having multiple annotators for inter-rater reliability ### Example Annotation ```json { "paper_id_123": { "automated_extraction": { "has_relevant_data": true, "records": [ { "species": "Apis mellifera", "location": "Brazil" } ] }, "ground_truth": { "has_relevant_data": true, "records": [ { "species": "Apis mellifera", "location": "Brazil", "state_province": "São Paulo" // Automated missed this }, { "species": "Bombus terrestris", // Automated missed this record "location": "Brazil", "state_province": "São Paulo" } ] }, "notes": "Automated extraction missed the state and second species", "annotator": "John Doe", "annotation_date": "2025-01-15" } } ``` ## Step 9: Calculate Validation Metrics ### Basic Metrics Calculation ```bash python scripts/08_calculate_validation_metrics.py \ --annotations validation_set.json \ --output validation_metrics.json \ --report validation_report.txt ``` ### Advanced Options **Fuzzy string matching:** ```bash python scripts/08_calculate_validation_metrics.py \ --annotations validation_set.json \ --fuzzy-strings \ --output validation_metrics.json ``` Normalizes whitespace and case for string comparisons. **Numeric tolerance:** ```bash python scripts/08_calculate_validation_metrics.py \ --annotations validation_set.json \ --numeric-tolerance 0.01 \ --output validation_metrics.json ``` Allows small differences in numeric values. **Ordered list comparison:** ```bash python scripts/08_calculate_validation_metrics.py \ --annotations validation_set.json \ --list-order-matters \ --output validation_metrics.json ``` Treats lists as ordered sequences instead of sets. ## Understanding the Metrics ### Precision **Definition:** Of the items extracted, what percentage are correct? **Formula:** TP / (TP + FP) **Example:** Extracted 10 species, 8 were correct → Precision = 80% **High precision, low recall:** Conservative extraction (misses data) ### Recall **Definition:** Of the true items, what percentage were extracted? **Formula:** TP / (TP + FN) **Example:** Paper has 12 species, extracted 8 → Recall = 67% **Low precision, high recall:** Liberal extraction (includes errors) ### F1 Score **Definition:** Harmonic mean of precision and recall **Formula:** 2 × (Precision × Recall) / (Precision + Recall) **Use:** Single metric balancing precision and recall ### Field-Level Metrics Metrics are calculated for each field type: **Boolean fields:** - True positives, false positives, false negatives **Numeric fields:** - Exact match or within tolerance **String fields:** - Exact or fuzzy match **List fields:** - Set-based comparison (default) - Items in both (TP), in automated only (FP), in truth only (FN) **Nested objects:** - Recursive field-by-field comparison ## Interpreting Results ### Validation Report Structure ``` OVERALL METRICS Papers evaluated: 20 Precision: 87.3% Recall: 79.2% F1 Score: 83.1% METRICS BY FIELD Field Precision Recall F1 species 95.2% 89.1% 92.0% location 82.3% 75.4% 78.7% method 91.0% 68.2% 77.9% COMMON ISSUES Fields with low recall (missed information): - method: 68.2% recall, 12 missed items Fields with low precision (incorrect extractions): - location: 82.3% precision, 8 incorrect items ``` ### Using Results to Improve **Low Recall (Missing Information):** - Review extraction prompt instructions - Add examples of the missed pattern - Emphasize completeness in prompt - Consider using more capable model (Haiku → Sonnet) **Low Precision (Incorrect Extractions):** - Add validation rules to prompt - Provide clearer field definitions - Add negative examples - Tighten extraction criteria **Field-Specific Issues:** - Identify problematic field types - Revise schema definitions - Add field-specific instructions - Update examples ## Inter-Rater Reliability (Optional) For critical applications, have multiple annotators: 1. **Split validation set:** - 10 papers: Single annotator - 10 papers: Both annotators independently 2. **Calculate agreement:** ```bash python scripts/08_calculate_validation_metrics.py \ --annotations annotator1.json \ --compare-with annotator2.json \ --output agreement_metrics.json ``` 3. **Resolve disagreements:** - Discuss discrepancies - Establish interpretation guidelines - Re-annotate if needed ## Iterative Improvement Workflow 1. **Baseline:** Run extraction with initial schema 2. **Validate:** Calculate metrics on sample 3. **Analyze:** Identify weak fields and error patterns 4. **Revise:** Update schema, prompts, or model 5. **Re-extract:** Run extraction with improvements 6. **Re-validate:** Calculate new metrics 7. **Compare:** Check if metrics improved 8. **Repeat:** Until acceptable quality achieved ## Reporting Validation in Publications Include in methods section: ``` Extraction quality was assessed on a stratified random sample of 20 papers. Automated extraction achieved 87.3% precision (95% CI: 81.2-93.4%) and 79.2% recall (95% CI: 72.8-85.6%), with an F1 score of 83.1%. Field-level metrics ranged from 77.9% (method descriptions) to 92.0% (species names). ``` Consider reporting: - Sample size and sampling strategy - Overall precision, recall, F1 - Field-level metrics for key fields - Confidence intervals - Common error types