Initial commit
This commit is contained in:
406
skills/extract_from_pdfs/references/api_reference.md
Normal file
406
skills/extract_from_pdfs/references/api_reference.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# External API Validation Reference
|
||||
|
||||
## Overview
|
||||
|
||||
Step 5 validates and enriches extracted data using external scientific databases. This ensures taxonomic names are standardized, locations are geocoded, and chemical/gene identifiers are canonical.
|
||||
|
||||
## Available APIs
|
||||
|
||||
### Biological Taxonomy
|
||||
|
||||
#### GBIF (Global Biodiversity Information Facility)
|
||||
|
||||
**Use for:** General biological taxonomy (animals, plants, fungi, etc.)
|
||||
|
||||
**Function:** `validate_gbif_taxonomy(scientific_name)`
|
||||
|
||||
**Returns:**
|
||||
- Matched canonical name
|
||||
- Full scientific name with authority
|
||||
- Taxonomic hierarchy (kingdom, phylum, class, order, family, genus)
|
||||
- GBIF ID
|
||||
- Match confidence and type
|
||||
- Taxonomic status
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_gbif_taxonomy("Apis melifera")
|
||||
# Returns:
|
||||
{
|
||||
"matched_name": "Apis mellifera",
|
||||
"scientific_name": "Apis mellifera Linnaeus, 1758",
|
||||
"rank": "SPECIES",
|
||||
"kingdom": "Animalia",
|
||||
"phylum": "Arthropoda",
|
||||
"class": "Insecta",
|
||||
"order": "Hymenoptera",
|
||||
"family": "Apidae",
|
||||
"genus": "Apis",
|
||||
"gbif_id": 1340278,
|
||||
"confidence": 100,
|
||||
"match_type": "EXACT"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free and unlimited
|
||||
|
||||
**Documentation:** https://www.gbif.org/developer/species
|
||||
|
||||
#### World Flora Online (WFO)
|
||||
|
||||
**Use for:** Plant taxonomy specifically
|
||||
|
||||
**Function:** `validate_wfo_plant(scientific_name)`
|
||||
|
||||
**Returns:**
|
||||
- Matched name
|
||||
- Scientific name with authors
|
||||
- Family
|
||||
- WFO ID
|
||||
- Taxonomic status
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_wfo_plant("Magnolia grandiflora")
|
||||
# Returns:
|
||||
{
|
||||
"matched_name": "Magnolia grandiflora",
|
||||
"scientific_name": "Magnolia grandiflora L.",
|
||||
"authors": "L.",
|
||||
"family": "Magnoliaceae",
|
||||
"wfo_id": "wfo-0000988234",
|
||||
"status": "Accepted"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Documentation:** http://www.worldfloraonline.org/
|
||||
|
||||
### Geography
|
||||
|
||||
#### GeoNames
|
||||
|
||||
**Use for:** Location validation and standardization
|
||||
|
||||
**Function:** `validate_geonames(location, country=None)`
|
||||
|
||||
**Returns:**
|
||||
- Matched place name
|
||||
- Country name and code
|
||||
- Administrative divisions (state, province)
|
||||
- Latitude/longitude
|
||||
- GeoNames ID
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_geonames("São Paulo", country="BR")
|
||||
# Returns:
|
||||
{
|
||||
"matched_name": "São Paulo",
|
||||
"country": "Brazil",
|
||||
"country_code": "BR",
|
||||
"admin1": "São Paulo",
|
||||
"admin2": None,
|
||||
"latitude": "-23.5475",
|
||||
"longitude": "-46.63611",
|
||||
"geonames_id": 3448439
|
||||
}
|
||||
```
|
||||
|
||||
**Requires free account:** Register at https://www.geonames.org/login
|
||||
|
||||
**Setup:**
|
||||
1. Create account
|
||||
2. Enable web services in account settings
|
||||
3. Set environment variable: `export GEONAMES_USERNAME='your-username'`
|
||||
|
||||
**Rate limit:** Free tier allows reasonable usage
|
||||
|
||||
**Documentation:** https://www.geonames.org/export/web-services.html
|
||||
|
||||
#### OpenStreetMap Nominatim
|
||||
|
||||
**Use for:** Geocoding addresses to coordinates
|
||||
|
||||
**Function:** `geocode_location(address)`
|
||||
|
||||
**Returns:**
|
||||
- Display name (formatted address)
|
||||
- Latitude/longitude
|
||||
- OSM type and ID
|
||||
- Place rank
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
geocode_location("Field Museum, Chicago, IL")
|
||||
# Returns:
|
||||
{
|
||||
"display_name": "Field Museum, 1400, South Lake Shore Drive, Chicago, Illinois, 60605, United States",
|
||||
"latitude": "41.8662",
|
||||
"longitude": "-87.6169",
|
||||
"osm_type": "way",
|
||||
"osm_id": 54856789,
|
||||
"place_rank": 30
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Important:** Add 1-second delays between requests (implemented in script)
|
||||
|
||||
**Documentation:** https://nominatim.org/release-docs/latest/api/Overview/
|
||||
|
||||
### Chemistry
|
||||
|
||||
#### PubChem
|
||||
|
||||
**Use for:** Chemical compound validation
|
||||
|
||||
**Function:** `validate_pubchem_compound(compound_name)`
|
||||
|
||||
**Returns:**
|
||||
- PubChem CID (compound ID)
|
||||
- Molecular formula
|
||||
- PubChem URL
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_pubchem_compound("aspirin")
|
||||
# Returns:
|
||||
{
|
||||
"cid": 2244,
|
||||
"molecular_formula": "C9H8O4",
|
||||
"pubchem_url": "https://pubchem.ncbi.nlm.nih.gov/compound/2244"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Documentation:** https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
|
||||
|
||||
### Genetics
|
||||
|
||||
#### NCBI Gene
|
||||
|
||||
**Use for:** Gene validation
|
||||
|
||||
**Function:** `validate_ncbi_gene(gene_symbol, organism=None)`
|
||||
|
||||
**Returns:**
|
||||
- NCBI Gene ID
|
||||
- NCBI URL
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_ncbi_gene("BRCA1", organism="Homo sapiens")
|
||||
# Returns:
|
||||
{
|
||||
"gene_id": "672",
|
||||
"ncbi_url": "https://www.ncbi.nlm.nih.gov/gene/672"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Rate limit:** Max 3 requests/second
|
||||
|
||||
**Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25500/
|
||||
|
||||
## Configuration
|
||||
|
||||
### API Config File Structure
|
||||
|
||||
Create `my_api_config.json` based on `assets/api_config_template.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"field_mappings": {
|
||||
"species": {
|
||||
"api": "gbif_taxonomy",
|
||||
"output_field": "validated_species",
|
||||
"description": "Validate species names against GBIF"
|
||||
},
|
||||
"location": {
|
||||
"api": "geocode",
|
||||
"output_field": "coordinates"
|
||||
}
|
||||
},
|
||||
|
||||
"nested_field_mappings": {
|
||||
"records.plant_species": {
|
||||
"api": "wfo_plants",
|
||||
"output_field": "validated_plant_taxonomy"
|
||||
},
|
||||
"records.location": {
|
||||
"api": "geocode",
|
||||
"output_field": "coordinates"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Field Mapping Parameters
|
||||
|
||||
**Required:**
|
||||
- `api` - API name (see list above)
|
||||
- `output_field` - Name for validated data
|
||||
|
||||
**Optional:**
|
||||
- `description` - Documentation
|
||||
- `extra_params` - Additional API-specific parameters
|
||||
|
||||
## Adding Custom APIs
|
||||
|
||||
To add a new validation API:
|
||||
|
||||
1. **Create validator function** in `scripts/05_validate_with_apis.py`:
|
||||
|
||||
```python
|
||||
def validate_custom_api(value: str, extra_param: str = None) -> Optional[Dict]:
|
||||
"""
|
||||
Validate value using custom API.
|
||||
|
||||
Args:
|
||||
value: The value to validate
|
||||
extra_param: Optional additional parameter
|
||||
|
||||
Returns:
|
||||
Dictionary with validated data or None if not found
|
||||
"""
|
||||
try:
|
||||
# Make API request
|
||||
response = requests.get(f"https://api.example.com/{value}")
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
return {
|
||||
'validated_value': data.get('canonical_name'),
|
||||
'api_id': data.get('id'),
|
||||
'additional_info': data.get('info')
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"Custom API error: {e}")
|
||||
|
||||
return None
|
||||
```
|
||||
|
||||
2. **Register in API_VALIDATORS** dictionary:
|
||||
|
||||
```python
|
||||
API_VALIDATORS = {
|
||||
'gbif_taxonomy': validate_gbif_taxonomy,
|
||||
'wfo_plants': validate_wfo_plant,
|
||||
# ... existing validators ...
|
||||
'custom_api': validate_custom_api, # Add here
|
||||
}
|
||||
```
|
||||
|
||||
3. **Use in config file:**
|
||||
|
||||
```json
|
||||
{
|
||||
"field_mappings": {
|
||||
"your_field": {
|
||||
"api": "custom_api",
|
||||
"output_field": "validated_field",
|
||||
"extra_params": {
|
||||
"extra_param": "value"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
The script implements rate limiting to respect API usage policies:
|
||||
|
||||
**Default delays (built into script):**
|
||||
- GeoNames: 0.5 seconds
|
||||
- Nominatim: 1.0 second (required)
|
||||
- WFO: 1.0 second
|
||||
- Others: 0.5 seconds
|
||||
|
||||
**Modify delays if needed** in `scripts/05_validate_with_apis.py`:
|
||||
|
||||
```python
|
||||
# In main() function
|
||||
if not args.skip_validation:
|
||||
time.sleep(0.5) # Adjust this value
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
APIs may fail for various reasons:
|
||||
|
||||
**Common errors:**
|
||||
- Connection timeout
|
||||
- Rate limit exceeded
|
||||
- Invalid API key
|
||||
- Malformed query
|
||||
- No match found
|
||||
|
||||
**Script behavior:**
|
||||
- Continues processing on error
|
||||
- Logs error to console
|
||||
- Sets validated field to None
|
||||
- Original extracted value preserved
|
||||
|
||||
**Retry logic:**
|
||||
- 3 retries with exponential backoff
|
||||
- Implemented for network errors
|
||||
- Not for "no match found" errors
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start with test run:**
|
||||
```bash
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--skip-validation \
|
||||
--output test_structure.json
|
||||
```
|
||||
|
||||
2. **Validate subset first:**
|
||||
- Test on 10 papers before full run
|
||||
- Verify API connections work
|
||||
- Check output structure
|
||||
|
||||
3. **Monitor API usage:**
|
||||
- Track request counts for paid APIs
|
||||
- Respect rate limits
|
||||
- Consider caching results
|
||||
|
||||
4. **Handle failures gracefully:**
|
||||
- Original data is never lost
|
||||
- Can re-run validation separately
|
||||
- Manually fix failed validations if needed
|
||||
|
||||
5. **Optimize API calls:**
|
||||
- Only validate fields that need standardization
|
||||
- Use cached results when re-running
|
||||
- Batch similar queries when possible
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### GeoNames "Service disabled" error
|
||||
- Check account email is verified
|
||||
- Enable web services in account settings
|
||||
- Wait up to 1 hour after enabling
|
||||
|
||||
### Nominatim rate limit errors
|
||||
- Script includes 1-second delays
|
||||
- Don't run multiple instances
|
||||
- Consider using local Nominatim instance
|
||||
|
||||
### NCBI errors
|
||||
- Reduce request frequency
|
||||
- Add longer delays
|
||||
- Use E-utilities API key (optional, increases limit)
|
||||
|
||||
### No matches found
|
||||
- Check spelling and formatting
|
||||
- Try variations of name
|
||||
- Some names may not be in database
|
||||
- Consider manual curation for important cases
|
||||
147
skills/extract_from_pdfs/references/setup_guide.md
Normal file
147
skills/extract_from_pdfs/references/setup_guide.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Setup Guide for PDF Data Extraction
|
||||
|
||||
## Installation
|
||||
|
||||
### Using Conda (Recommended)
|
||||
|
||||
Create a dedicated environment for the extraction pipeline:
|
||||
|
||||
```bash
|
||||
conda env create -f environment.yml
|
||||
conda activate pdf_extraction
|
||||
```
|
||||
|
||||
### Using pip
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
## Required Dependencies
|
||||
|
||||
### Core Dependencies
|
||||
- `anthropic>=0.40.0` - Anthropic API client
|
||||
- `pybtex>=0.24.0` - BibTeX file handling
|
||||
- `rispy>=0.6.0` - RIS file handling
|
||||
- `json-repair>=0.25.0` - JSON repair and validation
|
||||
- `jsonschema>=4.20.0` - JSON schema validation
|
||||
- `pandas>=2.0.0` - Data processing
|
||||
- `requests>=2.31.0` - HTTP requests for APIs
|
||||
|
||||
### Export Dependencies
|
||||
- `openpyxl>=3.1.0` - Excel export
|
||||
- `pyreadr>=0.5.0` - R RDS export
|
||||
|
||||
## API Keys Setup
|
||||
|
||||
### Anthropic API Key (Required for Claude backends)
|
||||
|
||||
```bash
|
||||
export ANTHROPIC_API_KEY='your-api-key-here'
|
||||
```
|
||||
|
||||
Add to your shell profile (~/.bashrc, ~/.zshrc) for persistence:
|
||||
|
||||
```bash
|
||||
echo 'export ANTHROPIC_API_KEY="your-api-key-here"' >> ~/.bashrc
|
||||
source ~/.bashrc
|
||||
```
|
||||
|
||||
### GeoNames Username (Optional - for geographic validation)
|
||||
|
||||
1. Register at https://www.geonames.org/login
|
||||
2. Enable web services in your account
|
||||
3. Set environment variable:
|
||||
|
||||
```bash
|
||||
export GEONAMES_USERNAME='your-username'
|
||||
```
|
||||
|
||||
## Local Model Setup (Ollama)
|
||||
|
||||
For free, private, offline abstract filtering:
|
||||
|
||||
### Installation
|
||||
|
||||
**macOS:**
|
||||
```bash
|
||||
brew install ollama
|
||||
```
|
||||
|
||||
**Linux:**
|
||||
```bash
|
||||
curl -fsSL https://ollama.com/install.sh | sh
|
||||
```
|
||||
|
||||
**Windows:**
|
||||
Download from https://ollama.com/download
|
||||
|
||||
### Pulling Models
|
||||
|
||||
```bash
|
||||
# Recommended models
|
||||
ollama pull llama3.1:8b # Good balance (8GB RAM)
|
||||
ollama pull mistral:7b # Fast, simple filtering
|
||||
ollama pull qwen2.5:7b # Multilingual support
|
||||
ollama pull llama3.1:70b # Best accuracy (64GB RAM)
|
||||
```
|
||||
|
||||
### Starting Ollama Server
|
||||
|
||||
Usually auto-starts, but can be manually started:
|
||||
|
||||
```bash
|
||||
ollama serve
|
||||
```
|
||||
|
||||
The server runs at http://localhost:11434 by default.
|
||||
|
||||
## Verifying Installation
|
||||
|
||||
Test that all components are properly installed:
|
||||
|
||||
```bash
|
||||
# Test Python dependencies
|
||||
python -c "import anthropic, pybtex, rispy, json_repair, pandas; print('All dependencies OK')"
|
||||
|
||||
# Test Anthropic API
|
||||
python -c "import os; from anthropic import Anthropic; client = Anthropic(); print('API key valid')"
|
||||
|
||||
# Test Ollama (if using)
|
||||
curl http://localhost:11434/api/tags
|
||||
```
|
||||
|
||||
## Directory Structure
|
||||
|
||||
The skill will work with PDFs and metadata organized in various ways:
|
||||
|
||||
### Option A: Reference Manager Export
|
||||
```
|
||||
project/
|
||||
├── library.bib # BibTeX export
|
||||
└── pdfs/
|
||||
├── Smith2020.pdf
|
||||
├── Jones2021.pdf
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Option B: Simple Directory
|
||||
```
|
||||
project/
|
||||
└── pdfs/
|
||||
├── paper1.pdf
|
||||
├── paper2.pdf
|
||||
└── ...
|
||||
```
|
||||
|
||||
### Option C: DOI List
|
||||
```
|
||||
project/
|
||||
└── dois.txt # One DOI per line
|
||||
```
|
||||
|
||||
## Next Steps
|
||||
|
||||
After installation, proceed to the workflow guide to start extracting data from your PDFs.
|
||||
|
||||
See: `references/workflow_guide.md`
|
||||
329
skills/extract_from_pdfs/references/validation_guide.md
Normal file
329
skills/extract_from_pdfs/references/validation_guide.md
Normal file
@@ -0,0 +1,329 @@
|
||||
# Validation and Quality Assurance Guide
|
||||
|
||||
## Overview
|
||||
|
||||
Validation quantifies extraction accuracy using precision, recall, and F1 metrics by comparing automated extraction against manually annotated ground truth.
|
||||
|
||||
## When to Validate
|
||||
|
||||
- **Before production use** - Establish baseline quality
|
||||
- **After schema changes** - Verify improvements
|
||||
- **When comparing models** - Test Haiku vs Sonnet vs Ollama
|
||||
- **For publication** - Report extraction quality metrics
|
||||
|
||||
## Recommended Sample Sizes
|
||||
|
||||
- Small projects (<100 papers): 10-20 papers
|
||||
- Medium projects (100-500 papers): 20-50 papers
|
||||
- Large projects (>500 papers): 50-100 papers
|
||||
|
||||
## Step 7: Prepare Validation Set
|
||||
|
||||
Sample papers for manual annotation using one of three strategies.
|
||||
|
||||
### Random Sampling (General Quality)
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy random \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Provides overall quality estimate but may miss rare cases.
|
||||
|
||||
### Stratified Sampling (Identify Weaknesses)
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy stratified \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Samples papers with different characteristics:
|
||||
- Papers with no records
|
||||
- Papers with few records (1-2)
|
||||
- Papers with medium records (3-5)
|
||||
- Papers with many records (6+)
|
||||
|
||||
Best for identifying weak points in extraction.
|
||||
|
||||
### Diverse Sampling (Comprehensive)
|
||||
|
||||
```bash
|
||||
python scripts/07_prepare_validation_set.py \
|
||||
--extraction-results cleaned_data.json \
|
||||
--schema my_schema.json \
|
||||
--sample-size 20 \
|
||||
--strategy diverse \
|
||||
--output validation_set.json
|
||||
```
|
||||
|
||||
Maximizes diversity across different paper types.
|
||||
|
||||
## Step 8: Manual Annotation
|
||||
|
||||
### Annotation Process
|
||||
|
||||
1. **Open validation file:**
|
||||
```bash
|
||||
# Use your preferred JSON editor
|
||||
code validation_set.json # VS Code
|
||||
vim validation_set.json # Vim
|
||||
```
|
||||
|
||||
2. **For each paper in `validation_papers`:**
|
||||
- Locate and read the original PDF
|
||||
- Extract data according to the schema
|
||||
- Fill the `ground_truth` field with correct extraction
|
||||
- The structure should match `automated_extraction`
|
||||
|
||||
3. **Fill metadata fields:**
|
||||
- `annotator`: Your name
|
||||
- `annotation_date`: YYYY-MM-DD
|
||||
- `notes`: Any ambiguous cases or comments
|
||||
|
||||
### Annotation Tips
|
||||
|
||||
**Be thorough:**
|
||||
- Extract ALL relevant information, even if automated extraction missed it
|
||||
- This ensures accurate recall calculation
|
||||
|
||||
**Be precise:**
|
||||
- Use exact values as they appear in the paper
|
||||
- Follow the same schema structure as automated extraction
|
||||
|
||||
**Be consistent:**
|
||||
- Apply the same interpretation rules across all papers
|
||||
- Document interpretation decisions in notes
|
||||
|
||||
**Mark ambiguities:**
|
||||
- If a field is unclear, note it and make your best judgment
|
||||
- Consider having multiple annotators for inter-rater reliability
|
||||
|
||||
### Example Annotation
|
||||
|
||||
```json
|
||||
{
|
||||
"paper_id_123": {
|
||||
"automated_extraction": {
|
||||
"has_relevant_data": true,
|
||||
"records": [
|
||||
{
|
||||
"species": "Apis mellifera",
|
||||
"location": "Brazil"
|
||||
}
|
||||
]
|
||||
},
|
||||
"ground_truth": {
|
||||
"has_relevant_data": true,
|
||||
"records": [
|
||||
{
|
||||
"species": "Apis mellifera",
|
||||
"location": "Brazil",
|
||||
"state_province": "São Paulo" // Automated missed this
|
||||
},
|
||||
{
|
||||
"species": "Bombus terrestris", // Automated missed this record
|
||||
"location": "Brazil",
|
||||
"state_province": "São Paulo"
|
||||
}
|
||||
]
|
||||
},
|
||||
"notes": "Automated extraction missed the state and second species",
|
||||
"annotator": "John Doe",
|
||||
"annotation_date": "2025-01-15"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Step 9: Calculate Validation Metrics
|
||||
|
||||
### Basic Metrics Calculation
|
||||
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--output validation_metrics.json \
|
||||
--report validation_report.txt
|
||||
```
|
||||
|
||||
### Advanced Options
|
||||
|
||||
**Fuzzy string matching:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--fuzzy-strings \
|
||||
--output validation_metrics.json
|
||||
```
|
||||
|
||||
Normalizes whitespace and case for string comparisons.
|
||||
|
||||
**Numeric tolerance:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--numeric-tolerance 0.01 \
|
||||
--output validation_metrics.json
|
||||
```
|
||||
|
||||
Allows small differences in numeric values.
|
||||
|
||||
**Ordered list comparison:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations validation_set.json \
|
||||
--list-order-matters \
|
||||
--output validation_metrics.json
|
||||
```
|
||||
|
||||
Treats lists as ordered sequences instead of sets.
|
||||
|
||||
## Understanding the Metrics
|
||||
|
||||
### Precision
|
||||
**Definition:** Of the items extracted, what percentage are correct?
|
||||
|
||||
**Formula:** TP / (TP + FP)
|
||||
|
||||
**Example:** Extracted 10 species, 8 were correct → Precision = 80%
|
||||
|
||||
**High precision, low recall:** Conservative extraction (misses data)
|
||||
|
||||
### Recall
|
||||
**Definition:** Of the true items, what percentage were extracted?
|
||||
|
||||
**Formula:** TP / (TP + FN)
|
||||
|
||||
**Example:** Paper has 12 species, extracted 8 → Recall = 67%
|
||||
|
||||
**Low precision, high recall:** Liberal extraction (includes errors)
|
||||
|
||||
### F1 Score
|
||||
**Definition:** Harmonic mean of precision and recall
|
||||
|
||||
**Formula:** 2 × (Precision × Recall) / (Precision + Recall)
|
||||
|
||||
**Use:** Single metric balancing precision and recall
|
||||
|
||||
### Field-Level Metrics
|
||||
|
||||
Metrics are calculated for each field type:
|
||||
|
||||
**Boolean fields:**
|
||||
- True positives, false positives, false negatives
|
||||
|
||||
**Numeric fields:**
|
||||
- Exact match or within tolerance
|
||||
|
||||
**String fields:**
|
||||
- Exact or fuzzy match
|
||||
|
||||
**List fields:**
|
||||
- Set-based comparison (default)
|
||||
- Items in both (TP), in automated only (FP), in truth only (FN)
|
||||
|
||||
**Nested objects:**
|
||||
- Recursive field-by-field comparison
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
### Validation Report Structure
|
||||
|
||||
```
|
||||
OVERALL METRICS
|
||||
Papers evaluated: 20
|
||||
Precision: 87.3%
|
||||
Recall: 79.2%
|
||||
F1 Score: 83.1%
|
||||
|
||||
METRICS BY FIELD
|
||||
Field Precision Recall F1
|
||||
species 95.2% 89.1% 92.0%
|
||||
location 82.3% 75.4% 78.7%
|
||||
method 91.0% 68.2% 77.9%
|
||||
|
||||
COMMON ISSUES
|
||||
Fields with low recall (missed information):
|
||||
- method: 68.2% recall, 12 missed items
|
||||
|
||||
Fields with low precision (incorrect extractions):
|
||||
- location: 82.3% precision, 8 incorrect items
|
||||
```
|
||||
|
||||
### Using Results to Improve
|
||||
|
||||
**Low Recall (Missing Information):**
|
||||
- Review extraction prompt instructions
|
||||
- Add examples of the missed pattern
|
||||
- Emphasize completeness in prompt
|
||||
- Consider using more capable model (Haiku → Sonnet)
|
||||
|
||||
**Low Precision (Incorrect Extractions):**
|
||||
- Add validation rules to prompt
|
||||
- Provide clearer field definitions
|
||||
- Add negative examples
|
||||
- Tighten extraction criteria
|
||||
|
||||
**Field-Specific Issues:**
|
||||
- Identify problematic field types
|
||||
- Revise schema definitions
|
||||
- Add field-specific instructions
|
||||
- Update examples
|
||||
|
||||
## Inter-Rater Reliability (Optional)
|
||||
|
||||
For critical applications, have multiple annotators:
|
||||
|
||||
1. **Split validation set:**
|
||||
- 10 papers: Single annotator
|
||||
- 10 papers: Both annotators independently
|
||||
|
||||
2. **Calculate agreement:**
|
||||
```bash
|
||||
python scripts/08_calculate_validation_metrics.py \
|
||||
--annotations annotator1.json \
|
||||
--compare-with annotator2.json \
|
||||
--output agreement_metrics.json
|
||||
```
|
||||
|
||||
3. **Resolve disagreements:**
|
||||
- Discuss discrepancies
|
||||
- Establish interpretation guidelines
|
||||
- Re-annotate if needed
|
||||
|
||||
## Iterative Improvement Workflow
|
||||
|
||||
1. **Baseline:** Run extraction with initial schema
|
||||
2. **Validate:** Calculate metrics on sample
|
||||
3. **Analyze:** Identify weak fields and error patterns
|
||||
4. **Revise:** Update schema, prompts, or model
|
||||
5. **Re-extract:** Run extraction with improvements
|
||||
6. **Re-validate:** Calculate new metrics
|
||||
7. **Compare:** Check if metrics improved
|
||||
8. **Repeat:** Until acceptable quality achieved
|
||||
|
||||
## Reporting Validation in Publications
|
||||
|
||||
Include in methods section:
|
||||
|
||||
```
|
||||
Extraction quality was assessed on a stratified random sample of
|
||||
20 papers. Automated extraction achieved 87.3% precision (95% CI:
|
||||
81.2-93.4%) and 79.2% recall (95% CI: 72.8-85.6%), with an F1
|
||||
score of 83.1%. Field-level metrics ranged from 77.9% (method
|
||||
descriptions) to 92.0% (species names).
|
||||
```
|
||||
|
||||
Consider reporting:
|
||||
- Sample size and sampling strategy
|
||||
- Overall precision, recall, F1
|
||||
- Field-level metrics for key fields
|
||||
- Confidence intervals
|
||||
- Common error types
|
||||
328
skills/extract_from_pdfs/references/workflow_guide.md
Normal file
328
skills/extract_from_pdfs/references/workflow_guide.md
Normal file
@@ -0,0 +1,328 @@
|
||||
# Complete Workflow Guide
|
||||
|
||||
This guide provides step-by-step instructions for the complete PDF extraction pipeline.
|
||||
|
||||
## Overview
|
||||
|
||||
The pipeline consists of 6 main steps plus optional validation:
|
||||
|
||||
1. **Organize Metadata** - Standardize PDF and metadata organization
|
||||
2. **Filter Papers** - Identify relevant papers by abstract (optional)
|
||||
3. **Extract Data** - Extract structured data from PDFs
|
||||
4. **Repair JSON** - Validate and repair JSON outputs
|
||||
5. **Validate with APIs** - Enrich with external databases
|
||||
6. **Export** - Convert to analysis format
|
||||
|
||||
**Optional:** Steps 7-9 for quality validation
|
||||
|
||||
## Step 1: Organize Metadata
|
||||
|
||||
Standardize PDF organization and metadata from various sources.
|
||||
|
||||
### From BibTeX (Zotero, JabRef, etc.)
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type bibtex \
|
||||
--source path/to/library.bib \
|
||||
--pdf-dir path/to/pdfs \
|
||||
--organize-pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### From RIS (Mendeley, EndNote, etc.)
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type ris \
|
||||
--source path/to/library.ris \
|
||||
--pdf-dir path/to/pdfs \
|
||||
--organize-pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### From PDF Directory
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type directory \
|
||||
--source path/to/pdfs \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
### From DOI List
|
||||
|
||||
```bash
|
||||
python scripts/01_organize_metadata.py \
|
||||
--source-type doi_list \
|
||||
--source dois.txt \
|
||||
--output metadata.json
|
||||
```
|
||||
|
||||
**Outputs:**
|
||||
- `metadata.json` - Standardized metadata file
|
||||
- `organized_pdfs/` - Renamed PDFs (if --organize-pdfs used)
|
||||
|
||||
## Step 2: Filter Papers (Optional but Recommended)
|
||||
|
||||
Filter papers by analyzing abstracts to reduce PDF processing costs.
|
||||
|
||||
### Backend Selection
|
||||
|
||||
**Option A: Claude Haiku (Fast & Cheap)**
|
||||
- Cost: ~$0.25 per million input tokens
|
||||
- Speed: Very fast with batches API
|
||||
- Accuracy: Good for most filtering tasks
|
||||
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-haiku \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Option B: Claude Sonnet (More Accurate)**
|
||||
- Cost: ~$3 per million input tokens
|
||||
- Speed: Fast with batches API
|
||||
- Accuracy: Higher for complex criteria
|
||||
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend anthropic-sonnet \
|
||||
--use-batches \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Option C: Local Ollama (FREE & Private)**
|
||||
- Cost: $0 (runs locally)
|
||||
- Speed: Depends on hardware
|
||||
- Accuracy: Good with llama3.1:8b or better
|
||||
|
||||
```bash
|
||||
python scripts/02_filter_abstracts.py \
|
||||
--metadata metadata.json \
|
||||
--backend ollama \
|
||||
--ollama-model llama3.1:8b \
|
||||
--output filtered_papers.json
|
||||
```
|
||||
|
||||
**Before running:** Customize the filtering prompt in `scripts/02_filter_abstracts.py` (line 74) to match your criteria.
|
||||
|
||||
**Outputs:**
|
||||
- `filtered_papers.json` - Papers marked as relevant/irrelevant
|
||||
|
||||
## Step 3: Extract Data from PDFs
|
||||
|
||||
Extract structured data using Claude's PDF vision capabilities.
|
||||
|
||||
### Schema Preparation
|
||||
|
||||
1. Copy schema template:
|
||||
```bash
|
||||
cp assets/schema_template.json my_schema.json
|
||||
```
|
||||
|
||||
2. Customize for your domain:
|
||||
- Update `objective` with your extraction goal
|
||||
- Define `output_schema` structure
|
||||
- Add domain-specific `instructions`
|
||||
- Provide an `output_example`
|
||||
|
||||
See `assets/example_flower_visitors_schema.json` for a real-world example.
|
||||
|
||||
### Run Extraction
|
||||
|
||||
```bash
|
||||
python scripts/03_extract_from_pdfs.py \
|
||||
--metadata filtered_papers.json \
|
||||
--schema my_schema.json \
|
||||
--method batches \
|
||||
--output extracted_data.json
|
||||
```
|
||||
|
||||
**Processing methods:**
|
||||
- `batches` - Most efficient for many PDFs
|
||||
- `base64` - Sequential processing
|
||||
|
||||
**Optional flags:**
|
||||
- `--filter-results filtered_papers.json` - Only process relevant papers
|
||||
- `--test` - Process only 3 PDFs for testing
|
||||
- `--model claude-3-5-sonnet-20241022` - Change model
|
||||
|
||||
**Outputs:**
|
||||
- `extracted_data.json` - Raw extraction results with token counts
|
||||
|
||||
## Step 4: Repair and Validate JSON
|
||||
|
||||
Repair malformed JSON and validate against schema.
|
||||
|
||||
```bash
|
||||
python scripts/04_repair_json.py \
|
||||
--input extracted_data.json \
|
||||
--schema my_schema.json \
|
||||
--output cleaned_data.json
|
||||
```
|
||||
|
||||
**Optional flags:**
|
||||
- `--strict` - Reject records that fail validation
|
||||
|
||||
**Outputs:**
|
||||
- `cleaned_data.json` - Repaired and validated extractions
|
||||
|
||||
## Step 5: Validate with External APIs
|
||||
|
||||
Enrich data using external scientific databases.
|
||||
|
||||
### API Configuration
|
||||
|
||||
1. Copy API config template:
|
||||
```bash
|
||||
cp assets/api_config_template.json my_api_config.json
|
||||
```
|
||||
|
||||
2. Map fields to validation APIs:
|
||||
- `gbif_taxonomy` - GBIF for biological taxonomy
|
||||
- `wfo_plants` - World Flora Online for plant names
|
||||
- `geonames` - GeoNames for locations (requires account)
|
||||
- `geocode` - OpenStreetMap for geocoding (free)
|
||||
- `pubchem` - PubChem for chemical compounds
|
||||
- `ncbi_gene` - NCBI Gene database
|
||||
|
||||
See `assets/example_api_config_ecology.json` for an ecology example.
|
||||
|
||||
### Run Validation
|
||||
|
||||
```bash
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--output validated_data.json
|
||||
```
|
||||
|
||||
**Optional flags:**
|
||||
- `--skip-validation` - Skip API calls, only structure data
|
||||
|
||||
**Outputs:**
|
||||
- `validated_data.json` - Data enriched with validated taxonomy, geography, etc.
|
||||
|
||||
## Step 6: Export to Analysis Format
|
||||
|
||||
Convert to format for your analysis environment.
|
||||
|
||||
### Python (pandas)
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format python \
|
||||
--flatten \
|
||||
--output results
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `results.pkl` - pandas DataFrame
|
||||
- `results.py` - Loading script
|
||||
|
||||
### R
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format r \
|
||||
--flatten \
|
||||
--output results
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `results.rds` - R data frame
|
||||
- `results.R` - Loading script
|
||||
|
||||
### CSV
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format csv \
|
||||
--flatten \
|
||||
--output results.csv
|
||||
```
|
||||
|
||||
### Excel
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format excel \
|
||||
--flatten \
|
||||
--output results.xlsx
|
||||
```
|
||||
|
||||
### SQLite Database
|
||||
|
||||
```bash
|
||||
python scripts/06_export_database.py \
|
||||
--input validated_data.json \
|
||||
--format sqlite \
|
||||
--flatten \
|
||||
--output results.db
|
||||
```
|
||||
|
||||
Outputs:
|
||||
- `results.db` - SQLite database
|
||||
- `results.sql` - Example queries
|
||||
|
||||
**Flags:**
|
||||
- `--flatten` - Flatten nested JSON for tabular format
|
||||
- `--include-metadata` - Include paper metadata in output
|
||||
|
||||
## Cost Estimation
|
||||
|
||||
### Example: 100 papers, 10 pages each
|
||||
|
||||
**With Filtering (Recommended):**
|
||||
1. Filter (Haiku): ~200 abstracts × 500 tokens × $0.25/M = **$0.03**
|
||||
2. Extract (Sonnet): ~50 relevant papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
|
||||
3. **Total: ~$3.78**
|
||||
|
||||
**Without Filtering:**
|
||||
1. Extract (Sonnet): 100 papers × 10 pages × 2,500 tokens × $3/M = **$7.50**
|
||||
|
||||
**With Local Ollama:**
|
||||
1. Filter (Ollama): **$0**
|
||||
2. Extract (Sonnet): ~50 papers × 10 pages × 2,500 tokens × $3/M = **$3.75**
|
||||
3. **Total: ~$3.75**
|
||||
|
||||
### Token Usage by Step
|
||||
- Abstract (~200 words): ~500 tokens
|
||||
- PDF page (text-heavy): ~1,500-3,000 tokens
|
||||
- Extraction prompt: ~500-1,000 tokens
|
||||
- Schema/context: ~500-1,000 tokens
|
||||
|
||||
**Tips to reduce costs:**
|
||||
- Use abstract filtering (Step 2)
|
||||
- Use Haiku for filtering instead of Sonnet
|
||||
- Use local Ollama for filtering (free)
|
||||
- Enable prompt caching with `--use-caching`
|
||||
- Process in batches with `--use-batches`
|
||||
|
||||
## Common Issues
|
||||
|
||||
### PDF Not Found
|
||||
Check PDF paths in metadata.json match actual file locations.
|
||||
|
||||
### JSON Parsing Errors
|
||||
Run Step 4 (repair JSON) - the json_repair library handles most issues.
|
||||
|
||||
### API Rate Limits
|
||||
Scripts include delays, but check specific API documentation for limits.
|
||||
|
||||
### Ollama Connection Error
|
||||
Ensure Ollama server is running: `ollama serve`
|
||||
|
||||
## Next Steps
|
||||
|
||||
For quality assurance, proceed to the validation workflow to calculate precision and recall metrics.
|
||||
|
||||
See: `references/validation_guide.md`
|
||||
Reference in New Issue
Block a user