Initial commit
This commit is contained in:
406
skills/extract_from_pdfs/references/api_reference.md
Normal file
406
skills/extract_from_pdfs/references/api_reference.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# External API Validation Reference
|
||||
|
||||
## Overview
|
||||
|
||||
Step 5 validates and enriches extracted data using external scientific databases. This ensures taxonomic names are standardized, locations are geocoded, and chemical/gene identifiers are canonical.
|
||||
|
||||
## Available APIs
|
||||
|
||||
### Biological Taxonomy
|
||||
|
||||
#### GBIF (Global Biodiversity Information Facility)
|
||||
|
||||
**Use for:** General biological taxonomy (animals, plants, fungi, etc.)
|
||||
|
||||
**Function:** `validate_gbif_taxonomy(scientific_name)`
|
||||
|
||||
**Returns:**
|
||||
- Matched canonical name
|
||||
- Full scientific name with authority
|
||||
- Taxonomic hierarchy (kingdom, phylum, class, order, family, genus)
|
||||
- GBIF ID
|
||||
- Match confidence and type
|
||||
- Taxonomic status
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_gbif_taxonomy("Apis melifera")
|
||||
# Returns:
|
||||
{
|
||||
"matched_name": "Apis mellifera",
|
||||
"scientific_name": "Apis mellifera Linnaeus, 1758",
|
||||
"rank": "SPECIES",
|
||||
"kingdom": "Animalia",
|
||||
"phylum": "Arthropoda",
|
||||
"class": "Insecta",
|
||||
"order": "Hymenoptera",
|
||||
"family": "Apidae",
|
||||
"genus": "Apis",
|
||||
"gbif_id": 1340278,
|
||||
"confidence": 100,
|
||||
"match_type": "EXACT"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free and unlimited
|
||||
|
||||
**Documentation:** https://www.gbif.org/developer/species
|
||||
|
||||
#### World Flora Online (WFO)
|
||||
|
||||
**Use for:** Plant taxonomy specifically
|
||||
|
||||
**Function:** `validate_wfo_plant(scientific_name)`
|
||||
|
||||
**Returns:**
|
||||
- Matched name
|
||||
- Scientific name with authors
|
||||
- Family
|
||||
- WFO ID
|
||||
- Taxonomic status
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_wfo_plant("Magnolia grandiflora")
|
||||
# Returns:
|
||||
{
|
||||
"matched_name": "Magnolia grandiflora",
|
||||
"scientific_name": "Magnolia grandiflora L.",
|
||||
"authors": "L.",
|
||||
"family": "Magnoliaceae",
|
||||
"wfo_id": "wfo-0000988234",
|
||||
"status": "Accepted"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Documentation:** http://www.worldfloraonline.org/
|
||||
|
||||
### Geography
|
||||
|
||||
#### GeoNames
|
||||
|
||||
**Use for:** Location validation and standardization
|
||||
|
||||
**Function:** `validate_geonames(location, country=None)`
|
||||
|
||||
**Returns:**
|
||||
- Matched place name
|
||||
- Country name and code
|
||||
- Administrative divisions (state, province)
|
||||
- Latitude/longitude
|
||||
- GeoNames ID
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_geonames("São Paulo", country="BR")
|
||||
# Returns:
|
||||
{
|
||||
"matched_name": "São Paulo",
|
||||
"country": "Brazil",
|
||||
"country_code": "BR",
|
||||
"admin1": "São Paulo",
|
||||
"admin2": None,
|
||||
"latitude": "-23.5475",
|
||||
"longitude": "-46.63611",
|
||||
"geonames_id": 3448439
|
||||
}
|
||||
```
|
||||
|
||||
**Requires free account:** Register at https://www.geonames.org/login
|
||||
|
||||
**Setup:**
|
||||
1. Create account
|
||||
2. Enable web services in account settings
|
||||
3. Set environment variable: `export GEONAMES_USERNAME='your-username'`
|
||||
|
||||
**Rate limit:** Free tier allows reasonable usage
|
||||
|
||||
**Documentation:** https://www.geonames.org/export/web-services.html
|
||||
|
||||
#### OpenStreetMap Nominatim
|
||||
|
||||
**Use for:** Geocoding addresses to coordinates
|
||||
|
||||
**Function:** `geocode_location(address)`
|
||||
|
||||
**Returns:**
|
||||
- Display name (formatted address)
|
||||
- Latitude/longitude
|
||||
- OSM type and ID
|
||||
- Place rank
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
geocode_location("Field Museum, Chicago, IL")
|
||||
# Returns:
|
||||
{
|
||||
"display_name": "Field Museum, 1400, South Lake Shore Drive, Chicago, Illinois, 60605, United States",
|
||||
"latitude": "41.8662",
|
||||
"longitude": "-87.6169",
|
||||
"osm_type": "way",
|
||||
"osm_id": 54856789,
|
||||
"place_rank": 30
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Important:** Add 1-second delays between requests (implemented in script)
|
||||
|
||||
**Documentation:** https://nominatim.org/release-docs/latest/api/Overview/
|
||||
|
||||
### Chemistry
|
||||
|
||||
#### PubChem
|
||||
|
||||
**Use for:** Chemical compound validation
|
||||
|
||||
**Function:** `validate_pubchem_compound(compound_name)`
|
||||
|
||||
**Returns:**
|
||||
- PubChem CID (compound ID)
|
||||
- Molecular formula
|
||||
- PubChem URL
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_pubchem_compound("aspirin")
|
||||
# Returns:
|
||||
{
|
||||
"cid": 2244,
|
||||
"molecular_formula": "C9H8O4",
|
||||
"pubchem_url": "https://pubchem.ncbi.nlm.nih.gov/compound/2244"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Documentation:** https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
|
||||
|
||||
### Genetics
|
||||
|
||||
#### NCBI Gene
|
||||
|
||||
**Use for:** Gene validation
|
||||
|
||||
**Function:** `validate_ncbi_gene(gene_symbol, organism=None)`
|
||||
|
||||
**Returns:**
|
||||
- NCBI Gene ID
|
||||
- NCBI URL
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
validate_ncbi_gene("BRCA1", organism="Homo sapiens")
|
||||
# Returns:
|
||||
{
|
||||
"gene_id": "672",
|
||||
"ncbi_url": "https://www.ncbi.nlm.nih.gov/gene/672"
|
||||
}
|
||||
```
|
||||
|
||||
**No API key required** - Free
|
||||
|
||||
**Rate limit:** Max 3 requests/second
|
||||
|
||||
**Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25500/
|
||||
|
||||
## Configuration
|
||||
|
||||
### API Config File Structure
|
||||
|
||||
Create `my_api_config.json` based on `assets/api_config_template.json`:
|
||||
|
||||
```json
|
||||
{
|
||||
"field_mappings": {
|
||||
"species": {
|
||||
"api": "gbif_taxonomy",
|
||||
"output_field": "validated_species",
|
||||
"description": "Validate species names against GBIF"
|
||||
},
|
||||
"location": {
|
||||
"api": "geocode",
|
||||
"output_field": "coordinates"
|
||||
}
|
||||
},
|
||||
|
||||
"nested_field_mappings": {
|
||||
"records.plant_species": {
|
||||
"api": "wfo_plants",
|
||||
"output_field": "validated_plant_taxonomy"
|
||||
},
|
||||
"records.location": {
|
||||
"api": "geocode",
|
||||
"output_field": "coordinates"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Field Mapping Parameters
|
||||
|
||||
**Required:**
|
||||
- `api` - API name (see list above)
|
||||
- `output_field` - Name for validated data
|
||||
|
||||
**Optional:**
|
||||
- `description` - Documentation
|
||||
- `extra_params` - Additional API-specific parameters
|
||||
|
||||
## Adding Custom APIs
|
||||
|
||||
To add a new validation API:
|
||||
|
||||
1. **Create validator function** in `scripts/05_validate_with_apis.py`:
|
||||
|
||||
```python
|
||||
def validate_custom_api(value: str, extra_param: str = None) -> Optional[Dict]:
|
||||
"""
|
||||
Validate value using custom API.
|
||||
|
||||
Args:
|
||||
value: The value to validate
|
||||
extra_param: Optional additional parameter
|
||||
|
||||
Returns:
|
||||
Dictionary with validated data or None if not found
|
||||
"""
|
||||
try:
|
||||
# Make API request
|
||||
response = requests.get(f"https://api.example.com/{value}")
|
||||
if response.status_code == 200:
|
||||
data = response.json()
|
||||
return {
|
||||
'validated_value': data.get('canonical_name'),
|
||||
'api_id': data.get('id'),
|
||||
'additional_info': data.get('info')
|
||||
}
|
||||
except Exception as e:
|
||||
print(f"Custom API error: {e}")
|
||||
|
||||
return None
|
||||
```
|
||||
|
||||
2. **Register in API_VALIDATORS** dictionary:
|
||||
|
||||
```python
|
||||
API_VALIDATORS = {
|
||||
'gbif_taxonomy': validate_gbif_taxonomy,
|
||||
'wfo_plants': validate_wfo_plant,
|
||||
# ... existing validators ...
|
||||
'custom_api': validate_custom_api, # Add here
|
||||
}
|
||||
```
|
||||
|
||||
3. **Use in config file:**
|
||||
|
||||
```json
|
||||
{
|
||||
"field_mappings": {
|
||||
"your_field": {
|
||||
"api": "custom_api",
|
||||
"output_field": "validated_field",
|
||||
"extra_params": {
|
||||
"extra_param": "value"
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Rate Limiting
|
||||
|
||||
The script implements rate limiting to respect API usage policies:
|
||||
|
||||
**Default delays (built into script):**
|
||||
- GeoNames: 0.5 seconds
|
||||
- Nominatim: 1.0 second (required)
|
||||
- WFO: 1.0 second
|
||||
- Others: 0.5 seconds
|
||||
|
||||
**Modify delays if needed** in `scripts/05_validate_with_apis.py`:
|
||||
|
||||
```python
|
||||
# In main() function
|
||||
if not args.skip_validation:
|
||||
time.sleep(0.5) # Adjust this value
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
APIs may fail for various reasons:
|
||||
|
||||
**Common errors:**
|
||||
- Connection timeout
|
||||
- Rate limit exceeded
|
||||
- Invalid API key
|
||||
- Malformed query
|
||||
- No match found
|
||||
|
||||
**Script behavior:**
|
||||
- Continues processing on error
|
||||
- Logs error to console
|
||||
- Sets validated field to None
|
||||
- Original extracted value preserved
|
||||
|
||||
**Retry logic:**
|
||||
- 3 retries with exponential backoff
|
||||
- Implemented for network errors
|
||||
- Not for "no match found" errors
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start with test run:**
|
||||
```bash
|
||||
python scripts/05_validate_with_apis.py \
|
||||
--input cleaned_data.json \
|
||||
--apis my_api_config.json \
|
||||
--skip-validation \
|
||||
--output test_structure.json
|
||||
```
|
||||
|
||||
2. **Validate subset first:**
|
||||
- Test on 10 papers before full run
|
||||
- Verify API connections work
|
||||
- Check output structure
|
||||
|
||||
3. **Monitor API usage:**
|
||||
- Track request counts for paid APIs
|
||||
- Respect rate limits
|
||||
- Consider caching results
|
||||
|
||||
4. **Handle failures gracefully:**
|
||||
- Original data is never lost
|
||||
- Can re-run validation separately
|
||||
- Manually fix failed validations if needed
|
||||
|
||||
5. **Optimize API calls:**
|
||||
- Only validate fields that need standardization
|
||||
- Use cached results when re-running
|
||||
- Batch similar queries when possible
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### GeoNames "Service disabled" error
|
||||
- Check account email is verified
|
||||
- Enable web services in account settings
|
||||
- Wait up to 1 hour after enabling
|
||||
|
||||
### Nominatim rate limit errors
|
||||
- Script includes 1-second delays
|
||||
- Don't run multiple instances
|
||||
- Consider using local Nominatim instance
|
||||
|
||||
### NCBI errors
|
||||
- Reduce request frequency
|
||||
- Add longer delays
|
||||
- Use E-utilities API key (optional, increases limit)
|
||||
|
||||
### No matches found
|
||||
- Check spelling and formatting
|
||||
- Try variations of name
|
||||
- Some names may not be in database
|
||||
- Consider manual curation for important cases
|
||||
Reference in New Issue
Block a user