8.6 KiB
External API Validation Reference
Overview
Step 5 validates and enriches extracted data using external scientific databases. This ensures taxonomic names are standardized, locations are geocoded, and chemical/gene identifiers are canonical.
Available APIs
Biological Taxonomy
GBIF (Global Biodiversity Information Facility)
Use for: General biological taxonomy (animals, plants, fungi, etc.)
Function: validate_gbif_taxonomy(scientific_name)
Returns:
- Matched canonical name
- Full scientific name with authority
- Taxonomic hierarchy (kingdom, phylum, class, order, family, genus)
- GBIF ID
- Match confidence and type
- Taxonomic status
Example:
validate_gbif_taxonomy("Apis melifera")
# Returns:
{
"matched_name": "Apis mellifera",
"scientific_name": "Apis mellifera Linnaeus, 1758",
"rank": "SPECIES",
"kingdom": "Animalia",
"phylum": "Arthropoda",
"class": "Insecta",
"order": "Hymenoptera",
"family": "Apidae",
"genus": "Apis",
"gbif_id": 1340278,
"confidence": 100,
"match_type": "EXACT"
}
No API key required - Free and unlimited
Documentation: https://www.gbif.org/developer/species
World Flora Online (WFO)
Use for: Plant taxonomy specifically
Function: validate_wfo_plant(scientific_name)
Returns:
- Matched name
- Scientific name with authors
- Family
- WFO ID
- Taxonomic status
Example:
validate_wfo_plant("Magnolia grandiflora")
# Returns:
{
"matched_name": "Magnolia grandiflora",
"scientific_name": "Magnolia grandiflora L.",
"authors": "L.",
"family": "Magnoliaceae",
"wfo_id": "wfo-0000988234",
"status": "Accepted"
}
No API key required - Free
Documentation: http://www.worldfloraonline.org/
Geography
GeoNames
Use for: Location validation and standardization
Function: validate_geonames(location, country=None)
Returns:
- Matched place name
- Country name and code
- Administrative divisions (state, province)
- Latitude/longitude
- GeoNames ID
Example:
validate_geonames("São Paulo", country="BR")
# Returns:
{
"matched_name": "São Paulo",
"country": "Brazil",
"country_code": "BR",
"admin1": "São Paulo",
"admin2": None,
"latitude": "-23.5475",
"longitude": "-46.63611",
"geonames_id": 3448439
}
Requires free account: Register at https://www.geonames.org/login
Setup:
- Create account
- Enable web services in account settings
- Set environment variable:
export GEONAMES_USERNAME='your-username'
Rate limit: Free tier allows reasonable usage
Documentation: https://www.geonames.org/export/web-services.html
OpenStreetMap Nominatim
Use for: Geocoding addresses to coordinates
Function: geocode_location(address)
Returns:
- Display name (formatted address)
- Latitude/longitude
- OSM type and ID
- Place rank
Example:
geocode_location("Field Museum, Chicago, IL")
# Returns:
{
"display_name": "Field Museum, 1400, South Lake Shore Drive, Chicago, Illinois, 60605, United States",
"latitude": "41.8662",
"longitude": "-87.6169",
"osm_type": "way",
"osm_id": 54856789,
"place_rank": 30
}
No API key required - Free
Important: Add 1-second delays between requests (implemented in script)
Documentation: https://nominatim.org/release-docs/latest/api/Overview/
Chemistry
PubChem
Use for: Chemical compound validation
Function: validate_pubchem_compound(compound_name)
Returns:
- PubChem CID (compound ID)
- Molecular formula
- PubChem URL
Example:
validate_pubchem_compound("aspirin")
# Returns:
{
"cid": 2244,
"molecular_formula": "C9H8O4",
"pubchem_url": "https://pubchem.ncbi.nlm.nih.gov/compound/2244"
}
No API key required - Free
Documentation: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
Genetics
NCBI Gene
Use for: Gene validation
Function: validate_ncbi_gene(gene_symbol, organism=None)
Returns:
- NCBI Gene ID
- NCBI URL
Example:
validate_ncbi_gene("BRCA1", organism="Homo sapiens")
# Returns:
{
"gene_id": "672",
"ncbi_url": "https://www.ncbi.nlm.nih.gov/gene/672"
}
No API key required - Free
Rate limit: Max 3 requests/second
Documentation: https://www.ncbi.nlm.nih.gov/books/NBK25500/
Configuration
API Config File Structure
Create my_api_config.json based on assets/api_config_template.json:
{
"field_mappings": {
"species": {
"api": "gbif_taxonomy",
"output_field": "validated_species",
"description": "Validate species names against GBIF"
},
"location": {
"api": "geocode",
"output_field": "coordinates"
}
},
"nested_field_mappings": {
"records.plant_species": {
"api": "wfo_plants",
"output_field": "validated_plant_taxonomy"
},
"records.location": {
"api": "geocode",
"output_field": "coordinates"
}
}
}
Field Mapping Parameters
Required:
api- API name (see list above)output_field- Name for validated data
Optional:
description- Documentationextra_params- Additional API-specific parameters
Adding Custom APIs
To add a new validation API:
- Create validator function in
scripts/05_validate_with_apis.py:
def validate_custom_api(value: str, extra_param: str = None) -> Optional[Dict]:
"""
Validate value using custom API.
Args:
value: The value to validate
extra_param: Optional additional parameter
Returns:
Dictionary with validated data or None if not found
"""
try:
# Make API request
response = requests.get(f"https://api.example.com/{value}")
if response.status_code == 200:
data = response.json()
return {
'validated_value': data.get('canonical_name'),
'api_id': data.get('id'),
'additional_info': data.get('info')
}
except Exception as e:
print(f"Custom API error: {e}")
return None
- Register in API_VALIDATORS dictionary:
API_VALIDATORS = {
'gbif_taxonomy': validate_gbif_taxonomy,
'wfo_plants': validate_wfo_plant,
# ... existing validators ...
'custom_api': validate_custom_api, # Add here
}
- Use in config file:
{
"field_mappings": {
"your_field": {
"api": "custom_api",
"output_field": "validated_field",
"extra_params": {
"extra_param": "value"
}
}
}
}
Rate Limiting
The script implements rate limiting to respect API usage policies:
Default delays (built into script):
- GeoNames: 0.5 seconds
- Nominatim: 1.0 second (required)
- WFO: 1.0 second
- Others: 0.5 seconds
Modify delays if needed in scripts/05_validate_with_apis.py:
# In main() function
if not args.skip_validation:
time.sleep(0.5) # Adjust this value
Error Handling
APIs may fail for various reasons:
Common errors:
- Connection timeout
- Rate limit exceeded
- Invalid API key
- Malformed query
- No match found
Script behavior:
- Continues processing on error
- Logs error to console
- Sets validated field to None
- Original extracted value preserved
Retry logic:
- 3 retries with exponential backoff
- Implemented for network errors
- Not for "no match found" errors
Best Practices
-
Start with test run:
python scripts/05_validate_with_apis.py \ --input cleaned_data.json \ --apis my_api_config.json \ --skip-validation \ --output test_structure.json -
Validate subset first:
- Test on 10 papers before full run
- Verify API connections work
- Check output structure
-
Monitor API usage:
- Track request counts for paid APIs
- Respect rate limits
- Consider caching results
-
Handle failures gracefully:
- Original data is never lost
- Can re-run validation separately
- Manually fix failed validations if needed
-
Optimize API calls:
- Only validate fields that need standardization
- Use cached results when re-running
- Batch similar queries when possible
Troubleshooting
GeoNames "Service disabled" error
- Check account email is verified
- Enable web services in account settings
- Wait up to 1 hour after enabling
Nominatim rate limit errors
- Script includes 1-second delays
- Don't run multiple instances
- Consider using local Nominatim instance
NCBI errors
- Reduce request frequency
- Add longer delays
- Use E-utilities API key (optional, increases limit)
No matches found
- Check spelling and formatting
- Try variations of name
- Some names may not be in database
- Consider manual curation for important cases