407 lines
8.6 KiB
Markdown
407 lines
8.6 KiB
Markdown
# External API Validation Reference
|
|
|
|
## Overview
|
|
|
|
Step 5 validates and enriches extracted data using external scientific databases. This ensures taxonomic names are standardized, locations are geocoded, and chemical/gene identifiers are canonical.
|
|
|
|
## Available APIs
|
|
|
|
### Biological Taxonomy
|
|
|
|
#### GBIF (Global Biodiversity Information Facility)
|
|
|
|
**Use for:** General biological taxonomy (animals, plants, fungi, etc.)
|
|
|
|
**Function:** `validate_gbif_taxonomy(scientific_name)`
|
|
|
|
**Returns:**
|
|
- Matched canonical name
|
|
- Full scientific name with authority
|
|
- Taxonomic hierarchy (kingdom, phylum, class, order, family, genus)
|
|
- GBIF ID
|
|
- Match confidence and type
|
|
- Taxonomic status
|
|
|
|
**Example:**
|
|
```python
|
|
validate_gbif_taxonomy("Apis melifera")
|
|
# Returns:
|
|
{
|
|
"matched_name": "Apis mellifera",
|
|
"scientific_name": "Apis mellifera Linnaeus, 1758",
|
|
"rank": "SPECIES",
|
|
"kingdom": "Animalia",
|
|
"phylum": "Arthropoda",
|
|
"class": "Insecta",
|
|
"order": "Hymenoptera",
|
|
"family": "Apidae",
|
|
"genus": "Apis",
|
|
"gbif_id": 1340278,
|
|
"confidence": 100,
|
|
"match_type": "EXACT"
|
|
}
|
|
```
|
|
|
|
**No API key required** - Free and unlimited
|
|
|
|
**Documentation:** https://www.gbif.org/developer/species
|
|
|
|
#### World Flora Online (WFO)
|
|
|
|
**Use for:** Plant taxonomy specifically
|
|
|
|
**Function:** `validate_wfo_plant(scientific_name)`
|
|
|
|
**Returns:**
|
|
- Matched name
|
|
- Scientific name with authors
|
|
- Family
|
|
- WFO ID
|
|
- Taxonomic status
|
|
|
|
**Example:**
|
|
```python
|
|
validate_wfo_plant("Magnolia grandiflora")
|
|
# Returns:
|
|
{
|
|
"matched_name": "Magnolia grandiflora",
|
|
"scientific_name": "Magnolia grandiflora L.",
|
|
"authors": "L.",
|
|
"family": "Magnoliaceae",
|
|
"wfo_id": "wfo-0000988234",
|
|
"status": "Accepted"
|
|
}
|
|
```
|
|
|
|
**No API key required** - Free
|
|
|
|
**Documentation:** http://www.worldfloraonline.org/
|
|
|
|
### Geography
|
|
|
|
#### GeoNames
|
|
|
|
**Use for:** Location validation and standardization
|
|
|
|
**Function:** `validate_geonames(location, country=None)`
|
|
|
|
**Returns:**
|
|
- Matched place name
|
|
- Country name and code
|
|
- Administrative divisions (state, province)
|
|
- Latitude/longitude
|
|
- GeoNames ID
|
|
|
|
**Example:**
|
|
```python
|
|
validate_geonames("São Paulo", country="BR")
|
|
# Returns:
|
|
{
|
|
"matched_name": "São Paulo",
|
|
"country": "Brazil",
|
|
"country_code": "BR",
|
|
"admin1": "São Paulo",
|
|
"admin2": None,
|
|
"latitude": "-23.5475",
|
|
"longitude": "-46.63611",
|
|
"geonames_id": 3448439
|
|
}
|
|
```
|
|
|
|
**Requires free account:** Register at https://www.geonames.org/login
|
|
|
|
**Setup:**
|
|
1. Create account
|
|
2. Enable web services in account settings
|
|
3. Set environment variable: `export GEONAMES_USERNAME='your-username'`
|
|
|
|
**Rate limit:** Free tier allows reasonable usage
|
|
|
|
**Documentation:** https://www.geonames.org/export/web-services.html
|
|
|
|
#### OpenStreetMap Nominatim
|
|
|
|
**Use for:** Geocoding addresses to coordinates
|
|
|
|
**Function:** `geocode_location(address)`
|
|
|
|
**Returns:**
|
|
- Display name (formatted address)
|
|
- Latitude/longitude
|
|
- OSM type and ID
|
|
- Place rank
|
|
|
|
**Example:**
|
|
```python
|
|
geocode_location("Field Museum, Chicago, IL")
|
|
# Returns:
|
|
{
|
|
"display_name": "Field Museum, 1400, South Lake Shore Drive, Chicago, Illinois, 60605, United States",
|
|
"latitude": "41.8662",
|
|
"longitude": "-87.6169",
|
|
"osm_type": "way",
|
|
"osm_id": 54856789,
|
|
"place_rank": 30
|
|
}
|
|
```
|
|
|
|
**No API key required** - Free
|
|
|
|
**Important:** Add 1-second delays between requests (implemented in script)
|
|
|
|
**Documentation:** https://nominatim.org/release-docs/latest/api/Overview/
|
|
|
|
### Chemistry
|
|
|
|
#### PubChem
|
|
|
|
**Use for:** Chemical compound validation
|
|
|
|
**Function:** `validate_pubchem_compound(compound_name)`
|
|
|
|
**Returns:**
|
|
- PubChem CID (compound ID)
|
|
- Molecular formula
|
|
- PubChem URL
|
|
|
|
**Example:**
|
|
```python
|
|
validate_pubchem_compound("aspirin")
|
|
# Returns:
|
|
{
|
|
"cid": 2244,
|
|
"molecular_formula": "C9H8O4",
|
|
"pubchem_url": "https://pubchem.ncbi.nlm.nih.gov/compound/2244"
|
|
}
|
|
```
|
|
|
|
**No API key required** - Free
|
|
|
|
**Documentation:** https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
|
|
|
|
### Genetics
|
|
|
|
#### NCBI Gene
|
|
|
|
**Use for:** Gene validation
|
|
|
|
**Function:** `validate_ncbi_gene(gene_symbol, organism=None)`
|
|
|
|
**Returns:**
|
|
- NCBI Gene ID
|
|
- NCBI URL
|
|
|
|
**Example:**
|
|
```python
|
|
validate_ncbi_gene("BRCA1", organism="Homo sapiens")
|
|
# Returns:
|
|
{
|
|
"gene_id": "672",
|
|
"ncbi_url": "https://www.ncbi.nlm.nih.gov/gene/672"
|
|
}
|
|
```
|
|
|
|
**No API key required** - Free
|
|
|
|
**Rate limit:** Max 3 requests/second
|
|
|
|
**Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25500/
|
|
|
|
## Configuration
|
|
|
|
### API Config File Structure
|
|
|
|
Create `my_api_config.json` based on `assets/api_config_template.json`:
|
|
|
|
```json
|
|
{
|
|
"field_mappings": {
|
|
"species": {
|
|
"api": "gbif_taxonomy",
|
|
"output_field": "validated_species",
|
|
"description": "Validate species names against GBIF"
|
|
},
|
|
"location": {
|
|
"api": "geocode",
|
|
"output_field": "coordinates"
|
|
}
|
|
},
|
|
|
|
"nested_field_mappings": {
|
|
"records.plant_species": {
|
|
"api": "wfo_plants",
|
|
"output_field": "validated_plant_taxonomy"
|
|
},
|
|
"records.location": {
|
|
"api": "geocode",
|
|
"output_field": "coordinates"
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Field Mapping Parameters
|
|
|
|
**Required:**
|
|
- `api` - API name (see list above)
|
|
- `output_field` - Name for validated data
|
|
|
|
**Optional:**
|
|
- `description` - Documentation
|
|
- `extra_params` - Additional API-specific parameters
|
|
|
|
## Adding Custom APIs
|
|
|
|
To add a new validation API:
|
|
|
|
1. **Create validator function** in `scripts/05_validate_with_apis.py`:
|
|
|
|
```python
|
|
def validate_custom_api(value: str, extra_param: str = None) -> Optional[Dict]:
|
|
"""
|
|
Validate value using custom API.
|
|
|
|
Args:
|
|
value: The value to validate
|
|
extra_param: Optional additional parameter
|
|
|
|
Returns:
|
|
Dictionary with validated data or None if not found
|
|
"""
|
|
try:
|
|
# Make API request
|
|
response = requests.get(f"https://api.example.com/{value}")
|
|
if response.status_code == 200:
|
|
data = response.json()
|
|
return {
|
|
'validated_value': data.get('canonical_name'),
|
|
'api_id': data.get('id'),
|
|
'additional_info': data.get('info')
|
|
}
|
|
except Exception as e:
|
|
print(f"Custom API error: {e}")
|
|
|
|
return None
|
|
```
|
|
|
|
2. **Register in API_VALIDATORS** dictionary:
|
|
|
|
```python
|
|
API_VALIDATORS = {
|
|
'gbif_taxonomy': validate_gbif_taxonomy,
|
|
'wfo_plants': validate_wfo_plant,
|
|
# ... existing validators ...
|
|
'custom_api': validate_custom_api, # Add here
|
|
}
|
|
```
|
|
|
|
3. **Use in config file:**
|
|
|
|
```json
|
|
{
|
|
"field_mappings": {
|
|
"your_field": {
|
|
"api": "custom_api",
|
|
"output_field": "validated_field",
|
|
"extra_params": {
|
|
"extra_param": "value"
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
## Rate Limiting
|
|
|
|
The script implements rate limiting to respect API usage policies:
|
|
|
|
**Default delays (built into script):**
|
|
- GeoNames: 0.5 seconds
|
|
- Nominatim: 1.0 second (required)
|
|
- WFO: 1.0 second
|
|
- Others: 0.5 seconds
|
|
|
|
**Modify delays if needed** in `scripts/05_validate_with_apis.py`:
|
|
|
|
```python
|
|
# In main() function
|
|
if not args.skip_validation:
|
|
time.sleep(0.5) # Adjust this value
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
APIs may fail for various reasons:
|
|
|
|
**Common errors:**
|
|
- Connection timeout
|
|
- Rate limit exceeded
|
|
- Invalid API key
|
|
- Malformed query
|
|
- No match found
|
|
|
|
**Script behavior:**
|
|
- Continues processing on error
|
|
- Logs error to console
|
|
- Sets validated field to None
|
|
- Original extracted value preserved
|
|
|
|
**Retry logic:**
|
|
- 3 retries with exponential backoff
|
|
- Implemented for network errors
|
|
- Not for "no match found" errors
|
|
|
|
## Best Practices
|
|
|
|
1. **Start with test run:**
|
|
```bash
|
|
python scripts/05_validate_with_apis.py \
|
|
--input cleaned_data.json \
|
|
--apis my_api_config.json \
|
|
--skip-validation \
|
|
--output test_structure.json
|
|
```
|
|
|
|
2. **Validate subset first:**
|
|
- Test on 10 papers before full run
|
|
- Verify API connections work
|
|
- Check output structure
|
|
|
|
3. **Monitor API usage:**
|
|
- Track request counts for paid APIs
|
|
- Respect rate limits
|
|
- Consider caching results
|
|
|
|
4. **Handle failures gracefully:**
|
|
- Original data is never lost
|
|
- Can re-run validation separately
|
|
- Manually fix failed validations if needed
|
|
|
|
5. **Optimize API calls:**
|
|
- Only validate fields that need standardization
|
|
- Use cached results when re-running
|
|
- Batch similar queries when possible
|
|
|
|
## Troubleshooting
|
|
|
|
### GeoNames "Service disabled" error
|
|
- Check account email is verified
|
|
- Enable web services in account settings
|
|
- Wait up to 1 hour after enabling
|
|
|
|
### Nominatim rate limit errors
|
|
- Script includes 1-second delays
|
|
- Don't run multiple instances
|
|
- Consider using local Nominatim instance
|
|
|
|
### NCBI errors
|
|
- Reduce request frequency
|
|
- Add longer delays
|
|
- Use E-utilities API key (optional, increases limit)
|
|
|
|
### No matches found
|
|
- Check spelling and formatting
|
|
- Try variations of name
|
|
- Some names may not be in database
|
|
- Consider manual curation for important cases
|