Files
gh-brunoasm-my-claude-skill…/skills/extract_from_pdfs/references/api_reference.md
2025-11-29 18:02:40 +08:00

407 lines
8.6 KiB
Markdown

# External API Validation Reference
## Overview
Step 5 validates and enriches extracted data using external scientific databases. This ensures taxonomic names are standardized, locations are geocoded, and chemical/gene identifiers are canonical.
## Available APIs
### Biological Taxonomy
#### GBIF (Global Biodiversity Information Facility)
**Use for:** General biological taxonomy (animals, plants, fungi, etc.)
**Function:** `validate_gbif_taxonomy(scientific_name)`
**Returns:**
- Matched canonical name
- Full scientific name with authority
- Taxonomic hierarchy (kingdom, phylum, class, order, family, genus)
- GBIF ID
- Match confidence and type
- Taxonomic status
**Example:**
```python
validate_gbif_taxonomy("Apis melifera")
# Returns:
{
"matched_name": "Apis mellifera",
"scientific_name": "Apis mellifera Linnaeus, 1758",
"rank": "SPECIES",
"kingdom": "Animalia",
"phylum": "Arthropoda",
"class": "Insecta",
"order": "Hymenoptera",
"family": "Apidae",
"genus": "Apis",
"gbif_id": 1340278,
"confidence": 100,
"match_type": "EXACT"
}
```
**No API key required** - Free and unlimited
**Documentation:** https://www.gbif.org/developer/species
#### World Flora Online (WFO)
**Use for:** Plant taxonomy specifically
**Function:** `validate_wfo_plant(scientific_name)`
**Returns:**
- Matched name
- Scientific name with authors
- Family
- WFO ID
- Taxonomic status
**Example:**
```python
validate_wfo_plant("Magnolia grandiflora")
# Returns:
{
"matched_name": "Magnolia grandiflora",
"scientific_name": "Magnolia grandiflora L.",
"authors": "L.",
"family": "Magnoliaceae",
"wfo_id": "wfo-0000988234",
"status": "Accepted"
}
```
**No API key required** - Free
**Documentation:** http://www.worldfloraonline.org/
### Geography
#### GeoNames
**Use for:** Location validation and standardization
**Function:** `validate_geonames(location, country=None)`
**Returns:**
- Matched place name
- Country name and code
- Administrative divisions (state, province)
- Latitude/longitude
- GeoNames ID
**Example:**
```python
validate_geonames("São Paulo", country="BR")
# Returns:
{
"matched_name": "São Paulo",
"country": "Brazil",
"country_code": "BR",
"admin1": "São Paulo",
"admin2": None,
"latitude": "-23.5475",
"longitude": "-46.63611",
"geonames_id": 3448439
}
```
**Requires free account:** Register at https://www.geonames.org/login
**Setup:**
1. Create account
2. Enable web services in account settings
3. Set environment variable: `export GEONAMES_USERNAME='your-username'`
**Rate limit:** Free tier allows reasonable usage
**Documentation:** https://www.geonames.org/export/web-services.html
#### OpenStreetMap Nominatim
**Use for:** Geocoding addresses to coordinates
**Function:** `geocode_location(address)`
**Returns:**
- Display name (formatted address)
- Latitude/longitude
- OSM type and ID
- Place rank
**Example:**
```python
geocode_location("Field Museum, Chicago, IL")
# Returns:
{
"display_name": "Field Museum, 1400, South Lake Shore Drive, Chicago, Illinois, 60605, United States",
"latitude": "41.8662",
"longitude": "-87.6169",
"osm_type": "way",
"osm_id": 54856789,
"place_rank": 30
}
```
**No API key required** - Free
**Important:** Add 1-second delays between requests (implemented in script)
**Documentation:** https://nominatim.org/release-docs/latest/api/Overview/
### Chemistry
#### PubChem
**Use for:** Chemical compound validation
**Function:** `validate_pubchem_compound(compound_name)`
**Returns:**
- PubChem CID (compound ID)
- Molecular formula
- PubChem URL
**Example:**
```python
validate_pubchem_compound("aspirin")
# Returns:
{
"cid": 2244,
"molecular_formula": "C9H8O4",
"pubchem_url": "https://pubchem.ncbi.nlm.nih.gov/compound/2244"
}
```
**No API key required** - Free
**Documentation:** https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest
### Genetics
#### NCBI Gene
**Use for:** Gene validation
**Function:** `validate_ncbi_gene(gene_symbol, organism=None)`
**Returns:**
- NCBI Gene ID
- NCBI URL
**Example:**
```python
validate_ncbi_gene("BRCA1", organism="Homo sapiens")
# Returns:
{
"gene_id": "672",
"ncbi_url": "https://www.ncbi.nlm.nih.gov/gene/672"
}
```
**No API key required** - Free
**Rate limit:** Max 3 requests/second
**Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25500/
## Configuration
### API Config File Structure
Create `my_api_config.json` based on `assets/api_config_template.json`:
```json
{
"field_mappings": {
"species": {
"api": "gbif_taxonomy",
"output_field": "validated_species",
"description": "Validate species names against GBIF"
},
"location": {
"api": "geocode",
"output_field": "coordinates"
}
},
"nested_field_mappings": {
"records.plant_species": {
"api": "wfo_plants",
"output_field": "validated_plant_taxonomy"
},
"records.location": {
"api": "geocode",
"output_field": "coordinates"
}
}
}
```
### Field Mapping Parameters
**Required:**
- `api` - API name (see list above)
- `output_field` - Name for validated data
**Optional:**
- `description` - Documentation
- `extra_params` - Additional API-specific parameters
## Adding Custom APIs
To add a new validation API:
1. **Create validator function** in `scripts/05_validate_with_apis.py`:
```python
def validate_custom_api(value: str, extra_param: str = None) -> Optional[Dict]:
"""
Validate value using custom API.
Args:
value: The value to validate
extra_param: Optional additional parameter
Returns:
Dictionary with validated data or None if not found
"""
try:
# Make API request
response = requests.get(f"https://api.example.com/{value}")
if response.status_code == 200:
data = response.json()
return {
'validated_value': data.get('canonical_name'),
'api_id': data.get('id'),
'additional_info': data.get('info')
}
except Exception as e:
print(f"Custom API error: {e}")
return None
```
2. **Register in API_VALIDATORS** dictionary:
```python
API_VALIDATORS = {
'gbif_taxonomy': validate_gbif_taxonomy,
'wfo_plants': validate_wfo_plant,
# ... existing validators ...
'custom_api': validate_custom_api, # Add here
}
```
3. **Use in config file:**
```json
{
"field_mappings": {
"your_field": {
"api": "custom_api",
"output_field": "validated_field",
"extra_params": {
"extra_param": "value"
}
}
}
}
```
## Rate Limiting
The script implements rate limiting to respect API usage policies:
**Default delays (built into script):**
- GeoNames: 0.5 seconds
- Nominatim: 1.0 second (required)
- WFO: 1.0 second
- Others: 0.5 seconds
**Modify delays if needed** in `scripts/05_validate_with_apis.py`:
```python
# In main() function
if not args.skip_validation:
time.sleep(0.5) # Adjust this value
```
## Error Handling
APIs may fail for various reasons:
**Common errors:**
- Connection timeout
- Rate limit exceeded
- Invalid API key
- Malformed query
- No match found
**Script behavior:**
- Continues processing on error
- Logs error to console
- Sets validated field to None
- Original extracted value preserved
**Retry logic:**
- 3 retries with exponential backoff
- Implemented for network errors
- Not for "no match found" errors
## Best Practices
1. **Start with test run:**
```bash
python scripts/05_validate_with_apis.py \
--input cleaned_data.json \
--apis my_api_config.json \
--skip-validation \
--output test_structure.json
```
2. **Validate subset first:**
- Test on 10 papers before full run
- Verify API connections work
- Check output structure
3. **Monitor API usage:**
- Track request counts for paid APIs
- Respect rate limits
- Consider caching results
4. **Handle failures gracefully:**
- Original data is never lost
- Can re-run validation separately
- Manually fix failed validations if needed
5. **Optimize API calls:**
- Only validate fields that need standardization
- Use cached results when re-running
- Batch similar queries when possible
## Troubleshooting
### GeoNames "Service disabled" error
- Check account email is verified
- Enable web services in account settings
- Wait up to 1 hour after enabling
### Nominatim rate limit errors
- Script includes 1-second delays
- Don't run multiple instances
- Consider using local Nominatim instance
### NCBI errors
- Reduce request frequency
- Add longer delays
- Use E-utilities API key (optional, increases limit)
### No matches found
- Check spelling and formatting
- Try variations of name
- Some names may not be in database
- Consider manual curation for important cases