Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,173 @@
---
name: gene-database
description: "Query NCBI Gene via E-utilities/Datasets API. Search by symbol/ID, retrieve gene info (RefSeqs, GO, locations, phenotypes), batch lookups, for gene annotation and functional analysis."
---
# Gene Database
## Overview
NCBI Gene is a comprehensive database integrating gene information from diverse species. It provides nomenclature, reference sequences (RefSeqs), chromosomal maps, biological pathways, genetic variations, phenotypes, and cross-references to global genomic resources.
## When to Use This Skill
This skill should be used when working with gene data including searching by gene symbol or ID, retrieving gene sequences and metadata, analyzing gene functions and pathways, or performing batch gene lookups.
## Quick Start
NCBI provides two main APIs for gene data access:
1. **E-utilities** (Traditional): Full-featured API for all Entrez databases with flexible querying
2. **NCBI Datasets API** (Newer): Optimized for gene data retrieval with simplified workflows
Choose E-utilities for complex queries and cross-database searches. Choose Datasets API for straightforward gene data retrieval with metadata and sequences in a single request.
## Common Workflows
### Search Genes by Symbol or Name
To search for genes by symbol or name across organisms:
1. Use the `scripts/query_gene.py` script with E-utilities ESearch
2. Specify the gene symbol and organism (e.g., "BRCA1 in human")
3. The script returns matching Gene IDs
Example query patterns:
- Gene symbol: `insulin[gene name] AND human[organism]`
- Gene with disease: `dystrophin[gene name] AND muscular dystrophy[disease]`
- Chromosome location: `human[organism] AND 17q21[chromosome]`
### Retrieve Gene Information by ID
To fetch detailed information for known Gene IDs:
1. Use `scripts/fetch_gene_data.py` with the Datasets API for comprehensive data
2. Alternatively, use `scripts/query_gene.py` with E-utilities EFetch for specific formats
3. Specify desired output format (JSON, XML, or text)
The Datasets API returns:
- Gene nomenclature and aliases
- Reference sequences (RefSeqs) for transcripts and proteins
- Chromosomal location and mapping
- Gene Ontology (GO) annotations
- Associated publications
### Batch Gene Lookups
For multiple genes simultaneously:
1. Use `scripts/batch_gene_lookup.py` for efficient batch processing
2. Provide a list of gene symbols or IDs
3. Specify the organism for symbol-based queries
4. The script handles rate limiting automatically (10 requests/second with API key)
This workflow is useful for:
- Validating gene lists
- Retrieving metadata for gene panels
- Cross-referencing gene identifiers
- Building gene annotation tables
### Search by Biological Context
To find genes associated with specific biological functions or phenotypes:
1. Use E-utilities with Gene Ontology (GO) terms or phenotype keywords
2. Query by pathway names or disease associations
3. Filter by organism, chromosome, or other attributes
Example searches:
- By GO term: `GO:0006915[biological process]` (apoptosis)
- By phenotype: `diabetes[phenotype] AND mouse[organism]`
- By pathway: `insulin signaling pathway[pathway]`
### API Access Patterns
**Rate Limits:**
- Without API key: 3 requests/second for E-utilities, 5 requests/second for Datasets API
- With API key: 10 requests/second for both APIs
**Authentication:**
Register for a free NCBI API key at https://www.ncbi.nlm.nih.gov/account/ to increase rate limits.
**Error Handling:**
Both APIs return standard HTTP status codes. Common errors include:
- 400: Malformed query or invalid parameters
- 429: Rate limit exceeded
- 404: Gene ID not found
Retry failed requests with exponential backoff.
## Script Usage
### query_gene.py
Query NCBI Gene using E-utilities (ESearch, ESummary, EFetch).
```bash
python scripts/query_gene.py --search "BRCA1" --organism "human"
python scripts/query_gene.py --id 672 --format json
python scripts/query_gene.py --search "insulin[gene] AND diabetes[disease]"
```
### fetch_gene_data.py
Fetch comprehensive gene data using NCBI Datasets API.
```bash
python scripts/fetch_gene_data.py --gene-id 672
python scripts/fetch_gene_data.py --symbol BRCA1 --taxon human
python scripts/fetch_gene_data.py --symbol TP53 --taxon "Homo sapiens" --output json
```
### batch_gene_lookup.py
Process multiple gene queries efficiently.
```bash
python scripts/batch_gene_lookup.py --file gene_list.txt --organism human
python scripts/batch_gene_lookup.py --ids 672,7157,5594 --output results.json
```
## API References
For detailed API documentation including endpoints, parameters, response formats, and examples, refer to:
- `references/api_reference.md` - Comprehensive API documentation for E-utilities and Datasets API
- `references/common_workflows.md` - Additional examples and use case patterns
Search these references when needing specific API endpoint details, parameter options, or response structure information.
## Data Formats
NCBI Gene data can be retrieved in multiple formats:
- **JSON**: Structured data ideal for programmatic processing
- **XML**: Detailed hierarchical format with full metadata
- **GenBank**: Sequence data with annotations
- **FASTA**: Sequence data only
- **Text**: Human-readable summaries
Choose JSON for modern applications, XML for legacy systems requiring detailed metadata, and FASTA for sequence analysis workflows.
## Best Practices
1. **Always specify organism** when searching by gene symbol to avoid ambiguity
2. **Use Gene IDs** for precise lookups when available
3. **Batch requests** when working with multiple genes to minimize API calls
4. **Cache results** locally to reduce redundant queries
5. **Include API key** in scripts for higher rate limits
6. **Handle errors gracefully** with retry logic for transient failures
7. **Validate gene symbols** before batch processing to catch typos
## Resources
This skill includes:
### scripts/
- `query_gene.py` - Query genes using E-utilities (ESearch, ESummary, EFetch)
- `fetch_gene_data.py` - Fetch gene data using NCBI Datasets API
- `batch_gene_lookup.py` - Handle multiple gene queries efficiently
### references/
- `api_reference.md` - Detailed API documentation for both E-utilities and Datasets API
- `common_workflows.md` - Examples of common gene queries and use cases

View File

@@ -0,0 +1,404 @@
# NCBI Gene API Reference
This document provides detailed API documentation for accessing NCBI Gene database programmatically.
## Table of Contents
1. [E-utilities API](#e-utilities-api)
2. [NCBI Datasets API](#ncbi-datasets-api)
3. [Authentication and Rate Limits](#authentication-and-rate-limits)
4. [Error Handling](#error-handling)
---
## E-utilities API
E-utilities (Entrez Programming Utilities) provide a stable interface to NCBI's Entrez databases.
### Base URL
```
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
```
### Common Parameters
- `db` - Database name (use `gene` for Gene database)
- `api_key` - API key for higher rate limits
- `retmode` - Return format (json, xml, text)
- `retmax` - Maximum number of records to return
### ESearch - Search Database
Search for genes matching a text query.
**Endpoint:** `esearch.fcgi`
**Parameters:**
- `db=gene` (required) - Database to search
- `term` (required) - Search query
- `retmax` - Maximum results (default: 20)
- `retmode` - json or xml (default: xml)
- `usehistory=y` - Store results on history server for large result sets
**Query Syntax:**
- Gene symbol: `BRCA1[gene]` or `BRCA1[gene name]`
- Organism: `human[organism]` or `9606[taxid]`
- Combine terms: `BRCA1[gene] AND human[organism]`
- Disease: `muscular dystrophy[disease]`
- Chromosome: `17q21[chromosome]`
- GO terms: `GO:0006915[biological process]`
**Example Request:**
```bash
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term=BRCA1[gene]+AND+human[organism]&retmode=json"
```
**Response Format (JSON):**
```json
{
"esearchresult": {
"count": "1",
"retmax": "1",
"retstart": "0",
"idlist": ["672"],
"translationset": [],
"querytranslation": "BRCA1[Gene Name] AND human[Organism]"
}
}
```
### ESummary - Document Summaries
Retrieve document summaries for Gene IDs.
**Endpoint:** `esummary.fcgi`
**Parameters:**
- `db=gene` (required) - Database
- `id` (required) - Comma-separated Gene IDs (up to 500)
- `retmode` - json or xml (default: xml)
**Example Request:**
```bash
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=672&retmode=json"
```
**Response Format (JSON):**
```json
{
"result": {
"672": {
"uid": "672",
"name": "BRCA1",
"description": "BRCA1 DNA repair associated",
"organism": {
"scientificname": "Homo sapiens",
"commonname": "human",
"taxid": 9606
},
"chromosome": "17",
"geneticsource": "genomic",
"maplocation": "17q21.31",
"nomenclaturesymbol": "BRCA1",
"nomenclaturename": "BRCA1 DNA repair associated"
}
}
}
```
### EFetch - Full Records
Fetch detailed gene records in various formats.
**Endpoint:** `efetch.fcgi`
**Parameters:**
- `db=gene` (required) - Database
- `id` (required) - Comma-separated Gene IDs
- `retmode` - xml, text, asn.1 (default: xml)
- `rettype` - gene_table, docsum
**Example Request:**
```bash
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=672&retmode=xml"
```
**XML Response:** Contains detailed gene information including:
- Gene nomenclature
- Sequence locations
- Transcript variants
- Protein products
- Gene Ontology annotations
- Cross-references
- Publications
### ELink - Related Records
Find related records in Gene or other databases.
**Endpoint:** `elink.fcgi`
**Parameters:**
- `dbfrom=gene` (required) - Source database
- `db` (required) - Target database (gene, nuccore, protein, pubmed, etc.)
- `id` (required) - Gene ID(s)
**Example Request:**
```bash
# Get related PubMed articles
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"
```
### EInfo - Database Information
Get information about the Gene database.
**Endpoint:** `einfo.fcgi`
**Parameters:**
- `db=gene` - Database to query
**Example Request:**
```bash
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=gene&retmode=json"
```
---
## NCBI Datasets API
The Datasets API provides streamlined access to gene data with metadata and sequences.
### Base URL
```
https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene
```
### Authentication
Include API key in request headers:
```
api-key: YOUR_API_KEY
```
### Get Gene by ID
Retrieve gene data by Gene ID.
**Endpoint:** `GET /gene/id/{gene_id}`
**Example Request:**
```bash
curl "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/id/672"
```
**Response Format (JSON):**
```json
{
"genes": [
{
"gene": {
"gene_id": "672",
"symbol": "BRCA1",
"description": "BRCA1 DNA repair associated",
"tax_name": "Homo sapiens",
"taxid": 9606,
"chromosomes": ["17"],
"type": "protein-coding",
"synonyms": ["BRCC1", "FANCS", "PNCA4", "RNF53"],
"nomenclature_authority": {
"authority": "HGNC",
"identifier": "HGNC:1100"
},
"genomic_ranges": [
{
"accession_version": "NC_000017.11",
"range": [
{
"begin": 43044295,
"end": 43170245,
"orientation": "minus"
}
]
}
],
"transcripts": [
{
"accession_version": "NM_007294.4",
"length": 7207
}
]
}
}
]
}
```
### Get Gene by Symbol
Retrieve gene data by symbol and organism.
**Endpoint:** `GET /gene/symbol/{symbol}/taxon/{taxon}`
**Parameters:**
- `{symbol}` - Gene symbol (e.g., BRCA1)
- `{taxon}` - Taxon ID (e.g., 9606 for human)
**Example Request:**
```bash
curl "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/symbol/BRCA1/taxon/9606"
```
### Get Multiple Genes
Retrieve data for multiple genes.
**Endpoint:** `POST /gene/id`
**Request Body:**
```json
{
"gene_ids": ["672", "7157", "5594"]
}
```
**Example Request:**
```bash
curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/id" \
-H "Content-Type: application/json" \
-d '{"gene_ids": ["672", "7157", "5594"]}'
```
---
## Authentication and Rate Limits
### Obtaining an API Key
1. Create an NCBI account at https://www.ncbi.nlm.nih.gov/account/
2. Navigate to Settings → API Key Management
3. Generate a new API key
4. Include the key in requests
### Rate Limits
**E-utilities:**
- Without API key: 3 requests/second
- With API key: 10 requests/second
**Datasets API:**
- Without API key: 5 requests/second
- With API key: 10 requests/second
### Usage Guidelines
1. **Include email in requests:** Add `&email=your@email.com` to E-utilities requests
2. **Implement rate limiting:** Use delays between requests
3. **Use POST for large queries:** When working with many IDs
4. **Cache results:** Store frequently accessed data locally
5. **Handle errors gracefully:** Implement retry logic with exponential backoff
---
## Error Handling
### HTTP Status Codes
- `200 OK` - Successful request
- `400 Bad Request` - Invalid parameters or malformed query
- `404 Not Found` - Gene ID or symbol not found
- `429 Too Many Requests` - Rate limit exceeded
- `500 Internal Server Error` - Server error (retry with backoff)
### E-utilities Error Messages
E-utilities return errors in the response body:
**XML format:**
```xml
<ERROR>Empty id list - nothing to do</ERROR>
```
**JSON format:**
```json
{
"error": "Invalid db name"
}
```
### Common Errors
1. **Empty Result Set**
- Cause: Gene symbol or ID not found
- Solution: Verify spelling, check organism filter
2. **Rate Limit Exceeded**
- Cause: Too many requests
- Solution: Add delays, use API key
3. **Invalid Query Syntax**
- Cause: Malformed search term
- Solution: Use proper field tags (e.g., `[gene]`, `[organism]`)
4. **Timeout**
- Cause: Large result set or slow connection
- Solution: Use History Server, reduce result size
### Retry Strategy
Implement exponential backoff for failed requests:
```python
import time
def retry_request(func, max_attempts=3):
for attempt in range(max_attempts):
try:
return func()
except Exception as e:
if attempt < max_attempts - 1:
wait_time = 2 ** attempt # 1s, 2s, 4s
time.sleep(wait_time)
else:
raise
```
---
## Common Taxon IDs
| Organism | Scientific Name | Taxon ID |
|----------|----------------|----------|
| Human | Homo sapiens | 9606 |
| Mouse | Mus musculus | 10090 |
| Rat | Rattus norvegicus | 10116 |
| Zebrafish | Danio rerio | 7955 |
| Fruit fly | Drosophila melanogaster | 7227 |
| C. elegans | Caenorhabditis elegans | 6239 |
| Yeast | Saccharomyces cerevisiae | 4932 |
| Arabidopsis | Arabidopsis thaliana | 3702 |
| E. coli | Escherichia coli | 562 |
---
## Additional Resources
- **E-utilities Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25501/
- **Datasets API Documentation:** https://www.ncbi.nlm.nih.gov/datasets/docs/v2/
- **Gene Database Help:** https://www.ncbi.nlm.nih.gov/gene/
- **API Key Registration:** https://www.ncbi.nlm.nih.gov/account/

View File

@@ -0,0 +1,428 @@
# Common Gene Database Workflows
This document provides examples of common workflows and use cases for working with NCBI Gene database.
## Table of Contents
1. [Disease Gene Discovery](#disease-gene-discovery)
2. [Gene Annotation Pipeline](#gene-annotation-pipeline)
3. [Cross-Species Gene Comparison](#cross-species-gene-comparison)
4. [Pathway Analysis](#pathway-analysis)
5. [Variant Analysis](#variant-analysis)
6. [Publication Mining](#publication-mining)
---
## Disease Gene Discovery
### Use Case
Identify genes associated with a specific disease or phenotype.
### Workflow
1. **Search by disease name**
```bash
# Find genes associated with Alzheimer's disease
python scripts/query_gene.py --search "Alzheimer disease[disease]" --organism human --max-results 50
```
2. **Filter by chromosome location**
```bash
# Find genes on chromosome 17 associated with breast cancer
python scripts/query_gene.py --search "breast cancer[disease] AND 17[chromosome]" --organism human
```
3. **Retrieve detailed information**
```python
# Python example: Get gene details for disease-associated genes
import json
from scripts.query_gene import esearch, esummary
# Search for genes
query = "diabetes[disease] AND human[organism]"
gene_ids = esearch(query, retmax=100, api_key="YOUR_KEY")
# Get summaries
summaries = esummary(gene_ids, api_key="YOUR_KEY")
# Extract relevant information
for gene_id in gene_ids:
if gene_id in summaries['result']:
gene = summaries['result'][gene_id]
print(f"{gene['name']}: {gene['description']}")
```
### Expected Output
- List of genes with disease associations
- Gene symbols, descriptions, and chromosomal locations
- Related publications and clinical annotations
---
## Gene Annotation Pipeline
### Use Case
Annotate a list of gene identifiers with comprehensive metadata.
### Workflow
1. **Prepare gene list**
Create a file `genes.txt` with gene symbols (one per line):
```
BRCA1
TP53
EGFR
KRAS
```
2. **Batch lookup**
```bash
python scripts/batch_gene_lookup.py --file genes.txt --organism human --output annotations.json --api-key YOUR_KEY
```
3. **Parse results**
```python
import json
with open('annotations.json', 'r') as f:
genes = json.load(f)
for gene in genes:
if 'gene_id' in gene:
print(f"Symbol: {gene['symbol']}")
print(f"ID: {gene['gene_id']}")
print(f"Description: {gene['description']}")
print(f"Location: chr{gene['chromosome']}:{gene['map_location']}")
print()
```
4. **Enrich with sequence data**
```bash
# Get detailed data including sequences for specific genes
python scripts/fetch_gene_data.py --gene-id 672 --verbose > BRCA1_detailed.json
```
### Use Cases
- Creating gene annotation tables for publications
- Validating gene lists before analysis
- Building gene reference databases
- Quality control for genomic pipelines
---
## Cross-Species Gene Comparison
### Use Case
Find orthologs or compare the same gene across different species.
### Workflow
1. **Search for gene in multiple organisms**
```bash
# Find TP53 in human
python scripts/fetch_gene_data.py --symbol TP53 --taxon human
# Find TP53 in mouse
python scripts/fetch_gene_data.py --symbol TP53 --taxon mouse
# Find TP53 in zebrafish
python scripts/fetch_gene_data.py --symbol TP53 --taxon zebrafish
```
2. **Compare gene IDs across species**
```python
# Compare gene information across species
species = {
'human': '9606',
'mouse': '10090',
'rat': '10116'
}
gene_symbol = 'TP53'
for organism, taxon_id in species.items():
# Fetch gene data
# ... (use fetch_gene_by_symbol)
print(f"{organism}: {gene_data}")
```
3. **Find orthologs using ELink**
```bash
# Get HomoloGene links for a gene
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=homologene&id=7157&retmode=json"
```
### Applications
- Evolutionary studies
- Model organism research
- Comparative genomics
- Cross-species experimental design
---
## Pathway Analysis
### Use Case
Identify genes involved in specific biological pathways or processes.
### Workflow
1. **Search by Gene Ontology (GO) term**
```bash
# Find genes involved in apoptosis
python scripts/query_gene.py --search "GO:0006915[biological process]" --organism human --max-results 100
```
2. **Search by pathway name**
```bash
# Find genes in insulin signaling pathway
python scripts/query_gene.py --search "insulin signaling pathway[pathway]" --organism human
```
3. **Get pathway-related genes**
```python
# Example: Get all genes in a specific pathway
import urllib.request
import json
# Search for pathway genes
query = "MAPK signaling pathway[pathway] AND human[organism]"
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term={query}&retmode=json&retmax=200"
with urllib.request.urlopen(url) as response:
data = json.loads(response.read().decode())
gene_ids = data['esearchresult']['idlist']
print(f"Found {len(gene_ids)} genes in MAPK signaling pathway")
```
4. **Batch retrieve gene details**
```bash
# Get details for all pathway genes
python scripts/batch_gene_lookup.py --ids 5594,5595,5603,5604 --output mapk_genes.json
```
### Applications
- Pathway enrichment analysis
- Gene set analysis
- Systems biology studies
- Drug target identification
---
## Variant Analysis
### Use Case
Find genes with clinically relevant variants or disease-associated mutations.
### Workflow
1. **Search for genes with clinical variants**
```bash
# Find genes with pathogenic variants
python scripts/query_gene.py --search "pathogenic[clinical significance]" --organism human --max-results 50
```
2. **Link to ClinVar database**
```bash
# Get ClinVar records for a gene
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=clinvar&id=672&retmode=json"
```
3. **Search for pharmacogenomic genes**
```bash
# Find genes associated with drug response
python scripts/query_gene.py --search "pharmacogenomic[property]" --organism human
```
4. **Get variant summary data**
```python
# Example: Get genes with known variants
from scripts.query_gene import esearch, efetch
# Search for genes with variants
gene_ids = esearch("has variants[filter] AND human[organism]", retmax=100)
# Fetch detailed records
for gene_id in gene_ids[:10]: # First 10
data = efetch([gene_id], retmode='xml')
# Parse XML for variant information
print(f"Gene {gene_id} variant data...")
```
### Applications
- Clinical genetics
- Precision medicine
- Pharmacogenomics
- Genetic counseling
---
## Publication Mining
### Use Case
Find genes mentioned in recent publications or link genes to literature.
### Workflow
1. **Search genes mentioned in specific publications**
```bash
# Find genes mentioned in papers about CRISPR
python scripts/query_gene.py --search "CRISPR[text word]" --organism human --max-results 100
```
2. **Get PubMed articles for a gene**
```bash
# Get all publications for BRCA1
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"
```
3. **Search by author or journal**
```bash
# Find genes studied by specific research group
python scripts/query_gene.py --search "Smith J[author] AND 2024[pdat]" --organism human
```
4. **Extract gene-publication relationships**
```python
# Example: Build gene-publication network
from scripts.query_gene import esearch, esummary
import urllib.request
import json
# Get gene
gene_id = '672'
# Get publications for gene
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id={gene_id}&retmode=json"
with urllib.request.urlopen(url) as response:
data = json.loads(response.read().decode())
# Extract PMIDs
pmids = []
for linkset in data.get('linksets', []):
for linksetdb in linkset.get('linksetdbs', []):
pmids.extend(linksetdb.get('links', []))
print(f"Gene {gene_id} has {len(pmids)} publications")
```
### Applications
- Literature reviews
- Grant writing
- Knowledge base construction
- Trend analysis in genomics research
---
## Advanced Patterns
### Combining Multiple Searches
```python
# Example: Find genes at intersection of multiple criteria
def find_genes_multi_criteria(organism='human'):
# Criteria 1: Disease association
disease_genes = set(esearch("diabetes[disease] AND human[organism]"))
# Criteria 2: Chromosome location
chr_genes = set(esearch("11[chromosome] AND human[organism]"))
# Criteria 3: Gene type
coding_genes = set(esearch("protein coding[gene type] AND human[organism]"))
# Intersection
candidates = disease_genes & chr_genes & coding_genes
return list(candidates)
```
### Rate-Limited Batch Processing
```python
import time
def process_genes_with_rate_limit(gene_ids, batch_size=200, delay=0.1):
results = []
for i in range(0, len(gene_ids), batch_size):
batch = gene_ids[i:i + batch_size]
# Process batch
batch_results = esummary(batch)
results.append(batch_results)
# Rate limit
time.sleep(delay)
return results
```
### Error Handling and Retry
```python
import time
def robust_gene_fetch(gene_id, max_retries=3):
for attempt in range(max_retries):
try:
data = fetch_gene_by_id(gene_id)
return data
except Exception as e:
if attempt < max_retries - 1:
wait = 2 ** attempt # Exponential backoff
time.sleep(wait)
else:
print(f"Failed to fetch gene {gene_id}: {e}")
return None
```
---
## Tips and Best Practices
1. **Start Specific, Then Broaden**: Begin with precise queries and expand if needed
2. **Use Organism Filters**: Always specify organism for gene symbol searches
3. **Validate Results**: Check gene IDs and symbols for accuracy
4. **Cache Frequently Used Data**: Store common queries locally
5. **Monitor Rate Limits**: Use API keys and implement delays
6. **Combine APIs**: Use E-utilities for search, Datasets API for detailed data
7. **Handle Ambiguity**: Gene symbols may refer to different genes in different species
8. **Check Data Currency**: Gene annotations are updated regularly
9. **Use Batch Operations**: Process multiple genes together when possible
10. **Document Your Queries**: Keep records of search terms and parameters

View File

@@ -0,0 +1,298 @@
#!/usr/bin/env python3
"""
Batch gene lookup using NCBI APIs.
This script efficiently processes multiple gene queries with proper
rate limiting and error handling.
"""
import argparse
import json
import sys
import time
import urllib.parse
import urllib.request
from typing import Optional, List, Dict, Any
def read_gene_list(filepath: str) -> List[str]:
"""
Read gene identifiers from a file (one per line).
Args:
filepath: Path to file containing gene symbols or IDs
Returns:
List of gene identifiers
"""
try:
with open(filepath, 'r') as f:
genes = [line.strip() for line in f if line.strip()]
return genes
except FileNotFoundError:
print(f"Error: File '{filepath}' not found", file=sys.stderr)
sys.exit(1)
except Exception as e:
print(f"Error reading file: {e}", file=sys.stderr)
sys.exit(1)
def batch_esearch(queries: List[str], organism: Optional[str] = None,
api_key: Optional[str] = None) -> Dict[str, str]:
"""
Search for multiple gene symbols and return their IDs.
Args:
queries: List of gene symbols
organism: Optional organism filter
api_key: Optional NCBI API key
Returns:
Dictionary mapping gene symbol to Gene ID (or 'NOT_FOUND')
"""
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
results = {}
# Rate limiting
delay = 0.1 if api_key else 0.34 # 10 req/sec with key, 3 req/sec without
for query in queries:
# Build search term
search_term = f"{query}[gene]"
if organism:
search_term += f" AND {organism}[organism]"
params = {
'db': 'gene',
'term': search_term,
'retmax': 1,
'retmode': 'json'
}
if api_key:
params['api_key'] = api_key
url = f"{base_url}esearch.fcgi?{urllib.parse.urlencode(params)}"
try:
with urllib.request.urlopen(url) as response:
data = json.loads(response.read().decode())
if 'esearchresult' in data and 'idlist' in data['esearchresult']:
id_list = data['esearchresult']['idlist']
results[query] = id_list[0] if id_list else 'NOT_FOUND'
else:
results[query] = 'ERROR'
except Exception as e:
print(f"Error searching for {query}: {e}", file=sys.stderr)
results[query] = 'ERROR'
time.sleep(delay)
return results
def batch_esummary(gene_ids: List[str], api_key: Optional[str] = None,
chunk_size: int = 200) -> Dict[str, Dict[str, Any]]:
"""
Get summaries for multiple genes in batches.
Args:
gene_ids: List of Gene IDs
api_key: Optional NCBI API key
chunk_size: Number of IDs per request (max 500)
Returns:
Dictionary mapping Gene ID to summary data
"""
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
all_results = {}
# Rate limiting
delay = 0.1 if api_key else 0.34
# Process in chunks
for i in range(0, len(gene_ids), chunk_size):
chunk = gene_ids[i:i + chunk_size]
params = {
'db': 'gene',
'id': ','.join(chunk),
'retmode': 'json'
}
if api_key:
params['api_key'] = api_key
url = f"{base_url}esummary.fcgi?{urllib.parse.urlencode(params)}"
try:
with urllib.request.urlopen(url) as response:
data = json.loads(response.read().decode())
if 'result' in data:
for gene_id in chunk:
if gene_id in data['result']:
all_results[gene_id] = data['result'][gene_id]
except Exception as e:
print(f"Error fetching summaries for chunk: {e}", file=sys.stderr)
time.sleep(delay)
return all_results
def batch_lookup_by_ids(gene_ids: List[str], api_key: Optional[str] = None) -> List[Dict[str, Any]]:
"""
Lookup genes by IDs and return structured data.
Args:
gene_ids: List of Gene IDs
api_key: Optional NCBI API key
Returns:
List of gene information dictionaries
"""
summaries = batch_esummary(gene_ids, api_key=api_key)
results = []
for gene_id in gene_ids:
if gene_id in summaries:
gene = summaries[gene_id]
results.append({
'gene_id': gene_id,
'symbol': gene.get('name', 'N/A'),
'description': gene.get('description', 'N/A'),
'organism': gene.get('organism', {}).get('scientificname', 'N/A'),
'chromosome': gene.get('chromosome', 'N/A'),
'map_location': gene.get('maplocation', 'N/A'),
'type': gene.get('geneticsource', 'N/A')
})
else:
results.append({
'gene_id': gene_id,
'error': 'Not found or error fetching'
})
return results
def batch_lookup_by_symbols(gene_symbols: List[str], organism: str,
api_key: Optional[str] = None) -> List[Dict[str, Any]]:
"""
Lookup genes by symbols and return structured data.
Args:
gene_symbols: List of gene symbols
organism: Organism name
api_key: Optional NCBI API key
Returns:
List of gene information dictionaries
"""
# First, search for IDs
print(f"Searching for {len(gene_symbols)} gene symbols...", file=sys.stderr)
symbol_to_id = batch_esearch(gene_symbols, organism=organism, api_key=api_key)
# Filter to valid IDs
valid_ids = [id for id in symbol_to_id.values() if id not in ['NOT_FOUND', 'ERROR']]
if not valid_ids:
print("No genes found", file=sys.stderr)
return []
print(f"Found {len(valid_ids)} genes, fetching details...", file=sys.stderr)
# Fetch summaries
summaries = batch_esummary(valid_ids, api_key=api_key)
# Build results
results = []
for symbol, gene_id in symbol_to_id.items():
if gene_id == 'NOT_FOUND':
results.append({
'query_symbol': symbol,
'status': 'not_found'
})
elif gene_id == 'ERROR':
results.append({
'query_symbol': symbol,
'status': 'error'
})
elif gene_id in summaries:
gene = summaries[gene_id]
results.append({
'query_symbol': symbol,
'gene_id': gene_id,
'symbol': gene.get('name', 'N/A'),
'description': gene.get('description', 'N/A'),
'organism': gene.get('organism', {}).get('scientificname', 'N/A'),
'chromosome': gene.get('chromosome', 'N/A'),
'map_location': gene.get('maplocation', 'N/A'),
'type': gene.get('geneticsource', 'N/A')
})
return results
def main():
parser = argparse.ArgumentParser(
description='Batch gene lookup using NCBI APIs',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Lookup by gene IDs
%(prog)s --ids 672,7157,5594
# Lookup by symbols from a file
%(prog)s --file genes.txt --organism human
# Lookup with API key and save to file
%(prog)s --ids 672,7157,5594 --api-key YOUR_KEY --output results.json
"""
)
parser.add_argument('--ids', '-i', help='Comma-separated Gene IDs')
parser.add_argument('--file', '-f', help='File containing gene symbols (one per line)')
parser.add_argument('--organism', '-o', help='Organism name (required with --file)')
parser.add_argument('--output', '-O', help='Output file path (JSON format)')
parser.add_argument('--api-key', '-k', help='NCBI API key')
parser.add_argument('--pretty', '-p', action='store_true',
help='Pretty-print JSON output')
args = parser.parse_args()
if not args.ids and not args.file:
parser.error("Either --ids or --file must be provided")
if args.file and not args.organism:
parser.error("--organism is required when using --file")
# Process genes
if args.ids:
gene_ids = [id.strip() for id in args.ids.split(',')]
results = batch_lookup_by_ids(gene_ids, api_key=args.api_key)
else:
gene_symbols = read_gene_list(args.file)
results = batch_lookup_by_symbols(gene_symbols, args.organism, api_key=args.api_key)
# Output results
indent = 2 if args.pretty else None
json_output = json.dumps(results, indent=indent)
if args.output:
try:
with open(args.output, 'w') as f:
f.write(json_output)
print(f"Results written to {args.output}", file=sys.stderr)
except Exception as e:
print(f"Error writing output file: {e}", file=sys.stderr)
sys.exit(1)
else:
print(json_output)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,277 @@
#!/usr/bin/env python3
"""
Fetch gene data from NCBI using the Datasets API.
This script provides access to the NCBI Datasets API for retrieving
comprehensive gene information including metadata and sequences.
"""
import argparse
import json
import sys
import urllib.parse
import urllib.request
from typing import Optional, Dict, Any, List
DATASETS_API_BASE = "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene"
def get_taxon_id(taxon_name: str) -> Optional[str]:
"""
Convert taxon name to NCBI taxon ID.
Args:
taxon_name: Common or scientific name (e.g., "human", "Homo sapiens")
Returns:
Taxon ID as string, or None if not found
"""
# Common mappings
common_taxa = {
'human': '9606',
'homo sapiens': '9606',
'mouse': '10090',
'mus musculus': '10090',
'rat': '10116',
'rattus norvegicus': '10116',
'zebrafish': '7955',
'danio rerio': '7955',
'fruit fly': '7227',
'drosophila melanogaster': '7227',
'c. elegans': '6239',
'caenorhabditis elegans': '6239',
'yeast': '4932',
'saccharomyces cerevisiae': '4932',
'arabidopsis': '3702',
'arabidopsis thaliana': '3702',
'e. coli': '562',
'escherichia coli': '562',
}
taxon_lower = taxon_name.lower().strip()
return common_taxa.get(taxon_lower)
def fetch_gene_by_id(gene_id: str, api_key: Optional[str] = None) -> Dict[str, Any]:
"""
Fetch gene data by Gene ID.
Args:
gene_id: NCBI Gene ID
api_key: Optional NCBI API key
Returns:
Gene data as dictionary
"""
url = f"{DATASETS_API_BASE}/id/{gene_id}"
headers = {}
if api_key:
headers['api-key'] = api_key
try:
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
return json.loads(response.read().decode())
except urllib.error.HTTPError as e:
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
if e.code == 404:
print(f"Gene ID {gene_id} not found", file=sys.stderr)
return {}
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return {}
def fetch_gene_by_symbol(symbol: str, taxon: str, api_key: Optional[str] = None) -> Dict[str, Any]:
"""
Fetch gene data by gene symbol and taxon.
Args:
symbol: Gene symbol (e.g., "BRCA1")
taxon: Organism name or taxon ID
api_key: Optional NCBI API key
Returns:
Gene data as dictionary
"""
# Convert taxon name to ID if needed
taxon_id = get_taxon_id(taxon)
if not taxon_id:
# Try to use as-is (might already be a taxon ID)
taxon_id = taxon
url = f"{DATASETS_API_BASE}/symbol/{symbol}/taxon/{taxon_id}"
headers = {}
if api_key:
headers['api-key'] = api_key
try:
req = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(req) as response:
return json.loads(response.read().decode())
except urllib.error.HTTPError as e:
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
if e.code == 404:
print(f"Gene symbol '{symbol}' not found for taxon {taxon}", file=sys.stderr)
return {}
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return {}
def fetch_multiple_genes(gene_ids: List[str], api_key: Optional[str] = None) -> Dict[str, Any]:
"""
Fetch data for multiple genes by ID.
Args:
gene_ids: List of Gene IDs
api_key: Optional NCBI API key
Returns:
Combined gene data as dictionary
"""
# For multiple genes, use POST request
url = f"{DATASETS_API_BASE}/id"
data = json.dumps({"gene_ids": gene_ids}).encode('utf-8')
headers = {'Content-Type': 'application/json'}
if api_key:
headers['api-key'] = api_key
try:
req = urllib.request.Request(url, data=data, headers=headers, method='POST')
with urllib.request.urlopen(req) as response:
return json.loads(response.read().decode())
except urllib.error.HTTPError as e:
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
return {}
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return {}
def display_gene_info(data: Dict[str, Any], verbose: bool = False) -> None:
"""
Display gene information in human-readable format.
Args:
data: Gene data dictionary from API
verbose: Show detailed information
"""
if 'genes' not in data:
print("No gene data found in response")
return
for gene in data['genes']:
gene_info = gene.get('gene', {})
print(f"Gene ID: {gene_info.get('gene_id', 'N/A')}")
print(f"Symbol: {gene_info.get('symbol', 'N/A')}")
print(f"Description: {gene_info.get('description', 'N/A')}")
if 'tax_name' in gene_info:
print(f"Organism: {gene_info['tax_name']}")
if 'chromosomes' in gene_info:
chromosomes = ', '.join(gene_info['chromosomes'])
print(f"Chromosome(s): {chromosomes}")
# Nomenclature
if 'nomenclature_authority' in gene_info:
auth = gene_info['nomenclature_authority']
print(f"Nomenclature: {auth.get('authority', 'N/A')}")
# Synonyms
if 'synonyms' in gene_info and gene_info['synonyms']:
print(f"Synonyms: {', '.join(gene_info['synonyms'])}")
if verbose:
# Gene type
if 'type' in gene_info:
print(f"Type: {gene_info['type']}")
# Genomic locations
if 'genomic_ranges' in gene_info:
print("\nGenomic Locations:")
for range_info in gene_info['genomic_ranges']:
accession = range_info.get('accession_version', 'N/A')
start = range_info.get('range', [{}])[0].get('begin', 'N/A')
end = range_info.get('range', [{}])[0].get('end', 'N/A')
strand = range_info.get('orientation', 'N/A')
print(f" {accession}: {start}-{end} ({strand})")
# Transcripts
if 'transcripts' in gene_info:
print(f"\nTranscripts: {len(gene_info['transcripts'])}")
for transcript in gene_info['transcripts'][:5]: # Show first 5
print(f" {transcript.get('accession_version', 'N/A')}")
print()
def main():
parser = argparse.ArgumentParser(
description='Fetch gene data from NCBI Datasets API',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Fetch by Gene ID
%(prog)s --gene-id 672
# Fetch by gene symbol and organism
%(prog)s --symbol BRCA1 --taxon human
# Fetch multiple genes
%(prog)s --gene-id 672,7157,5594
# Get JSON output
%(prog)s --symbol TP53 --taxon "Homo sapiens" --output json
# Verbose output with details
%(prog)s --gene-id 672 --verbose
"""
)
parser.add_argument('--gene-id', '-g', help='Gene ID(s), comma-separated')
parser.add_argument('--symbol', '-s', help='Gene symbol')
parser.add_argument('--taxon', '-t', help='Organism name or taxon ID (required with --symbol)')
parser.add_argument('--output', '-o', choices=['pretty', 'json'], default='pretty',
help='Output format (default: pretty)')
parser.add_argument('--verbose', '-v', action='store_true',
help='Show detailed information')
parser.add_argument('--api-key', '-k', help='NCBI API key')
args = parser.parse_args()
if not args.gene_id and not args.symbol:
parser.error("Either --gene-id or --symbol must be provided")
if args.symbol and not args.taxon:
parser.error("--taxon is required when using --symbol")
# Fetch data
if args.gene_id:
gene_ids = [id.strip() for id in args.gene_id.split(',')]
if len(gene_ids) == 1:
data = fetch_gene_by_id(gene_ids[0], api_key=args.api_key)
else:
data = fetch_multiple_genes(gene_ids, api_key=args.api_key)
else:
data = fetch_gene_by_symbol(args.symbol, args.taxon, api_key=args.api_key)
if not data:
sys.exit(1)
# Output
if args.output == 'json':
print(json.dumps(data, indent=2))
else:
display_gene_info(data, verbose=args.verbose)
if __name__ == '__main__':
main()

View File

@@ -0,0 +1,251 @@
#!/usr/bin/env python3
"""
Query NCBI Gene database using E-utilities.
This script provides access to ESearch, ESummary, and EFetch functions
for searching and retrieving gene information.
"""
import argparse
import json
import sys
import time
import urllib.parse
import urllib.request
from typing import Optional, Dict, List, Any
from xml.etree import ElementTree as ET
BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
DB = "gene"
def esearch(query: str, retmax: int = 20, api_key: Optional[str] = None) -> List[str]:
"""
Search NCBI Gene database and return list of Gene IDs.
Args:
query: Search query (e.g., "BRCA1[gene] AND human[organism]")
retmax: Maximum number of results to return
api_key: Optional NCBI API key for higher rate limits
Returns:
List of Gene IDs as strings
"""
params = {
'db': DB,
'term': query,
'retmax': retmax,
'retmode': 'json'
}
if api_key:
params['api_key'] = api_key
url = f"{BASE_URL}esearch.fcgi?{urllib.parse.urlencode(params)}"
try:
with urllib.request.urlopen(url) as response:
data = json.loads(response.read().decode())
if 'esearchresult' in data and 'idlist' in data['esearchresult']:
return data['esearchresult']['idlist']
else:
print(f"Error: Unexpected response format", file=sys.stderr)
return []
except urllib.error.HTTPError as e:
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
return []
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return []
def esummary(gene_ids: List[str], api_key: Optional[str] = None) -> Dict[str, Any]:
"""
Get document summaries for Gene IDs.
Args:
gene_ids: List of Gene IDs
api_key: Optional NCBI API key
Returns:
Dictionary of gene summaries
"""
params = {
'db': DB,
'id': ','.join(gene_ids),
'retmode': 'json'
}
if api_key:
params['api_key'] = api_key
url = f"{BASE_URL}esummary.fcgi?{urllib.parse.urlencode(params)}"
try:
with urllib.request.urlopen(url) as response:
data = json.loads(response.read().decode())
return data
except urllib.error.HTTPError as e:
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
return {}
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return {}
def efetch(gene_ids: List[str], retmode: str = 'xml', api_key: Optional[str] = None) -> str:
"""
Fetch full gene records.
Args:
gene_ids: List of Gene IDs
retmode: Return format ('xml', 'text', 'asn.1')
api_key: Optional NCBI API key
Returns:
Gene records as string in requested format
"""
params = {
'db': DB,
'id': ','.join(gene_ids),
'retmode': retmode
}
if api_key:
params['api_key'] = api_key
url = f"{BASE_URL}efetch.fcgi?{urllib.parse.urlencode(params)}"
try:
with urllib.request.urlopen(url) as response:
return response.read().decode()
except urllib.error.HTTPError as e:
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
return ""
except Exception as e:
print(f"Error: {e}", file=sys.stderr)
return ""
def search_and_summarize(query: str, organism: Optional[str] = None,
max_results: int = 20, api_key: Optional[str] = None) -> None:
"""
Search for genes and display summaries.
Args:
query: Gene search query
organism: Optional organism filter
max_results: Maximum number of results
api_key: Optional NCBI API key
"""
# Add organism filter if provided
if organism:
if '[organism]' not in query.lower():
query = f"{query} AND {organism}[organism]"
print(f"Searching for: {query}")
print("-" * 80)
# Search for gene IDs
gene_ids = esearch(query, retmax=max_results, api_key=api_key)
if not gene_ids:
print("No results found.")
return
print(f"Found {len(gene_ids)} gene(s)")
print()
# Get summaries
summaries = esummary(gene_ids, api_key=api_key)
if 'result' in summaries:
for gene_id in gene_ids:
if gene_id in summaries['result']:
gene = summaries['result'][gene_id]
print(f"Gene ID: {gene_id}")
print(f" Symbol: {gene.get('name', 'N/A')}")
print(f" Description: {gene.get('description', 'N/A')}")
print(f" Organism: {gene.get('organism', {}).get('scientificname', 'N/A')}")
print(f" Chromosome: {gene.get('chromosome', 'N/A')}")
print(f" Map Location: {gene.get('maplocation', 'N/A')}")
print(f" Type: {gene.get('geneticsource', 'N/A')}")
print()
# Respect rate limits
time.sleep(0.34) # ~3 requests per second
def fetch_by_id(gene_ids: List[str], output_format: str = 'json',
api_key: Optional[str] = None) -> None:
"""
Fetch and display gene information by ID.
Args:
gene_ids: List of Gene IDs
output_format: Output format ('json', 'xml', 'text')
api_key: Optional NCBI API key
"""
if output_format == 'json':
# Get summaries in JSON format
summaries = esummary(gene_ids, api_key=api_key)
print(json.dumps(summaries, indent=2))
else:
# Fetch full records
data = efetch(gene_ids, retmode=output_format, api_key=api_key)
print(data)
# Respect rate limits
time.sleep(0.34)
def main():
parser = argparse.ArgumentParser(
description='Query NCBI Gene database using E-utilities',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Search for gene by symbol
%(prog)s --search "BRCA1" --organism "human"
# Fetch gene by ID
%(prog)s --id 672 --format json
# Complex search query
%(prog)s --search "insulin[gene] AND diabetes[disease]"
# Multiple gene IDs
%(prog)s --id 672,7157,5594
"""
)
parser.add_argument('--search', '-s', help='Search query')
parser.add_argument('--organism', '-o', help='Organism filter')
parser.add_argument('--id', '-i', help='Gene ID(s), comma-separated')
parser.add_argument('--format', '-f', default='json',
choices=['json', 'xml', 'text'],
help='Output format (default: json)')
parser.add_argument('--max-results', '-m', type=int, default=20,
help='Maximum number of search results (default: 20)')
parser.add_argument('--api-key', '-k', help='NCBI API key for higher rate limits')
args = parser.parse_args()
if not args.search and not args.id:
parser.error("Either --search or --id must be provided")
if args.id:
# Fetch by ID
gene_ids = [id.strip() for id in args.id.split(',')]
fetch_by_id(gene_ids, output_format=args.format, api_key=args.api_key)
else:
# Search and summarize
search_and_summarize(args.search, organism=args.organism,
max_results=args.max_results, api_key=args.api_key)
if __name__ == '__main__':
main()