Initial commit
This commit is contained in:
173
skills/gene-database/SKILL.md
Normal file
173
skills/gene-database/SKILL.md
Normal file
@@ -0,0 +1,173 @@
|
||||
---
|
||||
name: gene-database
|
||||
description: "Query NCBI Gene via E-utilities/Datasets API. Search by symbol/ID, retrieve gene info (RefSeqs, GO, locations, phenotypes), batch lookups, for gene annotation and functional analysis."
|
||||
---
|
||||
|
||||
# Gene Database
|
||||
|
||||
## Overview
|
||||
|
||||
NCBI Gene is a comprehensive database integrating gene information from diverse species. It provides nomenclature, reference sequences (RefSeqs), chromosomal maps, biological pathways, genetic variations, phenotypes, and cross-references to global genomic resources.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when working with gene data including searching by gene symbol or ID, retrieving gene sequences and metadata, analyzing gene functions and pathways, or performing batch gene lookups.
|
||||
|
||||
## Quick Start
|
||||
|
||||
NCBI provides two main APIs for gene data access:
|
||||
|
||||
1. **E-utilities** (Traditional): Full-featured API for all Entrez databases with flexible querying
|
||||
2. **NCBI Datasets API** (Newer): Optimized for gene data retrieval with simplified workflows
|
||||
|
||||
Choose E-utilities for complex queries and cross-database searches. Choose Datasets API for straightforward gene data retrieval with metadata and sequences in a single request.
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Search Genes by Symbol or Name
|
||||
|
||||
To search for genes by symbol or name across organisms:
|
||||
|
||||
1. Use the `scripts/query_gene.py` script with E-utilities ESearch
|
||||
2. Specify the gene symbol and organism (e.g., "BRCA1 in human")
|
||||
3. The script returns matching Gene IDs
|
||||
|
||||
Example query patterns:
|
||||
- Gene symbol: `insulin[gene name] AND human[organism]`
|
||||
- Gene with disease: `dystrophin[gene name] AND muscular dystrophy[disease]`
|
||||
- Chromosome location: `human[organism] AND 17q21[chromosome]`
|
||||
|
||||
### Retrieve Gene Information by ID
|
||||
|
||||
To fetch detailed information for known Gene IDs:
|
||||
|
||||
1. Use `scripts/fetch_gene_data.py` with the Datasets API for comprehensive data
|
||||
2. Alternatively, use `scripts/query_gene.py` with E-utilities EFetch for specific formats
|
||||
3. Specify desired output format (JSON, XML, or text)
|
||||
|
||||
The Datasets API returns:
|
||||
- Gene nomenclature and aliases
|
||||
- Reference sequences (RefSeqs) for transcripts and proteins
|
||||
- Chromosomal location and mapping
|
||||
- Gene Ontology (GO) annotations
|
||||
- Associated publications
|
||||
|
||||
### Batch Gene Lookups
|
||||
|
||||
For multiple genes simultaneously:
|
||||
|
||||
1. Use `scripts/batch_gene_lookup.py` for efficient batch processing
|
||||
2. Provide a list of gene symbols or IDs
|
||||
3. Specify the organism for symbol-based queries
|
||||
4. The script handles rate limiting automatically (10 requests/second with API key)
|
||||
|
||||
This workflow is useful for:
|
||||
- Validating gene lists
|
||||
- Retrieving metadata for gene panels
|
||||
- Cross-referencing gene identifiers
|
||||
- Building gene annotation tables
|
||||
|
||||
### Search by Biological Context
|
||||
|
||||
To find genes associated with specific biological functions or phenotypes:
|
||||
|
||||
1. Use E-utilities with Gene Ontology (GO) terms or phenotype keywords
|
||||
2. Query by pathway names or disease associations
|
||||
3. Filter by organism, chromosome, or other attributes
|
||||
|
||||
Example searches:
|
||||
- By GO term: `GO:0006915[biological process]` (apoptosis)
|
||||
- By phenotype: `diabetes[phenotype] AND mouse[organism]`
|
||||
- By pathway: `insulin signaling pathway[pathway]`
|
||||
|
||||
### API Access Patterns
|
||||
|
||||
**Rate Limits:**
|
||||
- Without API key: 3 requests/second for E-utilities, 5 requests/second for Datasets API
|
||||
- With API key: 10 requests/second for both APIs
|
||||
|
||||
**Authentication:**
|
||||
Register for a free NCBI API key at https://www.ncbi.nlm.nih.gov/account/ to increase rate limits.
|
||||
|
||||
**Error Handling:**
|
||||
Both APIs return standard HTTP status codes. Common errors include:
|
||||
- 400: Malformed query or invalid parameters
|
||||
- 429: Rate limit exceeded
|
||||
- 404: Gene ID not found
|
||||
|
||||
Retry failed requests with exponential backoff.
|
||||
|
||||
## Script Usage
|
||||
|
||||
### query_gene.py
|
||||
|
||||
Query NCBI Gene using E-utilities (ESearch, ESummary, EFetch).
|
||||
|
||||
```bash
|
||||
python scripts/query_gene.py --search "BRCA1" --organism "human"
|
||||
python scripts/query_gene.py --id 672 --format json
|
||||
python scripts/query_gene.py --search "insulin[gene] AND diabetes[disease]"
|
||||
```
|
||||
|
||||
### fetch_gene_data.py
|
||||
|
||||
Fetch comprehensive gene data using NCBI Datasets API.
|
||||
|
||||
```bash
|
||||
python scripts/fetch_gene_data.py --gene-id 672
|
||||
python scripts/fetch_gene_data.py --symbol BRCA1 --taxon human
|
||||
python scripts/fetch_gene_data.py --symbol TP53 --taxon "Homo sapiens" --output json
|
||||
```
|
||||
|
||||
### batch_gene_lookup.py
|
||||
|
||||
Process multiple gene queries efficiently.
|
||||
|
||||
```bash
|
||||
python scripts/batch_gene_lookup.py --file gene_list.txt --organism human
|
||||
python scripts/batch_gene_lookup.py --ids 672,7157,5594 --output results.json
|
||||
```
|
||||
|
||||
## API References
|
||||
|
||||
For detailed API documentation including endpoints, parameters, response formats, and examples, refer to:
|
||||
|
||||
- `references/api_reference.md` - Comprehensive API documentation for E-utilities and Datasets API
|
||||
- `references/common_workflows.md` - Additional examples and use case patterns
|
||||
|
||||
Search these references when needing specific API endpoint details, parameter options, or response structure information.
|
||||
|
||||
## Data Formats
|
||||
|
||||
NCBI Gene data can be retrieved in multiple formats:
|
||||
|
||||
- **JSON**: Structured data ideal for programmatic processing
|
||||
- **XML**: Detailed hierarchical format with full metadata
|
||||
- **GenBank**: Sequence data with annotations
|
||||
- **FASTA**: Sequence data only
|
||||
- **Text**: Human-readable summaries
|
||||
|
||||
Choose JSON for modern applications, XML for legacy systems requiring detailed metadata, and FASTA for sequence analysis workflows.
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always specify organism** when searching by gene symbol to avoid ambiguity
|
||||
2. **Use Gene IDs** for precise lookups when available
|
||||
3. **Batch requests** when working with multiple genes to minimize API calls
|
||||
4. **Cache results** locally to reduce redundant queries
|
||||
5. **Include API key** in scripts for higher rate limits
|
||||
6. **Handle errors gracefully** with retry logic for transient failures
|
||||
7. **Validate gene symbols** before batch processing to catch typos
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes:
|
||||
|
||||
### scripts/
|
||||
- `query_gene.py` - Query genes using E-utilities (ESearch, ESummary, EFetch)
|
||||
- `fetch_gene_data.py` - Fetch gene data using NCBI Datasets API
|
||||
- `batch_gene_lookup.py` - Handle multiple gene queries efficiently
|
||||
|
||||
### references/
|
||||
- `api_reference.md` - Detailed API documentation for both E-utilities and Datasets API
|
||||
- `common_workflows.md` - Examples of common gene queries and use cases
|
||||
404
skills/gene-database/references/api_reference.md
Normal file
404
skills/gene-database/references/api_reference.md
Normal file
@@ -0,0 +1,404 @@
|
||||
# NCBI Gene API Reference
|
||||
|
||||
This document provides detailed API documentation for accessing NCBI Gene database programmatically.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [E-utilities API](#e-utilities-api)
|
||||
2. [NCBI Datasets API](#ncbi-datasets-api)
|
||||
3. [Authentication and Rate Limits](#authentication-and-rate-limits)
|
||||
4. [Error Handling](#error-handling)
|
||||
|
||||
---
|
||||
|
||||
## E-utilities API
|
||||
|
||||
E-utilities (Entrez Programming Utilities) provide a stable interface to NCBI's Entrez databases.
|
||||
|
||||
### Base URL
|
||||
|
||||
```
|
||||
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
|
||||
```
|
||||
|
||||
### Common Parameters
|
||||
|
||||
- `db` - Database name (use `gene` for Gene database)
|
||||
- `api_key` - API key for higher rate limits
|
||||
- `retmode` - Return format (json, xml, text)
|
||||
- `retmax` - Maximum number of records to return
|
||||
|
||||
### ESearch - Search Database
|
||||
|
||||
Search for genes matching a text query.
|
||||
|
||||
**Endpoint:** `esearch.fcgi`
|
||||
|
||||
**Parameters:**
|
||||
- `db=gene` (required) - Database to search
|
||||
- `term` (required) - Search query
|
||||
- `retmax` - Maximum results (default: 20)
|
||||
- `retmode` - json or xml (default: xml)
|
||||
- `usehistory=y` - Store results on history server for large result sets
|
||||
|
||||
**Query Syntax:**
|
||||
- Gene symbol: `BRCA1[gene]` or `BRCA1[gene name]`
|
||||
- Organism: `human[organism]` or `9606[taxid]`
|
||||
- Combine terms: `BRCA1[gene] AND human[organism]`
|
||||
- Disease: `muscular dystrophy[disease]`
|
||||
- Chromosome: `17q21[chromosome]`
|
||||
- GO terms: `GO:0006915[biological process]`
|
||||
|
||||
**Example Request:**
|
||||
|
||||
```bash
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term=BRCA1[gene]+AND+human[organism]&retmode=json"
|
||||
```
|
||||
|
||||
**Response Format (JSON):**
|
||||
|
||||
```json
|
||||
{
|
||||
"esearchresult": {
|
||||
"count": "1",
|
||||
"retmax": "1",
|
||||
"retstart": "0",
|
||||
"idlist": ["672"],
|
||||
"translationset": [],
|
||||
"querytranslation": "BRCA1[Gene Name] AND human[Organism]"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### ESummary - Document Summaries
|
||||
|
||||
Retrieve document summaries for Gene IDs.
|
||||
|
||||
**Endpoint:** `esummary.fcgi`
|
||||
|
||||
**Parameters:**
|
||||
- `db=gene` (required) - Database
|
||||
- `id` (required) - Comma-separated Gene IDs (up to 500)
|
||||
- `retmode` - json or xml (default: xml)
|
||||
|
||||
**Example Request:**
|
||||
|
||||
```bash
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=672&retmode=json"
|
||||
```
|
||||
|
||||
**Response Format (JSON):**
|
||||
|
||||
```json
|
||||
{
|
||||
"result": {
|
||||
"672": {
|
||||
"uid": "672",
|
||||
"name": "BRCA1",
|
||||
"description": "BRCA1 DNA repair associated",
|
||||
"organism": {
|
||||
"scientificname": "Homo sapiens",
|
||||
"commonname": "human",
|
||||
"taxid": 9606
|
||||
},
|
||||
"chromosome": "17",
|
||||
"geneticsource": "genomic",
|
||||
"maplocation": "17q21.31",
|
||||
"nomenclaturesymbol": "BRCA1",
|
||||
"nomenclaturename": "BRCA1 DNA repair associated"
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### EFetch - Full Records
|
||||
|
||||
Fetch detailed gene records in various formats.
|
||||
|
||||
**Endpoint:** `efetch.fcgi`
|
||||
|
||||
**Parameters:**
|
||||
- `db=gene` (required) - Database
|
||||
- `id` (required) - Comma-separated Gene IDs
|
||||
- `retmode` - xml, text, asn.1 (default: xml)
|
||||
- `rettype` - gene_table, docsum
|
||||
|
||||
**Example Request:**
|
||||
|
||||
```bash
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=672&retmode=xml"
|
||||
```
|
||||
|
||||
**XML Response:** Contains detailed gene information including:
|
||||
- Gene nomenclature
|
||||
- Sequence locations
|
||||
- Transcript variants
|
||||
- Protein products
|
||||
- Gene Ontology annotations
|
||||
- Cross-references
|
||||
- Publications
|
||||
|
||||
### ELink - Related Records
|
||||
|
||||
Find related records in Gene or other databases.
|
||||
|
||||
**Endpoint:** `elink.fcgi`
|
||||
|
||||
**Parameters:**
|
||||
- `dbfrom=gene` (required) - Source database
|
||||
- `db` (required) - Target database (gene, nuccore, protein, pubmed, etc.)
|
||||
- `id` (required) - Gene ID(s)
|
||||
|
||||
**Example Request:**
|
||||
|
||||
```bash
|
||||
# Get related PubMed articles
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"
|
||||
```
|
||||
|
||||
### EInfo - Database Information
|
||||
|
||||
Get information about the Gene database.
|
||||
|
||||
**Endpoint:** `einfo.fcgi`
|
||||
|
||||
**Parameters:**
|
||||
- `db=gene` - Database to query
|
||||
|
||||
**Example Request:**
|
||||
|
||||
```bash
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=gene&retmode=json"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## NCBI Datasets API
|
||||
|
||||
The Datasets API provides streamlined access to gene data with metadata and sequences.
|
||||
|
||||
### Base URL
|
||||
|
||||
```
|
||||
https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene
|
||||
```
|
||||
|
||||
### Authentication
|
||||
|
||||
Include API key in request headers:
|
||||
|
||||
```
|
||||
api-key: YOUR_API_KEY
|
||||
```
|
||||
|
||||
### Get Gene by ID
|
||||
|
||||
Retrieve gene data by Gene ID.
|
||||
|
||||
**Endpoint:** `GET /gene/id/{gene_id}`
|
||||
|
||||
**Example Request:**
|
||||
|
||||
```bash
|
||||
curl "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/id/672"
|
||||
```
|
||||
|
||||
**Response Format (JSON):**
|
||||
|
||||
```json
|
||||
{
|
||||
"genes": [
|
||||
{
|
||||
"gene": {
|
||||
"gene_id": "672",
|
||||
"symbol": "BRCA1",
|
||||
"description": "BRCA1 DNA repair associated",
|
||||
"tax_name": "Homo sapiens",
|
||||
"taxid": 9606,
|
||||
"chromosomes": ["17"],
|
||||
"type": "protein-coding",
|
||||
"synonyms": ["BRCC1", "FANCS", "PNCA4", "RNF53"],
|
||||
"nomenclature_authority": {
|
||||
"authority": "HGNC",
|
||||
"identifier": "HGNC:1100"
|
||||
},
|
||||
"genomic_ranges": [
|
||||
{
|
||||
"accession_version": "NC_000017.11",
|
||||
"range": [
|
||||
{
|
||||
"begin": 43044295,
|
||||
"end": 43170245,
|
||||
"orientation": "minus"
|
||||
}
|
||||
]
|
||||
}
|
||||
],
|
||||
"transcripts": [
|
||||
{
|
||||
"accession_version": "NM_007294.4",
|
||||
"length": 7207
|
||||
}
|
||||
]
|
||||
}
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Get Gene by Symbol
|
||||
|
||||
Retrieve gene data by symbol and organism.
|
||||
|
||||
**Endpoint:** `GET /gene/symbol/{symbol}/taxon/{taxon}`
|
||||
|
||||
**Parameters:**
|
||||
- `{symbol}` - Gene symbol (e.g., BRCA1)
|
||||
- `{taxon}` - Taxon ID (e.g., 9606 for human)
|
||||
|
||||
**Example Request:**
|
||||
|
||||
```bash
|
||||
curl "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/symbol/BRCA1/taxon/9606"
|
||||
```
|
||||
|
||||
### Get Multiple Genes
|
||||
|
||||
Retrieve data for multiple genes.
|
||||
|
||||
**Endpoint:** `POST /gene/id`
|
||||
|
||||
**Request Body:**
|
||||
|
||||
```json
|
||||
{
|
||||
"gene_ids": ["672", "7157", "5594"]
|
||||
}
|
||||
```
|
||||
|
||||
**Example Request:**
|
||||
|
||||
```bash
|
||||
curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/id" \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"gene_ids": ["672", "7157", "5594"]}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Authentication and Rate Limits
|
||||
|
||||
### Obtaining an API Key
|
||||
|
||||
1. Create an NCBI account at https://www.ncbi.nlm.nih.gov/account/
|
||||
2. Navigate to Settings → API Key Management
|
||||
3. Generate a new API key
|
||||
4. Include the key in requests
|
||||
|
||||
### Rate Limits
|
||||
|
||||
**E-utilities:**
|
||||
- Without API key: 3 requests/second
|
||||
- With API key: 10 requests/second
|
||||
|
||||
**Datasets API:**
|
||||
- Without API key: 5 requests/second
|
||||
- With API key: 10 requests/second
|
||||
|
||||
### Usage Guidelines
|
||||
|
||||
1. **Include email in requests:** Add `&email=your@email.com` to E-utilities requests
|
||||
2. **Implement rate limiting:** Use delays between requests
|
||||
3. **Use POST for large queries:** When working with many IDs
|
||||
4. **Cache results:** Store frequently accessed data locally
|
||||
5. **Handle errors gracefully:** Implement retry logic with exponential backoff
|
||||
|
||||
---
|
||||
|
||||
## Error Handling
|
||||
|
||||
### HTTP Status Codes
|
||||
|
||||
- `200 OK` - Successful request
|
||||
- `400 Bad Request` - Invalid parameters or malformed query
|
||||
- `404 Not Found` - Gene ID or symbol not found
|
||||
- `429 Too Many Requests` - Rate limit exceeded
|
||||
- `500 Internal Server Error` - Server error (retry with backoff)
|
||||
|
||||
### E-utilities Error Messages
|
||||
|
||||
E-utilities return errors in the response body:
|
||||
|
||||
**XML format:**
|
||||
```xml
|
||||
<ERROR>Empty id list - nothing to do</ERROR>
|
||||
```
|
||||
|
||||
**JSON format:**
|
||||
```json
|
||||
{
|
||||
"error": "Invalid db name"
|
||||
}
|
||||
```
|
||||
|
||||
### Common Errors
|
||||
|
||||
1. **Empty Result Set**
|
||||
- Cause: Gene symbol or ID not found
|
||||
- Solution: Verify spelling, check organism filter
|
||||
|
||||
2. **Rate Limit Exceeded**
|
||||
- Cause: Too many requests
|
||||
- Solution: Add delays, use API key
|
||||
|
||||
3. **Invalid Query Syntax**
|
||||
- Cause: Malformed search term
|
||||
- Solution: Use proper field tags (e.g., `[gene]`, `[organism]`)
|
||||
|
||||
4. **Timeout**
|
||||
- Cause: Large result set or slow connection
|
||||
- Solution: Use History Server, reduce result size
|
||||
|
||||
### Retry Strategy
|
||||
|
||||
Implement exponential backoff for failed requests:
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def retry_request(func, max_attempts=3):
|
||||
for attempt in range(max_attempts):
|
||||
try:
|
||||
return func()
|
||||
except Exception as e:
|
||||
if attempt < max_attempts - 1:
|
||||
wait_time = 2 ** attempt # 1s, 2s, 4s
|
||||
time.sleep(wait_time)
|
||||
else:
|
||||
raise
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Taxon IDs
|
||||
|
||||
| Organism | Scientific Name | Taxon ID |
|
||||
|----------|----------------|----------|
|
||||
| Human | Homo sapiens | 9606 |
|
||||
| Mouse | Mus musculus | 10090 |
|
||||
| Rat | Rattus norvegicus | 10116 |
|
||||
| Zebrafish | Danio rerio | 7955 |
|
||||
| Fruit fly | Drosophila melanogaster | 7227 |
|
||||
| C. elegans | Caenorhabditis elegans | 6239 |
|
||||
| Yeast | Saccharomyces cerevisiae | 4932 |
|
||||
| Arabidopsis | Arabidopsis thaliana | 3702 |
|
||||
| E. coli | Escherichia coli | 562 |
|
||||
|
||||
---
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **E-utilities Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25501/
|
||||
- **Datasets API Documentation:** https://www.ncbi.nlm.nih.gov/datasets/docs/v2/
|
||||
- **Gene Database Help:** https://www.ncbi.nlm.nih.gov/gene/
|
||||
- **API Key Registration:** https://www.ncbi.nlm.nih.gov/account/
|
||||
428
skills/gene-database/references/common_workflows.md
Normal file
428
skills/gene-database/references/common_workflows.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Common Gene Database Workflows
|
||||
|
||||
This document provides examples of common workflows and use cases for working with NCBI Gene database.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Disease Gene Discovery](#disease-gene-discovery)
|
||||
2. [Gene Annotation Pipeline](#gene-annotation-pipeline)
|
||||
3. [Cross-Species Gene Comparison](#cross-species-gene-comparison)
|
||||
4. [Pathway Analysis](#pathway-analysis)
|
||||
5. [Variant Analysis](#variant-analysis)
|
||||
6. [Publication Mining](#publication-mining)
|
||||
|
||||
---
|
||||
|
||||
## Disease Gene Discovery
|
||||
|
||||
### Use Case
|
||||
|
||||
Identify genes associated with a specific disease or phenotype.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search by disease name**
|
||||
|
||||
```bash
|
||||
# Find genes associated with Alzheimer's disease
|
||||
python scripts/query_gene.py --search "Alzheimer disease[disease]" --organism human --max-results 50
|
||||
```
|
||||
|
||||
2. **Filter by chromosome location**
|
||||
|
||||
```bash
|
||||
# Find genes on chromosome 17 associated with breast cancer
|
||||
python scripts/query_gene.py --search "breast cancer[disease] AND 17[chromosome]" --organism human
|
||||
```
|
||||
|
||||
3. **Retrieve detailed information**
|
||||
|
||||
```python
|
||||
# Python example: Get gene details for disease-associated genes
|
||||
import json
|
||||
from scripts.query_gene import esearch, esummary
|
||||
|
||||
# Search for genes
|
||||
query = "diabetes[disease] AND human[organism]"
|
||||
gene_ids = esearch(query, retmax=100, api_key="YOUR_KEY")
|
||||
|
||||
# Get summaries
|
||||
summaries = esummary(gene_ids, api_key="YOUR_KEY")
|
||||
|
||||
# Extract relevant information
|
||||
for gene_id in gene_ids:
|
||||
if gene_id in summaries['result']:
|
||||
gene = summaries['result'][gene_id]
|
||||
print(f"{gene['name']}: {gene['description']}")
|
||||
```
|
||||
|
||||
### Expected Output
|
||||
|
||||
- List of genes with disease associations
|
||||
- Gene symbols, descriptions, and chromosomal locations
|
||||
- Related publications and clinical annotations
|
||||
|
||||
---
|
||||
|
||||
## Gene Annotation Pipeline
|
||||
|
||||
### Use Case
|
||||
|
||||
Annotate a list of gene identifiers with comprehensive metadata.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Prepare gene list**
|
||||
|
||||
Create a file `genes.txt` with gene symbols (one per line):
|
||||
```
|
||||
BRCA1
|
||||
TP53
|
||||
EGFR
|
||||
KRAS
|
||||
```
|
||||
|
||||
2. **Batch lookup**
|
||||
|
||||
```bash
|
||||
python scripts/batch_gene_lookup.py --file genes.txt --organism human --output annotations.json --api-key YOUR_KEY
|
||||
```
|
||||
|
||||
3. **Parse results**
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
with open('annotations.json', 'r') as f:
|
||||
genes = json.load(f)
|
||||
|
||||
for gene in genes:
|
||||
if 'gene_id' in gene:
|
||||
print(f"Symbol: {gene['symbol']}")
|
||||
print(f"ID: {gene['gene_id']}")
|
||||
print(f"Description: {gene['description']}")
|
||||
print(f"Location: chr{gene['chromosome']}:{gene['map_location']}")
|
||||
print()
|
||||
```
|
||||
|
||||
4. **Enrich with sequence data**
|
||||
|
||||
```bash
|
||||
# Get detailed data including sequences for specific genes
|
||||
python scripts/fetch_gene_data.py --gene-id 672 --verbose > BRCA1_detailed.json
|
||||
```
|
||||
|
||||
### Use Cases
|
||||
|
||||
- Creating gene annotation tables for publications
|
||||
- Validating gene lists before analysis
|
||||
- Building gene reference databases
|
||||
- Quality control for genomic pipelines
|
||||
|
||||
---
|
||||
|
||||
## Cross-Species Gene Comparison
|
||||
|
||||
### Use Case
|
||||
|
||||
Find orthologs or compare the same gene across different species.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search for gene in multiple organisms**
|
||||
|
||||
```bash
|
||||
# Find TP53 in human
|
||||
python scripts/fetch_gene_data.py --symbol TP53 --taxon human
|
||||
|
||||
# Find TP53 in mouse
|
||||
python scripts/fetch_gene_data.py --symbol TP53 --taxon mouse
|
||||
|
||||
# Find TP53 in zebrafish
|
||||
python scripts/fetch_gene_data.py --symbol TP53 --taxon zebrafish
|
||||
```
|
||||
|
||||
2. **Compare gene IDs across species**
|
||||
|
||||
```python
|
||||
# Compare gene information across species
|
||||
species = {
|
||||
'human': '9606',
|
||||
'mouse': '10090',
|
||||
'rat': '10116'
|
||||
}
|
||||
|
||||
gene_symbol = 'TP53'
|
||||
|
||||
for organism, taxon_id in species.items():
|
||||
# Fetch gene data
|
||||
# ... (use fetch_gene_by_symbol)
|
||||
print(f"{organism}: {gene_data}")
|
||||
```
|
||||
|
||||
3. **Find orthologs using ELink**
|
||||
|
||||
```bash
|
||||
# Get HomoloGene links for a gene
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=homologene&id=7157&retmode=json"
|
||||
```
|
||||
|
||||
### Applications
|
||||
|
||||
- Evolutionary studies
|
||||
- Model organism research
|
||||
- Comparative genomics
|
||||
- Cross-species experimental design
|
||||
|
||||
---
|
||||
|
||||
## Pathway Analysis
|
||||
|
||||
### Use Case
|
||||
|
||||
Identify genes involved in specific biological pathways or processes.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search by Gene Ontology (GO) term**
|
||||
|
||||
```bash
|
||||
# Find genes involved in apoptosis
|
||||
python scripts/query_gene.py --search "GO:0006915[biological process]" --organism human --max-results 100
|
||||
```
|
||||
|
||||
2. **Search by pathway name**
|
||||
|
||||
```bash
|
||||
# Find genes in insulin signaling pathway
|
||||
python scripts/query_gene.py --search "insulin signaling pathway[pathway]" --organism human
|
||||
```
|
||||
|
||||
3. **Get pathway-related genes**
|
||||
|
||||
```python
|
||||
# Example: Get all genes in a specific pathway
|
||||
import urllib.request
|
||||
import json
|
||||
|
||||
# Search for pathway genes
|
||||
query = "MAPK signaling pathway[pathway] AND human[organism]"
|
||||
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term={query}&retmode=json&retmax=200"
|
||||
|
||||
with urllib.request.urlopen(url) as response:
|
||||
data = json.loads(response.read().decode())
|
||||
gene_ids = data['esearchresult']['idlist']
|
||||
|
||||
print(f"Found {len(gene_ids)} genes in MAPK signaling pathway")
|
||||
```
|
||||
|
||||
4. **Batch retrieve gene details**
|
||||
|
||||
```bash
|
||||
# Get details for all pathway genes
|
||||
python scripts/batch_gene_lookup.py --ids 5594,5595,5603,5604 --output mapk_genes.json
|
||||
```
|
||||
|
||||
### Applications
|
||||
|
||||
- Pathway enrichment analysis
|
||||
- Gene set analysis
|
||||
- Systems biology studies
|
||||
- Drug target identification
|
||||
|
||||
---
|
||||
|
||||
## Variant Analysis
|
||||
|
||||
### Use Case
|
||||
|
||||
Find genes with clinically relevant variants or disease-associated mutations.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search for genes with clinical variants**
|
||||
|
||||
```bash
|
||||
# Find genes with pathogenic variants
|
||||
python scripts/query_gene.py --search "pathogenic[clinical significance]" --organism human --max-results 50
|
||||
```
|
||||
|
||||
2. **Link to ClinVar database**
|
||||
|
||||
```bash
|
||||
# Get ClinVar records for a gene
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=clinvar&id=672&retmode=json"
|
||||
```
|
||||
|
||||
3. **Search for pharmacogenomic genes**
|
||||
|
||||
```bash
|
||||
# Find genes associated with drug response
|
||||
python scripts/query_gene.py --search "pharmacogenomic[property]" --organism human
|
||||
```
|
||||
|
||||
4. **Get variant summary data**
|
||||
|
||||
```python
|
||||
# Example: Get genes with known variants
|
||||
from scripts.query_gene import esearch, efetch
|
||||
|
||||
# Search for genes with variants
|
||||
gene_ids = esearch("has variants[filter] AND human[organism]", retmax=100)
|
||||
|
||||
# Fetch detailed records
|
||||
for gene_id in gene_ids[:10]: # First 10
|
||||
data = efetch([gene_id], retmode='xml')
|
||||
# Parse XML for variant information
|
||||
print(f"Gene {gene_id} variant data...")
|
||||
```
|
||||
|
||||
### Applications
|
||||
|
||||
- Clinical genetics
|
||||
- Precision medicine
|
||||
- Pharmacogenomics
|
||||
- Genetic counseling
|
||||
|
||||
---
|
||||
|
||||
## Publication Mining
|
||||
|
||||
### Use Case
|
||||
|
||||
Find genes mentioned in recent publications or link genes to literature.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search genes mentioned in specific publications**
|
||||
|
||||
```bash
|
||||
# Find genes mentioned in papers about CRISPR
|
||||
python scripts/query_gene.py --search "CRISPR[text word]" --organism human --max-results 100
|
||||
```
|
||||
|
||||
2. **Get PubMed articles for a gene**
|
||||
|
||||
```bash
|
||||
# Get all publications for BRCA1
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"
|
||||
```
|
||||
|
||||
3. **Search by author or journal**
|
||||
|
||||
```bash
|
||||
# Find genes studied by specific research group
|
||||
python scripts/query_gene.py --search "Smith J[author] AND 2024[pdat]" --organism human
|
||||
```
|
||||
|
||||
4. **Extract gene-publication relationships**
|
||||
|
||||
```python
|
||||
# Example: Build gene-publication network
|
||||
from scripts.query_gene import esearch, esummary
|
||||
import urllib.request
|
||||
import json
|
||||
|
||||
# Get gene
|
||||
gene_id = '672'
|
||||
|
||||
# Get publications for gene
|
||||
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id={gene_id}&retmode=json"
|
||||
|
||||
with urllib.request.urlopen(url) as response:
|
||||
data = json.loads(response.read().decode())
|
||||
|
||||
# Extract PMIDs
|
||||
pmids = []
|
||||
for linkset in data.get('linksets', []):
|
||||
for linksetdb in linkset.get('linksetdbs', []):
|
||||
pmids.extend(linksetdb.get('links', []))
|
||||
|
||||
print(f"Gene {gene_id} has {len(pmids)} publications")
|
||||
```
|
||||
|
||||
### Applications
|
||||
|
||||
- Literature reviews
|
||||
- Grant writing
|
||||
- Knowledge base construction
|
||||
- Trend analysis in genomics research
|
||||
|
||||
---
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Combining Multiple Searches
|
||||
|
||||
```python
|
||||
# Example: Find genes at intersection of multiple criteria
|
||||
def find_genes_multi_criteria(organism='human'):
|
||||
# Criteria 1: Disease association
|
||||
disease_genes = set(esearch("diabetes[disease] AND human[organism]"))
|
||||
|
||||
# Criteria 2: Chromosome location
|
||||
chr_genes = set(esearch("11[chromosome] AND human[organism]"))
|
||||
|
||||
# Criteria 3: Gene type
|
||||
coding_genes = set(esearch("protein coding[gene type] AND human[organism]"))
|
||||
|
||||
# Intersection
|
||||
candidates = disease_genes & chr_genes & coding_genes
|
||||
|
||||
return list(candidates)
|
||||
```
|
||||
|
||||
### Rate-Limited Batch Processing
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def process_genes_with_rate_limit(gene_ids, batch_size=200, delay=0.1):
|
||||
results = []
|
||||
|
||||
for i in range(0, len(gene_ids), batch_size):
|
||||
batch = gene_ids[i:i + batch_size]
|
||||
|
||||
# Process batch
|
||||
batch_results = esummary(batch)
|
||||
results.append(batch_results)
|
||||
|
||||
# Rate limit
|
||||
time.sleep(delay)
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### Error Handling and Retry
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def robust_gene_fetch(gene_id, max_retries=3):
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
data = fetch_gene_by_id(gene_id)
|
||||
return data
|
||||
except Exception as e:
|
||||
if attempt < max_retries - 1:
|
||||
wait = 2 ** attempt # Exponential backoff
|
||||
time.sleep(wait)
|
||||
else:
|
||||
print(f"Failed to fetch gene {gene_id}: {e}")
|
||||
return None
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tips and Best Practices
|
||||
|
||||
1. **Start Specific, Then Broaden**: Begin with precise queries and expand if needed
|
||||
2. **Use Organism Filters**: Always specify organism for gene symbol searches
|
||||
3. **Validate Results**: Check gene IDs and symbols for accuracy
|
||||
4. **Cache Frequently Used Data**: Store common queries locally
|
||||
5. **Monitor Rate Limits**: Use API keys and implement delays
|
||||
6. **Combine APIs**: Use E-utilities for search, Datasets API for detailed data
|
||||
7. **Handle Ambiguity**: Gene symbols may refer to different genes in different species
|
||||
8. **Check Data Currency**: Gene annotations are updated regularly
|
||||
9. **Use Batch Operations**: Process multiple genes together when possible
|
||||
10. **Document Your Queries**: Keep records of search terms and parameters
|
||||
298
skills/gene-database/scripts/batch_gene_lookup.py
Normal file
298
skills/gene-database/scripts/batch_gene_lookup.py
Normal file
@@ -0,0 +1,298 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Batch gene lookup using NCBI APIs.
|
||||
|
||||
This script efficiently processes multiple gene queries with proper
|
||||
rate limiting and error handling.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
from typing import Optional, List, Dict, Any
|
||||
|
||||
|
||||
def read_gene_list(filepath: str) -> List[str]:
|
||||
"""
|
||||
Read gene identifiers from a file (one per line).
|
||||
|
||||
Args:
|
||||
filepath: Path to file containing gene symbols or IDs
|
||||
|
||||
Returns:
|
||||
List of gene identifiers
|
||||
"""
|
||||
try:
|
||||
with open(filepath, 'r') as f:
|
||||
genes = [line.strip() for line in f if line.strip()]
|
||||
return genes
|
||||
except FileNotFoundError:
|
||||
print(f"Error: File '{filepath}' not found", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
except Exception as e:
|
||||
print(f"Error reading file: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
def batch_esearch(queries: List[str], organism: Optional[str] = None,
|
||||
api_key: Optional[str] = None) -> Dict[str, str]:
|
||||
"""
|
||||
Search for multiple gene symbols and return their IDs.
|
||||
|
||||
Args:
|
||||
queries: List of gene symbols
|
||||
organism: Optional organism filter
|
||||
api_key: Optional NCBI API key
|
||||
|
||||
Returns:
|
||||
Dictionary mapping gene symbol to Gene ID (or 'NOT_FOUND')
|
||||
"""
|
||||
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
|
||||
results = {}
|
||||
|
||||
# Rate limiting
|
||||
delay = 0.1 if api_key else 0.34 # 10 req/sec with key, 3 req/sec without
|
||||
|
||||
for query in queries:
|
||||
# Build search term
|
||||
search_term = f"{query}[gene]"
|
||||
if organism:
|
||||
search_term += f" AND {organism}[organism]"
|
||||
|
||||
params = {
|
||||
'db': 'gene',
|
||||
'term': search_term,
|
||||
'retmax': 1,
|
||||
'retmode': 'json'
|
||||
}
|
||||
|
||||
if api_key:
|
||||
params['api_key'] = api_key
|
||||
|
||||
url = f"{base_url}esearch.fcgi?{urllib.parse.urlencode(params)}"
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(url) as response:
|
||||
data = json.loads(response.read().decode())
|
||||
|
||||
if 'esearchresult' in data and 'idlist' in data['esearchresult']:
|
||||
id_list = data['esearchresult']['idlist']
|
||||
results[query] = id_list[0] if id_list else 'NOT_FOUND'
|
||||
else:
|
||||
results[query] = 'ERROR'
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error searching for {query}: {e}", file=sys.stderr)
|
||||
results[query] = 'ERROR'
|
||||
|
||||
time.sleep(delay)
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def batch_esummary(gene_ids: List[str], api_key: Optional[str] = None,
|
||||
chunk_size: int = 200) -> Dict[str, Dict[str, Any]]:
|
||||
"""
|
||||
Get summaries for multiple genes in batches.
|
||||
|
||||
Args:
|
||||
gene_ids: List of Gene IDs
|
||||
api_key: Optional NCBI API key
|
||||
chunk_size: Number of IDs per request (max 500)
|
||||
|
||||
Returns:
|
||||
Dictionary mapping Gene ID to summary data
|
||||
"""
|
||||
base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
|
||||
all_results = {}
|
||||
|
||||
# Rate limiting
|
||||
delay = 0.1 if api_key else 0.34
|
||||
|
||||
# Process in chunks
|
||||
for i in range(0, len(gene_ids), chunk_size):
|
||||
chunk = gene_ids[i:i + chunk_size]
|
||||
|
||||
params = {
|
||||
'db': 'gene',
|
||||
'id': ','.join(chunk),
|
||||
'retmode': 'json'
|
||||
}
|
||||
|
||||
if api_key:
|
||||
params['api_key'] = api_key
|
||||
|
||||
url = f"{base_url}esummary.fcgi?{urllib.parse.urlencode(params)}"
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(url) as response:
|
||||
data = json.loads(response.read().decode())
|
||||
|
||||
if 'result' in data:
|
||||
for gene_id in chunk:
|
||||
if gene_id in data['result']:
|
||||
all_results[gene_id] = data['result'][gene_id]
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error fetching summaries for chunk: {e}", file=sys.stderr)
|
||||
|
||||
time.sleep(delay)
|
||||
|
||||
return all_results
|
||||
|
||||
|
||||
def batch_lookup_by_ids(gene_ids: List[str], api_key: Optional[str] = None) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Lookup genes by IDs and return structured data.
|
||||
|
||||
Args:
|
||||
gene_ids: List of Gene IDs
|
||||
api_key: Optional NCBI API key
|
||||
|
||||
Returns:
|
||||
List of gene information dictionaries
|
||||
"""
|
||||
summaries = batch_esummary(gene_ids, api_key=api_key)
|
||||
|
||||
results = []
|
||||
for gene_id in gene_ids:
|
||||
if gene_id in summaries:
|
||||
gene = summaries[gene_id]
|
||||
results.append({
|
||||
'gene_id': gene_id,
|
||||
'symbol': gene.get('name', 'N/A'),
|
||||
'description': gene.get('description', 'N/A'),
|
||||
'organism': gene.get('organism', {}).get('scientificname', 'N/A'),
|
||||
'chromosome': gene.get('chromosome', 'N/A'),
|
||||
'map_location': gene.get('maplocation', 'N/A'),
|
||||
'type': gene.get('geneticsource', 'N/A')
|
||||
})
|
||||
else:
|
||||
results.append({
|
||||
'gene_id': gene_id,
|
||||
'error': 'Not found or error fetching'
|
||||
})
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def batch_lookup_by_symbols(gene_symbols: List[str], organism: str,
|
||||
api_key: Optional[str] = None) -> List[Dict[str, Any]]:
|
||||
"""
|
||||
Lookup genes by symbols and return structured data.
|
||||
|
||||
Args:
|
||||
gene_symbols: List of gene symbols
|
||||
organism: Organism name
|
||||
api_key: Optional NCBI API key
|
||||
|
||||
Returns:
|
||||
List of gene information dictionaries
|
||||
"""
|
||||
# First, search for IDs
|
||||
print(f"Searching for {len(gene_symbols)} gene symbols...", file=sys.stderr)
|
||||
symbol_to_id = batch_esearch(gene_symbols, organism=organism, api_key=api_key)
|
||||
|
||||
# Filter to valid IDs
|
||||
valid_ids = [id for id in symbol_to_id.values() if id not in ['NOT_FOUND', 'ERROR']]
|
||||
|
||||
if not valid_ids:
|
||||
print("No genes found", file=sys.stderr)
|
||||
return []
|
||||
|
||||
print(f"Found {len(valid_ids)} genes, fetching details...", file=sys.stderr)
|
||||
|
||||
# Fetch summaries
|
||||
summaries = batch_esummary(valid_ids, api_key=api_key)
|
||||
|
||||
# Build results
|
||||
results = []
|
||||
for symbol, gene_id in symbol_to_id.items():
|
||||
if gene_id == 'NOT_FOUND':
|
||||
results.append({
|
||||
'query_symbol': symbol,
|
||||
'status': 'not_found'
|
||||
})
|
||||
elif gene_id == 'ERROR':
|
||||
results.append({
|
||||
'query_symbol': symbol,
|
||||
'status': 'error'
|
||||
})
|
||||
elif gene_id in summaries:
|
||||
gene = summaries[gene_id]
|
||||
results.append({
|
||||
'query_symbol': symbol,
|
||||
'gene_id': gene_id,
|
||||
'symbol': gene.get('name', 'N/A'),
|
||||
'description': gene.get('description', 'N/A'),
|
||||
'organism': gene.get('organism', {}).get('scientificname', 'N/A'),
|
||||
'chromosome': gene.get('chromosome', 'N/A'),
|
||||
'map_location': gene.get('maplocation', 'N/A'),
|
||||
'type': gene.get('geneticsource', 'N/A')
|
||||
})
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Batch gene lookup using NCBI APIs',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Lookup by gene IDs
|
||||
%(prog)s --ids 672,7157,5594
|
||||
|
||||
# Lookup by symbols from a file
|
||||
%(prog)s --file genes.txt --organism human
|
||||
|
||||
# Lookup with API key and save to file
|
||||
%(prog)s --ids 672,7157,5594 --api-key YOUR_KEY --output results.json
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('--ids', '-i', help='Comma-separated Gene IDs')
|
||||
parser.add_argument('--file', '-f', help='File containing gene symbols (one per line)')
|
||||
parser.add_argument('--organism', '-o', help='Organism name (required with --file)')
|
||||
parser.add_argument('--output', '-O', help='Output file path (JSON format)')
|
||||
parser.add_argument('--api-key', '-k', help='NCBI API key')
|
||||
parser.add_argument('--pretty', '-p', action='store_true',
|
||||
help='Pretty-print JSON output')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.ids and not args.file:
|
||||
parser.error("Either --ids or --file must be provided")
|
||||
|
||||
if args.file and not args.organism:
|
||||
parser.error("--organism is required when using --file")
|
||||
|
||||
# Process genes
|
||||
if args.ids:
|
||||
gene_ids = [id.strip() for id in args.ids.split(',')]
|
||||
results = batch_lookup_by_ids(gene_ids, api_key=args.api_key)
|
||||
else:
|
||||
gene_symbols = read_gene_list(args.file)
|
||||
results = batch_lookup_by_symbols(gene_symbols, args.organism, api_key=args.api_key)
|
||||
|
||||
# Output results
|
||||
indent = 2 if args.pretty else None
|
||||
json_output = json.dumps(results, indent=indent)
|
||||
|
||||
if args.output:
|
||||
try:
|
||||
with open(args.output, 'w') as f:
|
||||
f.write(json_output)
|
||||
print(f"Results written to {args.output}", file=sys.stderr)
|
||||
except Exception as e:
|
||||
print(f"Error writing output file: {e}", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
else:
|
||||
print(json_output)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
277
skills/gene-database/scripts/fetch_gene_data.py
Normal file
277
skills/gene-database/scripts/fetch_gene_data.py
Normal file
@@ -0,0 +1,277 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Fetch gene data from NCBI using the Datasets API.
|
||||
|
||||
This script provides access to the NCBI Datasets API for retrieving
|
||||
comprehensive gene information including metadata and sequences.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
from typing import Optional, Dict, Any, List
|
||||
|
||||
|
||||
DATASETS_API_BASE = "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene"
|
||||
|
||||
|
||||
def get_taxon_id(taxon_name: str) -> Optional[str]:
|
||||
"""
|
||||
Convert taxon name to NCBI taxon ID.
|
||||
|
||||
Args:
|
||||
taxon_name: Common or scientific name (e.g., "human", "Homo sapiens")
|
||||
|
||||
Returns:
|
||||
Taxon ID as string, or None if not found
|
||||
"""
|
||||
# Common mappings
|
||||
common_taxa = {
|
||||
'human': '9606',
|
||||
'homo sapiens': '9606',
|
||||
'mouse': '10090',
|
||||
'mus musculus': '10090',
|
||||
'rat': '10116',
|
||||
'rattus norvegicus': '10116',
|
||||
'zebrafish': '7955',
|
||||
'danio rerio': '7955',
|
||||
'fruit fly': '7227',
|
||||
'drosophila melanogaster': '7227',
|
||||
'c. elegans': '6239',
|
||||
'caenorhabditis elegans': '6239',
|
||||
'yeast': '4932',
|
||||
'saccharomyces cerevisiae': '4932',
|
||||
'arabidopsis': '3702',
|
||||
'arabidopsis thaliana': '3702',
|
||||
'e. coli': '562',
|
||||
'escherichia coli': '562',
|
||||
}
|
||||
|
||||
taxon_lower = taxon_name.lower().strip()
|
||||
return common_taxa.get(taxon_lower)
|
||||
|
||||
|
||||
def fetch_gene_by_id(gene_id: str, api_key: Optional[str] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Fetch gene data by Gene ID.
|
||||
|
||||
Args:
|
||||
gene_id: NCBI Gene ID
|
||||
api_key: Optional NCBI API key
|
||||
|
||||
Returns:
|
||||
Gene data as dictionary
|
||||
"""
|
||||
url = f"{DATASETS_API_BASE}/id/{gene_id}"
|
||||
|
||||
headers = {}
|
||||
if api_key:
|
||||
headers['api-key'] = api_key
|
||||
|
||||
try:
|
||||
req = urllib.request.Request(url, headers=headers)
|
||||
with urllib.request.urlopen(req) as response:
|
||||
return json.loads(response.read().decode())
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
|
||||
if e.code == 404:
|
||||
print(f"Gene ID {gene_id} not found", file=sys.stderr)
|
||||
return {}
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
return {}
|
||||
|
||||
|
||||
def fetch_gene_by_symbol(symbol: str, taxon: str, api_key: Optional[str] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Fetch gene data by gene symbol and taxon.
|
||||
|
||||
Args:
|
||||
symbol: Gene symbol (e.g., "BRCA1")
|
||||
taxon: Organism name or taxon ID
|
||||
api_key: Optional NCBI API key
|
||||
|
||||
Returns:
|
||||
Gene data as dictionary
|
||||
"""
|
||||
# Convert taxon name to ID if needed
|
||||
taxon_id = get_taxon_id(taxon)
|
||||
if not taxon_id:
|
||||
# Try to use as-is (might already be a taxon ID)
|
||||
taxon_id = taxon
|
||||
|
||||
url = f"{DATASETS_API_BASE}/symbol/{symbol}/taxon/{taxon_id}"
|
||||
|
||||
headers = {}
|
||||
if api_key:
|
||||
headers['api-key'] = api_key
|
||||
|
||||
try:
|
||||
req = urllib.request.Request(url, headers=headers)
|
||||
with urllib.request.urlopen(req) as response:
|
||||
return json.loads(response.read().decode())
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
|
||||
if e.code == 404:
|
||||
print(f"Gene symbol '{symbol}' not found for taxon {taxon}", file=sys.stderr)
|
||||
return {}
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
return {}
|
||||
|
||||
|
||||
def fetch_multiple_genes(gene_ids: List[str], api_key: Optional[str] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Fetch data for multiple genes by ID.
|
||||
|
||||
Args:
|
||||
gene_ids: List of Gene IDs
|
||||
api_key: Optional NCBI API key
|
||||
|
||||
Returns:
|
||||
Combined gene data as dictionary
|
||||
"""
|
||||
# For multiple genes, use POST request
|
||||
url = f"{DATASETS_API_BASE}/id"
|
||||
|
||||
data = json.dumps({"gene_ids": gene_ids}).encode('utf-8')
|
||||
headers = {'Content-Type': 'application/json'}
|
||||
|
||||
if api_key:
|
||||
headers['api-key'] = api_key
|
||||
|
||||
try:
|
||||
req = urllib.request.Request(url, data=data, headers=headers, method='POST')
|
||||
with urllib.request.urlopen(req) as response:
|
||||
return json.loads(response.read().decode())
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
|
||||
return {}
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
return {}
|
||||
|
||||
|
||||
def display_gene_info(data: Dict[str, Any], verbose: bool = False) -> None:
|
||||
"""
|
||||
Display gene information in human-readable format.
|
||||
|
||||
Args:
|
||||
data: Gene data dictionary from API
|
||||
verbose: Show detailed information
|
||||
"""
|
||||
if 'genes' not in data:
|
||||
print("No gene data found in response")
|
||||
return
|
||||
|
||||
for gene in data['genes']:
|
||||
gene_info = gene.get('gene', {})
|
||||
|
||||
print(f"Gene ID: {gene_info.get('gene_id', 'N/A')}")
|
||||
print(f"Symbol: {gene_info.get('symbol', 'N/A')}")
|
||||
print(f"Description: {gene_info.get('description', 'N/A')}")
|
||||
|
||||
if 'tax_name' in gene_info:
|
||||
print(f"Organism: {gene_info['tax_name']}")
|
||||
|
||||
if 'chromosomes' in gene_info:
|
||||
chromosomes = ', '.join(gene_info['chromosomes'])
|
||||
print(f"Chromosome(s): {chromosomes}")
|
||||
|
||||
# Nomenclature
|
||||
if 'nomenclature_authority' in gene_info:
|
||||
auth = gene_info['nomenclature_authority']
|
||||
print(f"Nomenclature: {auth.get('authority', 'N/A')}")
|
||||
|
||||
# Synonyms
|
||||
if 'synonyms' in gene_info and gene_info['synonyms']:
|
||||
print(f"Synonyms: {', '.join(gene_info['synonyms'])}")
|
||||
|
||||
if verbose:
|
||||
# Gene type
|
||||
if 'type' in gene_info:
|
||||
print(f"Type: {gene_info['type']}")
|
||||
|
||||
# Genomic locations
|
||||
if 'genomic_ranges' in gene_info:
|
||||
print("\nGenomic Locations:")
|
||||
for range_info in gene_info['genomic_ranges']:
|
||||
accession = range_info.get('accession_version', 'N/A')
|
||||
start = range_info.get('range', [{}])[0].get('begin', 'N/A')
|
||||
end = range_info.get('range', [{}])[0].get('end', 'N/A')
|
||||
strand = range_info.get('orientation', 'N/A')
|
||||
print(f" {accession}: {start}-{end} ({strand})")
|
||||
|
||||
# Transcripts
|
||||
if 'transcripts' in gene_info:
|
||||
print(f"\nTranscripts: {len(gene_info['transcripts'])}")
|
||||
for transcript in gene_info['transcripts'][:5]: # Show first 5
|
||||
print(f" {transcript.get('accession_version', 'N/A')}")
|
||||
|
||||
print()
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Fetch gene data from NCBI Datasets API',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Fetch by Gene ID
|
||||
%(prog)s --gene-id 672
|
||||
|
||||
# Fetch by gene symbol and organism
|
||||
%(prog)s --symbol BRCA1 --taxon human
|
||||
|
||||
# Fetch multiple genes
|
||||
%(prog)s --gene-id 672,7157,5594
|
||||
|
||||
# Get JSON output
|
||||
%(prog)s --symbol TP53 --taxon "Homo sapiens" --output json
|
||||
|
||||
# Verbose output with details
|
||||
%(prog)s --gene-id 672 --verbose
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('--gene-id', '-g', help='Gene ID(s), comma-separated')
|
||||
parser.add_argument('--symbol', '-s', help='Gene symbol')
|
||||
parser.add_argument('--taxon', '-t', help='Organism name or taxon ID (required with --symbol)')
|
||||
parser.add_argument('--output', '-o', choices=['pretty', 'json'], default='pretty',
|
||||
help='Output format (default: pretty)')
|
||||
parser.add_argument('--verbose', '-v', action='store_true',
|
||||
help='Show detailed information')
|
||||
parser.add_argument('--api-key', '-k', help='NCBI API key')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.gene_id and not args.symbol:
|
||||
parser.error("Either --gene-id or --symbol must be provided")
|
||||
|
||||
if args.symbol and not args.taxon:
|
||||
parser.error("--taxon is required when using --symbol")
|
||||
|
||||
# Fetch data
|
||||
if args.gene_id:
|
||||
gene_ids = [id.strip() for id in args.gene_id.split(',')]
|
||||
if len(gene_ids) == 1:
|
||||
data = fetch_gene_by_id(gene_ids[0], api_key=args.api_key)
|
||||
else:
|
||||
data = fetch_multiple_genes(gene_ids, api_key=args.api_key)
|
||||
else:
|
||||
data = fetch_gene_by_symbol(args.symbol, args.taxon, api_key=args.api_key)
|
||||
|
||||
if not data:
|
||||
sys.exit(1)
|
||||
|
||||
# Output
|
||||
if args.output == 'json':
|
||||
print(json.dumps(data, indent=2))
|
||||
else:
|
||||
display_gene_info(data, verbose=args.verbose)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
251
skills/gene-database/scripts/query_gene.py
Normal file
251
skills/gene-database/scripts/query_gene.py
Normal file
@@ -0,0 +1,251 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Query NCBI Gene database using E-utilities.
|
||||
|
||||
This script provides access to ESearch, ESummary, and EFetch functions
|
||||
for searching and retrieving gene information.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import json
|
||||
import sys
|
||||
import time
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
from typing import Optional, Dict, List, Any
|
||||
from xml.etree import ElementTree as ET
|
||||
|
||||
|
||||
BASE_URL = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/"
|
||||
DB = "gene"
|
||||
|
||||
|
||||
def esearch(query: str, retmax: int = 20, api_key: Optional[str] = None) -> List[str]:
|
||||
"""
|
||||
Search NCBI Gene database and return list of Gene IDs.
|
||||
|
||||
Args:
|
||||
query: Search query (e.g., "BRCA1[gene] AND human[organism]")
|
||||
retmax: Maximum number of results to return
|
||||
api_key: Optional NCBI API key for higher rate limits
|
||||
|
||||
Returns:
|
||||
List of Gene IDs as strings
|
||||
"""
|
||||
params = {
|
||||
'db': DB,
|
||||
'term': query,
|
||||
'retmax': retmax,
|
||||
'retmode': 'json'
|
||||
}
|
||||
|
||||
if api_key:
|
||||
params['api_key'] = api_key
|
||||
|
||||
url = f"{BASE_URL}esearch.fcgi?{urllib.parse.urlencode(params)}"
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(url) as response:
|
||||
data = json.loads(response.read().decode())
|
||||
|
||||
if 'esearchresult' in data and 'idlist' in data['esearchresult']:
|
||||
return data['esearchresult']['idlist']
|
||||
else:
|
||||
print(f"Error: Unexpected response format", file=sys.stderr)
|
||||
return []
|
||||
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
|
||||
return []
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
return []
|
||||
|
||||
|
||||
def esummary(gene_ids: List[str], api_key: Optional[str] = None) -> Dict[str, Any]:
|
||||
"""
|
||||
Get document summaries for Gene IDs.
|
||||
|
||||
Args:
|
||||
gene_ids: List of Gene IDs
|
||||
api_key: Optional NCBI API key
|
||||
|
||||
Returns:
|
||||
Dictionary of gene summaries
|
||||
"""
|
||||
params = {
|
||||
'db': DB,
|
||||
'id': ','.join(gene_ids),
|
||||
'retmode': 'json'
|
||||
}
|
||||
|
||||
if api_key:
|
||||
params['api_key'] = api_key
|
||||
|
||||
url = f"{BASE_URL}esummary.fcgi?{urllib.parse.urlencode(params)}"
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(url) as response:
|
||||
data = json.loads(response.read().decode())
|
||||
return data
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
|
||||
return {}
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
return {}
|
||||
|
||||
|
||||
def efetch(gene_ids: List[str], retmode: str = 'xml', api_key: Optional[str] = None) -> str:
|
||||
"""
|
||||
Fetch full gene records.
|
||||
|
||||
Args:
|
||||
gene_ids: List of Gene IDs
|
||||
retmode: Return format ('xml', 'text', 'asn.1')
|
||||
api_key: Optional NCBI API key
|
||||
|
||||
Returns:
|
||||
Gene records as string in requested format
|
||||
"""
|
||||
params = {
|
||||
'db': DB,
|
||||
'id': ','.join(gene_ids),
|
||||
'retmode': retmode
|
||||
}
|
||||
|
||||
if api_key:
|
||||
params['api_key'] = api_key
|
||||
|
||||
url = f"{BASE_URL}efetch.fcgi?{urllib.parse.urlencode(params)}"
|
||||
|
||||
try:
|
||||
with urllib.request.urlopen(url) as response:
|
||||
return response.read().decode()
|
||||
except urllib.error.HTTPError as e:
|
||||
print(f"HTTP Error {e.code}: {e.reason}", file=sys.stderr)
|
||||
return ""
|
||||
except Exception as e:
|
||||
print(f"Error: {e}", file=sys.stderr)
|
||||
return ""
|
||||
|
||||
|
||||
def search_and_summarize(query: str, organism: Optional[str] = None,
|
||||
max_results: int = 20, api_key: Optional[str] = None) -> None:
|
||||
"""
|
||||
Search for genes and display summaries.
|
||||
|
||||
Args:
|
||||
query: Gene search query
|
||||
organism: Optional organism filter
|
||||
max_results: Maximum number of results
|
||||
api_key: Optional NCBI API key
|
||||
"""
|
||||
# Add organism filter if provided
|
||||
if organism:
|
||||
if '[organism]' not in query.lower():
|
||||
query = f"{query} AND {organism}[organism]"
|
||||
|
||||
print(f"Searching for: {query}")
|
||||
print("-" * 80)
|
||||
|
||||
# Search for gene IDs
|
||||
gene_ids = esearch(query, retmax=max_results, api_key=api_key)
|
||||
|
||||
if not gene_ids:
|
||||
print("No results found.")
|
||||
return
|
||||
|
||||
print(f"Found {len(gene_ids)} gene(s)")
|
||||
print()
|
||||
|
||||
# Get summaries
|
||||
summaries = esummary(gene_ids, api_key=api_key)
|
||||
|
||||
if 'result' in summaries:
|
||||
for gene_id in gene_ids:
|
||||
if gene_id in summaries['result']:
|
||||
gene = summaries['result'][gene_id]
|
||||
print(f"Gene ID: {gene_id}")
|
||||
print(f" Symbol: {gene.get('name', 'N/A')}")
|
||||
print(f" Description: {gene.get('description', 'N/A')}")
|
||||
print(f" Organism: {gene.get('organism', {}).get('scientificname', 'N/A')}")
|
||||
print(f" Chromosome: {gene.get('chromosome', 'N/A')}")
|
||||
print(f" Map Location: {gene.get('maplocation', 'N/A')}")
|
||||
print(f" Type: {gene.get('geneticsource', 'N/A')}")
|
||||
print()
|
||||
|
||||
# Respect rate limits
|
||||
time.sleep(0.34) # ~3 requests per second
|
||||
|
||||
|
||||
def fetch_by_id(gene_ids: List[str], output_format: str = 'json',
|
||||
api_key: Optional[str] = None) -> None:
|
||||
"""
|
||||
Fetch and display gene information by ID.
|
||||
|
||||
Args:
|
||||
gene_ids: List of Gene IDs
|
||||
output_format: Output format ('json', 'xml', 'text')
|
||||
api_key: Optional NCBI API key
|
||||
"""
|
||||
if output_format == 'json':
|
||||
# Get summaries in JSON format
|
||||
summaries = esummary(gene_ids, api_key=api_key)
|
||||
print(json.dumps(summaries, indent=2))
|
||||
else:
|
||||
# Fetch full records
|
||||
data = efetch(gene_ids, retmode=output_format, api_key=api_key)
|
||||
print(data)
|
||||
|
||||
# Respect rate limits
|
||||
time.sleep(0.34)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Query NCBI Gene database using E-utilities',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Search for gene by symbol
|
||||
%(prog)s --search "BRCA1" --organism "human"
|
||||
|
||||
# Fetch gene by ID
|
||||
%(prog)s --id 672 --format json
|
||||
|
||||
# Complex search query
|
||||
%(prog)s --search "insulin[gene] AND diabetes[disease]"
|
||||
|
||||
# Multiple gene IDs
|
||||
%(prog)s --id 672,7157,5594
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('--search', '-s', help='Search query')
|
||||
parser.add_argument('--organism', '-o', help='Organism filter')
|
||||
parser.add_argument('--id', '-i', help='Gene ID(s), comma-separated')
|
||||
parser.add_argument('--format', '-f', default='json',
|
||||
choices=['json', 'xml', 'text'],
|
||||
help='Output format (default: json)')
|
||||
parser.add_argument('--max-results', '-m', type=int, default=20,
|
||||
help='Maximum number of search results (default: 20)')
|
||||
parser.add_argument('--api-key', '-k', help='NCBI API key for higher rate limits')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if not args.search and not args.id:
|
||||
parser.error("Either --search or --id must be provided")
|
||||
|
||||
if args.id:
|
||||
# Fetch by ID
|
||||
gene_ids = [id.strip() for id in args.id.split(',')]
|
||||
fetch_by_id(gene_ids, output_format=args.format, api_key=args.api_key)
|
||||
else:
|
||||
# Search and summarize
|
||||
search_and_summarize(args.search, organism=args.organism,
|
||||
max_results=args.max_results, api_key=args.api_key)
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
||||
Reference in New Issue
Block a user