429 lines
10 KiB
Markdown
429 lines
10 KiB
Markdown
# Common Gene Database Workflows
|
|
|
|
This document provides examples of common workflows and use cases for working with NCBI Gene database.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Disease Gene Discovery](#disease-gene-discovery)
|
|
2. [Gene Annotation Pipeline](#gene-annotation-pipeline)
|
|
3. [Cross-Species Gene Comparison](#cross-species-gene-comparison)
|
|
4. [Pathway Analysis](#pathway-analysis)
|
|
5. [Variant Analysis](#variant-analysis)
|
|
6. [Publication Mining](#publication-mining)
|
|
|
|
---
|
|
|
|
## Disease Gene Discovery
|
|
|
|
### Use Case
|
|
|
|
Identify genes associated with a specific disease or phenotype.
|
|
|
|
### Workflow
|
|
|
|
1. **Search by disease name**
|
|
|
|
```bash
|
|
# Find genes associated with Alzheimer's disease
|
|
python scripts/query_gene.py --search "Alzheimer disease[disease]" --organism human --max-results 50
|
|
```
|
|
|
|
2. **Filter by chromosome location**
|
|
|
|
```bash
|
|
# Find genes on chromosome 17 associated with breast cancer
|
|
python scripts/query_gene.py --search "breast cancer[disease] AND 17[chromosome]" --organism human
|
|
```
|
|
|
|
3. **Retrieve detailed information**
|
|
|
|
```python
|
|
# Python example: Get gene details for disease-associated genes
|
|
import json
|
|
from scripts.query_gene import esearch, esummary
|
|
|
|
# Search for genes
|
|
query = "diabetes[disease] AND human[organism]"
|
|
gene_ids = esearch(query, retmax=100, api_key="YOUR_KEY")
|
|
|
|
# Get summaries
|
|
summaries = esummary(gene_ids, api_key="YOUR_KEY")
|
|
|
|
# Extract relevant information
|
|
for gene_id in gene_ids:
|
|
if gene_id in summaries['result']:
|
|
gene = summaries['result'][gene_id]
|
|
print(f"{gene['name']}: {gene['description']}")
|
|
```
|
|
|
|
### Expected Output
|
|
|
|
- List of genes with disease associations
|
|
- Gene symbols, descriptions, and chromosomal locations
|
|
- Related publications and clinical annotations
|
|
|
|
---
|
|
|
|
## Gene Annotation Pipeline
|
|
|
|
### Use Case
|
|
|
|
Annotate a list of gene identifiers with comprehensive metadata.
|
|
|
|
### Workflow
|
|
|
|
1. **Prepare gene list**
|
|
|
|
Create a file `genes.txt` with gene symbols (one per line):
|
|
```
|
|
BRCA1
|
|
TP53
|
|
EGFR
|
|
KRAS
|
|
```
|
|
|
|
2. **Batch lookup**
|
|
|
|
```bash
|
|
python scripts/batch_gene_lookup.py --file genes.txt --organism human --output annotations.json --api-key YOUR_KEY
|
|
```
|
|
|
|
3. **Parse results**
|
|
|
|
```python
|
|
import json
|
|
|
|
with open('annotations.json', 'r') as f:
|
|
genes = json.load(f)
|
|
|
|
for gene in genes:
|
|
if 'gene_id' in gene:
|
|
print(f"Symbol: {gene['symbol']}")
|
|
print(f"ID: {gene['gene_id']}")
|
|
print(f"Description: {gene['description']}")
|
|
print(f"Location: chr{gene['chromosome']}:{gene['map_location']}")
|
|
print()
|
|
```
|
|
|
|
4. **Enrich with sequence data**
|
|
|
|
```bash
|
|
# Get detailed data including sequences for specific genes
|
|
python scripts/fetch_gene_data.py --gene-id 672 --verbose > BRCA1_detailed.json
|
|
```
|
|
|
|
### Use Cases
|
|
|
|
- Creating gene annotation tables for publications
|
|
- Validating gene lists before analysis
|
|
- Building gene reference databases
|
|
- Quality control for genomic pipelines
|
|
|
|
---
|
|
|
|
## Cross-Species Gene Comparison
|
|
|
|
### Use Case
|
|
|
|
Find orthologs or compare the same gene across different species.
|
|
|
|
### Workflow
|
|
|
|
1. **Search for gene in multiple organisms**
|
|
|
|
```bash
|
|
# Find TP53 in human
|
|
python scripts/fetch_gene_data.py --symbol TP53 --taxon human
|
|
|
|
# Find TP53 in mouse
|
|
python scripts/fetch_gene_data.py --symbol TP53 --taxon mouse
|
|
|
|
# Find TP53 in zebrafish
|
|
python scripts/fetch_gene_data.py --symbol TP53 --taxon zebrafish
|
|
```
|
|
|
|
2. **Compare gene IDs across species**
|
|
|
|
```python
|
|
# Compare gene information across species
|
|
species = {
|
|
'human': '9606',
|
|
'mouse': '10090',
|
|
'rat': '10116'
|
|
}
|
|
|
|
gene_symbol = 'TP53'
|
|
|
|
for organism, taxon_id in species.items():
|
|
# Fetch gene data
|
|
# ... (use fetch_gene_by_symbol)
|
|
print(f"{organism}: {gene_data}")
|
|
```
|
|
|
|
3. **Find orthologs using ELink**
|
|
|
|
```bash
|
|
# Get HomoloGene links for a gene
|
|
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=homologene&id=7157&retmode=json"
|
|
```
|
|
|
|
### Applications
|
|
|
|
- Evolutionary studies
|
|
- Model organism research
|
|
- Comparative genomics
|
|
- Cross-species experimental design
|
|
|
|
---
|
|
|
|
## Pathway Analysis
|
|
|
|
### Use Case
|
|
|
|
Identify genes involved in specific biological pathways or processes.
|
|
|
|
### Workflow
|
|
|
|
1. **Search by Gene Ontology (GO) term**
|
|
|
|
```bash
|
|
# Find genes involved in apoptosis
|
|
python scripts/query_gene.py --search "GO:0006915[biological process]" --organism human --max-results 100
|
|
```
|
|
|
|
2. **Search by pathway name**
|
|
|
|
```bash
|
|
# Find genes in insulin signaling pathway
|
|
python scripts/query_gene.py --search "insulin signaling pathway[pathway]" --organism human
|
|
```
|
|
|
|
3. **Get pathway-related genes**
|
|
|
|
```python
|
|
# Example: Get all genes in a specific pathway
|
|
import urllib.request
|
|
import json
|
|
|
|
# Search for pathway genes
|
|
query = "MAPK signaling pathway[pathway] AND human[organism]"
|
|
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term={query}&retmode=json&retmax=200"
|
|
|
|
with urllib.request.urlopen(url) as response:
|
|
data = json.loads(response.read().decode())
|
|
gene_ids = data['esearchresult']['idlist']
|
|
|
|
print(f"Found {len(gene_ids)} genes in MAPK signaling pathway")
|
|
```
|
|
|
|
4. **Batch retrieve gene details**
|
|
|
|
```bash
|
|
# Get details for all pathway genes
|
|
python scripts/batch_gene_lookup.py --ids 5594,5595,5603,5604 --output mapk_genes.json
|
|
```
|
|
|
|
### Applications
|
|
|
|
- Pathway enrichment analysis
|
|
- Gene set analysis
|
|
- Systems biology studies
|
|
- Drug target identification
|
|
|
|
---
|
|
|
|
## Variant Analysis
|
|
|
|
### Use Case
|
|
|
|
Find genes with clinically relevant variants or disease-associated mutations.
|
|
|
|
### Workflow
|
|
|
|
1. **Search for genes with clinical variants**
|
|
|
|
```bash
|
|
# Find genes with pathogenic variants
|
|
python scripts/query_gene.py --search "pathogenic[clinical significance]" --organism human --max-results 50
|
|
```
|
|
|
|
2. **Link to ClinVar database**
|
|
|
|
```bash
|
|
# Get ClinVar records for a gene
|
|
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=clinvar&id=672&retmode=json"
|
|
```
|
|
|
|
3. **Search for pharmacogenomic genes**
|
|
|
|
```bash
|
|
# Find genes associated with drug response
|
|
python scripts/query_gene.py --search "pharmacogenomic[property]" --organism human
|
|
```
|
|
|
|
4. **Get variant summary data**
|
|
|
|
```python
|
|
# Example: Get genes with known variants
|
|
from scripts.query_gene import esearch, efetch
|
|
|
|
# Search for genes with variants
|
|
gene_ids = esearch("has variants[filter] AND human[organism]", retmax=100)
|
|
|
|
# Fetch detailed records
|
|
for gene_id in gene_ids[:10]: # First 10
|
|
data = efetch([gene_id], retmode='xml')
|
|
# Parse XML for variant information
|
|
print(f"Gene {gene_id} variant data...")
|
|
```
|
|
|
|
### Applications
|
|
|
|
- Clinical genetics
|
|
- Precision medicine
|
|
- Pharmacogenomics
|
|
- Genetic counseling
|
|
|
|
---
|
|
|
|
## Publication Mining
|
|
|
|
### Use Case
|
|
|
|
Find genes mentioned in recent publications or link genes to literature.
|
|
|
|
### Workflow
|
|
|
|
1. **Search genes mentioned in specific publications**
|
|
|
|
```bash
|
|
# Find genes mentioned in papers about CRISPR
|
|
python scripts/query_gene.py --search "CRISPR[text word]" --organism human --max-results 100
|
|
```
|
|
|
|
2. **Get PubMed articles for a gene**
|
|
|
|
```bash
|
|
# Get all publications for BRCA1
|
|
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"
|
|
```
|
|
|
|
3. **Search by author or journal**
|
|
|
|
```bash
|
|
# Find genes studied by specific research group
|
|
python scripts/query_gene.py --search "Smith J[author] AND 2024[pdat]" --organism human
|
|
```
|
|
|
|
4. **Extract gene-publication relationships**
|
|
|
|
```python
|
|
# Example: Build gene-publication network
|
|
from scripts.query_gene import esearch, esummary
|
|
import urllib.request
|
|
import json
|
|
|
|
# Get gene
|
|
gene_id = '672'
|
|
|
|
# Get publications for gene
|
|
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id={gene_id}&retmode=json"
|
|
|
|
with urllib.request.urlopen(url) as response:
|
|
data = json.loads(response.read().decode())
|
|
|
|
# Extract PMIDs
|
|
pmids = []
|
|
for linkset in data.get('linksets', []):
|
|
for linksetdb in linkset.get('linksetdbs', []):
|
|
pmids.extend(linksetdb.get('links', []))
|
|
|
|
print(f"Gene {gene_id} has {len(pmids)} publications")
|
|
```
|
|
|
|
### Applications
|
|
|
|
- Literature reviews
|
|
- Grant writing
|
|
- Knowledge base construction
|
|
- Trend analysis in genomics research
|
|
|
|
---
|
|
|
|
## Advanced Patterns
|
|
|
|
### Combining Multiple Searches
|
|
|
|
```python
|
|
# Example: Find genes at intersection of multiple criteria
|
|
def find_genes_multi_criteria(organism='human'):
|
|
# Criteria 1: Disease association
|
|
disease_genes = set(esearch("diabetes[disease] AND human[organism]"))
|
|
|
|
# Criteria 2: Chromosome location
|
|
chr_genes = set(esearch("11[chromosome] AND human[organism]"))
|
|
|
|
# Criteria 3: Gene type
|
|
coding_genes = set(esearch("protein coding[gene type] AND human[organism]"))
|
|
|
|
# Intersection
|
|
candidates = disease_genes & chr_genes & coding_genes
|
|
|
|
return list(candidates)
|
|
```
|
|
|
|
### Rate-Limited Batch Processing
|
|
|
|
```python
|
|
import time
|
|
|
|
def process_genes_with_rate_limit(gene_ids, batch_size=200, delay=0.1):
|
|
results = []
|
|
|
|
for i in range(0, len(gene_ids), batch_size):
|
|
batch = gene_ids[i:i + batch_size]
|
|
|
|
# Process batch
|
|
batch_results = esummary(batch)
|
|
results.append(batch_results)
|
|
|
|
# Rate limit
|
|
time.sleep(delay)
|
|
|
|
return results
|
|
```
|
|
|
|
### Error Handling and Retry
|
|
|
|
```python
|
|
import time
|
|
|
|
def robust_gene_fetch(gene_id, max_retries=3):
|
|
for attempt in range(max_retries):
|
|
try:
|
|
data = fetch_gene_by_id(gene_id)
|
|
return data
|
|
except Exception as e:
|
|
if attempt < max_retries - 1:
|
|
wait = 2 ** attempt # Exponential backoff
|
|
time.sleep(wait)
|
|
else:
|
|
print(f"Failed to fetch gene {gene_id}: {e}")
|
|
return None
|
|
```
|
|
|
|
---
|
|
|
|
## Tips and Best Practices
|
|
|
|
1. **Start Specific, Then Broaden**: Begin with precise queries and expand if needed
|
|
2. **Use Organism Filters**: Always specify organism for gene symbol searches
|
|
3. **Validate Results**: Check gene IDs and symbols for accuracy
|
|
4. **Cache Frequently Used Data**: Store common queries locally
|
|
5. **Monitor Rate Limits**: Use API keys and implement delays
|
|
6. **Combine APIs**: Use E-utilities for search, Datasets API for detailed data
|
|
7. **Handle Ambiguity**: Gene symbols may refer to different genes in different species
|
|
8. **Check Data Currency**: Gene annotations are updated regularly
|
|
9. **Use Batch Operations**: Process multiple genes together when possible
|
|
10. **Document Your Queries**: Keep records of search terms and parameters
|