Initial commit
This commit is contained in:
428
skills/gene-database/references/common_workflows.md
Normal file
428
skills/gene-database/references/common_workflows.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Common Gene Database Workflows
|
||||
|
||||
This document provides examples of common workflows and use cases for working with NCBI Gene database.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Disease Gene Discovery](#disease-gene-discovery)
|
||||
2. [Gene Annotation Pipeline](#gene-annotation-pipeline)
|
||||
3. [Cross-Species Gene Comparison](#cross-species-gene-comparison)
|
||||
4. [Pathway Analysis](#pathway-analysis)
|
||||
5. [Variant Analysis](#variant-analysis)
|
||||
6. [Publication Mining](#publication-mining)
|
||||
|
||||
---
|
||||
|
||||
## Disease Gene Discovery
|
||||
|
||||
### Use Case
|
||||
|
||||
Identify genes associated with a specific disease or phenotype.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search by disease name**
|
||||
|
||||
```bash
|
||||
# Find genes associated with Alzheimer's disease
|
||||
python scripts/query_gene.py --search "Alzheimer disease[disease]" --organism human --max-results 50
|
||||
```
|
||||
|
||||
2. **Filter by chromosome location**
|
||||
|
||||
```bash
|
||||
# Find genes on chromosome 17 associated with breast cancer
|
||||
python scripts/query_gene.py --search "breast cancer[disease] AND 17[chromosome]" --organism human
|
||||
```
|
||||
|
||||
3. **Retrieve detailed information**
|
||||
|
||||
```python
|
||||
# Python example: Get gene details for disease-associated genes
|
||||
import json
|
||||
from scripts.query_gene import esearch, esummary
|
||||
|
||||
# Search for genes
|
||||
query = "diabetes[disease] AND human[organism]"
|
||||
gene_ids = esearch(query, retmax=100, api_key="YOUR_KEY")
|
||||
|
||||
# Get summaries
|
||||
summaries = esummary(gene_ids, api_key="YOUR_KEY")
|
||||
|
||||
# Extract relevant information
|
||||
for gene_id in gene_ids:
|
||||
if gene_id in summaries['result']:
|
||||
gene = summaries['result'][gene_id]
|
||||
print(f"{gene['name']}: {gene['description']}")
|
||||
```
|
||||
|
||||
### Expected Output
|
||||
|
||||
- List of genes with disease associations
|
||||
- Gene symbols, descriptions, and chromosomal locations
|
||||
- Related publications and clinical annotations
|
||||
|
||||
---
|
||||
|
||||
## Gene Annotation Pipeline
|
||||
|
||||
### Use Case
|
||||
|
||||
Annotate a list of gene identifiers with comprehensive metadata.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Prepare gene list**
|
||||
|
||||
Create a file `genes.txt` with gene symbols (one per line):
|
||||
```
|
||||
BRCA1
|
||||
TP53
|
||||
EGFR
|
||||
KRAS
|
||||
```
|
||||
|
||||
2. **Batch lookup**
|
||||
|
||||
```bash
|
||||
python scripts/batch_gene_lookup.py --file genes.txt --organism human --output annotations.json --api-key YOUR_KEY
|
||||
```
|
||||
|
||||
3. **Parse results**
|
||||
|
||||
```python
|
||||
import json
|
||||
|
||||
with open('annotations.json', 'r') as f:
|
||||
genes = json.load(f)
|
||||
|
||||
for gene in genes:
|
||||
if 'gene_id' in gene:
|
||||
print(f"Symbol: {gene['symbol']}")
|
||||
print(f"ID: {gene['gene_id']}")
|
||||
print(f"Description: {gene['description']}")
|
||||
print(f"Location: chr{gene['chromosome']}:{gene['map_location']}")
|
||||
print()
|
||||
```
|
||||
|
||||
4. **Enrich with sequence data**
|
||||
|
||||
```bash
|
||||
# Get detailed data including sequences for specific genes
|
||||
python scripts/fetch_gene_data.py --gene-id 672 --verbose > BRCA1_detailed.json
|
||||
```
|
||||
|
||||
### Use Cases
|
||||
|
||||
- Creating gene annotation tables for publications
|
||||
- Validating gene lists before analysis
|
||||
- Building gene reference databases
|
||||
- Quality control for genomic pipelines
|
||||
|
||||
---
|
||||
|
||||
## Cross-Species Gene Comparison
|
||||
|
||||
### Use Case
|
||||
|
||||
Find orthologs or compare the same gene across different species.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search for gene in multiple organisms**
|
||||
|
||||
```bash
|
||||
# Find TP53 in human
|
||||
python scripts/fetch_gene_data.py --symbol TP53 --taxon human
|
||||
|
||||
# Find TP53 in mouse
|
||||
python scripts/fetch_gene_data.py --symbol TP53 --taxon mouse
|
||||
|
||||
# Find TP53 in zebrafish
|
||||
python scripts/fetch_gene_data.py --symbol TP53 --taxon zebrafish
|
||||
```
|
||||
|
||||
2. **Compare gene IDs across species**
|
||||
|
||||
```python
|
||||
# Compare gene information across species
|
||||
species = {
|
||||
'human': '9606',
|
||||
'mouse': '10090',
|
||||
'rat': '10116'
|
||||
}
|
||||
|
||||
gene_symbol = 'TP53'
|
||||
|
||||
for organism, taxon_id in species.items():
|
||||
# Fetch gene data
|
||||
# ... (use fetch_gene_by_symbol)
|
||||
print(f"{organism}: {gene_data}")
|
||||
```
|
||||
|
||||
3. **Find orthologs using ELink**
|
||||
|
||||
```bash
|
||||
# Get HomoloGene links for a gene
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=homologene&id=7157&retmode=json"
|
||||
```
|
||||
|
||||
### Applications
|
||||
|
||||
- Evolutionary studies
|
||||
- Model organism research
|
||||
- Comparative genomics
|
||||
- Cross-species experimental design
|
||||
|
||||
---
|
||||
|
||||
## Pathway Analysis
|
||||
|
||||
### Use Case
|
||||
|
||||
Identify genes involved in specific biological pathways or processes.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search by Gene Ontology (GO) term**
|
||||
|
||||
```bash
|
||||
# Find genes involved in apoptosis
|
||||
python scripts/query_gene.py --search "GO:0006915[biological process]" --organism human --max-results 100
|
||||
```
|
||||
|
||||
2. **Search by pathway name**
|
||||
|
||||
```bash
|
||||
# Find genes in insulin signaling pathway
|
||||
python scripts/query_gene.py --search "insulin signaling pathway[pathway]" --organism human
|
||||
```
|
||||
|
||||
3. **Get pathway-related genes**
|
||||
|
||||
```python
|
||||
# Example: Get all genes in a specific pathway
|
||||
import urllib.request
|
||||
import json
|
||||
|
||||
# Search for pathway genes
|
||||
query = "MAPK signaling pathway[pathway] AND human[organism]"
|
||||
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term={query}&retmode=json&retmax=200"
|
||||
|
||||
with urllib.request.urlopen(url) as response:
|
||||
data = json.loads(response.read().decode())
|
||||
gene_ids = data['esearchresult']['idlist']
|
||||
|
||||
print(f"Found {len(gene_ids)} genes in MAPK signaling pathway")
|
||||
```
|
||||
|
||||
4. **Batch retrieve gene details**
|
||||
|
||||
```bash
|
||||
# Get details for all pathway genes
|
||||
python scripts/batch_gene_lookup.py --ids 5594,5595,5603,5604 --output mapk_genes.json
|
||||
```
|
||||
|
||||
### Applications
|
||||
|
||||
- Pathway enrichment analysis
|
||||
- Gene set analysis
|
||||
- Systems biology studies
|
||||
- Drug target identification
|
||||
|
||||
---
|
||||
|
||||
## Variant Analysis
|
||||
|
||||
### Use Case
|
||||
|
||||
Find genes with clinically relevant variants or disease-associated mutations.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search for genes with clinical variants**
|
||||
|
||||
```bash
|
||||
# Find genes with pathogenic variants
|
||||
python scripts/query_gene.py --search "pathogenic[clinical significance]" --organism human --max-results 50
|
||||
```
|
||||
|
||||
2. **Link to ClinVar database**
|
||||
|
||||
```bash
|
||||
# Get ClinVar records for a gene
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=clinvar&id=672&retmode=json"
|
||||
```
|
||||
|
||||
3. **Search for pharmacogenomic genes**
|
||||
|
||||
```bash
|
||||
# Find genes associated with drug response
|
||||
python scripts/query_gene.py --search "pharmacogenomic[property]" --organism human
|
||||
```
|
||||
|
||||
4. **Get variant summary data**
|
||||
|
||||
```python
|
||||
# Example: Get genes with known variants
|
||||
from scripts.query_gene import esearch, efetch
|
||||
|
||||
# Search for genes with variants
|
||||
gene_ids = esearch("has variants[filter] AND human[organism]", retmax=100)
|
||||
|
||||
# Fetch detailed records
|
||||
for gene_id in gene_ids[:10]: # First 10
|
||||
data = efetch([gene_id], retmode='xml')
|
||||
# Parse XML for variant information
|
||||
print(f"Gene {gene_id} variant data...")
|
||||
```
|
||||
|
||||
### Applications
|
||||
|
||||
- Clinical genetics
|
||||
- Precision medicine
|
||||
- Pharmacogenomics
|
||||
- Genetic counseling
|
||||
|
||||
---
|
||||
|
||||
## Publication Mining
|
||||
|
||||
### Use Case
|
||||
|
||||
Find genes mentioned in recent publications or link genes to literature.
|
||||
|
||||
### Workflow
|
||||
|
||||
1. **Search genes mentioned in specific publications**
|
||||
|
||||
```bash
|
||||
# Find genes mentioned in papers about CRISPR
|
||||
python scripts/query_gene.py --search "CRISPR[text word]" --organism human --max-results 100
|
||||
```
|
||||
|
||||
2. **Get PubMed articles for a gene**
|
||||
|
||||
```bash
|
||||
# Get all publications for BRCA1
|
||||
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"
|
||||
```
|
||||
|
||||
3. **Search by author or journal**
|
||||
|
||||
```bash
|
||||
# Find genes studied by specific research group
|
||||
python scripts/query_gene.py --search "Smith J[author] AND 2024[pdat]" --organism human
|
||||
```
|
||||
|
||||
4. **Extract gene-publication relationships**
|
||||
|
||||
```python
|
||||
# Example: Build gene-publication network
|
||||
from scripts.query_gene import esearch, esummary
|
||||
import urllib.request
|
||||
import json
|
||||
|
||||
# Get gene
|
||||
gene_id = '672'
|
||||
|
||||
# Get publications for gene
|
||||
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id={gene_id}&retmode=json"
|
||||
|
||||
with urllib.request.urlopen(url) as response:
|
||||
data = json.loads(response.read().decode())
|
||||
|
||||
# Extract PMIDs
|
||||
pmids = []
|
||||
for linkset in data.get('linksets', []):
|
||||
for linksetdb in linkset.get('linksetdbs', []):
|
||||
pmids.extend(linksetdb.get('links', []))
|
||||
|
||||
print(f"Gene {gene_id} has {len(pmids)} publications")
|
||||
```
|
||||
|
||||
### Applications
|
||||
|
||||
- Literature reviews
|
||||
- Grant writing
|
||||
- Knowledge base construction
|
||||
- Trend analysis in genomics research
|
||||
|
||||
---
|
||||
|
||||
## Advanced Patterns
|
||||
|
||||
### Combining Multiple Searches
|
||||
|
||||
```python
|
||||
# Example: Find genes at intersection of multiple criteria
|
||||
def find_genes_multi_criteria(organism='human'):
|
||||
# Criteria 1: Disease association
|
||||
disease_genes = set(esearch("diabetes[disease] AND human[organism]"))
|
||||
|
||||
# Criteria 2: Chromosome location
|
||||
chr_genes = set(esearch("11[chromosome] AND human[organism]"))
|
||||
|
||||
# Criteria 3: Gene type
|
||||
coding_genes = set(esearch("protein coding[gene type] AND human[organism]"))
|
||||
|
||||
# Intersection
|
||||
candidates = disease_genes & chr_genes & coding_genes
|
||||
|
||||
return list(candidates)
|
||||
```
|
||||
|
||||
### Rate-Limited Batch Processing
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def process_genes_with_rate_limit(gene_ids, batch_size=200, delay=0.1):
|
||||
results = []
|
||||
|
||||
for i in range(0, len(gene_ids), batch_size):
|
||||
batch = gene_ids[i:i + batch_size]
|
||||
|
||||
# Process batch
|
||||
batch_results = esummary(batch)
|
||||
results.append(batch_results)
|
||||
|
||||
# Rate limit
|
||||
time.sleep(delay)
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
### Error Handling and Retry
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def robust_gene_fetch(gene_id, max_retries=3):
|
||||
for attempt in range(max_retries):
|
||||
try:
|
||||
data = fetch_gene_by_id(gene_id)
|
||||
return data
|
||||
except Exception as e:
|
||||
if attempt < max_retries - 1:
|
||||
wait = 2 ** attempt # Exponential backoff
|
||||
time.sleep(wait)
|
||||
else:
|
||||
print(f"Failed to fetch gene {gene_id}: {e}")
|
||||
return None
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Tips and Best Practices
|
||||
|
||||
1. **Start Specific, Then Broaden**: Begin with precise queries and expand if needed
|
||||
2. **Use Organism Filters**: Always specify organism for gene symbol searches
|
||||
3. **Validate Results**: Check gene IDs and symbols for accuracy
|
||||
4. **Cache Frequently Used Data**: Store common queries locally
|
||||
5. **Monitor Rate Limits**: Use API keys and implement delays
|
||||
6. **Combine APIs**: Use E-utilities for search, Datasets API for detailed data
|
||||
7. **Handle Ambiguity**: Gene symbols may refer to different genes in different species
|
||||
8. **Check Data Currency**: Gene annotations are updated regularly
|
||||
9. **Use Batch Operations**: Process multiple genes together when possible
|
||||
10. **Document Your Queries**: Keep records of search terms and parameters
|
||||
Reference in New Issue
Block a user