# Common Gene Database Workflows This document provides examples of common workflows and use cases for working with NCBI Gene database. ## Table of Contents 1. [Disease Gene Discovery](#disease-gene-discovery) 2. [Gene Annotation Pipeline](#gene-annotation-pipeline) 3. [Cross-Species Gene Comparison](#cross-species-gene-comparison) 4. [Pathway Analysis](#pathway-analysis) 5. [Variant Analysis](#variant-analysis) 6. [Publication Mining](#publication-mining) --- ## Disease Gene Discovery ### Use Case Identify genes associated with a specific disease or phenotype. ### Workflow 1. **Search by disease name** ```bash # Find genes associated with Alzheimer's disease python scripts/query_gene.py --search "Alzheimer disease[disease]" --organism human --max-results 50 ``` 2. **Filter by chromosome location** ```bash # Find genes on chromosome 17 associated with breast cancer python scripts/query_gene.py --search "breast cancer[disease] AND 17[chromosome]" --organism human ``` 3. **Retrieve detailed information** ```python # Python example: Get gene details for disease-associated genes import json from scripts.query_gene import esearch, esummary # Search for genes query = "diabetes[disease] AND human[organism]" gene_ids = esearch(query, retmax=100, api_key="YOUR_KEY") # Get summaries summaries = esummary(gene_ids, api_key="YOUR_KEY") # Extract relevant information for gene_id in gene_ids: if gene_id in summaries['result']: gene = summaries['result'][gene_id] print(f"{gene['name']}: {gene['description']}") ``` ### Expected Output - List of genes with disease associations - Gene symbols, descriptions, and chromosomal locations - Related publications and clinical annotations --- ## Gene Annotation Pipeline ### Use Case Annotate a list of gene identifiers with comprehensive metadata. ### Workflow 1. **Prepare gene list** Create a file `genes.txt` with gene symbols (one per line): ``` BRCA1 TP53 EGFR KRAS ``` 2. **Batch lookup** ```bash python scripts/batch_gene_lookup.py --file genes.txt --organism human --output annotations.json --api-key YOUR_KEY ``` 3. **Parse results** ```python import json with open('annotations.json', 'r') as f: genes = json.load(f) for gene in genes: if 'gene_id' in gene: print(f"Symbol: {gene['symbol']}") print(f"ID: {gene['gene_id']}") print(f"Description: {gene['description']}") print(f"Location: chr{gene['chromosome']}:{gene['map_location']}") print() ``` 4. **Enrich with sequence data** ```bash # Get detailed data including sequences for specific genes python scripts/fetch_gene_data.py --gene-id 672 --verbose > BRCA1_detailed.json ``` ### Use Cases - Creating gene annotation tables for publications - Validating gene lists before analysis - Building gene reference databases - Quality control for genomic pipelines --- ## Cross-Species Gene Comparison ### Use Case Find orthologs or compare the same gene across different species. ### Workflow 1. **Search for gene in multiple organisms** ```bash # Find TP53 in human python scripts/fetch_gene_data.py --symbol TP53 --taxon human # Find TP53 in mouse python scripts/fetch_gene_data.py --symbol TP53 --taxon mouse # Find TP53 in zebrafish python scripts/fetch_gene_data.py --symbol TP53 --taxon zebrafish ``` 2. **Compare gene IDs across species** ```python # Compare gene information across species species = { 'human': '9606', 'mouse': '10090', 'rat': '10116' } gene_symbol = 'TP53' for organism, taxon_id in species.items(): # Fetch gene data # ... (use fetch_gene_by_symbol) print(f"{organism}: {gene_data}") ``` 3. **Find orthologs using ELink** ```bash # Get HomoloGene links for a gene curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=homologene&id=7157&retmode=json" ``` ### Applications - Evolutionary studies - Model organism research - Comparative genomics - Cross-species experimental design --- ## Pathway Analysis ### Use Case Identify genes involved in specific biological pathways or processes. ### Workflow 1. **Search by Gene Ontology (GO) term** ```bash # Find genes involved in apoptosis python scripts/query_gene.py --search "GO:0006915[biological process]" --organism human --max-results 100 ``` 2. **Search by pathway name** ```bash # Find genes in insulin signaling pathway python scripts/query_gene.py --search "insulin signaling pathway[pathway]" --organism human ``` 3. **Get pathway-related genes** ```python # Example: Get all genes in a specific pathway import urllib.request import json # Search for pathway genes query = "MAPK signaling pathway[pathway] AND human[organism]" url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term={query}&retmode=json&retmax=200" with urllib.request.urlopen(url) as response: data = json.loads(response.read().decode()) gene_ids = data['esearchresult']['idlist'] print(f"Found {len(gene_ids)} genes in MAPK signaling pathway") ``` 4. **Batch retrieve gene details** ```bash # Get details for all pathway genes python scripts/batch_gene_lookup.py --ids 5594,5595,5603,5604 --output mapk_genes.json ``` ### Applications - Pathway enrichment analysis - Gene set analysis - Systems biology studies - Drug target identification --- ## Variant Analysis ### Use Case Find genes with clinically relevant variants or disease-associated mutations. ### Workflow 1. **Search for genes with clinical variants** ```bash # Find genes with pathogenic variants python scripts/query_gene.py --search "pathogenic[clinical significance]" --organism human --max-results 50 ``` 2. **Link to ClinVar database** ```bash # Get ClinVar records for a gene curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=clinvar&id=672&retmode=json" ``` 3. **Search for pharmacogenomic genes** ```bash # Find genes associated with drug response python scripts/query_gene.py --search "pharmacogenomic[property]" --organism human ``` 4. **Get variant summary data** ```python # Example: Get genes with known variants from scripts.query_gene import esearch, efetch # Search for genes with variants gene_ids = esearch("has variants[filter] AND human[organism]", retmax=100) # Fetch detailed records for gene_id in gene_ids[:10]: # First 10 data = efetch([gene_id], retmode='xml') # Parse XML for variant information print(f"Gene {gene_id} variant data...") ``` ### Applications - Clinical genetics - Precision medicine - Pharmacogenomics - Genetic counseling --- ## Publication Mining ### Use Case Find genes mentioned in recent publications or link genes to literature. ### Workflow 1. **Search genes mentioned in specific publications** ```bash # Find genes mentioned in papers about CRISPR python scripts/query_gene.py --search "CRISPR[text word]" --organism human --max-results 100 ``` 2. **Get PubMed articles for a gene** ```bash # Get all publications for BRCA1 curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json" ``` 3. **Search by author or journal** ```bash # Find genes studied by specific research group python scripts/query_gene.py --search "Smith J[author] AND 2024[pdat]" --organism human ``` 4. **Extract gene-publication relationships** ```python # Example: Build gene-publication network from scripts.query_gene import esearch, esummary import urllib.request import json # Get gene gene_id = '672' # Get publications for gene url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id={gene_id}&retmode=json" with urllib.request.urlopen(url) as response: data = json.loads(response.read().decode()) # Extract PMIDs pmids = [] for linkset in data.get('linksets', []): for linksetdb in linkset.get('linksetdbs', []): pmids.extend(linksetdb.get('links', [])) print(f"Gene {gene_id} has {len(pmids)} publications") ``` ### Applications - Literature reviews - Grant writing - Knowledge base construction - Trend analysis in genomics research --- ## Advanced Patterns ### Combining Multiple Searches ```python # Example: Find genes at intersection of multiple criteria def find_genes_multi_criteria(organism='human'): # Criteria 1: Disease association disease_genes = set(esearch("diabetes[disease] AND human[organism]")) # Criteria 2: Chromosome location chr_genes = set(esearch("11[chromosome] AND human[organism]")) # Criteria 3: Gene type coding_genes = set(esearch("protein coding[gene type] AND human[organism]")) # Intersection candidates = disease_genes & chr_genes & coding_genes return list(candidates) ``` ### Rate-Limited Batch Processing ```python import time def process_genes_with_rate_limit(gene_ids, batch_size=200, delay=0.1): results = [] for i in range(0, len(gene_ids), batch_size): batch = gene_ids[i:i + batch_size] # Process batch batch_results = esummary(batch) results.append(batch_results) # Rate limit time.sleep(delay) return results ``` ### Error Handling and Retry ```python import time def robust_gene_fetch(gene_id, max_retries=3): for attempt in range(max_retries): try: data = fetch_gene_by_id(gene_id) return data except Exception as e: if attempt < max_retries - 1: wait = 2 ** attempt # Exponential backoff time.sleep(wait) else: print(f"Failed to fetch gene {gene_id}: {e}") return None ``` --- ## Tips and Best Practices 1. **Start Specific, Then Broaden**: Begin with precise queries and expand if needed 2. **Use Organism Filters**: Always specify organism for gene symbol searches 3. **Validate Results**: Check gene IDs and symbols for accuracy 4. **Cache Frequently Used Data**: Store common queries locally 5. **Monitor Rate Limits**: Use API keys and implement delays 6. **Combine APIs**: Use E-utilities for search, Datasets API for detailed data 7. **Handle Ambiguity**: Gene symbols may refer to different genes in different species 8. **Check Data Currency**: Gene annotations are updated regularly 9. **Use Batch Operations**: Process multiple genes together when possible 10. **Document Your Queries**: Keep records of search terms and parameters