zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

10 KiB

Raw Blame History

Common Gene Database Workflows

This document provides examples of common workflows and use cases for working with NCBI Gene database.

Disease Gene Discovery
Gene Annotation Pipeline
Cross-Species Gene Comparison
Pathway Analysis
Variant Analysis
Publication Mining

Disease Gene Discovery

Use Case

Identify genes associated with a specific disease or phenotype.

Workflow

Search by disease name

# Find genes associated with Alzheimer's disease
python scripts/query_gene.py --search "Alzheimer disease[disease]" --organism human --max-results 50

Filter by chromosome location

# Find genes on chromosome 17 associated with breast cancer
python scripts/query_gene.py --search "breast cancer[disease] AND 17[chromosome]" --organism human

Retrieve detailed information

# Python example: Get gene details for disease-associated genes
import json
from scripts.query_gene import esearch, esummary

# Search for genes
query = "diabetes[disease] AND human[organism]"
gene_ids = esearch(query, retmax=100, api_key="YOUR_KEY")

# Get summaries
summaries = esummary(gene_ids, api_key="YOUR_KEY")

# Extract relevant information
for gene_id in gene_ids:
    if gene_id in summaries['result']:
        gene = summaries['result'][gene_id]
        print(f"{gene['name']}: {gene['description']}")

Expected Output

List of genes with disease associations
Gene symbols, descriptions, and chromosomal locations
Related publications and clinical annotations

Gene Annotation Pipeline

Use Case

Annotate a list of gene identifiers with comprehensive metadata.

Workflow

Prepare gene list

Create a file genes.txt with gene symbols (one per line):

BRCA1
TP53
EGFR
KRAS

Batch lookup

python scripts/batch_gene_lookup.py --file genes.txt --organism human --output annotations.json --api-key YOUR_KEY

Parse results

import json

with open('annotations.json', 'r') as f:
    genes = json.load(f)

for gene in genes:
    if 'gene_id' in gene:
        print(f"Symbol: {gene['symbol']}")
        print(f"ID: {gene['gene_id']}")
        print(f"Description: {gene['description']}")
        print(f"Location: chr{gene['chromosome']}:{gene['map_location']}")
        print()

Enrich with sequence data

# Get detailed data including sequences for specific genes
python scripts/fetch_gene_data.py --gene-id 672 --verbose > BRCA1_detailed.json

Use Cases

Creating gene annotation tables for publications
Validating gene lists before analysis
Building gene reference databases
Quality control for genomic pipelines

Cross-Species Gene Comparison

Use Case

Find orthologs or compare the same gene across different species.

Workflow

Search for gene in multiple organisms

# Find TP53 in human
python scripts/fetch_gene_data.py --symbol TP53 --taxon human

# Find TP53 in mouse
python scripts/fetch_gene_data.py --symbol TP53 --taxon mouse

# Find TP53 in zebrafish
python scripts/fetch_gene_data.py --symbol TP53 --taxon zebrafish

Compare gene IDs across species

# Compare gene information across species
species = {
    'human': '9606',
    'mouse': '10090',
    'rat': '10116'
}

gene_symbol = 'TP53'

for organism, taxon_id in species.items():
    # Fetch gene data
    # ... (use fetch_gene_by_symbol)
    print(f"{organism}: {gene_data}")

Find orthologs using ELink

# Get HomoloGene links for a gene
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=homologene&id=7157&retmode=json"

Applications

Evolutionary studies
Model organism research
Comparative genomics
Cross-species experimental design

Pathway Analysis

Use Case

Identify genes involved in specific biological pathways or processes.

Workflow

Search by Gene Ontology (GO) term

# Find genes involved in apoptosis
python scripts/query_gene.py --search "GO:0006915[biological process]" --organism human --max-results 100

Search by pathway name

# Find genes in insulin signaling pathway
python scripts/query_gene.py --search "insulin signaling pathway[pathway]" --organism human

Get pathway-related genes

# Example: Get all genes in a specific pathway
import urllib.request
import json

# Search for pathway genes
query = "MAPK signaling pathway[pathway] AND human[organism]"
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term={query}&retmode=json&retmax=200"

with urllib.request.urlopen(url) as response:
    data = json.loads(response.read().decode())
    gene_ids = data['esearchresult']['idlist']

print(f"Found {len(gene_ids)} genes in MAPK signaling pathway")

Batch retrieve gene details

# Get details for all pathway genes
python scripts/batch_gene_lookup.py --ids 5594,5595,5603,5604 --output mapk_genes.json

Applications

Pathway enrichment analysis
Gene set analysis
Systems biology studies
Drug target identification

Variant Analysis

Use Case

Find genes with clinically relevant variants or disease-associated mutations.

Workflow

Search for genes with clinical variants

# Find genes with pathogenic variants
python scripts/query_gene.py --search "pathogenic[clinical significance]" --organism human --max-results 50

Link to ClinVar database

# Get ClinVar records for a gene
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=clinvar&id=672&retmode=json"

Search for pharmacogenomic genes

# Find genes associated with drug response
python scripts/query_gene.py --search "pharmacogenomic[property]" --organism human

Get variant summary data

# Example: Get genes with known variants
from scripts.query_gene import esearch, efetch

# Search for genes with variants
gene_ids = esearch("has variants[filter] AND human[organism]", retmax=100)

# Fetch detailed records
for gene_id in gene_ids[:10]:  # First 10
    data = efetch([gene_id], retmode='xml')
    # Parse XML for variant information
    print(f"Gene {gene_id} variant data...")

Applications

Clinical genetics
Precision medicine
Pharmacogenomics
Genetic counseling

Publication Mining

Use Case

Find genes mentioned in recent publications or link genes to literature.

Workflow

Search genes mentioned in specific publications

# Find genes mentioned in papers about CRISPR
python scripts/query_gene.py --search "CRISPR[text word]" --organism human --max-results 100

Get PubMed articles for a gene

# Get all publications for BRCA1
curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"

Search by author or journal

# Find genes studied by specific research group
python scripts/query_gene.py --search "Smith J[author] AND 2024[pdat]" --organism human

Extract gene-publication relationships

# Example: Build gene-publication network
from scripts.query_gene import esearch, esummary
import urllib.request
import json

# Get gene
gene_id = '672'

# Get publications for gene
url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id={gene_id}&retmode=json"

with urllib.request.urlopen(url) as response:
    data = json.loads(response.read().decode())

# Extract PMIDs
pmids = []
for linkset in data.get('linksets', []):
    for linksetdb in linkset.get('linksetdbs', []):
        pmids.extend(linksetdb.get('links', []))

print(f"Gene {gene_id} has {len(pmids)} publications")

Applications

Literature reviews
Grant writing
Knowledge base construction
Trend analysis in genomics research

Advanced Patterns

Combining Multiple Searches

# Example: Find genes at intersection of multiple criteria
def find_genes_multi_criteria(organism='human'):
    # Criteria 1: Disease association
    disease_genes = set(esearch("diabetes[disease] AND human[organism]"))

    # Criteria 2: Chromosome location
    chr_genes = set(esearch("11[chromosome] AND human[organism]"))

    # Criteria 3: Gene type
    coding_genes = set(esearch("protein coding[gene type] AND human[organism]"))

    # Intersection
    candidates = disease_genes & chr_genes & coding_genes

    return list(candidates)

Rate-Limited Batch Processing

import time

def process_genes_with_rate_limit(gene_ids, batch_size=200, delay=0.1):
    results = []

    for i in range(0, len(gene_ids), batch_size):
        batch = gene_ids[i:i + batch_size]

        # Process batch
        batch_results = esummary(batch)
        results.append(batch_results)

        # Rate limit
        time.sleep(delay)

    return results

Error Handling and Retry

import time

def robust_gene_fetch(gene_id, max_retries=3):
    for attempt in range(max_retries):
        try:
            data = fetch_gene_by_id(gene_id)
            return data
        except Exception as e:
            if attempt < max_retries - 1:
                wait = 2 ** attempt  # Exponential backoff
                time.sleep(wait)
            else:
                print(f"Failed to fetch gene {gene_id}: {e}")
                return None

Tips and Best Practices

Start Specific, Then Broaden: Begin with precise queries and expand if needed
Use Organism Filters: Always specify organism for gene symbol searches
Validate Results: Check gene IDs and symbols for accuracy
Cache Frequently Used Data: Store common queries locally
Monitor Rate Limits: Use API keys and implement delays
Combine APIs: Use E-utilities for search, Datasets API for detailed data
Handle Ambiguity: Gene symbols may refer to different genes in different species
Check Data Currency: Gene annotations are updated regularly
Use Batch Operations: Process multiple genes together when possible
Document Your Queries: Keep records of search terms and parameters

10 KiB Raw Blame History

Common Gene Database Workflows

Table of Contents

Disease Gene Discovery

Use Case

Workflow

Expected Output

Gene Annotation Pipeline

Use Case

Workflow

Use Cases

Cross-Species Gene Comparison

Use Case

Workflow

Applications

Pathway Analysis

Use Case

Workflow

Applications

Variant Analysis

Use Case

Workflow

Applications

Publication Mining

Use Case

Workflow

Applications

Advanced Patterns

Combining Multiple Searches

Rate-Limited Batch Processing

Error Handling and Retry

Tips and Best Practices

10 KiB

Raw Blame History