Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/gene-database/references/common_workflows.md
+++ b/skills/gene-database/references/common_workflows.md
@@ -0,0 +1,428 @@
+# Common Gene Database Workflows
+
+This document provides examples of common workflows and use cases for working with NCBI Gene database.
+
+## Table of Contents
+
+1. [Disease Gene Discovery](#disease-gene-discovery)
+2. [Gene Annotation Pipeline](#gene-annotation-pipeline)
+3. [Cross-Species Gene Comparison](#cross-species-gene-comparison)
+4. [Pathway Analysis](#pathway-analysis)
+5. [Variant Analysis](#variant-analysis)
+6. [Publication Mining](#publication-mining)
+
+---
+
+## Disease Gene Discovery
+
+### Use Case
+
+Identify genes associated with a specific disease or phenotype.
+
+### Workflow
+
+1. **Search by disease name**
+
+```bash
+# Find genes associated with Alzheimer's disease
+python scripts/query_gene.py --search "Alzheimer disease[disease]" --organism human --max-results 50
+```
+
+2. **Filter by chromosome location**
+
+```bash
+# Find genes on chromosome 17 associated with breast cancer
+python scripts/query_gene.py --search "breast cancer[disease] AND 17[chromosome]" --organism human
+```
+
+3. **Retrieve detailed information**
+
+```python
+# Python example: Get gene details for disease-associated genes
+import json
+from scripts.query_gene import esearch, esummary
+
+# Search for genes
+query = "diabetes[disease] AND human[organism]"
+gene_ids = esearch(query, retmax=100, api_key="YOUR_KEY")
+
+# Get summaries
+summaries = esummary(gene_ids, api_key="YOUR_KEY")
+
+# Extract relevant information
+for gene_id in gene_ids:
+    if gene_id in summaries['result']:
+        gene = summaries['result'][gene_id]
+        print(f"{gene['name']}: {gene['description']}")
+```
+
+### Expected Output
+
+- List of genes with disease associations
+- Gene symbols, descriptions, and chromosomal locations
+- Related publications and clinical annotations
+
+---
+
+## Gene Annotation Pipeline
+
+### Use Case
+
+Annotate a list of gene identifiers with comprehensive metadata.
+
+### Workflow
+
+1. **Prepare gene list**
+
+Create a file `genes.txt` with gene symbols (one per line):
+```
+BRCA1
+TP53
+EGFR
+KRAS
+```
+
+2. **Batch lookup**
+
+```bash
+python scripts/batch_gene_lookup.py --file genes.txt --organism human --output annotations.json --api-key YOUR_KEY
+```
+
+3. **Parse results**
+
+```python
+import json
+
+with open('annotations.json', 'r') as f:
+    genes = json.load(f)
+
+for gene in genes:
+    if 'gene_id' in gene:
+        print(f"Symbol: {gene['symbol']}")
+        print(f"ID: {gene['gene_id']}")
+        print(f"Description: {gene['description']}")
+        print(f"Location: chr{gene['chromosome']}:{gene['map_location']}")
+        print()
+```
+
+4. **Enrich with sequence data**
+
+```bash
+# Get detailed data including sequences for specific genes
+python scripts/fetch_gene_data.py --gene-id 672 --verbose > BRCA1_detailed.json
+```
+
+### Use Cases
+
+- Creating gene annotation tables for publications
+- Validating gene lists before analysis
+- Building gene reference databases
+- Quality control for genomic pipelines
+
+---
+
+## Cross-Species Gene Comparison
+
+### Use Case
+
+Find orthologs or compare the same gene across different species.
+
+### Workflow
+
+1. **Search for gene in multiple organisms**
+
+```bash
+# Find TP53 in human
+python scripts/fetch_gene_data.py --symbol TP53 --taxon human
+
+# Find TP53 in mouse
+python scripts/fetch_gene_data.py --symbol TP53 --taxon mouse
+
+# Find TP53 in zebrafish
+python scripts/fetch_gene_data.py --symbol TP53 --taxon zebrafish
+```
+
+2. **Compare gene IDs across species**
+
+```python
+# Compare gene information across species
+species = {
+    'human': '9606',
+    'mouse': '10090',
+    'rat': '10116'
+}
+
+gene_symbol = 'TP53'
+
+for organism, taxon_id in species.items():
+    # Fetch gene data
+    # ... (use fetch_gene_by_symbol)
+    print(f"{organism}: {gene_data}")
+```
+
+3. **Find orthologs using ELink**
+
+```bash
+# Get HomoloGene links for a gene
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=homologene&id=7157&retmode=json"
+```
+
+### Applications
+
+- Evolutionary studies
+- Model organism research
+- Comparative genomics
+- Cross-species experimental design
+
+---
+
+## Pathway Analysis
+
+### Use Case
+
+Identify genes involved in specific biological pathways or processes.
+
+### Workflow
+
+1. **Search by Gene Ontology (GO) term**
+
+```bash
+# Find genes involved in apoptosis
+python scripts/query_gene.py --search "GO:0006915[biological process]" --organism human --max-results 100
+```
+
+2. **Search by pathway name**
+
+```bash
+# Find genes in insulin signaling pathway
+python scripts/query_gene.py --search "insulin signaling pathway[pathway]" --organism human
+```
+
+3. **Get pathway-related genes**
+
+```python
+# Example: Get all genes in a specific pathway
+import urllib.request
+import json
+
+# Search for pathway genes
+query = "MAPK signaling pathway[pathway] AND human[organism]"
+url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term={query}&retmode=json&retmax=200"
+
+with urllib.request.urlopen(url) as response:
+    data = json.loads(response.read().decode())
+    gene_ids = data['esearchresult']['idlist']
+
+print(f"Found {len(gene_ids)} genes in MAPK signaling pathway")
+```
+
+4. **Batch retrieve gene details**
+
+```bash
+# Get details for all pathway genes
+python scripts/batch_gene_lookup.py --ids 5594,5595,5603,5604 --output mapk_genes.json
+```
+
+### Applications
+
+- Pathway enrichment analysis
+- Gene set analysis
+- Systems biology studies
+- Drug target identification
+
+---
+
+## Variant Analysis
+
+### Use Case
+
+Find genes with clinically relevant variants or disease-associated mutations.
+
+### Workflow
+
+1. **Search for genes with clinical variants**
+
+```bash
+# Find genes with pathogenic variants
+python scripts/query_gene.py --search "pathogenic[clinical significance]" --organism human --max-results 50
+```
+
+2. **Link to ClinVar database**
+
+```bash
+# Get ClinVar records for a gene
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=clinvar&id=672&retmode=json"
+```
+
+3. **Search for pharmacogenomic genes**
+
+```bash
+# Find genes associated with drug response
+python scripts/query_gene.py --search "pharmacogenomic[property]" --organism human
+```
+
+4. **Get variant summary data**
+
+```python
+# Example: Get genes with known variants
+from scripts.query_gene import esearch, efetch
+
+# Search for genes with variants
+gene_ids = esearch("has variants[filter] AND human[organism]", retmax=100)
+
+# Fetch detailed records
+for gene_id in gene_ids[:10]:  # First 10
+    data = efetch([gene_id], retmode='xml')
+    # Parse XML for variant information
+    print(f"Gene {gene_id} variant data...")
+```
+
+### Applications
+
+- Clinical genetics
+- Precision medicine
+- Pharmacogenomics
+- Genetic counseling
+
+---
+
+## Publication Mining
+
+### Use Case
+
+Find genes mentioned in recent publications or link genes to literature.
+
+### Workflow
+
+1. **Search genes mentioned in specific publications**
+
+```bash
+# Find genes mentioned in papers about CRISPR
+python scripts/query_gene.py --search "CRISPR[text word]" --organism human --max-results 100
+```
+
+2. **Get PubMed articles for a gene**
+
+```bash
+# Get all publications for BRCA1
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"
+```
+
+3. **Search by author or journal**
+
+```bash
+# Find genes studied by specific research group
+python scripts/query_gene.py --search "Smith J[author] AND 2024[pdat]" --organism human
+```
+
+4. **Extract gene-publication relationships**
+
+```python
+# Example: Build gene-publication network
+from scripts.query_gene import esearch, esummary
+import urllib.request
+import json
+
+# Get gene
+gene_id = '672'
+
+# Get publications for gene
+url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id={gene_id}&retmode=json"
+
+with urllib.request.urlopen(url) as response:
+    data = json.loads(response.read().decode())
+
+# Extract PMIDs
+pmids = []
+for linkset in data.get('linksets', []):
+    for linksetdb in linkset.get('linksetdbs', []):
+        pmids.extend(linksetdb.get('links', []))
+
+print(f"Gene {gene_id} has {len(pmids)} publications")
+```
+
+### Applications
+
+- Literature reviews
+- Grant writing
+- Knowledge base construction
+- Trend analysis in genomics research
+
+---
+
+## Advanced Patterns
+
+### Combining Multiple Searches
+
+```python
+# Example: Find genes at intersection of multiple criteria
+def find_genes_multi_criteria(organism='human'):
+    # Criteria 1: Disease association
+    disease_genes = set(esearch("diabetes[disease] AND human[organism]"))
+
+    # Criteria 2: Chromosome location
+    chr_genes = set(esearch("11[chromosome] AND human[organism]"))
+
+    # Criteria 3: Gene type
+    coding_genes = set(esearch("protein coding[gene type] AND human[organism]"))
+
+    # Intersection
+    candidates = disease_genes & chr_genes & coding_genes
+
+    return list(candidates)
+```
+
+### Rate-Limited Batch Processing
+
+```python
+import time
+
+def process_genes_with_rate_limit(gene_ids, batch_size=200, delay=0.1):
+    results = []
+
+    for i in range(0, len(gene_ids), batch_size):
+        batch = gene_ids[i:i + batch_size]
+
+        # Process batch
+        batch_results = esummary(batch)
+        results.append(batch_results)
+
+        # Rate limit
+        time.sleep(delay)
+
+    return results
+```
+
+### Error Handling and Retry
+
+```python
+import time
+
+def robust_gene_fetch(gene_id, max_retries=3):
+    for attempt in range(max_retries):
+        try:
+            data = fetch_gene_by_id(gene_id)
+            return data
+        except Exception as e:
+            if attempt < max_retries - 1:
+                wait = 2 ** attempt  # Exponential backoff
+                time.sleep(wait)
+            else:
+                print(f"Failed to fetch gene {gene_id}: {e}")
+                return None
+```
+
+---
+
+## Tips and Best Practices
+
+1. **Start Specific, Then Broaden**: Begin with precise queries and expand if needed
+2. **Use Organism Filters**: Always specify organism for gene symbol searches
+3. **Validate Results**: Check gene IDs and symbols for accuracy
+4. **Cache Frequently Used Data**: Store common queries locally
+5. **Monitor Rate Limits**: Use API keys and implement delays
+6. **Combine APIs**: Use E-utilities for search, Datasets API for detailed data
+7. **Handle Ambiguity**: Gene symbols may refer to different genes in different species
+8. **Check Data Currency**: Gene annotations are updated regularly
+9. **Use Batch Operations**: Process multiple genes together when possible
+10. **Document Your Queries**: Keep records of search terms and parameters