Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/gene-database/references/api_reference.md
+++ b/skills/gene-database/references/api_reference.md
@@ -0,0 +1,404 @@
+# NCBI Gene API Reference
+
+This document provides detailed API documentation for accessing NCBI Gene database programmatically.
+
+## Table of Contents
+
+1. [E-utilities API](#e-utilities-api)
+2. [NCBI Datasets API](#ncbi-datasets-api)
+3. [Authentication and Rate Limits](#authentication-and-rate-limits)
+4. [Error Handling](#error-handling)
+
+---
+
+## E-utilities API
+
+E-utilities (Entrez Programming Utilities) provide a stable interface to NCBI's Entrez databases.
+
+### Base URL
+
+```
+https://eutils.ncbi.nlm.nih.gov/entrez/eutils/
+```
+
+### Common Parameters
+
+- `db` - Database name (use `gene` for Gene database)
+- `api_key` - API key for higher rate limits
+- `retmode` - Return format (json, xml, text)
+- `retmax` - Maximum number of records to return
+
+### ESearch - Search Database
+
+Search for genes matching a text query.
+
+**Endpoint:** `esearch.fcgi`
+
+**Parameters:**
+- `db=gene` (required) - Database to search
+- `term` (required) - Search query
+- `retmax` - Maximum results (default: 20)
+- `retmode` - json or xml (default: xml)
+- `usehistory=y` - Store results on history server for large result sets
+
+**Query Syntax:**
+- Gene symbol: `BRCA1[gene]` or `BRCA1[gene name]`
+- Organism: `human[organism]` or `9606[taxid]`
+- Combine terms: `BRCA1[gene] AND human[organism]`
+- Disease: `muscular dystrophy[disease]`
+- Chromosome: `17q21[chromosome]`
+- GO terms: `GO:0006915[biological process]`
+
+**Example Request:**
+
+```bash
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term=BRCA1[gene]+AND+human[organism]&retmode=json"
+```
+
+**Response Format (JSON):**
+
+```json
+{
+  "esearchresult": {
+    "count": "1",
+    "retmax": "1",
+    "retstart": "0",
+    "idlist": ["672"],
+    "translationset": [],
+    "querytranslation": "BRCA1[Gene Name] AND human[Organism]"
+  }
+}
+```
+
+### ESummary - Document Summaries
+
+Retrieve document summaries for Gene IDs.
+
+**Endpoint:** `esummary.fcgi`
+
+**Parameters:**
+- `db=gene` (required) - Database
+- `id` (required) - Comma-separated Gene IDs (up to 500)
+- `retmode` - json or xml (default: xml)
+
+**Example Request:**
+
+```bash
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=gene&id=672&retmode=json"
+```
+
+**Response Format (JSON):**
+
+```json
+{
+  "result": {
+    "672": {
+      "uid": "672",
+      "name": "BRCA1",
+      "description": "BRCA1 DNA repair associated",
+      "organism": {
+        "scientificname": "Homo sapiens",
+        "commonname": "human",
+        "taxid": 9606
+      },
+      "chromosome": "17",
+      "geneticsource": "genomic",
+      "maplocation": "17q21.31",
+      "nomenclaturesymbol": "BRCA1",
+      "nomenclaturename": "BRCA1 DNA repair associated"
+    }
+  }
+}
+```
+
+### EFetch - Full Records
+
+Fetch detailed gene records in various formats.
+
+**Endpoint:** `efetch.fcgi`
+
+**Parameters:**
+- `db=gene` (required) - Database
+- `id` (required) - Comma-separated Gene IDs
+- `retmode` - xml, text, asn.1 (default: xml)
+- `rettype` - gene_table, docsum
+
+**Example Request:**
+
+```bash
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=gene&id=672&retmode=xml"
+```
+
+**XML Response:** Contains detailed gene information including:
+- Gene nomenclature
+- Sequence locations
+- Transcript variants
+- Protein products
+- Gene Ontology annotations
+- Cross-references
+- Publications
+
+### ELink - Related Records
+
+Find related records in Gene or other databases.
+
+**Endpoint:** `elink.fcgi`
+
+**Parameters:**
+- `dbfrom=gene` (required) - Source database
+- `db` (required) - Target database (gene, nuccore, protein, pubmed, etc.)
+- `id` (required) - Gene ID(s)
+
+**Example Request:**
+
+```bash
+# Get related PubMed articles
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"
+```
+
+### EInfo - Database Information
+
+Get information about the Gene database.
+
+**Endpoint:** `einfo.fcgi`
+
+**Parameters:**
+- `db=gene` - Database to query
+
+**Example Request:**
+
+```bash
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=gene&retmode=json"
+```
+
+---
+
+## NCBI Datasets API
+
+The Datasets API provides streamlined access to gene data with metadata and sequences.
+
+### Base URL
+
+```
+https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene
+```
+
+### Authentication
+
+Include API key in request headers:
+
+```
+api-key: YOUR_API_KEY
+```
+
+### Get Gene by ID
+
+Retrieve gene data by Gene ID.
+
+**Endpoint:** `GET /gene/id/{gene_id}`
+
+**Example Request:**
+
+```bash
+curl "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/id/672"
+```
+
+**Response Format (JSON):**
+
+```json
+{
+  "genes": [
+    {
+      "gene": {
+        "gene_id": "672",
+        "symbol": "BRCA1",
+        "description": "BRCA1 DNA repair associated",
+        "tax_name": "Homo sapiens",
+        "taxid": 9606,
+        "chromosomes": ["17"],
+        "type": "protein-coding",
+        "synonyms": ["BRCC1", "FANCS", "PNCA4", "RNF53"],
+        "nomenclature_authority": {
+          "authority": "HGNC",
+          "identifier": "HGNC:1100"
+        },
+        "genomic_ranges": [
+          {
+            "accession_version": "NC_000017.11",
+            "range": [
+              {
+                "begin": 43044295,
+                "end": 43170245,
+                "orientation": "minus"
+              }
+            ]
+          }
+        ],
+        "transcripts": [
+          {
+            "accession_version": "NM_007294.4",
+            "length": 7207
+          }
+        ]
+      }
+    }
+  ]
+}
+```
+
+### Get Gene by Symbol
+
+Retrieve gene data by symbol and organism.
+
+**Endpoint:** `GET /gene/symbol/{symbol}/taxon/{taxon}`
+
+**Parameters:**
+- `{symbol}` - Gene symbol (e.g., BRCA1)
+- `{taxon}` - Taxon ID (e.g., 9606 for human)
+
+**Example Request:**
+
+```bash
+curl "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/symbol/BRCA1/taxon/9606"
+```
+
+### Get Multiple Genes
+
+Retrieve data for multiple genes.
+
+**Endpoint:** `POST /gene/id`
+
+**Request Body:**
+
+```json
+{
+  "gene_ids": ["672", "7157", "5594"]
+}
+```
+
+**Example Request:**
+
+```bash
+curl -X POST "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/gene/id" \
+  -H "Content-Type: application/json" \
+  -d '{"gene_ids": ["672", "7157", "5594"]}'
+```
+
+---
+
+## Authentication and Rate Limits
+
+### Obtaining an API Key
+
+1. Create an NCBI account at https://www.ncbi.nlm.nih.gov/account/
+2. Navigate to Settings → API Key Management
+3. Generate a new API key
+4. Include the key in requests
+
+### Rate Limits
+
+**E-utilities:**
+- Without API key: 3 requests/second
+- With API key: 10 requests/second
+
+**Datasets API:**
+- Without API key: 5 requests/second
+- With API key: 10 requests/second
+
+### Usage Guidelines
+
+1. **Include email in requests:** Add `&email=your@email.com` to E-utilities requests
+2. **Implement rate limiting:** Use delays between requests
+3. **Use POST for large queries:** When working with many IDs
+4. **Cache results:** Store frequently accessed data locally
+5. **Handle errors gracefully:** Implement retry logic with exponential backoff
+
+---
+
+## Error Handling
+
+### HTTP Status Codes
+
+- `200 OK` - Successful request
+- `400 Bad Request` - Invalid parameters or malformed query
+- `404 Not Found` - Gene ID or symbol not found
+- `429 Too Many Requests` - Rate limit exceeded
+- `500 Internal Server Error` - Server error (retry with backoff)
+
+### E-utilities Error Messages
+
+E-utilities return errors in the response body:
+
+**XML format:**
+```xml
+<ERROR>Empty id list - nothing to do</ERROR>
+```
+
+**JSON format:**
+```json
+{
+  "error": "Invalid db name"
+}
+```
+
+### Common Errors
+
+1. **Empty Result Set**
+   - Cause: Gene symbol or ID not found
+   - Solution: Verify spelling, check organism filter
+
+2. **Rate Limit Exceeded**
+   - Cause: Too many requests
+   - Solution: Add delays, use API key
+
+3. **Invalid Query Syntax**
+   - Cause: Malformed search term
+   - Solution: Use proper field tags (e.g., `[gene]`, `[organism]`)
+
+4. **Timeout**
+   - Cause: Large result set or slow connection
+   - Solution: Use History Server, reduce result size
+
+### Retry Strategy
+
+Implement exponential backoff for failed requests:
+
+```python
+import time
+
+def retry_request(func, max_attempts=3):
+    for attempt in range(max_attempts):
+        try:
+            return func()
+        except Exception as e:
+            if attempt < max_attempts - 1:
+                wait_time = 2 ** attempt  # 1s, 2s, 4s
+                time.sleep(wait_time)
+            else:
+                raise
+```
+
+---
+
+## Common Taxon IDs
+
+| Organism | Scientific Name | Taxon ID |
+|----------|----------------|----------|
+| Human | Homo sapiens | 9606 |
+| Mouse | Mus musculus | 10090 |
+| Rat | Rattus norvegicus | 10116 |
+| Zebrafish | Danio rerio | 7955 |
+| Fruit fly | Drosophila melanogaster | 7227 |
+| C. elegans | Caenorhabditis elegans | 6239 |
+| Yeast | Saccharomyces cerevisiae | 4932 |
+| Arabidopsis | Arabidopsis thaliana | 3702 |
+| E. coli | Escherichia coli | 562 |
+
+---
+
+## Additional Resources
+
+- **E-utilities Documentation:** https://www.ncbi.nlm.nih.gov/books/NBK25501/
+- **Datasets API Documentation:** https://www.ncbi.nlm.nih.gov/datasets/docs/v2/
+- **Gene Database Help:** https://www.ncbi.nlm.nih.gov/gene/
+- **API Key Registration:** https://www.ncbi.nlm.nih.gov/account/
--- a/skills/gene-database/references/common_workflows.md
+++ b/skills/gene-database/references/common_workflows.md
@@ -0,0 +1,428 @@
+# Common Gene Database Workflows
+
+This document provides examples of common workflows and use cases for working with NCBI Gene database.
+
+## Table of Contents
+
+1. [Disease Gene Discovery](#disease-gene-discovery)
+2. [Gene Annotation Pipeline](#gene-annotation-pipeline)
+3. [Cross-Species Gene Comparison](#cross-species-gene-comparison)
+4. [Pathway Analysis](#pathway-analysis)
+5. [Variant Analysis](#variant-analysis)
+6. [Publication Mining](#publication-mining)
+
+---
+
+## Disease Gene Discovery
+
+### Use Case
+
+Identify genes associated with a specific disease or phenotype.
+
+### Workflow
+
+1. **Search by disease name**
+
+```bash
+# Find genes associated with Alzheimer's disease
+python scripts/query_gene.py --search "Alzheimer disease[disease]" --organism human --max-results 50
+```
+
+2. **Filter by chromosome location**
+
+```bash
+# Find genes on chromosome 17 associated with breast cancer
+python scripts/query_gene.py --search "breast cancer[disease] AND 17[chromosome]" --organism human
+```
+
+3. **Retrieve detailed information**
+
+```python
+# Python example: Get gene details for disease-associated genes
+import json
+from scripts.query_gene import esearch, esummary
+
+# Search for genes
+query = "diabetes[disease] AND human[organism]"
+gene_ids = esearch(query, retmax=100, api_key="YOUR_KEY")
+
+# Get summaries
+summaries = esummary(gene_ids, api_key="YOUR_KEY")
+
+# Extract relevant information
+for gene_id in gene_ids:
+    if gene_id in summaries['result']:
+        gene = summaries['result'][gene_id]
+        print(f"{gene['name']}: {gene['description']}")
+```
+
+### Expected Output
+
+- List of genes with disease associations
+- Gene symbols, descriptions, and chromosomal locations
+- Related publications and clinical annotations
+
+---
+
+## Gene Annotation Pipeline
+
+### Use Case
+
+Annotate a list of gene identifiers with comprehensive metadata.
+
+### Workflow
+
+1. **Prepare gene list**
+
+Create a file `genes.txt` with gene symbols (one per line):
+```
+BRCA1
+TP53
+EGFR
+KRAS
+```
+
+2. **Batch lookup**
+
+```bash
+python scripts/batch_gene_lookup.py --file genes.txt --organism human --output annotations.json --api-key YOUR_KEY
+```
+
+3. **Parse results**
+
+```python
+import json
+
+with open('annotations.json', 'r') as f:
+    genes = json.load(f)
+
+for gene in genes:
+    if 'gene_id' in gene:
+        print(f"Symbol: {gene['symbol']}")
+        print(f"ID: {gene['gene_id']}")
+        print(f"Description: {gene['description']}")
+        print(f"Location: chr{gene['chromosome']}:{gene['map_location']}")
+        print()
+```
+
+4. **Enrich with sequence data**
+
+```bash
+# Get detailed data including sequences for specific genes
+python scripts/fetch_gene_data.py --gene-id 672 --verbose > BRCA1_detailed.json
+```
+
+### Use Cases
+
+- Creating gene annotation tables for publications
+- Validating gene lists before analysis
+- Building gene reference databases
+- Quality control for genomic pipelines
+
+---
+
+## Cross-Species Gene Comparison
+
+### Use Case
+
+Find orthologs or compare the same gene across different species.
+
+### Workflow
+
+1. **Search for gene in multiple organisms**
+
+```bash
+# Find TP53 in human
+python scripts/fetch_gene_data.py --symbol TP53 --taxon human
+
+# Find TP53 in mouse
+python scripts/fetch_gene_data.py --symbol TP53 --taxon mouse
+
+# Find TP53 in zebrafish
+python scripts/fetch_gene_data.py --symbol TP53 --taxon zebrafish
+```
+
+2. **Compare gene IDs across species**
+
+```python
+# Compare gene information across species
+species = {
+    'human': '9606',
+    'mouse': '10090',
+    'rat': '10116'
+}
+
+gene_symbol = 'TP53'
+
+for organism, taxon_id in species.items():
+    # Fetch gene data
+    # ... (use fetch_gene_by_symbol)
+    print(f"{organism}: {gene_data}")
+```
+
+3. **Find orthologs using ELink**
+
+```bash
+# Get HomoloGene links for a gene
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=homologene&id=7157&retmode=json"
+```
+
+### Applications
+
+- Evolutionary studies
+- Model organism research
+- Comparative genomics
+- Cross-species experimental design
+
+---
+
+## Pathway Analysis
+
+### Use Case
+
+Identify genes involved in specific biological pathways or processes.
+
+### Workflow
+
+1. **Search by Gene Ontology (GO) term**
+
+```bash
+# Find genes involved in apoptosis
+python scripts/query_gene.py --search "GO:0006915[biological process]" --organism human --max-results 100
+```
+
+2. **Search by pathway name**
+
+```bash
+# Find genes in insulin signaling pathway
+python scripts/query_gene.py --search "insulin signaling pathway[pathway]" --organism human
+```
+
+3. **Get pathway-related genes**
+
+```python
+# Example: Get all genes in a specific pathway
+import urllib.request
+import json
+
+# Search for pathway genes
+query = "MAPK signaling pathway[pathway] AND human[organism]"
+url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=gene&term={query}&retmode=json&retmax=200"
+
+with urllib.request.urlopen(url) as response:
+    data = json.loads(response.read().decode())
+    gene_ids = data['esearchresult']['idlist']
+
+print(f"Found {len(gene_ids)} genes in MAPK signaling pathway")
+```
+
+4. **Batch retrieve gene details**
+
+```bash
+# Get details for all pathway genes
+python scripts/batch_gene_lookup.py --ids 5594,5595,5603,5604 --output mapk_genes.json
+```
+
+### Applications
+
+- Pathway enrichment analysis
+- Gene set analysis
+- Systems biology studies
+- Drug target identification
+
+---
+
+## Variant Analysis
+
+### Use Case
+
+Find genes with clinically relevant variants or disease-associated mutations.
+
+### Workflow
+
+1. **Search for genes with clinical variants**
+
+```bash
+# Find genes with pathogenic variants
+python scripts/query_gene.py --search "pathogenic[clinical significance]" --organism human --max-results 50
+```
+
+2. **Link to ClinVar database**
+
+```bash
+# Get ClinVar records for a gene
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=clinvar&id=672&retmode=json"
+```
+
+3. **Search for pharmacogenomic genes**
+
+```bash
+# Find genes associated with drug response
+python scripts/query_gene.py --search "pharmacogenomic[property]" --organism human
+```
+
+4. **Get variant summary data**
+
+```python
+# Example: Get genes with known variants
+from scripts.query_gene import esearch, efetch
+
+# Search for genes with variants
+gene_ids = esearch("has variants[filter] AND human[organism]", retmax=100)
+
+# Fetch detailed records
+for gene_id in gene_ids[:10]:  # First 10
+    data = efetch([gene_id], retmode='xml')
+    # Parse XML for variant information
+    print(f"Gene {gene_id} variant data...")
+```
+
+### Applications
+
+- Clinical genetics
+- Precision medicine
+- Pharmacogenomics
+- Genetic counseling
+
+---
+
+## Publication Mining
+
+### Use Case
+
+Find genes mentioned in recent publications or link genes to literature.
+
+### Workflow
+
+1. **Search genes mentioned in specific publications**
+
+```bash
+# Find genes mentioned in papers about CRISPR
+python scripts/query_gene.py --search "CRISPR[text word]" --organism human --max-results 100
+```
+
+2. **Get PubMed articles for a gene**
+
+```bash
+# Get all publications for BRCA1
+curl "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id=672&retmode=json"
+```
+
+3. **Search by author or journal**
+
+```bash
+# Find genes studied by specific research group
+python scripts/query_gene.py --search "Smith J[author] AND 2024[pdat]" --organism human
+```
+
+4. **Extract gene-publication relationships**
+
+```python
+# Example: Build gene-publication network
+from scripts.query_gene import esearch, esummary
+import urllib.request
+import json
+
+# Get gene
+gene_id = '672'
+
+# Get publications for gene
+url = f"https://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=gene&db=pubmed&id={gene_id}&retmode=json"
+
+with urllib.request.urlopen(url) as response:
+    data = json.loads(response.read().decode())
+
+# Extract PMIDs
+pmids = []
+for linkset in data.get('linksets', []):
+    for linksetdb in linkset.get('linksetdbs', []):
+        pmids.extend(linksetdb.get('links', []))
+
+print(f"Gene {gene_id} has {len(pmids)} publications")
+```
+
+### Applications
+
+- Literature reviews
+- Grant writing
+- Knowledge base construction
+- Trend analysis in genomics research
+
+---
+
+## Advanced Patterns
+
+### Combining Multiple Searches
+
+```python
+# Example: Find genes at intersection of multiple criteria
+def find_genes_multi_criteria(organism='human'):
+    # Criteria 1: Disease association
+    disease_genes = set(esearch("diabetes[disease] AND human[organism]"))
+
+    # Criteria 2: Chromosome location
+    chr_genes = set(esearch("11[chromosome] AND human[organism]"))
+
+    # Criteria 3: Gene type
+    coding_genes = set(esearch("protein coding[gene type] AND human[organism]"))
+
+    # Intersection
+    candidates = disease_genes & chr_genes & coding_genes
+
+    return list(candidates)
+```
+
+### Rate-Limited Batch Processing
+
+```python
+import time
+
+def process_genes_with_rate_limit(gene_ids, batch_size=200, delay=0.1):
+    results = []
+
+    for i in range(0, len(gene_ids), batch_size):
+        batch = gene_ids[i:i + batch_size]
+
+        # Process batch
+        batch_results = esummary(batch)
+        results.append(batch_results)
+
+        # Rate limit
+        time.sleep(delay)
+
+    return results
+```
+
+### Error Handling and Retry
+
+```python
+import time
+
+def robust_gene_fetch(gene_id, max_retries=3):
+    for attempt in range(max_retries):
+        try:
+            data = fetch_gene_by_id(gene_id)
+            return data
+        except Exception as e:
+            if attempt < max_retries - 1:
+                wait = 2 ** attempt  # Exponential backoff
+                time.sleep(wait)
+            else:
+                print(f"Failed to fetch gene {gene_id}: {e}")
+                return None
+```
+
+---
+
+## Tips and Best Practices
+
+1. **Start Specific, Then Broaden**: Begin with precise queries and expand if needed
+2. **Use Organism Filters**: Always specify organism for gene symbol searches
+3. **Validate Results**: Check gene IDs and symbols for accuracy
+4. **Cache Frequently Used Data**: Store common queries locally
+5. **Monitor Rate Limits**: Use API keys and implement delays
+6. **Combine APIs**: Use E-utilities for search, Datasets API for detailed data
+7. **Handle Ambiguity**: Gene symbols may refer to different genes in different species
+8. **Check Data Currency**: Gene annotations are updated regularly
+9. **Use Batch Operations**: Process multiple genes together when possible
+10. **Document Your Queries**: Keep records of search terms and parameters