Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/bioservices/references/identifier_mapping.md
+++ b/skills/bioservices/references/identifier_mapping.md
@@ -0,0 +1,685 @@
+# BioServices: Identifier Mapping Guide
+
+This document provides comprehensive information about converting identifiers between different biological databases using BioServices.
+
+## Table of Contents
+
+1. [Overview](#overview)
+2. [UniProt Mapping Service](#uniprot-mapping-service)
+3. [UniChem Compound Mapping](#unichem-compound-mapping)
+4. [KEGG Identifier Conversions](#kegg-identifier-conversions)
+5. [Common Mapping Patterns](#common-mapping-patterns)
+6. [Troubleshooting](#troubleshooting)
+
+---
+
+## Overview
+
+Biological databases use different identifier systems. Cross-referencing requires mapping between these systems. BioServices provides multiple approaches:
+
+1. **UniProt Mapping**: Comprehensive protein/gene ID conversion
+2. **UniChem**: Chemical compound ID mapping
+3. **KEGG**: Built-in cross-references in entries
+4. **PICR**: Protein identifier cross-reference service
+
+---
+
+## UniProt Mapping Service
+
+The UniProt mapping service is the most comprehensive tool for protein and gene identifier conversion.
+
+### Basic Usage
+
+```python
+from bioservices import UniProt
+
+u = UniProt()
+
+# Map single ID
+result = u.mapping(
+    fr="UniProtKB_AC-ID",    # Source database
+    to="KEGG",                # Target database
+    query="P43403"            # Identifier to convert
+)
+
+print(result)
+# Output: {'P43403': ['hsa:7535']}
+```
+
+### Batch Mapping
+
+```python
+# Map multiple IDs (comma-separated)
+ids = ["P43403", "P04637", "P53779"]
+result = u.mapping(
+    fr="UniProtKB_AC-ID",
+    to="KEGG",
+    query=",".join(ids)
+)
+
+for uniprot_id, kegg_ids in result.items():
+    print(f"{uniprot_id} → {kegg_ids}")
+```
+
+### Supported Database Pairs
+
+UniProt supports mapping between 100+ database pairs. Key ones include:
+
+#### Protein/Gene Databases
+
+| Source Format | Code | Target Format | Code |
+|---------------|------|---------------|------|
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | KEGG | `KEGG` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl | `Ensembl` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Protein | `Ensembl_Protein` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Transcript | `Ensembl_Transcript` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Protein | `RefSeq_Protein` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Nucleotide | `RefSeq_Nucleotide` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | GeneID (Entrez) | `GeneID` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | HGNC | `HGNC` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | MGI | `MGI` |
+| KEGG | `KEGG` | UniProtKB | `UniProtKB` |
+| Ensembl | `Ensembl` | UniProtKB | `UniProtKB` |
+| GeneID | `GeneID` | UniProtKB | `UniProtKB` |
+
+#### Structural Databases
+
+| Source | Code | Target | Code |
+|--------|------|--------|------|
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | PDB | `PDB` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | Pfam | `Pfam` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | InterPro | `InterPro` |
+| PDB | `PDB` | UniProtKB | `UniProtKB` |
+
+#### Expression & Proteomics
+
+| Source | Code | Target | Code |
+|--------|------|--------|------|
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | PRIDE | `PRIDE` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | ProteomicsDB | `ProteomicsDB` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | PaxDb | `PaxDb` |
+
+#### Organism-Specific
+
+| Source | Code | Target | Code |
+|--------|------|--------|------|
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | FlyBase | `FlyBase` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | WormBase | `WormBase` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | SGD | `SGD` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | ZFIN | `ZFIN` |
+
+#### Other Useful Mappings
+
+| Source | Code | Target | Code |
+|--------|------|--------|------|
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | GO | `GO` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | Reactome | `Reactome` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | STRING | `STRING` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | BioGRID | `BioGRID` |
+| UniProtKB AC/ID | `UniProtKB_AC-ID` | OMA | `OMA` |
+
+### Complete List of Database Codes
+
+To get the complete, up-to-date list:
+
+```python
+from bioservices import UniProt
+
+u = UniProt()
+
+# This information is in the UniProt REST API documentation
+# Common patterns:
+# - Source databases typically end in source database name
+# - UniProtKB uses "UniProtKB_AC-ID" or "UniProtKB"
+# - Most other databases use their standard abbreviation
+```
+
+### Common Database Codes Reference
+
+**Gene/Protein Identifiers:**
+- `UniProtKB_AC-ID`: UniProt accession/ID
+- `UniProtKB`: UniProt accession
+- `KEGG`: KEGG gene IDs (e.g., hsa:7535)
+- `GeneID`: NCBI Gene (Entrez) IDs
+- `Ensembl`: Ensembl gene IDs
+- `Ensembl_Protein`: Ensembl protein IDs
+- `Ensembl_Transcript`: Ensembl transcript IDs
+- `RefSeq_Protein`: RefSeq protein IDs (NP_)
+- `RefSeq_Nucleotide`: RefSeq nucleotide IDs (NM_)
+
+**Gene Nomenclature:**
+- `HGNC`: Human Gene Nomenclature Committee
+- `MGI`: Mouse Genome Informatics
+- `RGD`: Rat Genome Database
+- `SGD`: Saccharomyces Genome Database
+- `FlyBase`: Drosophila database
+- `WormBase`: C. elegans database
+- `ZFIN`: Zebrafish database
+
+**Structure:**
+- `PDB`: Protein Data Bank
+- `Pfam`: Protein families
+- `InterPro`: Protein domains
+- `SUPFAM`: Superfamily
+- `PROSITE`: Protein motifs
+
+**Pathways & Networks:**
+- `Reactome`: Reactome pathways
+- `BioCyc`: BioCyc pathways
+- `PathwayCommons`: Pathway Commons
+- `STRING`: Protein-protein networks
+- `BioGRID`: Interaction database
+
+### Mapping Examples
+
+#### UniProt → KEGG
+
+```python
+from bioservices import UniProt
+
+u = UniProt()
+
+# Single mapping
+result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
+print(result)  # {'P43403': ['hsa:7535']}
+```
+
+#### KEGG → UniProt
+
+```python
+# Reverse mapping
+result = u.mapping(fr="KEGG", to="UniProtKB", query="hsa:7535")
+print(result)  # {'hsa:7535': ['P43403']}
+```
+
+#### UniProt → Ensembl
+
+```python
+# To Ensembl gene IDs
+result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query="P43403")
+print(result)  # {'P43403': ['ENSG00000115085']}
+
+# To Ensembl protein IDs
+result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl_Protein", query="P43403")
+print(result)  # {'P43403': ['ENSP00000381359']}
+```
+
+#### UniProt → PDB
+
+```python
+# Find 3D structures
+result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
+print(result)  # {'P04637': ['1A1U', '1AIE', '1C26', ...]}
+```
+
+#### UniProt → RefSeq
+
+```python
+# Get RefSeq protein IDs
+result = u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query="P43403")
+print(result)  # {'P43403': ['NP_001070.2']}
+```
+
+#### Gene Name → UniProt (via search, then mapping)
+
+```python
+# First search for gene
+search_result = u.search("gene:ZAP70 AND organism:9606", frmt="tab", columns="id")
+lines = search_result.strip().split("\n")
+if len(lines) > 1:
+    uniprot_id = lines[1].split("\t")[0]
+
+    # Then map to other databases
+    kegg_id = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
+    print(kegg_id)
+```
+
+---
+
+## UniChem Compound Mapping
+
+UniChem specializes in mapping chemical compound identifiers across databases.
+
+### Source Database IDs
+
+| Source ID | Database |
+|-----------|----------|
+| 1 | ChEMBL |
+| 2 | DrugBank |
+| 3 | PDB |
+| 4 | IUPHAR/BPS Guide to Pharmacology |
+| 5 | PubChem |
+| 6 | KEGG |
+| 7 | ChEBI |
+| 8 | NIH Clinical Collection |
+| 14 | FDA/SRS |
+| 22 | PubChem |
+
+### Basic Usage
+
+```python
+from bioservices import UniChem
+
+u = UniChem()
+
+# Get ChEMBL ID from KEGG compound ID
+chembl_id = u.get_compound_id_from_kegg("C11222")
+print(chembl_id)  # CHEMBL278315
+```
+
+### All Compound IDs
+
+```python
+# Get all identifiers for a compound
+# src_compound_id: compound ID, src_id: source database ID
+all_ids = u.get_all_compound_ids("CHEMBL278315", src_id=1)  # 1 = ChEMBL
+
+for mapping in all_ids:
+    src_name = mapping['src_name']
+    src_compound_id = mapping['src_compound_id']
+    print(f"{src_name}: {src_compound_id}")
+```
+
+### Specific Database Conversion
+
+```python
+# Convert between specific databases
+# from_src_id=6 (KEGG), to_src_id=1 (ChEMBL)
+result = u.get_src_compound_ids("C11222", from_src_id=6, to_src_id=1)
+print(result)
+```
+
+### Common Compound Mappings
+
+#### KEGG → ChEMBL
+
+```python
+u = UniChem()
+chembl_id = u.get_compound_id_from_kegg("C00031")  # D-Glucose
+print(f"ChEMBL: {chembl_id}")
+```
+
+#### ChEMBL → PubChem
+
+```python
+result = u.get_src_compound_ids("CHEMBL278315", from_src_id=1, to_src_id=22)
+if result:
+    pubchem_id = result[0]['src_compound_id']
+    print(f"PubChem: {pubchem_id}")
+```
+
+#### ChEBI → DrugBank
+
+```python
+result = u.get_src_compound_ids("5292", from_src_id=7, to_src_id=2)
+if result:
+    drugbank_id = result[0]['src_compound_id']
+    print(f"DrugBank: {drugbank_id}")
+```
+
+---
+
+## KEGG Identifier Conversions
+
+KEGG entries contain cross-references that can be extracted by parsing.
+
+### Extract Database Links from KEGG Entry
+
+```python
+from bioservices import KEGG
+
+k = KEGG()
+
+# Get compound entry
+entry = k.get("cpd:C11222")
+
+# Parse for specific database
+chebi_id = None
+uniprot_ids = []
+
+for line in entry.split("\n"):
+    if "ChEBI:" in line:
+        # Extract ChEBI ID
+        parts = line.split("ChEBI:")
+        if len(parts) > 1:
+            chebi_id = parts[1].strip().split()[0]
+
+# For genes/proteins
+gene_entry = k.get("hsa:7535")
+for line in gene_entry.split("\n"):
+    if line.startswith("            "):  # Database links section
+        if "UniProt:" in line:
+            parts = line.split("UniProt:")
+            if len(parts) > 1:
+                uniprot_id = parts[1].strip()
+                uniprot_ids.append(uniprot_id)
+```
+
+### KEGG Gene ID Components
+
+KEGG gene IDs have format `organism:gene_id`:
+
+```python
+kegg_id = "hsa:7535"
+organism, gene_id = kegg_id.split(":")
+
+print(f"Organism: {organism}")  # hsa (human)
+print(f"Gene ID: {gene_id}")    # 7535
+```
+
+### KEGG Pathway to Genes
+
+```python
+k = KEGG()
+
+# Get pathway entry
+pathway = k.get("path:hsa04660")
+
+# Parse for gene list
+genes = []
+in_gene_section = False
+
+for line in pathway.split("\n"):
+    if line.startswith("GENE"):
+        in_gene_section = True
+
+    if in_gene_section:
+        if line.startswith(" " * 12):  # Gene line
+            parts = line.strip().split()
+            if parts:
+                gene_id = parts[0]
+                genes.append(f"hsa:{gene_id}")
+        elif not line.startswith(" "):
+            break
+
+print(f"Found {len(genes)} genes")
+```
+
+---
+
+## Common Mapping Patterns
+
+### Pattern 1: Gene Symbol → Multiple Database IDs
+
+```python
+from bioservices import UniProt
+
+def gene_symbol_to_ids(gene_symbol, organism="9606"):
+    """Convert gene symbol to multiple database IDs."""
+    u = UniProt()
+
+    # Search for gene
+    query = f"gene:{gene_symbol} AND organism:{organism}"
+    result = u.search(query, frmt="tab", columns="id")
+
+    lines = result.strip().split("\n")
+    if len(lines) < 2:
+        return None
+
+    uniprot_id = lines[1].split("\t")[0]
+
+    # Map to multiple databases
+    ids = {
+        'uniprot': uniprot_id,
+        'kegg': u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id),
+        'ensembl': u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query=uniprot_id),
+        'refseq': u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query=uniprot_id),
+        'pdb': u.mapping(fr="UniProtKB_AC-ID", to="PDB", query=uniprot_id)
+    }
+
+    return ids
+
+# Usage
+ids = gene_symbol_to_ids("ZAP70")
+print(ids)
+```
+
+### Pattern 2: Compound Name → All Database IDs
+
+```python
+from bioservices import KEGG, UniChem, ChEBI
+
+def compound_name_to_ids(compound_name):
+    """Search compound and get all database IDs."""
+    k = KEGG()
+
+    # Search KEGG
+    results = k.find("compound", compound_name)
+    if not results:
+        return None
+
+    # Extract KEGG ID
+    kegg_id = results.strip().split("\n")[0].split("\t")[0].replace("cpd:", "")
+
+    # Get KEGG entry for ChEBI
+    entry = k.get(f"cpd:{kegg_id}")
+    chebi_id = None
+    for line in entry.split("\n"):
+        if "ChEBI:" in line:
+            parts = line.split("ChEBI:")
+            if len(parts) > 1:
+                chebi_id = parts[1].strip().split()[0]
+                break
+
+    # Get ChEMBL from UniChem
+    u = UniChem()
+    try:
+        chembl_id = u.get_compound_id_from_kegg(kegg_id)
+    except:
+        chembl_id = None
+
+    return {
+        'kegg': kegg_id,
+        'chebi': chebi_id,
+        'chembl': chembl_id
+    }
+
+# Usage
+ids = compound_name_to_ids("Geldanamycin")
+print(ids)
+```
+
+### Pattern 3: Batch ID Conversion with Error Handling
+
+```python
+from bioservices import UniProt
+
+def safe_batch_mapping(ids, from_db, to_db, chunk_size=100):
+    """Safely map IDs with error handling and chunking."""
+    u = UniProt()
+    all_results = {}
+
+    for i in range(0, len(ids), chunk_size):
+        chunk = ids[i:i+chunk_size]
+        query = ",".join(chunk)
+
+        try:
+            results = u.mapping(fr=from_db, to=to_db, query=query)
+            all_results.update(results)
+            print(f"✓ Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
+
+        except Exception as e:
+            print(f"✗ Error at chunk {i}: {e}")
+
+            # Try individual IDs in failed chunk
+            for single_id in chunk:
+                try:
+                    result = u.mapping(fr=from_db, to=to_db, query=single_id)
+                    all_results.update(result)
+                except:
+                    all_results[single_id] = None
+
+    return all_results
+
+# Usage
+uniprot_ids = ["P43403", "P04637", "P53779", "INVALID123"]
+mapping = safe_batch_mapping(uniprot_ids, "UniProtKB_AC-ID", "KEGG")
+```
+
+### Pattern 4: Multi-Hop Mapping
+
+Sometimes you need to map through intermediate databases:
+
+```python
+from bioservices import UniProt
+
+def multi_hop_mapping(gene_symbol, organism="9606"):
+    """Gene symbol → UniProt → KEGG → Pathways."""
+    u = UniProt()
+    k = KEGG()
+
+    # Step 1: Gene symbol → UniProt
+    query = f"gene:{gene_symbol} AND organism:{organism}"
+    result = u.search(query, frmt="tab", columns="id")
+
+    lines = result.strip().split("\n")
+    if len(lines) < 2:
+        return None
+
+    uniprot_id = lines[1].split("\t")[0]
+
+    # Step 2: UniProt → KEGG
+    kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
+    if not kegg_mapping or uniprot_id not in kegg_mapping:
+        return None
+
+    kegg_id = kegg_mapping[uniprot_id][0]
+
+    # Step 3: KEGG → Pathways
+    organism_code, gene_id = kegg_id.split(":")
+    pathways = k.get_pathway_by_gene(gene_id, organism_code)
+
+    return {
+        'gene': gene_symbol,
+        'uniprot': uniprot_id,
+        'kegg': kegg_id,
+        'pathways': pathways
+    }
+
+# Usage
+result = multi_hop_mapping("TP53")
+print(result)
+```
+
+---
+
+## Troubleshooting
+
+### Issue 1: No Mapping Found
+
+**Symptom:** Mapping returns empty or None
+
+**Solutions:**
+1. Verify source ID exists in source database
+2. Check database code spelling
+3. Try reverse mapping
+4. Some IDs may not have mappings in all databases
+
+```python
+result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
+
+if not result or 'P43403' not in result:
+    print("No mapping found. Try:")
+    print("1. Verify ID exists: u.search('P43403')")
+    print("2. Check if protein has KEGG annotation")
+```
+
+### Issue 2: Too Many IDs in Batch
+
+**Symptom:** Batch mapping fails or times out
+
+**Solution:** Split into smaller chunks
+
+```python
+def chunked_mapping(ids, from_db, to_db, chunk_size=50):
+    all_results = {}
+
+    for i in range(0, len(ids), chunk_size):
+        chunk = ids[i:i+chunk_size]
+        result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
+        all_results.update(result)
+
+    return all_results
+```
+
+### Issue 3: Multiple Target IDs
+
+**Symptom:** One source ID maps to multiple target IDs
+
+**Solution:** Handle as list
+
+```python
+result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
+# Result: {'P04637': ['1A1U', '1AIE', '1C26', ...]}
+
+pdb_ids = result['P04637']
+print(f"Found {len(pdb_ids)} PDB structures")
+
+for pdb_id in pdb_ids:
+    print(f"  {pdb_id}")
+```
+
+### Issue 4: Organism Ambiguity
+
+**Symptom:** Gene symbol maps to multiple organisms
+
+**Solution:** Always specify organism in searches
+
+```python
+# Bad: Ambiguous
+result = u.search("gene:TP53")  # Many organisms have TP53
+
+# Good: Specific
+result = u.search("gene:TP53 AND organism:9606")  # Human only
+```
+
+### Issue 5: Deprecated IDs
+
+**Symptom:** Old database IDs don't map
+
+**Solution:** Update to current IDs first
+
+```python
+# Check if ID is current
+entry = u.retrieve("P43403", frmt="txt")
+
+# Look for secondary accessions
+for line in entry.split("\n"):
+    if line.startswith("AC"):
+        print(line)  # Shows primary and secondary accessions
+```
+
+---
+
+## Best Practices
+
+1. **Always validate inputs** before batch processing
+2. **Handle None/empty results** gracefully
+3. **Use chunking** for large ID lists (50-100 per chunk)
+4. **Cache results** for repeated queries
+5. **Specify organism** when possible to avoid ambiguity
+6. **Log failures** in batch processing for later retry
+7. **Add delays** between large batches to respect API limits
+
+```python
+import time
+
+def polite_batch_mapping(ids, from_db, to_db):
+    """Batch mapping with rate limiting."""
+    results = {}
+
+    for i in range(0, len(ids), 50):
+        chunk = ids[i:i+50]
+        result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
+        results.update(result)
+
+        time.sleep(0.5)  # Be nice to the API
+
+    return results
+```
+
+---
+
+For complete working examples, see:
+- `scripts/batch_id_converter.py`: Command-line batch conversion tool
+- `workflow_patterns.md`: Integration into larger workflows