# BioServices: Identifier Mapping Guide This document provides comprehensive information about converting identifiers between different biological databases using BioServices. ## Table of Contents 1. [Overview](#overview) 2. [UniProt Mapping Service](#uniprot-mapping-service) 3. [UniChem Compound Mapping](#unichem-compound-mapping) 4. [KEGG Identifier Conversions](#kegg-identifier-conversions) 5. [Common Mapping Patterns](#common-mapping-patterns) 6. [Troubleshooting](#troubleshooting) --- ## Overview Biological databases use different identifier systems. Cross-referencing requires mapping between these systems. BioServices provides multiple approaches: 1. **UniProt Mapping**: Comprehensive protein/gene ID conversion 2. **UniChem**: Chemical compound ID mapping 3. **KEGG**: Built-in cross-references in entries 4. **PICR**: Protein identifier cross-reference service --- ## UniProt Mapping Service The UniProt mapping service is the most comprehensive tool for protein and gene identifier conversion. ### Basic Usage ```python from bioservices import UniProt u = UniProt() # Map single ID result = u.mapping( fr="UniProtKB_AC-ID", # Source database to="KEGG", # Target database query="P43403" # Identifier to convert ) print(result) # Output: {'P43403': ['hsa:7535']} ``` ### Batch Mapping ```python # Map multiple IDs (comma-separated) ids = ["P43403", "P04637", "P53779"] result = u.mapping( fr="UniProtKB_AC-ID", to="KEGG", query=",".join(ids) ) for uniprot_id, kegg_ids in result.items(): print(f"{uniprot_id} → {kegg_ids}") ``` ### Supported Database Pairs UniProt supports mapping between 100+ database pairs. Key ones include: #### Protein/Gene Databases | Source Format | Code | Target Format | Code | |---------------|------|---------------|------| | UniProtKB AC/ID | `UniProtKB_AC-ID` | KEGG | `KEGG` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl | `Ensembl` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Protein | `Ensembl_Protein` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Transcript | `Ensembl_Transcript` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Protein | `RefSeq_Protein` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Nucleotide | `RefSeq_Nucleotide` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | GeneID (Entrez) | `GeneID` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | HGNC | `HGNC` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | MGI | `MGI` | | KEGG | `KEGG` | UniProtKB | `UniProtKB` | | Ensembl | `Ensembl` | UniProtKB | `UniProtKB` | | GeneID | `GeneID` | UniProtKB | `UniProtKB` | #### Structural Databases | Source | Code | Target | Code | |--------|------|--------|------| | UniProtKB AC/ID | `UniProtKB_AC-ID` | PDB | `PDB` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | Pfam | `Pfam` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | InterPro | `InterPro` | | PDB | `PDB` | UniProtKB | `UniProtKB` | #### Expression & Proteomics | Source | Code | Target | Code | |--------|------|--------|------| | UniProtKB AC/ID | `UniProtKB_AC-ID` | PRIDE | `PRIDE` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | ProteomicsDB | `ProteomicsDB` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | PaxDb | `PaxDb` | #### Organism-Specific | Source | Code | Target | Code | |--------|------|--------|------| | UniProtKB AC/ID | `UniProtKB_AC-ID` | FlyBase | `FlyBase` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | WormBase | `WormBase` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | SGD | `SGD` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | ZFIN | `ZFIN` | #### Other Useful Mappings | Source | Code | Target | Code | |--------|------|--------|------| | UniProtKB AC/ID | `UniProtKB_AC-ID` | GO | `GO` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | Reactome | `Reactome` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | STRING | `STRING` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | BioGRID | `BioGRID` | | UniProtKB AC/ID | `UniProtKB_AC-ID` | OMA | `OMA` | ### Complete List of Database Codes To get the complete, up-to-date list: ```python from bioservices import UniProt u = UniProt() # This information is in the UniProt REST API documentation # Common patterns: # - Source databases typically end in source database name # - UniProtKB uses "UniProtKB_AC-ID" or "UniProtKB" # - Most other databases use their standard abbreviation ``` ### Common Database Codes Reference **Gene/Protein Identifiers:** - `UniProtKB_AC-ID`: UniProt accession/ID - `UniProtKB`: UniProt accession - `KEGG`: KEGG gene IDs (e.g., hsa:7535) - `GeneID`: NCBI Gene (Entrez) IDs - `Ensembl`: Ensembl gene IDs - `Ensembl_Protein`: Ensembl protein IDs - `Ensembl_Transcript`: Ensembl transcript IDs - `RefSeq_Protein`: RefSeq protein IDs (NP_) - `RefSeq_Nucleotide`: RefSeq nucleotide IDs (NM_) **Gene Nomenclature:** - `HGNC`: Human Gene Nomenclature Committee - `MGI`: Mouse Genome Informatics - `RGD`: Rat Genome Database - `SGD`: Saccharomyces Genome Database - `FlyBase`: Drosophila database - `WormBase`: C. elegans database - `ZFIN`: Zebrafish database **Structure:** - `PDB`: Protein Data Bank - `Pfam`: Protein families - `InterPro`: Protein domains - `SUPFAM`: Superfamily - `PROSITE`: Protein motifs **Pathways & Networks:** - `Reactome`: Reactome pathways - `BioCyc`: BioCyc pathways - `PathwayCommons`: Pathway Commons - `STRING`: Protein-protein networks - `BioGRID`: Interaction database ### Mapping Examples #### UniProt → KEGG ```python from bioservices import UniProt u = UniProt() # Single mapping result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403") print(result) # {'P43403': ['hsa:7535']} ``` #### KEGG → UniProt ```python # Reverse mapping result = u.mapping(fr="KEGG", to="UniProtKB", query="hsa:7535") print(result) # {'hsa:7535': ['P43403']} ``` #### UniProt → Ensembl ```python # To Ensembl gene IDs result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query="P43403") print(result) # {'P43403': ['ENSG00000115085']} # To Ensembl protein IDs result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl_Protein", query="P43403") print(result) # {'P43403': ['ENSP00000381359']} ``` #### UniProt → PDB ```python # Find 3D structures result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637") print(result) # {'P04637': ['1A1U', '1AIE', '1C26', ...]} ``` #### UniProt → RefSeq ```python # Get RefSeq protein IDs result = u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query="P43403") print(result) # {'P43403': ['NP_001070.2']} ``` #### Gene Name → UniProt (via search, then mapping) ```python # First search for gene search_result = u.search("gene:ZAP70 AND organism:9606", frmt="tab", columns="id") lines = search_result.strip().split("\n") if len(lines) > 1: uniprot_id = lines[1].split("\t")[0] # Then map to other databases kegg_id = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id) print(kegg_id) ``` --- ## UniChem Compound Mapping UniChem specializes in mapping chemical compound identifiers across databases. ### Source Database IDs | Source ID | Database | |-----------|----------| | 1 | ChEMBL | | 2 | DrugBank | | 3 | PDB | | 4 | IUPHAR/BPS Guide to Pharmacology | | 5 | PubChem | | 6 | KEGG | | 7 | ChEBI | | 8 | NIH Clinical Collection | | 14 | FDA/SRS | | 22 | PubChem | ### Basic Usage ```python from bioservices import UniChem u = UniChem() # Get ChEMBL ID from KEGG compound ID chembl_id = u.get_compound_id_from_kegg("C11222") print(chembl_id) # CHEMBL278315 ``` ### All Compound IDs ```python # Get all identifiers for a compound # src_compound_id: compound ID, src_id: source database ID all_ids = u.get_all_compound_ids("CHEMBL278315", src_id=1) # 1 = ChEMBL for mapping in all_ids: src_name = mapping['src_name'] src_compound_id = mapping['src_compound_id'] print(f"{src_name}: {src_compound_id}") ``` ### Specific Database Conversion ```python # Convert between specific databases # from_src_id=6 (KEGG), to_src_id=1 (ChEMBL) result = u.get_src_compound_ids("C11222", from_src_id=6, to_src_id=1) print(result) ``` ### Common Compound Mappings #### KEGG → ChEMBL ```python u = UniChem() chembl_id = u.get_compound_id_from_kegg("C00031") # D-Glucose print(f"ChEMBL: {chembl_id}") ``` #### ChEMBL → PubChem ```python result = u.get_src_compound_ids("CHEMBL278315", from_src_id=1, to_src_id=22) if result: pubchem_id = result[0]['src_compound_id'] print(f"PubChem: {pubchem_id}") ``` #### ChEBI → DrugBank ```python result = u.get_src_compound_ids("5292", from_src_id=7, to_src_id=2) if result: drugbank_id = result[0]['src_compound_id'] print(f"DrugBank: {drugbank_id}") ``` --- ## KEGG Identifier Conversions KEGG entries contain cross-references that can be extracted by parsing. ### Extract Database Links from KEGG Entry ```python from bioservices import KEGG k = KEGG() # Get compound entry entry = k.get("cpd:C11222") # Parse for specific database chebi_id = None uniprot_ids = [] for line in entry.split("\n"): if "ChEBI:" in line: # Extract ChEBI ID parts = line.split("ChEBI:") if len(parts) > 1: chebi_id = parts[1].strip().split()[0] # For genes/proteins gene_entry = k.get("hsa:7535") for line in gene_entry.split("\n"): if line.startswith(" "): # Database links section if "UniProt:" in line: parts = line.split("UniProt:") if len(parts) > 1: uniprot_id = parts[1].strip() uniprot_ids.append(uniprot_id) ``` ### KEGG Gene ID Components KEGG gene IDs have format `organism:gene_id`: ```python kegg_id = "hsa:7535" organism, gene_id = kegg_id.split(":") print(f"Organism: {organism}") # hsa (human) print(f"Gene ID: {gene_id}") # 7535 ``` ### KEGG Pathway to Genes ```python k = KEGG() # Get pathway entry pathway = k.get("path:hsa04660") # Parse for gene list genes = [] in_gene_section = False for line in pathway.split("\n"): if line.startswith("GENE"): in_gene_section = True if in_gene_section: if line.startswith(" " * 12): # Gene line parts = line.strip().split() if parts: gene_id = parts[0] genes.append(f"hsa:{gene_id}") elif not line.startswith(" "): break print(f"Found {len(genes)} genes") ``` --- ## Common Mapping Patterns ### Pattern 1: Gene Symbol → Multiple Database IDs ```python from bioservices import UniProt def gene_symbol_to_ids(gene_symbol, organism="9606"): """Convert gene symbol to multiple database IDs.""" u = UniProt() # Search for gene query = f"gene:{gene_symbol} AND organism:{organism}" result = u.search(query, frmt="tab", columns="id") lines = result.strip().split("\n") if len(lines) < 2: return None uniprot_id = lines[1].split("\t")[0] # Map to multiple databases ids = { 'uniprot': uniprot_id, 'kegg': u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id), 'ensembl': u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query=uniprot_id), 'refseq': u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query=uniprot_id), 'pdb': u.mapping(fr="UniProtKB_AC-ID", to="PDB", query=uniprot_id) } return ids # Usage ids = gene_symbol_to_ids("ZAP70") print(ids) ``` ### Pattern 2: Compound Name → All Database IDs ```python from bioservices import KEGG, UniChem, ChEBI def compound_name_to_ids(compound_name): """Search compound and get all database IDs.""" k = KEGG() # Search KEGG results = k.find("compound", compound_name) if not results: return None # Extract KEGG ID kegg_id = results.strip().split("\n")[0].split("\t")[0].replace("cpd:", "") # Get KEGG entry for ChEBI entry = k.get(f"cpd:{kegg_id}") chebi_id = None for line in entry.split("\n"): if "ChEBI:" in line: parts = line.split("ChEBI:") if len(parts) > 1: chebi_id = parts[1].strip().split()[0] break # Get ChEMBL from UniChem u = UniChem() try: chembl_id = u.get_compound_id_from_kegg(kegg_id) except: chembl_id = None return { 'kegg': kegg_id, 'chebi': chebi_id, 'chembl': chembl_id } # Usage ids = compound_name_to_ids("Geldanamycin") print(ids) ``` ### Pattern 3: Batch ID Conversion with Error Handling ```python from bioservices import UniProt def safe_batch_mapping(ids, from_db, to_db, chunk_size=100): """Safely map IDs with error handling and chunking.""" u = UniProt() all_results = {} for i in range(0, len(ids), chunk_size): chunk = ids[i:i+chunk_size] query = ",".join(chunk) try: results = u.mapping(fr=from_db, to=to_db, query=query) all_results.update(results) print(f"✓ Processed {min(i+chunk_size, len(ids))}/{len(ids)}") except Exception as e: print(f"✗ Error at chunk {i}: {e}") # Try individual IDs in failed chunk for single_id in chunk: try: result = u.mapping(fr=from_db, to=to_db, query=single_id) all_results.update(result) except: all_results[single_id] = None return all_results # Usage uniprot_ids = ["P43403", "P04637", "P53779", "INVALID123"] mapping = safe_batch_mapping(uniprot_ids, "UniProtKB_AC-ID", "KEGG") ``` ### Pattern 4: Multi-Hop Mapping Sometimes you need to map through intermediate databases: ```python from bioservices import UniProt def multi_hop_mapping(gene_symbol, organism="9606"): """Gene symbol → UniProt → KEGG → Pathways.""" u = UniProt() k = KEGG() # Step 1: Gene symbol → UniProt query = f"gene:{gene_symbol} AND organism:{organism}" result = u.search(query, frmt="tab", columns="id") lines = result.strip().split("\n") if len(lines) < 2: return None uniprot_id = lines[1].split("\t")[0] # Step 2: UniProt → KEGG kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id) if not kegg_mapping or uniprot_id not in kegg_mapping: return None kegg_id = kegg_mapping[uniprot_id][0] # Step 3: KEGG → Pathways organism_code, gene_id = kegg_id.split(":") pathways = k.get_pathway_by_gene(gene_id, organism_code) return { 'gene': gene_symbol, 'uniprot': uniprot_id, 'kegg': kegg_id, 'pathways': pathways } # Usage result = multi_hop_mapping("TP53") print(result) ``` --- ## Troubleshooting ### Issue 1: No Mapping Found **Symptom:** Mapping returns empty or None **Solutions:** 1. Verify source ID exists in source database 2. Check database code spelling 3. Try reverse mapping 4. Some IDs may not have mappings in all databases ```python result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403") if not result or 'P43403' not in result: print("No mapping found. Try:") print("1. Verify ID exists: u.search('P43403')") print("2. Check if protein has KEGG annotation") ``` ### Issue 2: Too Many IDs in Batch **Symptom:** Batch mapping fails or times out **Solution:** Split into smaller chunks ```python def chunked_mapping(ids, from_db, to_db, chunk_size=50): all_results = {} for i in range(0, len(ids), chunk_size): chunk = ids[i:i+chunk_size] result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk)) all_results.update(result) return all_results ``` ### Issue 3: Multiple Target IDs **Symptom:** One source ID maps to multiple target IDs **Solution:** Handle as list ```python result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637") # Result: {'P04637': ['1A1U', '1AIE', '1C26', ...]} pdb_ids = result['P04637'] print(f"Found {len(pdb_ids)} PDB structures") for pdb_id in pdb_ids: print(f" {pdb_id}") ``` ### Issue 4: Organism Ambiguity **Symptom:** Gene symbol maps to multiple organisms **Solution:** Always specify organism in searches ```python # Bad: Ambiguous result = u.search("gene:TP53") # Many organisms have TP53 # Good: Specific result = u.search("gene:TP53 AND organism:9606") # Human only ``` ### Issue 5: Deprecated IDs **Symptom:** Old database IDs don't map **Solution:** Update to current IDs first ```python # Check if ID is current entry = u.retrieve("P43403", frmt="txt") # Look for secondary accessions for line in entry.split("\n"): if line.startswith("AC"): print(line) # Shows primary and secondary accessions ``` --- ## Best Practices 1. **Always validate inputs** before batch processing 2. **Handle None/empty results** gracefully 3. **Use chunking** for large ID lists (50-100 per chunk) 4. **Cache results** for repeated queries 5. **Specify organism** when possible to avoid ambiguity 6. **Log failures** in batch processing for later retry 7. **Add delays** between large batches to respect API limits ```python import time def polite_batch_mapping(ids, from_db, to_db): """Batch mapping with rate limiting.""" results = {} for i in range(0, len(ids), 50): chunk = ids[i:i+50] result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk)) results.update(result) time.sleep(0.5) # Be nice to the API return results ``` --- For complete working examples, see: - `scripts/batch_id_converter.py`: Command-line batch conversion tool - `workflow_patterns.md`: Integration into larger workflows