zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

17 KiB

Raw Blame History

BioServices: Identifier Mapping Guide

This document provides comprehensive information about converting identifiers between different biological databases using BioServices.

Overview
UniProt Mapping Service
UniChem Compound Mapping
KEGG Identifier Conversions
Common Mapping Patterns
Troubleshooting

Overview

Biological databases use different identifier systems. Cross-referencing requires mapping between these systems. BioServices provides multiple approaches:

UniProt Mapping: Comprehensive protein/gene ID conversion
UniChem: Chemical compound ID mapping
KEGG: Built-in cross-references in entries
PICR: Protein identifier cross-reference service

UniProt Mapping Service

The UniProt mapping service is the most comprehensive tool for protein and gene identifier conversion.

Basic Usage

from bioservices import UniProt

u = UniProt()

# Map single ID
result = u.mapping(
    fr="UniProtKB_AC-ID",    # Source database
    to="KEGG",                # Target database
    query="P43403"            # Identifier to convert
)

print(result)
# Output: {'P43403': ['hsa:7535']}

Batch Mapping

# Map multiple IDs (comma-separated)
ids = ["P43403", "P04637", "P53779"]
result = u.mapping(
    fr="UniProtKB_AC-ID",
    to="KEGG",
    query=",".join(ids)
)

for uniprot_id, kegg_ids in result.items():
    print(f"{uniprot_id} → {kegg_ids}")

Supported Database Pairs

UniProt supports mapping between 100+ database pairs. Key ones include:

Protein/Gene Databases

Source Format	Code	Target Format	Code
UniProtKB AC/ID	`UniProtKB_AC-ID`	KEGG	`KEGG`
UniProtKB AC/ID	`UniProtKB_AC-ID`	Ensembl	`Ensembl`
UniProtKB AC/ID	`UniProtKB_AC-ID`	Ensembl Protein	`Ensembl_Protein`
UniProtKB AC/ID	`UniProtKB_AC-ID`	Ensembl Transcript	`Ensembl_Transcript`
UniProtKB AC/ID	`UniProtKB_AC-ID`	RefSeq Protein	`RefSeq_Protein`
UniProtKB AC/ID	`UniProtKB_AC-ID`	RefSeq Nucleotide	`RefSeq_Nucleotide`
UniProtKB AC/ID	`UniProtKB_AC-ID`	GeneID (Entrez)	`GeneID`
UniProtKB AC/ID	`UniProtKB_AC-ID`	HGNC	`HGNC`
UniProtKB AC/ID	`UniProtKB_AC-ID`	MGI	`MGI`
KEGG	`KEGG`	UniProtKB	`UniProtKB`
Ensembl	`Ensembl`	UniProtKB	`UniProtKB`
GeneID	`GeneID`	UniProtKB	`UniProtKB`

Structural Databases

Source	Code	Target	Code
UniProtKB AC/ID	`UniProtKB_AC-ID`	PDB	`PDB`
UniProtKB AC/ID	`UniProtKB_AC-ID`	Pfam	`Pfam`
UniProtKB AC/ID	`UniProtKB_AC-ID`	InterPro	`InterPro`
PDB	`PDB`	UniProtKB	`UniProtKB`

Expression & Proteomics

Source	Code	Target	Code
UniProtKB AC/ID	`UniProtKB_AC-ID`	PRIDE	`PRIDE`
UniProtKB AC/ID	`UniProtKB_AC-ID`	ProteomicsDB	`ProteomicsDB`
UniProtKB AC/ID	`UniProtKB_AC-ID`	PaxDb	`PaxDb`

Organism-Specific

Source	Code	Target	Code
UniProtKB AC/ID	`UniProtKB_AC-ID`	FlyBase	`FlyBase`
UniProtKB AC/ID	`UniProtKB_AC-ID`	WormBase	`WormBase`
UniProtKB AC/ID	`UniProtKB_AC-ID`	SGD	`SGD`
UniProtKB AC/ID	`UniProtKB_AC-ID`	ZFIN	`ZFIN`

Other Useful Mappings

Source	Code	Target	Code
UniProtKB AC/ID	`UniProtKB_AC-ID`	GO	`GO`
UniProtKB AC/ID	`UniProtKB_AC-ID`	Reactome	`Reactome`
UniProtKB AC/ID	`UniProtKB_AC-ID`	STRING	`STRING`
UniProtKB AC/ID	`UniProtKB_AC-ID`	BioGRID	`BioGRID`
UniProtKB AC/ID	`UniProtKB_AC-ID`	OMA	`OMA`

Complete List of Database Codes

To get the complete, up-to-date list:

from bioservices import UniProt

u = UniProt()

# This information is in the UniProt REST API documentation
# Common patterns:
# - Source databases typically end in source database name
# - UniProtKB uses "UniProtKB_AC-ID" or "UniProtKB"
# - Most other databases use their standard abbreviation

Common Database Codes Reference

Gene/Protein Identifiers:

UniProtKB_AC-ID: UniProt accession/ID
UniProtKB: UniProt accession
KEGG: KEGG gene IDs (e.g., hsa:7535)
GeneID: NCBI Gene (Entrez) IDs
Ensembl: Ensembl gene IDs
Ensembl_Protein: Ensembl protein IDs
Ensembl_Transcript: Ensembl transcript IDs
RefSeq_Protein: RefSeq protein IDs (NP_)
RefSeq_Nucleotide: RefSeq nucleotide IDs (NM_)

Gene Nomenclature:

HGNC: Human Gene Nomenclature Committee
MGI: Mouse Genome Informatics
RGD: Rat Genome Database
SGD: Saccharomyces Genome Database
FlyBase: Drosophila database
WormBase: C. elegans database
ZFIN: Zebrafish database

Structure:

PDB: Protein Data Bank
Pfam: Protein families
InterPro: Protein domains
SUPFAM: Superfamily
PROSITE: Protein motifs

Pathways & Networks:

Reactome: Reactome pathways
BioCyc: BioCyc pathways
PathwayCommons: Pathway Commons
STRING: Protein-protein networks
BioGRID: Interaction database

Mapping Examples

UniProt → KEGG

from bioservices import UniProt

u = UniProt()

# Single mapping
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
print(result)  # {'P43403': ['hsa:7535']}

KEGG → UniProt

# Reverse mapping
result = u.mapping(fr="KEGG", to="UniProtKB", query="hsa:7535")
print(result)  # {'hsa:7535': ['P43403']}

UniProt → Ensembl

# To Ensembl gene IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query="P43403")
print(result)  # {'P43403': ['ENSG00000115085']}

# To Ensembl protein IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl_Protein", query="P43403")
print(result)  # {'P43403': ['ENSP00000381359']}

UniProt → PDB

# Find 3D structures
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
print(result)  # {'P04637': ['1A1U', '1AIE', '1C26', ...]}

UniProt → RefSeq

# Get RefSeq protein IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query="P43403")
print(result)  # {'P43403': ['NP_001070.2']}

Gene Name → UniProt (via search, then mapping)

# First search for gene
search_result = u.search("gene:ZAP70 AND organism:9606", frmt="tab", columns="id")
lines = search_result.strip().split("\n")
if len(lines) > 1:
    uniprot_id = lines[1].split("\t")[0]

    # Then map to other databases
    kegg_id = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
    print(kegg_id)

UniChem Compound Mapping

UniChem specializes in mapping chemical compound identifiers across databases.

Source Database IDs

Source ID	Database
1	ChEMBL
2	DrugBank
3	PDB
4	IUPHAR/BPS Guide to Pharmacology
5	PubChem
6	KEGG
7	ChEBI
8	NIH Clinical Collection
14	FDA/SRS
22	PubChem

Basic Usage

from bioservices import UniChem

u = UniChem()

# Get ChEMBL ID from KEGG compound ID
chembl_id = u.get_compound_id_from_kegg("C11222")
print(chembl_id)  # CHEMBL278315

All Compound IDs

# Get all identifiers for a compound
# src_compound_id: compound ID, src_id: source database ID
all_ids = u.get_all_compound_ids("CHEMBL278315", src_id=1)  # 1 = ChEMBL

for mapping in all_ids:
    src_name = mapping['src_name']
    src_compound_id = mapping['src_compound_id']
    print(f"{src_name}: {src_compound_id}")

Specific Database Conversion

# Convert between specific databases
# from_src_id=6 (KEGG), to_src_id=1 (ChEMBL)
result = u.get_src_compound_ids("C11222", from_src_id=6, to_src_id=1)
print(result)

Common Compound Mappings

KEGG → ChEMBL

u = UniChem()
chembl_id = u.get_compound_id_from_kegg("C00031")  # D-Glucose
print(f"ChEMBL: {chembl_id}")

ChEMBL → PubChem

result = u.get_src_compound_ids("CHEMBL278315", from_src_id=1, to_src_id=22)
if result:
    pubchem_id = result[0]['src_compound_id']
    print(f"PubChem: {pubchem_id}")

ChEBI → DrugBank

result = u.get_src_compound_ids("5292", from_src_id=7, to_src_id=2)
if result:
    drugbank_id = result[0]['src_compound_id']
    print(f"DrugBank: {drugbank_id}")

KEGG Identifier Conversions

KEGG entries contain cross-references that can be extracted by parsing.

Extract Database Links from KEGG Entry

from bioservices import KEGG

k = KEGG()

# Get compound entry
entry = k.get("cpd:C11222")

# Parse for specific database
chebi_id = None
uniprot_ids = []

for line in entry.split("\n"):
    if "ChEBI:" in line:
        # Extract ChEBI ID
        parts = line.split("ChEBI:")
        if len(parts) > 1:
            chebi_id = parts[1].strip().split()[0]

# For genes/proteins
gene_entry = k.get("hsa:7535")
for line in gene_entry.split("\n"):
    if line.startswith("            "):  # Database links section
        if "UniProt:" in line:
            parts = line.split("UniProt:")
            if len(parts) > 1:
                uniprot_id = parts[1].strip()
                uniprot_ids.append(uniprot_id)

KEGG Gene ID Components

KEGG gene IDs have format organism:gene_id:

kegg_id = "hsa:7535"
organism, gene_id = kegg_id.split(":")

print(f"Organism: {organism}")  # hsa (human)
print(f"Gene ID: {gene_id}")    # 7535

KEGG Pathway to Genes

k = KEGG()

# Get pathway entry
pathway = k.get("path:hsa04660")

# Parse for gene list
genes = []
in_gene_section = False

for line in pathway.split("\n"):
    if line.startswith("GENE"):
        in_gene_section = True

    if in_gene_section:
        if line.startswith(" " * 12):  # Gene line
            parts = line.strip().split()
            if parts:
                gene_id = parts[0]
                genes.append(f"hsa:{gene_id}")
        elif not line.startswith(" "):
            break

print(f"Found {len(genes)} genes")

Common Mapping Patterns

Pattern 1: Gene Symbol → Multiple Database IDs

from bioservices import UniProt

def gene_symbol_to_ids(gene_symbol, organism="9606"):
    """Convert gene symbol to multiple database IDs."""
    u = UniProt()

    # Search for gene
    query = f"gene:{gene_symbol} AND organism:{organism}"
    result = u.search(query, frmt="tab", columns="id")

    lines = result.strip().split("\n")
    if len(lines) < 2:
        return None

    uniprot_id = lines[1].split("\t")[0]

    # Map to multiple databases
    ids = {
        'uniprot': uniprot_id,
        'kegg': u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id),
        'ensembl': u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query=uniprot_id),
        'refseq': u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query=uniprot_id),
        'pdb': u.mapping(fr="UniProtKB_AC-ID", to="PDB", query=uniprot_id)
    }

    return ids

# Usage
ids = gene_symbol_to_ids("ZAP70")
print(ids)

Pattern 2: Compound Name → All Database IDs

from bioservices import KEGG, UniChem, ChEBI

def compound_name_to_ids(compound_name):
    """Search compound and get all database IDs."""
    k = KEGG()

    # Search KEGG
    results = k.find("compound", compound_name)
    if not results:
        return None

    # Extract KEGG ID
    kegg_id = results.strip().split("\n")[0].split("\t")[0].replace("cpd:", "")

    # Get KEGG entry for ChEBI
    entry = k.get(f"cpd:{kegg_id}")
    chebi_id = None
    for line in entry.split("\n"):
        if "ChEBI:" in line:
            parts = line.split("ChEBI:")
            if len(parts) > 1:
                chebi_id = parts[1].strip().split()[0]
                break

    # Get ChEMBL from UniChem
    u = UniChem()
    try:
        chembl_id = u.get_compound_id_from_kegg(kegg_id)
    except:
        chembl_id = None

    return {
        'kegg': kegg_id,
        'chebi': chebi_id,
        'chembl': chembl_id
    }

# Usage
ids = compound_name_to_ids("Geldanamycin")
print(ids)

Pattern 3: Batch ID Conversion with Error Handling

from bioservices import UniProt

def safe_batch_mapping(ids, from_db, to_db, chunk_size=100):
    """Safely map IDs with error handling and chunking."""
    u = UniProt()
    all_results = {}

    for i in range(0, len(ids), chunk_size):
        chunk = ids[i:i+chunk_size]
        query = ",".join(chunk)

        try:
            results = u.mapping(fr=from_db, to=to_db, query=query)
            all_results.update(results)
            print(f"✓ Processed {min(i+chunk_size, len(ids))}/{len(ids)}")

        except Exception as e:
            print(f"✗ Error at chunk {i}: {e}")

            # Try individual IDs in failed chunk
            for single_id in chunk:
                try:
                    result = u.mapping(fr=from_db, to=to_db, query=single_id)
                    all_results.update(result)
                except:
                    all_results[single_id] = None

    return all_results

# Usage
uniprot_ids = ["P43403", "P04637", "P53779", "INVALID123"]
mapping = safe_batch_mapping(uniprot_ids, "UniProtKB_AC-ID", "KEGG")

Pattern 4: Multi-Hop Mapping

Sometimes you need to map through intermediate databases:

from bioservices import UniProt

def multi_hop_mapping(gene_symbol, organism="9606"):
    """Gene symbol → UniProt → KEGG → Pathways."""
    u = UniProt()
    k = KEGG()

    # Step 1: Gene symbol → UniProt
    query = f"gene:{gene_symbol} AND organism:{organism}"
    result = u.search(query, frmt="tab", columns="id")

    lines = result.strip().split("\n")
    if len(lines) < 2:
        return None

    uniprot_id = lines[1].split("\t")[0]

    # Step 2: UniProt → KEGG
    kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
    if not kegg_mapping or uniprot_id not in kegg_mapping:
        return None

    kegg_id = kegg_mapping[uniprot_id][0]

    # Step 3: KEGG → Pathways
    organism_code, gene_id = kegg_id.split(":")
    pathways = k.get_pathway_by_gene(gene_id, organism_code)

    return {
        'gene': gene_symbol,
        'uniprot': uniprot_id,
        'kegg': kegg_id,
        'pathways': pathways
    }

# Usage
result = multi_hop_mapping("TP53")
print(result)

Troubleshooting

Issue 1: No Mapping Found

Symptom: Mapping returns empty or None

Solutions:

Verify source ID exists in source database
Check database code spelling
Try reverse mapping
Some IDs may not have mappings in all databases

result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")

if not result or 'P43403' not in result:
    print("No mapping found. Try:")
    print("1. Verify ID exists: u.search('P43403')")
    print("2. Check if protein has KEGG annotation")

Issue 2: Too Many IDs in Batch

Symptom: Batch mapping fails or times out

Solution: Split into smaller chunks

def chunked_mapping(ids, from_db, to_db, chunk_size=50):
    all_results = {}

    for i in range(0, len(ids), chunk_size):
        chunk = ids[i:i+chunk_size]
        result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
        all_results.update(result)

    return all_results

Issue 3: Multiple Target IDs

Symptom: One source ID maps to multiple target IDs

Solution: Handle as list

result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
# Result: {'P04637': ['1A1U', '1AIE', '1C26', ...]}

pdb_ids = result['P04637']
print(f"Found {len(pdb_ids)} PDB structures")

for pdb_id in pdb_ids:
    print(f"  {pdb_id}")

Issue 4: Organism Ambiguity

Symptom: Gene symbol maps to multiple organisms

Solution: Always specify organism in searches

# Bad: Ambiguous
result = u.search("gene:TP53")  # Many organisms have TP53

# Good: Specific
result = u.search("gene:TP53 AND organism:9606")  # Human only

Issue 5: Deprecated IDs

Symptom: Old database IDs don't map

Solution: Update to current IDs first

# Check if ID is current
entry = u.retrieve("P43403", frmt="txt")

# Look for secondary accessions
for line in entry.split("\n"):
    if line.startswith("AC"):
        print(line)  # Shows primary and secondary accessions

Best Practices

Always validate inputs before batch processing
Handle None/empty results gracefully
Use chunking for large ID lists (50-100 per chunk)
Cache results for repeated queries
Specify organism when possible to avoid ambiguity
Log failures in batch processing for later retry
Add delays between large batches to respect API limits

import time

def polite_batch_mapping(ids, from_db, to_db):
    """Batch mapping with rate limiting."""
    results = {}

    for i in range(0, len(ids), 50):
        chunk = ids[i:i+50]
        result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
        results.update(result)

        time.sleep(0.5)  # Be nice to the API

    return results

For complete working examples, see:

scripts/batch_id_converter.py: Command-line batch conversion tool
workflow_patterns.md: Integration into larger workflows

17 KiB Raw Blame History

BioServices: Identifier Mapping Guide

Table of Contents

Overview

UniProt Mapping Service

Basic Usage

Batch Mapping

Supported Database Pairs

Protein/Gene Databases

Structural Databases

Expression & Proteomics

Organism-Specific

Other Useful Mappings

Complete List of Database Codes

Common Database Codes Reference

Mapping Examples

UniProt → KEGG

KEGG → UniProt

UniProt → Ensembl

UniProt → PDB

UniProt → RefSeq

Gene Name → UniProt (via search, then mapping)

UniChem Compound Mapping

Source Database IDs

Basic Usage

All Compound IDs

Specific Database Conversion

Common Compound Mappings

KEGG → ChEMBL

ChEMBL → PubChem

ChEBI → DrugBank

KEGG Identifier Conversions

Extract Database Links from KEGG Entry

KEGG Gene ID Components

KEGG Pathway to Genes

Common Mapping Patterns

Pattern 1: Gene Symbol → Multiple Database IDs

Pattern 2: Compound Name → All Database IDs

Pattern 3: Batch ID Conversion with Error Handling

Pattern 4: Multi-Hop Mapping

Troubleshooting

Issue 1: No Mapping Found

Issue 2: Too Many IDs in Batch

Issue 3: Multiple Target IDs

Issue 4: Organism Ambiguity

Issue 5: Deprecated IDs

Best Practices

17 KiB

Raw Blame History