Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,685 @@
# BioServices: Identifier Mapping Guide
This document provides comprehensive information about converting identifiers between different biological databases using BioServices.
## Table of Contents
1. [Overview](#overview)
2. [UniProt Mapping Service](#uniprot-mapping-service)
3. [UniChem Compound Mapping](#unichem-compound-mapping)
4. [KEGG Identifier Conversions](#kegg-identifier-conversions)
5. [Common Mapping Patterns](#common-mapping-patterns)
6. [Troubleshooting](#troubleshooting)
---
## Overview
Biological databases use different identifier systems. Cross-referencing requires mapping between these systems. BioServices provides multiple approaches:
1. **UniProt Mapping**: Comprehensive protein/gene ID conversion
2. **UniChem**: Chemical compound ID mapping
3. **KEGG**: Built-in cross-references in entries
4. **PICR**: Protein identifier cross-reference service
---
## UniProt Mapping Service
The UniProt mapping service is the most comprehensive tool for protein and gene identifier conversion.
### Basic Usage
```python
from bioservices import UniProt
u = UniProt()
# Map single ID
result = u.mapping(
fr="UniProtKB_AC-ID", # Source database
to="KEGG", # Target database
query="P43403" # Identifier to convert
)
print(result)
# Output: {'P43403': ['hsa:7535']}
```
### Batch Mapping
```python
# Map multiple IDs (comma-separated)
ids = ["P43403", "P04637", "P53779"]
result = u.mapping(
fr="UniProtKB_AC-ID",
to="KEGG",
query=",".join(ids)
)
for uniprot_id, kegg_ids in result.items():
print(f"{uniprot_id}{kegg_ids}")
```
### Supported Database Pairs
UniProt supports mapping between 100+ database pairs. Key ones include:
#### Protein/Gene Databases
| Source Format | Code | Target Format | Code |
|---------------|------|---------------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | KEGG | `KEGG` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl | `Ensembl` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Protein | `Ensembl_Protein` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Transcript | `Ensembl_Transcript` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Protein | `RefSeq_Protein` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Nucleotide | `RefSeq_Nucleotide` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | GeneID (Entrez) | `GeneID` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | HGNC | `HGNC` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | MGI | `MGI` |
| KEGG | `KEGG` | UniProtKB | `UniProtKB` |
| Ensembl | `Ensembl` | UniProtKB | `UniProtKB` |
| GeneID | `GeneID` | UniProtKB | `UniProtKB` |
#### Structural Databases
| Source | Code | Target | Code |
|--------|------|--------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PDB | `PDB` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Pfam | `Pfam` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | InterPro | `InterPro` |
| PDB | `PDB` | UniProtKB | `UniProtKB` |
#### Expression & Proteomics
| Source | Code | Target | Code |
|--------|------|--------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PRIDE | `PRIDE` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | ProteomicsDB | `ProteomicsDB` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PaxDb | `PaxDb` |
#### Organism-Specific
| Source | Code | Target | Code |
|--------|------|--------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | FlyBase | `FlyBase` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | WormBase | `WormBase` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | SGD | `SGD` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | ZFIN | `ZFIN` |
#### Other Useful Mappings
| Source | Code | Target | Code |
|--------|------|--------|------|
| UniProtKB AC/ID | `UniProtKB_AC-ID` | GO | `GO` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Reactome | `Reactome` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | STRING | `STRING` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | BioGRID | `BioGRID` |
| UniProtKB AC/ID | `UniProtKB_AC-ID` | OMA | `OMA` |
### Complete List of Database Codes
To get the complete, up-to-date list:
```python
from bioservices import UniProt
u = UniProt()
# This information is in the UniProt REST API documentation
# Common patterns:
# - Source databases typically end in source database name
# - UniProtKB uses "UniProtKB_AC-ID" or "UniProtKB"
# - Most other databases use their standard abbreviation
```
### Common Database Codes Reference
**Gene/Protein Identifiers:**
- `UniProtKB_AC-ID`: UniProt accession/ID
- `UniProtKB`: UniProt accession
- `KEGG`: KEGG gene IDs (e.g., hsa:7535)
- `GeneID`: NCBI Gene (Entrez) IDs
- `Ensembl`: Ensembl gene IDs
- `Ensembl_Protein`: Ensembl protein IDs
- `Ensembl_Transcript`: Ensembl transcript IDs
- `RefSeq_Protein`: RefSeq protein IDs (NP_)
- `RefSeq_Nucleotide`: RefSeq nucleotide IDs (NM_)
**Gene Nomenclature:**
- `HGNC`: Human Gene Nomenclature Committee
- `MGI`: Mouse Genome Informatics
- `RGD`: Rat Genome Database
- `SGD`: Saccharomyces Genome Database
- `FlyBase`: Drosophila database
- `WormBase`: C. elegans database
- `ZFIN`: Zebrafish database
**Structure:**
- `PDB`: Protein Data Bank
- `Pfam`: Protein families
- `InterPro`: Protein domains
- `SUPFAM`: Superfamily
- `PROSITE`: Protein motifs
**Pathways & Networks:**
- `Reactome`: Reactome pathways
- `BioCyc`: BioCyc pathways
- `PathwayCommons`: Pathway Commons
- `STRING`: Protein-protein networks
- `BioGRID`: Interaction database
### Mapping Examples
#### UniProt → KEGG
```python
from bioservices import UniProt
u = UniProt()
# Single mapping
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
print(result) # {'P43403': ['hsa:7535']}
```
#### KEGG → UniProt
```python
# Reverse mapping
result = u.mapping(fr="KEGG", to="UniProtKB", query="hsa:7535")
print(result) # {'hsa:7535': ['P43403']}
```
#### UniProt → Ensembl
```python
# To Ensembl gene IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query="P43403")
print(result) # {'P43403': ['ENSG00000115085']}
# To Ensembl protein IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl_Protein", query="P43403")
print(result) # {'P43403': ['ENSP00000381359']}
```
#### UniProt → PDB
```python
# Find 3D structures
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
print(result) # {'P04637': ['1A1U', '1AIE', '1C26', ...]}
```
#### UniProt → RefSeq
```python
# Get RefSeq protein IDs
result = u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query="P43403")
print(result) # {'P43403': ['NP_001070.2']}
```
#### Gene Name → UniProt (via search, then mapping)
```python
# First search for gene
search_result = u.search("gene:ZAP70 AND organism:9606", frmt="tab", columns="id")
lines = search_result.strip().split("\n")
if len(lines) > 1:
uniprot_id = lines[1].split("\t")[0]
# Then map to other databases
kegg_id = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
print(kegg_id)
```
---
## UniChem Compound Mapping
UniChem specializes in mapping chemical compound identifiers across databases.
### Source Database IDs
| Source ID | Database |
|-----------|----------|
| 1 | ChEMBL |
| 2 | DrugBank |
| 3 | PDB |
| 4 | IUPHAR/BPS Guide to Pharmacology |
| 5 | PubChem |
| 6 | KEGG |
| 7 | ChEBI |
| 8 | NIH Clinical Collection |
| 14 | FDA/SRS |
| 22 | PubChem |
### Basic Usage
```python
from bioservices import UniChem
u = UniChem()
# Get ChEMBL ID from KEGG compound ID
chembl_id = u.get_compound_id_from_kegg("C11222")
print(chembl_id) # CHEMBL278315
```
### All Compound IDs
```python
# Get all identifiers for a compound
# src_compound_id: compound ID, src_id: source database ID
all_ids = u.get_all_compound_ids("CHEMBL278315", src_id=1) # 1 = ChEMBL
for mapping in all_ids:
src_name = mapping['src_name']
src_compound_id = mapping['src_compound_id']
print(f"{src_name}: {src_compound_id}")
```
### Specific Database Conversion
```python
# Convert between specific databases
# from_src_id=6 (KEGG), to_src_id=1 (ChEMBL)
result = u.get_src_compound_ids("C11222", from_src_id=6, to_src_id=1)
print(result)
```
### Common Compound Mappings
#### KEGG → ChEMBL
```python
u = UniChem()
chembl_id = u.get_compound_id_from_kegg("C00031") # D-Glucose
print(f"ChEMBL: {chembl_id}")
```
#### ChEMBL → PubChem
```python
result = u.get_src_compound_ids("CHEMBL278315", from_src_id=1, to_src_id=22)
if result:
pubchem_id = result[0]['src_compound_id']
print(f"PubChem: {pubchem_id}")
```
#### ChEBI → DrugBank
```python
result = u.get_src_compound_ids("5292", from_src_id=7, to_src_id=2)
if result:
drugbank_id = result[0]['src_compound_id']
print(f"DrugBank: {drugbank_id}")
```
---
## KEGG Identifier Conversions
KEGG entries contain cross-references that can be extracted by parsing.
### Extract Database Links from KEGG Entry
```python
from bioservices import KEGG
k = KEGG()
# Get compound entry
entry = k.get("cpd:C11222")
# Parse for specific database
chebi_id = None
uniprot_ids = []
for line in entry.split("\n"):
if "ChEBI:" in line:
# Extract ChEBI ID
parts = line.split("ChEBI:")
if len(parts) > 1:
chebi_id = parts[1].strip().split()[0]
# For genes/proteins
gene_entry = k.get("hsa:7535")
for line in gene_entry.split("\n"):
if line.startswith(" "): # Database links section
if "UniProt:" in line:
parts = line.split("UniProt:")
if len(parts) > 1:
uniprot_id = parts[1].strip()
uniprot_ids.append(uniprot_id)
```
### KEGG Gene ID Components
KEGG gene IDs have format `organism:gene_id`:
```python
kegg_id = "hsa:7535"
organism, gene_id = kegg_id.split(":")
print(f"Organism: {organism}") # hsa (human)
print(f"Gene ID: {gene_id}") # 7535
```
### KEGG Pathway to Genes
```python
k = KEGG()
# Get pathway entry
pathway = k.get("path:hsa04660")
# Parse for gene list
genes = []
in_gene_section = False
for line in pathway.split("\n"):
if line.startswith("GENE"):
in_gene_section = True
if in_gene_section:
if line.startswith(" " * 12): # Gene line
parts = line.strip().split()
if parts:
gene_id = parts[0]
genes.append(f"hsa:{gene_id}")
elif not line.startswith(" "):
break
print(f"Found {len(genes)} genes")
```
---
## Common Mapping Patterns
### Pattern 1: Gene Symbol → Multiple Database IDs
```python
from bioservices import UniProt
def gene_symbol_to_ids(gene_symbol, organism="9606"):
"""Convert gene symbol to multiple database IDs."""
u = UniProt()
# Search for gene
query = f"gene:{gene_symbol} AND organism:{organism}"
result = u.search(query, frmt="tab", columns="id")
lines = result.strip().split("\n")
if len(lines) < 2:
return None
uniprot_id = lines[1].split("\t")[0]
# Map to multiple databases
ids = {
'uniprot': uniprot_id,
'kegg': u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id),
'ensembl': u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query=uniprot_id),
'refseq': u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query=uniprot_id),
'pdb': u.mapping(fr="UniProtKB_AC-ID", to="PDB", query=uniprot_id)
}
return ids
# Usage
ids = gene_symbol_to_ids("ZAP70")
print(ids)
```
### Pattern 2: Compound Name → All Database IDs
```python
from bioservices import KEGG, UniChem, ChEBI
def compound_name_to_ids(compound_name):
"""Search compound and get all database IDs."""
k = KEGG()
# Search KEGG
results = k.find("compound", compound_name)
if not results:
return None
# Extract KEGG ID
kegg_id = results.strip().split("\n")[0].split("\t")[0].replace("cpd:", "")
# Get KEGG entry for ChEBI
entry = k.get(f"cpd:{kegg_id}")
chebi_id = None
for line in entry.split("\n"):
if "ChEBI:" in line:
parts = line.split("ChEBI:")
if len(parts) > 1:
chebi_id = parts[1].strip().split()[0]
break
# Get ChEMBL from UniChem
u = UniChem()
try:
chembl_id = u.get_compound_id_from_kegg(kegg_id)
except:
chembl_id = None
return {
'kegg': kegg_id,
'chebi': chebi_id,
'chembl': chembl_id
}
# Usage
ids = compound_name_to_ids("Geldanamycin")
print(ids)
```
### Pattern 3: Batch ID Conversion with Error Handling
```python
from bioservices import UniProt
def safe_batch_mapping(ids, from_db, to_db, chunk_size=100):
"""Safely map IDs with error handling and chunking."""
u = UniProt()
all_results = {}
for i in range(0, len(ids), chunk_size):
chunk = ids[i:i+chunk_size]
query = ",".join(chunk)
try:
results = u.mapping(fr=from_db, to=to_db, query=query)
all_results.update(results)
print(f"✓ Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
except Exception as e:
print(f"✗ Error at chunk {i}: {e}")
# Try individual IDs in failed chunk
for single_id in chunk:
try:
result = u.mapping(fr=from_db, to=to_db, query=single_id)
all_results.update(result)
except:
all_results[single_id] = None
return all_results
# Usage
uniprot_ids = ["P43403", "P04637", "P53779", "INVALID123"]
mapping = safe_batch_mapping(uniprot_ids, "UniProtKB_AC-ID", "KEGG")
```
### Pattern 4: Multi-Hop Mapping
Sometimes you need to map through intermediate databases:
```python
from bioservices import UniProt
def multi_hop_mapping(gene_symbol, organism="9606"):
"""Gene symbol → UniProt → KEGG → Pathways."""
u = UniProt()
k = KEGG()
# Step 1: Gene symbol → UniProt
query = f"gene:{gene_symbol} AND organism:{organism}"
result = u.search(query, frmt="tab", columns="id")
lines = result.strip().split("\n")
if len(lines) < 2:
return None
uniprot_id = lines[1].split("\t")[0]
# Step 2: UniProt → KEGG
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
if not kegg_mapping or uniprot_id not in kegg_mapping:
return None
kegg_id = kegg_mapping[uniprot_id][0]
# Step 3: KEGG → Pathways
organism_code, gene_id = kegg_id.split(":")
pathways = k.get_pathway_by_gene(gene_id, organism_code)
return {
'gene': gene_symbol,
'uniprot': uniprot_id,
'kegg': kegg_id,
'pathways': pathways
}
# Usage
result = multi_hop_mapping("TP53")
print(result)
```
---
## Troubleshooting
### Issue 1: No Mapping Found
**Symptom:** Mapping returns empty or None
**Solutions:**
1. Verify source ID exists in source database
2. Check database code spelling
3. Try reverse mapping
4. Some IDs may not have mappings in all databases
```python
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
if not result or 'P43403' not in result:
print("No mapping found. Try:")
print("1. Verify ID exists: u.search('P43403')")
print("2. Check if protein has KEGG annotation")
```
### Issue 2: Too Many IDs in Batch
**Symptom:** Batch mapping fails or times out
**Solution:** Split into smaller chunks
```python
def chunked_mapping(ids, from_db, to_db, chunk_size=50):
all_results = {}
for i in range(0, len(ids), chunk_size):
chunk = ids[i:i+chunk_size]
result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
all_results.update(result)
return all_results
```
### Issue 3: Multiple Target IDs
**Symptom:** One source ID maps to multiple target IDs
**Solution:** Handle as list
```python
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
# Result: {'P04637': ['1A1U', '1AIE', '1C26', ...]}
pdb_ids = result['P04637']
print(f"Found {len(pdb_ids)} PDB structures")
for pdb_id in pdb_ids:
print(f" {pdb_id}")
```
### Issue 4: Organism Ambiguity
**Symptom:** Gene symbol maps to multiple organisms
**Solution:** Always specify organism in searches
```python
# Bad: Ambiguous
result = u.search("gene:TP53") # Many organisms have TP53
# Good: Specific
result = u.search("gene:TP53 AND organism:9606") # Human only
```
### Issue 5: Deprecated IDs
**Symptom:** Old database IDs don't map
**Solution:** Update to current IDs first
```python
# Check if ID is current
entry = u.retrieve("P43403", frmt="txt")
# Look for secondary accessions
for line in entry.split("\n"):
if line.startswith("AC"):
print(line) # Shows primary and secondary accessions
```
---
## Best Practices
1. **Always validate inputs** before batch processing
2. **Handle None/empty results** gracefully
3. **Use chunking** for large ID lists (50-100 per chunk)
4. **Cache results** for repeated queries
5. **Specify organism** when possible to avoid ambiguity
6. **Log failures** in batch processing for later retry
7. **Add delays** between large batches to respect API limits
```python
import time
def polite_batch_mapping(ids, from_db, to_db):
"""Batch mapping with rate limiting."""
results = {}
for i in range(0, len(ids), 50):
chunk = ids[i:i+50]
result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
results.update(result)
time.sleep(0.5) # Be nice to the API
return results
```
---
For complete working examples, see:
- `scripts/batch_id_converter.py`: Command-line batch conversion tool
- `workflow_patterns.md`: Integration into larger workflows