Initial commit
This commit is contained in:
685
skills/bioservices/references/identifier_mapping.md
Normal file
685
skills/bioservices/references/identifier_mapping.md
Normal file
@@ -0,0 +1,685 @@
|
||||
# BioServices: Identifier Mapping Guide
|
||||
|
||||
This document provides comprehensive information about converting identifiers between different biological databases using BioServices.
|
||||
|
||||
## Table of Contents
|
||||
|
||||
1. [Overview](#overview)
|
||||
2. [UniProt Mapping Service](#uniprot-mapping-service)
|
||||
3. [UniChem Compound Mapping](#unichem-compound-mapping)
|
||||
4. [KEGG Identifier Conversions](#kegg-identifier-conversions)
|
||||
5. [Common Mapping Patterns](#common-mapping-patterns)
|
||||
6. [Troubleshooting](#troubleshooting)
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Biological databases use different identifier systems. Cross-referencing requires mapping between these systems. BioServices provides multiple approaches:
|
||||
|
||||
1. **UniProt Mapping**: Comprehensive protein/gene ID conversion
|
||||
2. **UniChem**: Chemical compound ID mapping
|
||||
3. **KEGG**: Built-in cross-references in entries
|
||||
4. **PICR**: Protein identifier cross-reference service
|
||||
|
||||
---
|
||||
|
||||
## UniProt Mapping Service
|
||||
|
||||
The UniProt mapping service is the most comprehensive tool for protein and gene identifier conversion.
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
u = UniProt()
|
||||
|
||||
# Map single ID
|
||||
result = u.mapping(
|
||||
fr="UniProtKB_AC-ID", # Source database
|
||||
to="KEGG", # Target database
|
||||
query="P43403" # Identifier to convert
|
||||
)
|
||||
|
||||
print(result)
|
||||
# Output: {'P43403': ['hsa:7535']}
|
||||
```
|
||||
|
||||
### Batch Mapping
|
||||
|
||||
```python
|
||||
# Map multiple IDs (comma-separated)
|
||||
ids = ["P43403", "P04637", "P53779"]
|
||||
result = u.mapping(
|
||||
fr="UniProtKB_AC-ID",
|
||||
to="KEGG",
|
||||
query=",".join(ids)
|
||||
)
|
||||
|
||||
for uniprot_id, kegg_ids in result.items():
|
||||
print(f"{uniprot_id} → {kegg_ids}")
|
||||
```
|
||||
|
||||
### Supported Database Pairs
|
||||
|
||||
UniProt supports mapping between 100+ database pairs. Key ones include:
|
||||
|
||||
#### Protein/Gene Databases
|
||||
|
||||
| Source Format | Code | Target Format | Code |
|
||||
|---------------|------|---------------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | KEGG | `KEGG` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl | `Ensembl` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Protein | `Ensembl_Protein` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Ensembl Transcript | `Ensembl_Transcript` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Protein | `RefSeq_Protein` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | RefSeq Nucleotide | `RefSeq_Nucleotide` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | GeneID (Entrez) | `GeneID` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | HGNC | `HGNC` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | MGI | `MGI` |
|
||||
| KEGG | `KEGG` | UniProtKB | `UniProtKB` |
|
||||
| Ensembl | `Ensembl` | UniProtKB | `UniProtKB` |
|
||||
| GeneID | `GeneID` | UniProtKB | `UniProtKB` |
|
||||
|
||||
#### Structural Databases
|
||||
|
||||
| Source | Code | Target | Code |
|
||||
|--------|------|--------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PDB | `PDB` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Pfam | `Pfam` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | InterPro | `InterPro` |
|
||||
| PDB | `PDB` | UniProtKB | `UniProtKB` |
|
||||
|
||||
#### Expression & Proteomics
|
||||
|
||||
| Source | Code | Target | Code |
|
||||
|--------|------|--------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PRIDE | `PRIDE` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | ProteomicsDB | `ProteomicsDB` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | PaxDb | `PaxDb` |
|
||||
|
||||
#### Organism-Specific
|
||||
|
||||
| Source | Code | Target | Code |
|
||||
|--------|------|--------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | FlyBase | `FlyBase` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | WormBase | `WormBase` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | SGD | `SGD` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | ZFIN | `ZFIN` |
|
||||
|
||||
#### Other Useful Mappings
|
||||
|
||||
| Source | Code | Target | Code |
|
||||
|--------|------|--------|------|
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | GO | `GO` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | Reactome | `Reactome` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | STRING | `STRING` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | BioGRID | `BioGRID` |
|
||||
| UniProtKB AC/ID | `UniProtKB_AC-ID` | OMA | `OMA` |
|
||||
|
||||
### Complete List of Database Codes
|
||||
|
||||
To get the complete, up-to-date list:
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
u = UniProt()
|
||||
|
||||
# This information is in the UniProt REST API documentation
|
||||
# Common patterns:
|
||||
# - Source databases typically end in source database name
|
||||
# - UniProtKB uses "UniProtKB_AC-ID" or "UniProtKB"
|
||||
# - Most other databases use their standard abbreviation
|
||||
```
|
||||
|
||||
### Common Database Codes Reference
|
||||
|
||||
**Gene/Protein Identifiers:**
|
||||
- `UniProtKB_AC-ID`: UniProt accession/ID
|
||||
- `UniProtKB`: UniProt accession
|
||||
- `KEGG`: KEGG gene IDs (e.g., hsa:7535)
|
||||
- `GeneID`: NCBI Gene (Entrez) IDs
|
||||
- `Ensembl`: Ensembl gene IDs
|
||||
- `Ensembl_Protein`: Ensembl protein IDs
|
||||
- `Ensembl_Transcript`: Ensembl transcript IDs
|
||||
- `RefSeq_Protein`: RefSeq protein IDs (NP_)
|
||||
- `RefSeq_Nucleotide`: RefSeq nucleotide IDs (NM_)
|
||||
|
||||
**Gene Nomenclature:**
|
||||
- `HGNC`: Human Gene Nomenclature Committee
|
||||
- `MGI`: Mouse Genome Informatics
|
||||
- `RGD`: Rat Genome Database
|
||||
- `SGD`: Saccharomyces Genome Database
|
||||
- `FlyBase`: Drosophila database
|
||||
- `WormBase`: C. elegans database
|
||||
- `ZFIN`: Zebrafish database
|
||||
|
||||
**Structure:**
|
||||
- `PDB`: Protein Data Bank
|
||||
- `Pfam`: Protein families
|
||||
- `InterPro`: Protein domains
|
||||
- `SUPFAM`: Superfamily
|
||||
- `PROSITE`: Protein motifs
|
||||
|
||||
**Pathways & Networks:**
|
||||
- `Reactome`: Reactome pathways
|
||||
- `BioCyc`: BioCyc pathways
|
||||
- `PathwayCommons`: Pathway Commons
|
||||
- `STRING`: Protein-protein networks
|
||||
- `BioGRID`: Interaction database
|
||||
|
||||
### Mapping Examples
|
||||
|
||||
#### UniProt → KEGG
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
u = UniProt()
|
||||
|
||||
# Single mapping
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
|
||||
print(result) # {'P43403': ['hsa:7535']}
|
||||
```
|
||||
|
||||
#### KEGG → UniProt
|
||||
|
||||
```python
|
||||
# Reverse mapping
|
||||
result = u.mapping(fr="KEGG", to="UniProtKB", query="hsa:7535")
|
||||
print(result) # {'hsa:7535': ['P43403']}
|
||||
```
|
||||
|
||||
#### UniProt → Ensembl
|
||||
|
||||
```python
|
||||
# To Ensembl gene IDs
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query="P43403")
|
||||
print(result) # {'P43403': ['ENSG00000115085']}
|
||||
|
||||
# To Ensembl protein IDs
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="Ensembl_Protein", query="P43403")
|
||||
print(result) # {'P43403': ['ENSP00000381359']}
|
||||
```
|
||||
|
||||
#### UniProt → PDB
|
||||
|
||||
```python
|
||||
# Find 3D structures
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
|
||||
print(result) # {'P04637': ['1A1U', '1AIE', '1C26', ...]}
|
||||
```
|
||||
|
||||
#### UniProt → RefSeq
|
||||
|
||||
```python
|
||||
# Get RefSeq protein IDs
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query="P43403")
|
||||
print(result) # {'P43403': ['NP_001070.2']}
|
||||
```
|
||||
|
||||
#### Gene Name → UniProt (via search, then mapping)
|
||||
|
||||
```python
|
||||
# First search for gene
|
||||
search_result = u.search("gene:ZAP70 AND organism:9606", frmt="tab", columns="id")
|
||||
lines = search_result.strip().split("\n")
|
||||
if len(lines) > 1:
|
||||
uniprot_id = lines[1].split("\t")[0]
|
||||
|
||||
# Then map to other databases
|
||||
kegg_id = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
|
||||
print(kegg_id)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## UniChem Compound Mapping
|
||||
|
||||
UniChem specializes in mapping chemical compound identifiers across databases.
|
||||
|
||||
### Source Database IDs
|
||||
|
||||
| Source ID | Database |
|
||||
|-----------|----------|
|
||||
| 1 | ChEMBL |
|
||||
| 2 | DrugBank |
|
||||
| 3 | PDB |
|
||||
| 4 | IUPHAR/BPS Guide to Pharmacology |
|
||||
| 5 | PubChem |
|
||||
| 6 | KEGG |
|
||||
| 7 | ChEBI |
|
||||
| 8 | NIH Clinical Collection |
|
||||
| 14 | FDA/SRS |
|
||||
| 22 | PubChem |
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from bioservices import UniChem
|
||||
|
||||
u = UniChem()
|
||||
|
||||
# Get ChEMBL ID from KEGG compound ID
|
||||
chembl_id = u.get_compound_id_from_kegg("C11222")
|
||||
print(chembl_id) # CHEMBL278315
|
||||
```
|
||||
|
||||
### All Compound IDs
|
||||
|
||||
```python
|
||||
# Get all identifiers for a compound
|
||||
# src_compound_id: compound ID, src_id: source database ID
|
||||
all_ids = u.get_all_compound_ids("CHEMBL278315", src_id=1) # 1 = ChEMBL
|
||||
|
||||
for mapping in all_ids:
|
||||
src_name = mapping['src_name']
|
||||
src_compound_id = mapping['src_compound_id']
|
||||
print(f"{src_name}: {src_compound_id}")
|
||||
```
|
||||
|
||||
### Specific Database Conversion
|
||||
|
||||
```python
|
||||
# Convert between specific databases
|
||||
# from_src_id=6 (KEGG), to_src_id=1 (ChEMBL)
|
||||
result = u.get_src_compound_ids("C11222", from_src_id=6, to_src_id=1)
|
||||
print(result)
|
||||
```
|
||||
|
||||
### Common Compound Mappings
|
||||
|
||||
#### KEGG → ChEMBL
|
||||
|
||||
```python
|
||||
u = UniChem()
|
||||
chembl_id = u.get_compound_id_from_kegg("C00031") # D-Glucose
|
||||
print(f"ChEMBL: {chembl_id}")
|
||||
```
|
||||
|
||||
#### ChEMBL → PubChem
|
||||
|
||||
```python
|
||||
result = u.get_src_compound_ids("CHEMBL278315", from_src_id=1, to_src_id=22)
|
||||
if result:
|
||||
pubchem_id = result[0]['src_compound_id']
|
||||
print(f"PubChem: {pubchem_id}")
|
||||
```
|
||||
|
||||
#### ChEBI → DrugBank
|
||||
|
||||
```python
|
||||
result = u.get_src_compound_ids("5292", from_src_id=7, to_src_id=2)
|
||||
if result:
|
||||
drugbank_id = result[0]['src_compound_id']
|
||||
print(f"DrugBank: {drugbank_id}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## KEGG Identifier Conversions
|
||||
|
||||
KEGG entries contain cross-references that can be extracted by parsing.
|
||||
|
||||
### Extract Database Links from KEGG Entry
|
||||
|
||||
```python
|
||||
from bioservices import KEGG
|
||||
|
||||
k = KEGG()
|
||||
|
||||
# Get compound entry
|
||||
entry = k.get("cpd:C11222")
|
||||
|
||||
# Parse for specific database
|
||||
chebi_id = None
|
||||
uniprot_ids = []
|
||||
|
||||
for line in entry.split("\n"):
|
||||
if "ChEBI:" in line:
|
||||
# Extract ChEBI ID
|
||||
parts = line.split("ChEBI:")
|
||||
if len(parts) > 1:
|
||||
chebi_id = parts[1].strip().split()[0]
|
||||
|
||||
# For genes/proteins
|
||||
gene_entry = k.get("hsa:7535")
|
||||
for line in gene_entry.split("\n"):
|
||||
if line.startswith(" "): # Database links section
|
||||
if "UniProt:" in line:
|
||||
parts = line.split("UniProt:")
|
||||
if len(parts) > 1:
|
||||
uniprot_id = parts[1].strip()
|
||||
uniprot_ids.append(uniprot_id)
|
||||
```
|
||||
|
||||
### KEGG Gene ID Components
|
||||
|
||||
KEGG gene IDs have format `organism:gene_id`:
|
||||
|
||||
```python
|
||||
kegg_id = "hsa:7535"
|
||||
organism, gene_id = kegg_id.split(":")
|
||||
|
||||
print(f"Organism: {organism}") # hsa (human)
|
||||
print(f"Gene ID: {gene_id}") # 7535
|
||||
```
|
||||
|
||||
### KEGG Pathway to Genes
|
||||
|
||||
```python
|
||||
k = KEGG()
|
||||
|
||||
# Get pathway entry
|
||||
pathway = k.get("path:hsa04660")
|
||||
|
||||
# Parse for gene list
|
||||
genes = []
|
||||
in_gene_section = False
|
||||
|
||||
for line in pathway.split("\n"):
|
||||
if line.startswith("GENE"):
|
||||
in_gene_section = True
|
||||
|
||||
if in_gene_section:
|
||||
if line.startswith(" " * 12): # Gene line
|
||||
parts = line.strip().split()
|
||||
if parts:
|
||||
gene_id = parts[0]
|
||||
genes.append(f"hsa:{gene_id}")
|
||||
elif not line.startswith(" "):
|
||||
break
|
||||
|
||||
print(f"Found {len(genes)} genes")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Mapping Patterns
|
||||
|
||||
### Pattern 1: Gene Symbol → Multiple Database IDs
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
def gene_symbol_to_ids(gene_symbol, organism="9606"):
|
||||
"""Convert gene symbol to multiple database IDs."""
|
||||
u = UniProt()
|
||||
|
||||
# Search for gene
|
||||
query = f"gene:{gene_symbol} AND organism:{organism}"
|
||||
result = u.search(query, frmt="tab", columns="id")
|
||||
|
||||
lines = result.strip().split("\n")
|
||||
if len(lines) < 2:
|
||||
return None
|
||||
|
||||
uniprot_id = lines[1].split("\t")[0]
|
||||
|
||||
# Map to multiple databases
|
||||
ids = {
|
||||
'uniprot': uniprot_id,
|
||||
'kegg': u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id),
|
||||
'ensembl': u.mapping(fr="UniProtKB_AC-ID", to="Ensembl", query=uniprot_id),
|
||||
'refseq': u.mapping(fr="UniProtKB_AC-ID", to="RefSeq_Protein", query=uniprot_id),
|
||||
'pdb': u.mapping(fr="UniProtKB_AC-ID", to="PDB", query=uniprot_id)
|
||||
}
|
||||
|
||||
return ids
|
||||
|
||||
# Usage
|
||||
ids = gene_symbol_to_ids("ZAP70")
|
||||
print(ids)
|
||||
```
|
||||
|
||||
### Pattern 2: Compound Name → All Database IDs
|
||||
|
||||
```python
|
||||
from bioservices import KEGG, UniChem, ChEBI
|
||||
|
||||
def compound_name_to_ids(compound_name):
|
||||
"""Search compound and get all database IDs."""
|
||||
k = KEGG()
|
||||
|
||||
# Search KEGG
|
||||
results = k.find("compound", compound_name)
|
||||
if not results:
|
||||
return None
|
||||
|
||||
# Extract KEGG ID
|
||||
kegg_id = results.strip().split("\n")[0].split("\t")[0].replace("cpd:", "")
|
||||
|
||||
# Get KEGG entry for ChEBI
|
||||
entry = k.get(f"cpd:{kegg_id}")
|
||||
chebi_id = None
|
||||
for line in entry.split("\n"):
|
||||
if "ChEBI:" in line:
|
||||
parts = line.split("ChEBI:")
|
||||
if len(parts) > 1:
|
||||
chebi_id = parts[1].strip().split()[0]
|
||||
break
|
||||
|
||||
# Get ChEMBL from UniChem
|
||||
u = UniChem()
|
||||
try:
|
||||
chembl_id = u.get_compound_id_from_kegg(kegg_id)
|
||||
except:
|
||||
chembl_id = None
|
||||
|
||||
return {
|
||||
'kegg': kegg_id,
|
||||
'chebi': chebi_id,
|
||||
'chembl': chembl_id
|
||||
}
|
||||
|
||||
# Usage
|
||||
ids = compound_name_to_ids("Geldanamycin")
|
||||
print(ids)
|
||||
```
|
||||
|
||||
### Pattern 3: Batch ID Conversion with Error Handling
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
def safe_batch_mapping(ids, from_db, to_db, chunk_size=100):
|
||||
"""Safely map IDs with error handling and chunking."""
|
||||
u = UniProt()
|
||||
all_results = {}
|
||||
|
||||
for i in range(0, len(ids), chunk_size):
|
||||
chunk = ids[i:i+chunk_size]
|
||||
query = ",".join(chunk)
|
||||
|
||||
try:
|
||||
results = u.mapping(fr=from_db, to=to_db, query=query)
|
||||
all_results.update(results)
|
||||
print(f"✓ Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
|
||||
|
||||
except Exception as e:
|
||||
print(f"✗ Error at chunk {i}: {e}")
|
||||
|
||||
# Try individual IDs in failed chunk
|
||||
for single_id in chunk:
|
||||
try:
|
||||
result = u.mapping(fr=from_db, to=to_db, query=single_id)
|
||||
all_results.update(result)
|
||||
except:
|
||||
all_results[single_id] = None
|
||||
|
||||
return all_results
|
||||
|
||||
# Usage
|
||||
uniprot_ids = ["P43403", "P04637", "P53779", "INVALID123"]
|
||||
mapping = safe_batch_mapping(uniprot_ids, "UniProtKB_AC-ID", "KEGG")
|
||||
```
|
||||
|
||||
### Pattern 4: Multi-Hop Mapping
|
||||
|
||||
Sometimes you need to map through intermediate databases:
|
||||
|
||||
```python
|
||||
from bioservices import UniProt
|
||||
|
||||
def multi_hop_mapping(gene_symbol, organism="9606"):
|
||||
"""Gene symbol → UniProt → KEGG → Pathways."""
|
||||
u = UniProt()
|
||||
k = KEGG()
|
||||
|
||||
# Step 1: Gene symbol → UniProt
|
||||
query = f"gene:{gene_symbol} AND organism:{organism}"
|
||||
result = u.search(query, frmt="tab", columns="id")
|
||||
|
||||
lines = result.strip().split("\n")
|
||||
if len(lines) < 2:
|
||||
return None
|
||||
|
||||
uniprot_id = lines[1].split("\t")[0]
|
||||
|
||||
# Step 2: UniProt → KEGG
|
||||
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
|
||||
if not kegg_mapping or uniprot_id not in kegg_mapping:
|
||||
return None
|
||||
|
||||
kegg_id = kegg_mapping[uniprot_id][0]
|
||||
|
||||
# Step 3: KEGG → Pathways
|
||||
organism_code, gene_id = kegg_id.split(":")
|
||||
pathways = k.get_pathway_by_gene(gene_id, organism_code)
|
||||
|
||||
return {
|
||||
'gene': gene_symbol,
|
||||
'uniprot': uniprot_id,
|
||||
'kegg': kegg_id,
|
||||
'pathways': pathways
|
||||
}
|
||||
|
||||
# Usage
|
||||
result = multi_hop_mapping("TP53")
|
||||
print(result)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Issue 1: No Mapping Found
|
||||
|
||||
**Symptom:** Mapping returns empty or None
|
||||
|
||||
**Solutions:**
|
||||
1. Verify source ID exists in source database
|
||||
2. Check database code spelling
|
||||
3. Try reverse mapping
|
||||
4. Some IDs may not have mappings in all databases
|
||||
|
||||
```python
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
|
||||
|
||||
if not result or 'P43403' not in result:
|
||||
print("No mapping found. Try:")
|
||||
print("1. Verify ID exists: u.search('P43403')")
|
||||
print("2. Check if protein has KEGG annotation")
|
||||
```
|
||||
|
||||
### Issue 2: Too Many IDs in Batch
|
||||
|
||||
**Symptom:** Batch mapping fails or times out
|
||||
|
||||
**Solution:** Split into smaller chunks
|
||||
|
||||
```python
|
||||
def chunked_mapping(ids, from_db, to_db, chunk_size=50):
|
||||
all_results = {}
|
||||
|
||||
for i in range(0, len(ids), chunk_size):
|
||||
chunk = ids[i:i+chunk_size]
|
||||
result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
|
||||
all_results.update(result)
|
||||
|
||||
return all_results
|
||||
```
|
||||
|
||||
### Issue 3: Multiple Target IDs
|
||||
|
||||
**Symptom:** One source ID maps to multiple target IDs
|
||||
|
||||
**Solution:** Handle as list
|
||||
|
||||
```python
|
||||
result = u.mapping(fr="UniProtKB_AC-ID", to="PDB", query="P04637")
|
||||
# Result: {'P04637': ['1A1U', '1AIE', '1C26', ...]}
|
||||
|
||||
pdb_ids = result['P04637']
|
||||
print(f"Found {len(pdb_ids)} PDB structures")
|
||||
|
||||
for pdb_id in pdb_ids:
|
||||
print(f" {pdb_id}")
|
||||
```
|
||||
|
||||
### Issue 4: Organism Ambiguity
|
||||
|
||||
**Symptom:** Gene symbol maps to multiple organisms
|
||||
|
||||
**Solution:** Always specify organism in searches
|
||||
|
||||
```python
|
||||
# Bad: Ambiguous
|
||||
result = u.search("gene:TP53") # Many organisms have TP53
|
||||
|
||||
# Good: Specific
|
||||
result = u.search("gene:TP53 AND organism:9606") # Human only
|
||||
```
|
||||
|
||||
### Issue 5: Deprecated IDs
|
||||
|
||||
**Symptom:** Old database IDs don't map
|
||||
|
||||
**Solution:** Update to current IDs first
|
||||
|
||||
```python
|
||||
# Check if ID is current
|
||||
entry = u.retrieve("P43403", frmt="txt")
|
||||
|
||||
# Look for secondary accessions
|
||||
for line in entry.split("\n"):
|
||||
if line.startswith("AC"):
|
||||
print(line) # Shows primary and secondary accessions
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always validate inputs** before batch processing
|
||||
2. **Handle None/empty results** gracefully
|
||||
3. **Use chunking** for large ID lists (50-100 per chunk)
|
||||
4. **Cache results** for repeated queries
|
||||
5. **Specify organism** when possible to avoid ambiguity
|
||||
6. **Log failures** in batch processing for later retry
|
||||
7. **Add delays** between large batches to respect API limits
|
||||
|
||||
```python
|
||||
import time
|
||||
|
||||
def polite_batch_mapping(ids, from_db, to_db):
|
||||
"""Batch mapping with rate limiting."""
|
||||
results = {}
|
||||
|
||||
for i in range(0, len(ids), 50):
|
||||
chunk = ids[i:i+50]
|
||||
result = u.mapping(fr=from_db, to=to_db, query=",".join(chunk))
|
||||
results.update(result)
|
||||
|
||||
time.sleep(0.5) # Be nice to the API
|
||||
|
||||
return results
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
For complete working examples, see:
|
||||
- `scripts/batch_id_converter.py`: Command-line batch conversion tool
|
||||
- `workflow_patterns.md`: Integration into larger workflows
|
||||
Reference in New Issue
Block a user