356 lines
9.6 KiB
Markdown
356 lines
9.6 KiB
Markdown
---
|
|
name: bioservices
|
|
description: "Primary Python tool for 40+ bioinformatics services. Preferred for multi-database workflows: UniProt, KEGG, ChEMBL, PubChem, Reactome, QuickGO. Unified API for queries, ID mapping, pathway analysis. For direct REST control, use individual database skills (uniprot-database, kegg-database)."
|
|
---
|
|
|
|
# BioServices
|
|
|
|
## Overview
|
|
|
|
BioServices is a Python package providing programmatic access to approximately 40 bioinformatics web services and databases. Retrieve biological data, perform cross-database queries, map identifiers, analyze sequences, and integrate multiple biological resources in Python workflows. The package handles both REST and SOAP/WSDL protocols transparently.
|
|
|
|
## When to Use This Skill
|
|
|
|
This skill should be used when:
|
|
- Retrieving protein sequences, annotations, or structures from UniProt, PDB, Pfam
|
|
- Analyzing metabolic pathways and gene functions via KEGG or Reactome
|
|
- Searching compound databases (ChEBI, ChEMBL, PubChem) for chemical information
|
|
- Converting identifiers between different biological databases (KEGG↔UniProt, compound IDs)
|
|
- Running sequence similarity searches (BLAST, MUSCLE alignment)
|
|
- Querying gene ontology terms (QuickGO, GO annotations)
|
|
- Accessing protein-protein interaction data (PSICQUIC, IntactComplex)
|
|
- Mining genomic data (BioMart, ArrayExpress, ENA)
|
|
- Integrating data from multiple bioinformatics resources in a single workflow
|
|
|
|
## Core Capabilities
|
|
|
|
### 1. Protein Analysis
|
|
|
|
Retrieve protein information, sequences, and functional annotations:
|
|
|
|
```python
|
|
from bioservices import UniProt
|
|
|
|
u = UniProt(verbose=False)
|
|
|
|
# Search for protein by name
|
|
results = u.search("ZAP70_HUMAN", frmt="tab", columns="id,genes,organism")
|
|
|
|
# Retrieve FASTA sequence
|
|
sequence = u.retrieve("P43403", "fasta")
|
|
|
|
# Map identifiers between databases
|
|
kegg_ids = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
|
|
```
|
|
|
|
**Key methods:**
|
|
- `search()`: Query UniProt with flexible search terms
|
|
- `retrieve()`: Get protein entries in various formats (FASTA, XML, tab)
|
|
- `mapping()`: Convert identifiers between databases
|
|
|
|
Reference: `references/services_reference.md` for complete UniProt API details.
|
|
|
|
### 2. Pathway Discovery and Analysis
|
|
|
|
Access KEGG pathway information for genes and organisms:
|
|
|
|
```python
|
|
from bioservices import KEGG
|
|
|
|
k = KEGG()
|
|
k.organism = "hsa" # Set to human
|
|
|
|
# Search for organisms
|
|
k.lookfor_organism("droso") # Find Drosophila species
|
|
|
|
# Find pathways by name
|
|
k.lookfor_pathway("B cell") # Returns matching pathway IDs
|
|
|
|
# Get pathways containing specific genes
|
|
pathways = k.get_pathway_by_gene("7535", "hsa") # ZAP70 gene
|
|
|
|
# Retrieve and parse pathway data
|
|
data = k.get("hsa04660")
|
|
parsed = k.parse(data)
|
|
|
|
# Extract pathway interactions
|
|
interactions = k.parse_kgml_pathway("hsa04660")
|
|
relations = interactions['relations'] # Protein-protein interactions
|
|
|
|
# Convert to Simple Interaction Format
|
|
sif_data = k.pathway2sif("hsa04660")
|
|
```
|
|
|
|
**Key methods:**
|
|
- `lookfor_organism()`, `lookfor_pathway()`: Search by name
|
|
- `get_pathway_by_gene()`: Find pathways containing genes
|
|
- `parse_kgml_pathway()`: Extract structured pathway data
|
|
- `pathway2sif()`: Get protein interaction networks
|
|
|
|
Reference: `references/workflow_patterns.md` for complete pathway analysis workflows.
|
|
|
|
### 3. Compound Database Searches
|
|
|
|
Search and cross-reference compounds across multiple databases:
|
|
|
|
```python
|
|
from bioservices import KEGG, UniChem
|
|
|
|
k = KEGG()
|
|
|
|
# Search compounds by name
|
|
results = k.find("compound", "Geldanamycin") # Returns cpd:C11222
|
|
|
|
# Get compound information with database links
|
|
compound_info = k.get("cpd:C11222") # Includes ChEBI links
|
|
|
|
# Cross-reference KEGG → ChEMBL using UniChem
|
|
u = UniChem()
|
|
chembl_id = u.get_compound_id_from_kegg("C11222") # Returns CHEMBL278315
|
|
```
|
|
|
|
**Common workflow:**
|
|
1. Search compound by name in KEGG
|
|
2. Extract KEGG compound ID
|
|
3. Use UniChem for KEGG → ChEMBL mapping
|
|
4. ChEBI IDs are often provided in KEGG entries
|
|
|
|
Reference: `references/identifier_mapping.md` for complete cross-database mapping guide.
|
|
|
|
### 4. Sequence Analysis
|
|
|
|
Run BLAST searches and sequence alignments:
|
|
|
|
```python
|
|
from bioservices import NCBIblast
|
|
|
|
s = NCBIblast(verbose=False)
|
|
|
|
# Run BLASTP against UniProtKB
|
|
jobid = s.run(
|
|
program="blastp",
|
|
sequence=protein_sequence,
|
|
stype="protein",
|
|
database="uniprotkb",
|
|
email="your.email@example.com" # Required by NCBI
|
|
)
|
|
|
|
# Check job status and retrieve results
|
|
s.getStatus(jobid)
|
|
results = s.getResult(jobid, "out")
|
|
```
|
|
|
|
**Note:** BLAST jobs are asynchronous. Check status before retrieving results.
|
|
|
|
### 5. Identifier Mapping
|
|
|
|
Convert identifiers between different biological databases:
|
|
|
|
```python
|
|
from bioservices import UniProt, KEGG
|
|
|
|
# UniProt mapping (many database pairs supported)
|
|
u = UniProt()
|
|
results = u.mapping(
|
|
fr="UniProtKB_AC-ID", # Source database
|
|
to="KEGG", # Target database
|
|
query="P43403" # Identifier(s) to convert
|
|
)
|
|
|
|
# KEGG gene ID → UniProt
|
|
kegg_to_uniprot = u.mapping(fr="KEGG", to="UniProtKB_AC-ID", query="hsa:7535")
|
|
|
|
# For compounds, use UniChem
|
|
from bioservices import UniChem
|
|
u = UniChem()
|
|
chembl_from_kegg = u.get_compound_id_from_kegg("C11222")
|
|
```
|
|
|
|
**Supported mappings (UniProt):**
|
|
- UniProtKB ↔ KEGG
|
|
- UniProtKB ↔ Ensembl
|
|
- UniProtKB ↔ PDB
|
|
- UniProtKB ↔ RefSeq
|
|
- And many more (see `references/identifier_mapping.md`)
|
|
|
|
### 6. Gene Ontology Queries
|
|
|
|
Access GO terms and annotations:
|
|
|
|
```python
|
|
from bioservices import QuickGO
|
|
|
|
g = QuickGO(verbose=False)
|
|
|
|
# Retrieve GO term information
|
|
term_info = g.Term("GO:0003824", frmt="obo")
|
|
|
|
# Search annotations
|
|
annotations = g.Annotation(protein="P43403", format="tsv")
|
|
```
|
|
|
|
### 7. Protein-Protein Interactions
|
|
|
|
Query interaction databases via PSICQUIC:
|
|
|
|
```python
|
|
from bioservices import PSICQUIC
|
|
|
|
s = PSICQUIC(verbose=False)
|
|
|
|
# Query specific database (e.g., MINT)
|
|
interactions = s.query("mint", "ZAP70 AND species:9606")
|
|
|
|
# List available interaction databases
|
|
databases = s.activeDBs
|
|
```
|
|
|
|
**Available databases:** MINT, IntAct, BioGRID, DIP, and 30+ others.
|
|
|
|
## Multi-Service Integration Workflows
|
|
|
|
BioServices excels at combining multiple services for comprehensive analysis. Common integration patterns:
|
|
|
|
### Complete Protein Analysis Pipeline
|
|
|
|
Execute a full protein characterization workflow:
|
|
|
|
```bash
|
|
python scripts/protein_analysis_workflow.py ZAP70_HUMAN your.email@example.com
|
|
```
|
|
|
|
This script demonstrates:
|
|
1. UniProt search for protein entry
|
|
2. FASTA sequence retrieval
|
|
3. BLAST similarity search
|
|
4. KEGG pathway discovery
|
|
5. PSICQUIC interaction mapping
|
|
|
|
### Pathway Network Analysis
|
|
|
|
Analyze all pathways for an organism:
|
|
|
|
```bash
|
|
python scripts/pathway_analysis.py hsa output_directory/
|
|
```
|
|
|
|
Extracts and analyzes:
|
|
- All pathway IDs for organism
|
|
- Protein-protein interactions per pathway
|
|
- Interaction type distributions
|
|
- Exports to CSV/SIF formats
|
|
|
|
### Cross-Database Compound Search
|
|
|
|
Map compound identifiers across databases:
|
|
|
|
```bash
|
|
python scripts/compound_cross_reference.py Geldanamycin
|
|
```
|
|
|
|
Retrieves:
|
|
- KEGG compound ID
|
|
- ChEBI identifier
|
|
- ChEMBL identifier
|
|
- Basic compound properties
|
|
|
|
### Batch Identifier Conversion
|
|
|
|
Convert multiple identifiers at once:
|
|
|
|
```bash
|
|
python scripts/batch_id_converter.py input_ids.txt --from UniProtKB_AC-ID --to KEGG
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
### Output Format Handling
|
|
|
|
Different services return data in various formats:
|
|
- **XML**: Parse using BeautifulSoup (most SOAP services)
|
|
- **Tab-separated (TSV)**: Pandas DataFrames for tabular data
|
|
- **Dictionary/JSON**: Direct Python manipulation
|
|
- **FASTA**: BioPython integration for sequence analysis
|
|
|
|
### Rate Limiting and Verbosity
|
|
|
|
Control API request behavior:
|
|
|
|
```python
|
|
from bioservices import KEGG
|
|
|
|
k = KEGG(verbose=False) # Suppress HTTP request details
|
|
k.TIMEOUT = 30 # Adjust timeout for slow connections
|
|
```
|
|
|
|
### Error Handling
|
|
|
|
Wrap service calls in try-except blocks:
|
|
|
|
```python
|
|
try:
|
|
results = u.search("ambiguous_query")
|
|
if results:
|
|
# Process results
|
|
pass
|
|
except Exception as e:
|
|
print(f"Search failed: {e}")
|
|
```
|
|
|
|
### Organism Codes
|
|
|
|
Use standard organism abbreviations:
|
|
- `hsa`: Homo sapiens (human)
|
|
- `mmu`: Mus musculus (mouse)
|
|
- `dme`: Drosophila melanogaster
|
|
- `sce`: Saccharomyces cerevisiae (yeast)
|
|
|
|
List all organisms: `k.list("organism")` or `k.organismIds`
|
|
|
|
### Integration with Other Tools
|
|
|
|
BioServices works well with:
|
|
- **BioPython**: Sequence analysis on retrieved FASTA data
|
|
- **Pandas**: Tabular data manipulation
|
|
- **PyMOL**: 3D structure visualization (retrieve PDB IDs)
|
|
- **NetworkX**: Network analysis of pathway interactions
|
|
- **Galaxy**: Custom tool wrappers for workflow platforms
|
|
|
|
## Resources
|
|
|
|
### scripts/
|
|
|
|
Executable Python scripts demonstrating complete workflows:
|
|
|
|
- `protein_analysis_workflow.py`: End-to-end protein characterization
|
|
- `pathway_analysis.py`: KEGG pathway discovery and network extraction
|
|
- `compound_cross_reference.py`: Multi-database compound searching
|
|
- `batch_id_converter.py`: Bulk identifier mapping utility
|
|
|
|
Scripts can be executed directly or adapted for specific use cases.
|
|
|
|
### references/
|
|
|
|
Detailed documentation loaded as needed:
|
|
|
|
- `services_reference.md`: Comprehensive list of all 40+ services with methods
|
|
- `workflow_patterns.md`: Detailed multi-step analysis workflows
|
|
- `identifier_mapping.md`: Complete guide to cross-database ID conversion
|
|
|
|
Load references when working with specific services or complex integration tasks.
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
uv pip install bioservices
|
|
```
|
|
|
|
Dependencies are automatically managed. Package is tested on Python 3.9-3.12.
|
|
|
|
## Additional Information
|
|
|
|
For detailed API documentation and advanced features, refer to:
|
|
- Official documentation: https://bioservices.readthedocs.io/
|
|
- Source code: https://github.com/cokelaer/bioservices
|
|
- Service-specific references in `references/services_reference.md`
|