Files
gh-k-dense-ai-claude-scient…/skills/bioservices/SKILL.md
2025-11-30 08:30:10 +08:00

356 lines
9.6 KiB
Markdown

---
name: bioservices
description: "Primary Python tool for 40+ bioinformatics services. Preferred for multi-database workflows: UniProt, KEGG, ChEMBL, PubChem, Reactome, QuickGO. Unified API for queries, ID mapping, pathway analysis. For direct REST control, use individual database skills (uniprot-database, kegg-database)."
---
# BioServices
## Overview
BioServices is a Python package providing programmatic access to approximately 40 bioinformatics web services and databases. Retrieve biological data, perform cross-database queries, map identifiers, analyze sequences, and integrate multiple biological resources in Python workflows. The package handles both REST and SOAP/WSDL protocols transparently.
## When to Use This Skill
This skill should be used when:
- Retrieving protein sequences, annotations, or structures from UniProt, PDB, Pfam
- Analyzing metabolic pathways and gene functions via KEGG or Reactome
- Searching compound databases (ChEBI, ChEMBL, PubChem) for chemical information
- Converting identifiers between different biological databases (KEGG↔UniProt, compound IDs)
- Running sequence similarity searches (BLAST, MUSCLE alignment)
- Querying gene ontology terms (QuickGO, GO annotations)
- Accessing protein-protein interaction data (PSICQUIC, IntactComplex)
- Mining genomic data (BioMart, ArrayExpress, ENA)
- Integrating data from multiple bioinformatics resources in a single workflow
## Core Capabilities
### 1. Protein Analysis
Retrieve protein information, sequences, and functional annotations:
```python
from bioservices import UniProt
u = UniProt(verbose=False)
# Search for protein by name
results = u.search("ZAP70_HUMAN", frmt="tab", columns="id,genes,organism")
# Retrieve FASTA sequence
sequence = u.retrieve("P43403", "fasta")
# Map identifiers between databases
kegg_ids = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query="P43403")
```
**Key methods:**
- `search()`: Query UniProt with flexible search terms
- `retrieve()`: Get protein entries in various formats (FASTA, XML, tab)
- `mapping()`: Convert identifiers between databases
Reference: `references/services_reference.md` for complete UniProt API details.
### 2. Pathway Discovery and Analysis
Access KEGG pathway information for genes and organisms:
```python
from bioservices import KEGG
k = KEGG()
k.organism = "hsa" # Set to human
# Search for organisms
k.lookfor_organism("droso") # Find Drosophila species
# Find pathways by name
k.lookfor_pathway("B cell") # Returns matching pathway IDs
# Get pathways containing specific genes
pathways = k.get_pathway_by_gene("7535", "hsa") # ZAP70 gene
# Retrieve and parse pathway data
data = k.get("hsa04660")
parsed = k.parse(data)
# Extract pathway interactions
interactions = k.parse_kgml_pathway("hsa04660")
relations = interactions['relations'] # Protein-protein interactions
# Convert to Simple Interaction Format
sif_data = k.pathway2sif("hsa04660")
```
**Key methods:**
- `lookfor_organism()`, `lookfor_pathway()`: Search by name
- `get_pathway_by_gene()`: Find pathways containing genes
- `parse_kgml_pathway()`: Extract structured pathway data
- `pathway2sif()`: Get protein interaction networks
Reference: `references/workflow_patterns.md` for complete pathway analysis workflows.
### 3. Compound Database Searches
Search and cross-reference compounds across multiple databases:
```python
from bioservices import KEGG, UniChem
k = KEGG()
# Search compounds by name
results = k.find("compound", "Geldanamycin") # Returns cpd:C11222
# Get compound information with database links
compound_info = k.get("cpd:C11222") # Includes ChEBI links
# Cross-reference KEGG → ChEMBL using UniChem
u = UniChem()
chembl_id = u.get_compound_id_from_kegg("C11222") # Returns CHEMBL278315
```
**Common workflow:**
1. Search compound by name in KEGG
2. Extract KEGG compound ID
3. Use UniChem for KEGG → ChEMBL mapping
4. ChEBI IDs are often provided in KEGG entries
Reference: `references/identifier_mapping.md` for complete cross-database mapping guide.
### 4. Sequence Analysis
Run BLAST searches and sequence alignments:
```python
from bioservices import NCBIblast
s = NCBIblast(verbose=False)
# Run BLASTP against UniProtKB
jobid = s.run(
program="blastp",
sequence=protein_sequence,
stype="protein",
database="uniprotkb",
email="your.email@example.com" # Required by NCBI
)
# Check job status and retrieve results
s.getStatus(jobid)
results = s.getResult(jobid, "out")
```
**Note:** BLAST jobs are asynchronous. Check status before retrieving results.
### 5. Identifier Mapping
Convert identifiers between different biological databases:
```python
from bioservices import UniProt, KEGG
# UniProt mapping (many database pairs supported)
u = UniProt()
results = u.mapping(
fr="UniProtKB_AC-ID", # Source database
to="KEGG", # Target database
query="P43403" # Identifier(s) to convert
)
# KEGG gene ID → UniProt
kegg_to_uniprot = u.mapping(fr="KEGG", to="UniProtKB_AC-ID", query="hsa:7535")
# For compounds, use UniChem
from bioservices import UniChem
u = UniChem()
chembl_from_kegg = u.get_compound_id_from_kegg("C11222")
```
**Supported mappings (UniProt):**
- UniProtKB ↔ KEGG
- UniProtKB ↔ Ensembl
- UniProtKB ↔ PDB
- UniProtKB ↔ RefSeq
- And many more (see `references/identifier_mapping.md`)
### 6. Gene Ontology Queries
Access GO terms and annotations:
```python
from bioservices import QuickGO
g = QuickGO(verbose=False)
# Retrieve GO term information
term_info = g.Term("GO:0003824", frmt="obo")
# Search annotations
annotations = g.Annotation(protein="P43403", format="tsv")
```
### 7. Protein-Protein Interactions
Query interaction databases via PSICQUIC:
```python
from bioservices import PSICQUIC
s = PSICQUIC(verbose=False)
# Query specific database (e.g., MINT)
interactions = s.query("mint", "ZAP70 AND species:9606")
# List available interaction databases
databases = s.activeDBs
```
**Available databases:** MINT, IntAct, BioGRID, DIP, and 30+ others.
## Multi-Service Integration Workflows
BioServices excels at combining multiple services for comprehensive analysis. Common integration patterns:
### Complete Protein Analysis Pipeline
Execute a full protein characterization workflow:
```bash
python scripts/protein_analysis_workflow.py ZAP70_HUMAN your.email@example.com
```
This script demonstrates:
1. UniProt search for protein entry
2. FASTA sequence retrieval
3. BLAST similarity search
4. KEGG pathway discovery
5. PSICQUIC interaction mapping
### Pathway Network Analysis
Analyze all pathways for an organism:
```bash
python scripts/pathway_analysis.py hsa output_directory/
```
Extracts and analyzes:
- All pathway IDs for organism
- Protein-protein interactions per pathway
- Interaction type distributions
- Exports to CSV/SIF formats
### Cross-Database Compound Search
Map compound identifiers across databases:
```bash
python scripts/compound_cross_reference.py Geldanamycin
```
Retrieves:
- KEGG compound ID
- ChEBI identifier
- ChEMBL identifier
- Basic compound properties
### Batch Identifier Conversion
Convert multiple identifiers at once:
```bash
python scripts/batch_id_converter.py input_ids.txt --from UniProtKB_AC-ID --to KEGG
```
## Best Practices
### Output Format Handling
Different services return data in various formats:
- **XML**: Parse using BeautifulSoup (most SOAP services)
- **Tab-separated (TSV)**: Pandas DataFrames for tabular data
- **Dictionary/JSON**: Direct Python manipulation
- **FASTA**: BioPython integration for sequence analysis
### Rate Limiting and Verbosity
Control API request behavior:
```python
from bioservices import KEGG
k = KEGG(verbose=False) # Suppress HTTP request details
k.TIMEOUT = 30 # Adjust timeout for slow connections
```
### Error Handling
Wrap service calls in try-except blocks:
```python
try:
results = u.search("ambiguous_query")
if results:
# Process results
pass
except Exception as e:
print(f"Search failed: {e}")
```
### Organism Codes
Use standard organism abbreviations:
- `hsa`: Homo sapiens (human)
- `mmu`: Mus musculus (mouse)
- `dme`: Drosophila melanogaster
- `sce`: Saccharomyces cerevisiae (yeast)
List all organisms: `k.list("organism")` or `k.organismIds`
### Integration with Other Tools
BioServices works well with:
- **BioPython**: Sequence analysis on retrieved FASTA data
- **Pandas**: Tabular data manipulation
- **PyMOL**: 3D structure visualization (retrieve PDB IDs)
- **NetworkX**: Network analysis of pathway interactions
- **Galaxy**: Custom tool wrappers for workflow platforms
## Resources
### scripts/
Executable Python scripts demonstrating complete workflows:
- `protein_analysis_workflow.py`: End-to-end protein characterization
- `pathway_analysis.py`: KEGG pathway discovery and network extraction
- `compound_cross_reference.py`: Multi-database compound searching
- `batch_id_converter.py`: Bulk identifier mapping utility
Scripts can be executed directly or adapted for specific use cases.
### references/
Detailed documentation loaded as needed:
- `services_reference.md`: Comprehensive list of all 40+ services with methods
- `workflow_patterns.md`: Detailed multi-step analysis workflows
- `identifier_mapping.md`: Complete guide to cross-database ID conversion
Load references when working with specific services or complex integration tasks.
## Installation
```bash
uv pip install bioservices
```
Dependencies are automatically managed. Package is tested on Python 3.9-3.12.
## Additional Information
For detailed API documentation and advanced features, refer to:
- Official documentation: https://bioservices.readthedocs.io/
- Source code: https://github.com/cokelaer/bioservices
- Service-specific references in `references/services_reference.md`