812 lines
20 KiB
Markdown
812 lines
20 KiB
Markdown
# BioServices: Common Workflow Patterns
|
|
|
|
This document describes detailed multi-step workflows for common bioinformatics tasks using BioServices.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Complete Protein Analysis Pipeline](#complete-protein-analysis-pipeline)
|
|
2. [Pathway Discovery and Network Analysis](#pathway-discovery-and-network-analysis)
|
|
3. [Compound Multi-Database Search](#compound-multi-database-search)
|
|
4. [Batch Identifier Conversion](#batch-identifier-conversion)
|
|
5. [Gene Functional Annotation](#gene-functional-annotation)
|
|
6. [Protein Interaction Network Construction](#protein-interaction-network-construction)
|
|
7. [Multi-Organism Comparative Analysis](#multi-organism-comparative-analysis)
|
|
|
|
---
|
|
|
|
## Complete Protein Analysis Pipeline
|
|
|
|
**Goal:** Given a protein name, retrieve sequence, find homologs, identify pathways, and discover interactions.
|
|
|
|
**Example:** Analyzing human ZAP70 protein
|
|
|
|
### Step 1: UniProt Search and Identifier Retrieval
|
|
|
|
```python
|
|
from bioservices import UniProt
|
|
|
|
u = UniProt(verbose=False)
|
|
|
|
# Search for protein by name
|
|
query = "ZAP70_HUMAN"
|
|
results = u.search(query, frmt="tab", columns="id,genes,organism,length")
|
|
|
|
# Parse results
|
|
lines = results.strip().split("\n")
|
|
if len(lines) > 1:
|
|
header = lines[0]
|
|
data = lines[1].split("\t")
|
|
uniprot_id = data[0] # e.g., P43403
|
|
gene_names = data[1] # e.g., ZAP70
|
|
|
|
print(f"UniProt ID: {uniprot_id}")
|
|
print(f"Gene names: {gene_names}")
|
|
```
|
|
|
|
**Output:**
|
|
- UniProt accession: P43403
|
|
- Gene name: ZAP70
|
|
|
|
### Step 2: Sequence Retrieval
|
|
|
|
```python
|
|
# Retrieve FASTA sequence
|
|
sequence = u.retrieve(uniprot_id, frmt="fasta")
|
|
print(sequence)
|
|
|
|
# Extract just the sequence string (remove header)
|
|
seq_lines = sequence.split("\n")
|
|
sequence_only = "".join(seq_lines[1:]) # Skip FASTA header
|
|
```
|
|
|
|
**Output:** Complete protein sequence in FASTA format
|
|
|
|
### Step 3: BLAST Similarity Search
|
|
|
|
```python
|
|
from bioservices import NCBIblast
|
|
import time
|
|
|
|
s = NCBIblast(verbose=False)
|
|
|
|
# Submit BLAST job
|
|
jobid = s.run(
|
|
program="blastp",
|
|
sequence=sequence_only,
|
|
stype="protein",
|
|
database="uniprotkb",
|
|
email="your.email@example.com"
|
|
)
|
|
|
|
print(f"BLAST Job ID: {jobid}")
|
|
|
|
# Wait for completion
|
|
while True:
|
|
status = s.getStatus(jobid)
|
|
print(f"Status: {status}")
|
|
if status == "FINISHED":
|
|
break
|
|
elif status == "ERROR":
|
|
print("BLAST job failed")
|
|
break
|
|
time.sleep(5)
|
|
|
|
# Retrieve results
|
|
if status == "FINISHED":
|
|
blast_results = s.getResult(jobid, "out")
|
|
print(blast_results[:500]) # Print first 500 characters
|
|
```
|
|
|
|
**Output:** BLAST alignment results showing similar proteins
|
|
|
|
### Step 4: KEGG Pathway Discovery
|
|
|
|
```python
|
|
from bioservices import KEGG
|
|
|
|
k = KEGG()
|
|
|
|
# Get KEGG gene ID from UniProt mapping
|
|
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
|
|
print(f"KEGG mapping: {kegg_mapping}")
|
|
|
|
# Extract KEGG gene ID (e.g., hsa:7535)
|
|
if kegg_mapping:
|
|
kegg_gene_id = kegg_mapping[uniprot_id][0] if uniprot_id in kegg_mapping else None
|
|
|
|
if kegg_gene_id:
|
|
# Find pathways containing this gene
|
|
organism = kegg_gene_id.split(":")[0] # e.g., "hsa"
|
|
gene_id = kegg_gene_id.split(":")[1] # e.g., "7535"
|
|
|
|
pathways = k.get_pathway_by_gene(gene_id, organism)
|
|
print(f"Found {len(pathways)} pathways:")
|
|
|
|
# Get pathway names
|
|
for pathway_id in pathways:
|
|
pathway_info = k.get(pathway_id)
|
|
# Parse NAME line
|
|
for line in pathway_info.split("\n"):
|
|
if line.startswith("NAME"):
|
|
pathway_name = line.replace("NAME", "").strip()
|
|
print(f" {pathway_id}: {pathway_name}")
|
|
break
|
|
```
|
|
|
|
**Output:**
|
|
- path:hsa04064 - NF-kappa B signaling pathway
|
|
- path:hsa04650 - Natural killer cell mediated cytotoxicity
|
|
- path:hsa04660 - T cell receptor signaling pathway
|
|
- path:hsa04662 - B cell receptor signaling pathway
|
|
|
|
### Step 5: Protein-Protein Interactions
|
|
|
|
```python
|
|
from bioservices import PSICQUIC
|
|
|
|
p = PSICQUIC()
|
|
|
|
# Query MINT database for human (taxid:9606) interactions
|
|
query = f"ZAP70 AND species:9606"
|
|
interactions = p.query("mint", query)
|
|
|
|
# Parse PSI-MI TAB format results
|
|
if interactions:
|
|
interaction_lines = interactions.strip().split("\n")
|
|
print(f"Found {len(interaction_lines)} interactions")
|
|
|
|
# Print first few interactions
|
|
for line in interaction_lines[:5]:
|
|
fields = line.split("\t")
|
|
protein_a = fields[0]
|
|
protein_b = fields[1]
|
|
interaction_type = fields[11]
|
|
print(f" {protein_a} - {protein_b}: {interaction_type}")
|
|
```
|
|
|
|
**Output:** List of proteins that interact with ZAP70
|
|
|
|
### Step 6: Gene Ontology Annotation
|
|
|
|
```python
|
|
from bioservices import QuickGO
|
|
|
|
g = QuickGO()
|
|
|
|
# Get GO annotations for protein
|
|
annotations = g.Annotation(protein=uniprot_id, format="tsv")
|
|
|
|
if annotations:
|
|
# Parse TSV results
|
|
lines = annotations.strip().split("\n")
|
|
print(f"Found {len(lines)-1} GO annotations")
|
|
|
|
# Display first few annotations
|
|
for line in lines[1:6]: # Skip header
|
|
fields = line.split("\t")
|
|
go_id = fields[6]
|
|
go_term = fields[7]
|
|
go_aspect = fields[8]
|
|
print(f" {go_id}: {go_term} [{go_aspect}]")
|
|
```
|
|
|
|
**Output:** GO terms annotating ZAP70 function, process, and location
|
|
|
|
### Complete Pipeline Summary
|
|
|
|
**Inputs:** Protein name (e.g., "ZAP70_HUMAN")
|
|
|
|
**Outputs:**
|
|
1. UniProt accession and gene name
|
|
2. Protein sequence (FASTA)
|
|
3. Similar proteins (BLAST results)
|
|
4. Biological pathways (KEGG)
|
|
5. Interaction partners (PSICQUIC)
|
|
6. Functional annotations (GO terms)
|
|
|
|
**Script:** `scripts/protein_analysis_workflow.py` automates this entire pipeline.
|
|
|
|
---
|
|
|
|
## Pathway Discovery and Network Analysis
|
|
|
|
**Goal:** Analyze all pathways for an organism and extract protein interaction networks.
|
|
|
|
**Example:** Human (hsa) pathway analysis
|
|
|
|
### Step 1: Get All Pathways for Organism
|
|
|
|
```python
|
|
from bioservices import KEGG
|
|
|
|
k = KEGG()
|
|
k.organism = "hsa"
|
|
|
|
# Get all pathway IDs
|
|
pathway_ids = k.pathwayIds
|
|
print(f"Found {len(pathway_ids)} pathways for {k.organism}")
|
|
|
|
# Display first few
|
|
for pid in pathway_ids[:10]:
|
|
print(f" {pid}")
|
|
```
|
|
|
|
**Output:** List of ~300 human pathways
|
|
|
|
### Step 2: Parse Pathway for Interactions
|
|
|
|
```python
|
|
# Analyze specific pathway
|
|
pathway_id = "hsa04660" # T cell receptor signaling
|
|
|
|
# Get KGML data
|
|
kgml_data = k.parse_kgml_pathway(pathway_id)
|
|
|
|
# Extract entries (genes/proteins)
|
|
entries = kgml_data['entries']
|
|
print(f"Pathway contains {len(entries)} entries")
|
|
|
|
# Extract relations (interactions)
|
|
relations = kgml_data['relations']
|
|
print(f"Found {len(relations)} relations")
|
|
|
|
# Analyze relation types
|
|
relation_types = {}
|
|
for rel in relations:
|
|
rel_type = rel.get('name', 'unknown')
|
|
relation_types[rel_type] = relation_types.get(rel_type, 0) + 1
|
|
|
|
print("\nRelation type distribution:")
|
|
for rel_type, count in sorted(relation_types.items()):
|
|
print(f" {rel_type}: {count}")
|
|
```
|
|
|
|
**Output:**
|
|
- Entry count (genes/proteins in pathway)
|
|
- Relation count (interactions)
|
|
- Distribution of interaction types (activation, inhibition, binding, etc.)
|
|
|
|
### Step 3: Extract Protein-Protein Interactions
|
|
|
|
```python
|
|
# Filter for specific interaction types
|
|
pprel_interactions = [
|
|
rel for rel in relations
|
|
if rel.get('link') == 'PPrel' # Protein-protein relation
|
|
]
|
|
|
|
print(f"Found {len(pprel_interactions)} protein-protein interactions")
|
|
|
|
# Extract interaction details
|
|
for rel in pprel_interactions[:10]:
|
|
entry1 = rel['entry1']
|
|
entry2 = rel['entry2']
|
|
interaction_type = rel.get('name', 'unknown')
|
|
|
|
print(f" {entry1} -> {entry2}: {interaction_type}")
|
|
```
|
|
|
|
**Output:** Directed protein-protein interactions with types
|
|
|
|
### Step 4: Convert to Network Format (SIF)
|
|
|
|
```python
|
|
# Get Simple Interaction Format (filters for key interactions)
|
|
sif_data = k.pathway2sif(pathway_id)
|
|
|
|
# SIF format: source, interaction_type, target
|
|
print("\nSimple Interaction Format:")
|
|
for interaction in sif_data[:10]:
|
|
print(f" {interaction}")
|
|
```
|
|
|
|
**Output:** Network edges suitable for Cytoscape or NetworkX
|
|
|
|
### Step 5: Batch Analysis of All Pathways
|
|
|
|
```python
|
|
import pandas as pd
|
|
|
|
# Analyze all pathways (this takes time!)
|
|
all_results = []
|
|
|
|
for pathway_id in pathway_ids[:50]: # Limit for example
|
|
try:
|
|
kgml = k.parse_kgml_pathway(pathway_id)
|
|
|
|
result = {
|
|
'pathway_id': pathway_id,
|
|
'num_entries': len(kgml.get('entries', [])),
|
|
'num_relations': len(kgml.get('relations', []))
|
|
}
|
|
|
|
all_results.append(result)
|
|
|
|
except Exception as e:
|
|
print(f"Error parsing {pathway_id}: {e}")
|
|
|
|
# Create DataFrame
|
|
df = pd.DataFrame(all_results)
|
|
print(df.describe())
|
|
|
|
# Find largest pathways
|
|
print("\nLargest pathways:")
|
|
print(df.nlargest(10, 'num_entries')[['pathway_id', 'num_entries', 'num_relations']])
|
|
```
|
|
|
|
**Output:** Statistical summary of pathway sizes and interaction densities
|
|
|
|
**Script:** `scripts/pathway_analysis.py` implements this workflow with export options.
|
|
|
|
---
|
|
|
|
## Compound Multi-Database Search
|
|
|
|
**Goal:** Search for compound by name and retrieve identifiers across KEGG, ChEBI, and ChEMBL.
|
|
|
|
**Example:** Geldanamycin (antibiotic)
|
|
|
|
### Step 1: Search KEGG Compound Database
|
|
|
|
```python
|
|
from bioservices import KEGG
|
|
|
|
k = KEGG()
|
|
|
|
# Search by compound name
|
|
compound_name = "Geldanamycin"
|
|
results = k.find("compound", compound_name)
|
|
|
|
print(f"KEGG search results for '{compound_name}':")
|
|
print(results)
|
|
|
|
# Extract compound ID
|
|
if results:
|
|
lines = results.strip().split("\n")
|
|
if lines:
|
|
kegg_id = lines[0].split("\t")[0] # e.g., cpd:C11222
|
|
kegg_id_clean = kegg_id.replace("cpd:", "") # C11222
|
|
print(f"\nKEGG Compound ID: {kegg_id_clean}")
|
|
```
|
|
|
|
**Output:** KEGG ID (e.g., C11222)
|
|
|
|
### Step 2: Get KEGG Entry with Database Links
|
|
|
|
```python
|
|
# Retrieve compound entry
|
|
compound_entry = k.get(kegg_id)
|
|
|
|
# Parse entry for database links
|
|
chebi_id = None
|
|
for line in compound_entry.split("\n"):
|
|
if "ChEBI:" in line:
|
|
# Extract ChEBI ID
|
|
parts = line.split("ChEBI:")
|
|
if len(parts) > 1:
|
|
chebi_id = parts[1].strip().split()[0]
|
|
print(f"ChEBI ID: {chebi_id}")
|
|
break
|
|
|
|
# Display entry snippet
|
|
print("\nKEGG Entry (first 500 chars):")
|
|
print(compound_entry[:500])
|
|
```
|
|
|
|
**Output:** ChEBI ID (e.g., 5292) and compound information
|
|
|
|
### Step 3: Cross-Reference to ChEMBL via UniChem
|
|
|
|
```python
|
|
from bioservices import UniChem
|
|
|
|
u = UniChem()
|
|
|
|
# Convert KEGG → ChEMBL
|
|
try:
|
|
chembl_id = u.get_compound_id_from_kegg(kegg_id_clean)
|
|
print(f"ChEMBL ID: {chembl_id}")
|
|
except Exception as e:
|
|
print(f"UniChem lookup failed: {e}")
|
|
chembl_id = None
|
|
```
|
|
|
|
**Output:** ChEMBL ID (e.g., CHEMBL278315)
|
|
|
|
### Step 4: Retrieve Detailed Information
|
|
|
|
```python
|
|
# Get ChEBI information
|
|
if chebi_id:
|
|
from bioservices import ChEBI
|
|
c = ChEBI()
|
|
|
|
try:
|
|
chebi_entity = c.getCompleteEntity(f"CHEBI:{chebi_id}")
|
|
print(f"\nChEBI Formula: {chebi_entity.Formulae}")
|
|
print(f"ChEBI Name: {chebi_entity.chebiAsciiName}")
|
|
except Exception as e:
|
|
print(f"ChEBI lookup failed: {e}")
|
|
|
|
# Get ChEMBL information
|
|
if chembl_id:
|
|
from bioservices import ChEMBL
|
|
chembl = ChEMBL()
|
|
|
|
try:
|
|
chembl_compound = chembl.get_compound_by_chemblId(chembl_id)
|
|
print(f"\nChEMBL Molecular Weight: {chembl_compound['molecule_properties']['full_mwt']}")
|
|
print(f"ChEMBL SMILES: {chembl_compound['molecule_structures']['canonical_smiles']}")
|
|
except Exception as e:
|
|
print(f"ChEMBL lookup failed: {e}")
|
|
```
|
|
|
|
**Output:** Chemical properties from multiple databases
|
|
|
|
### Complete Compound Workflow Summary
|
|
|
|
**Input:** Compound name (e.g., "Geldanamycin")
|
|
|
|
**Output:**
|
|
- KEGG ID: C11222
|
|
- ChEBI ID: 5292
|
|
- ChEMBL ID: CHEMBL278315
|
|
- Chemical formula
|
|
- Molecular weight
|
|
- SMILES structure
|
|
|
|
**Script:** `scripts/compound_cross_reference.py` automates this workflow.
|
|
|
|
---
|
|
|
|
## Batch Identifier Conversion
|
|
|
|
**Goal:** Convert multiple identifiers between databases efficiently.
|
|
|
|
### Batch UniProt → KEGG Mapping
|
|
|
|
```python
|
|
from bioservices import UniProt
|
|
|
|
u = UniProt()
|
|
|
|
# List of UniProt IDs
|
|
uniprot_ids = ["P43403", "P04637", "P53779", "Q9Y6K9"]
|
|
|
|
# Batch mapping (comma-separated)
|
|
query_string = ",".join(uniprot_ids)
|
|
results = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=query_string)
|
|
|
|
print("UniProt → KEGG mapping:")
|
|
for uniprot_id, kegg_ids in results.items():
|
|
print(f" {uniprot_id} → {kegg_ids}")
|
|
```
|
|
|
|
**Output:** Dictionary mapping each UniProt ID to KEGG gene IDs
|
|
|
|
### Batch File Processing
|
|
|
|
```python
|
|
import csv
|
|
|
|
# Read identifiers from file
|
|
def read_ids_from_file(filename):
|
|
with open(filename, 'r') as f:
|
|
ids = [line.strip() for line in f if line.strip()]
|
|
return ids
|
|
|
|
# Process in chunks (API limits)
|
|
def batch_convert(ids, from_db, to_db, chunk_size=100):
|
|
u = UniProt()
|
|
all_results = {}
|
|
|
|
for i in range(0, len(ids), chunk_size):
|
|
chunk = ids[i:i+chunk_size]
|
|
query = ",".join(chunk)
|
|
|
|
try:
|
|
results = u.mapping(fr=from_db, to=to_db, query=query)
|
|
all_results.update(results)
|
|
print(f"Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
|
|
except Exception as e:
|
|
print(f"Error processing chunk {i}: {e}")
|
|
|
|
return all_results
|
|
|
|
# Write results to CSV
|
|
def write_mapping_to_csv(mapping, output_file):
|
|
with open(output_file, 'w', newline='') as f:
|
|
writer = csv.writer(f)
|
|
writer.writerow(['Source_ID', 'Target_IDs'])
|
|
|
|
for source_id, target_ids in mapping.items():
|
|
target_str = ";".join(target_ids) if target_ids else "No mapping"
|
|
writer.writerow([source_id, target_str])
|
|
|
|
# Example usage
|
|
input_ids = read_ids_from_file("uniprot_ids.txt")
|
|
mapping = batch_convert(input_ids, "UniProtKB_AC-ID", "KEGG", chunk_size=50)
|
|
write_mapping_to_csv(mapping, "uniprot_to_kegg_mapping.csv")
|
|
```
|
|
|
|
**Script:** `scripts/batch_id_converter.py` provides command-line batch conversion.
|
|
|
|
---
|
|
|
|
## Gene Functional Annotation
|
|
|
|
**Goal:** Retrieve comprehensive functional information for a gene.
|
|
|
|
### Workflow
|
|
|
|
```python
|
|
from bioservices import UniProt, KEGG, QuickGO
|
|
|
|
# Gene of interest
|
|
gene_symbol = "TP53"
|
|
|
|
# 1. Find UniProt entry
|
|
u = UniProt()
|
|
search_results = u.search(f"gene:{gene_symbol} AND organism:9606",
|
|
frmt="tab",
|
|
columns="id,genes,protein names")
|
|
|
|
# Extract UniProt ID
|
|
lines = search_results.strip().split("\n")
|
|
if len(lines) > 1:
|
|
uniprot_id = lines[1].split("\t")[0]
|
|
protein_name = lines[1].split("\t")[2]
|
|
print(f"Protein: {protein_name}")
|
|
print(f"UniProt ID: {uniprot_id}")
|
|
|
|
# 2. Get KEGG pathways
|
|
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
|
|
if uniprot_id in kegg_mapping:
|
|
kegg_id = kegg_mapping[uniprot_id][0]
|
|
|
|
k = KEGG()
|
|
organism, gene_id = kegg_id.split(":")
|
|
pathways = k.get_pathway_by_gene(gene_id, organism)
|
|
|
|
print(f"\nPathways ({len(pathways)}):")
|
|
for pathway_id in pathways[:5]:
|
|
print(f" {pathway_id}")
|
|
|
|
# 3. Get GO annotations
|
|
g = QuickGO()
|
|
go_annotations = g.Annotation(protein=uniprot_id, format="tsv")
|
|
|
|
if go_annotations:
|
|
lines = go_annotations.strip().split("\n")
|
|
print(f"\nGO Annotations ({len(lines)-1} total):")
|
|
|
|
# Group by aspect
|
|
aspects = {"P": [], "F": [], "C": []}
|
|
for line in lines[1:]:
|
|
fields = line.split("\t")
|
|
go_aspect = fields[8] # P, F, or C
|
|
go_term = fields[7]
|
|
aspects[go_aspect].append(go_term)
|
|
|
|
print(f" Biological Process: {len(aspects['P'])} terms")
|
|
print(f" Molecular Function: {len(aspects['F'])} terms")
|
|
print(f" Cellular Component: {len(aspects['C'])} terms")
|
|
|
|
# 4. Get protein sequence features
|
|
full_entry = u.retrieve(uniprot_id, frmt="txt")
|
|
print("\nProtein Features:")
|
|
for line in full_entry.split("\n"):
|
|
if line.startswith("FT DOMAIN"):
|
|
print(f" {line}")
|
|
```
|
|
|
|
**Output:** Comprehensive annotation including name, pathways, GO terms, and features.
|
|
|
|
---
|
|
|
|
## Protein Interaction Network Construction
|
|
|
|
**Goal:** Build a protein-protein interaction network for a set of proteins.
|
|
|
|
### Workflow
|
|
|
|
```python
|
|
from bioservices import PSICQUIC
|
|
import networkx as nx
|
|
|
|
# Proteins of interest
|
|
proteins = ["ZAP70", "LCK", "LAT", "SLP76", "PLCg1"]
|
|
|
|
# Initialize PSICQUIC
|
|
p = PSICQUIC()
|
|
|
|
# Build network
|
|
G = nx.Graph()
|
|
|
|
for protein in proteins:
|
|
# Query for human interactions
|
|
query = f"{protein} AND species:9606"
|
|
|
|
try:
|
|
results = p.query("intact", query)
|
|
|
|
if results:
|
|
lines = results.strip().split("\n")
|
|
|
|
for line in lines:
|
|
fields = line.split("\t")
|
|
# Extract protein names (simplified)
|
|
protein_a = fields[4].split(":")[1] if ":" in fields[4] else fields[4]
|
|
protein_b = fields[5].split(":")[1] if ":" in fields[5] else fields[5]
|
|
|
|
# Add edge
|
|
G.add_edge(protein_a, protein_b)
|
|
|
|
except Exception as e:
|
|
print(f"Error querying {protein}: {e}")
|
|
|
|
print(f"Network: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
|
|
|
|
# Analyze network
|
|
print("\nNode degrees:")
|
|
for node in proteins:
|
|
if node in G:
|
|
print(f" {node}: {G.degree(node)} interactions")
|
|
|
|
# Export for visualization
|
|
nx.write_gml(G, "protein_network.gml")
|
|
print("\nNetwork exported to protein_network.gml")
|
|
```
|
|
|
|
**Output:** NetworkX graph exported in GML format for Cytoscape visualization.
|
|
|
|
---
|
|
|
|
## Multi-Organism Comparative Analysis
|
|
|
|
**Goal:** Compare pathway or gene presence across multiple organisms.
|
|
|
|
### Workflow
|
|
|
|
```python
|
|
from bioservices import KEGG
|
|
|
|
k = KEGG()
|
|
|
|
# Organisms to compare
|
|
organisms = ["hsa", "mmu", "dme", "sce"] # Human, mouse, fly, yeast
|
|
organism_names = {
|
|
"hsa": "Human",
|
|
"mmu": "Mouse",
|
|
"dme": "Fly",
|
|
"sce": "Yeast"
|
|
}
|
|
|
|
# Pathway of interest
|
|
pathway_name = "cell cycle"
|
|
|
|
print(f"Searching for '{pathway_name}' pathway across organisms:\n")
|
|
|
|
for org in organisms:
|
|
k.organism = org
|
|
|
|
# Search pathways
|
|
results = k.lookfor_pathway(pathway_name)
|
|
|
|
print(f"{organism_names[org]} ({org}):")
|
|
if results:
|
|
for pathway in results[:3]: # Show first 3
|
|
print(f" {pathway}")
|
|
else:
|
|
print(" No matches found")
|
|
print()
|
|
```
|
|
|
|
**Output:** Pathway presence/absence across organisms.
|
|
|
|
---
|
|
|
|
## Best Practices for Workflows
|
|
|
|
### 1. Error Handling
|
|
|
|
Always wrap service calls:
|
|
```python
|
|
try:
|
|
result = service.method(params)
|
|
if result:
|
|
# Process
|
|
pass
|
|
except Exception as e:
|
|
print(f"Error: {e}")
|
|
```
|
|
|
|
### 2. Rate Limiting
|
|
|
|
Add delays for batch processing:
|
|
```python
|
|
import time
|
|
|
|
for item in items:
|
|
result = service.query(item)
|
|
time.sleep(0.5) # 500ms delay
|
|
```
|
|
|
|
### 3. Result Validation
|
|
|
|
Check for empty or unexpected results:
|
|
```python
|
|
if result and len(result) > 0:
|
|
# Process
|
|
pass
|
|
else:
|
|
print("No results returned")
|
|
```
|
|
|
|
### 4. Progress Reporting
|
|
|
|
For long workflows:
|
|
```python
|
|
total = len(items)
|
|
for i, item in enumerate(items):
|
|
# Process item
|
|
if (i + 1) % 10 == 0:
|
|
print(f"Processed {i+1}/{total}")
|
|
```
|
|
|
|
### 5. Data Export
|
|
|
|
Save intermediate results:
|
|
```python
|
|
import json
|
|
|
|
with open("results.json", "w") as f:
|
|
json.dump(results, f, indent=2)
|
|
```
|
|
|
|
---
|
|
|
|
## Integration with Other Tools
|
|
|
|
### BioPython Integration
|
|
|
|
```python
|
|
from bioservices import UniProt
|
|
from Bio import SeqIO
|
|
from io import StringIO
|
|
|
|
u = UniProt()
|
|
fasta_data = u.retrieve("P43403", "fasta")
|
|
|
|
# Parse with BioPython
|
|
fasta_io = StringIO(fasta_data)
|
|
record = SeqIO.read(fasta_io, "fasta")
|
|
|
|
print(f"Sequence length: {len(record.seq)}")
|
|
print(f"Description: {record.description}")
|
|
```
|
|
|
|
### Pandas Integration
|
|
|
|
```python
|
|
from bioservices import UniProt
|
|
import pandas as pd
|
|
from io import StringIO
|
|
|
|
u = UniProt()
|
|
results = u.search("zap70", frmt="tab", columns="id,genes,length,organism")
|
|
|
|
# Load into DataFrame
|
|
df = pd.read_csv(StringIO(results), sep="\t")
|
|
print(df.head())
|
|
print(df.describe())
|
|
```
|
|
|
|
### NetworkX Integration
|
|
|
|
See Protein Interaction Network Construction above.
|
|
|
|
---
|
|
|
|
For complete working examples, see the scripts in `scripts/` directory.
|