zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

20 KiB

Raw Blame History

BioServices: Common Workflow Patterns

This document describes detailed multi-step workflows for common bioinformatics tasks using BioServices.

Complete Protein Analysis Pipeline
Pathway Discovery and Network Analysis
Compound Multi-Database Search
Batch Identifier Conversion
Gene Functional Annotation
Protein Interaction Network Construction
Multi-Organism Comparative Analysis

Complete Protein Analysis Pipeline

Goal: Given a protein name, retrieve sequence, find homologs, identify pathways, and discover interactions.

Example: Analyzing human ZAP70 protein

Step 1: UniProt Search and Identifier Retrieval

from bioservices import UniProt

u = UniProt(verbose=False)

# Search for protein by name
query = "ZAP70_HUMAN"
results = u.search(query, frmt="tab", columns="id,genes,organism,length")

# Parse results
lines = results.strip().split("\n")
if len(lines) > 1:
    header = lines[0]
    data = lines[1].split("\t")
    uniprot_id = data[0]  # e.g., P43403
    gene_names = data[1]   # e.g., ZAP70

print(f"UniProt ID: {uniprot_id}")
print(f"Gene names: {gene_names}")

Output:

UniProt accession: P43403
Gene name: ZAP70

Step 2: Sequence Retrieval

# Retrieve FASTA sequence
sequence = u.retrieve(uniprot_id, frmt="fasta")
print(sequence)

# Extract just the sequence string (remove header)
seq_lines = sequence.split("\n")
sequence_only = "".join(seq_lines[1:])  # Skip FASTA header

Output: Complete protein sequence in FASTA format

Step 3: BLAST Similarity Search

from bioservices import NCBIblast
import time

s = NCBIblast(verbose=False)

# Submit BLAST job
jobid = s.run(
    program="blastp",
    sequence=sequence_only,
    stype="protein",
    database="uniprotkb",
    email="your.email@example.com"
)

print(f"BLAST Job ID: {jobid}")

# Wait for completion
while True:
    status = s.getStatus(jobid)
    print(f"Status: {status}")
    if status == "FINISHED":
        break
    elif status == "ERROR":
        print("BLAST job failed")
        break
    time.sleep(5)

# Retrieve results
if status == "FINISHED":
    blast_results = s.getResult(jobid, "out")
    print(blast_results[:500])  # Print first 500 characters

Output: BLAST alignment results showing similar proteins

Step 4: KEGG Pathway Discovery

from bioservices import KEGG

k = KEGG()

# Get KEGG gene ID from UniProt mapping
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
print(f"KEGG mapping: {kegg_mapping}")

# Extract KEGG gene ID (e.g., hsa:7535)
if kegg_mapping:
    kegg_gene_id = kegg_mapping[uniprot_id][0] if uniprot_id in kegg_mapping else None

    if kegg_gene_id:
        # Find pathways containing this gene
        organism = kegg_gene_id.split(":")[0]  # e.g., "hsa"
        gene_id = kegg_gene_id.split(":")[1]   # e.g., "7535"

        pathways = k.get_pathway_by_gene(gene_id, organism)
        print(f"Found {len(pathways)} pathways:")

        # Get pathway names
        for pathway_id in pathways:
            pathway_info = k.get(pathway_id)
            # Parse NAME line
            for line in pathway_info.split("\n"):
                if line.startswith("NAME"):
                    pathway_name = line.replace("NAME", "").strip()
                    print(f"  {pathway_id}: {pathway_name}")
                    break

Output:

path:hsa04064 - NF-kappa B signaling pathway
path:hsa04650 - Natural killer cell mediated cytotoxicity
path:hsa04660 - T cell receptor signaling pathway
path:hsa04662 - B cell receptor signaling pathway

Step 5: Protein-Protein Interactions

from bioservices import PSICQUIC

p = PSICQUIC()

# Query MINT database for human (taxid:9606) interactions
query = f"ZAP70 AND species:9606"
interactions = p.query("mint", query)

# Parse PSI-MI TAB format results
if interactions:
    interaction_lines = interactions.strip().split("\n")
    print(f"Found {len(interaction_lines)} interactions")

    # Print first few interactions
    for line in interaction_lines[:5]:
        fields = line.split("\t")
        protein_a = fields[0]
        protein_b = fields[1]
        interaction_type = fields[11]
        print(f"  {protein_a} - {protein_b}: {interaction_type}")

Output: List of proteins that interact with ZAP70

Step 6: Gene Ontology Annotation

from bioservices import QuickGO

g = QuickGO()

# Get GO annotations for protein
annotations = g.Annotation(protein=uniprot_id, format="tsv")

if annotations:
    # Parse TSV results
    lines = annotations.strip().split("\n")
    print(f"Found {len(lines)-1} GO annotations")

    # Display first few annotations
    for line in lines[1:6]:  # Skip header
        fields = line.split("\t")
        go_id = fields[6]
        go_term = fields[7]
        go_aspect = fields[8]
        print(f"  {go_id}: {go_term} [{go_aspect}]")

Output: GO terms annotating ZAP70 function, process, and location

Complete Pipeline Summary

Inputs: Protein name (e.g., "ZAP70_HUMAN")

Outputs:

UniProt accession and gene name
Protein sequence (FASTA)
Similar proteins (BLAST results)
Biological pathways (KEGG)
Interaction partners (PSICQUIC)
Functional annotations (GO terms)

Script: scripts/protein_analysis_workflow.py automates this entire pipeline.

Pathway Discovery and Network Analysis

Goal: Analyze all pathways for an organism and extract protein interaction networks.

Example: Human (hsa) pathway analysis

Step 1: Get All Pathways for Organism

from bioservices import KEGG

k = KEGG()
k.organism = "hsa"

# Get all pathway IDs
pathway_ids = k.pathwayIds
print(f"Found {len(pathway_ids)} pathways for {k.organism}")

# Display first few
for pid in pathway_ids[:10]:
    print(f"  {pid}")

Output: List of ~300 human pathways

Step 2: Parse Pathway for Interactions

# Analyze specific pathway
pathway_id = "hsa04660"  # T cell receptor signaling

# Get KGML data
kgml_data = k.parse_kgml_pathway(pathway_id)

# Extract entries (genes/proteins)
entries = kgml_data['entries']
print(f"Pathway contains {len(entries)} entries")

# Extract relations (interactions)
relations = kgml_data['relations']
print(f"Found {len(relations)} relations")

# Analyze relation types
relation_types = {}
for rel in relations:
    rel_type = rel.get('name', 'unknown')
    relation_types[rel_type] = relation_types.get(rel_type, 0) + 1

print("\nRelation type distribution:")
for rel_type, count in sorted(relation_types.items()):
    print(f"  {rel_type}: {count}")

Output:

Entry count (genes/proteins in pathway)
Relation count (interactions)
Distribution of interaction types (activation, inhibition, binding, etc.)

Step 3: Extract Protein-Protein Interactions

# Filter for specific interaction types
pprel_interactions = [
    rel for rel in relations
    if rel.get('link') == 'PPrel'  # Protein-protein relation
]

print(f"Found {len(pprel_interactions)} protein-protein interactions")

# Extract interaction details
for rel in pprel_interactions[:10]:
    entry1 = rel['entry1']
    entry2 = rel['entry2']
    interaction_type = rel.get('name', 'unknown')

    print(f"  {entry1} -> {entry2}: {interaction_type}")

Output: Directed protein-protein interactions with types

Step 4: Convert to Network Format (SIF)

# Get Simple Interaction Format (filters for key interactions)
sif_data = k.pathway2sif(pathway_id)

# SIF format: source, interaction_type, target
print("\nSimple Interaction Format:")
for interaction in sif_data[:10]:
    print(f"  {interaction}")

Output: Network edges suitable for Cytoscape or NetworkX

Step 5: Batch Analysis of All Pathways

import pandas as pd

# Analyze all pathways (this takes time!)
all_results = []

for pathway_id in pathway_ids[:50]:  # Limit for example
    try:
        kgml = k.parse_kgml_pathway(pathway_id)

        result = {
            'pathway_id': pathway_id,
            'num_entries': len(kgml.get('entries', [])),
            'num_relations': len(kgml.get('relations', []))
        }

        all_results.append(result)

    except Exception as e:
        print(f"Error parsing {pathway_id}: {e}")

# Create DataFrame
df = pd.DataFrame(all_results)
print(df.describe())

# Find largest pathways
print("\nLargest pathways:")
print(df.nlargest(10, 'num_entries')[['pathway_id', 'num_entries', 'num_relations']])

Output: Statistical summary of pathway sizes and interaction densities

Script: scripts/pathway_analysis.py implements this workflow with export options.

Compound Multi-Database Search

Goal: Search for compound by name and retrieve identifiers across KEGG, ChEBI, and ChEMBL.

Example: Geldanamycin (antibiotic)

Step 1: Search KEGG Compound Database

from bioservices import KEGG

k = KEGG()

# Search by compound name
compound_name = "Geldanamycin"
results = k.find("compound", compound_name)

print(f"KEGG search results for '{compound_name}':")
print(results)

# Extract compound ID
if results:
    lines = results.strip().split("\n")
    if lines:
        kegg_id = lines[0].split("\t")[0]  # e.g., cpd:C11222
        kegg_id_clean = kegg_id.replace("cpd:", "")  # C11222
        print(f"\nKEGG Compound ID: {kegg_id_clean}")

Output: KEGG ID (e.g., C11222)

Step 2: Get KEGG Entry with Database Links

# Retrieve compound entry
compound_entry = k.get(kegg_id)

# Parse entry for database links
chebi_id = None
for line in compound_entry.split("\n"):
    if "ChEBI:" in line:
        # Extract ChEBI ID
        parts = line.split("ChEBI:")
        if len(parts) > 1:
            chebi_id = parts[1].strip().split()[0]
            print(f"ChEBI ID: {chebi_id}")
            break

# Display entry snippet
print("\nKEGG Entry (first 500 chars):")
print(compound_entry[:500])

Output: ChEBI ID (e.g., 5292) and compound information

Step 3: Cross-Reference to ChEMBL via UniChem

from bioservices import UniChem

u = UniChem()

# Convert KEGG → ChEMBL
try:
    chembl_id = u.get_compound_id_from_kegg(kegg_id_clean)
    print(f"ChEMBL ID: {chembl_id}")
except Exception as e:
    print(f"UniChem lookup failed: {e}")
    chembl_id = None

Output: ChEMBL ID (e.g., CHEMBL278315)

Step 4: Retrieve Detailed Information

# Get ChEBI information
if chebi_id:
    from bioservices import ChEBI
    c = ChEBI()

    try:
        chebi_entity = c.getCompleteEntity(f"CHEBI:{chebi_id}")
        print(f"\nChEBI Formula: {chebi_entity.Formulae}")
        print(f"ChEBI Name: {chebi_entity.chebiAsciiName}")
    except Exception as e:
        print(f"ChEBI lookup failed: {e}")

# Get ChEMBL information
if chembl_id:
    from bioservices import ChEMBL
    chembl = ChEMBL()

    try:
        chembl_compound = chembl.get_compound_by_chemblId(chembl_id)
        print(f"\nChEMBL Molecular Weight: {chembl_compound['molecule_properties']['full_mwt']}")
        print(f"ChEMBL SMILES: {chembl_compound['molecule_structures']['canonical_smiles']}")
    except Exception as e:
        print(f"ChEMBL lookup failed: {e}")

Output: Chemical properties from multiple databases

Complete Compound Workflow Summary

Input: Compound name (e.g., "Geldanamycin")

Output:

KEGG ID: C11222
ChEBI ID: 5292
ChEMBL ID: CHEMBL278315
Chemical formula
Molecular weight
SMILES structure

Script: scripts/compound_cross_reference.py automates this workflow.

Batch Identifier Conversion

Goal: Convert multiple identifiers between databases efficiently.

Batch UniProt → KEGG Mapping

from bioservices import UniProt

u = UniProt()

# List of UniProt IDs
uniprot_ids = ["P43403", "P04637", "P53779", "Q9Y6K9"]

# Batch mapping (comma-separated)
query_string = ",".join(uniprot_ids)
results = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=query_string)

print("UniProt → KEGG mapping:")
for uniprot_id, kegg_ids in results.items():
    print(f"  {uniprot_id} → {kegg_ids}")

Output: Dictionary mapping each UniProt ID to KEGG gene IDs

Batch File Processing

import csv

# Read identifiers from file
def read_ids_from_file(filename):
    with open(filename, 'r') as f:
        ids = [line.strip() for line in f if line.strip()]
    return ids

# Process in chunks (API limits)
def batch_convert(ids, from_db, to_db, chunk_size=100):
    u = UniProt()
    all_results = {}

    for i in range(0, len(ids), chunk_size):
        chunk = ids[i:i+chunk_size]
        query = ",".join(chunk)

        try:
            results = u.mapping(fr=from_db, to=to_db, query=query)
            all_results.update(results)
            print(f"Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
        except Exception as e:
            print(f"Error processing chunk {i}: {e}")

    return all_results

# Write results to CSV
def write_mapping_to_csv(mapping, output_file):
    with open(output_file, 'w', newline='') as f:
        writer = csv.writer(f)
        writer.writerow(['Source_ID', 'Target_IDs'])

        for source_id, target_ids in mapping.items():
            target_str = ";".join(target_ids) if target_ids else "No mapping"
            writer.writerow([source_id, target_str])

# Example usage
input_ids = read_ids_from_file("uniprot_ids.txt")
mapping = batch_convert(input_ids, "UniProtKB_AC-ID", "KEGG", chunk_size=50)
write_mapping_to_csv(mapping, "uniprot_to_kegg_mapping.csv")

Script: scripts/batch_id_converter.py provides command-line batch conversion.

Gene Functional Annotation

Goal: Retrieve comprehensive functional information for a gene.

Workflow

from bioservices import UniProt, KEGG, QuickGO

# Gene of interest
gene_symbol = "TP53"

# 1. Find UniProt entry
u = UniProt()
search_results = u.search(f"gene:{gene_symbol} AND organism:9606",
                          frmt="tab",
                          columns="id,genes,protein names")

# Extract UniProt ID
lines = search_results.strip().split("\n")
if len(lines) > 1:
    uniprot_id = lines[1].split("\t")[0]
    protein_name = lines[1].split("\t")[2]
    print(f"Protein: {protein_name}")
    print(f"UniProt ID: {uniprot_id}")

# 2. Get KEGG pathways
kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
if uniprot_id in kegg_mapping:
    kegg_id = kegg_mapping[uniprot_id][0]

    k = KEGG()
    organism, gene_id = kegg_id.split(":")
    pathways = k.get_pathway_by_gene(gene_id, organism)

    print(f"\nPathways ({len(pathways)}):")
    for pathway_id in pathways[:5]:
        print(f"  {pathway_id}")

# 3. Get GO annotations
g = QuickGO()
go_annotations = g.Annotation(protein=uniprot_id, format="tsv")

if go_annotations:
    lines = go_annotations.strip().split("\n")
    print(f"\nGO Annotations ({len(lines)-1} total):")

    # Group by aspect
    aspects = {"P": [], "F": [], "C": []}
    for line in lines[1:]:
        fields = line.split("\t")
        go_aspect = fields[8]  # P, F, or C
        go_term = fields[7]
        aspects[go_aspect].append(go_term)

    print(f"  Biological Process: {len(aspects['P'])} terms")
    print(f"  Molecular Function: {len(aspects['F'])} terms")
    print(f"  Cellular Component: {len(aspects['C'])} terms")

# 4. Get protein sequence features
full_entry = u.retrieve(uniprot_id, frmt="txt")
print("\nProtein Features:")
for line in full_entry.split("\n"):
    if line.startswith("FT   DOMAIN"):
        print(f"  {line}")

Output: Comprehensive annotation including name, pathways, GO terms, and features.

Protein Interaction Network Construction

Goal: Build a protein-protein interaction network for a set of proteins.

Workflow

from bioservices import PSICQUIC
import networkx as nx

# Proteins of interest
proteins = ["ZAP70", "LCK", "LAT", "SLP76", "PLCg1"]

# Initialize PSICQUIC
p = PSICQUIC()

# Build network
G = nx.Graph()

for protein in proteins:
    # Query for human interactions
    query = f"{protein} AND species:9606"

    try:
        results = p.query("intact", query)

        if results:
            lines = results.strip().split("\n")

            for line in lines:
                fields = line.split("\t")
                # Extract protein names (simplified)
                protein_a = fields[4].split(":")[1] if ":" in fields[4] else fields[4]
                protein_b = fields[5].split(":")[1] if ":" in fields[5] else fields[5]

                # Add edge
                G.add_edge(protein_a, protein_b)

    except Exception as e:
        print(f"Error querying {protein}: {e}")

print(f"Network: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

# Analyze network
print("\nNode degrees:")
for node in proteins:
    if node in G:
        print(f"  {node}: {G.degree(node)} interactions")

# Export for visualization
nx.write_gml(G, "protein_network.gml")
print("\nNetwork exported to protein_network.gml")

Output: NetworkX graph exported in GML format for Cytoscape visualization.

Multi-Organism Comparative Analysis

Goal: Compare pathway or gene presence across multiple organisms.

Workflow

from bioservices import KEGG

k = KEGG()

# Organisms to compare
organisms = ["hsa", "mmu", "dme", "sce"]  # Human, mouse, fly, yeast
organism_names = {
    "hsa": "Human",
    "mmu": "Mouse",
    "dme": "Fly",
    "sce": "Yeast"
}

# Pathway of interest
pathway_name = "cell cycle"

print(f"Searching for '{pathway_name}' pathway across organisms:\n")

for org in organisms:
    k.organism = org

    # Search pathways
    results = k.lookfor_pathway(pathway_name)

    print(f"{organism_names[org]} ({org}):")
    if results:
        for pathway in results[:3]:  # Show first 3
            print(f"  {pathway}")
    else:
        print("  No matches found")
    print()

Output: Pathway presence/absence across organisms.

Best Practices for Workflows

1. Error Handling

Always wrap service calls:

try:
    result = service.method(params)
    if result:
        # Process
        pass
except Exception as e:
    print(f"Error: {e}")

2. Rate Limiting

Add delays for batch processing:

import time

for item in items:
    result = service.query(item)
    time.sleep(0.5)  # 500ms delay

3. Result Validation

Check for empty or unexpected results:

if result and len(result) > 0:
    # Process
    pass
else:
    print("No results returned")

4. Progress Reporting

For long workflows:

total = len(items)
for i, item in enumerate(items):
    # Process item
    if (i + 1) % 10 == 0:
        print(f"Processed {i+1}/{total}")

5. Data Export

Save intermediate results:

import json

with open("results.json", "w") as f:
    json.dump(results, f, indent=2)

Integration with Other Tools

BioPython Integration

from bioservices import UniProt
from Bio import SeqIO
from io import StringIO

u = UniProt()
fasta_data = u.retrieve("P43403", "fasta")

# Parse with BioPython
fasta_io = StringIO(fasta_data)
record = SeqIO.read(fasta_io, "fasta")

print(f"Sequence length: {len(record.seq)}")
print(f"Description: {record.description}")

Pandas Integration

from bioservices import UniProt
import pandas as pd
from io import StringIO

u = UniProt()
results = u.search("zap70", frmt="tab", columns="id,genes,length,organism")

# Load into DataFrame
df = pd.read_csv(StringIO(results), sep="\t")
print(df.head())
print(df.describe())

NetworkX Integration

See Protein Interaction Network Construction above.

For complete working examples, see the scripts in scripts/ directory.

20 KiB Raw Blame History

BioServices: Common Workflow Patterns

Table of Contents

Complete Protein Analysis Pipeline

Step 1: UniProt Search and Identifier Retrieval

Step 2: Sequence Retrieval

Step 3: BLAST Similarity Search

Step 4: KEGG Pathway Discovery

Step 5: Protein-Protein Interactions

Step 6: Gene Ontology Annotation

Complete Pipeline Summary

Pathway Discovery and Network Analysis

Step 1: Get All Pathways for Organism

Step 2: Parse Pathway for Interactions

Step 3: Extract Protein-Protein Interactions

Step 4: Convert to Network Format (SIF)

Step 5: Batch Analysis of All Pathways

Compound Multi-Database Search

Step 1: Search KEGG Compound Database

Step 2: Get KEGG Entry with Database Links

Step 3: Cross-Reference to ChEMBL via UniChem

Step 4: Retrieve Detailed Information

Complete Compound Workflow Summary

Batch Identifier Conversion

Batch UniProt → KEGG Mapping

Batch File Processing

Gene Functional Annotation

Workflow

Protein Interaction Network Construction

Workflow

Multi-Organism Comparative Analysis

Workflow

Best Practices for Workflows

1. Error Handling

2. Rate Limiting

3. Result Validation

4. Progress Reporting

5. Data Export

Integration with Other Tools

BioPython Integration

Pandas Integration

NetworkX Integration

20 KiB

Raw Blame History