Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/bioservices/references/workflow_patterns.md
+++ b/skills/bioservices/references/workflow_patterns.md
@@ -0,0 +1,811 @@
+# BioServices: Common Workflow Patterns
+
+This document describes detailed multi-step workflows for common bioinformatics tasks using BioServices.
+
+## Table of Contents
+
+1. [Complete Protein Analysis Pipeline](#complete-protein-analysis-pipeline)
+2. [Pathway Discovery and Network Analysis](#pathway-discovery-and-network-analysis)
+3. [Compound Multi-Database Search](#compound-multi-database-search)
+4. [Batch Identifier Conversion](#batch-identifier-conversion)
+5. [Gene Functional Annotation](#gene-functional-annotation)
+6. [Protein Interaction Network Construction](#protein-interaction-network-construction)
+7. [Multi-Organism Comparative Analysis](#multi-organism-comparative-analysis)
+
+---
+
+## Complete Protein Analysis Pipeline
+
+**Goal:** Given a protein name, retrieve sequence, find homologs, identify pathways, and discover interactions.
+
+**Example:** Analyzing human ZAP70 protein
+
+### Step 1: UniProt Search and Identifier Retrieval
+
+```python
+from bioservices import UniProt
+
+u = UniProt(verbose=False)
+
+# Search for protein by name
+query = "ZAP70_HUMAN"
+results = u.search(query, frmt="tab", columns="id,genes,organism,length")
+
+# Parse results
+lines = results.strip().split("\n")
+if len(lines) > 1:
+    header = lines[0]
+    data = lines[1].split("\t")
+    uniprot_id = data[0]  # e.g., P43403
+    gene_names = data[1]   # e.g., ZAP70
+
+print(f"UniProt ID: {uniprot_id}")
+print(f"Gene names: {gene_names}")
+```
+
+**Output:**
+- UniProt accession: P43403
+- Gene name: ZAP70
+
+### Step 2: Sequence Retrieval
+
+```python
+# Retrieve FASTA sequence
+sequence = u.retrieve(uniprot_id, frmt="fasta")
+print(sequence)
+
+# Extract just the sequence string (remove header)
+seq_lines = sequence.split("\n")
+sequence_only = "".join(seq_lines[1:])  # Skip FASTA header
+```
+
+**Output:** Complete protein sequence in FASTA format
+
+### Step 3: BLAST Similarity Search
+
+```python
+from bioservices import NCBIblast
+import time
+
+s = NCBIblast(verbose=False)
+
+# Submit BLAST job
+jobid = s.run(
+    program="blastp",
+    sequence=sequence_only,
+    stype="protein",
+    database="uniprotkb",
+    email="your.email@example.com"
+)
+
+print(f"BLAST Job ID: {jobid}")
+
+# Wait for completion
+while True:
+    status = s.getStatus(jobid)
+    print(f"Status: {status}")
+    if status == "FINISHED":
+        break
+    elif status == "ERROR":
+        print("BLAST job failed")
+        break
+    time.sleep(5)
+
+# Retrieve results
+if status == "FINISHED":
+    blast_results = s.getResult(jobid, "out")
+    print(blast_results[:500])  # Print first 500 characters
+```
+
+**Output:** BLAST alignment results showing similar proteins
+
+### Step 4: KEGG Pathway Discovery
+
+```python
+from bioservices import KEGG
+
+k = KEGG()
+
+# Get KEGG gene ID from UniProt mapping
+kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
+print(f"KEGG mapping: {kegg_mapping}")
+
+# Extract KEGG gene ID (e.g., hsa:7535)
+if kegg_mapping:
+    kegg_gene_id = kegg_mapping[uniprot_id][0] if uniprot_id in kegg_mapping else None
+
+    if kegg_gene_id:
+        # Find pathways containing this gene
+        organism = kegg_gene_id.split(":")[0]  # e.g., "hsa"
+        gene_id = kegg_gene_id.split(":")[1]   # e.g., "7535"
+
+        pathways = k.get_pathway_by_gene(gene_id, organism)
+        print(f"Found {len(pathways)} pathways:")
+
+        # Get pathway names
+        for pathway_id in pathways:
+            pathway_info = k.get(pathway_id)
+            # Parse NAME line
+            for line in pathway_info.split("\n"):
+                if line.startswith("NAME"):
+                    pathway_name = line.replace("NAME", "").strip()
+                    print(f"  {pathway_id}: {pathway_name}")
+                    break
+```
+
+**Output:**
+- path:hsa04064 - NF-kappa B signaling pathway
+- path:hsa04650 - Natural killer cell mediated cytotoxicity
+- path:hsa04660 - T cell receptor signaling pathway
+- path:hsa04662 - B cell receptor signaling pathway
+
+### Step 5: Protein-Protein Interactions
+
+```python
+from bioservices import PSICQUIC
+
+p = PSICQUIC()
+
+# Query MINT database for human (taxid:9606) interactions
+query = f"ZAP70 AND species:9606"
+interactions = p.query("mint", query)
+
+# Parse PSI-MI TAB format results
+if interactions:
+    interaction_lines = interactions.strip().split("\n")
+    print(f"Found {len(interaction_lines)} interactions")
+
+    # Print first few interactions
+    for line in interaction_lines[:5]:
+        fields = line.split("\t")
+        protein_a = fields[0]
+        protein_b = fields[1]
+        interaction_type = fields[11]
+        print(f"  {protein_a} - {protein_b}: {interaction_type}")
+```
+
+**Output:** List of proteins that interact with ZAP70
+
+### Step 6: Gene Ontology Annotation
+
+```python
+from bioservices import QuickGO
+
+g = QuickGO()
+
+# Get GO annotations for protein
+annotations = g.Annotation(protein=uniprot_id, format="tsv")
+
+if annotations:
+    # Parse TSV results
+    lines = annotations.strip().split("\n")
+    print(f"Found {len(lines)-1} GO annotations")
+
+    # Display first few annotations
+    for line in lines[1:6]:  # Skip header
+        fields = line.split("\t")
+        go_id = fields[6]
+        go_term = fields[7]
+        go_aspect = fields[8]
+        print(f"  {go_id}: {go_term} [{go_aspect}]")
+```
+
+**Output:** GO terms annotating ZAP70 function, process, and location
+
+### Complete Pipeline Summary
+
+**Inputs:** Protein name (e.g., "ZAP70_HUMAN")
+
+**Outputs:**
+1. UniProt accession and gene name
+2. Protein sequence (FASTA)
+3. Similar proteins (BLAST results)
+4. Biological pathways (KEGG)
+5. Interaction partners (PSICQUIC)
+6. Functional annotations (GO terms)
+
+**Script:** `scripts/protein_analysis_workflow.py` automates this entire pipeline.
+
+---
+
+## Pathway Discovery and Network Analysis
+
+**Goal:** Analyze all pathways for an organism and extract protein interaction networks.
+
+**Example:** Human (hsa) pathway analysis
+
+### Step 1: Get All Pathways for Organism
+
+```python
+from bioservices import KEGG
+
+k = KEGG()
+k.organism = "hsa"
+
+# Get all pathway IDs
+pathway_ids = k.pathwayIds
+print(f"Found {len(pathway_ids)} pathways for {k.organism}")
+
+# Display first few
+for pid in pathway_ids[:10]:
+    print(f"  {pid}")
+```
+
+**Output:** List of ~300 human pathways
+
+### Step 2: Parse Pathway for Interactions
+
+```python
+# Analyze specific pathway
+pathway_id = "hsa04660"  # T cell receptor signaling
+
+# Get KGML data
+kgml_data = k.parse_kgml_pathway(pathway_id)
+
+# Extract entries (genes/proteins)
+entries = kgml_data['entries']
+print(f"Pathway contains {len(entries)} entries")
+
+# Extract relations (interactions)
+relations = kgml_data['relations']
+print(f"Found {len(relations)} relations")
+
+# Analyze relation types
+relation_types = {}
+for rel in relations:
+    rel_type = rel.get('name', 'unknown')
+    relation_types[rel_type] = relation_types.get(rel_type, 0) + 1
+
+print("\nRelation type distribution:")
+for rel_type, count in sorted(relation_types.items()):
+    print(f"  {rel_type}: {count}")
+```
+
+**Output:**
+- Entry count (genes/proteins in pathway)
+- Relation count (interactions)
+- Distribution of interaction types (activation, inhibition, binding, etc.)
+
+### Step 3: Extract Protein-Protein Interactions
+
+```python
+# Filter for specific interaction types
+pprel_interactions = [
+    rel for rel in relations
+    if rel.get('link') == 'PPrel'  # Protein-protein relation
+]
+
+print(f"Found {len(pprel_interactions)} protein-protein interactions")
+
+# Extract interaction details
+for rel in pprel_interactions[:10]:
+    entry1 = rel['entry1']
+    entry2 = rel['entry2']
+    interaction_type = rel.get('name', 'unknown')
+
+    print(f"  {entry1} -> {entry2}: {interaction_type}")
+```
+
+**Output:** Directed protein-protein interactions with types
+
+### Step 4: Convert to Network Format (SIF)
+
+```python
+# Get Simple Interaction Format (filters for key interactions)
+sif_data = k.pathway2sif(pathway_id)
+
+# SIF format: source, interaction_type, target
+print("\nSimple Interaction Format:")
+for interaction in sif_data[:10]:
+    print(f"  {interaction}")
+```
+
+**Output:** Network edges suitable for Cytoscape or NetworkX
+
+### Step 5: Batch Analysis of All Pathways
+
+```python
+import pandas as pd
+
+# Analyze all pathways (this takes time!)
+all_results = []
+
+for pathway_id in pathway_ids[:50]:  # Limit for example
+    try:
+        kgml = k.parse_kgml_pathway(pathway_id)
+
+        result = {
+            'pathway_id': pathway_id,
+            'num_entries': len(kgml.get('entries', [])),
+            'num_relations': len(kgml.get('relations', []))
+        }
+
+        all_results.append(result)
+
+    except Exception as e:
+        print(f"Error parsing {pathway_id}: {e}")
+
+# Create DataFrame
+df = pd.DataFrame(all_results)
+print(df.describe())
+
+# Find largest pathways
+print("\nLargest pathways:")
+print(df.nlargest(10, 'num_entries')[['pathway_id', 'num_entries', 'num_relations']])
+```
+
+**Output:** Statistical summary of pathway sizes and interaction densities
+
+**Script:** `scripts/pathway_analysis.py` implements this workflow with export options.
+
+---
+
+## Compound Multi-Database Search
+
+**Goal:** Search for compound by name and retrieve identifiers across KEGG, ChEBI, and ChEMBL.
+
+**Example:** Geldanamycin (antibiotic)
+
+### Step 1: Search KEGG Compound Database
+
+```python
+from bioservices import KEGG
+
+k = KEGG()
+
+# Search by compound name
+compound_name = "Geldanamycin"
+results = k.find("compound", compound_name)
+
+print(f"KEGG search results for '{compound_name}':")
+print(results)
+
+# Extract compound ID
+if results:
+    lines = results.strip().split("\n")
+    if lines:
+        kegg_id = lines[0].split("\t")[0]  # e.g., cpd:C11222
+        kegg_id_clean = kegg_id.replace("cpd:", "")  # C11222
+        print(f"\nKEGG Compound ID: {kegg_id_clean}")
+```
+
+**Output:** KEGG ID (e.g., C11222)
+
+### Step 2: Get KEGG Entry with Database Links
+
+```python
+# Retrieve compound entry
+compound_entry = k.get(kegg_id)
+
+# Parse entry for database links
+chebi_id = None
+for line in compound_entry.split("\n"):
+    if "ChEBI:" in line:
+        # Extract ChEBI ID
+        parts = line.split("ChEBI:")
+        if len(parts) > 1:
+            chebi_id = parts[1].strip().split()[0]
+            print(f"ChEBI ID: {chebi_id}")
+            break
+
+# Display entry snippet
+print("\nKEGG Entry (first 500 chars):")
+print(compound_entry[:500])
+```
+
+**Output:** ChEBI ID (e.g., 5292) and compound information
+
+### Step 3: Cross-Reference to ChEMBL via UniChem
+
+```python
+from bioservices import UniChem
+
+u = UniChem()
+
+# Convert KEGG → ChEMBL
+try:
+    chembl_id = u.get_compound_id_from_kegg(kegg_id_clean)
+    print(f"ChEMBL ID: {chembl_id}")
+except Exception as e:
+    print(f"UniChem lookup failed: {e}")
+    chembl_id = None
+```
+
+**Output:** ChEMBL ID (e.g., CHEMBL278315)
+
+### Step 4: Retrieve Detailed Information
+
+```python
+# Get ChEBI information
+if chebi_id:
+    from bioservices import ChEBI
+    c = ChEBI()
+
+    try:
+        chebi_entity = c.getCompleteEntity(f"CHEBI:{chebi_id}")
+        print(f"\nChEBI Formula: {chebi_entity.Formulae}")
+        print(f"ChEBI Name: {chebi_entity.chebiAsciiName}")
+    except Exception as e:
+        print(f"ChEBI lookup failed: {e}")
+
+# Get ChEMBL information
+if chembl_id:
+    from bioservices import ChEMBL
+    chembl = ChEMBL()
+
+    try:
+        chembl_compound = chembl.get_compound_by_chemblId(chembl_id)
+        print(f"\nChEMBL Molecular Weight: {chembl_compound['molecule_properties']['full_mwt']}")
+        print(f"ChEMBL SMILES: {chembl_compound['molecule_structures']['canonical_smiles']}")
+    except Exception as e:
+        print(f"ChEMBL lookup failed: {e}")
+```
+
+**Output:** Chemical properties from multiple databases
+
+### Complete Compound Workflow Summary
+
+**Input:** Compound name (e.g., "Geldanamycin")
+
+**Output:**
+- KEGG ID: C11222
+- ChEBI ID: 5292
+- ChEMBL ID: CHEMBL278315
+- Chemical formula
+- Molecular weight
+- SMILES structure
+
+**Script:** `scripts/compound_cross_reference.py` automates this workflow.
+
+---
+
+## Batch Identifier Conversion
+
+**Goal:** Convert multiple identifiers between databases efficiently.
+
+### Batch UniProt → KEGG Mapping
+
+```python
+from bioservices import UniProt
+
+u = UniProt()
+
+# List of UniProt IDs
+uniprot_ids = ["P43403", "P04637", "P53779", "Q9Y6K9"]
+
+# Batch mapping (comma-separated)
+query_string = ",".join(uniprot_ids)
+results = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=query_string)
+
+print("UniProt → KEGG mapping:")
+for uniprot_id, kegg_ids in results.items():
+    print(f"  {uniprot_id} → {kegg_ids}")
+```
+
+**Output:** Dictionary mapping each UniProt ID to KEGG gene IDs
+
+### Batch File Processing
+
+```python
+import csv
+
+# Read identifiers from file
+def read_ids_from_file(filename):
+    with open(filename, 'r') as f:
+        ids = [line.strip() for line in f if line.strip()]
+    return ids
+
+# Process in chunks (API limits)
+def batch_convert(ids, from_db, to_db, chunk_size=100):
+    u = UniProt()
+    all_results = {}
+
+    for i in range(0, len(ids), chunk_size):
+        chunk = ids[i:i+chunk_size]
+        query = ",".join(chunk)
+
+        try:
+            results = u.mapping(fr=from_db, to=to_db, query=query)
+            all_results.update(results)
+            print(f"Processed {min(i+chunk_size, len(ids))}/{len(ids)}")
+        except Exception as e:
+            print(f"Error processing chunk {i}: {e}")
+
+    return all_results
+
+# Write results to CSV
+def write_mapping_to_csv(mapping, output_file):
+    with open(output_file, 'w', newline='') as f:
+        writer = csv.writer(f)
+        writer.writerow(['Source_ID', 'Target_IDs'])
+
+        for source_id, target_ids in mapping.items():
+            target_str = ";".join(target_ids) if target_ids else "No mapping"
+            writer.writerow([source_id, target_str])
+
+# Example usage
+input_ids = read_ids_from_file("uniprot_ids.txt")
+mapping = batch_convert(input_ids, "UniProtKB_AC-ID", "KEGG", chunk_size=50)
+write_mapping_to_csv(mapping, "uniprot_to_kegg_mapping.csv")
+```
+
+**Script:** `scripts/batch_id_converter.py` provides command-line batch conversion.
+
+---
+
+## Gene Functional Annotation
+
+**Goal:** Retrieve comprehensive functional information for a gene.
+
+### Workflow
+
+```python
+from bioservices import UniProt, KEGG, QuickGO
+
+# Gene of interest
+gene_symbol = "TP53"
+
+# 1. Find UniProt entry
+u = UniProt()
+search_results = u.search(f"gene:{gene_symbol} AND organism:9606",
+                          frmt="tab",
+                          columns="id,genes,protein names")
+
+# Extract UniProt ID
+lines = search_results.strip().split("\n")
+if len(lines) > 1:
+    uniprot_id = lines[1].split("\t")[0]
+    protein_name = lines[1].split("\t")[2]
+    print(f"Protein: {protein_name}")
+    print(f"UniProt ID: {uniprot_id}")
+
+# 2. Get KEGG pathways
+kegg_mapping = u.mapping(fr="UniProtKB_AC-ID", to="KEGG", query=uniprot_id)
+if uniprot_id in kegg_mapping:
+    kegg_id = kegg_mapping[uniprot_id][0]
+
+    k = KEGG()
+    organism, gene_id = kegg_id.split(":")
+    pathways = k.get_pathway_by_gene(gene_id, organism)
+
+    print(f"\nPathways ({len(pathways)}):")
+    for pathway_id in pathways[:5]:
+        print(f"  {pathway_id}")
+
+# 3. Get GO annotations
+g = QuickGO()
+go_annotations = g.Annotation(protein=uniprot_id, format="tsv")
+
+if go_annotations:
+    lines = go_annotations.strip().split("\n")
+    print(f"\nGO Annotations ({len(lines)-1} total):")
+
+    # Group by aspect
+    aspects = {"P": [], "F": [], "C": []}
+    for line in lines[1:]:
+        fields = line.split("\t")
+        go_aspect = fields[8]  # P, F, or C
+        go_term = fields[7]
+        aspects[go_aspect].append(go_term)
+
+    print(f"  Biological Process: {len(aspects['P'])} terms")
+    print(f"  Molecular Function: {len(aspects['F'])} terms")
+    print(f"  Cellular Component: {len(aspects['C'])} terms")
+
+# 4. Get protein sequence features
+full_entry = u.retrieve(uniprot_id, frmt="txt")
+print("\nProtein Features:")
+for line in full_entry.split("\n"):
+    if line.startswith("FT   DOMAIN"):
+        print(f"  {line}")
+```
+
+**Output:** Comprehensive annotation including name, pathways, GO terms, and features.
+
+---
+
+## Protein Interaction Network Construction
+
+**Goal:** Build a protein-protein interaction network for a set of proteins.
+
+### Workflow
+
+```python
+from bioservices import PSICQUIC
+import networkx as nx
+
+# Proteins of interest
+proteins = ["ZAP70", "LCK", "LAT", "SLP76", "PLCg1"]
+
+# Initialize PSICQUIC
+p = PSICQUIC()
+
+# Build network
+G = nx.Graph()
+
+for protein in proteins:
+    # Query for human interactions
+    query = f"{protein} AND species:9606"
+
+    try:
+        results = p.query("intact", query)
+
+        if results:
+            lines = results.strip().split("\n")
+
+            for line in lines:
+                fields = line.split("\t")
+                # Extract protein names (simplified)
+                protein_a = fields[4].split(":")[1] if ":" in fields[4] else fields[4]
+                protein_b = fields[5].split(":")[1] if ":" in fields[5] else fields[5]
+
+                # Add edge
+                G.add_edge(protein_a, protein_b)
+
+    except Exception as e:
+        print(f"Error querying {protein}: {e}")
+
+print(f"Network: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")
+
+# Analyze network
+print("\nNode degrees:")
+for node in proteins:
+    if node in G:
+        print(f"  {node}: {G.degree(node)} interactions")
+
+# Export for visualization
+nx.write_gml(G, "protein_network.gml")
+print("\nNetwork exported to protein_network.gml")
+```
+
+**Output:** NetworkX graph exported in GML format for Cytoscape visualization.
+
+---
+
+## Multi-Organism Comparative Analysis
+
+**Goal:** Compare pathway or gene presence across multiple organisms.
+
+### Workflow
+
+```python
+from bioservices import KEGG
+
+k = KEGG()
+
+# Organisms to compare
+organisms = ["hsa", "mmu", "dme", "sce"]  # Human, mouse, fly, yeast
+organism_names = {
+    "hsa": "Human",
+    "mmu": "Mouse",
+    "dme": "Fly",
+    "sce": "Yeast"
+}
+
+# Pathway of interest
+pathway_name = "cell cycle"
+
+print(f"Searching for '{pathway_name}' pathway across organisms:\n")
+
+for org in organisms:
+    k.organism = org
+
+    # Search pathways
+    results = k.lookfor_pathway(pathway_name)
+
+    print(f"{organism_names[org]} ({org}):")
+    if results:
+        for pathway in results[:3]:  # Show first 3
+            print(f"  {pathway}")
+    else:
+        print("  No matches found")
+    print()
+```
+
+**Output:** Pathway presence/absence across organisms.
+
+---
+
+## Best Practices for Workflows
+
+### 1. Error Handling
+
+Always wrap service calls:
+```python
+try:
+    result = service.method(params)
+    if result:
+        # Process
+        pass
+except Exception as e:
+    print(f"Error: {e}")
+```
+
+### 2. Rate Limiting
+
+Add delays for batch processing:
+```python
+import time
+
+for item in items:
+    result = service.query(item)
+    time.sleep(0.5)  # 500ms delay
+```
+
+### 3. Result Validation
+
+Check for empty or unexpected results:
+```python
+if result and len(result) > 0:
+    # Process
+    pass
+else:
+    print("No results returned")
+```
+
+### 4. Progress Reporting
+
+For long workflows:
+```python
+total = len(items)
+for i, item in enumerate(items):
+    # Process item
+    if (i + 1) % 10 == 0:
+        print(f"Processed {i+1}/{total}")
+```
+
+### 5. Data Export
+
+Save intermediate results:
+```python
+import json
+
+with open("results.json", "w") as f:
+    json.dump(results, f, indent=2)
+```
+
+---
+
+## Integration with Other Tools
+
+### BioPython Integration
+
+```python
+from bioservices import UniProt
+from Bio import SeqIO
+from io import StringIO
+
+u = UniProt()
+fasta_data = u.retrieve("P43403", "fasta")
+
+# Parse with BioPython
+fasta_io = StringIO(fasta_data)
+record = SeqIO.read(fasta_io, "fasta")
+
+print(f"Sequence length: {len(record.seq)}")
+print(f"Description: {record.description}")
+```
+
+### Pandas Integration
+
+```python
+from bioservices import UniProt
+import pandas as pd
+from io import StringIO
+
+u = UniProt()
+results = u.search("zap70", frmt="tab", columns="id,genes,length,organism")
+
+# Load into DataFrame
+df = pd.read_csv(StringIO(results), sep="\t")
+print(df.head())
+print(df.describe())
+```
+
+### NetworkX Integration
+
+See Protein Interaction Network Construction above.
+
+---
+
+For complete working examples, see the scripts in `scripts/` directory.