Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/biopython/SKILL.md
+++ b/skills/biopython/SKILL.md
@@ -0,0 +1,437 @@
+---
+name: biopython
+description: "Primary Python toolkit for molecular biology. Preferred for Python-based PubMed/NCBI queries (Bio.Entrez), sequence manipulation, file parsing (FASTA, GenBank, FASTQ, PDB), advanced BLAST workflows, structures, phylogenetics. For quick BLAST, use gget. For direct REST API, use pubmed-database."
+---
+
+# Biopython: Computational Molecular Biology in Python
+
+## Overview
+
+Biopython is a comprehensive set of freely available Python tools for biological computation. It provides functionality for sequence manipulation, file I/O, database access, structural bioinformatics, phylogenetics, and many other bioinformatics tasks. The current version is **Biopython 1.85** (released January 2025), which supports Python 3 and requires NumPy.
+
+## When to Use This Skill
+
+Use this skill when:
+
+- Working with biological sequences (DNA, RNA, or protein)
+- Reading, writing, or converting biological file formats (FASTA, GenBank, FASTQ, PDB, mmCIF, etc.)
+- Accessing NCBI databases (GenBank, PubMed, Protein, Gene, etc.) via Entrez
+- Running BLAST searches or parsing BLAST results
+- Performing sequence alignments (pairwise or multiple sequence alignments)
+- Analyzing protein structures from PDB files
+- Creating, manipulating, or visualizing phylogenetic trees
+- Finding sequence motifs or analyzing motif patterns
+- Calculating sequence statistics (GC content, molecular weight, melting temperature, etc.)
+- Performing structural bioinformatics tasks
+- Working with population genetics data
+- Any other computational molecular biology task
+
+## Core Capabilities
+
+Biopython is organized into modular sub-packages, each addressing specific bioinformatics domains:
+
+1. **Sequence Handling** - Bio.Seq and Bio.SeqIO for sequence manipulation and file I/O
+2. **Alignment Analysis** - Bio.Align and Bio.AlignIO for pairwise and multiple sequence alignments
+3. **Database Access** - Bio.Entrez for programmatic access to NCBI databases
+4. **BLAST Operations** - Bio.Blast for running and parsing BLAST searches
+5. **Structural Bioinformatics** - Bio.PDB for working with 3D protein structures
+6. **Phylogenetics** - Bio.Phylo for phylogenetic tree manipulation and visualization
+7. **Advanced Features** - Motifs, population genetics, sequence utilities, and more
+
+## Installation and Setup
+
+Install Biopython using pip (requires Python 3 and NumPy):
+
+```python
+uv pip install biopython
+```
+
+For NCBI database access, always set your email address (required by NCBI):
+
+```python
+from Bio import Entrez
+Entrez.email = "your.email@example.com"
+
+# Optional: API key for higher rate limits (10 req/s instead of 3 req/s)
+Entrez.api_key = "your_api_key_here"
+```
+
+## Using This Skill
+
+This skill provides comprehensive documentation organized by functionality area. When working on a task, consult the relevant reference documentation:
+
+### 1. Sequence Handling (Bio.Seq & Bio.SeqIO)
+
+**Reference:** `references/sequence_io.md`
+
+Use for:
+- Creating and manipulating biological sequences
+- Reading and writing sequence files (FASTA, GenBank, FASTQ, etc.)
+- Converting between file formats
+- Extracting sequences from large files
+- Sequence translation, transcription, and reverse complement
+- Working with SeqRecord objects
+
+**Quick example:**
+```python
+from Bio import SeqIO
+
+# Read sequences from FASTA file
+for record in SeqIO.parse("sequences.fasta", "fasta"):
+    print(f"{record.id}: {len(record.seq)} bp")
+
+# Convert GenBank to FASTA
+SeqIO.convert("input.gb", "genbank", "output.fasta", "fasta")
+```
+
+### 2. Alignment Analysis (Bio.Align & Bio.AlignIO)
+
+**Reference:** `references/alignment.md`
+
+Use for:
+- Pairwise sequence alignment (global and local)
+- Reading and writing multiple sequence alignments
+- Using substitution matrices (BLOSUM, PAM)
+- Calculating alignment statistics
+- Customizing alignment parameters
+
+**Quick example:**
+```python
+from Bio import Align
+
+# Pairwise alignment
+aligner = Align.PairwiseAligner()
+aligner.mode = 'global'
+alignments = aligner.align("ACCGGT", "ACGGT")
+print(alignments[0])
+```
+
+### 3. Database Access (Bio.Entrez)
+
+**Reference:** `references/databases.md`
+
+Use for:
+- Searching NCBI databases (PubMed, GenBank, Protein, Gene, etc.)
+- Downloading sequences and records
+- Fetching publication information
+- Finding related records across databases
+- Batch downloading with proper rate limiting
+
+**Quick example:**
+```python
+from Bio import Entrez
+Entrez.email = "your.email@example.com"
+
+# Search PubMed
+handle = Entrez.esearch(db="pubmed", term="biopython", retmax=10)
+results = Entrez.read(handle)
+handle.close()
+print(f"Found {results['Count']} results")
+```
+
+### 4. BLAST Operations (Bio.Blast)
+
+**Reference:** `references/blast.md`
+
+Use for:
+- Running BLAST searches via NCBI web services
+- Running local BLAST searches
+- Parsing BLAST XML output
+- Filtering results by E-value or identity
+- Extracting hit sequences
+
+**Quick example:**
+```python
+from Bio.Blast import NCBIWWW, NCBIXML
+
+# Run BLAST search
+result_handle = NCBIWWW.qblast("blastn", "nt", "ATCGATCGATCG")
+blast_record = NCBIXML.read(result_handle)
+
+# Display top hits
+for alignment in blast_record.alignments[:5]:
+    print(f"{alignment.title}: E-value={alignment.hsps[0].expect}")
+```
+
+### 5. Structural Bioinformatics (Bio.PDB)
+
+**Reference:** `references/structure.md`
+
+Use for:
+- Parsing PDB and mmCIF structure files
+- Navigating protein structure hierarchy (SMCRA: Structure/Model/Chain/Residue/Atom)
+- Calculating distances, angles, and dihedrals
+- Secondary structure assignment (DSSP)
+- Structure superimposition and RMSD calculation
+- Extracting sequences from structures
+
+**Quick example:**
+```python
+from Bio.PDB import PDBParser
+
+# Parse structure
+parser = PDBParser(QUIET=True)
+structure = parser.get_structure("1crn", "1crn.pdb")
+
+# Calculate distance between alpha carbons
+chain = structure[0]["A"]
+distance = chain[10]["CA"] - chain[20]["CA"]
+print(f"Distance: {distance:.2f} Å")
+```
+
+### 6. Phylogenetics (Bio.Phylo)
+
+**Reference:** `references/phylogenetics.md`
+
+Use for:
+- Reading and writing phylogenetic trees (Newick, NEXUS, phyloXML)
+- Building trees from distance matrices or alignments
+- Tree manipulation (pruning, rerooting, ladderizing)
+- Calculating phylogenetic distances
+- Creating consensus trees
+- Visualizing trees
+
+**Quick example:**
+```python
+from Bio import Phylo
+
+# Read and visualize tree
+tree = Phylo.read("tree.nwk", "newick")
+Phylo.draw_ascii(tree)
+
+# Calculate distance
+distance = tree.distance("Species_A", "Species_B")
+print(f"Distance: {distance:.3f}")
+```
+
+### 7. Advanced Features
+
+**Reference:** `references/advanced.md`
+
+Use for:
+- **Sequence motifs** (Bio.motifs) - Finding and analyzing motif patterns
+- **Population genetics** (Bio.PopGen) - GenePop files, Fst calculations, Hardy-Weinberg tests
+- **Sequence utilities** (Bio.SeqUtils) - GC content, melting temperature, molecular weight, protein analysis
+- **Restriction analysis** (Bio.Restriction) - Finding restriction enzyme sites
+- **Clustering** (Bio.Cluster) - K-means and hierarchical clustering
+- **Genome diagrams** (GenomeDiagram) - Visualizing genomic features
+
+**Quick example:**
+```python
+from Bio.SeqUtils import gc_fraction, molecular_weight
+from Bio.Seq import Seq
+
+seq = Seq("ATCGATCGATCG")
+print(f"GC content: {gc_fraction(seq):.2%}")
+print(f"Molecular weight: {molecular_weight(seq, seq_type='DNA'):.2f} g/mol")
+```
+
+## General Workflow Guidelines
+
+### Reading Documentation
+
+When a user asks about a specific Biopython task:
+
+1. **Identify the relevant module** based on the task description
+2. **Read the appropriate reference file** using the Read tool
+3. **Extract relevant code patterns** and adapt them to the user's specific needs
+4. **Combine multiple modules** when the task requires it
+
+Example search patterns for reference files:
+```bash
+# Find information about specific functions
+grep -n "SeqIO.parse" references/sequence_io.md
+
+# Find examples of specific tasks
+grep -n "BLAST" references/blast.md
+
+# Find information about specific concepts
+grep -n "alignment" references/alignment.md
+```
+
+### Writing Biopython Code
+
+Follow these principles when writing Biopython code:
+
+1. **Import modules explicitly**
+   ```python
+   from Bio import SeqIO, Entrez
+   from Bio.Seq import Seq
+   ```
+
+2. **Set Entrez email** when using NCBI databases
+   ```python
+   Entrez.email = "your.email@example.com"
+   ```
+
+3. **Use appropriate file formats** - Check which format best suits the task
+   ```python
+   # Common formats: "fasta", "genbank", "fastq", "clustal", "phylip"
+   ```
+
+4. **Handle files properly** - Close handles after use or use context managers
+   ```python
+   with open("file.fasta") as handle:
+       records = SeqIO.parse(handle, "fasta")
+   ```
+
+5. **Use iterators for large files** - Avoid loading everything into memory
+   ```python
+   for record in SeqIO.parse("large_file.fasta", "fasta"):
+       # Process one record at a time
+   ```
+
+6. **Handle errors gracefully** - Network operations and file parsing can fail
+   ```python
+   try:
+       handle = Entrez.efetch(db="nucleotide", id=accession)
+   except HTTPError as e:
+       print(f"Error: {e}")
+   ```
+
+## Common Patterns
+
+### Pattern 1: Fetch Sequence from GenBank
+
+```python
+from Bio import Entrez, SeqIO
+
+Entrez.email = "your.email@example.com"
+
+# Fetch sequence
+handle = Entrez.efetch(db="nucleotide", id="EU490707", rettype="gb", retmode="text")
+record = SeqIO.read(handle, "genbank")
+handle.close()
+
+print(f"Description: {record.description}")
+print(f"Sequence length: {len(record.seq)}")
+```
+
+### Pattern 2: Sequence Analysis Pipeline
+
+```python
+from Bio import SeqIO
+from Bio.SeqUtils import gc_fraction
+
+for record in SeqIO.parse("sequences.fasta", "fasta"):
+    # Calculate statistics
+    gc = gc_fraction(record.seq)
+    length = len(record.seq)
+
+    # Find ORFs, translate, etc.
+    protein = record.seq.translate()
+
+    print(f"{record.id}: {length} bp, GC={gc:.2%}")
+```
+
+### Pattern 3: BLAST and Fetch Top Hits
+
+```python
+from Bio.Blast import NCBIWWW, NCBIXML
+from Bio import Entrez, SeqIO
+
+Entrez.email = "your.email@example.com"
+
+# Run BLAST
+result_handle = NCBIWWW.qblast("blastn", "nt", sequence)
+blast_record = NCBIXML.read(result_handle)
+
+# Get top hit accessions
+accessions = [aln.accession for aln in blast_record.alignments[:5]]
+
+# Fetch sequences
+for acc in accessions:
+    handle = Entrez.efetch(db="nucleotide", id=acc, rettype="fasta", retmode="text")
+    record = SeqIO.read(handle, "fasta")
+    handle.close()
+    print(f">{record.description}")
+```
+
+### Pattern 4: Build Phylogenetic Tree from Sequences
+
+```python
+from Bio import AlignIO, Phylo
+from Bio.Phylo.TreeConstruction import DistanceCalculator, DistanceTreeConstructor
+
+# Read alignment
+alignment = AlignIO.read("alignment.fasta", "fasta")
+
+# Calculate distances
+calculator = DistanceCalculator("identity")
+dm = calculator.get_distance(alignment)
+
+# Build tree
+constructor = DistanceTreeConstructor()
+tree = constructor.nj(dm)
+
+# Visualize
+Phylo.draw_ascii(tree)
+```
+
+## Best Practices
+
+1. **Always read relevant reference documentation** before writing code
+2. **Use grep to search reference files** for specific functions or examples
+3. **Validate file formats** before parsing
+4. **Handle missing data gracefully** - Not all records have all fields
+5. **Cache downloaded data** - Don't repeatedly download the same sequences
+6. **Respect NCBI rate limits** - Use API keys and proper delays
+7. **Test with small datasets** before processing large files
+8. **Keep Biopython updated** to get latest features and bug fixes
+9. **Use appropriate genetic code tables** for translation
+10. **Document analysis parameters** for reproducibility
+
+## Troubleshooting Common Issues
+
+### Issue: "No handlers could be found for logger 'Bio.Entrez'"
+**Solution:** This is just a warning. Set Entrez.email to suppress it.
+
+### Issue: "HTTP Error 400" from NCBI
+**Solution:** Check that IDs/accessions are valid and properly formatted.
+
+### Issue: "ValueError: EOF" when parsing files
+**Solution:** Verify file format matches the specified format string.
+
+### Issue: Alignment fails with "sequences are not the same length"
+**Solution:** Ensure sequences are aligned before using AlignIO or MultipleSeqAlignment.
+
+### Issue: BLAST searches are slow
+**Solution:** Use local BLAST for large-scale searches, or cache results.
+
+### Issue: PDB parser warnings
+**Solution:** Use `PDBParser(QUIET=True)` to suppress warnings, or investigate structure quality.
+
+## Additional Resources
+
+- **Official Documentation**: https://biopython.org/docs/latest/
+- **Tutorial**: https://biopython.org/docs/latest/Tutorial/
+- **Cookbook**: https://biopython.org/docs/latest/Tutorial/ (advanced examples)
+- **GitHub**: https://github.com/biopython/biopython
+- **Mailing List**: biopython@biopython.org
+
+## Quick Reference
+
+To locate information in reference files, use these search patterns:
+
+```bash
+# Search for specific functions
+grep -n "function_name" references/*.md
+
+# Find examples of specific tasks
+grep -n "example" references/sequence_io.md
+
+# Find all occurrences of a module
+grep -n "Bio.Seq" references/*.md
+```
+
+## Summary
+
+Biopython provides comprehensive tools for computational molecular biology. When using this skill:
+
+1. **Identify the task domain** (sequences, alignments, databases, BLAST, structures, phylogenetics, or advanced)
+2. **Consult the appropriate reference file** in the `references/` directory
+3. **Adapt code examples** to the specific use case
+4. **Combine multiple modules** when needed for complex workflows
+5. **Follow best practices** for file handling, error checking, and data management
+
+The modular reference documentation ensures detailed, searchable information for every major Biopython capability.