Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/biopython/references/alignment.md
+++ b/skills/biopython/references/alignment.md
@@ -0,0 +1,362 @@
+# Sequence Alignments with Bio.Align and Bio.AlignIO
+
+## Overview
+
+Bio.Align provides tools for pairwise sequence alignment using various algorithms, while Bio.AlignIO handles reading and writing multiple sequence alignment files in various formats.
+
+## Pairwise Alignment with Bio.Align
+
+### The PairwiseAligner Class
+
+The `PairwiseAligner` class performs pairwise sequence alignments using Needleman-Wunsch (global), Smith-Waterman (local), Gotoh (three-state), and Waterman-Smith-Beyer algorithms. The appropriate algorithm is automatically selected based on gap score parameters.
+
+### Creating an Aligner
+
+```python
+from Bio import Align
+
+# Create aligner with default parameters
+aligner = Align.PairwiseAligner()
+
+# Default scores (as of Biopython 1.85+):
+# - Match score: +1.0
+# - Mismatch score: 0.0
+# - All gap scores: -1.0
+```
+
+### Customizing Alignment Parameters
+
+```python
+# Set scoring parameters
+aligner.match_score = 2.0
+aligner.mismatch_score = -1.0
+aligner.gap_score = -0.5
+
+# Or use separate gap opening/extension penalties
+aligner.open_gap_score = -2.0
+aligner.extend_gap_score = -0.5
+
+# Set internal gap scores separately
+aligner.internal_open_gap_score = -2.0
+aligner.internal_extend_gap_score = -0.5
+
+# Set end gap scores (for semi-global alignment)
+aligner.left_open_gap_score = 0.0
+aligner.left_extend_gap_score = 0.0
+aligner.right_open_gap_score = 0.0
+aligner.right_extend_gap_score = 0.0
+```
+
+### Alignment Modes
+
+```python
+# Global alignment (default)
+aligner.mode = 'global'
+
+# Local alignment
+aligner.mode = 'local'
+```
+
+### Performing Alignments
+
+```python
+from Bio.Seq import Seq
+
+seq1 = Seq("ACCGGT")
+seq2 = Seq("ACGGT")
+
+# Get all optimal alignments
+alignments = aligner.align(seq1, seq2)
+
+# Iterate through alignments
+for alignment in alignments:
+    print(alignment)
+    print(f"Score: {alignment.score}")
+
+# Get just the score
+score = aligner.score(seq1, seq2)
+```
+
+### Using Substitution Matrices
+
+```python
+from Bio.Align import substitution_matrices
+
+# Load a substitution matrix
+matrix = substitution_matrices.load("BLOSUM62")
+aligner.substitution_matrix = matrix
+
+# Align protein sequences
+protein1 = Seq("KEVLA")
+protein2 = Seq("KSVLA")
+alignments = aligner.align(protein1, protein2)
+```
+
+### Available Substitution Matrices
+
+Common matrices include:
+- **BLOSUM** series (BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90)
+- **PAM** series (PAM30, PAM70, PAM250)
+- **MATCH** - Simple match/mismatch matrix
+
+```python
+# List available matrices
+available = substitution_matrices.load()
+print(available)
+```
+
+## Multiple Sequence Alignments with Bio.AlignIO
+
+### Reading Alignments
+
+Bio.AlignIO provides similar API to Bio.SeqIO but for alignment files:
+
+```python
+from Bio import AlignIO
+
+# Read a single alignment
+alignment = AlignIO.read("alignment.aln", "clustal")
+
+# Parse multiple alignments from a file
+for alignment in AlignIO.parse("alignments.aln", "clustal"):
+    print(f"Alignment with {len(alignment)} sequences")
+    print(f"Alignment length: {alignment.get_alignment_length()}")
+```
+
+### Supported Alignment Formats
+
+Common formats include:
+- **clustal** - Clustal format
+- **phylip** - PHYLIP format
+- **phylip-relaxed** - Relaxed PHYLIP (longer names)
+- **stockholm** - Stockholm format
+- **fasta** - FASTA format (aligned)
+- **nexus** - NEXUS format
+- **emboss** - EMBOSS alignment format
+- **msf** - MSF format
+- **maf** - Multiple Alignment Format
+
+### Writing Alignments
+
+```python
+# Write alignment to file
+AlignIO.write(alignment, "output.aln", "clustal")
+
+# Convert between formats
+count = AlignIO.convert("input.aln", "clustal", "output.phy", "phylip")
+```
+
+### Working with Alignment Objects
+
+```python
+from Bio import AlignIO
+
+alignment = AlignIO.read("alignment.aln", "clustal")
+
+# Get alignment properties
+print(f"Number of sequences: {len(alignment)}")
+print(f"Alignment length: {alignment.get_alignment_length()}")
+
+# Access individual sequences
+for record in alignment:
+    print(f"{record.id}: {record.seq}")
+
+# Get alignment column
+column = alignment[:, 0]  # First column
+
+# Get alignment slice
+sub_alignment = alignment[:, 10:20]  # Positions 10-20
+
+# Get specific sequence
+seq_record = alignment[0]  # First sequence
+```
+
+### Alignment Analysis
+
+```python
+# Calculate alignment statistics
+from Bio.Align import AlignInfo
+
+summary = AlignInfo.SummaryInfo(alignment)
+
+# Get consensus sequence
+consensus = summary.gap_consensus(threshold=0.7)
+
+# Position-specific scoring matrix (PSSM)
+pssm = summary.pos_specific_score_matrix(consensus)
+
+# Calculate information content
+from Bio import motifs
+motif = motifs.create([record.seq for record in alignment])
+information = motif.counts.information_content()
+```
+
+## Creating Alignments Programmatically
+
+### From SeqRecord Objects
+
+```python
+from Bio.Align import MultipleSeqAlignment
+from Bio.SeqRecord import SeqRecord
+from Bio.Seq import Seq
+
+# Create records
+records = [
+    SeqRecord(Seq("ACTGCTAGCTAG"), id="seq1"),
+    SeqRecord(Seq("ACT-CTAGCTAG"), id="seq2"),
+    SeqRecord(Seq("ACTGCTA-CTAG"), id="seq3"),
+]
+
+# Create alignment
+alignment = MultipleSeqAlignment(records)
+```
+
+### Adding Sequences to Alignments
+
+```python
+# Start with empty alignment
+alignment = MultipleSeqAlignment([])
+
+# Add sequences (must have same length)
+alignment.append(SeqRecord(Seq("ACTG"), id="seq1"))
+alignment.append(SeqRecord(Seq("ACTG"), id="seq2"))
+
+# Extend with another alignment
+alignment.extend(other_alignment)
+```
+
+## Advanced Alignment Operations
+
+### Removing Gaps
+
+```python
+# Remove all gap-only columns
+from Bio.Align import AlignInfo
+
+no_gaps = []
+for i in range(alignment.get_alignment_length()):
+    column = alignment[:, i]
+    if set(column) != {'-'}:  # Not all gaps
+        no_gaps.append(column)
+```
+
+### Alignment Sorting
+
+```python
+# Sort by sequence ID
+sorted_alignment = sorted(alignment, key=lambda x: x.id)
+alignment = MultipleSeqAlignment(sorted_alignment)
+```
+
+### Computing Pairwise Identities
+
+```python
+def pairwise_identity(seq1, seq2):
+    """Calculate percent identity between two sequences."""
+    matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-')
+    length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-')
+    return matches / length if length > 0 else 0
+
+# Calculate all pairwise identities
+for i, record1 in enumerate(alignment):
+    for record2 in alignment[i+1:]:
+        identity = pairwise_identity(record1.seq, record2.seq)
+        print(f"{record1.id} vs {record2.id}: {identity:.2%}")
+```
+
+## Running External Alignment Tools
+
+### Clustal Omega (via Command Line)
+
+```python
+from Bio.Align.Applications import ClustalOmegaCommandline
+
+# Setup command
+clustal_cmd = ClustalOmegaCommandline(
+    infile="sequences.fasta",
+    outfile="alignment.aln",
+    verbose=True,
+    auto=True
+)
+
+# Run alignment
+stdout, stderr = clustal_cmd()
+
+# Read result
+alignment = AlignIO.read("alignment.aln", "clustal")
+```
+
+### MUSCLE (via Command Line)
+
+```python
+from Bio.Align.Applications import MuscleCommandline
+
+muscle_cmd = MuscleCommandline(
+    input="sequences.fasta",
+    out="alignment.aln"
+)
+stdout, stderr = muscle_cmd()
+```
+
+## Best Practices
+
+1. **Choose appropriate scoring schemes** - Use BLOSUM62 for proteins, custom scores for DNA
+2. **Consider alignment mode** - Global for similar-length sequences, local for finding conserved regions
+3. **Set gap penalties carefully** - Higher penalties create fewer, longer gaps
+4. **Use appropriate formats** - FASTA for simple alignments, Stockholm for rich annotation
+5. **Validate alignment quality** - Check for conserved regions and percent identity
+6. **Handle large alignments carefully** - Use slicing and iteration for memory efficiency
+7. **Preserve metadata** - Maintain SeqRecord IDs and annotations through alignment operations
+
+## Common Use Cases
+
+### Find Best Local Alignment
+
+```python
+from Bio.Align import PairwiseAligner
+from Bio.Seq import Seq
+
+aligner = PairwiseAligner()
+aligner.mode = 'local'
+aligner.match_score = 2
+aligner.mismatch_score = -1
+
+seq1 = Seq("AGCTTAGCTAGCTAGC")
+seq2 = Seq("CTAGCTAGC")
+
+alignments = aligner.align(seq1, seq2)
+print(alignments[0])
+```
+
+### Protein Sequence Alignment
+
+```python
+from Bio.Align import PairwiseAligner, substitution_matrices
+
+aligner = PairwiseAligner()
+aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
+aligner.open_gap_score = -10
+aligner.extend_gap_score = -0.5
+
+protein1 = Seq("KEVLA")
+protein2 = Seq("KEVLAEQP")
+alignments = aligner.align(protein1, protein2)
+```
+
+### Extract Conserved Regions
+
+```python
+from Bio import AlignIO
+
+alignment = AlignIO.read("alignment.aln", "clustal")
+
+# Find columns with >80% identity
+conserved_positions = []
+for i in range(alignment.get_alignment_length()):
+    column = alignment[:, i]
+    most_common = max(set(column), key=column.count)
+    if column.count(most_common) / len(column) > 0.8:
+        conserved_positions.append(i)
+
+print(f"Conserved positions: {conserved_positions}")
+```