Initial commit
This commit is contained in:
362
skills/biopython/references/alignment.md
Normal file
362
skills/biopython/references/alignment.md
Normal file
@@ -0,0 +1,362 @@
|
||||
# Sequence Alignments with Bio.Align and Bio.AlignIO
|
||||
|
||||
## Overview
|
||||
|
||||
Bio.Align provides tools for pairwise sequence alignment using various algorithms, while Bio.AlignIO handles reading and writing multiple sequence alignment files in various formats.
|
||||
|
||||
## Pairwise Alignment with Bio.Align
|
||||
|
||||
### The PairwiseAligner Class
|
||||
|
||||
The `PairwiseAligner` class performs pairwise sequence alignments using Needleman-Wunsch (global), Smith-Waterman (local), Gotoh (three-state), and Waterman-Smith-Beyer algorithms. The appropriate algorithm is automatically selected based on gap score parameters.
|
||||
|
||||
### Creating an Aligner
|
||||
|
||||
```python
|
||||
from Bio import Align
|
||||
|
||||
# Create aligner with default parameters
|
||||
aligner = Align.PairwiseAligner()
|
||||
|
||||
# Default scores (as of Biopython 1.85+):
|
||||
# - Match score: +1.0
|
||||
# - Mismatch score: 0.0
|
||||
# - All gap scores: -1.0
|
||||
```
|
||||
|
||||
### Customizing Alignment Parameters
|
||||
|
||||
```python
|
||||
# Set scoring parameters
|
||||
aligner.match_score = 2.0
|
||||
aligner.mismatch_score = -1.0
|
||||
aligner.gap_score = -0.5
|
||||
|
||||
# Or use separate gap opening/extension penalties
|
||||
aligner.open_gap_score = -2.0
|
||||
aligner.extend_gap_score = -0.5
|
||||
|
||||
# Set internal gap scores separately
|
||||
aligner.internal_open_gap_score = -2.0
|
||||
aligner.internal_extend_gap_score = -0.5
|
||||
|
||||
# Set end gap scores (for semi-global alignment)
|
||||
aligner.left_open_gap_score = 0.0
|
||||
aligner.left_extend_gap_score = 0.0
|
||||
aligner.right_open_gap_score = 0.0
|
||||
aligner.right_extend_gap_score = 0.0
|
||||
```
|
||||
|
||||
### Alignment Modes
|
||||
|
||||
```python
|
||||
# Global alignment (default)
|
||||
aligner.mode = 'global'
|
||||
|
||||
# Local alignment
|
||||
aligner.mode = 'local'
|
||||
```
|
||||
|
||||
### Performing Alignments
|
||||
|
||||
```python
|
||||
from Bio.Seq import Seq
|
||||
|
||||
seq1 = Seq("ACCGGT")
|
||||
seq2 = Seq("ACGGT")
|
||||
|
||||
# Get all optimal alignments
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
|
||||
# Iterate through alignments
|
||||
for alignment in alignments:
|
||||
print(alignment)
|
||||
print(f"Score: {alignment.score}")
|
||||
|
||||
# Get just the score
|
||||
score = aligner.score(seq1, seq2)
|
||||
```
|
||||
|
||||
### Using Substitution Matrices
|
||||
|
||||
```python
|
||||
from Bio.Align import substitution_matrices
|
||||
|
||||
# Load a substitution matrix
|
||||
matrix = substitution_matrices.load("BLOSUM62")
|
||||
aligner.substitution_matrix = matrix
|
||||
|
||||
# Align protein sequences
|
||||
protein1 = Seq("KEVLA")
|
||||
protein2 = Seq("KSVLA")
|
||||
alignments = aligner.align(protein1, protein2)
|
||||
```
|
||||
|
||||
### Available Substitution Matrices
|
||||
|
||||
Common matrices include:
|
||||
- **BLOSUM** series (BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90)
|
||||
- **PAM** series (PAM30, PAM70, PAM250)
|
||||
- **MATCH** - Simple match/mismatch matrix
|
||||
|
||||
```python
|
||||
# List available matrices
|
||||
available = substitution_matrices.load()
|
||||
print(available)
|
||||
```
|
||||
|
||||
## Multiple Sequence Alignments with Bio.AlignIO
|
||||
|
||||
### Reading Alignments
|
||||
|
||||
Bio.AlignIO provides similar API to Bio.SeqIO but for alignment files:
|
||||
|
||||
```python
|
||||
from Bio import AlignIO
|
||||
|
||||
# Read a single alignment
|
||||
alignment = AlignIO.read("alignment.aln", "clustal")
|
||||
|
||||
# Parse multiple alignments from a file
|
||||
for alignment in AlignIO.parse("alignments.aln", "clustal"):
|
||||
print(f"Alignment with {len(alignment)} sequences")
|
||||
print(f"Alignment length: {alignment.get_alignment_length()}")
|
||||
```
|
||||
|
||||
### Supported Alignment Formats
|
||||
|
||||
Common formats include:
|
||||
- **clustal** - Clustal format
|
||||
- **phylip** - PHYLIP format
|
||||
- **phylip-relaxed** - Relaxed PHYLIP (longer names)
|
||||
- **stockholm** - Stockholm format
|
||||
- **fasta** - FASTA format (aligned)
|
||||
- **nexus** - NEXUS format
|
||||
- **emboss** - EMBOSS alignment format
|
||||
- **msf** - MSF format
|
||||
- **maf** - Multiple Alignment Format
|
||||
|
||||
### Writing Alignments
|
||||
|
||||
```python
|
||||
# Write alignment to file
|
||||
AlignIO.write(alignment, "output.aln", "clustal")
|
||||
|
||||
# Convert between formats
|
||||
count = AlignIO.convert("input.aln", "clustal", "output.phy", "phylip")
|
||||
```
|
||||
|
||||
### Working with Alignment Objects
|
||||
|
||||
```python
|
||||
from Bio import AlignIO
|
||||
|
||||
alignment = AlignIO.read("alignment.aln", "clustal")
|
||||
|
||||
# Get alignment properties
|
||||
print(f"Number of sequences: {len(alignment)}")
|
||||
print(f"Alignment length: {alignment.get_alignment_length()}")
|
||||
|
||||
# Access individual sequences
|
||||
for record in alignment:
|
||||
print(f"{record.id}: {record.seq}")
|
||||
|
||||
# Get alignment column
|
||||
column = alignment[:, 0] # First column
|
||||
|
||||
# Get alignment slice
|
||||
sub_alignment = alignment[:, 10:20] # Positions 10-20
|
||||
|
||||
# Get specific sequence
|
||||
seq_record = alignment[0] # First sequence
|
||||
```
|
||||
|
||||
### Alignment Analysis
|
||||
|
||||
```python
|
||||
# Calculate alignment statistics
|
||||
from Bio.Align import AlignInfo
|
||||
|
||||
summary = AlignInfo.SummaryInfo(alignment)
|
||||
|
||||
# Get consensus sequence
|
||||
consensus = summary.gap_consensus(threshold=0.7)
|
||||
|
||||
# Position-specific scoring matrix (PSSM)
|
||||
pssm = summary.pos_specific_score_matrix(consensus)
|
||||
|
||||
# Calculate information content
|
||||
from Bio import motifs
|
||||
motif = motifs.create([record.seq for record in alignment])
|
||||
information = motif.counts.information_content()
|
||||
```
|
||||
|
||||
## Creating Alignments Programmatically
|
||||
|
||||
### From SeqRecord Objects
|
||||
|
||||
```python
|
||||
from Bio.Align import MultipleSeqAlignment
|
||||
from Bio.SeqRecord import SeqRecord
|
||||
from Bio.Seq import Seq
|
||||
|
||||
# Create records
|
||||
records = [
|
||||
SeqRecord(Seq("ACTGCTAGCTAG"), id="seq1"),
|
||||
SeqRecord(Seq("ACT-CTAGCTAG"), id="seq2"),
|
||||
SeqRecord(Seq("ACTGCTA-CTAG"), id="seq3"),
|
||||
]
|
||||
|
||||
# Create alignment
|
||||
alignment = MultipleSeqAlignment(records)
|
||||
```
|
||||
|
||||
### Adding Sequences to Alignments
|
||||
|
||||
```python
|
||||
# Start with empty alignment
|
||||
alignment = MultipleSeqAlignment([])
|
||||
|
||||
# Add sequences (must have same length)
|
||||
alignment.append(SeqRecord(Seq("ACTG"), id="seq1"))
|
||||
alignment.append(SeqRecord(Seq("ACTG"), id="seq2"))
|
||||
|
||||
# Extend with another alignment
|
||||
alignment.extend(other_alignment)
|
||||
```
|
||||
|
||||
## Advanced Alignment Operations
|
||||
|
||||
### Removing Gaps
|
||||
|
||||
```python
|
||||
# Remove all gap-only columns
|
||||
from Bio.Align import AlignInfo
|
||||
|
||||
no_gaps = []
|
||||
for i in range(alignment.get_alignment_length()):
|
||||
column = alignment[:, i]
|
||||
if set(column) != {'-'}: # Not all gaps
|
||||
no_gaps.append(column)
|
||||
```
|
||||
|
||||
### Alignment Sorting
|
||||
|
||||
```python
|
||||
# Sort by sequence ID
|
||||
sorted_alignment = sorted(alignment, key=lambda x: x.id)
|
||||
alignment = MultipleSeqAlignment(sorted_alignment)
|
||||
```
|
||||
|
||||
### Computing Pairwise Identities
|
||||
|
||||
```python
|
||||
def pairwise_identity(seq1, seq2):
|
||||
"""Calculate percent identity between two sequences."""
|
||||
matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-')
|
||||
length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-')
|
||||
return matches / length if length > 0 else 0
|
||||
|
||||
# Calculate all pairwise identities
|
||||
for i, record1 in enumerate(alignment):
|
||||
for record2 in alignment[i+1:]:
|
||||
identity = pairwise_identity(record1.seq, record2.seq)
|
||||
print(f"{record1.id} vs {record2.id}: {identity:.2%}")
|
||||
```
|
||||
|
||||
## Running External Alignment Tools
|
||||
|
||||
### Clustal Omega (via Command Line)
|
||||
|
||||
```python
|
||||
from Bio.Align.Applications import ClustalOmegaCommandline
|
||||
|
||||
# Setup command
|
||||
clustal_cmd = ClustalOmegaCommandline(
|
||||
infile="sequences.fasta",
|
||||
outfile="alignment.aln",
|
||||
verbose=True,
|
||||
auto=True
|
||||
)
|
||||
|
||||
# Run alignment
|
||||
stdout, stderr = clustal_cmd()
|
||||
|
||||
# Read result
|
||||
alignment = AlignIO.read("alignment.aln", "clustal")
|
||||
```
|
||||
|
||||
### MUSCLE (via Command Line)
|
||||
|
||||
```python
|
||||
from Bio.Align.Applications import MuscleCommandline
|
||||
|
||||
muscle_cmd = MuscleCommandline(
|
||||
input="sequences.fasta",
|
||||
out="alignment.aln"
|
||||
)
|
||||
stdout, stderr = muscle_cmd()
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Choose appropriate scoring schemes** - Use BLOSUM62 for proteins, custom scores for DNA
|
||||
2. **Consider alignment mode** - Global for similar-length sequences, local for finding conserved regions
|
||||
3. **Set gap penalties carefully** - Higher penalties create fewer, longer gaps
|
||||
4. **Use appropriate formats** - FASTA for simple alignments, Stockholm for rich annotation
|
||||
5. **Validate alignment quality** - Check for conserved regions and percent identity
|
||||
6. **Handle large alignments carefully** - Use slicing and iteration for memory efficiency
|
||||
7. **Preserve metadata** - Maintain SeqRecord IDs and annotations through alignment operations
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Find Best Local Alignment
|
||||
|
||||
```python
|
||||
from Bio.Align import PairwiseAligner
|
||||
from Bio.Seq import Seq
|
||||
|
||||
aligner = PairwiseAligner()
|
||||
aligner.mode = 'local'
|
||||
aligner.match_score = 2
|
||||
aligner.mismatch_score = -1
|
||||
|
||||
seq1 = Seq("AGCTTAGCTAGCTAGC")
|
||||
seq2 = Seq("CTAGCTAGC")
|
||||
|
||||
alignments = aligner.align(seq1, seq2)
|
||||
print(alignments[0])
|
||||
```
|
||||
|
||||
### Protein Sequence Alignment
|
||||
|
||||
```python
|
||||
from Bio.Align import PairwiseAligner, substitution_matrices
|
||||
|
||||
aligner = PairwiseAligner()
|
||||
aligner.substitution_matrix = substitution_matrices.load("BLOSUM62")
|
||||
aligner.open_gap_score = -10
|
||||
aligner.extend_gap_score = -0.5
|
||||
|
||||
protein1 = Seq("KEVLA")
|
||||
protein2 = Seq("KEVLAEQP")
|
||||
alignments = aligner.align(protein1, protein2)
|
||||
```
|
||||
|
||||
### Extract Conserved Regions
|
||||
|
||||
```python
|
||||
from Bio import AlignIO
|
||||
|
||||
alignment = AlignIO.read("alignment.aln", "clustal")
|
||||
|
||||
# Find columns with >80% identity
|
||||
conserved_positions = []
|
||||
for i in range(alignment.get_alignment_length()):
|
||||
column = alignment[:, i]
|
||||
most_common = max(set(column), key=column.count)
|
||||
if column.count(most_common) / len(column) > 0.8:
|
||||
conserved_positions.append(i)
|
||||
|
||||
print(f"Conserved positions: {conserved_positions}")
|
||||
```
|
||||
Reference in New Issue
Block a user