# Sequence Alignments with Bio.Align and Bio.AlignIO ## Overview Bio.Align provides tools for pairwise sequence alignment using various algorithms, while Bio.AlignIO handles reading and writing multiple sequence alignment files in various formats. ## Pairwise Alignment with Bio.Align ### The PairwiseAligner Class The `PairwiseAligner` class performs pairwise sequence alignments using Needleman-Wunsch (global), Smith-Waterman (local), Gotoh (three-state), and Waterman-Smith-Beyer algorithms. The appropriate algorithm is automatically selected based on gap score parameters. ### Creating an Aligner ```python from Bio import Align # Create aligner with default parameters aligner = Align.PairwiseAligner() # Default scores (as of Biopython 1.85+): # - Match score: +1.0 # - Mismatch score: 0.0 # - All gap scores: -1.0 ``` ### Customizing Alignment Parameters ```python # Set scoring parameters aligner.match_score = 2.0 aligner.mismatch_score = -1.0 aligner.gap_score = -0.5 # Or use separate gap opening/extension penalties aligner.open_gap_score = -2.0 aligner.extend_gap_score = -0.5 # Set internal gap scores separately aligner.internal_open_gap_score = -2.0 aligner.internal_extend_gap_score = -0.5 # Set end gap scores (for semi-global alignment) aligner.left_open_gap_score = 0.0 aligner.left_extend_gap_score = 0.0 aligner.right_open_gap_score = 0.0 aligner.right_extend_gap_score = 0.0 ``` ### Alignment Modes ```python # Global alignment (default) aligner.mode = 'global' # Local alignment aligner.mode = 'local' ``` ### Performing Alignments ```python from Bio.Seq import Seq seq1 = Seq("ACCGGT") seq2 = Seq("ACGGT") # Get all optimal alignments alignments = aligner.align(seq1, seq2) # Iterate through alignments for alignment in alignments: print(alignment) print(f"Score: {alignment.score}") # Get just the score score = aligner.score(seq1, seq2) ``` ### Using Substitution Matrices ```python from Bio.Align import substitution_matrices # Load a substitution matrix matrix = substitution_matrices.load("BLOSUM62") aligner.substitution_matrix = matrix # Align protein sequences protein1 = Seq("KEVLA") protein2 = Seq("KSVLA") alignments = aligner.align(protein1, protein2) ``` ### Available Substitution Matrices Common matrices include: - **BLOSUM** series (BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, BLOSUM90) - **PAM** series (PAM30, PAM70, PAM250) - **MATCH** - Simple match/mismatch matrix ```python # List available matrices available = substitution_matrices.load() print(available) ``` ## Multiple Sequence Alignments with Bio.AlignIO ### Reading Alignments Bio.AlignIO provides similar API to Bio.SeqIO but for alignment files: ```python from Bio import AlignIO # Read a single alignment alignment = AlignIO.read("alignment.aln", "clustal") # Parse multiple alignments from a file for alignment in AlignIO.parse("alignments.aln", "clustal"): print(f"Alignment with {len(alignment)} sequences") print(f"Alignment length: {alignment.get_alignment_length()}") ``` ### Supported Alignment Formats Common formats include: - **clustal** - Clustal format - **phylip** - PHYLIP format - **phylip-relaxed** - Relaxed PHYLIP (longer names) - **stockholm** - Stockholm format - **fasta** - FASTA format (aligned) - **nexus** - NEXUS format - **emboss** - EMBOSS alignment format - **msf** - MSF format - **maf** - Multiple Alignment Format ### Writing Alignments ```python # Write alignment to file AlignIO.write(alignment, "output.aln", "clustal") # Convert between formats count = AlignIO.convert("input.aln", "clustal", "output.phy", "phylip") ``` ### Working with Alignment Objects ```python from Bio import AlignIO alignment = AlignIO.read("alignment.aln", "clustal") # Get alignment properties print(f"Number of sequences: {len(alignment)}") print(f"Alignment length: {alignment.get_alignment_length()}") # Access individual sequences for record in alignment: print(f"{record.id}: {record.seq}") # Get alignment column column = alignment[:, 0] # First column # Get alignment slice sub_alignment = alignment[:, 10:20] # Positions 10-20 # Get specific sequence seq_record = alignment[0] # First sequence ``` ### Alignment Analysis ```python # Calculate alignment statistics from Bio.Align import AlignInfo summary = AlignInfo.SummaryInfo(alignment) # Get consensus sequence consensus = summary.gap_consensus(threshold=0.7) # Position-specific scoring matrix (PSSM) pssm = summary.pos_specific_score_matrix(consensus) # Calculate information content from Bio import motifs motif = motifs.create([record.seq for record in alignment]) information = motif.counts.information_content() ``` ## Creating Alignments Programmatically ### From SeqRecord Objects ```python from Bio.Align import MultipleSeqAlignment from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq # Create records records = [ SeqRecord(Seq("ACTGCTAGCTAG"), id="seq1"), SeqRecord(Seq("ACT-CTAGCTAG"), id="seq2"), SeqRecord(Seq("ACTGCTA-CTAG"), id="seq3"), ] # Create alignment alignment = MultipleSeqAlignment(records) ``` ### Adding Sequences to Alignments ```python # Start with empty alignment alignment = MultipleSeqAlignment([]) # Add sequences (must have same length) alignment.append(SeqRecord(Seq("ACTG"), id="seq1")) alignment.append(SeqRecord(Seq("ACTG"), id="seq2")) # Extend with another alignment alignment.extend(other_alignment) ``` ## Advanced Alignment Operations ### Removing Gaps ```python # Remove all gap-only columns from Bio.Align import AlignInfo no_gaps = [] for i in range(alignment.get_alignment_length()): column = alignment[:, i] if set(column) != {'-'}: # Not all gaps no_gaps.append(column) ``` ### Alignment Sorting ```python # Sort by sequence ID sorted_alignment = sorted(alignment, key=lambda x: x.id) alignment = MultipleSeqAlignment(sorted_alignment) ``` ### Computing Pairwise Identities ```python def pairwise_identity(seq1, seq2): """Calculate percent identity between two sequences.""" matches = sum(a == b for a, b in zip(seq1, seq2) if a != '-' and b != '-') length = sum(1 for a, b in zip(seq1, seq2) if a != '-' and b != '-') return matches / length if length > 0 else 0 # Calculate all pairwise identities for i, record1 in enumerate(alignment): for record2 in alignment[i+1:]: identity = pairwise_identity(record1.seq, record2.seq) print(f"{record1.id} vs {record2.id}: {identity:.2%}") ``` ## Running External Alignment Tools ### Clustal Omega (via Command Line) ```python from Bio.Align.Applications import ClustalOmegaCommandline # Setup command clustal_cmd = ClustalOmegaCommandline( infile="sequences.fasta", outfile="alignment.aln", verbose=True, auto=True ) # Run alignment stdout, stderr = clustal_cmd() # Read result alignment = AlignIO.read("alignment.aln", "clustal") ``` ### MUSCLE (via Command Line) ```python from Bio.Align.Applications import MuscleCommandline muscle_cmd = MuscleCommandline( input="sequences.fasta", out="alignment.aln" ) stdout, stderr = muscle_cmd() ``` ## Best Practices 1. **Choose appropriate scoring schemes** - Use BLOSUM62 for proteins, custom scores for DNA 2. **Consider alignment mode** - Global for similar-length sequences, local for finding conserved regions 3. **Set gap penalties carefully** - Higher penalties create fewer, longer gaps 4. **Use appropriate formats** - FASTA for simple alignments, Stockholm for rich annotation 5. **Validate alignment quality** - Check for conserved regions and percent identity 6. **Handle large alignments carefully** - Use slicing and iteration for memory efficiency 7. **Preserve metadata** - Maintain SeqRecord IDs and annotations through alignment operations ## Common Use Cases ### Find Best Local Alignment ```python from Bio.Align import PairwiseAligner from Bio.Seq import Seq aligner = PairwiseAligner() aligner.mode = 'local' aligner.match_score = 2 aligner.mismatch_score = -1 seq1 = Seq("AGCTTAGCTAGCTAGC") seq2 = Seq("CTAGCTAGC") alignments = aligner.align(seq1, seq2) print(alignments[0]) ``` ### Protein Sequence Alignment ```python from Bio.Align import PairwiseAligner, substitution_matrices aligner = PairwiseAligner() aligner.substitution_matrix = substitution_matrices.load("BLOSUM62") aligner.open_gap_score = -10 aligner.extend_gap_score = -0.5 protein1 = Seq("KEVLA") protein2 = Seq("KEVLAEQP") alignments = aligner.align(protein1, protein2) ``` ### Extract Conserved Regions ```python from Bio import AlignIO alignment = AlignIO.read("alignment.aln", "clustal") # Find columns with >80% identity conserved_positions = [] for i in range(alignment.get_alignment_length()): column = alignment[:, i] most_common = max(set(column), key=column.count) if column.count(most_common) / len(column) > 0.8: conserved_positions.append(i) print(f"Conserved positions: {conserved_positions}") ```