281 lines
8.4 KiB
Markdown
281 lines
8.4 KiB
Markdown
# Working with Alignment Files (SAM/BAM/CRAM)
|
|
|
|
## Overview
|
|
|
|
Pysam provides the `AlignmentFile` class for reading and writing SAM/BAM/CRAM formatted files containing aligned sequence data. BAM/CRAM files support compression and random access through indexing.
|
|
|
|
## Opening Alignment Files
|
|
|
|
Specify format via mode qualifier:
|
|
- `"rb"` - Read BAM (binary)
|
|
- `"r"` - Read SAM (text)
|
|
- `"rc"` - Read CRAM (compressed)
|
|
- `"wb"` - Write BAM
|
|
- `"w"` - Write SAM
|
|
- `"wc"` - Write CRAM
|
|
|
|
```python
|
|
import pysam
|
|
|
|
# Reading
|
|
samfile = pysam.AlignmentFile("example.bam", "rb")
|
|
|
|
# Writing (requires template or header)
|
|
outfile = pysam.AlignmentFile("output.bam", "wb", template=samfile)
|
|
```
|
|
|
|
### Stream Processing
|
|
|
|
Use `"-"` as filename for stdin/stdout operations:
|
|
|
|
```python
|
|
# Read from stdin
|
|
infile = pysam.AlignmentFile('-', 'rb')
|
|
|
|
# Write to stdout
|
|
outfile = pysam.AlignmentFile('-', 'w', template=infile)
|
|
```
|
|
|
|
**Important:** Pysam does not support reading/writing from true Python file objects—only stdin/stdout streams are supported.
|
|
|
|
## AlignmentFile Properties
|
|
|
|
**Header Information:**
|
|
- `references` - List of chromosome/contig names
|
|
- `lengths` - Corresponding lengths for each reference
|
|
- `header` - Complete header as dictionary
|
|
|
|
```python
|
|
samfile = pysam.AlignmentFile("example.bam", "rb")
|
|
print(f"References: {samfile.references}")
|
|
print(f"Lengths: {samfile.lengths}")
|
|
```
|
|
|
|
## Reading Reads
|
|
|
|
### fetch() - Region-Based Retrieval
|
|
|
|
Retrieves reads overlapping specified genomic regions using **0-based coordinates**.
|
|
|
|
```python
|
|
# Fetch specific region
|
|
for read in samfile.fetch("chr1", 1000, 2000):
|
|
print(read.query_name, read.reference_start)
|
|
|
|
# Fetch entire contig
|
|
for read in samfile.fetch("chr1"):
|
|
print(read.query_name)
|
|
|
|
# Fetch without index (sequential read)
|
|
for read in samfile.fetch(until_eof=True):
|
|
print(read.query_name)
|
|
```
|
|
|
|
**Important Notes:**
|
|
- Requires index (.bai/.crai) for random access
|
|
- Returns reads that **overlap** the region (may extend beyond boundaries)
|
|
- Use `until_eof=True` for non-indexed files or sequential reading
|
|
- By default, only returns mapped reads
|
|
- For unmapped reads, use `fetch("*")` or `until_eof=True`
|
|
|
|
### Multiple Iterators
|
|
|
|
When using multiple iterators on the same file:
|
|
|
|
```python
|
|
samfile = pysam.AlignmentFile("example.bam", "rb", multiple_iterators=True)
|
|
iter1 = samfile.fetch("chr1", 1000, 2000)
|
|
iter2 = samfile.fetch("chr2", 5000, 6000)
|
|
```
|
|
|
|
Without `multiple_iterators=True`, a new fetch() call repositions the file pointer and breaks existing iterators.
|
|
|
|
### count() - Count Reads in Region
|
|
|
|
```python
|
|
# Count all reads
|
|
num_reads = samfile.count("chr1", 1000, 2000)
|
|
|
|
# Count with quality filter
|
|
num_quality_reads = samfile.count("chr1", 1000, 2000, quality=20)
|
|
```
|
|
|
|
### count_coverage() - Per-Base Coverage
|
|
|
|
Returns four arrays (A, C, G, T) with per-base coverage:
|
|
|
|
```python
|
|
coverage = samfile.count_coverage("chr1", 1000, 2000)
|
|
a_counts, c_counts, g_counts, t_counts = coverage
|
|
```
|
|
|
|
## AlignedSegment Objects
|
|
|
|
Each read is represented as an `AlignedSegment` object with these key attributes:
|
|
|
|
### Read Information
|
|
- `query_name` - Read name/ID
|
|
- `query_sequence` - Read sequence (bases)
|
|
- `query_qualities` - Base quality scores (ASCII-encoded)
|
|
- `query_length` - Length of the read
|
|
|
|
### Mapping Information
|
|
- `reference_name` - Chromosome/contig name
|
|
- `reference_start` - Start position (0-based, inclusive)
|
|
- `reference_end` - End position (0-based, exclusive)
|
|
- `mapping_quality` - MAPQ score
|
|
- `cigarstring` - CIGAR string (e.g., "100M")
|
|
- `cigartuples` - CIGAR as list of (operation, length) tuples
|
|
|
|
**Important:** `cigartuples` format differs from SAM specification. Operations are integers:
|
|
- 0 = M (match/mismatch)
|
|
- 1 = I (insertion)
|
|
- 2 = D (deletion)
|
|
- 3 = N (skipped reference)
|
|
- 4 = S (soft clipping)
|
|
- 5 = H (hard clipping)
|
|
- 6 = P (padding)
|
|
- 7 = = (sequence match)
|
|
- 8 = X (sequence mismatch)
|
|
|
|
### Flags and Status
|
|
- `flag` - SAM flag as integer
|
|
- `is_paired` - Is read paired?
|
|
- `is_proper_pair` - Is read in a proper pair?
|
|
- `is_unmapped` - Is read unmapped?
|
|
- `mate_is_unmapped` - Is mate unmapped?
|
|
- `is_reverse` - Is read on reverse strand?
|
|
- `mate_is_reverse` - Is mate on reverse strand?
|
|
- `is_read1` - Is this read1?
|
|
- `is_read2` - Is this read2?
|
|
- `is_secondary` - Is secondary alignment?
|
|
- `is_qcfail` - Did read fail QC?
|
|
- `is_duplicate` - Is read a duplicate?
|
|
- `is_supplementary` - Is supplementary alignment?
|
|
|
|
### Tags and Optional Fields
|
|
- `get_tag(tag)` - Get value of optional field
|
|
- `set_tag(tag, value)` - Set optional field
|
|
- `has_tag(tag)` - Check if tag exists
|
|
- `get_tags()` - Get all tags as list of tuples
|
|
|
|
```python
|
|
for read in samfile.fetch("chr1", 1000, 2000):
|
|
if read.has_tag("NM"):
|
|
edit_distance = read.get_tag("NM")
|
|
print(f"{read.query_name}: NM={edit_distance}")
|
|
```
|
|
|
|
## Writing Alignment Files
|
|
|
|
### Creating Header
|
|
|
|
```python
|
|
header = {
|
|
'HD': {'VN': '1.0'},
|
|
'SQ': [
|
|
{'LN': 1575, 'SN': 'chr1'},
|
|
{'LN': 1584, 'SN': 'chr2'}
|
|
]
|
|
}
|
|
|
|
outfile = pysam.AlignmentFile("output.bam", "wb", header=header)
|
|
```
|
|
|
|
### Creating AlignedSegment Objects
|
|
|
|
```python
|
|
# Create new read
|
|
a = pysam.AlignedSegment()
|
|
a.query_name = "read001"
|
|
a.query_sequence = "AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
|
|
a.flag = 0
|
|
a.reference_id = 0 # Index into header['SQ']
|
|
a.reference_start = 100
|
|
a.mapping_quality = 20
|
|
a.cigar = [(0, 35)] # 35M
|
|
a.query_qualities = pysam.qualitystring_to_array("IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII")
|
|
|
|
# Write to file
|
|
outfile.write(a)
|
|
```
|
|
|
|
### Converting Between Formats
|
|
|
|
```python
|
|
# BAM to SAM
|
|
infile = pysam.AlignmentFile("input.bam", "rb")
|
|
outfile = pysam.AlignmentFile("output.sam", "w", template=infile)
|
|
for read in infile:
|
|
outfile.write(read)
|
|
infile.close()
|
|
outfile.close()
|
|
```
|
|
|
|
## Pileup Analysis
|
|
|
|
The `pileup()` method provides **column-wise** (position-by-position) analysis across a region:
|
|
|
|
```python
|
|
for pileupcolumn in samfile.pileup("chr1", 1000, 2000):
|
|
print(f"Position {pileupcolumn.pos}: coverage = {pileupcolumn.nsegments}")
|
|
|
|
for pileupread in pileupcolumn.pileups:
|
|
if not pileupread.is_del and not pileupread.is_refskip:
|
|
# Query position is the position in the read
|
|
base = pileupread.alignment.query_sequence[pileupread.query_position]
|
|
print(f" {pileupread.alignment.query_name}: {base}")
|
|
```
|
|
|
|
**Key attributes:**
|
|
- `pileupcolumn.pos` - 0-based reference position
|
|
- `pileupcolumn.nsegments` - Number of reads covering position
|
|
- `pileupread.alignment` - The AlignedSegment object
|
|
- `pileupread.query_position` - Position in the read (None for deletions)
|
|
- `pileupread.is_del` - Is this a deletion?
|
|
- `pileupread.is_refskip` - Is this a reference skip (N in CIGAR)?
|
|
|
|
**Important:** Keep iterator references alive. The error "PileupProxy accessed after iterator finished" occurs when iterators go out of scope prematurely.
|
|
|
|
## Coordinate System
|
|
|
|
**Critical:** Pysam uses **0-based, half-open** coordinates (Python convention):
|
|
- `reference_start` is 0-based (first base is 0)
|
|
- `reference_end` is exclusive (not included in range)
|
|
- Region from 1000-2000 includes bases 1000-1999
|
|
|
|
**Exception:** Region strings in `fetch()` and `pileup()` follow samtools conventions (1-based):
|
|
```python
|
|
# These are equivalent:
|
|
samfile.fetch("chr1", 999, 2000) # Python style: 0-based
|
|
samfile.fetch("chr1:1000-2000") # samtools style: 1-based
|
|
```
|
|
|
|
## Indexing
|
|
|
|
Create BAM index:
|
|
```python
|
|
pysam.index("example.bam")
|
|
```
|
|
|
|
Or use command-line interface:
|
|
```python
|
|
pysam.samtools.index("example.bam")
|
|
```
|
|
|
|
## Performance Tips
|
|
|
|
1. **Use indexed access** when querying specific regions repeatedly
|
|
2. **Use `pileup()` for column-wise analysis** instead of repeated fetch operations
|
|
3. **Use `fetch(until_eof=True)` for sequential reading** of non-indexed files
|
|
4. **Avoid multiple iterators** unless necessary (performance cost)
|
|
5. **Use `count()` for simple counting** instead of iterating and counting manually
|
|
|
|
## Common Pitfalls
|
|
|
|
1. **Partial overlaps:** `fetch()` returns reads that overlap region boundaries—implement explicit filtering if exact boundaries are needed
|
|
2. **Quality score editing:** Cannot edit `query_qualities` in place after modifying `query_sequence`. Create a copy first: `quals = read.query_qualities`
|
|
3. **Missing index:** `fetch()` without `until_eof=True` requires an index file
|
|
4. **Thread safety:** While pysam releases GIL during I/O, comprehensive thread-safety hasn't been fully validated
|
|
5. **Iterator scope:** Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors
|