Files
gh-k-dense-ai-claude-scient…/skills/pysam/references/alignment_files.md
2025-11-30 08:30:10 +08:00

8.4 KiB

Working with Alignment Files (SAM/BAM/CRAM)

Overview

Pysam provides the AlignmentFile class for reading and writing SAM/BAM/CRAM formatted files containing aligned sequence data. BAM/CRAM files support compression and random access through indexing.

Opening Alignment Files

Specify format via mode qualifier:

  • "rb" - Read BAM (binary)
  • "r" - Read SAM (text)
  • "rc" - Read CRAM (compressed)
  • "wb" - Write BAM
  • "w" - Write SAM
  • "wc" - Write CRAM
import pysam

# Reading
samfile = pysam.AlignmentFile("example.bam", "rb")

# Writing (requires template or header)
outfile = pysam.AlignmentFile("output.bam", "wb", template=samfile)

Stream Processing

Use "-" as filename for stdin/stdout operations:

# Read from stdin
infile = pysam.AlignmentFile('-', 'rb')

# Write to stdout
outfile = pysam.AlignmentFile('-', 'w', template=infile)

Important: Pysam does not support reading/writing from true Python file objects—only stdin/stdout streams are supported.

AlignmentFile Properties

Header Information:

  • references - List of chromosome/contig names
  • lengths - Corresponding lengths for each reference
  • header - Complete header as dictionary
samfile = pysam.AlignmentFile("example.bam", "rb")
print(f"References: {samfile.references}")
print(f"Lengths: {samfile.lengths}")

Reading Reads

fetch() - Region-Based Retrieval

Retrieves reads overlapping specified genomic regions using 0-based coordinates.

# Fetch specific region
for read in samfile.fetch("chr1", 1000, 2000):
    print(read.query_name, read.reference_start)

# Fetch entire contig
for read in samfile.fetch("chr1"):
    print(read.query_name)

# Fetch without index (sequential read)
for read in samfile.fetch(until_eof=True):
    print(read.query_name)

Important Notes:

  • Requires index (.bai/.crai) for random access
  • Returns reads that overlap the region (may extend beyond boundaries)
  • Use until_eof=True for non-indexed files or sequential reading
  • By default, only returns mapped reads
  • For unmapped reads, use fetch("*") or until_eof=True

Multiple Iterators

When using multiple iterators on the same file:

samfile = pysam.AlignmentFile("example.bam", "rb", multiple_iterators=True)
iter1 = samfile.fetch("chr1", 1000, 2000)
iter2 = samfile.fetch("chr2", 5000, 6000)

Without multiple_iterators=True, a new fetch() call repositions the file pointer and breaks existing iterators.

count() - Count Reads in Region

# Count all reads
num_reads = samfile.count("chr1", 1000, 2000)

# Count with quality filter
num_quality_reads = samfile.count("chr1", 1000, 2000, quality=20)

count_coverage() - Per-Base Coverage

Returns four arrays (A, C, G, T) with per-base coverage:

coverage = samfile.count_coverage("chr1", 1000, 2000)
a_counts, c_counts, g_counts, t_counts = coverage

AlignedSegment Objects

Each read is represented as an AlignedSegment object with these key attributes:

Read Information

  • query_name - Read name/ID
  • query_sequence - Read sequence (bases)
  • query_qualities - Base quality scores (ASCII-encoded)
  • query_length - Length of the read

Mapping Information

  • reference_name - Chromosome/contig name
  • reference_start - Start position (0-based, inclusive)
  • reference_end - End position (0-based, exclusive)
  • mapping_quality - MAPQ score
  • cigarstring - CIGAR string (e.g., "100M")
  • cigartuples - CIGAR as list of (operation, length) tuples

Important: cigartuples format differs from SAM specification. Operations are integers:

  • 0 = M (match/mismatch)
  • 1 = I (insertion)
  • 2 = D (deletion)
  • 3 = N (skipped reference)
  • 4 = S (soft clipping)
  • 5 = H (hard clipping)
  • 6 = P (padding)
  • 7 = = (sequence match)
  • 8 = X (sequence mismatch)

Flags and Status

  • flag - SAM flag as integer
  • is_paired - Is read paired?
  • is_proper_pair - Is read in a proper pair?
  • is_unmapped - Is read unmapped?
  • mate_is_unmapped - Is mate unmapped?
  • is_reverse - Is read on reverse strand?
  • mate_is_reverse - Is mate on reverse strand?
  • is_read1 - Is this read1?
  • is_read2 - Is this read2?
  • is_secondary - Is secondary alignment?
  • is_qcfail - Did read fail QC?
  • is_duplicate - Is read a duplicate?
  • is_supplementary - Is supplementary alignment?

Tags and Optional Fields

  • get_tag(tag) - Get value of optional field
  • set_tag(tag, value) - Set optional field
  • has_tag(tag) - Check if tag exists
  • get_tags() - Get all tags as list of tuples
for read in samfile.fetch("chr1", 1000, 2000):
    if read.has_tag("NM"):
        edit_distance = read.get_tag("NM")
        print(f"{read.query_name}: NM={edit_distance}")

Writing Alignment Files

Creating Header

header = {
    'HD': {'VN': '1.0'},
    'SQ': [
        {'LN': 1575, 'SN': 'chr1'},
        {'LN': 1584, 'SN': 'chr2'}
    ]
}

outfile = pysam.AlignmentFile("output.bam", "wb", header=header)

Creating AlignedSegment Objects

# Create new read
a = pysam.AlignedSegment()
a.query_name = "read001"
a.query_sequence = "AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
a.flag = 0
a.reference_id = 0  # Index into header['SQ']
a.reference_start = 100
a.mapping_quality = 20
a.cigar = [(0, 35)]  # 35M
a.query_qualities = pysam.qualitystring_to_array("IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII")

# Write to file
outfile.write(a)

Converting Between Formats

# BAM to SAM
infile = pysam.AlignmentFile("input.bam", "rb")
outfile = pysam.AlignmentFile("output.sam", "w", template=infile)
for read in infile:
    outfile.write(read)
infile.close()
outfile.close()

Pileup Analysis

The pileup() method provides column-wise (position-by-position) analysis across a region:

for pileupcolumn in samfile.pileup("chr1", 1000, 2000):
    print(f"Position {pileupcolumn.pos}: coverage = {pileupcolumn.nsegments}")

    for pileupread in pileupcolumn.pileups:
        if not pileupread.is_del and not pileupread.is_refskip:
            # Query position is the position in the read
            base = pileupread.alignment.query_sequence[pileupread.query_position]
            print(f"  {pileupread.alignment.query_name}: {base}")

Key attributes:

  • pileupcolumn.pos - 0-based reference position
  • pileupcolumn.nsegments - Number of reads covering position
  • pileupread.alignment - The AlignedSegment object
  • pileupread.query_position - Position in the read (None for deletions)
  • pileupread.is_del - Is this a deletion?
  • pileupread.is_refskip - Is this a reference skip (N in CIGAR)?

Important: Keep iterator references alive. The error "PileupProxy accessed after iterator finished" occurs when iterators go out of scope prematurely.

Coordinate System

Critical: Pysam uses 0-based, half-open coordinates (Python convention):

  • reference_start is 0-based (first base is 0)
  • reference_end is exclusive (not included in range)
  • Region from 1000-2000 includes bases 1000-1999

Exception: Region strings in fetch() and pileup() follow samtools conventions (1-based):

# These are equivalent:
samfile.fetch("chr1", 999, 2000)  # Python style: 0-based
samfile.fetch("chr1:1000-2000")   # samtools style: 1-based

Indexing

Create BAM index:

pysam.index("example.bam")

Or use command-line interface:

pysam.samtools.index("example.bam")

Performance Tips

  1. Use indexed access when querying specific regions repeatedly
  2. Use pileup() for column-wise analysis instead of repeated fetch operations
  3. Use fetch(until_eof=True) for sequential reading of non-indexed files
  4. Avoid multiple iterators unless necessary (performance cost)
  5. Use count() for simple counting instead of iterating and counting manually

Common Pitfalls

  1. Partial overlaps: fetch() returns reads that overlap region boundaries—implement explicit filtering if exact boundaries are needed
  2. Quality score editing: Cannot edit query_qualities in place after modifying query_sequence. Create a copy first: quals = read.query_qualities
  3. Missing index: fetch() without until_eof=True requires an index file
  4. Thread safety: While pysam releases GIL during I/O, comprehensive thread-safety hasn't been fully validated
  5. Iterator scope: Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors