Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

259
skills/pysam/SKILL.md Normal file
View File

@@ -0,0 +1,259 @@
---
name: pysam
description: "Genomic file toolkit. Read/write SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences, extract regions, calculate coverage, for NGS data processing pipelines."
---
# Pysam
## Overview
Pysam is a Python module for reading, manipulating, and writing genomic datasets. Read/write SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequences with a Pythonic interface to htslib. Query tabix-indexed files, perform pileup analysis for coverage, and execute samtools/bcftools commands.
## When to Use This Skill
This skill should be used when:
- Working with sequencing alignment files (BAM/CRAM)
- Analyzing genetic variants (VCF/BCF)
- Extracting reference sequences or gene regions
- Processing raw sequencing data (FASTQ)
- Calculating coverage or read depth
- Implementing bioinformatics analysis pipelines
- Quality control of sequencing data
- Variant calling and annotation workflows
## Quick Start
### Installation
```bash
uv pip install pysam
```
### Basic Examples
**Read alignment file:**
```python
import pysam
# Open BAM file and fetch reads in region
samfile = pysam.AlignmentFile("example.bam", "rb")
for read in samfile.fetch("chr1", 1000, 2000):
print(f"{read.query_name}: {read.reference_start}")
samfile.close()
```
**Read variant file:**
```python
# Open VCF file and iterate variants
vcf = pysam.VariantFile("variants.vcf")
for variant in vcf:
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
vcf.close()
```
**Query reference sequence:**
```python
# Open FASTA and extract sequence
fasta = pysam.FastaFile("reference.fasta")
sequence = fasta.fetch("chr1", 1000, 2000)
print(sequence)
fasta.close()
```
## Core Capabilities
### 1. Alignment File Operations (SAM/BAM/CRAM)
Use the `AlignmentFile` class to work with aligned sequencing reads. This is appropriate for analyzing mapping results, calculating coverage, extracting reads, or quality control.
**Common operations:**
- Open and read BAM/SAM/CRAM files
- Fetch reads from specific genomic regions
- Filter reads by mapping quality, flags, or other criteria
- Write filtered or modified alignments
- Calculate coverage statistics
- Perform pileup analysis (base-by-base coverage)
- Access read sequences, quality scores, and alignment information
**Reference:** See `references/alignment_files.md` for detailed documentation on:
- Opening and reading alignment files
- AlignedSegment attributes and methods
- Region-based fetching with `fetch()`
- Pileup analysis for coverage
- Writing and creating BAM files
- Coordinate systems and indexing
- Performance optimization tips
### 2. Variant File Operations (VCF/BCF)
Use the `VariantFile` class to work with genetic variants from variant calling pipelines. This is appropriate for variant analysis, filtering, annotation, or population genetics.
**Common operations:**
- Read and write VCF/BCF files
- Query variants in specific regions
- Access variant information (position, alleles, quality)
- Extract genotype data for samples
- Filter variants by quality, allele frequency, or other criteria
- Annotate variants with additional information
- Subset samples or regions
**Reference:** See `references/variant_files.md` for detailed documentation on:
- Opening and reading variant files
- VariantRecord attributes and methods
- Accessing INFO and FORMAT fields
- Working with genotypes and samples
- Creating and writing VCF files
- Filtering and subsetting variants
- Multi-sample VCF operations
### 3. Sequence File Operations (FASTA/FASTQ)
Use `FastaFile` for random access to reference sequences and `FastxFile` for reading raw sequencing data. This is appropriate for extracting gene sequences, validating variants against reference, or processing raw reads.
**Common operations:**
- Query reference sequences by genomic coordinates
- Extract sequences for genes or regions of interest
- Read FASTQ files with quality scores
- Validate variant reference alleles
- Calculate sequence statistics
- Filter reads by quality or length
- Convert between FASTA and FASTQ formats
**Reference:** See `references/sequence_files.md` for detailed documentation on:
- FASTA file access and indexing
- Extracting sequences by region
- Handling reverse complement for genes
- Reading FASTQ files sequentially
- Quality score conversion and filtering
- Working with tabix-indexed files (BED, GTF, GFF)
- Common sequence processing patterns
### 4. Integrated Bioinformatics Workflows
Pysam excels at integrating multiple file types for comprehensive genomic analyses. Common workflows combine alignment files, variant files, and reference sequences.
**Common workflows:**
- Calculate coverage statistics for specific regions
- Validate variants against aligned reads
- Annotate variants with coverage information
- Extract sequences around variant positions
- Filter alignments or variants based on multiple criteria
- Generate coverage tracks for visualization
- Quality control across multiple data types
**Reference:** See `references/common_workflows.md` for detailed examples of:
- Quality control workflows (BAM statistics, reference consistency)
- Coverage analysis (per-base coverage, low coverage detection)
- Variant analysis (annotation, filtering by read support)
- Sequence extraction (variant contexts, gene sequences)
- Read filtering and subsetting
- Integration patterns (BAM+VCF, VCF+BED, etc.)
- Performance optimization for complex workflows
## Key Concepts
### Coordinate Systems
**Critical:** Pysam uses **0-based, half-open** coordinates (Python convention):
- Start positions are 0-based (first base is position 0)
- End positions are exclusive (not included in the range)
- Region 1000-2000 includes bases 1000-1999 (1000 bases total)
**Exception:** Region strings in `fetch()` follow samtools convention (1-based):
```python
samfile.fetch("chr1", 999, 2000) # 0-based: positions 999-1999
samfile.fetch("chr1:1000-2000") # 1-based string: positions 1000-2000
```
**VCF files:** Use 1-based coordinates in the file format, but `VariantRecord.start` is 0-based.
### Indexing Requirements
Random access to specific genomic regions requires index files:
- **BAM files**: Require `.bai` index (create with `pysam.index()`)
- **CRAM files**: Require `.crai` index
- **FASTA files**: Require `.fai` index (create with `pysam.faidx()`)
- **VCF.gz files**: Require `.tbi` tabix index (create with `pysam.tabix_index()`)
- **BCF files**: Require `.csi` index
Without an index, use `fetch(until_eof=True)` for sequential reading.
### File Modes
Specify format when opening files:
- `"rb"` - Read BAM (binary)
- `"r"` - Read SAM (text)
- `"rc"` - Read CRAM
- `"wb"` - Write BAM
- `"w"` - Write SAM
- `"wc"` - Write CRAM
### Performance Considerations
1. **Always use indexed files** for random access operations
2. **Use `pileup()` for column-wise analysis** instead of repeated fetch operations
3. **Use `count()` for counting** instead of iterating and counting manually
4. **Process regions in parallel** when analyzing independent genomic regions
5. **Close files explicitly** to free resources
6. **Use `until_eof=True`** for sequential processing without index
7. **Avoid multiple iterators** unless necessary (use `multiple_iterators=True` if needed)
## Common Pitfalls
1. **Coordinate confusion:** Remember 0-based vs 1-based systems in different contexts
2. **Missing indices:** Many operations require index files—create them first
3. **Partial overlaps:** `fetch()` returns reads overlapping region boundaries, not just those fully contained
4. **Iterator scope:** Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors
5. **Quality score editing:** Cannot modify `query_qualities` in place after changing `query_sequence`—create a copy first
6. **Stream limitations:** Only stdin/stdout are supported for streaming, not arbitrary Python file objects
7. **Thread safety:** While GIL is released during I/O, comprehensive thread-safety hasn't been fully validated
## Command-Line Tools
Pysam provides access to samtools and bcftools commands:
```python
# Sort BAM file
pysam.samtools.sort("-o", "sorted.bam", "input.bam")
# Index BAM
pysam.samtools.index("sorted.bam")
# View specific region
pysam.samtools.view("-b", "-o", "region.bam", "input.bam", "chr1:1000-2000")
# BCF tools
pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf")
```
**Error handling:**
```python
try:
pysam.samtools.sort("-o", "output.bam", "input.bam")
except pysam.SamtoolsError as e:
print(f"Error: {e}")
```
## Resources
### references/
Detailed documentation for each major capability:
- **alignment_files.md** - Complete guide to SAM/BAM/CRAM operations, including AlignmentFile class, AlignedSegment attributes, fetch operations, pileup analysis, and writing alignments
- **variant_files.md** - Complete guide to VCF/BCF operations, including VariantFile class, VariantRecord attributes, genotype handling, INFO/FORMAT fields, and multi-sample operations
- **sequence_files.md** - Complete guide to FASTA/FASTQ operations, including FastaFile and FastxFile classes, sequence extraction, quality score handling, and tabix-indexed file access
- **common_workflows.md** - Practical examples of integrated bioinformatics workflows combining multiple file types, including quality control, coverage analysis, variant validation, and sequence extraction
## Getting Help
For detailed information on specific operations, refer to the appropriate reference document:
- Working with BAM files or calculating coverage → `alignment_files.md`
- Analyzing variants or genotypes → `variant_files.md`
- Extracting sequences or processing FASTQ → `sequence_files.md`
- Complex workflows integrating multiple file types → `common_workflows.md`
Official documentation: https://pysam.readthedocs.io/

View File

@@ -0,0 +1,280 @@
# Working with Alignment Files (SAM/BAM/CRAM)
## Overview
Pysam provides the `AlignmentFile` class for reading and writing SAM/BAM/CRAM formatted files containing aligned sequence data. BAM/CRAM files support compression and random access through indexing.
## Opening Alignment Files
Specify format via mode qualifier:
- `"rb"` - Read BAM (binary)
- `"r"` - Read SAM (text)
- `"rc"` - Read CRAM (compressed)
- `"wb"` - Write BAM
- `"w"` - Write SAM
- `"wc"` - Write CRAM
```python
import pysam
# Reading
samfile = pysam.AlignmentFile("example.bam", "rb")
# Writing (requires template or header)
outfile = pysam.AlignmentFile("output.bam", "wb", template=samfile)
```
### Stream Processing
Use `"-"` as filename for stdin/stdout operations:
```python
# Read from stdin
infile = pysam.AlignmentFile('-', 'rb')
# Write to stdout
outfile = pysam.AlignmentFile('-', 'w', template=infile)
```
**Important:** Pysam does not support reading/writing from true Python file objects—only stdin/stdout streams are supported.
## AlignmentFile Properties
**Header Information:**
- `references` - List of chromosome/contig names
- `lengths` - Corresponding lengths for each reference
- `header` - Complete header as dictionary
```python
samfile = pysam.AlignmentFile("example.bam", "rb")
print(f"References: {samfile.references}")
print(f"Lengths: {samfile.lengths}")
```
## Reading Reads
### fetch() - Region-Based Retrieval
Retrieves reads overlapping specified genomic regions using **0-based coordinates**.
```python
# Fetch specific region
for read in samfile.fetch("chr1", 1000, 2000):
print(read.query_name, read.reference_start)
# Fetch entire contig
for read in samfile.fetch("chr1"):
print(read.query_name)
# Fetch without index (sequential read)
for read in samfile.fetch(until_eof=True):
print(read.query_name)
```
**Important Notes:**
- Requires index (.bai/.crai) for random access
- Returns reads that **overlap** the region (may extend beyond boundaries)
- Use `until_eof=True` for non-indexed files or sequential reading
- By default, only returns mapped reads
- For unmapped reads, use `fetch("*")` or `until_eof=True`
### Multiple Iterators
When using multiple iterators on the same file:
```python
samfile = pysam.AlignmentFile("example.bam", "rb", multiple_iterators=True)
iter1 = samfile.fetch("chr1", 1000, 2000)
iter2 = samfile.fetch("chr2", 5000, 6000)
```
Without `multiple_iterators=True`, a new fetch() call repositions the file pointer and breaks existing iterators.
### count() - Count Reads in Region
```python
# Count all reads
num_reads = samfile.count("chr1", 1000, 2000)
# Count with quality filter
num_quality_reads = samfile.count("chr1", 1000, 2000, quality=20)
```
### count_coverage() - Per-Base Coverage
Returns four arrays (A, C, G, T) with per-base coverage:
```python
coverage = samfile.count_coverage("chr1", 1000, 2000)
a_counts, c_counts, g_counts, t_counts = coverage
```
## AlignedSegment Objects
Each read is represented as an `AlignedSegment` object with these key attributes:
### Read Information
- `query_name` - Read name/ID
- `query_sequence` - Read sequence (bases)
- `query_qualities` - Base quality scores (ASCII-encoded)
- `query_length` - Length of the read
### Mapping Information
- `reference_name` - Chromosome/contig name
- `reference_start` - Start position (0-based, inclusive)
- `reference_end` - End position (0-based, exclusive)
- `mapping_quality` - MAPQ score
- `cigarstring` - CIGAR string (e.g., "100M")
- `cigartuples` - CIGAR as list of (operation, length) tuples
**Important:** `cigartuples` format differs from SAM specification. Operations are integers:
- 0 = M (match/mismatch)
- 1 = I (insertion)
- 2 = D (deletion)
- 3 = N (skipped reference)
- 4 = S (soft clipping)
- 5 = H (hard clipping)
- 6 = P (padding)
- 7 = = (sequence match)
- 8 = X (sequence mismatch)
### Flags and Status
- `flag` - SAM flag as integer
- `is_paired` - Is read paired?
- `is_proper_pair` - Is read in a proper pair?
- `is_unmapped` - Is read unmapped?
- `mate_is_unmapped` - Is mate unmapped?
- `is_reverse` - Is read on reverse strand?
- `mate_is_reverse` - Is mate on reverse strand?
- `is_read1` - Is this read1?
- `is_read2` - Is this read2?
- `is_secondary` - Is secondary alignment?
- `is_qcfail` - Did read fail QC?
- `is_duplicate` - Is read a duplicate?
- `is_supplementary` - Is supplementary alignment?
### Tags and Optional Fields
- `get_tag(tag)` - Get value of optional field
- `set_tag(tag, value)` - Set optional field
- `has_tag(tag)` - Check if tag exists
- `get_tags()` - Get all tags as list of tuples
```python
for read in samfile.fetch("chr1", 1000, 2000):
if read.has_tag("NM"):
edit_distance = read.get_tag("NM")
print(f"{read.query_name}: NM={edit_distance}")
```
## Writing Alignment Files
### Creating Header
```python
header = {
'HD': {'VN': '1.0'},
'SQ': [
{'LN': 1575, 'SN': 'chr1'},
{'LN': 1584, 'SN': 'chr2'}
]
}
outfile = pysam.AlignmentFile("output.bam", "wb", header=header)
```
### Creating AlignedSegment Objects
```python
# Create new read
a = pysam.AlignedSegment()
a.query_name = "read001"
a.query_sequence = "AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
a.flag = 0
a.reference_id = 0 # Index into header['SQ']
a.reference_start = 100
a.mapping_quality = 20
a.cigar = [(0, 35)] # 35M
a.query_qualities = pysam.qualitystring_to_array("IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII")
# Write to file
outfile.write(a)
```
### Converting Between Formats
```python
# BAM to SAM
infile = pysam.AlignmentFile("input.bam", "rb")
outfile = pysam.AlignmentFile("output.sam", "w", template=infile)
for read in infile:
outfile.write(read)
infile.close()
outfile.close()
```
## Pileup Analysis
The `pileup()` method provides **column-wise** (position-by-position) analysis across a region:
```python
for pileupcolumn in samfile.pileup("chr1", 1000, 2000):
print(f"Position {pileupcolumn.pos}: coverage = {pileupcolumn.nsegments}")
for pileupread in pileupcolumn.pileups:
if not pileupread.is_del and not pileupread.is_refskip:
# Query position is the position in the read
base = pileupread.alignment.query_sequence[pileupread.query_position]
print(f" {pileupread.alignment.query_name}: {base}")
```
**Key attributes:**
- `pileupcolumn.pos` - 0-based reference position
- `pileupcolumn.nsegments` - Number of reads covering position
- `pileupread.alignment` - The AlignedSegment object
- `pileupread.query_position` - Position in the read (None for deletions)
- `pileupread.is_del` - Is this a deletion?
- `pileupread.is_refskip` - Is this a reference skip (N in CIGAR)?
**Important:** Keep iterator references alive. The error "PileupProxy accessed after iterator finished" occurs when iterators go out of scope prematurely.
## Coordinate System
**Critical:** Pysam uses **0-based, half-open** coordinates (Python convention):
- `reference_start` is 0-based (first base is 0)
- `reference_end` is exclusive (not included in range)
- Region from 1000-2000 includes bases 1000-1999
**Exception:** Region strings in `fetch()` and `pileup()` follow samtools conventions (1-based):
```python
# These are equivalent:
samfile.fetch("chr1", 999, 2000) # Python style: 0-based
samfile.fetch("chr1:1000-2000") # samtools style: 1-based
```
## Indexing
Create BAM index:
```python
pysam.index("example.bam")
```
Or use command-line interface:
```python
pysam.samtools.index("example.bam")
```
## Performance Tips
1. **Use indexed access** when querying specific regions repeatedly
2. **Use `pileup()` for column-wise analysis** instead of repeated fetch operations
3. **Use `fetch(until_eof=True)` for sequential reading** of non-indexed files
4. **Avoid multiple iterators** unless necessary (performance cost)
5. **Use `count()` for simple counting** instead of iterating and counting manually
## Common Pitfalls
1. **Partial overlaps:** `fetch()` returns reads that overlap region boundaries—implement explicit filtering if exact boundaries are needed
2. **Quality score editing:** Cannot edit `query_qualities` in place after modifying `query_sequence`. Create a copy first: `quals = read.query_qualities`
3. **Missing index:** `fetch()` without `until_eof=True` requires an index file
4. **Thread safety:** While pysam releases GIL during I/O, comprehensive thread-safety hasn't been fully validated
5. **Iterator scope:** Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors

View File

@@ -0,0 +1,520 @@
# Common Bioinformatics Workflows with Pysam
## Overview
This document provides practical examples of common bioinformatics workflows using pysam, demonstrating how to combine different file types and operations.
## Quality Control Workflows
### Calculate BAM Statistics
```python
import pysam
def calculate_bam_stats(bam_file):
"""Calculate basic statistics for BAM file."""
samfile = pysam.AlignmentFile(bam_file, "rb")
stats = {
"total_reads": 0,
"mapped_reads": 0,
"unmapped_reads": 0,
"paired_reads": 0,
"proper_pairs": 0,
"duplicates": 0,
"total_bases": 0,
"mapped_bases": 0
}
for read in samfile.fetch(until_eof=True):
stats["total_reads"] += 1
if read.is_unmapped:
stats["unmapped_reads"] += 1
else:
stats["mapped_reads"] += 1
stats["mapped_bases"] += read.query_alignment_length
if read.is_paired:
stats["paired_reads"] += 1
if read.is_proper_pair:
stats["proper_pairs"] += 1
if read.is_duplicate:
stats["duplicates"] += 1
stats["total_bases"] += read.query_length
samfile.close()
# Calculate derived statistics
stats["mapping_rate"] = stats["mapped_reads"] / stats["total_reads"] if stats["total_reads"] > 0 else 0
stats["duplication_rate"] = stats["duplicates"] / stats["total_reads"] if stats["total_reads"] > 0 else 0
return stats
```
### Check Reference Consistency
```python
def check_bam_reference_consistency(bam_file, fasta_file):
"""Verify that BAM reads match reference genome."""
samfile = pysam.AlignmentFile(bam_file, "rb")
fasta = pysam.FastaFile(fasta_file)
mismatches = 0
total_checked = 0
for read in samfile.fetch():
if read.is_unmapped:
continue
# Get reference sequence for aligned region
ref_seq = fasta.fetch(
read.reference_name,
read.reference_start,
read.reference_end
)
# Get read sequence aligned to reference
aligned_pairs = read.get_aligned_pairs(with_seq=True)
for query_pos, ref_pos, ref_base in aligned_pairs:
if query_pos is not None and ref_pos is not None and ref_base is not None:
read_base = read.query_sequence[query_pos]
if read_base.upper() != ref_base.upper():
mismatches += 1
total_checked += 1
if total_checked >= 10000: # Sample first 10k positions
break
samfile.close()
fasta.close()
error_rate = mismatches / total_checked if total_checked > 0 else 0
return {
"positions_checked": total_checked,
"mismatches": mismatches,
"error_rate": error_rate
}
```
## Coverage Analysis
### Calculate Per-Base Coverage
```python
def calculate_coverage(bam_file, chrom, start, end):
"""Calculate coverage for each position in region."""
samfile = pysam.AlignmentFile(bam_file, "rb")
# Initialize coverage array
length = end - start
coverage = [0] * length
# Count coverage at each position
for pileupcolumn in samfile.pileup(chrom, start, end):
if start <= pileupcolumn.pos < end:
coverage[pileupcolumn.pos - start] = pileupcolumn.nsegments
samfile.close()
return coverage
```
### Identify Low Coverage Regions
```python
def find_low_coverage_regions(bam_file, chrom, start, end, min_coverage=10):
"""Find regions with coverage below threshold."""
samfile = pysam.AlignmentFile(bam_file, "rb")
low_coverage_regions = []
in_low_region = False
region_start = None
for pileupcolumn in samfile.pileup(chrom, start, end):
pos = pileupcolumn.pos
if pos < start or pos >= end:
continue
coverage = pileupcolumn.nsegments
if coverage < min_coverage:
if not in_low_region:
region_start = pos
in_low_region = True
else:
if in_low_region:
low_coverage_regions.append((region_start, pos))
in_low_region = False
# Close last region if still open
if in_low_region:
low_coverage_regions.append((region_start, end))
samfile.close()
return low_coverage_regions
```
### Calculate Coverage Statistics
```python
def coverage_statistics(bam_file, chrom, start, end):
"""Calculate coverage statistics for region."""
samfile = pysam.AlignmentFile(bam_file, "rb")
coverages = []
for pileupcolumn in samfile.pileup(chrom, start, end):
if start <= pileupcolumn.pos < end:
coverages.append(pileupcolumn.nsegments)
samfile.close()
if not coverages:
return None
coverages.sort()
n = len(coverages)
return {
"mean": sum(coverages) / n,
"median": coverages[n // 2],
"min": coverages[0],
"max": coverages[-1],
"positions": n
}
```
## Variant Analysis
### Extract Variants in Regions
```python
def extract_variants_in_genes(vcf_file, bed_file):
"""Extract variants overlapping gene regions."""
vcf = pysam.VariantFile(vcf_file)
bed = pysam.TabixFile(bed_file)
variants_by_gene = {}
for gene in bed.fetch(parser=pysam.asBed()):
gene_name = gene.name
variants_by_gene[gene_name] = []
# Find variants in gene region
for variant in vcf.fetch(gene.contig, gene.start, gene.end):
variant_info = {
"chrom": variant.chrom,
"pos": variant.pos,
"ref": variant.ref,
"alt": variant.alts,
"qual": variant.qual
}
variants_by_gene[gene_name].append(variant_info)
vcf.close()
bed.close()
return variants_by_gene
```
### Annotate Variants with Coverage
```python
def annotate_variants_with_coverage(vcf_file, bam_file, output_file):
"""Add coverage information to variants."""
vcf = pysam.VariantFile(vcf_file)
samfile = pysam.AlignmentFile(bam_file, "rb")
# Add DP to header if not present
if "DP" not in vcf.header.info:
vcf.header.info.add("DP", "1", "Integer", "Total Depth from BAM")
outvcf = pysam.VariantFile(output_file, "w", header=vcf.header)
for variant in vcf:
# Get coverage at variant position
coverage = samfile.count(
variant.chrom,
variant.pos - 1, # Convert to 0-based
variant.pos
)
# Add to INFO field
variant.info["DP"] = coverage
outvcf.write(variant)
vcf.close()
samfile.close()
outvcf.close()
```
### Filter Variants by Read Support
```python
def filter_variants_by_support(vcf_file, bam_file, output_file, min_alt_reads=3):
"""Filter variants requiring minimum alternate allele support."""
vcf = pysam.VariantFile(vcf_file)
samfile = pysam.AlignmentFile(bam_file, "rb")
outvcf = pysam.VariantFile(output_file, "w", header=vcf.header)
for variant in vcf:
# Count reads supporting each allele
allele_counts = {variant.ref: 0}
for alt in variant.alts:
allele_counts[alt] = 0
# Pileup at variant position
for pileupcolumn in samfile.pileup(
variant.chrom,
variant.pos - 1,
variant.pos
):
if pileupcolumn.pos == variant.pos - 1: # 0-based
for pileupread in pileupcolumn.pileups:
if not pileupread.is_del and not pileupread.is_refskip:
base = pileupread.alignment.query_sequence[
pileupread.query_position
]
if base in allele_counts:
allele_counts[base] += 1
# Check if any alt allele has sufficient support
has_support = any(
allele_counts.get(alt, 0) >= min_alt_reads
for alt in variant.alts
)
if has_support:
outvcf.write(variant)
vcf.close()
samfile.close()
outvcf.close()
```
## Sequence Extraction
### Extract Sequences Around Variants
```python
def extract_variant_contexts(vcf_file, fasta_file, output_file, window=50):
"""Extract reference sequences around variants."""
vcf = pysam.VariantFile(vcf_file)
fasta = pysam.FastaFile(fasta_file)
with open(output_file, 'w') as out:
for variant in vcf:
# Get sequence context
start = max(0, variant.pos - window - 1) # Convert to 0-based
end = variant.pos + window
context = fasta.fetch(variant.chrom, start, end)
# Mark variant position
var_pos_in_context = variant.pos - 1 - start
out.write(f">{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}\n")
out.write(context[:var_pos_in_context].lower())
out.write(context[var_pos_in_context:var_pos_in_context+len(variant.ref)].upper())
out.write(context[var_pos_in_context+len(variant.ref):].lower())
out.write("\n")
vcf.close()
fasta.close()
```
### Extract Gene Sequences
```python
def extract_gene_sequences(bed_file, fasta_file, output_fasta):
"""Extract sequences for genes from BED file."""
bed = pysam.TabixFile(bed_file)
fasta = pysam.FastaFile(fasta_file)
with open(output_fasta, 'w') as out:
for gene in bed.fetch(parser=pysam.asBed()):
sequence = fasta.fetch(gene.contig, gene.start, gene.end)
# Handle strand
if hasattr(gene, 'strand') and gene.strand == '-':
# Reverse complement
complement = str.maketrans("ATGCatgcNn", "TACGtacgNn")
sequence = sequence.translate(complement)[::-1]
out.write(f">{gene.name} {gene.contig}:{gene.start}-{gene.end}\n")
# Write sequence in 60-character lines
for i in range(0, len(sequence), 60):
out.write(sequence[i:i+60] + "\n")
bed.close()
fasta.close()
```
## Read Filtering and Subsetting
### Filter BAM by Region and Quality
```python
def filter_bam(input_bam, output_bam, chrom, start, end, min_mapq=20):
"""Filter BAM file by region and mapping quality."""
infile = pysam.AlignmentFile(input_bam, "rb")
outfile = pysam.AlignmentFile(output_bam, "wb", template=infile)
for read in infile.fetch(chrom, start, end):
if read.mapping_quality >= min_mapq and not read.is_duplicate:
outfile.write(read)
infile.close()
outfile.close()
# Create index
pysam.index(output_bam)
```
### Extract Reads for Specific Variants
```python
def extract_reads_at_variants(bam_file, vcf_file, output_bam, window=100):
"""Extract reads overlapping variant positions."""
samfile = pysam.AlignmentFile(bam_file, "rb")
vcf = pysam.VariantFile(vcf_file)
outfile = pysam.AlignmentFile(output_bam, "wb", template=samfile)
# Collect all reads (using set to avoid duplicates)
reads_to_keep = set()
for variant in vcf:
start = max(0, variant.pos - window - 1)
end = variant.pos + window
for read in samfile.fetch(variant.chrom, start, end):
reads_to_keep.add(read.query_name)
# Write all reads
samfile.close()
samfile = pysam.AlignmentFile(bam_file, "rb")
for read in samfile.fetch(until_eof=True):
if read.query_name in reads_to_keep:
outfile.write(read)
samfile.close()
vcf.close()
outfile.close()
pysam.index(output_bam)
```
## Integration Workflows
### Create Coverage Track from BAM
```python
def create_coverage_bedgraph(bam_file, output_file, chrom=None):
"""Create bedGraph coverage track from BAM."""
samfile = pysam.AlignmentFile(bam_file, "rb")
chroms = [chrom] if chrom else samfile.references
with open(output_file, 'w') as out:
out.write("track type=bedGraph name=\"Coverage\"\n")
for chrom in chroms:
current_cov = None
region_start = None
for pileupcolumn in samfile.pileup(chrom):
pos = pileupcolumn.pos
cov = pileupcolumn.nsegments
if cov != current_cov:
# Write previous region
if current_cov is not None:
out.write(f"{chrom}\t{region_start}\t{pos}\t{current_cov}\n")
# Start new region
current_cov = cov
region_start = pos
# Write final region
if current_cov is not None:
out.write(f"{chrom}\t{region_start}\t{pos+1}\t{current_cov}\n")
samfile.close()
```
### Merge Multiple VCF Files
```python
def merge_vcf_samples(vcf_files, output_file):
"""Merge multiple single-sample VCFs."""
# Open all input files
vcf_readers = [pysam.VariantFile(f) for f in vcf_files]
# Create merged header
merged_header = vcf_readers[0].header.copy()
for vcf in vcf_readers[1:]:
for sample in vcf.header.samples:
merged_header.samples.add(sample)
outvcf = pysam.VariantFile(output_file, "w", header=merged_header)
# Get all variant positions
all_variants = {}
for vcf in vcf_readers:
for variant in vcf:
key = (variant.chrom, variant.pos, variant.ref, variant.alts)
if key not in all_variants:
all_variants[key] = []
all_variants[key].append(variant)
# Write merged variants
for key, variants in sorted(all_variants.items()):
# Create merged record from first variant
merged = outvcf.new_record(
contig=variants[0].chrom,
start=variants[0].start,
stop=variants[0].stop,
alleles=variants[0].alleles
)
# Add genotypes from all samples
for variant in variants:
for sample in variant.samples:
merged.samples[sample].update(variant.samples[sample])
outvcf.write(merged)
# Close all files
for vcf in vcf_readers:
vcf.close()
outvcf.close()
```
## Performance Tips for Workflows
1. **Use indexed files** for all random access operations
2. **Process regions in parallel** when analyzing multiple independent regions
3. **Stream data when possible** - avoid loading entire files into memory
4. **Close files explicitly** to free resources
5. **Use `until_eof=True`** for sequential processing of entire files
6. **Batch operations** on the same file to minimize I/O
7. **Consider memory usage** with pileup operations on high-coverage regions
8. **Use count() instead of pileup()** when only counts are needed
## Common Integration Patterns
1. **BAM + Reference**: Verify alignments, extract aligned sequences
2. **BAM + VCF**: Validate variants, calculate allele frequencies
3. **VCF + BED**: Annotate variants with gene/region information
4. **BAM + BED**: Calculate coverage statistics for specific regions
5. **FASTA + VCF**: Extract variant context sequences
6. **Multiple BAMs**: Compare coverage or variants across samples
7. **BAM + FASTQ**: Extract unaligned reads for re-alignment

View File

@@ -0,0 +1,407 @@
# Working with Sequence Files (FASTA/FASTQ)
## FASTA Files
### Overview
Pysam provides the `FastaFile` class for indexed, random access to FASTA reference sequences. FASTA files must be indexed with `samtools faidx` before use.
### Opening FASTA Files
```python
import pysam
# Open indexed FASTA file
fasta = pysam.FastaFile("reference.fasta")
# Automatically looks for reference.fasta.fai index
```
### Creating FASTA Index
```python
# Create index using pysam
pysam.faidx("reference.fasta")
# Or using samtools command
pysam.samtools.faidx("reference.fasta")
```
This creates a `.fai` index file required for random access.
### FastaFile Properties
```python
fasta = pysam.FastaFile("reference.fasta")
# List of reference sequences
references = fasta.references
print(f"References: {references}")
# Get lengths
lengths = fasta.lengths
print(f"Lengths: {lengths}")
# Get specific sequence length
chr1_length = fasta.get_reference_length("chr1")
```
### Fetching Sequences
#### Fetch by Region
Uses **0-based, half-open** coordinates:
```python
# Fetch specific region
sequence = fasta.fetch("chr1", 1000, 2000)
print(f"Sequence: {sequence}") # Returns 1000 bases
# Fetch entire chromosome
chr1_seq = fasta.fetch("chr1")
# Fetch using region string (1-based)
sequence = fasta.fetch(region="chr1:1001-2000")
```
**Important:** Numeric arguments use 0-based coordinates, region strings use 1-based coordinates (samtools convention).
#### Common Use Cases
```python
# Get sequence at variant position
def get_variant_context(fasta, chrom, pos, window=10):
"""Get sequence context around a variant position (1-based)."""
start = max(0, pos - window - 1) # Convert to 0-based
end = pos + window
return fasta.fetch(chrom, start, end)
# Get sequence for gene coordinates
def get_gene_sequence(fasta, chrom, start, end, strand):
"""Get gene sequence with strand awareness."""
seq = fasta.fetch(chrom, start, end)
if strand == "-":
# Reverse complement
complement = str.maketrans("ATGCatgc", "TACGtacg")
seq = seq.translate(complement)[::-1]
return seq
# Check reference allele
def check_ref_allele(fasta, chrom, pos, expected_ref):
"""Verify reference allele at position (1-based pos)."""
actual = fasta.fetch(chrom, pos-1, pos) # Convert to 0-based
return actual.upper() == expected_ref.upper()
```
### Extracting Multiple Regions
```python
# Extract multiple regions efficiently
regions = [
("chr1", 1000, 2000),
("chr1", 5000, 6000),
("chr2", 10000, 11000)
]
sequences = {}
for chrom, start, end in regions:
seq_id = f"{chrom}:{start}-{end}"
sequences[seq_id] = fasta.fetch(chrom, start, end)
```
### Working with Ambiguous Bases
FASTA files may contain IUPAC ambiguity codes:
- N = any base
- R = A or G (purine)
- Y = C or T (pyrimidine)
- S = G or C (strong)
- W = A or T (weak)
- K = G or T (keto)
- M = A or C (amino)
- B = C, G, or T (not A)
- D = A, G, or T (not C)
- H = A, C, or T (not G)
- V = A, C, or G (not T)
```python
# Handle ambiguous bases
def count_ambiguous(sequence):
"""Count non-ATGC bases."""
return sum(1 for base in sequence.upper() if base not in "ATGC")
# Remove regions with too many Ns
def has_quality_sequence(fasta, chrom, start, end, max_n_frac=0.1):
"""Check if region has acceptable N content."""
seq = fasta.fetch(chrom, start, end)
n_count = seq.upper().count('N')
return (n_count / len(seq)) <= max_n_frac
```
## FASTQ Files
### Overview
Pysam provides `FastxFile` (or `FastqFile`) for reading FASTQ files containing raw sequencing reads with quality scores. FASTQ files do not support random access—only sequential reading.
### Opening FASTQ Files
```python
import pysam
# Open FASTQ file
fastq = pysam.FastxFile("reads.fastq")
# Works with compressed files
fastq_gz = pysam.FastxFile("reads.fastq.gz")
```
### Reading FASTQ Records
```python
fastq = pysam.FastxFile("reads.fastq")
for read in fastq:
print(f"Name: {read.name}")
print(f"Sequence: {read.sequence}")
print(f"Quality: {read.quality}")
print(f"Comment: {read.comment}") # Optional header comment
```
**FastqProxy attributes:**
- `name` - Read identifier (without @ prefix)
- `sequence` - DNA/RNA sequence
- `quality` - ASCII-encoded quality string
- `comment` - Optional comment from header line
- `get_quality_array()` - Convert quality string to numeric array
### Quality Score Conversion
```python
# Convert quality string to numeric values
for read in fastq:
qual_array = read.get_quality_array()
mean_quality = sum(qual_array) / len(qual_array)
print(f"{read.name}: mean Q = {mean_quality:.1f}")
```
Quality scores are Phred-scaled (typically Phred+33 encoding):
- Q = -10 * log10(P_error)
- ASCII 33 ('!') = Q0
- ASCII 43 ('+') = Q10
- ASCII 63 ('?') = Q30
### Common FASTQ Processing Workflows
#### Quality Filtering
```python
def filter_by_quality(input_fastq, output_fastq, min_mean_quality=20):
"""Filter reads by mean quality score."""
with pysam.FastxFile(input_fastq) as infile:
with open(output_fastq, 'w') as outfile:
for read in infile:
qual_array = read.get_quality_array()
mean_q = sum(qual_array) / len(qual_array)
if mean_q >= min_mean_quality:
# Write in FASTQ format
outfile.write(f"@{read.name}\n")
outfile.write(f"{read.sequence}\n")
outfile.write("+\n")
outfile.write(f"{read.quality}\n")
```
#### Length Filtering
```python
def filter_by_length(input_fastq, output_fastq, min_length=50):
"""Filter reads by minimum length."""
with pysam.FastxFile(input_fastq) as infile:
with open(output_fastq, 'w') as outfile:
kept = 0
for read in infile:
if len(read.sequence) >= min_length:
outfile.write(f"@{read.name}\n")
outfile.write(f"{read.sequence}\n")
outfile.write("+\n")
outfile.write(f"{read.quality}\n")
kept += 1
print(f"Kept {kept} reads")
```
#### Calculate Quality Statistics
```python
def calculate_fastq_stats(fastq_file):
"""Calculate basic statistics for FASTQ file."""
total_reads = 0
total_bases = 0
quality_sum = 0
with pysam.FastxFile(fastq_file) as fastq:
for read in fastq:
total_reads += 1
read_length = len(read.sequence)
total_bases += read_length
qual_array = read.get_quality_array()
quality_sum += sum(qual_array)
return {
"total_reads": total_reads,
"total_bases": total_bases,
"mean_read_length": total_bases / total_reads if total_reads > 0 else 0,
"mean_quality": quality_sum / total_bases if total_bases > 0 else 0
}
```
#### Extract Reads by Name
```python
def extract_reads_by_name(fastq_file, read_names, output_file):
"""Extract specific reads by name."""
read_set = set(read_names)
with pysam.FastxFile(fastq_file) as infile:
with open(output_file, 'w') as outfile:
for read in infile:
if read.name in read_set:
outfile.write(f"@{read.name}\n")
outfile.write(f"{read.sequence}\n")
outfile.write("+\n")
outfile.write(f"{read.quality}\n")
```
#### Convert FASTQ to FASTA
```python
def fastq_to_fasta(fastq_file, fasta_file):
"""Convert FASTQ to FASTA (discards quality scores)."""
with pysam.FastxFile(fastq_file) as infile:
with open(fasta_file, 'w') as outfile:
for read in infile:
outfile.write(f">{read.name}\n")
outfile.write(f"{read.sequence}\n")
```
#### Subsample FASTQ
```python
import random
def subsample_fastq(input_fastq, output_fastq, fraction=0.1, seed=42):
"""Randomly subsample reads from FASTQ file."""
random.seed(seed)
with pysam.FastxFile(input_fastq) as infile:
with open(output_fastq, 'w') as outfile:
for read in infile:
if random.random() < fraction:
outfile.write(f"@{read.name}\n")
outfile.write(f"{read.sequence}\n")
outfile.write("+\n")
outfile.write(f"{read.quality}\n")
```
## Tabix-Indexed Files
### Overview
Pysam provides `TabixFile` for accessing tabix-indexed genomic data files (BED, GFF, GTF, generic tab-delimited).
### Opening Tabix Files
```python
import pysam
# Open tabix-indexed file
tabix = pysam.TabixFile("annotations.bed.gz")
# File must be bgzip-compressed and tabix-indexed
```
### Creating Tabix Index
```python
# Index a file
pysam.tabix_index("annotations.bed", preset="bed", force=True)
# Creates annotations.bed.gz and annotations.bed.gz.tbi
# Presets available: bed, gff, vcf
```
### Fetching Records
```python
tabix = pysam.TabixFile("annotations.bed.gz")
# Fetch region
for row in tabix.fetch("chr1", 1000000, 2000000):
print(row) # Returns tab-delimited string
# Parse with specific parser
for row in tabix.fetch("chr1", 1000000, 2000000, parser=pysam.asBed()):
print(f"Interval: {row.contig}:{row.start}-{row.end}")
# Available parsers: asBed(), asGTF(), asVCF(), asTuple()
```
### Working with BED Files
```python
bed = pysam.TabixFile("regions.bed.gz")
# Access BED fields by name
for interval in bed.fetch("chr1", 1000000, 2000000, parser=pysam.asBed()):
print(f"Region: {interval.contig}:{interval.start}-{interval.end}")
print(f"Name: {interval.name}")
print(f"Score: {interval.score}")
print(f"Strand: {interval.strand}")
```
### Working with GTF/GFF Files
```python
gtf = pysam.TabixFile("annotations.gtf.gz")
# Access GTF fields
for feature in gtf.fetch("chr1", 1000000, 2000000, parser=pysam.asGTF()):
print(f"Feature: {feature.feature}")
print(f"Gene: {feature.gene_id}")
print(f"Transcript: {feature.transcript_id}")
print(f"Coordinates: {feature.start}-{feature.end}")
```
## Performance Tips
### FASTA
1. **Always use indexed FASTA** files (create .fai with samtools faidx)
2. **Batch fetch operations** when extracting multiple regions
3. **Cache frequently accessed sequences** in memory
4. **Use appropriate window sizes** to avoid loading excessive sequence data
### FASTQ
1. **Stream processing** - FASTQ files are read sequentially, process on-the-fly
2. **Use compressed FASTQ.gz** to save disk space (pysam handles transparently)
3. **Avoid loading entire file** into memory—process read-by-read
4. **For large files**, consider parallel processing with file splitting
### Tabix
1. **Always bgzip and tabix-index** files before region queries
2. **Use appropriate presets** when creating indices
3. **Specify parser** for named field access
4. **Batch queries** to same file to avoid re-opening
## Common Pitfalls
1. **FASTA coordinate system:** fetch() uses 0-based coordinates, region strings use 1-based
2. **Missing index:** FASTA random access requires .fai index file
3. **FASTQ sequential only:** Cannot do random access or region-based queries on FASTQ
4. **Quality encoding:** Assume Phred+33 unless specified otherwise
5. **Tabix compression:** Must use bgzip, not regular gzip, for tabix indexing
6. **Parser requirement:** TabixFile needs explicit parser for named field access
7. **Case sensitivity:** FASTA sequences preserve case—use .upper() or .lower() for consistent comparisons

View File

@@ -0,0 +1,365 @@
# Working with Variant Files (VCF/BCF)
## Overview
Pysam provides the `VariantFile` class for reading and writing VCF (Variant Call Format) and BCF (binary VCF) files. These files contain information about genetic variants, including SNPs, indels, and structural variants.
## Opening Variant Files
```python
import pysam
# Reading VCF
vcf = pysam.VariantFile("example.vcf")
# Reading BCF (binary, compressed)
bcf = pysam.VariantFile("example.bcf")
# Reading compressed VCF
vcf_gz = pysam.VariantFile("example.vcf.gz")
# Writing
outvcf = pysam.VariantFile("output.vcf", "w", header=vcf.header)
```
## VariantFile Properties
**Header Information:**
- `header` - Complete VCF header with metadata
- `header.contigs` - Dictionary of contigs/chromosomes
- `header.samples` - List of sample names
- `header.filters` - Dictionary of FILTER definitions
- `header.info` - Dictionary of INFO field definitions
- `header.formats` - Dictionary of FORMAT field definitions
```python
vcf = pysam.VariantFile("example.vcf")
# List samples
print(f"Samples: {list(vcf.header.samples)}")
# List contigs
for contig in vcf.header.contigs:
print(f"{contig}: length={vcf.header.contigs[contig].length}")
# List INFO fields
for info in vcf.header.info:
print(f"{info}: {vcf.header.info[info].description}")
```
## Reading Variant Records
### Iterate All Variants
```python
for variant in vcf:
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
```
### Fetch Specific Region
Requires tabix index (.tbi) for VCF.gz or index for BCF:
```python
# Fetch variants in region (1-based coordinates for region string)
for variant in vcf.fetch("chr1", 1000000, 2000000):
print(f"{variant.chrom}:{variant.pos} {variant.id}")
# Using region string (1-based)
for variant in vcf.fetch("chr1:1000000-2000000"):
print(variant.pos)
```
**Note:** Uses **1-based coordinates** in `fetch()` calls to match VCF specification.
## VariantRecord Objects
Each variant is represented as a `VariantRecord` object:
### Position Information
- `chrom` - Chromosome/contig name
- `pos` - Position (1-based)
- `start` - Start position (0-based)
- `stop` - Stop position (0-based, exclusive)
- `id` - Variant ID (e.g., rsID)
### Allele Information
- `ref` - Reference allele
- `alts` - Tuple of alternate alleles
- `alleles` - Tuple of all alleles (ref + alts)
### Quality and Filtering
- `qual` - Quality score (QUAL field)
- `filter` - Filter status
### INFO Fields
Access INFO fields as dictionary:
```python
for variant in vcf:
# Check if field exists
if "DP" in variant.info:
depth = variant.info["DP"]
print(f"Depth: {depth}")
# Get all INFO keys
print(f"INFO fields: {variant.info.keys()}")
# Access specific fields
if "AF" in variant.info:
allele_freq = variant.info["AF"]
print(f"Allele frequency: {allele_freq}")
```
### Sample Genotype Data
Access sample data through `samples` dictionary:
```python
for variant in vcf:
for sample_name in variant.samples:
sample = variant.samples[sample_name]
# Genotype (GT field)
gt = sample["GT"]
print(f"{sample_name} genotype: {gt}")
# Other FORMAT fields
if "DP" in sample:
print(f"{sample_name} depth: {sample['DP']}")
if "GQ" in sample:
print(f"{sample_name} quality: {sample['GQ']}")
# Alleles for this genotype
alleles = sample.alleles
print(f"{sample_name} alleles: {alleles}")
# Phasing
if sample.phased:
print(f"{sample_name} is phased")
```
**Genotype representation:**
- `(0, 0)` - Homozygous reference
- `(0, 1)` - Heterozygous
- `(1, 1)` - Homozygous alternate
- `(None, None)` - Missing genotype
- Phased: `(0|1)` vs unphased: `(0/1)`
## Writing Variant Files
### Creating Header
```python
header = pysam.VariantHeader()
# Add contigs
header.contigs.add("chr1", length=248956422)
header.contigs.add("chr2", length=242193529)
# Add INFO fields
header.add_line('##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">')
header.add_line('##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">')
# Add FORMAT fields
header.add_line('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">')
header.add_line('##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">')
# Add samples
header.add_sample("sample1")
header.add_sample("sample2")
# Create output file
outvcf = pysam.VariantFile("output.vcf", "w", header=header)
```
### Creating Variant Records
```python
# Create new variant
record = outvcf.new_record()
record.chrom = "chr1"
record.pos = 100000
record.id = "rs123456"
record.ref = "A"
record.alts = ("G",)
record.qual = 30
record.filter.add("PASS")
# Set INFO fields
record.info["DP"] = 100
record.info["AF"] = (0.25,)
# Set genotype data
record.samples["sample1"]["GT"] = (0, 1)
record.samples["sample1"]["DP"] = 50
record.samples["sample2"]["GT"] = (0, 0)
record.samples["sample2"]["DP"] = 50
# Write to file
outvcf.write(record)
```
## Filtering Variants
### Basic Filtering
```python
# Filter by quality
for variant in vcf:
if variant.qual >= 30:
print(f"High quality variant: {variant.chrom}:{variant.pos}")
# Filter by depth
for variant in vcf:
if "DP" in variant.info and variant.info["DP"] >= 20:
print(f"High depth variant: {variant.chrom}:{variant.pos}")
# Filter by allele frequency
for variant in vcf:
if "AF" in variant.info:
for af in variant.info["AF"]:
if af >= 0.01:
print(f"Common variant: {variant.chrom}:{variant.pos}")
```
### Filtering by Genotype
```python
# Find variants where sample has alternate allele
for variant in vcf:
sample = variant.samples["sample1"]
gt = sample["GT"]
# Check if has alternate allele
if gt and any(allele and allele > 0 for allele in gt):
print(f"Sample has alt allele: {variant.chrom}:{variant.pos}")
# Check if homozygous alternate
if gt == (1, 1):
print(f"Homozygous alt: {variant.chrom}:{variant.pos}")
```
### Filter Field
```python
# Check FILTER status
for variant in vcf:
if "PASS" in variant.filter or len(variant.filter) == 0:
print(f"Passed filters: {variant.chrom}:{variant.pos}")
else:
print(f"Failed: {variant.filter.keys()}")
```
## Indexing VCF Files
Create tabix index for compressed VCF:
```python
# Compress and index
pysam.tabix_index("example.vcf", preset="vcf", force=True)
# Creates example.vcf.gz and example.vcf.gz.tbi
```
Or use bcftools for BCF:
```python
pysam.bcftools.index("example.bcf")
```
## Common Workflows
### Extract Variants for Specific Samples
```python
invcf = pysam.VariantFile("input.vcf")
samples_to_keep = ["sample1", "sample3"]
# Create new header with subset of samples
new_header = invcf.header.copy()
new_header.samples.clear()
for sample in samples_to_keep:
new_header.samples.add(sample)
outvcf = pysam.VariantFile("output.vcf", "w", header=new_header)
for variant in invcf:
# Create new record
new_record = outvcf.new_record(
contig=variant.chrom,
start=variant.start,
stop=variant.stop,
alleles=variant.alleles,
id=variant.id,
qual=variant.qual,
filter=variant.filter,
info=variant.info
)
# Copy genotype data for selected samples
for sample in samples_to_keep:
new_record.samples[sample].update(variant.samples[sample])
outvcf.write(new_record)
```
### Calculate Allele Frequencies
```python
vcf = pysam.VariantFile("example.vcf")
for variant in vcf:
total_alleles = 0
alt_alleles = 0
for sample_name in variant.samples:
gt = variant.samples[sample_name]["GT"]
if gt and None not in gt:
total_alleles += 2
alt_alleles += sum(1 for allele in gt if allele > 0)
if total_alleles > 0:
af = alt_alleles / total_alleles
print(f"{variant.chrom}:{variant.pos} AF={af:.4f}")
```
### Convert VCF to Summary Table
```python
import csv
vcf = pysam.VariantFile("example.vcf")
with open("variants.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "DP"])
for variant in vcf:
writer.writerow([
variant.chrom,
variant.pos,
variant.id or ".",
variant.ref,
",".join(variant.alts) if variant.alts else ".",
variant.qual or ".",
variant.info.get("DP", ".")
])
```
## Performance Tips
1. **Use BCF format** for better compression and faster access than VCF
2. **Index files** with tabix for efficient region queries
3. **Filter early** to reduce processing of irrelevant variants
4. **Use INFO fields efficiently** - check existence before accessing
5. **Batch write operations** when creating VCF files
## Common Pitfalls
1. **Coordinate systems:** VCF uses 1-based coordinates, but VariantRecord.start is 0-based
2. **Missing data:** Always check if INFO/FORMAT fields exist before accessing
3. **Genotype tuples:** Genotypes are tuples, not lists—handle None values for missing data
4. **Allele indexing:** In genotype (0, 1), 0=REF, 1=first ALT, 2=second ALT, etc.
5. **Index requirement:** Region-based `fetch()` requires tabix index for VCF.gz
6. **Header modification:** When subsetting samples, properly update header and copy FORMAT fields