Initial commit
This commit is contained in:
259
skills/pysam/SKILL.md
Normal file
259
skills/pysam/SKILL.md
Normal file
@@ -0,0 +1,259 @@
|
||||
---
|
||||
name: pysam
|
||||
description: "Genomic file toolkit. Read/write SAM/BAM/CRAM alignments, VCF/BCF variants, FASTA/FASTQ sequences, extract regions, calculate coverage, for NGS data processing pipelines."
|
||||
---
|
||||
|
||||
# Pysam
|
||||
|
||||
## Overview
|
||||
|
||||
Pysam is a Python module for reading, manipulating, and writing genomic datasets. Read/write SAM/BAM/CRAM alignment files, VCF/BCF variant files, and FASTA/FASTQ sequences with a Pythonic interface to htslib. Query tabix-indexed files, perform pileup analysis for coverage, and execute samtools/bcftools commands.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
- Working with sequencing alignment files (BAM/CRAM)
|
||||
- Analyzing genetic variants (VCF/BCF)
|
||||
- Extracting reference sequences or gene regions
|
||||
- Processing raw sequencing data (FASTQ)
|
||||
- Calculating coverage or read depth
|
||||
- Implementing bioinformatics analysis pipelines
|
||||
- Quality control of sequencing data
|
||||
- Variant calling and annotation workflows
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation
|
||||
```bash
|
||||
uv pip install pysam
|
||||
```
|
||||
|
||||
### Basic Examples
|
||||
|
||||
**Read alignment file:**
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Open BAM file and fetch reads in region
|
||||
samfile = pysam.AlignmentFile("example.bam", "rb")
|
||||
for read in samfile.fetch("chr1", 1000, 2000):
|
||||
print(f"{read.query_name}: {read.reference_start}")
|
||||
samfile.close()
|
||||
```
|
||||
|
||||
**Read variant file:**
|
||||
```python
|
||||
# Open VCF file and iterate variants
|
||||
vcf = pysam.VariantFile("variants.vcf")
|
||||
for variant in vcf:
|
||||
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
|
||||
vcf.close()
|
||||
```
|
||||
|
||||
**Query reference sequence:**
|
||||
```python
|
||||
# Open FASTA and extract sequence
|
||||
fasta = pysam.FastaFile("reference.fasta")
|
||||
sequence = fasta.fetch("chr1", 1000, 2000)
|
||||
print(sequence)
|
||||
fasta.close()
|
||||
```
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Alignment File Operations (SAM/BAM/CRAM)
|
||||
|
||||
Use the `AlignmentFile` class to work with aligned sequencing reads. This is appropriate for analyzing mapping results, calculating coverage, extracting reads, or quality control.
|
||||
|
||||
**Common operations:**
|
||||
- Open and read BAM/SAM/CRAM files
|
||||
- Fetch reads from specific genomic regions
|
||||
- Filter reads by mapping quality, flags, or other criteria
|
||||
- Write filtered or modified alignments
|
||||
- Calculate coverage statistics
|
||||
- Perform pileup analysis (base-by-base coverage)
|
||||
- Access read sequences, quality scores, and alignment information
|
||||
|
||||
**Reference:** See `references/alignment_files.md` for detailed documentation on:
|
||||
- Opening and reading alignment files
|
||||
- AlignedSegment attributes and methods
|
||||
- Region-based fetching with `fetch()`
|
||||
- Pileup analysis for coverage
|
||||
- Writing and creating BAM files
|
||||
- Coordinate systems and indexing
|
||||
- Performance optimization tips
|
||||
|
||||
### 2. Variant File Operations (VCF/BCF)
|
||||
|
||||
Use the `VariantFile` class to work with genetic variants from variant calling pipelines. This is appropriate for variant analysis, filtering, annotation, or population genetics.
|
||||
|
||||
**Common operations:**
|
||||
- Read and write VCF/BCF files
|
||||
- Query variants in specific regions
|
||||
- Access variant information (position, alleles, quality)
|
||||
- Extract genotype data for samples
|
||||
- Filter variants by quality, allele frequency, or other criteria
|
||||
- Annotate variants with additional information
|
||||
- Subset samples or regions
|
||||
|
||||
**Reference:** See `references/variant_files.md` for detailed documentation on:
|
||||
- Opening and reading variant files
|
||||
- VariantRecord attributes and methods
|
||||
- Accessing INFO and FORMAT fields
|
||||
- Working with genotypes and samples
|
||||
- Creating and writing VCF files
|
||||
- Filtering and subsetting variants
|
||||
- Multi-sample VCF operations
|
||||
|
||||
### 3. Sequence File Operations (FASTA/FASTQ)
|
||||
|
||||
Use `FastaFile` for random access to reference sequences and `FastxFile` for reading raw sequencing data. This is appropriate for extracting gene sequences, validating variants against reference, or processing raw reads.
|
||||
|
||||
**Common operations:**
|
||||
- Query reference sequences by genomic coordinates
|
||||
- Extract sequences for genes or regions of interest
|
||||
- Read FASTQ files with quality scores
|
||||
- Validate variant reference alleles
|
||||
- Calculate sequence statistics
|
||||
- Filter reads by quality or length
|
||||
- Convert between FASTA and FASTQ formats
|
||||
|
||||
**Reference:** See `references/sequence_files.md` for detailed documentation on:
|
||||
- FASTA file access and indexing
|
||||
- Extracting sequences by region
|
||||
- Handling reverse complement for genes
|
||||
- Reading FASTQ files sequentially
|
||||
- Quality score conversion and filtering
|
||||
- Working with tabix-indexed files (BED, GTF, GFF)
|
||||
- Common sequence processing patterns
|
||||
|
||||
### 4. Integrated Bioinformatics Workflows
|
||||
|
||||
Pysam excels at integrating multiple file types for comprehensive genomic analyses. Common workflows combine alignment files, variant files, and reference sequences.
|
||||
|
||||
**Common workflows:**
|
||||
- Calculate coverage statistics for specific regions
|
||||
- Validate variants against aligned reads
|
||||
- Annotate variants with coverage information
|
||||
- Extract sequences around variant positions
|
||||
- Filter alignments or variants based on multiple criteria
|
||||
- Generate coverage tracks for visualization
|
||||
- Quality control across multiple data types
|
||||
|
||||
**Reference:** See `references/common_workflows.md` for detailed examples of:
|
||||
- Quality control workflows (BAM statistics, reference consistency)
|
||||
- Coverage analysis (per-base coverage, low coverage detection)
|
||||
- Variant analysis (annotation, filtering by read support)
|
||||
- Sequence extraction (variant contexts, gene sequences)
|
||||
- Read filtering and subsetting
|
||||
- Integration patterns (BAM+VCF, VCF+BED, etc.)
|
||||
- Performance optimization for complex workflows
|
||||
|
||||
## Key Concepts
|
||||
|
||||
### Coordinate Systems
|
||||
|
||||
**Critical:** Pysam uses **0-based, half-open** coordinates (Python convention):
|
||||
- Start positions are 0-based (first base is position 0)
|
||||
- End positions are exclusive (not included in the range)
|
||||
- Region 1000-2000 includes bases 1000-1999 (1000 bases total)
|
||||
|
||||
**Exception:** Region strings in `fetch()` follow samtools convention (1-based):
|
||||
```python
|
||||
samfile.fetch("chr1", 999, 2000) # 0-based: positions 999-1999
|
||||
samfile.fetch("chr1:1000-2000") # 1-based string: positions 1000-2000
|
||||
```
|
||||
|
||||
**VCF files:** Use 1-based coordinates in the file format, but `VariantRecord.start` is 0-based.
|
||||
|
||||
### Indexing Requirements
|
||||
|
||||
Random access to specific genomic regions requires index files:
|
||||
- **BAM files**: Require `.bai` index (create with `pysam.index()`)
|
||||
- **CRAM files**: Require `.crai` index
|
||||
- **FASTA files**: Require `.fai` index (create with `pysam.faidx()`)
|
||||
- **VCF.gz files**: Require `.tbi` tabix index (create with `pysam.tabix_index()`)
|
||||
- **BCF files**: Require `.csi` index
|
||||
|
||||
Without an index, use `fetch(until_eof=True)` for sequential reading.
|
||||
|
||||
### File Modes
|
||||
|
||||
Specify format when opening files:
|
||||
- `"rb"` - Read BAM (binary)
|
||||
- `"r"` - Read SAM (text)
|
||||
- `"rc"` - Read CRAM
|
||||
- `"wb"` - Write BAM
|
||||
- `"w"` - Write SAM
|
||||
- `"wc"` - Write CRAM
|
||||
|
||||
### Performance Considerations
|
||||
|
||||
1. **Always use indexed files** for random access operations
|
||||
2. **Use `pileup()` for column-wise analysis** instead of repeated fetch operations
|
||||
3. **Use `count()` for counting** instead of iterating and counting manually
|
||||
4. **Process regions in parallel** when analyzing independent genomic regions
|
||||
5. **Close files explicitly** to free resources
|
||||
6. **Use `until_eof=True`** for sequential processing without index
|
||||
7. **Avoid multiple iterators** unless necessary (use `multiple_iterators=True` if needed)
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Coordinate confusion:** Remember 0-based vs 1-based systems in different contexts
|
||||
2. **Missing indices:** Many operations require index files—create them first
|
||||
3. **Partial overlaps:** `fetch()` returns reads overlapping region boundaries, not just those fully contained
|
||||
4. **Iterator scope:** Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors
|
||||
5. **Quality score editing:** Cannot modify `query_qualities` in place after changing `query_sequence`—create a copy first
|
||||
6. **Stream limitations:** Only stdin/stdout are supported for streaming, not arbitrary Python file objects
|
||||
7. **Thread safety:** While GIL is released during I/O, comprehensive thread-safety hasn't been fully validated
|
||||
|
||||
## Command-Line Tools
|
||||
|
||||
Pysam provides access to samtools and bcftools commands:
|
||||
|
||||
```python
|
||||
# Sort BAM file
|
||||
pysam.samtools.sort("-o", "sorted.bam", "input.bam")
|
||||
|
||||
# Index BAM
|
||||
pysam.samtools.index("sorted.bam")
|
||||
|
||||
# View specific region
|
||||
pysam.samtools.view("-b", "-o", "region.bam", "input.bam", "chr1:1000-2000")
|
||||
|
||||
# BCF tools
|
||||
pysam.bcftools.view("-O", "z", "-o", "output.vcf.gz", "input.vcf")
|
||||
```
|
||||
|
||||
**Error handling:**
|
||||
```python
|
||||
try:
|
||||
pysam.samtools.sort("-o", "output.bam", "input.bam")
|
||||
except pysam.SamtoolsError as e:
|
||||
print(f"Error: {e}")
|
||||
```
|
||||
|
||||
## Resources
|
||||
|
||||
### references/
|
||||
|
||||
Detailed documentation for each major capability:
|
||||
|
||||
- **alignment_files.md** - Complete guide to SAM/BAM/CRAM operations, including AlignmentFile class, AlignedSegment attributes, fetch operations, pileup analysis, and writing alignments
|
||||
|
||||
- **variant_files.md** - Complete guide to VCF/BCF operations, including VariantFile class, VariantRecord attributes, genotype handling, INFO/FORMAT fields, and multi-sample operations
|
||||
|
||||
- **sequence_files.md** - Complete guide to FASTA/FASTQ operations, including FastaFile and FastxFile classes, sequence extraction, quality score handling, and tabix-indexed file access
|
||||
|
||||
- **common_workflows.md** - Practical examples of integrated bioinformatics workflows combining multiple file types, including quality control, coverage analysis, variant validation, and sequence extraction
|
||||
|
||||
## Getting Help
|
||||
|
||||
For detailed information on specific operations, refer to the appropriate reference document:
|
||||
|
||||
- Working with BAM files or calculating coverage → `alignment_files.md`
|
||||
- Analyzing variants or genotypes → `variant_files.md`
|
||||
- Extracting sequences or processing FASTQ → `sequence_files.md`
|
||||
- Complex workflows integrating multiple file types → `common_workflows.md`
|
||||
|
||||
Official documentation: https://pysam.readthedocs.io/
|
||||
280
skills/pysam/references/alignment_files.md
Normal file
280
skills/pysam/references/alignment_files.md
Normal file
@@ -0,0 +1,280 @@
|
||||
# Working with Alignment Files (SAM/BAM/CRAM)
|
||||
|
||||
## Overview
|
||||
|
||||
Pysam provides the `AlignmentFile` class for reading and writing SAM/BAM/CRAM formatted files containing aligned sequence data. BAM/CRAM files support compression and random access through indexing.
|
||||
|
||||
## Opening Alignment Files
|
||||
|
||||
Specify format via mode qualifier:
|
||||
- `"rb"` - Read BAM (binary)
|
||||
- `"r"` - Read SAM (text)
|
||||
- `"rc"` - Read CRAM (compressed)
|
||||
- `"wb"` - Write BAM
|
||||
- `"w"` - Write SAM
|
||||
- `"wc"` - Write CRAM
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Reading
|
||||
samfile = pysam.AlignmentFile("example.bam", "rb")
|
||||
|
||||
# Writing (requires template or header)
|
||||
outfile = pysam.AlignmentFile("output.bam", "wb", template=samfile)
|
||||
```
|
||||
|
||||
### Stream Processing
|
||||
|
||||
Use `"-"` as filename for stdin/stdout operations:
|
||||
|
||||
```python
|
||||
# Read from stdin
|
||||
infile = pysam.AlignmentFile('-', 'rb')
|
||||
|
||||
# Write to stdout
|
||||
outfile = pysam.AlignmentFile('-', 'w', template=infile)
|
||||
```
|
||||
|
||||
**Important:** Pysam does not support reading/writing from true Python file objects—only stdin/stdout streams are supported.
|
||||
|
||||
## AlignmentFile Properties
|
||||
|
||||
**Header Information:**
|
||||
- `references` - List of chromosome/contig names
|
||||
- `lengths` - Corresponding lengths for each reference
|
||||
- `header` - Complete header as dictionary
|
||||
|
||||
```python
|
||||
samfile = pysam.AlignmentFile("example.bam", "rb")
|
||||
print(f"References: {samfile.references}")
|
||||
print(f"Lengths: {samfile.lengths}")
|
||||
```
|
||||
|
||||
## Reading Reads
|
||||
|
||||
### fetch() - Region-Based Retrieval
|
||||
|
||||
Retrieves reads overlapping specified genomic regions using **0-based coordinates**.
|
||||
|
||||
```python
|
||||
# Fetch specific region
|
||||
for read in samfile.fetch("chr1", 1000, 2000):
|
||||
print(read.query_name, read.reference_start)
|
||||
|
||||
# Fetch entire contig
|
||||
for read in samfile.fetch("chr1"):
|
||||
print(read.query_name)
|
||||
|
||||
# Fetch without index (sequential read)
|
||||
for read in samfile.fetch(until_eof=True):
|
||||
print(read.query_name)
|
||||
```
|
||||
|
||||
**Important Notes:**
|
||||
- Requires index (.bai/.crai) for random access
|
||||
- Returns reads that **overlap** the region (may extend beyond boundaries)
|
||||
- Use `until_eof=True` for non-indexed files or sequential reading
|
||||
- By default, only returns mapped reads
|
||||
- For unmapped reads, use `fetch("*")` or `until_eof=True`
|
||||
|
||||
### Multiple Iterators
|
||||
|
||||
When using multiple iterators on the same file:
|
||||
|
||||
```python
|
||||
samfile = pysam.AlignmentFile("example.bam", "rb", multiple_iterators=True)
|
||||
iter1 = samfile.fetch("chr1", 1000, 2000)
|
||||
iter2 = samfile.fetch("chr2", 5000, 6000)
|
||||
```
|
||||
|
||||
Without `multiple_iterators=True`, a new fetch() call repositions the file pointer and breaks existing iterators.
|
||||
|
||||
### count() - Count Reads in Region
|
||||
|
||||
```python
|
||||
# Count all reads
|
||||
num_reads = samfile.count("chr1", 1000, 2000)
|
||||
|
||||
# Count with quality filter
|
||||
num_quality_reads = samfile.count("chr1", 1000, 2000, quality=20)
|
||||
```
|
||||
|
||||
### count_coverage() - Per-Base Coverage
|
||||
|
||||
Returns four arrays (A, C, G, T) with per-base coverage:
|
||||
|
||||
```python
|
||||
coverage = samfile.count_coverage("chr1", 1000, 2000)
|
||||
a_counts, c_counts, g_counts, t_counts = coverage
|
||||
```
|
||||
|
||||
## AlignedSegment Objects
|
||||
|
||||
Each read is represented as an `AlignedSegment` object with these key attributes:
|
||||
|
||||
### Read Information
|
||||
- `query_name` - Read name/ID
|
||||
- `query_sequence` - Read sequence (bases)
|
||||
- `query_qualities` - Base quality scores (ASCII-encoded)
|
||||
- `query_length` - Length of the read
|
||||
|
||||
### Mapping Information
|
||||
- `reference_name` - Chromosome/contig name
|
||||
- `reference_start` - Start position (0-based, inclusive)
|
||||
- `reference_end` - End position (0-based, exclusive)
|
||||
- `mapping_quality` - MAPQ score
|
||||
- `cigarstring` - CIGAR string (e.g., "100M")
|
||||
- `cigartuples` - CIGAR as list of (operation, length) tuples
|
||||
|
||||
**Important:** `cigartuples` format differs from SAM specification. Operations are integers:
|
||||
- 0 = M (match/mismatch)
|
||||
- 1 = I (insertion)
|
||||
- 2 = D (deletion)
|
||||
- 3 = N (skipped reference)
|
||||
- 4 = S (soft clipping)
|
||||
- 5 = H (hard clipping)
|
||||
- 6 = P (padding)
|
||||
- 7 = = (sequence match)
|
||||
- 8 = X (sequence mismatch)
|
||||
|
||||
### Flags and Status
|
||||
- `flag` - SAM flag as integer
|
||||
- `is_paired` - Is read paired?
|
||||
- `is_proper_pair` - Is read in a proper pair?
|
||||
- `is_unmapped` - Is read unmapped?
|
||||
- `mate_is_unmapped` - Is mate unmapped?
|
||||
- `is_reverse` - Is read on reverse strand?
|
||||
- `mate_is_reverse` - Is mate on reverse strand?
|
||||
- `is_read1` - Is this read1?
|
||||
- `is_read2` - Is this read2?
|
||||
- `is_secondary` - Is secondary alignment?
|
||||
- `is_qcfail` - Did read fail QC?
|
||||
- `is_duplicate` - Is read a duplicate?
|
||||
- `is_supplementary` - Is supplementary alignment?
|
||||
|
||||
### Tags and Optional Fields
|
||||
- `get_tag(tag)` - Get value of optional field
|
||||
- `set_tag(tag, value)` - Set optional field
|
||||
- `has_tag(tag)` - Check if tag exists
|
||||
- `get_tags()` - Get all tags as list of tuples
|
||||
|
||||
```python
|
||||
for read in samfile.fetch("chr1", 1000, 2000):
|
||||
if read.has_tag("NM"):
|
||||
edit_distance = read.get_tag("NM")
|
||||
print(f"{read.query_name}: NM={edit_distance}")
|
||||
```
|
||||
|
||||
## Writing Alignment Files
|
||||
|
||||
### Creating Header
|
||||
|
||||
```python
|
||||
header = {
|
||||
'HD': {'VN': '1.0'},
|
||||
'SQ': [
|
||||
{'LN': 1575, 'SN': 'chr1'},
|
||||
{'LN': 1584, 'SN': 'chr2'}
|
||||
]
|
||||
}
|
||||
|
||||
outfile = pysam.AlignmentFile("output.bam", "wb", header=header)
|
||||
```
|
||||
|
||||
### Creating AlignedSegment Objects
|
||||
|
||||
```python
|
||||
# Create new read
|
||||
a = pysam.AlignedSegment()
|
||||
a.query_name = "read001"
|
||||
a.query_sequence = "AGCTTAGCTAGCTACCTATATCTTGGTCTTGGCCG"
|
||||
a.flag = 0
|
||||
a.reference_id = 0 # Index into header['SQ']
|
||||
a.reference_start = 100
|
||||
a.mapping_quality = 20
|
||||
a.cigar = [(0, 35)] # 35M
|
||||
a.query_qualities = pysam.qualitystring_to_array("IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII")
|
||||
|
||||
# Write to file
|
||||
outfile.write(a)
|
||||
```
|
||||
|
||||
### Converting Between Formats
|
||||
|
||||
```python
|
||||
# BAM to SAM
|
||||
infile = pysam.AlignmentFile("input.bam", "rb")
|
||||
outfile = pysam.AlignmentFile("output.sam", "w", template=infile)
|
||||
for read in infile:
|
||||
outfile.write(read)
|
||||
infile.close()
|
||||
outfile.close()
|
||||
```
|
||||
|
||||
## Pileup Analysis
|
||||
|
||||
The `pileup()` method provides **column-wise** (position-by-position) analysis across a region:
|
||||
|
||||
```python
|
||||
for pileupcolumn in samfile.pileup("chr1", 1000, 2000):
|
||||
print(f"Position {pileupcolumn.pos}: coverage = {pileupcolumn.nsegments}")
|
||||
|
||||
for pileupread in pileupcolumn.pileups:
|
||||
if not pileupread.is_del and not pileupread.is_refskip:
|
||||
# Query position is the position in the read
|
||||
base = pileupread.alignment.query_sequence[pileupread.query_position]
|
||||
print(f" {pileupread.alignment.query_name}: {base}")
|
||||
```
|
||||
|
||||
**Key attributes:**
|
||||
- `pileupcolumn.pos` - 0-based reference position
|
||||
- `pileupcolumn.nsegments` - Number of reads covering position
|
||||
- `pileupread.alignment` - The AlignedSegment object
|
||||
- `pileupread.query_position` - Position in the read (None for deletions)
|
||||
- `pileupread.is_del` - Is this a deletion?
|
||||
- `pileupread.is_refskip` - Is this a reference skip (N in CIGAR)?
|
||||
|
||||
**Important:** Keep iterator references alive. The error "PileupProxy accessed after iterator finished" occurs when iterators go out of scope prematurely.
|
||||
|
||||
## Coordinate System
|
||||
|
||||
**Critical:** Pysam uses **0-based, half-open** coordinates (Python convention):
|
||||
- `reference_start` is 0-based (first base is 0)
|
||||
- `reference_end` is exclusive (not included in range)
|
||||
- Region from 1000-2000 includes bases 1000-1999
|
||||
|
||||
**Exception:** Region strings in `fetch()` and `pileup()` follow samtools conventions (1-based):
|
||||
```python
|
||||
# These are equivalent:
|
||||
samfile.fetch("chr1", 999, 2000) # Python style: 0-based
|
||||
samfile.fetch("chr1:1000-2000") # samtools style: 1-based
|
||||
```
|
||||
|
||||
## Indexing
|
||||
|
||||
Create BAM index:
|
||||
```python
|
||||
pysam.index("example.bam")
|
||||
```
|
||||
|
||||
Or use command-line interface:
|
||||
```python
|
||||
pysam.samtools.index("example.bam")
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use indexed access** when querying specific regions repeatedly
|
||||
2. **Use `pileup()` for column-wise analysis** instead of repeated fetch operations
|
||||
3. **Use `fetch(until_eof=True)` for sequential reading** of non-indexed files
|
||||
4. **Avoid multiple iterators** unless necessary (performance cost)
|
||||
5. **Use `count()` for simple counting** instead of iterating and counting manually
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Partial overlaps:** `fetch()` returns reads that overlap region boundaries—implement explicit filtering if exact boundaries are needed
|
||||
2. **Quality score editing:** Cannot edit `query_qualities` in place after modifying `query_sequence`. Create a copy first: `quals = read.query_qualities`
|
||||
3. **Missing index:** `fetch()` without `until_eof=True` requires an index file
|
||||
4. **Thread safety:** While pysam releases GIL during I/O, comprehensive thread-safety hasn't been fully validated
|
||||
5. **Iterator scope:** Keep pileup iterator references alive to avoid "PileupProxy accessed after iterator finished" errors
|
||||
520
skills/pysam/references/common_workflows.md
Normal file
520
skills/pysam/references/common_workflows.md
Normal file
@@ -0,0 +1,520 @@
|
||||
# Common Bioinformatics Workflows with Pysam
|
||||
|
||||
## Overview
|
||||
|
||||
This document provides practical examples of common bioinformatics workflows using pysam, demonstrating how to combine different file types and operations.
|
||||
|
||||
## Quality Control Workflows
|
||||
|
||||
### Calculate BAM Statistics
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
def calculate_bam_stats(bam_file):
|
||||
"""Calculate basic statistics for BAM file."""
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
|
||||
stats = {
|
||||
"total_reads": 0,
|
||||
"mapped_reads": 0,
|
||||
"unmapped_reads": 0,
|
||||
"paired_reads": 0,
|
||||
"proper_pairs": 0,
|
||||
"duplicates": 0,
|
||||
"total_bases": 0,
|
||||
"mapped_bases": 0
|
||||
}
|
||||
|
||||
for read in samfile.fetch(until_eof=True):
|
||||
stats["total_reads"] += 1
|
||||
|
||||
if read.is_unmapped:
|
||||
stats["unmapped_reads"] += 1
|
||||
else:
|
||||
stats["mapped_reads"] += 1
|
||||
stats["mapped_bases"] += read.query_alignment_length
|
||||
|
||||
if read.is_paired:
|
||||
stats["paired_reads"] += 1
|
||||
if read.is_proper_pair:
|
||||
stats["proper_pairs"] += 1
|
||||
|
||||
if read.is_duplicate:
|
||||
stats["duplicates"] += 1
|
||||
|
||||
stats["total_bases"] += read.query_length
|
||||
|
||||
samfile.close()
|
||||
|
||||
# Calculate derived statistics
|
||||
stats["mapping_rate"] = stats["mapped_reads"] / stats["total_reads"] if stats["total_reads"] > 0 else 0
|
||||
stats["duplication_rate"] = stats["duplicates"] / stats["total_reads"] if stats["total_reads"] > 0 else 0
|
||||
|
||||
return stats
|
||||
```
|
||||
|
||||
### Check Reference Consistency
|
||||
|
||||
```python
|
||||
def check_bam_reference_consistency(bam_file, fasta_file):
|
||||
"""Verify that BAM reads match reference genome."""
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
fasta = pysam.FastaFile(fasta_file)
|
||||
|
||||
mismatches = 0
|
||||
total_checked = 0
|
||||
|
||||
for read in samfile.fetch():
|
||||
if read.is_unmapped:
|
||||
continue
|
||||
|
||||
# Get reference sequence for aligned region
|
||||
ref_seq = fasta.fetch(
|
||||
read.reference_name,
|
||||
read.reference_start,
|
||||
read.reference_end
|
||||
)
|
||||
|
||||
# Get read sequence aligned to reference
|
||||
aligned_pairs = read.get_aligned_pairs(with_seq=True)
|
||||
|
||||
for query_pos, ref_pos, ref_base in aligned_pairs:
|
||||
if query_pos is not None and ref_pos is not None and ref_base is not None:
|
||||
read_base = read.query_sequence[query_pos]
|
||||
if read_base.upper() != ref_base.upper():
|
||||
mismatches += 1
|
||||
total_checked += 1
|
||||
|
||||
if total_checked >= 10000: # Sample first 10k positions
|
||||
break
|
||||
|
||||
samfile.close()
|
||||
fasta.close()
|
||||
|
||||
error_rate = mismatches / total_checked if total_checked > 0 else 0
|
||||
return {
|
||||
"positions_checked": total_checked,
|
||||
"mismatches": mismatches,
|
||||
"error_rate": error_rate
|
||||
}
|
||||
```
|
||||
|
||||
## Coverage Analysis
|
||||
|
||||
### Calculate Per-Base Coverage
|
||||
|
||||
```python
|
||||
def calculate_coverage(bam_file, chrom, start, end):
|
||||
"""Calculate coverage for each position in region."""
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
|
||||
# Initialize coverage array
|
||||
length = end - start
|
||||
coverage = [0] * length
|
||||
|
||||
# Count coverage at each position
|
||||
for pileupcolumn in samfile.pileup(chrom, start, end):
|
||||
if start <= pileupcolumn.pos < end:
|
||||
coverage[pileupcolumn.pos - start] = pileupcolumn.nsegments
|
||||
|
||||
samfile.close()
|
||||
|
||||
return coverage
|
||||
```
|
||||
|
||||
### Identify Low Coverage Regions
|
||||
|
||||
```python
|
||||
def find_low_coverage_regions(bam_file, chrom, start, end, min_coverage=10):
|
||||
"""Find regions with coverage below threshold."""
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
|
||||
low_coverage_regions = []
|
||||
in_low_region = False
|
||||
region_start = None
|
||||
|
||||
for pileupcolumn in samfile.pileup(chrom, start, end):
|
||||
pos = pileupcolumn.pos
|
||||
if pos < start or pos >= end:
|
||||
continue
|
||||
|
||||
coverage = pileupcolumn.nsegments
|
||||
|
||||
if coverage < min_coverage:
|
||||
if not in_low_region:
|
||||
region_start = pos
|
||||
in_low_region = True
|
||||
else:
|
||||
if in_low_region:
|
||||
low_coverage_regions.append((region_start, pos))
|
||||
in_low_region = False
|
||||
|
||||
# Close last region if still open
|
||||
if in_low_region:
|
||||
low_coverage_regions.append((region_start, end))
|
||||
|
||||
samfile.close()
|
||||
|
||||
return low_coverage_regions
|
||||
```
|
||||
|
||||
### Calculate Coverage Statistics
|
||||
|
||||
```python
|
||||
def coverage_statistics(bam_file, chrom, start, end):
|
||||
"""Calculate coverage statistics for region."""
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
|
||||
coverages = []
|
||||
|
||||
for pileupcolumn in samfile.pileup(chrom, start, end):
|
||||
if start <= pileupcolumn.pos < end:
|
||||
coverages.append(pileupcolumn.nsegments)
|
||||
|
||||
samfile.close()
|
||||
|
||||
if not coverages:
|
||||
return None
|
||||
|
||||
coverages.sort()
|
||||
n = len(coverages)
|
||||
|
||||
return {
|
||||
"mean": sum(coverages) / n,
|
||||
"median": coverages[n // 2],
|
||||
"min": coverages[0],
|
||||
"max": coverages[-1],
|
||||
"positions": n
|
||||
}
|
||||
```
|
||||
|
||||
## Variant Analysis
|
||||
|
||||
### Extract Variants in Regions
|
||||
|
||||
```python
|
||||
def extract_variants_in_genes(vcf_file, bed_file):
|
||||
"""Extract variants overlapping gene regions."""
|
||||
vcf = pysam.VariantFile(vcf_file)
|
||||
bed = pysam.TabixFile(bed_file)
|
||||
|
||||
variants_by_gene = {}
|
||||
|
||||
for gene in bed.fetch(parser=pysam.asBed()):
|
||||
gene_name = gene.name
|
||||
variants_by_gene[gene_name] = []
|
||||
|
||||
# Find variants in gene region
|
||||
for variant in vcf.fetch(gene.contig, gene.start, gene.end):
|
||||
variant_info = {
|
||||
"chrom": variant.chrom,
|
||||
"pos": variant.pos,
|
||||
"ref": variant.ref,
|
||||
"alt": variant.alts,
|
||||
"qual": variant.qual
|
||||
}
|
||||
variants_by_gene[gene_name].append(variant_info)
|
||||
|
||||
vcf.close()
|
||||
bed.close()
|
||||
|
||||
return variants_by_gene
|
||||
```
|
||||
|
||||
### Annotate Variants with Coverage
|
||||
|
||||
```python
|
||||
def annotate_variants_with_coverage(vcf_file, bam_file, output_file):
|
||||
"""Add coverage information to variants."""
|
||||
vcf = pysam.VariantFile(vcf_file)
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
|
||||
# Add DP to header if not present
|
||||
if "DP" not in vcf.header.info:
|
||||
vcf.header.info.add("DP", "1", "Integer", "Total Depth from BAM")
|
||||
|
||||
outvcf = pysam.VariantFile(output_file, "w", header=vcf.header)
|
||||
|
||||
for variant in vcf:
|
||||
# Get coverage at variant position
|
||||
coverage = samfile.count(
|
||||
variant.chrom,
|
||||
variant.pos - 1, # Convert to 0-based
|
||||
variant.pos
|
||||
)
|
||||
|
||||
# Add to INFO field
|
||||
variant.info["DP"] = coverage
|
||||
|
||||
outvcf.write(variant)
|
||||
|
||||
vcf.close()
|
||||
samfile.close()
|
||||
outvcf.close()
|
||||
```
|
||||
|
||||
### Filter Variants by Read Support
|
||||
|
||||
```python
|
||||
def filter_variants_by_support(vcf_file, bam_file, output_file, min_alt_reads=3):
|
||||
"""Filter variants requiring minimum alternate allele support."""
|
||||
vcf = pysam.VariantFile(vcf_file)
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
outvcf = pysam.VariantFile(output_file, "w", header=vcf.header)
|
||||
|
||||
for variant in vcf:
|
||||
# Count reads supporting each allele
|
||||
allele_counts = {variant.ref: 0}
|
||||
for alt in variant.alts:
|
||||
allele_counts[alt] = 0
|
||||
|
||||
# Pileup at variant position
|
||||
for pileupcolumn in samfile.pileup(
|
||||
variant.chrom,
|
||||
variant.pos - 1,
|
||||
variant.pos
|
||||
):
|
||||
if pileupcolumn.pos == variant.pos - 1: # 0-based
|
||||
for pileupread in pileupcolumn.pileups:
|
||||
if not pileupread.is_del and not pileupread.is_refskip:
|
||||
base = pileupread.alignment.query_sequence[
|
||||
pileupread.query_position
|
||||
]
|
||||
if base in allele_counts:
|
||||
allele_counts[base] += 1
|
||||
|
||||
# Check if any alt allele has sufficient support
|
||||
has_support = any(
|
||||
allele_counts.get(alt, 0) >= min_alt_reads
|
||||
for alt in variant.alts
|
||||
)
|
||||
|
||||
if has_support:
|
||||
outvcf.write(variant)
|
||||
|
||||
vcf.close()
|
||||
samfile.close()
|
||||
outvcf.close()
|
||||
```
|
||||
|
||||
## Sequence Extraction
|
||||
|
||||
### Extract Sequences Around Variants
|
||||
|
||||
```python
|
||||
def extract_variant_contexts(vcf_file, fasta_file, output_file, window=50):
|
||||
"""Extract reference sequences around variants."""
|
||||
vcf = pysam.VariantFile(vcf_file)
|
||||
fasta = pysam.FastaFile(fasta_file)
|
||||
|
||||
with open(output_file, 'w') as out:
|
||||
for variant in vcf:
|
||||
# Get sequence context
|
||||
start = max(0, variant.pos - window - 1) # Convert to 0-based
|
||||
end = variant.pos + window
|
||||
|
||||
context = fasta.fetch(variant.chrom, start, end)
|
||||
|
||||
# Mark variant position
|
||||
var_pos_in_context = variant.pos - 1 - start
|
||||
|
||||
out.write(f">{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}\n")
|
||||
out.write(context[:var_pos_in_context].lower())
|
||||
out.write(context[var_pos_in_context:var_pos_in_context+len(variant.ref)].upper())
|
||||
out.write(context[var_pos_in_context+len(variant.ref):].lower())
|
||||
out.write("\n")
|
||||
|
||||
vcf.close()
|
||||
fasta.close()
|
||||
```
|
||||
|
||||
### Extract Gene Sequences
|
||||
|
||||
```python
|
||||
def extract_gene_sequences(bed_file, fasta_file, output_fasta):
|
||||
"""Extract sequences for genes from BED file."""
|
||||
bed = pysam.TabixFile(bed_file)
|
||||
fasta = pysam.FastaFile(fasta_file)
|
||||
|
||||
with open(output_fasta, 'w') as out:
|
||||
for gene in bed.fetch(parser=pysam.asBed()):
|
||||
sequence = fasta.fetch(gene.contig, gene.start, gene.end)
|
||||
|
||||
# Handle strand
|
||||
if hasattr(gene, 'strand') and gene.strand == '-':
|
||||
# Reverse complement
|
||||
complement = str.maketrans("ATGCatgcNn", "TACGtacgNn")
|
||||
sequence = sequence.translate(complement)[::-1]
|
||||
|
||||
out.write(f">{gene.name} {gene.contig}:{gene.start}-{gene.end}\n")
|
||||
|
||||
# Write sequence in 60-character lines
|
||||
for i in range(0, len(sequence), 60):
|
||||
out.write(sequence[i:i+60] + "\n")
|
||||
|
||||
bed.close()
|
||||
fasta.close()
|
||||
```
|
||||
|
||||
## Read Filtering and Subsetting
|
||||
|
||||
### Filter BAM by Region and Quality
|
||||
|
||||
```python
|
||||
def filter_bam(input_bam, output_bam, chrom, start, end, min_mapq=20):
|
||||
"""Filter BAM file by region and mapping quality."""
|
||||
infile = pysam.AlignmentFile(input_bam, "rb")
|
||||
outfile = pysam.AlignmentFile(output_bam, "wb", template=infile)
|
||||
|
||||
for read in infile.fetch(chrom, start, end):
|
||||
if read.mapping_quality >= min_mapq and not read.is_duplicate:
|
||||
outfile.write(read)
|
||||
|
||||
infile.close()
|
||||
outfile.close()
|
||||
|
||||
# Create index
|
||||
pysam.index(output_bam)
|
||||
```
|
||||
|
||||
### Extract Reads for Specific Variants
|
||||
|
||||
```python
|
||||
def extract_reads_at_variants(bam_file, vcf_file, output_bam, window=100):
|
||||
"""Extract reads overlapping variant positions."""
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
vcf = pysam.VariantFile(vcf_file)
|
||||
outfile = pysam.AlignmentFile(output_bam, "wb", template=samfile)
|
||||
|
||||
# Collect all reads (using set to avoid duplicates)
|
||||
reads_to_keep = set()
|
||||
|
||||
for variant in vcf:
|
||||
start = max(0, variant.pos - window - 1)
|
||||
end = variant.pos + window
|
||||
|
||||
for read in samfile.fetch(variant.chrom, start, end):
|
||||
reads_to_keep.add(read.query_name)
|
||||
|
||||
# Write all reads
|
||||
samfile.close()
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
|
||||
for read in samfile.fetch(until_eof=True):
|
||||
if read.query_name in reads_to_keep:
|
||||
outfile.write(read)
|
||||
|
||||
samfile.close()
|
||||
vcf.close()
|
||||
outfile.close()
|
||||
|
||||
pysam.index(output_bam)
|
||||
```
|
||||
|
||||
## Integration Workflows
|
||||
|
||||
### Create Coverage Track from BAM
|
||||
|
||||
```python
|
||||
def create_coverage_bedgraph(bam_file, output_file, chrom=None):
|
||||
"""Create bedGraph coverage track from BAM."""
|
||||
samfile = pysam.AlignmentFile(bam_file, "rb")
|
||||
|
||||
chroms = [chrom] if chrom else samfile.references
|
||||
|
||||
with open(output_file, 'w') as out:
|
||||
out.write("track type=bedGraph name=\"Coverage\"\n")
|
||||
|
||||
for chrom in chroms:
|
||||
current_cov = None
|
||||
region_start = None
|
||||
|
||||
for pileupcolumn in samfile.pileup(chrom):
|
||||
pos = pileupcolumn.pos
|
||||
cov = pileupcolumn.nsegments
|
||||
|
||||
if cov != current_cov:
|
||||
# Write previous region
|
||||
if current_cov is not None:
|
||||
out.write(f"{chrom}\t{region_start}\t{pos}\t{current_cov}\n")
|
||||
|
||||
# Start new region
|
||||
current_cov = cov
|
||||
region_start = pos
|
||||
|
||||
# Write final region
|
||||
if current_cov is not None:
|
||||
out.write(f"{chrom}\t{region_start}\t{pos+1}\t{current_cov}\n")
|
||||
|
||||
samfile.close()
|
||||
```
|
||||
|
||||
### Merge Multiple VCF Files
|
||||
|
||||
```python
|
||||
def merge_vcf_samples(vcf_files, output_file):
|
||||
"""Merge multiple single-sample VCFs."""
|
||||
# Open all input files
|
||||
vcf_readers = [pysam.VariantFile(f) for f in vcf_files]
|
||||
|
||||
# Create merged header
|
||||
merged_header = vcf_readers[0].header.copy()
|
||||
for vcf in vcf_readers[1:]:
|
||||
for sample in vcf.header.samples:
|
||||
merged_header.samples.add(sample)
|
||||
|
||||
outvcf = pysam.VariantFile(output_file, "w", header=merged_header)
|
||||
|
||||
# Get all variant positions
|
||||
all_variants = {}
|
||||
for vcf in vcf_readers:
|
||||
for variant in vcf:
|
||||
key = (variant.chrom, variant.pos, variant.ref, variant.alts)
|
||||
if key not in all_variants:
|
||||
all_variants[key] = []
|
||||
all_variants[key].append(variant)
|
||||
|
||||
# Write merged variants
|
||||
for key, variants in sorted(all_variants.items()):
|
||||
# Create merged record from first variant
|
||||
merged = outvcf.new_record(
|
||||
contig=variants[0].chrom,
|
||||
start=variants[0].start,
|
||||
stop=variants[0].stop,
|
||||
alleles=variants[0].alleles
|
||||
)
|
||||
|
||||
# Add genotypes from all samples
|
||||
for variant in variants:
|
||||
for sample in variant.samples:
|
||||
merged.samples[sample].update(variant.samples[sample])
|
||||
|
||||
outvcf.write(merged)
|
||||
|
||||
# Close all files
|
||||
for vcf in vcf_readers:
|
||||
vcf.close()
|
||||
outvcf.close()
|
||||
```
|
||||
|
||||
## Performance Tips for Workflows
|
||||
|
||||
1. **Use indexed files** for all random access operations
|
||||
2. **Process regions in parallel** when analyzing multiple independent regions
|
||||
3. **Stream data when possible** - avoid loading entire files into memory
|
||||
4. **Close files explicitly** to free resources
|
||||
5. **Use `until_eof=True`** for sequential processing of entire files
|
||||
6. **Batch operations** on the same file to minimize I/O
|
||||
7. **Consider memory usage** with pileup operations on high-coverage regions
|
||||
8. **Use count() instead of pileup()** when only counts are needed
|
||||
|
||||
## Common Integration Patterns
|
||||
|
||||
1. **BAM + Reference**: Verify alignments, extract aligned sequences
|
||||
2. **BAM + VCF**: Validate variants, calculate allele frequencies
|
||||
3. **VCF + BED**: Annotate variants with gene/region information
|
||||
4. **BAM + BED**: Calculate coverage statistics for specific regions
|
||||
5. **FASTA + VCF**: Extract variant context sequences
|
||||
6. **Multiple BAMs**: Compare coverage or variants across samples
|
||||
7. **BAM + FASTQ**: Extract unaligned reads for re-alignment
|
||||
407
skills/pysam/references/sequence_files.md
Normal file
407
skills/pysam/references/sequence_files.md
Normal file
@@ -0,0 +1,407 @@
|
||||
# Working with Sequence Files (FASTA/FASTQ)
|
||||
|
||||
## FASTA Files
|
||||
|
||||
### Overview
|
||||
|
||||
Pysam provides the `FastaFile` class for indexed, random access to FASTA reference sequences. FASTA files must be indexed with `samtools faidx` before use.
|
||||
|
||||
### Opening FASTA Files
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Open indexed FASTA file
|
||||
fasta = pysam.FastaFile("reference.fasta")
|
||||
|
||||
# Automatically looks for reference.fasta.fai index
|
||||
```
|
||||
|
||||
### Creating FASTA Index
|
||||
|
||||
```python
|
||||
# Create index using pysam
|
||||
pysam.faidx("reference.fasta")
|
||||
|
||||
# Or using samtools command
|
||||
pysam.samtools.faidx("reference.fasta")
|
||||
```
|
||||
|
||||
This creates a `.fai` index file required for random access.
|
||||
|
||||
### FastaFile Properties
|
||||
|
||||
```python
|
||||
fasta = pysam.FastaFile("reference.fasta")
|
||||
|
||||
# List of reference sequences
|
||||
references = fasta.references
|
||||
print(f"References: {references}")
|
||||
|
||||
# Get lengths
|
||||
lengths = fasta.lengths
|
||||
print(f"Lengths: {lengths}")
|
||||
|
||||
# Get specific sequence length
|
||||
chr1_length = fasta.get_reference_length("chr1")
|
||||
```
|
||||
|
||||
### Fetching Sequences
|
||||
|
||||
#### Fetch by Region
|
||||
|
||||
Uses **0-based, half-open** coordinates:
|
||||
|
||||
```python
|
||||
# Fetch specific region
|
||||
sequence = fasta.fetch("chr1", 1000, 2000)
|
||||
print(f"Sequence: {sequence}") # Returns 1000 bases
|
||||
|
||||
# Fetch entire chromosome
|
||||
chr1_seq = fasta.fetch("chr1")
|
||||
|
||||
# Fetch using region string (1-based)
|
||||
sequence = fasta.fetch(region="chr1:1001-2000")
|
||||
```
|
||||
|
||||
**Important:** Numeric arguments use 0-based coordinates, region strings use 1-based coordinates (samtools convention).
|
||||
|
||||
#### Common Use Cases
|
||||
|
||||
```python
|
||||
# Get sequence at variant position
|
||||
def get_variant_context(fasta, chrom, pos, window=10):
|
||||
"""Get sequence context around a variant position (1-based)."""
|
||||
start = max(0, pos - window - 1) # Convert to 0-based
|
||||
end = pos + window
|
||||
return fasta.fetch(chrom, start, end)
|
||||
|
||||
# Get sequence for gene coordinates
|
||||
def get_gene_sequence(fasta, chrom, start, end, strand):
|
||||
"""Get gene sequence with strand awareness."""
|
||||
seq = fasta.fetch(chrom, start, end)
|
||||
|
||||
if strand == "-":
|
||||
# Reverse complement
|
||||
complement = str.maketrans("ATGCatgc", "TACGtacg")
|
||||
seq = seq.translate(complement)[::-1]
|
||||
|
||||
return seq
|
||||
|
||||
# Check reference allele
|
||||
def check_ref_allele(fasta, chrom, pos, expected_ref):
|
||||
"""Verify reference allele at position (1-based pos)."""
|
||||
actual = fasta.fetch(chrom, pos-1, pos) # Convert to 0-based
|
||||
return actual.upper() == expected_ref.upper()
|
||||
```
|
||||
|
||||
### Extracting Multiple Regions
|
||||
|
||||
```python
|
||||
# Extract multiple regions efficiently
|
||||
regions = [
|
||||
("chr1", 1000, 2000),
|
||||
("chr1", 5000, 6000),
|
||||
("chr2", 10000, 11000)
|
||||
]
|
||||
|
||||
sequences = {}
|
||||
for chrom, start, end in regions:
|
||||
seq_id = f"{chrom}:{start}-{end}"
|
||||
sequences[seq_id] = fasta.fetch(chrom, start, end)
|
||||
```
|
||||
|
||||
### Working with Ambiguous Bases
|
||||
|
||||
FASTA files may contain IUPAC ambiguity codes:
|
||||
|
||||
- N = any base
|
||||
- R = A or G (purine)
|
||||
- Y = C or T (pyrimidine)
|
||||
- S = G or C (strong)
|
||||
- W = A or T (weak)
|
||||
- K = G or T (keto)
|
||||
- M = A or C (amino)
|
||||
- B = C, G, or T (not A)
|
||||
- D = A, G, or T (not C)
|
||||
- H = A, C, or T (not G)
|
||||
- V = A, C, or G (not T)
|
||||
|
||||
```python
|
||||
# Handle ambiguous bases
|
||||
def count_ambiguous(sequence):
|
||||
"""Count non-ATGC bases."""
|
||||
return sum(1 for base in sequence.upper() if base not in "ATGC")
|
||||
|
||||
# Remove regions with too many Ns
|
||||
def has_quality_sequence(fasta, chrom, start, end, max_n_frac=0.1):
|
||||
"""Check if region has acceptable N content."""
|
||||
seq = fasta.fetch(chrom, start, end)
|
||||
n_count = seq.upper().count('N')
|
||||
return (n_count / len(seq)) <= max_n_frac
|
||||
```
|
||||
|
||||
## FASTQ Files
|
||||
|
||||
### Overview
|
||||
|
||||
Pysam provides `FastxFile` (or `FastqFile`) for reading FASTQ files containing raw sequencing reads with quality scores. FASTQ files do not support random access—only sequential reading.
|
||||
|
||||
### Opening FASTQ Files
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Open FASTQ file
|
||||
fastq = pysam.FastxFile("reads.fastq")
|
||||
|
||||
# Works with compressed files
|
||||
fastq_gz = pysam.FastxFile("reads.fastq.gz")
|
||||
```
|
||||
|
||||
### Reading FASTQ Records
|
||||
|
||||
```python
|
||||
fastq = pysam.FastxFile("reads.fastq")
|
||||
|
||||
for read in fastq:
|
||||
print(f"Name: {read.name}")
|
||||
print(f"Sequence: {read.sequence}")
|
||||
print(f"Quality: {read.quality}")
|
||||
print(f"Comment: {read.comment}") # Optional header comment
|
||||
```
|
||||
|
||||
**FastqProxy attributes:**
|
||||
- `name` - Read identifier (without @ prefix)
|
||||
- `sequence` - DNA/RNA sequence
|
||||
- `quality` - ASCII-encoded quality string
|
||||
- `comment` - Optional comment from header line
|
||||
- `get_quality_array()` - Convert quality string to numeric array
|
||||
|
||||
### Quality Score Conversion
|
||||
|
||||
```python
|
||||
# Convert quality string to numeric values
|
||||
for read in fastq:
|
||||
qual_array = read.get_quality_array()
|
||||
mean_quality = sum(qual_array) / len(qual_array)
|
||||
print(f"{read.name}: mean Q = {mean_quality:.1f}")
|
||||
```
|
||||
|
||||
Quality scores are Phred-scaled (typically Phred+33 encoding):
|
||||
- Q = -10 * log10(P_error)
|
||||
- ASCII 33 ('!') = Q0
|
||||
- ASCII 43 ('+') = Q10
|
||||
- ASCII 63 ('?') = Q30
|
||||
|
||||
### Common FASTQ Processing Workflows
|
||||
|
||||
#### Quality Filtering
|
||||
|
||||
```python
|
||||
def filter_by_quality(input_fastq, output_fastq, min_mean_quality=20):
|
||||
"""Filter reads by mean quality score."""
|
||||
with pysam.FastxFile(input_fastq) as infile:
|
||||
with open(output_fastq, 'w') as outfile:
|
||||
for read in infile:
|
||||
qual_array = read.get_quality_array()
|
||||
mean_q = sum(qual_array) / len(qual_array)
|
||||
|
||||
if mean_q >= min_mean_quality:
|
||||
# Write in FASTQ format
|
||||
outfile.write(f"@{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
outfile.write("+\n")
|
||||
outfile.write(f"{read.quality}\n")
|
||||
```
|
||||
|
||||
#### Length Filtering
|
||||
|
||||
```python
|
||||
def filter_by_length(input_fastq, output_fastq, min_length=50):
|
||||
"""Filter reads by minimum length."""
|
||||
with pysam.FastxFile(input_fastq) as infile:
|
||||
with open(output_fastq, 'w') as outfile:
|
||||
kept = 0
|
||||
for read in infile:
|
||||
if len(read.sequence) >= min_length:
|
||||
outfile.write(f"@{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
outfile.write("+\n")
|
||||
outfile.write(f"{read.quality}\n")
|
||||
kept += 1
|
||||
print(f"Kept {kept} reads")
|
||||
```
|
||||
|
||||
#### Calculate Quality Statistics
|
||||
|
||||
```python
|
||||
def calculate_fastq_stats(fastq_file):
|
||||
"""Calculate basic statistics for FASTQ file."""
|
||||
total_reads = 0
|
||||
total_bases = 0
|
||||
quality_sum = 0
|
||||
|
||||
with pysam.FastxFile(fastq_file) as fastq:
|
||||
for read in fastq:
|
||||
total_reads += 1
|
||||
read_length = len(read.sequence)
|
||||
total_bases += read_length
|
||||
|
||||
qual_array = read.get_quality_array()
|
||||
quality_sum += sum(qual_array)
|
||||
|
||||
return {
|
||||
"total_reads": total_reads,
|
||||
"total_bases": total_bases,
|
||||
"mean_read_length": total_bases / total_reads if total_reads > 0 else 0,
|
||||
"mean_quality": quality_sum / total_bases if total_bases > 0 else 0
|
||||
}
|
||||
```
|
||||
|
||||
#### Extract Reads by Name
|
||||
|
||||
```python
|
||||
def extract_reads_by_name(fastq_file, read_names, output_file):
|
||||
"""Extract specific reads by name."""
|
||||
read_set = set(read_names)
|
||||
|
||||
with pysam.FastxFile(fastq_file) as infile:
|
||||
with open(output_file, 'w') as outfile:
|
||||
for read in infile:
|
||||
if read.name in read_set:
|
||||
outfile.write(f"@{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
outfile.write("+\n")
|
||||
outfile.write(f"{read.quality}\n")
|
||||
```
|
||||
|
||||
#### Convert FASTQ to FASTA
|
||||
|
||||
```python
|
||||
def fastq_to_fasta(fastq_file, fasta_file):
|
||||
"""Convert FASTQ to FASTA (discards quality scores)."""
|
||||
with pysam.FastxFile(fastq_file) as infile:
|
||||
with open(fasta_file, 'w') as outfile:
|
||||
for read in infile:
|
||||
outfile.write(f">{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
```
|
||||
|
||||
#### Subsample FASTQ
|
||||
|
||||
```python
|
||||
import random
|
||||
|
||||
def subsample_fastq(input_fastq, output_fastq, fraction=0.1, seed=42):
|
||||
"""Randomly subsample reads from FASTQ file."""
|
||||
random.seed(seed)
|
||||
|
||||
with pysam.FastxFile(input_fastq) as infile:
|
||||
with open(output_fastq, 'w') as outfile:
|
||||
for read in infile:
|
||||
if random.random() < fraction:
|
||||
outfile.write(f"@{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
outfile.write("+\n")
|
||||
outfile.write(f"{read.quality}\n")
|
||||
```
|
||||
|
||||
## Tabix-Indexed Files
|
||||
|
||||
### Overview
|
||||
|
||||
Pysam provides `TabixFile` for accessing tabix-indexed genomic data files (BED, GFF, GTF, generic tab-delimited).
|
||||
|
||||
### Opening Tabix Files
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Open tabix-indexed file
|
||||
tabix = pysam.TabixFile("annotations.bed.gz")
|
||||
|
||||
# File must be bgzip-compressed and tabix-indexed
|
||||
```
|
||||
|
||||
### Creating Tabix Index
|
||||
|
||||
```python
|
||||
# Index a file
|
||||
pysam.tabix_index("annotations.bed", preset="bed", force=True)
|
||||
# Creates annotations.bed.gz and annotations.bed.gz.tbi
|
||||
|
||||
# Presets available: bed, gff, vcf
|
||||
```
|
||||
|
||||
### Fetching Records
|
||||
|
||||
```python
|
||||
tabix = pysam.TabixFile("annotations.bed.gz")
|
||||
|
||||
# Fetch region
|
||||
for row in tabix.fetch("chr1", 1000000, 2000000):
|
||||
print(row) # Returns tab-delimited string
|
||||
|
||||
# Parse with specific parser
|
||||
for row in tabix.fetch("chr1", 1000000, 2000000, parser=pysam.asBed()):
|
||||
print(f"Interval: {row.contig}:{row.start}-{row.end}")
|
||||
|
||||
# Available parsers: asBed(), asGTF(), asVCF(), asTuple()
|
||||
```
|
||||
|
||||
### Working with BED Files
|
||||
|
||||
```python
|
||||
bed = pysam.TabixFile("regions.bed.gz")
|
||||
|
||||
# Access BED fields by name
|
||||
for interval in bed.fetch("chr1", 1000000, 2000000, parser=pysam.asBed()):
|
||||
print(f"Region: {interval.contig}:{interval.start}-{interval.end}")
|
||||
print(f"Name: {interval.name}")
|
||||
print(f"Score: {interval.score}")
|
||||
print(f"Strand: {interval.strand}")
|
||||
```
|
||||
|
||||
### Working with GTF/GFF Files
|
||||
|
||||
```python
|
||||
gtf = pysam.TabixFile("annotations.gtf.gz")
|
||||
|
||||
# Access GTF fields
|
||||
for feature in gtf.fetch("chr1", 1000000, 2000000, parser=pysam.asGTF()):
|
||||
print(f"Feature: {feature.feature}")
|
||||
print(f"Gene: {feature.gene_id}")
|
||||
print(f"Transcript: {feature.transcript_id}")
|
||||
print(f"Coordinates: {feature.start}-{feature.end}")
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
### FASTA
|
||||
1. **Always use indexed FASTA** files (create .fai with samtools faidx)
|
||||
2. **Batch fetch operations** when extracting multiple regions
|
||||
3. **Cache frequently accessed sequences** in memory
|
||||
4. **Use appropriate window sizes** to avoid loading excessive sequence data
|
||||
|
||||
### FASTQ
|
||||
1. **Stream processing** - FASTQ files are read sequentially, process on-the-fly
|
||||
2. **Use compressed FASTQ.gz** to save disk space (pysam handles transparently)
|
||||
3. **Avoid loading entire file** into memory—process read-by-read
|
||||
4. **For large files**, consider parallel processing with file splitting
|
||||
|
||||
### Tabix
|
||||
1. **Always bgzip and tabix-index** files before region queries
|
||||
2. **Use appropriate presets** when creating indices
|
||||
3. **Specify parser** for named field access
|
||||
4. **Batch queries** to same file to avoid re-opening
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **FASTA coordinate system:** fetch() uses 0-based coordinates, region strings use 1-based
|
||||
2. **Missing index:** FASTA random access requires .fai index file
|
||||
3. **FASTQ sequential only:** Cannot do random access or region-based queries on FASTQ
|
||||
4. **Quality encoding:** Assume Phred+33 unless specified otherwise
|
||||
5. **Tabix compression:** Must use bgzip, not regular gzip, for tabix indexing
|
||||
6. **Parser requirement:** TabixFile needs explicit parser for named field access
|
||||
7. **Case sensitivity:** FASTA sequences preserve case—use .upper() or .lower() for consistent comparisons
|
||||
365
skills/pysam/references/variant_files.md
Normal file
365
skills/pysam/references/variant_files.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# Working with Variant Files (VCF/BCF)
|
||||
|
||||
## Overview
|
||||
|
||||
Pysam provides the `VariantFile` class for reading and writing VCF (Variant Call Format) and BCF (binary VCF) files. These files contain information about genetic variants, including SNPs, indels, and structural variants.
|
||||
|
||||
## Opening Variant Files
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Reading VCF
|
||||
vcf = pysam.VariantFile("example.vcf")
|
||||
|
||||
# Reading BCF (binary, compressed)
|
||||
bcf = pysam.VariantFile("example.bcf")
|
||||
|
||||
# Reading compressed VCF
|
||||
vcf_gz = pysam.VariantFile("example.vcf.gz")
|
||||
|
||||
# Writing
|
||||
outvcf = pysam.VariantFile("output.vcf", "w", header=vcf.header)
|
||||
```
|
||||
|
||||
## VariantFile Properties
|
||||
|
||||
**Header Information:**
|
||||
- `header` - Complete VCF header with metadata
|
||||
- `header.contigs` - Dictionary of contigs/chromosomes
|
||||
- `header.samples` - List of sample names
|
||||
- `header.filters` - Dictionary of FILTER definitions
|
||||
- `header.info` - Dictionary of INFO field definitions
|
||||
- `header.formats` - Dictionary of FORMAT field definitions
|
||||
|
||||
```python
|
||||
vcf = pysam.VariantFile("example.vcf")
|
||||
|
||||
# List samples
|
||||
print(f"Samples: {list(vcf.header.samples)}")
|
||||
|
||||
# List contigs
|
||||
for contig in vcf.header.contigs:
|
||||
print(f"{contig}: length={vcf.header.contigs[contig].length}")
|
||||
|
||||
# List INFO fields
|
||||
for info in vcf.header.info:
|
||||
print(f"{info}: {vcf.header.info[info].description}")
|
||||
```
|
||||
|
||||
## Reading Variant Records
|
||||
|
||||
### Iterate All Variants
|
||||
|
||||
```python
|
||||
for variant in vcf:
|
||||
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
|
||||
```
|
||||
|
||||
### Fetch Specific Region
|
||||
|
||||
Requires tabix index (.tbi) for VCF.gz or index for BCF:
|
||||
|
||||
```python
|
||||
# Fetch variants in region (1-based coordinates for region string)
|
||||
for variant in vcf.fetch("chr1", 1000000, 2000000):
|
||||
print(f"{variant.chrom}:{variant.pos} {variant.id}")
|
||||
|
||||
# Using region string (1-based)
|
||||
for variant in vcf.fetch("chr1:1000000-2000000"):
|
||||
print(variant.pos)
|
||||
```
|
||||
|
||||
**Note:** Uses **1-based coordinates** in `fetch()` calls to match VCF specification.
|
||||
|
||||
## VariantRecord Objects
|
||||
|
||||
Each variant is represented as a `VariantRecord` object:
|
||||
|
||||
### Position Information
|
||||
- `chrom` - Chromosome/contig name
|
||||
- `pos` - Position (1-based)
|
||||
- `start` - Start position (0-based)
|
||||
- `stop` - Stop position (0-based, exclusive)
|
||||
- `id` - Variant ID (e.g., rsID)
|
||||
|
||||
### Allele Information
|
||||
- `ref` - Reference allele
|
||||
- `alts` - Tuple of alternate alleles
|
||||
- `alleles` - Tuple of all alleles (ref + alts)
|
||||
|
||||
### Quality and Filtering
|
||||
- `qual` - Quality score (QUAL field)
|
||||
- `filter` - Filter status
|
||||
|
||||
### INFO Fields
|
||||
|
||||
Access INFO fields as dictionary:
|
||||
|
||||
```python
|
||||
for variant in vcf:
|
||||
# Check if field exists
|
||||
if "DP" in variant.info:
|
||||
depth = variant.info["DP"]
|
||||
print(f"Depth: {depth}")
|
||||
|
||||
# Get all INFO keys
|
||||
print(f"INFO fields: {variant.info.keys()}")
|
||||
|
||||
# Access specific fields
|
||||
if "AF" in variant.info:
|
||||
allele_freq = variant.info["AF"]
|
||||
print(f"Allele frequency: {allele_freq}")
|
||||
```
|
||||
|
||||
### Sample Genotype Data
|
||||
|
||||
Access sample data through `samples` dictionary:
|
||||
|
||||
```python
|
||||
for variant in vcf:
|
||||
for sample_name in variant.samples:
|
||||
sample = variant.samples[sample_name]
|
||||
|
||||
# Genotype (GT field)
|
||||
gt = sample["GT"]
|
||||
print(f"{sample_name} genotype: {gt}")
|
||||
|
||||
# Other FORMAT fields
|
||||
if "DP" in sample:
|
||||
print(f"{sample_name} depth: {sample['DP']}")
|
||||
if "GQ" in sample:
|
||||
print(f"{sample_name} quality: {sample['GQ']}")
|
||||
|
||||
# Alleles for this genotype
|
||||
alleles = sample.alleles
|
||||
print(f"{sample_name} alleles: {alleles}")
|
||||
|
||||
# Phasing
|
||||
if sample.phased:
|
||||
print(f"{sample_name} is phased")
|
||||
```
|
||||
|
||||
**Genotype representation:**
|
||||
- `(0, 0)` - Homozygous reference
|
||||
- `(0, 1)` - Heterozygous
|
||||
- `(1, 1)` - Homozygous alternate
|
||||
- `(None, None)` - Missing genotype
|
||||
- Phased: `(0|1)` vs unphased: `(0/1)`
|
||||
|
||||
## Writing Variant Files
|
||||
|
||||
### Creating Header
|
||||
|
||||
```python
|
||||
header = pysam.VariantHeader()
|
||||
|
||||
# Add contigs
|
||||
header.contigs.add("chr1", length=248956422)
|
||||
header.contigs.add("chr2", length=242193529)
|
||||
|
||||
# Add INFO fields
|
||||
header.add_line('##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">')
|
||||
header.add_line('##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">')
|
||||
|
||||
# Add FORMAT fields
|
||||
header.add_line('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">')
|
||||
header.add_line('##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">')
|
||||
|
||||
# Add samples
|
||||
header.add_sample("sample1")
|
||||
header.add_sample("sample2")
|
||||
|
||||
# Create output file
|
||||
outvcf = pysam.VariantFile("output.vcf", "w", header=header)
|
||||
```
|
||||
|
||||
### Creating Variant Records
|
||||
|
||||
```python
|
||||
# Create new variant
|
||||
record = outvcf.new_record()
|
||||
record.chrom = "chr1"
|
||||
record.pos = 100000
|
||||
record.id = "rs123456"
|
||||
record.ref = "A"
|
||||
record.alts = ("G",)
|
||||
record.qual = 30
|
||||
record.filter.add("PASS")
|
||||
|
||||
# Set INFO fields
|
||||
record.info["DP"] = 100
|
||||
record.info["AF"] = (0.25,)
|
||||
|
||||
# Set genotype data
|
||||
record.samples["sample1"]["GT"] = (0, 1)
|
||||
record.samples["sample1"]["DP"] = 50
|
||||
record.samples["sample2"]["GT"] = (0, 0)
|
||||
record.samples["sample2"]["DP"] = 50
|
||||
|
||||
# Write to file
|
||||
outvcf.write(record)
|
||||
```
|
||||
|
||||
## Filtering Variants
|
||||
|
||||
### Basic Filtering
|
||||
|
||||
```python
|
||||
# Filter by quality
|
||||
for variant in vcf:
|
||||
if variant.qual >= 30:
|
||||
print(f"High quality variant: {variant.chrom}:{variant.pos}")
|
||||
|
||||
# Filter by depth
|
||||
for variant in vcf:
|
||||
if "DP" in variant.info and variant.info["DP"] >= 20:
|
||||
print(f"High depth variant: {variant.chrom}:{variant.pos}")
|
||||
|
||||
# Filter by allele frequency
|
||||
for variant in vcf:
|
||||
if "AF" in variant.info:
|
||||
for af in variant.info["AF"]:
|
||||
if af >= 0.01:
|
||||
print(f"Common variant: {variant.chrom}:{variant.pos}")
|
||||
```
|
||||
|
||||
### Filtering by Genotype
|
||||
|
||||
```python
|
||||
# Find variants where sample has alternate allele
|
||||
for variant in vcf:
|
||||
sample = variant.samples["sample1"]
|
||||
gt = sample["GT"]
|
||||
|
||||
# Check if has alternate allele
|
||||
if gt and any(allele and allele > 0 for allele in gt):
|
||||
print(f"Sample has alt allele: {variant.chrom}:{variant.pos}")
|
||||
|
||||
# Check if homozygous alternate
|
||||
if gt == (1, 1):
|
||||
print(f"Homozygous alt: {variant.chrom}:{variant.pos}")
|
||||
```
|
||||
|
||||
### Filter Field
|
||||
|
||||
```python
|
||||
# Check FILTER status
|
||||
for variant in vcf:
|
||||
if "PASS" in variant.filter or len(variant.filter) == 0:
|
||||
print(f"Passed filters: {variant.chrom}:{variant.pos}")
|
||||
else:
|
||||
print(f"Failed: {variant.filter.keys()}")
|
||||
```
|
||||
|
||||
## Indexing VCF Files
|
||||
|
||||
Create tabix index for compressed VCF:
|
||||
|
||||
```python
|
||||
# Compress and index
|
||||
pysam.tabix_index("example.vcf", preset="vcf", force=True)
|
||||
# Creates example.vcf.gz and example.vcf.gz.tbi
|
||||
```
|
||||
|
||||
Or use bcftools for BCF:
|
||||
|
||||
```python
|
||||
pysam.bcftools.index("example.bcf")
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Extract Variants for Specific Samples
|
||||
|
||||
```python
|
||||
invcf = pysam.VariantFile("input.vcf")
|
||||
samples_to_keep = ["sample1", "sample3"]
|
||||
|
||||
# Create new header with subset of samples
|
||||
new_header = invcf.header.copy()
|
||||
new_header.samples.clear()
|
||||
for sample in samples_to_keep:
|
||||
new_header.samples.add(sample)
|
||||
|
||||
outvcf = pysam.VariantFile("output.vcf", "w", header=new_header)
|
||||
|
||||
for variant in invcf:
|
||||
# Create new record
|
||||
new_record = outvcf.new_record(
|
||||
contig=variant.chrom,
|
||||
start=variant.start,
|
||||
stop=variant.stop,
|
||||
alleles=variant.alleles,
|
||||
id=variant.id,
|
||||
qual=variant.qual,
|
||||
filter=variant.filter,
|
||||
info=variant.info
|
||||
)
|
||||
|
||||
# Copy genotype data for selected samples
|
||||
for sample in samples_to_keep:
|
||||
new_record.samples[sample].update(variant.samples[sample])
|
||||
|
||||
outvcf.write(new_record)
|
||||
```
|
||||
|
||||
### Calculate Allele Frequencies
|
||||
|
||||
```python
|
||||
vcf = pysam.VariantFile("example.vcf")
|
||||
|
||||
for variant in vcf:
|
||||
total_alleles = 0
|
||||
alt_alleles = 0
|
||||
|
||||
for sample_name in variant.samples:
|
||||
gt = variant.samples[sample_name]["GT"]
|
||||
if gt and None not in gt:
|
||||
total_alleles += 2
|
||||
alt_alleles += sum(1 for allele in gt if allele > 0)
|
||||
|
||||
if total_alleles > 0:
|
||||
af = alt_alleles / total_alleles
|
||||
print(f"{variant.chrom}:{variant.pos} AF={af:.4f}")
|
||||
```
|
||||
|
||||
### Convert VCF to Summary Table
|
||||
|
||||
```python
|
||||
import csv
|
||||
|
||||
vcf = pysam.VariantFile("example.vcf")
|
||||
|
||||
with open("variants.csv", "w", newline="") as csvfile:
|
||||
writer = csv.writer(csvfile)
|
||||
writer.writerow(["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "DP"])
|
||||
|
||||
for variant in vcf:
|
||||
writer.writerow([
|
||||
variant.chrom,
|
||||
variant.pos,
|
||||
variant.id or ".",
|
||||
variant.ref,
|
||||
",".join(variant.alts) if variant.alts else ".",
|
||||
variant.qual or ".",
|
||||
variant.info.get("DP", ".")
|
||||
])
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use BCF format** for better compression and faster access than VCF
|
||||
2. **Index files** with tabix for efficient region queries
|
||||
3. **Filter early** to reduce processing of irrelevant variants
|
||||
4. **Use INFO fields efficiently** - check existence before accessing
|
||||
5. **Batch write operations** when creating VCF files
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Coordinate systems:** VCF uses 1-based coordinates, but VariantRecord.start is 0-based
|
||||
2. **Missing data:** Always check if INFO/FORMAT fields exist before accessing
|
||||
3. **Genotype tuples:** Genotypes are tuples, not lists—handle None values for missing data
|
||||
4. **Allele indexing:** In genotype (0, 1), 0=REF, 1=first ALT, 2=second ALT, etc.
|
||||
5. **Index requirement:** Region-based `fetch()` requires tabix index for VCF.gz
|
||||
6. **Header modification:** When subsetting samples, properly update header and copy FORMAT fields
|
||||
Reference in New Issue
Block a user