Files
2025-11-30 08:30:10 +08:00

366 lines
9.3 KiB
Markdown

# Working with Variant Files (VCF/BCF)
## Overview
Pysam provides the `VariantFile` class for reading and writing VCF (Variant Call Format) and BCF (binary VCF) files. These files contain information about genetic variants, including SNPs, indels, and structural variants.
## Opening Variant Files
```python
import pysam
# Reading VCF
vcf = pysam.VariantFile("example.vcf")
# Reading BCF (binary, compressed)
bcf = pysam.VariantFile("example.bcf")
# Reading compressed VCF
vcf_gz = pysam.VariantFile("example.vcf.gz")
# Writing
outvcf = pysam.VariantFile("output.vcf", "w", header=vcf.header)
```
## VariantFile Properties
**Header Information:**
- `header` - Complete VCF header with metadata
- `header.contigs` - Dictionary of contigs/chromosomes
- `header.samples` - List of sample names
- `header.filters` - Dictionary of FILTER definitions
- `header.info` - Dictionary of INFO field definitions
- `header.formats` - Dictionary of FORMAT field definitions
```python
vcf = pysam.VariantFile("example.vcf")
# List samples
print(f"Samples: {list(vcf.header.samples)}")
# List contigs
for contig in vcf.header.contigs:
print(f"{contig}: length={vcf.header.contigs[contig].length}")
# List INFO fields
for info in vcf.header.info:
print(f"{info}: {vcf.header.info[info].description}")
```
## Reading Variant Records
### Iterate All Variants
```python
for variant in vcf:
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
```
### Fetch Specific Region
Requires tabix index (.tbi) for VCF.gz or index for BCF:
```python
# Fetch variants in region (1-based coordinates for region string)
for variant in vcf.fetch("chr1", 1000000, 2000000):
print(f"{variant.chrom}:{variant.pos} {variant.id}")
# Using region string (1-based)
for variant in vcf.fetch("chr1:1000000-2000000"):
print(variant.pos)
```
**Note:** Uses **1-based coordinates** in `fetch()` calls to match VCF specification.
## VariantRecord Objects
Each variant is represented as a `VariantRecord` object:
### Position Information
- `chrom` - Chromosome/contig name
- `pos` - Position (1-based)
- `start` - Start position (0-based)
- `stop` - Stop position (0-based, exclusive)
- `id` - Variant ID (e.g., rsID)
### Allele Information
- `ref` - Reference allele
- `alts` - Tuple of alternate alleles
- `alleles` - Tuple of all alleles (ref + alts)
### Quality and Filtering
- `qual` - Quality score (QUAL field)
- `filter` - Filter status
### INFO Fields
Access INFO fields as dictionary:
```python
for variant in vcf:
# Check if field exists
if "DP" in variant.info:
depth = variant.info["DP"]
print(f"Depth: {depth}")
# Get all INFO keys
print(f"INFO fields: {variant.info.keys()}")
# Access specific fields
if "AF" in variant.info:
allele_freq = variant.info["AF"]
print(f"Allele frequency: {allele_freq}")
```
### Sample Genotype Data
Access sample data through `samples` dictionary:
```python
for variant in vcf:
for sample_name in variant.samples:
sample = variant.samples[sample_name]
# Genotype (GT field)
gt = sample["GT"]
print(f"{sample_name} genotype: {gt}")
# Other FORMAT fields
if "DP" in sample:
print(f"{sample_name} depth: {sample['DP']}")
if "GQ" in sample:
print(f"{sample_name} quality: {sample['GQ']}")
# Alleles for this genotype
alleles = sample.alleles
print(f"{sample_name} alleles: {alleles}")
# Phasing
if sample.phased:
print(f"{sample_name} is phased")
```
**Genotype representation:**
- `(0, 0)` - Homozygous reference
- `(0, 1)` - Heterozygous
- `(1, 1)` - Homozygous alternate
- `(None, None)` - Missing genotype
- Phased: `(0|1)` vs unphased: `(0/1)`
## Writing Variant Files
### Creating Header
```python
header = pysam.VariantHeader()
# Add contigs
header.contigs.add("chr1", length=248956422)
header.contigs.add("chr2", length=242193529)
# Add INFO fields
header.add_line('##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">')
header.add_line('##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">')
# Add FORMAT fields
header.add_line('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">')
header.add_line('##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">')
# Add samples
header.add_sample("sample1")
header.add_sample("sample2")
# Create output file
outvcf = pysam.VariantFile("output.vcf", "w", header=header)
```
### Creating Variant Records
```python
# Create new variant
record = outvcf.new_record()
record.chrom = "chr1"
record.pos = 100000
record.id = "rs123456"
record.ref = "A"
record.alts = ("G",)
record.qual = 30
record.filter.add("PASS")
# Set INFO fields
record.info["DP"] = 100
record.info["AF"] = (0.25,)
# Set genotype data
record.samples["sample1"]["GT"] = (0, 1)
record.samples["sample1"]["DP"] = 50
record.samples["sample2"]["GT"] = (0, 0)
record.samples["sample2"]["DP"] = 50
# Write to file
outvcf.write(record)
```
## Filtering Variants
### Basic Filtering
```python
# Filter by quality
for variant in vcf:
if variant.qual >= 30:
print(f"High quality variant: {variant.chrom}:{variant.pos}")
# Filter by depth
for variant in vcf:
if "DP" in variant.info and variant.info["DP"] >= 20:
print(f"High depth variant: {variant.chrom}:{variant.pos}")
# Filter by allele frequency
for variant in vcf:
if "AF" in variant.info:
for af in variant.info["AF"]:
if af >= 0.01:
print(f"Common variant: {variant.chrom}:{variant.pos}")
```
### Filtering by Genotype
```python
# Find variants where sample has alternate allele
for variant in vcf:
sample = variant.samples["sample1"]
gt = sample["GT"]
# Check if has alternate allele
if gt and any(allele and allele > 0 for allele in gt):
print(f"Sample has alt allele: {variant.chrom}:{variant.pos}")
# Check if homozygous alternate
if gt == (1, 1):
print(f"Homozygous alt: {variant.chrom}:{variant.pos}")
```
### Filter Field
```python
# Check FILTER status
for variant in vcf:
if "PASS" in variant.filter or len(variant.filter) == 0:
print(f"Passed filters: {variant.chrom}:{variant.pos}")
else:
print(f"Failed: {variant.filter.keys()}")
```
## Indexing VCF Files
Create tabix index for compressed VCF:
```python
# Compress and index
pysam.tabix_index("example.vcf", preset="vcf", force=True)
# Creates example.vcf.gz and example.vcf.gz.tbi
```
Or use bcftools for BCF:
```python
pysam.bcftools.index("example.bcf")
```
## Common Workflows
### Extract Variants for Specific Samples
```python
invcf = pysam.VariantFile("input.vcf")
samples_to_keep = ["sample1", "sample3"]
# Create new header with subset of samples
new_header = invcf.header.copy()
new_header.samples.clear()
for sample in samples_to_keep:
new_header.samples.add(sample)
outvcf = pysam.VariantFile("output.vcf", "w", header=new_header)
for variant in invcf:
# Create new record
new_record = outvcf.new_record(
contig=variant.chrom,
start=variant.start,
stop=variant.stop,
alleles=variant.alleles,
id=variant.id,
qual=variant.qual,
filter=variant.filter,
info=variant.info
)
# Copy genotype data for selected samples
for sample in samples_to_keep:
new_record.samples[sample].update(variant.samples[sample])
outvcf.write(new_record)
```
### Calculate Allele Frequencies
```python
vcf = pysam.VariantFile("example.vcf")
for variant in vcf:
total_alleles = 0
alt_alleles = 0
for sample_name in variant.samples:
gt = variant.samples[sample_name]["GT"]
if gt and None not in gt:
total_alleles += 2
alt_alleles += sum(1 for allele in gt if allele > 0)
if total_alleles > 0:
af = alt_alleles / total_alleles
print(f"{variant.chrom}:{variant.pos} AF={af:.4f}")
```
### Convert VCF to Summary Table
```python
import csv
vcf = pysam.VariantFile("example.vcf")
with open("variants.csv", "w", newline="") as csvfile:
writer = csv.writer(csvfile)
writer.writerow(["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "DP"])
for variant in vcf:
writer.writerow([
variant.chrom,
variant.pos,
variant.id or ".",
variant.ref,
",".join(variant.alts) if variant.alts else ".",
variant.qual or ".",
variant.info.get("DP", ".")
])
```
## Performance Tips
1. **Use BCF format** for better compression and faster access than VCF
2. **Index files** with tabix for efficient region queries
3. **Filter early** to reduce processing of irrelevant variants
4. **Use INFO fields efficiently** - check existence before accessing
5. **Batch write operations** when creating VCF files
## Common Pitfalls
1. **Coordinate systems:** VCF uses 1-based coordinates, but VariantRecord.start is 0-based
2. **Missing data:** Always check if INFO/FORMAT fields exist before accessing
3. **Genotype tuples:** Genotypes are tuples, not lists—handle None values for missing data
4. **Allele indexing:** In genotype (0, 1), 0=REF, 1=first ALT, 2=second ALT, etc.
5. **Index requirement:** Region-based `fetch()` requires tabix index for VCF.gz
6. **Header modification:** When subsetting samples, properly update header and copy FORMAT fields