Initial commit
This commit is contained in:
365
skills/pysam/references/variant_files.md
Normal file
365
skills/pysam/references/variant_files.md
Normal file
@@ -0,0 +1,365 @@
|
||||
# Working with Variant Files (VCF/BCF)
|
||||
|
||||
## Overview
|
||||
|
||||
Pysam provides the `VariantFile` class for reading and writing VCF (Variant Call Format) and BCF (binary VCF) files. These files contain information about genetic variants, including SNPs, indels, and structural variants.
|
||||
|
||||
## Opening Variant Files
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Reading VCF
|
||||
vcf = pysam.VariantFile("example.vcf")
|
||||
|
||||
# Reading BCF (binary, compressed)
|
||||
bcf = pysam.VariantFile("example.bcf")
|
||||
|
||||
# Reading compressed VCF
|
||||
vcf_gz = pysam.VariantFile("example.vcf.gz")
|
||||
|
||||
# Writing
|
||||
outvcf = pysam.VariantFile("output.vcf", "w", header=vcf.header)
|
||||
```
|
||||
|
||||
## VariantFile Properties
|
||||
|
||||
**Header Information:**
|
||||
- `header` - Complete VCF header with metadata
|
||||
- `header.contigs` - Dictionary of contigs/chromosomes
|
||||
- `header.samples` - List of sample names
|
||||
- `header.filters` - Dictionary of FILTER definitions
|
||||
- `header.info` - Dictionary of INFO field definitions
|
||||
- `header.formats` - Dictionary of FORMAT field definitions
|
||||
|
||||
```python
|
||||
vcf = pysam.VariantFile("example.vcf")
|
||||
|
||||
# List samples
|
||||
print(f"Samples: {list(vcf.header.samples)}")
|
||||
|
||||
# List contigs
|
||||
for contig in vcf.header.contigs:
|
||||
print(f"{contig}: length={vcf.header.contigs[contig].length}")
|
||||
|
||||
# List INFO fields
|
||||
for info in vcf.header.info:
|
||||
print(f"{info}: {vcf.header.info[info].description}")
|
||||
```
|
||||
|
||||
## Reading Variant Records
|
||||
|
||||
### Iterate All Variants
|
||||
|
||||
```python
|
||||
for variant in vcf:
|
||||
print(f"{variant.chrom}:{variant.pos} {variant.ref}>{variant.alts}")
|
||||
```
|
||||
|
||||
### Fetch Specific Region
|
||||
|
||||
Requires tabix index (.tbi) for VCF.gz or index for BCF:
|
||||
|
||||
```python
|
||||
# Fetch variants in region (1-based coordinates for region string)
|
||||
for variant in vcf.fetch("chr1", 1000000, 2000000):
|
||||
print(f"{variant.chrom}:{variant.pos} {variant.id}")
|
||||
|
||||
# Using region string (1-based)
|
||||
for variant in vcf.fetch("chr1:1000000-2000000"):
|
||||
print(variant.pos)
|
||||
```
|
||||
|
||||
**Note:** Uses **1-based coordinates** in `fetch()` calls to match VCF specification.
|
||||
|
||||
## VariantRecord Objects
|
||||
|
||||
Each variant is represented as a `VariantRecord` object:
|
||||
|
||||
### Position Information
|
||||
- `chrom` - Chromosome/contig name
|
||||
- `pos` - Position (1-based)
|
||||
- `start` - Start position (0-based)
|
||||
- `stop` - Stop position (0-based, exclusive)
|
||||
- `id` - Variant ID (e.g., rsID)
|
||||
|
||||
### Allele Information
|
||||
- `ref` - Reference allele
|
||||
- `alts` - Tuple of alternate alleles
|
||||
- `alleles` - Tuple of all alleles (ref + alts)
|
||||
|
||||
### Quality and Filtering
|
||||
- `qual` - Quality score (QUAL field)
|
||||
- `filter` - Filter status
|
||||
|
||||
### INFO Fields
|
||||
|
||||
Access INFO fields as dictionary:
|
||||
|
||||
```python
|
||||
for variant in vcf:
|
||||
# Check if field exists
|
||||
if "DP" in variant.info:
|
||||
depth = variant.info["DP"]
|
||||
print(f"Depth: {depth}")
|
||||
|
||||
# Get all INFO keys
|
||||
print(f"INFO fields: {variant.info.keys()}")
|
||||
|
||||
# Access specific fields
|
||||
if "AF" in variant.info:
|
||||
allele_freq = variant.info["AF"]
|
||||
print(f"Allele frequency: {allele_freq}")
|
||||
```
|
||||
|
||||
### Sample Genotype Data
|
||||
|
||||
Access sample data through `samples` dictionary:
|
||||
|
||||
```python
|
||||
for variant in vcf:
|
||||
for sample_name in variant.samples:
|
||||
sample = variant.samples[sample_name]
|
||||
|
||||
# Genotype (GT field)
|
||||
gt = sample["GT"]
|
||||
print(f"{sample_name} genotype: {gt}")
|
||||
|
||||
# Other FORMAT fields
|
||||
if "DP" in sample:
|
||||
print(f"{sample_name} depth: {sample['DP']}")
|
||||
if "GQ" in sample:
|
||||
print(f"{sample_name} quality: {sample['GQ']}")
|
||||
|
||||
# Alleles for this genotype
|
||||
alleles = sample.alleles
|
||||
print(f"{sample_name} alleles: {alleles}")
|
||||
|
||||
# Phasing
|
||||
if sample.phased:
|
||||
print(f"{sample_name} is phased")
|
||||
```
|
||||
|
||||
**Genotype representation:**
|
||||
- `(0, 0)` - Homozygous reference
|
||||
- `(0, 1)` - Heterozygous
|
||||
- `(1, 1)` - Homozygous alternate
|
||||
- `(None, None)` - Missing genotype
|
||||
- Phased: `(0|1)` vs unphased: `(0/1)`
|
||||
|
||||
## Writing Variant Files
|
||||
|
||||
### Creating Header
|
||||
|
||||
```python
|
||||
header = pysam.VariantHeader()
|
||||
|
||||
# Add contigs
|
||||
header.contigs.add("chr1", length=248956422)
|
||||
header.contigs.add("chr2", length=242193529)
|
||||
|
||||
# Add INFO fields
|
||||
header.add_line('##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">')
|
||||
header.add_line('##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency">')
|
||||
|
||||
# Add FORMAT fields
|
||||
header.add_line('##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">')
|
||||
header.add_line('##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">')
|
||||
|
||||
# Add samples
|
||||
header.add_sample("sample1")
|
||||
header.add_sample("sample2")
|
||||
|
||||
# Create output file
|
||||
outvcf = pysam.VariantFile("output.vcf", "w", header=header)
|
||||
```
|
||||
|
||||
### Creating Variant Records
|
||||
|
||||
```python
|
||||
# Create new variant
|
||||
record = outvcf.new_record()
|
||||
record.chrom = "chr1"
|
||||
record.pos = 100000
|
||||
record.id = "rs123456"
|
||||
record.ref = "A"
|
||||
record.alts = ("G",)
|
||||
record.qual = 30
|
||||
record.filter.add("PASS")
|
||||
|
||||
# Set INFO fields
|
||||
record.info["DP"] = 100
|
||||
record.info["AF"] = (0.25,)
|
||||
|
||||
# Set genotype data
|
||||
record.samples["sample1"]["GT"] = (0, 1)
|
||||
record.samples["sample1"]["DP"] = 50
|
||||
record.samples["sample2"]["GT"] = (0, 0)
|
||||
record.samples["sample2"]["DP"] = 50
|
||||
|
||||
# Write to file
|
||||
outvcf.write(record)
|
||||
```
|
||||
|
||||
## Filtering Variants
|
||||
|
||||
### Basic Filtering
|
||||
|
||||
```python
|
||||
# Filter by quality
|
||||
for variant in vcf:
|
||||
if variant.qual >= 30:
|
||||
print(f"High quality variant: {variant.chrom}:{variant.pos}")
|
||||
|
||||
# Filter by depth
|
||||
for variant in vcf:
|
||||
if "DP" in variant.info and variant.info["DP"] >= 20:
|
||||
print(f"High depth variant: {variant.chrom}:{variant.pos}")
|
||||
|
||||
# Filter by allele frequency
|
||||
for variant in vcf:
|
||||
if "AF" in variant.info:
|
||||
for af in variant.info["AF"]:
|
||||
if af >= 0.01:
|
||||
print(f"Common variant: {variant.chrom}:{variant.pos}")
|
||||
```
|
||||
|
||||
### Filtering by Genotype
|
||||
|
||||
```python
|
||||
# Find variants where sample has alternate allele
|
||||
for variant in vcf:
|
||||
sample = variant.samples["sample1"]
|
||||
gt = sample["GT"]
|
||||
|
||||
# Check if has alternate allele
|
||||
if gt and any(allele and allele > 0 for allele in gt):
|
||||
print(f"Sample has alt allele: {variant.chrom}:{variant.pos}")
|
||||
|
||||
# Check if homozygous alternate
|
||||
if gt == (1, 1):
|
||||
print(f"Homozygous alt: {variant.chrom}:{variant.pos}")
|
||||
```
|
||||
|
||||
### Filter Field
|
||||
|
||||
```python
|
||||
# Check FILTER status
|
||||
for variant in vcf:
|
||||
if "PASS" in variant.filter or len(variant.filter) == 0:
|
||||
print(f"Passed filters: {variant.chrom}:{variant.pos}")
|
||||
else:
|
||||
print(f"Failed: {variant.filter.keys()}")
|
||||
```
|
||||
|
||||
## Indexing VCF Files
|
||||
|
||||
Create tabix index for compressed VCF:
|
||||
|
||||
```python
|
||||
# Compress and index
|
||||
pysam.tabix_index("example.vcf", preset="vcf", force=True)
|
||||
# Creates example.vcf.gz and example.vcf.gz.tbi
|
||||
```
|
||||
|
||||
Or use bcftools for BCF:
|
||||
|
||||
```python
|
||||
pysam.bcftools.index("example.bcf")
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Extract Variants for Specific Samples
|
||||
|
||||
```python
|
||||
invcf = pysam.VariantFile("input.vcf")
|
||||
samples_to_keep = ["sample1", "sample3"]
|
||||
|
||||
# Create new header with subset of samples
|
||||
new_header = invcf.header.copy()
|
||||
new_header.samples.clear()
|
||||
for sample in samples_to_keep:
|
||||
new_header.samples.add(sample)
|
||||
|
||||
outvcf = pysam.VariantFile("output.vcf", "w", header=new_header)
|
||||
|
||||
for variant in invcf:
|
||||
# Create new record
|
||||
new_record = outvcf.new_record(
|
||||
contig=variant.chrom,
|
||||
start=variant.start,
|
||||
stop=variant.stop,
|
||||
alleles=variant.alleles,
|
||||
id=variant.id,
|
||||
qual=variant.qual,
|
||||
filter=variant.filter,
|
||||
info=variant.info
|
||||
)
|
||||
|
||||
# Copy genotype data for selected samples
|
||||
for sample in samples_to_keep:
|
||||
new_record.samples[sample].update(variant.samples[sample])
|
||||
|
||||
outvcf.write(new_record)
|
||||
```
|
||||
|
||||
### Calculate Allele Frequencies
|
||||
|
||||
```python
|
||||
vcf = pysam.VariantFile("example.vcf")
|
||||
|
||||
for variant in vcf:
|
||||
total_alleles = 0
|
||||
alt_alleles = 0
|
||||
|
||||
for sample_name in variant.samples:
|
||||
gt = variant.samples[sample_name]["GT"]
|
||||
if gt and None not in gt:
|
||||
total_alleles += 2
|
||||
alt_alleles += sum(1 for allele in gt if allele > 0)
|
||||
|
||||
if total_alleles > 0:
|
||||
af = alt_alleles / total_alleles
|
||||
print(f"{variant.chrom}:{variant.pos} AF={af:.4f}")
|
||||
```
|
||||
|
||||
### Convert VCF to Summary Table
|
||||
|
||||
```python
|
||||
import csv
|
||||
|
||||
vcf = pysam.VariantFile("example.vcf")
|
||||
|
||||
with open("variants.csv", "w", newline="") as csvfile:
|
||||
writer = csv.writer(csvfile)
|
||||
writer.writerow(["CHROM", "POS", "ID", "REF", "ALT", "QUAL", "DP"])
|
||||
|
||||
for variant in vcf:
|
||||
writer.writerow([
|
||||
variant.chrom,
|
||||
variant.pos,
|
||||
variant.id or ".",
|
||||
variant.ref,
|
||||
",".join(variant.alts) if variant.alts else ".",
|
||||
variant.qual or ".",
|
||||
variant.info.get("DP", ".")
|
||||
])
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use BCF format** for better compression and faster access than VCF
|
||||
2. **Index files** with tabix for efficient region queries
|
||||
3. **Filter early** to reduce processing of irrelevant variants
|
||||
4. **Use INFO fields efficiently** - check existence before accessing
|
||||
5. **Batch write operations** when creating VCF files
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **Coordinate systems:** VCF uses 1-based coordinates, but VariantRecord.start is 0-based
|
||||
2. **Missing data:** Always check if INFO/FORMAT fields exist before accessing
|
||||
3. **Genotype tuples:** Genotypes are tuples, not lists—handle None values for missing data
|
||||
4. **Allele indexing:** In genotype (0, 1), 0=REF, 1=first ALT, 2=second ALT, etc.
|
||||
5. **Index requirement:** Region-based `fetch()` requires tabix index for VCF.gz
|
||||
6. **Header modification:** When subsetting samples, properly update header and copy FORMAT fields
|
||||
Reference in New Issue
Block a user