Initial commit
This commit is contained in:
407
skills/pysam/references/sequence_files.md
Normal file
407
skills/pysam/references/sequence_files.md
Normal file
@@ -0,0 +1,407 @@
|
||||
# Working with Sequence Files (FASTA/FASTQ)
|
||||
|
||||
## FASTA Files
|
||||
|
||||
### Overview
|
||||
|
||||
Pysam provides the `FastaFile` class for indexed, random access to FASTA reference sequences. FASTA files must be indexed with `samtools faidx` before use.
|
||||
|
||||
### Opening FASTA Files
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Open indexed FASTA file
|
||||
fasta = pysam.FastaFile("reference.fasta")
|
||||
|
||||
# Automatically looks for reference.fasta.fai index
|
||||
```
|
||||
|
||||
### Creating FASTA Index
|
||||
|
||||
```python
|
||||
# Create index using pysam
|
||||
pysam.faidx("reference.fasta")
|
||||
|
||||
# Or using samtools command
|
||||
pysam.samtools.faidx("reference.fasta")
|
||||
```
|
||||
|
||||
This creates a `.fai` index file required for random access.
|
||||
|
||||
### FastaFile Properties
|
||||
|
||||
```python
|
||||
fasta = pysam.FastaFile("reference.fasta")
|
||||
|
||||
# List of reference sequences
|
||||
references = fasta.references
|
||||
print(f"References: {references}")
|
||||
|
||||
# Get lengths
|
||||
lengths = fasta.lengths
|
||||
print(f"Lengths: {lengths}")
|
||||
|
||||
# Get specific sequence length
|
||||
chr1_length = fasta.get_reference_length("chr1")
|
||||
```
|
||||
|
||||
### Fetching Sequences
|
||||
|
||||
#### Fetch by Region
|
||||
|
||||
Uses **0-based, half-open** coordinates:
|
||||
|
||||
```python
|
||||
# Fetch specific region
|
||||
sequence = fasta.fetch("chr1", 1000, 2000)
|
||||
print(f"Sequence: {sequence}") # Returns 1000 bases
|
||||
|
||||
# Fetch entire chromosome
|
||||
chr1_seq = fasta.fetch("chr1")
|
||||
|
||||
# Fetch using region string (1-based)
|
||||
sequence = fasta.fetch(region="chr1:1001-2000")
|
||||
```
|
||||
|
||||
**Important:** Numeric arguments use 0-based coordinates, region strings use 1-based coordinates (samtools convention).
|
||||
|
||||
#### Common Use Cases
|
||||
|
||||
```python
|
||||
# Get sequence at variant position
|
||||
def get_variant_context(fasta, chrom, pos, window=10):
|
||||
"""Get sequence context around a variant position (1-based)."""
|
||||
start = max(0, pos - window - 1) # Convert to 0-based
|
||||
end = pos + window
|
||||
return fasta.fetch(chrom, start, end)
|
||||
|
||||
# Get sequence for gene coordinates
|
||||
def get_gene_sequence(fasta, chrom, start, end, strand):
|
||||
"""Get gene sequence with strand awareness."""
|
||||
seq = fasta.fetch(chrom, start, end)
|
||||
|
||||
if strand == "-":
|
||||
# Reverse complement
|
||||
complement = str.maketrans("ATGCatgc", "TACGtacg")
|
||||
seq = seq.translate(complement)[::-1]
|
||||
|
||||
return seq
|
||||
|
||||
# Check reference allele
|
||||
def check_ref_allele(fasta, chrom, pos, expected_ref):
|
||||
"""Verify reference allele at position (1-based pos)."""
|
||||
actual = fasta.fetch(chrom, pos-1, pos) # Convert to 0-based
|
||||
return actual.upper() == expected_ref.upper()
|
||||
```
|
||||
|
||||
### Extracting Multiple Regions
|
||||
|
||||
```python
|
||||
# Extract multiple regions efficiently
|
||||
regions = [
|
||||
("chr1", 1000, 2000),
|
||||
("chr1", 5000, 6000),
|
||||
("chr2", 10000, 11000)
|
||||
]
|
||||
|
||||
sequences = {}
|
||||
for chrom, start, end in regions:
|
||||
seq_id = f"{chrom}:{start}-{end}"
|
||||
sequences[seq_id] = fasta.fetch(chrom, start, end)
|
||||
```
|
||||
|
||||
### Working with Ambiguous Bases
|
||||
|
||||
FASTA files may contain IUPAC ambiguity codes:
|
||||
|
||||
- N = any base
|
||||
- R = A or G (purine)
|
||||
- Y = C or T (pyrimidine)
|
||||
- S = G or C (strong)
|
||||
- W = A or T (weak)
|
||||
- K = G or T (keto)
|
||||
- M = A or C (amino)
|
||||
- B = C, G, or T (not A)
|
||||
- D = A, G, or T (not C)
|
||||
- H = A, C, or T (not G)
|
||||
- V = A, C, or G (not T)
|
||||
|
||||
```python
|
||||
# Handle ambiguous bases
|
||||
def count_ambiguous(sequence):
|
||||
"""Count non-ATGC bases."""
|
||||
return sum(1 for base in sequence.upper() if base not in "ATGC")
|
||||
|
||||
# Remove regions with too many Ns
|
||||
def has_quality_sequence(fasta, chrom, start, end, max_n_frac=0.1):
|
||||
"""Check if region has acceptable N content."""
|
||||
seq = fasta.fetch(chrom, start, end)
|
||||
n_count = seq.upper().count('N')
|
||||
return (n_count / len(seq)) <= max_n_frac
|
||||
```
|
||||
|
||||
## FASTQ Files
|
||||
|
||||
### Overview
|
||||
|
||||
Pysam provides `FastxFile` (or `FastqFile`) for reading FASTQ files containing raw sequencing reads with quality scores. FASTQ files do not support random access—only sequential reading.
|
||||
|
||||
### Opening FASTQ Files
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Open FASTQ file
|
||||
fastq = pysam.FastxFile("reads.fastq")
|
||||
|
||||
# Works with compressed files
|
||||
fastq_gz = pysam.FastxFile("reads.fastq.gz")
|
||||
```
|
||||
|
||||
### Reading FASTQ Records
|
||||
|
||||
```python
|
||||
fastq = pysam.FastxFile("reads.fastq")
|
||||
|
||||
for read in fastq:
|
||||
print(f"Name: {read.name}")
|
||||
print(f"Sequence: {read.sequence}")
|
||||
print(f"Quality: {read.quality}")
|
||||
print(f"Comment: {read.comment}") # Optional header comment
|
||||
```
|
||||
|
||||
**FastqProxy attributes:**
|
||||
- `name` - Read identifier (without @ prefix)
|
||||
- `sequence` - DNA/RNA sequence
|
||||
- `quality` - ASCII-encoded quality string
|
||||
- `comment` - Optional comment from header line
|
||||
- `get_quality_array()` - Convert quality string to numeric array
|
||||
|
||||
### Quality Score Conversion
|
||||
|
||||
```python
|
||||
# Convert quality string to numeric values
|
||||
for read in fastq:
|
||||
qual_array = read.get_quality_array()
|
||||
mean_quality = sum(qual_array) / len(qual_array)
|
||||
print(f"{read.name}: mean Q = {mean_quality:.1f}")
|
||||
```
|
||||
|
||||
Quality scores are Phred-scaled (typically Phred+33 encoding):
|
||||
- Q = -10 * log10(P_error)
|
||||
- ASCII 33 ('!') = Q0
|
||||
- ASCII 43 ('+') = Q10
|
||||
- ASCII 63 ('?') = Q30
|
||||
|
||||
### Common FASTQ Processing Workflows
|
||||
|
||||
#### Quality Filtering
|
||||
|
||||
```python
|
||||
def filter_by_quality(input_fastq, output_fastq, min_mean_quality=20):
|
||||
"""Filter reads by mean quality score."""
|
||||
with pysam.FastxFile(input_fastq) as infile:
|
||||
with open(output_fastq, 'w') as outfile:
|
||||
for read in infile:
|
||||
qual_array = read.get_quality_array()
|
||||
mean_q = sum(qual_array) / len(qual_array)
|
||||
|
||||
if mean_q >= min_mean_quality:
|
||||
# Write in FASTQ format
|
||||
outfile.write(f"@{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
outfile.write("+\n")
|
||||
outfile.write(f"{read.quality}\n")
|
||||
```
|
||||
|
||||
#### Length Filtering
|
||||
|
||||
```python
|
||||
def filter_by_length(input_fastq, output_fastq, min_length=50):
|
||||
"""Filter reads by minimum length."""
|
||||
with pysam.FastxFile(input_fastq) as infile:
|
||||
with open(output_fastq, 'w') as outfile:
|
||||
kept = 0
|
||||
for read in infile:
|
||||
if len(read.sequence) >= min_length:
|
||||
outfile.write(f"@{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
outfile.write("+\n")
|
||||
outfile.write(f"{read.quality}\n")
|
||||
kept += 1
|
||||
print(f"Kept {kept} reads")
|
||||
```
|
||||
|
||||
#### Calculate Quality Statistics
|
||||
|
||||
```python
|
||||
def calculate_fastq_stats(fastq_file):
|
||||
"""Calculate basic statistics for FASTQ file."""
|
||||
total_reads = 0
|
||||
total_bases = 0
|
||||
quality_sum = 0
|
||||
|
||||
with pysam.FastxFile(fastq_file) as fastq:
|
||||
for read in fastq:
|
||||
total_reads += 1
|
||||
read_length = len(read.sequence)
|
||||
total_bases += read_length
|
||||
|
||||
qual_array = read.get_quality_array()
|
||||
quality_sum += sum(qual_array)
|
||||
|
||||
return {
|
||||
"total_reads": total_reads,
|
||||
"total_bases": total_bases,
|
||||
"mean_read_length": total_bases / total_reads if total_reads > 0 else 0,
|
||||
"mean_quality": quality_sum / total_bases if total_bases > 0 else 0
|
||||
}
|
||||
```
|
||||
|
||||
#### Extract Reads by Name
|
||||
|
||||
```python
|
||||
def extract_reads_by_name(fastq_file, read_names, output_file):
|
||||
"""Extract specific reads by name."""
|
||||
read_set = set(read_names)
|
||||
|
||||
with pysam.FastxFile(fastq_file) as infile:
|
||||
with open(output_file, 'w') as outfile:
|
||||
for read in infile:
|
||||
if read.name in read_set:
|
||||
outfile.write(f"@{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
outfile.write("+\n")
|
||||
outfile.write(f"{read.quality}\n")
|
||||
```
|
||||
|
||||
#### Convert FASTQ to FASTA
|
||||
|
||||
```python
|
||||
def fastq_to_fasta(fastq_file, fasta_file):
|
||||
"""Convert FASTQ to FASTA (discards quality scores)."""
|
||||
with pysam.FastxFile(fastq_file) as infile:
|
||||
with open(fasta_file, 'w') as outfile:
|
||||
for read in infile:
|
||||
outfile.write(f">{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
```
|
||||
|
||||
#### Subsample FASTQ
|
||||
|
||||
```python
|
||||
import random
|
||||
|
||||
def subsample_fastq(input_fastq, output_fastq, fraction=0.1, seed=42):
|
||||
"""Randomly subsample reads from FASTQ file."""
|
||||
random.seed(seed)
|
||||
|
||||
with pysam.FastxFile(input_fastq) as infile:
|
||||
with open(output_fastq, 'w') as outfile:
|
||||
for read in infile:
|
||||
if random.random() < fraction:
|
||||
outfile.write(f"@{read.name}\n")
|
||||
outfile.write(f"{read.sequence}\n")
|
||||
outfile.write("+\n")
|
||||
outfile.write(f"{read.quality}\n")
|
||||
```
|
||||
|
||||
## Tabix-Indexed Files
|
||||
|
||||
### Overview
|
||||
|
||||
Pysam provides `TabixFile` for accessing tabix-indexed genomic data files (BED, GFF, GTF, generic tab-delimited).
|
||||
|
||||
### Opening Tabix Files
|
||||
|
||||
```python
|
||||
import pysam
|
||||
|
||||
# Open tabix-indexed file
|
||||
tabix = pysam.TabixFile("annotations.bed.gz")
|
||||
|
||||
# File must be bgzip-compressed and tabix-indexed
|
||||
```
|
||||
|
||||
### Creating Tabix Index
|
||||
|
||||
```python
|
||||
# Index a file
|
||||
pysam.tabix_index("annotations.bed", preset="bed", force=True)
|
||||
# Creates annotations.bed.gz and annotations.bed.gz.tbi
|
||||
|
||||
# Presets available: bed, gff, vcf
|
||||
```
|
||||
|
||||
### Fetching Records
|
||||
|
||||
```python
|
||||
tabix = pysam.TabixFile("annotations.bed.gz")
|
||||
|
||||
# Fetch region
|
||||
for row in tabix.fetch("chr1", 1000000, 2000000):
|
||||
print(row) # Returns tab-delimited string
|
||||
|
||||
# Parse with specific parser
|
||||
for row in tabix.fetch("chr1", 1000000, 2000000, parser=pysam.asBed()):
|
||||
print(f"Interval: {row.contig}:{row.start}-{row.end}")
|
||||
|
||||
# Available parsers: asBed(), asGTF(), asVCF(), asTuple()
|
||||
```
|
||||
|
||||
### Working with BED Files
|
||||
|
||||
```python
|
||||
bed = pysam.TabixFile("regions.bed.gz")
|
||||
|
||||
# Access BED fields by name
|
||||
for interval in bed.fetch("chr1", 1000000, 2000000, parser=pysam.asBed()):
|
||||
print(f"Region: {interval.contig}:{interval.start}-{interval.end}")
|
||||
print(f"Name: {interval.name}")
|
||||
print(f"Score: {interval.score}")
|
||||
print(f"Strand: {interval.strand}")
|
||||
```
|
||||
|
||||
### Working with GTF/GFF Files
|
||||
|
||||
```python
|
||||
gtf = pysam.TabixFile("annotations.gtf.gz")
|
||||
|
||||
# Access GTF fields
|
||||
for feature in gtf.fetch("chr1", 1000000, 2000000, parser=pysam.asGTF()):
|
||||
print(f"Feature: {feature.feature}")
|
||||
print(f"Gene: {feature.gene_id}")
|
||||
print(f"Transcript: {feature.transcript_id}")
|
||||
print(f"Coordinates: {feature.start}-{feature.end}")
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
### FASTA
|
||||
1. **Always use indexed FASTA** files (create .fai with samtools faidx)
|
||||
2. **Batch fetch operations** when extracting multiple regions
|
||||
3. **Cache frequently accessed sequences** in memory
|
||||
4. **Use appropriate window sizes** to avoid loading excessive sequence data
|
||||
|
||||
### FASTQ
|
||||
1. **Stream processing** - FASTQ files are read sequentially, process on-the-fly
|
||||
2. **Use compressed FASTQ.gz** to save disk space (pysam handles transparently)
|
||||
3. **Avoid loading entire file** into memory—process read-by-read
|
||||
4. **For large files**, consider parallel processing with file splitting
|
||||
|
||||
### Tabix
|
||||
1. **Always bgzip and tabix-index** files before region queries
|
||||
2. **Use appropriate presets** when creating indices
|
||||
3. **Specify parser** for named field access
|
||||
4. **Batch queries** to same file to avoid re-opening
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
1. **FASTA coordinate system:** fetch() uses 0-based coordinates, region strings use 1-based
|
||||
2. **Missing index:** FASTA random access requires .fai index file
|
||||
3. **FASTQ sequential only:** Cannot do random access or region-based queries on FASTQ
|
||||
4. **Quality encoding:** Assume Phred+33 unless specified otherwise
|
||||
5. **Tabix compression:** Must use bgzip, not regular gzip, for tabix indexing
|
||||
6. **Parser requirement:** TabixFile needs explicit parser for named field access
|
||||
7. **Case sensitivity:** FASTA sequences preserve case—use .upper() or .lower() for consistent comparisons
|
||||
Reference in New Issue
Block a user