zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

11 KiB

Raw Blame History

Working with Sequence Files (FASTA/FASTQ)

FASTA Files

Overview

Pysam provides the FastaFile class for indexed, random access to FASTA reference sequences. FASTA files must be indexed with samtools faidx before use.

Opening FASTA Files

import pysam

# Open indexed FASTA file
fasta = pysam.FastaFile("reference.fasta")

# Automatically looks for reference.fasta.fai index

Creating FASTA Index

# Create index using pysam
pysam.faidx("reference.fasta")

# Or using samtools command
pysam.samtools.faidx("reference.fasta")

This creates a .fai index file required for random access.

FastaFile Properties

fasta = pysam.FastaFile("reference.fasta")

# List of reference sequences
references = fasta.references
print(f"References: {references}")

# Get lengths
lengths = fasta.lengths
print(f"Lengths: {lengths}")

# Get specific sequence length
chr1_length = fasta.get_reference_length("chr1")

Fetching Sequences

Fetch by Region

Uses 0-based, half-open coordinates:

# Fetch specific region
sequence = fasta.fetch("chr1", 1000, 2000)
print(f"Sequence: {sequence}")  # Returns 1000 bases

# Fetch entire chromosome
chr1_seq = fasta.fetch("chr1")

# Fetch using region string (1-based)
sequence = fasta.fetch(region="chr1:1001-2000")

Important: Numeric arguments use 0-based coordinates, region strings use 1-based coordinates (samtools convention).

Common Use Cases

# Get sequence at variant position
def get_variant_context(fasta, chrom, pos, window=10):
    """Get sequence context around a variant position (1-based)."""
    start = max(0, pos - window - 1)  # Convert to 0-based
    end = pos + window
    return fasta.fetch(chrom, start, end)

# Get sequence for gene coordinates
def get_gene_sequence(fasta, chrom, start, end, strand):
    """Get gene sequence with strand awareness."""
    seq = fasta.fetch(chrom, start, end)

    if strand == "-":
        # Reverse complement
        complement = str.maketrans("ATGCatgc", "TACGtacg")
        seq = seq.translate(complement)[::-1]

    return seq

# Check reference allele
def check_ref_allele(fasta, chrom, pos, expected_ref):
    """Verify reference allele at position (1-based pos)."""
    actual = fasta.fetch(chrom, pos-1, pos)  # Convert to 0-based
    return actual.upper() == expected_ref.upper()

Extracting Multiple Regions

# Extract multiple regions efficiently
regions = [
    ("chr1", 1000, 2000),
    ("chr1", 5000, 6000),
    ("chr2", 10000, 11000)
]

sequences = {}
for chrom, start, end in regions:
    seq_id = f"{chrom}:{start}-{end}"
    sequences[seq_id] = fasta.fetch(chrom, start, end)

Working with Ambiguous Bases

FASTA files may contain IUPAC ambiguity codes:

N = any base
R = A or G (purine)
Y = C or T (pyrimidine)
S = G or C (strong)
W = A or T (weak)
K = G or T (keto)
M = A or C (amino)
B = C, G, or T (not A)
D = A, G, or T (not C)
H = A, C, or T (not G)
V = A, C, or G (not T)

# Handle ambiguous bases
def count_ambiguous(sequence):
    """Count non-ATGC bases."""
    return sum(1 for base in sequence.upper() if base not in "ATGC")

# Remove regions with too many Ns
def has_quality_sequence(fasta, chrom, start, end, max_n_frac=0.1):
    """Check if region has acceptable N content."""
    seq = fasta.fetch(chrom, start, end)
    n_count = seq.upper().count('N')
    return (n_count / len(seq)) <= max_n_frac

FASTQ Files

Overview

Pysam provides FastxFile (or FastqFile) for reading FASTQ files containing raw sequencing reads with quality scores. FASTQ files do not support random access—only sequential reading.

Opening FASTQ Files

import pysam

# Open FASTQ file
fastq = pysam.FastxFile("reads.fastq")

# Works with compressed files
fastq_gz = pysam.FastxFile("reads.fastq.gz")

Reading FASTQ Records

fastq = pysam.FastxFile("reads.fastq")

for read in fastq:
    print(f"Name: {read.name}")
    print(f"Sequence: {read.sequence}")
    print(f"Quality: {read.quality}")
    print(f"Comment: {read.comment}")  # Optional header comment

FastqProxy attributes:

name - Read identifier (without @ prefix)
sequence - DNA/RNA sequence
quality - ASCII-encoded quality string
comment - Optional comment from header line
get_quality_array() - Convert quality string to numeric array

Quality Score Conversion

# Convert quality string to numeric values
for read in fastq:
    qual_array = read.get_quality_array()
    mean_quality = sum(qual_array) / len(qual_array)
    print(f"{read.name}: mean Q = {mean_quality:.1f}")

Quality scores are Phred-scaled (typically Phred+33 encoding):

Q = -10 * log10(P_error)
ASCII 33 ('!') = Q0
ASCII 43 ('+') = Q10
ASCII 63 ('?') = Q30

Common FASTQ Processing Workflows

Quality Filtering

def filter_by_quality(input_fastq, output_fastq, min_mean_quality=20):
    """Filter reads by mean quality score."""
    with pysam.FastxFile(input_fastq) as infile:
        with open(output_fastq, 'w') as outfile:
            for read in infile:
                qual_array = read.get_quality_array()
                mean_q = sum(qual_array) / len(qual_array)

                if mean_q >= min_mean_quality:
                    # Write in FASTQ format
                    outfile.write(f"@{read.name}\n")
                    outfile.write(f"{read.sequence}\n")
                    outfile.write("+\n")
                    outfile.write(f"{read.quality}\n")

Length Filtering

def filter_by_length(input_fastq, output_fastq, min_length=50):
    """Filter reads by minimum length."""
    with pysam.FastxFile(input_fastq) as infile:
        with open(output_fastq, 'w') as outfile:
            kept = 0
            for read in infile:
                if len(read.sequence) >= min_length:
                    outfile.write(f"@{read.name}\n")
                    outfile.write(f"{read.sequence}\n")
                    outfile.write("+\n")
                    outfile.write(f"{read.quality}\n")
                    kept += 1
    print(f"Kept {kept} reads")

Calculate Quality Statistics

def calculate_fastq_stats(fastq_file):
    """Calculate basic statistics for FASTQ file."""
    total_reads = 0
    total_bases = 0
    quality_sum = 0

    with pysam.FastxFile(fastq_file) as fastq:
        for read in fastq:
            total_reads += 1
            read_length = len(read.sequence)
            total_bases += read_length

            qual_array = read.get_quality_array()
            quality_sum += sum(qual_array)

    return {
        "total_reads": total_reads,
        "total_bases": total_bases,
        "mean_read_length": total_bases / total_reads if total_reads > 0 else 0,
        "mean_quality": quality_sum / total_bases if total_bases > 0 else 0
    }

Extract Reads by Name

def extract_reads_by_name(fastq_file, read_names, output_file):
    """Extract specific reads by name."""
    read_set = set(read_names)

    with pysam.FastxFile(fastq_file) as infile:
        with open(output_file, 'w') as outfile:
            for read in infile:
                if read.name in read_set:
                    outfile.write(f"@{read.name}\n")
                    outfile.write(f"{read.sequence}\n")
                    outfile.write("+\n")
                    outfile.write(f"{read.quality}\n")

Convert FASTQ to FASTA

def fastq_to_fasta(fastq_file, fasta_file):
    """Convert FASTQ to FASTA (discards quality scores)."""
    with pysam.FastxFile(fastq_file) as infile:
        with open(fasta_file, 'w') as outfile:
            for read in infile:
                outfile.write(f">{read.name}\n")
                outfile.write(f"{read.sequence}\n")

Subsample FASTQ

import random

def subsample_fastq(input_fastq, output_fastq, fraction=0.1, seed=42):
    """Randomly subsample reads from FASTQ file."""
    random.seed(seed)

    with pysam.FastxFile(input_fastq) as infile:
        with open(output_fastq, 'w') as outfile:
            for read in infile:
                if random.random() < fraction:
                    outfile.write(f"@{read.name}\n")
                    outfile.write(f"{read.sequence}\n")
                    outfile.write("+\n")
                    outfile.write(f"{read.quality}\n")

Tabix-Indexed Files

Overview

Pysam provides TabixFile for accessing tabix-indexed genomic data files (BED, GFF, GTF, generic tab-delimited).

Opening Tabix Files

import pysam

# Open tabix-indexed file
tabix = pysam.TabixFile("annotations.bed.gz")

# File must be bgzip-compressed and tabix-indexed

Creating Tabix Index

# Index a file
pysam.tabix_index("annotations.bed", preset="bed", force=True)
# Creates annotations.bed.gz and annotations.bed.gz.tbi

# Presets available: bed, gff, vcf

Fetching Records

tabix = pysam.TabixFile("annotations.bed.gz")

# Fetch region
for row in tabix.fetch("chr1", 1000000, 2000000):
    print(row)  # Returns tab-delimited string

# Parse with specific parser
for row in tabix.fetch("chr1", 1000000, 2000000, parser=pysam.asBed()):
    print(f"Interval: {row.contig}:{row.start}-{row.end}")

# Available parsers: asBed(), asGTF(), asVCF(), asTuple()

Working with BED Files

bed = pysam.TabixFile("regions.bed.gz")

# Access BED fields by name
for interval in bed.fetch("chr1", 1000000, 2000000, parser=pysam.asBed()):
    print(f"Region: {interval.contig}:{interval.start}-{interval.end}")
    print(f"Name: {interval.name}")
    print(f"Score: {interval.score}")
    print(f"Strand: {interval.strand}")

Working with GTF/GFF Files

gtf = pysam.TabixFile("annotations.gtf.gz")

# Access GTF fields
for feature in gtf.fetch("chr1", 1000000, 2000000, parser=pysam.asGTF()):
    print(f"Feature: {feature.feature}")
    print(f"Gene: {feature.gene_id}")
    print(f"Transcript: {feature.transcript_id}")
    print(f"Coordinates: {feature.start}-{feature.end}")

Performance Tips

FASTA

Always use indexed FASTA files (create .fai with samtools faidx)
Batch fetch operations when extracting multiple regions
Cache frequently accessed sequences in memory
Use appropriate window sizes to avoid loading excessive sequence data

FASTQ

Stream processing - FASTQ files are read sequentially, process on-the-fly
Use compressed FASTQ.gz to save disk space (pysam handles transparently)
Avoid loading entire file into memory—process read-by-read
For large files, consider parallel processing with file splitting

Tabix

Always bgzip and tabix-index files before region queries
Use appropriate presets when creating indices
Specify parser for named field access
Batch queries to same file to avoid re-opening

Common Pitfalls

FASTA coordinate system: fetch() uses 0-based coordinates, region strings use 1-based
Missing index: FASTA random access requires .fai index file
FASTQ sequential only: Cannot do random access or region-based queries on FASTQ
Quality encoding: Assume Phred+33 unless specified otherwise
Tabix compression: Must use bgzip, not regular gzip, for tabix indexing
Parser requirement: TabixFile needs explicit parser for named field access
Case sensitivity: FASTA sequences preserve case—use .upper() or .lower() for consistent comparisons

11 KiB Raw Blame History

Working with Sequence Files (FASTA/FASTQ)

FASTA Files

Overview

Opening FASTA Files

Creating FASTA Index

FastaFile Properties

Fetching Sequences

Fetch by Region

Common Use Cases

Extracting Multiple Regions

Working with Ambiguous Bases

FASTQ Files

Overview

Opening FASTQ Files

Reading FASTQ Records

Quality Score Conversion

Common FASTQ Processing Workflows

Quality Filtering

Length Filtering

Calculate Quality Statistics

Extract Reads by Name

Convert FASTQ to FASTA

Subsample FASTQ

Tabix-Indexed Files

Overview

Opening Tabix Files

Creating Tabix Index

Fetching Records

Working with BED Files

Working with GTF/GFF Files

Performance Tips

FASTA

FASTQ

Tabix

Common Pitfalls

11 KiB

Raw Blame History