3.6 KiB
3.6 KiB
Overlap Detection and IGD
The overlaprs module provides efficient overlap detection between genomic intervals using the Integrated Genome Database (IGD) data structure.
IGD Index
IGD (Integrated Genome Database) is a specialized data structure for fast genomic interval queries and overlap detection.
Building an IGD Index
Create indexes from genomic region files:
import gtars
# Build IGD index from BED file
igd = gtars.igd.build_index("regions.bed")
# Save index for reuse
igd.save("regions.igd")
# Load existing index
igd = gtars.igd.load_index("regions.igd")
Querying Overlaps
Find overlapping regions efficiently:
# Query a single region
overlaps = igd.query("chr1", 1000, 2000)
# Query multiple regions
results = []
for chrom, start, end in query_regions:
overlaps = igd.query(chrom, start, end)
results.append(overlaps)
# Get overlap counts only
count = igd.count_overlaps("chr1", 1000, 2000)
CLI Usage
The overlaprs command-line tool provides overlap detection:
# Find overlaps between two BED files
gtars overlaprs query --index regions.bed --query query_regions.bed
# Count overlaps
gtars overlaprs count --index regions.bed --query query_regions.bed
# Output overlapping regions
gtars overlaprs overlap --index regions.bed --query query_regions.bed --output overlaps.bed
IGD CLI Commands
Build and query IGD indexes:
# Build IGD index
gtars igd build --input regions.bed --output regions.igd
# Query IGD index
gtars igd query --index regions.igd --region "chr1:1000-2000"
# Batch query from file
gtars igd query --index regions.igd --query-file queries.bed --output results.bed
Python API
Overlap Detection
Compute overlaps between region sets:
import gtars
# Load two region sets
set_a = gtars.RegionSet.from_bed("regions_a.bed")
set_b = gtars.RegionSet.from_bed("regions_b.bed")
# Find overlaps
overlaps = set_a.overlap(set_b)
# Get regions from A that overlap with B
overlapping_a = set_a.filter_overlapping(set_b)
# Get regions from A that don't overlap with B
non_overlapping_a = set_a.filter_non_overlapping(set_b)
Overlap Statistics
Calculate overlap metrics:
# Count overlaps
overlap_count = set_a.count_overlaps(set_b)
# Calculate overlap fraction
overlap_fraction = set_a.overlap_fraction(set_b)
# Get overlap coverage
coverage = set_a.overlap_coverage(set_b)
Performance Characteristics
IGD provides efficient querying:
- Index construction: O(n log n) where n is number of regions
- Query time: O(k + log n) where k is number of overlaps
- Memory efficient: Compact representation of genomic intervals
Use Cases
Regulatory Element Analysis
Identify overlap between genomic features:
# Find transcription factor binding sites overlapping promoters
tfbs = gtars.RegionSet.from_bed("chip_seq_peaks.bed")
promoters = gtars.RegionSet.from_bed("promoters.bed")
overlapping_tfbs = tfbs.filter_overlapping(promoters)
print(f"Found {len(overlapping_tfbs)} TFBS in promoters")
Variant Annotation
Annotate variants with overlapping features:
# Check which variants overlap with coding regions
variants = gtars.RegionSet.from_bed("variants.bed")
cds = gtars.RegionSet.from_bed("coding_sequences.bed")
coding_variants = variants.filter_overlapping(cds)
Chromatin State Analysis
Compare chromatin states across samples:
# Find regions with consistent chromatin states
sample1 = gtars.RegionSet.from_bed("sample1_peaks.bed")
sample2 = gtars.RegionSet.from_bed("sample2_peaks.bed")
consistent_regions = sample1.overlap(sample2)