Files
2025-11-30 08:30:10 +08:00

3.6 KiB

Overlap Detection and IGD

The overlaprs module provides efficient overlap detection between genomic intervals using the Integrated Genome Database (IGD) data structure.

IGD Index

IGD (Integrated Genome Database) is a specialized data structure for fast genomic interval queries and overlap detection.

Building an IGD Index

Create indexes from genomic region files:

import gtars

# Build IGD index from BED file
igd = gtars.igd.build_index("regions.bed")

# Save index for reuse
igd.save("regions.igd")

# Load existing index
igd = gtars.igd.load_index("regions.igd")

Querying Overlaps

Find overlapping regions efficiently:

# Query a single region
overlaps = igd.query("chr1", 1000, 2000)

# Query multiple regions
results = []
for chrom, start, end in query_regions:
    overlaps = igd.query(chrom, start, end)
    results.append(overlaps)

# Get overlap counts only
count = igd.count_overlaps("chr1", 1000, 2000)

CLI Usage

The overlaprs command-line tool provides overlap detection:

# Find overlaps between two BED files
gtars overlaprs query --index regions.bed --query query_regions.bed

# Count overlaps
gtars overlaprs count --index regions.bed --query query_regions.bed

# Output overlapping regions
gtars overlaprs overlap --index regions.bed --query query_regions.bed --output overlaps.bed

IGD CLI Commands

Build and query IGD indexes:

# Build IGD index
gtars igd build --input regions.bed --output regions.igd

# Query IGD index
gtars igd query --index regions.igd --region "chr1:1000-2000"

# Batch query from file
gtars igd query --index regions.igd --query-file queries.bed --output results.bed

Python API

Overlap Detection

Compute overlaps between region sets:

import gtars

# Load two region sets
set_a = gtars.RegionSet.from_bed("regions_a.bed")
set_b = gtars.RegionSet.from_bed("regions_b.bed")

# Find overlaps
overlaps = set_a.overlap(set_b)

# Get regions from A that overlap with B
overlapping_a = set_a.filter_overlapping(set_b)

# Get regions from A that don't overlap with B
non_overlapping_a = set_a.filter_non_overlapping(set_b)

Overlap Statistics

Calculate overlap metrics:

# Count overlaps
overlap_count = set_a.count_overlaps(set_b)

# Calculate overlap fraction
overlap_fraction = set_a.overlap_fraction(set_b)

# Get overlap coverage
coverage = set_a.overlap_coverage(set_b)

Performance Characteristics

IGD provides efficient querying:

  • Index construction: O(n log n) where n is number of regions
  • Query time: O(k + log n) where k is number of overlaps
  • Memory efficient: Compact representation of genomic intervals

Use Cases

Regulatory Element Analysis

Identify overlap between genomic features:

# Find transcription factor binding sites overlapping promoters
tfbs = gtars.RegionSet.from_bed("chip_seq_peaks.bed")
promoters = gtars.RegionSet.from_bed("promoters.bed")

overlapping_tfbs = tfbs.filter_overlapping(promoters)
print(f"Found {len(overlapping_tfbs)} TFBS in promoters")

Variant Annotation

Annotate variants with overlapping features:

# Check which variants overlap with coding regions
variants = gtars.RegionSet.from_bed("variants.bed")
cds = gtars.RegionSet.from_bed("coding_sequences.bed")

coding_variants = variants.filter_overlapping(cds)

Chromatin State Analysis

Compare chromatin states across samples:

# Find regions with consistent chromatin states
sample1 = gtars.RegionSet.from_bed("sample1_peaks.bed")
sample2 = gtars.RegionSet.from_bed("sample2_peaks.bed")

consistent_regions = sample1.overlap(sample2)