Initial commit
This commit is contained in:
156
skills/gtars/references/overlap.md
Normal file
156
skills/gtars/references/overlap.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Overlap Detection and IGD
|
||||
|
||||
The overlaprs module provides efficient overlap detection between genomic intervals using the Integrated Genome Database (IGD) data structure.
|
||||
|
||||
## IGD Index
|
||||
|
||||
IGD (Integrated Genome Database) is a specialized data structure for fast genomic interval queries and overlap detection.
|
||||
|
||||
### Building an IGD Index
|
||||
|
||||
Create indexes from genomic region files:
|
||||
|
||||
```python
|
||||
import gtars
|
||||
|
||||
# Build IGD index from BED file
|
||||
igd = gtars.igd.build_index("regions.bed")
|
||||
|
||||
# Save index for reuse
|
||||
igd.save("regions.igd")
|
||||
|
||||
# Load existing index
|
||||
igd = gtars.igd.load_index("regions.igd")
|
||||
```
|
||||
|
||||
### Querying Overlaps
|
||||
|
||||
Find overlapping regions efficiently:
|
||||
|
||||
```python
|
||||
# Query a single region
|
||||
overlaps = igd.query("chr1", 1000, 2000)
|
||||
|
||||
# Query multiple regions
|
||||
results = []
|
||||
for chrom, start, end in query_regions:
|
||||
overlaps = igd.query(chrom, start, end)
|
||||
results.append(overlaps)
|
||||
|
||||
# Get overlap counts only
|
||||
count = igd.count_overlaps("chr1", 1000, 2000)
|
||||
```
|
||||
|
||||
## CLI Usage
|
||||
|
||||
The overlaprs command-line tool provides overlap detection:
|
||||
|
||||
```bash
|
||||
# Find overlaps between two BED files
|
||||
gtars overlaprs query --index regions.bed --query query_regions.bed
|
||||
|
||||
# Count overlaps
|
||||
gtars overlaprs count --index regions.bed --query query_regions.bed
|
||||
|
||||
# Output overlapping regions
|
||||
gtars overlaprs overlap --index regions.bed --query query_regions.bed --output overlaps.bed
|
||||
```
|
||||
|
||||
### IGD CLI Commands
|
||||
|
||||
Build and query IGD indexes:
|
||||
|
||||
```bash
|
||||
# Build IGD index
|
||||
gtars igd build --input regions.bed --output regions.igd
|
||||
|
||||
# Query IGD index
|
||||
gtars igd query --index regions.igd --region "chr1:1000-2000"
|
||||
|
||||
# Batch query from file
|
||||
gtars igd query --index regions.igd --query-file queries.bed --output results.bed
|
||||
```
|
||||
|
||||
## Python API
|
||||
|
||||
### Overlap Detection
|
||||
|
||||
Compute overlaps between region sets:
|
||||
|
||||
```python
|
||||
import gtars
|
||||
|
||||
# Load two region sets
|
||||
set_a = gtars.RegionSet.from_bed("regions_a.bed")
|
||||
set_b = gtars.RegionSet.from_bed("regions_b.bed")
|
||||
|
||||
# Find overlaps
|
||||
overlaps = set_a.overlap(set_b)
|
||||
|
||||
# Get regions from A that overlap with B
|
||||
overlapping_a = set_a.filter_overlapping(set_b)
|
||||
|
||||
# Get regions from A that don't overlap with B
|
||||
non_overlapping_a = set_a.filter_non_overlapping(set_b)
|
||||
```
|
||||
|
||||
### Overlap Statistics
|
||||
|
||||
Calculate overlap metrics:
|
||||
|
||||
```python
|
||||
# Count overlaps
|
||||
overlap_count = set_a.count_overlaps(set_b)
|
||||
|
||||
# Calculate overlap fraction
|
||||
overlap_fraction = set_a.overlap_fraction(set_b)
|
||||
|
||||
# Get overlap coverage
|
||||
coverage = set_a.overlap_coverage(set_b)
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
IGD provides efficient querying:
|
||||
- **Index construction**: O(n log n) where n is number of regions
|
||||
- **Query time**: O(k + log n) where k is number of overlaps
|
||||
- **Memory efficient**: Compact representation of genomic intervals
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Regulatory Element Analysis
|
||||
|
||||
Identify overlap between genomic features:
|
||||
|
||||
```python
|
||||
# Find transcription factor binding sites overlapping promoters
|
||||
tfbs = gtars.RegionSet.from_bed("chip_seq_peaks.bed")
|
||||
promoters = gtars.RegionSet.from_bed("promoters.bed")
|
||||
|
||||
overlapping_tfbs = tfbs.filter_overlapping(promoters)
|
||||
print(f"Found {len(overlapping_tfbs)} TFBS in promoters")
|
||||
```
|
||||
|
||||
### Variant Annotation
|
||||
|
||||
Annotate variants with overlapping features:
|
||||
|
||||
```python
|
||||
# Check which variants overlap with coding regions
|
||||
variants = gtars.RegionSet.from_bed("variants.bed")
|
||||
cds = gtars.RegionSet.from_bed("coding_sequences.bed")
|
||||
|
||||
coding_variants = variants.filter_overlapping(cds)
|
||||
```
|
||||
|
||||
### Chromatin State Analysis
|
||||
|
||||
Compare chromatin states across samples:
|
||||
|
||||
```python
|
||||
# Find regions with consistent chromatin states
|
||||
sample1 = gtars.RegionSet.from_bed("sample1_peaks.bed")
|
||||
sample2 = gtars.RegionSet.from_bed("sample2_peaks.bed")
|
||||
|
||||
consistent_regions = sample1.overlap(sample2)
|
||||
```
|
||||
Reference in New Issue
Block a user