Initial commit
This commit is contained in:
222
skills/gtars/references/cli.md
Normal file
222
skills/gtars/references/cli.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Command-Line Interface
|
||||
|
||||
Gtars provides a comprehensive CLI for genomic interval analysis directly from the terminal.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Install with all features
|
||||
cargo install gtars-cli --features "uniwig overlaprs igd bbcache scoring fragsplit"
|
||||
|
||||
# Install specific features only
|
||||
cargo install gtars-cli --features "uniwig overlaprs"
|
||||
```
|
||||
|
||||
## Global Options
|
||||
|
||||
```bash
|
||||
# Display help
|
||||
gtars --help
|
||||
|
||||
# Show version
|
||||
gtars --version
|
||||
|
||||
# Verbose output
|
||||
gtars --verbose <command>
|
||||
|
||||
# Quiet mode
|
||||
gtars --quiet <command>
|
||||
```
|
||||
|
||||
## IGD Commands
|
||||
|
||||
Build and query IGD indexes for overlap detection:
|
||||
|
||||
```bash
|
||||
# Build IGD index
|
||||
gtars igd build --input regions.bed --output regions.igd
|
||||
|
||||
# Query single region
|
||||
gtars igd query --index regions.igd --region "chr1:1000-2000"
|
||||
|
||||
# Query from file
|
||||
gtars igd query --index regions.igd --query-file queries.bed --output results.bed
|
||||
|
||||
# Count overlaps
|
||||
gtars igd count --index regions.igd --query-file queries.bed
|
||||
```
|
||||
|
||||
## Overlap Commands
|
||||
|
||||
Compute overlaps between genomic region sets:
|
||||
|
||||
```bash
|
||||
# Find overlapping regions
|
||||
gtars overlaprs overlap --set-a regions_a.bed --set-b regions_b.bed --output overlaps.bed
|
||||
|
||||
# Count overlaps
|
||||
gtars overlaprs count --set-a regions_a.bed --set-b regions_b.bed
|
||||
|
||||
# Filter regions by overlap
|
||||
gtars overlaprs filter --input regions.bed --filter overlapping.bed --output filtered.bed
|
||||
|
||||
# Subtract regions
|
||||
gtars overlaprs subtract --set-a regions_a.bed --set-b regions_b.bed --output difference.bed
|
||||
```
|
||||
|
||||
## Uniwig Commands
|
||||
|
||||
Generate coverage tracks from genomic intervals:
|
||||
|
||||
```bash
|
||||
# Generate coverage track
|
||||
gtars uniwig generate --input fragments.bed --output coverage.wig
|
||||
|
||||
# Specify resolution
|
||||
gtars uniwig generate --input fragments.bed --output coverage.wig --resolution 10
|
||||
|
||||
# Generate BigWig
|
||||
gtars uniwig generate --input fragments.bed --output coverage.bw --format bigwig
|
||||
|
||||
# Strand-specific coverage
|
||||
gtars uniwig generate --input fragments.bed --output forward.wig --strand +
|
||||
```
|
||||
|
||||
## BBCache Commands
|
||||
|
||||
Cache and manage BED files from BEDbase.org:
|
||||
|
||||
```bash
|
||||
# Cache BED file from bedbase
|
||||
gtars bbcache fetch --id <bedbase_id> --output cached.bed
|
||||
|
||||
# List cached files
|
||||
gtars bbcache list
|
||||
|
||||
# Clear cache
|
||||
gtars bbcache clear
|
||||
|
||||
# Update cache
|
||||
gtars bbcache update
|
||||
```
|
||||
|
||||
## Scoring Commands
|
||||
|
||||
Score fragment overlaps against reference datasets:
|
||||
|
||||
```bash
|
||||
# Score fragments
|
||||
gtars scoring score --fragments fragments.bed --reference reference.bed --output scores.txt
|
||||
|
||||
# Batch scoring
|
||||
gtars scoring batch --fragments-dir ./fragments/ --reference reference.bed --output-dir ./scores/
|
||||
|
||||
# Score with weights
|
||||
gtars scoring score --fragments fragments.bed --reference reference.bed --weights weights.txt --output scores.txt
|
||||
```
|
||||
|
||||
## FragSplit Commands
|
||||
|
||||
Split fragment files by cell barcodes or clusters:
|
||||
|
||||
```bash
|
||||
# Split by barcode
|
||||
gtars fragsplit split --input fragments.tsv --barcodes barcodes.txt --output-dir ./split/
|
||||
|
||||
# Split by clusters
|
||||
gtars fragsplit cluster-split --input fragments.tsv --clusters clusters.txt --output-dir ./clustered/
|
||||
|
||||
# Filter fragments
|
||||
gtars fragsplit filter --input fragments.tsv --min-fragments 100 --output filtered.tsv
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Workflow 1: Overlap Analysis Pipeline
|
||||
|
||||
```bash
|
||||
# Step 1: Build IGD index for reference
|
||||
gtars igd build --input reference_regions.bed --output reference.igd
|
||||
|
||||
# Step 2: Query with experimental data
|
||||
gtars igd query --index reference.igd --query-file experimental.bed --output overlaps.bed
|
||||
|
||||
# Step 3: Generate statistics
|
||||
gtars overlaprs count --set-a experimental.bed --set-b reference_regions.bed
|
||||
```
|
||||
|
||||
### Workflow 2: Coverage Track Generation
|
||||
|
||||
```bash
|
||||
# Step 1: Generate coverage
|
||||
gtars uniwig generate --input fragments.bed --output coverage.wig --resolution 10
|
||||
|
||||
# Step 2: Convert to BigWig
|
||||
gtars uniwig generate --input fragments.bed --output coverage.bw --format bigwig
|
||||
```
|
||||
|
||||
### Workflow 3: Fragment Processing
|
||||
|
||||
```bash
|
||||
# Step 1: Filter fragments
|
||||
gtars fragsplit filter --input raw_fragments.tsv --min-fragments 100 --output filtered.tsv
|
||||
|
||||
# Step 2: Split by clusters
|
||||
gtars fragsplit cluster-split --input filtered.tsv --clusters clusters.txt --output-dir ./by_cluster/
|
||||
|
||||
# Step 3: Score against reference
|
||||
gtars scoring batch --fragments-dir ./by_cluster/ --reference reference.bed --output-dir ./scores/
|
||||
```
|
||||
|
||||
## Input/Output Formats
|
||||
|
||||
### BED Format
|
||||
Standard 3-column or extended BED format:
|
||||
```
|
||||
chr1 1000 2000
|
||||
chr1 3000 4000
|
||||
chr2 5000 6000
|
||||
```
|
||||
|
||||
### Fragment Format (TSV)
|
||||
Tab-separated format for single-cell fragments:
|
||||
```
|
||||
chr1 1000 2000 BARCODE1
|
||||
chr1 3000 4000 BARCODE2
|
||||
chr2 5000 6000 BARCODE1
|
||||
```
|
||||
|
||||
### WIG Format
|
||||
Wiggle format for coverage tracks:
|
||||
```
|
||||
fixedStep chrom=chr1 start=1000 step=10
|
||||
12
|
||||
15
|
||||
18
|
||||
```
|
||||
|
||||
## Performance Options
|
||||
|
||||
```bash
|
||||
# Set thread count
|
||||
gtars --threads 8 <command>
|
||||
|
||||
# Memory limit
|
||||
gtars --memory-limit 4G <command>
|
||||
|
||||
# Buffer size
|
||||
gtars --buffer-size 10000 <command>
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
```bash
|
||||
# Continue on errors
|
||||
gtars --continue-on-error <command>
|
||||
|
||||
# Strict mode (fail on warnings)
|
||||
gtars --strict <command>
|
||||
|
||||
# Log to file
|
||||
gtars --log-file output.log <command>
|
||||
```
|
||||
172
skills/gtars/references/coverage.md
Normal file
172
skills/gtars/references/coverage.md
Normal file
@@ -0,0 +1,172 @@
|
||||
# Coverage Analysis with Uniwig
|
||||
|
||||
The uniwig module generates coverage tracks from sequencing data, providing efficient conversion of genomic intervals to coverage profiles.
|
||||
|
||||
## Coverage Track Generation
|
||||
|
||||
Create coverage tracks from BED files:
|
||||
|
||||
```python
|
||||
import gtars
|
||||
|
||||
# Generate coverage from BED file
|
||||
coverage = gtars.uniwig.coverage_from_bed("fragments.bed")
|
||||
|
||||
# Generate coverage with specific resolution
|
||||
coverage = gtars.uniwig.coverage_from_bed("fragments.bed", resolution=10)
|
||||
|
||||
# Generate strand-specific coverage
|
||||
fwd_coverage = gtars.uniwig.coverage_from_bed("fragments.bed", strand="+")
|
||||
rev_coverage = gtars.uniwig.coverage_from_bed("fragments.bed", strand="-")
|
||||
```
|
||||
|
||||
## CLI Usage
|
||||
|
||||
Generate coverage tracks from command line:
|
||||
|
||||
```bash
|
||||
# Generate coverage track
|
||||
gtars uniwig generate --input fragments.bed --output coverage.wig
|
||||
|
||||
# Specify resolution
|
||||
gtars uniwig generate --input fragments.bed --output coverage.wig --resolution 10
|
||||
|
||||
# Generate BigWig format
|
||||
gtars uniwig generate --input fragments.bed --output coverage.bw --format bigwig
|
||||
|
||||
# Strand-specific coverage
|
||||
gtars uniwig generate --input fragments.bed --output forward.wig --strand +
|
||||
gtars uniwig generate --input fragments.bed --output reverse.wig --strand -
|
||||
```
|
||||
|
||||
## Working with Coverage Data
|
||||
|
||||
### Accessing Coverage Values
|
||||
|
||||
Query coverage at specific positions:
|
||||
|
||||
```python
|
||||
# Get coverage at position
|
||||
cov = coverage.get_coverage("chr1", 1000)
|
||||
|
||||
# Get coverage over range
|
||||
cov_array = coverage.get_coverage_range("chr1", 1000, 2000)
|
||||
|
||||
# Get coverage statistics
|
||||
mean_cov = coverage.mean_coverage("chr1", 1000, 2000)
|
||||
max_cov = coverage.max_coverage("chr1", 1000, 2000)
|
||||
```
|
||||
|
||||
### Coverage Operations
|
||||
|
||||
Perform operations on coverage tracks:
|
||||
|
||||
```python
|
||||
# Normalize coverage
|
||||
normalized = coverage.normalize()
|
||||
|
||||
# Smooth coverage
|
||||
smoothed = coverage.smooth(window_size=10)
|
||||
|
||||
# Combine coverage tracks
|
||||
combined = coverage1.add(coverage2)
|
||||
|
||||
# Compute coverage difference
|
||||
diff = coverage1.subtract(coverage2)
|
||||
```
|
||||
|
||||
## Output Formats
|
||||
|
||||
Uniwig supports multiple output formats:
|
||||
|
||||
### WIG Format
|
||||
|
||||
Standard wiggle format:
|
||||
```
|
||||
fixedStep chrom=chr1 start=1000 step=1
|
||||
12
|
||||
15
|
||||
18
|
||||
22
|
||||
...
|
||||
```
|
||||
|
||||
### BigWig Format
|
||||
|
||||
Binary format for efficient storage and access:
|
||||
```bash
|
||||
# Generate BigWig
|
||||
gtars uniwig generate --input fragments.bed --output coverage.bw --format bigwig
|
||||
```
|
||||
|
||||
### BedGraph Format
|
||||
|
||||
Flexible format for variable coverage:
|
||||
```
|
||||
chr1 1000 1001 12
|
||||
chr1 1001 1002 15
|
||||
chr1 1002 1003 18
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
### ATAC-seq Analysis
|
||||
|
||||
Generate chromatin accessibility profiles:
|
||||
|
||||
```python
|
||||
# Generate ATAC-seq coverage
|
||||
atac_fragments = gtars.RegionSet.from_bed("atac_fragments.bed")
|
||||
coverage = gtars.uniwig.coverage_from_bed("atac_fragments.bed", resolution=1)
|
||||
|
||||
# Identify accessible regions
|
||||
peaks = coverage.call_peaks(threshold=10)
|
||||
```
|
||||
|
||||
### ChIP-seq Peak Visualization
|
||||
|
||||
Create coverage tracks for ChIP-seq data:
|
||||
|
||||
```bash
|
||||
# Generate coverage for visualization
|
||||
gtars uniwig generate --input chip_seq_fragments.bed \
|
||||
--output chip_coverage.bw \
|
||||
--format bigwig
|
||||
```
|
||||
|
||||
### RNA-seq Coverage
|
||||
|
||||
Compute read coverage for RNA-seq:
|
||||
|
||||
```python
|
||||
# Generate strand-specific RNA-seq coverage
|
||||
fwd = gtars.uniwig.coverage_from_bed("rnaseq.bed", strand="+")
|
||||
rev = gtars.uniwig.coverage_from_bed("rnaseq.bed", strand="-")
|
||||
|
||||
# Export for IGV
|
||||
fwd.to_bigwig("rnaseq_fwd.bw")
|
||||
rev.to_bigwig("rnaseq_rev.bw")
|
||||
```
|
||||
|
||||
### Differential Coverage Analysis
|
||||
|
||||
Compare coverage between samples:
|
||||
|
||||
```python
|
||||
# Generate coverage for two samples
|
||||
control = gtars.uniwig.coverage_from_bed("control.bed")
|
||||
treatment = gtars.uniwig.coverage_from_bed("treatment.bed")
|
||||
|
||||
# Compute fold change
|
||||
fold_change = treatment.divide(control)
|
||||
|
||||
# Find differential regions
|
||||
diff_regions = fold_change.find_regions(threshold=2.0)
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
- Use appropriate resolution for data scale
|
||||
- BigWig format recommended for large datasets
|
||||
- Parallel processing available for multiple chromosomes
|
||||
- Memory-efficient streaming for large files
|
||||
156
skills/gtars/references/overlap.md
Normal file
156
skills/gtars/references/overlap.md
Normal file
@@ -0,0 +1,156 @@
|
||||
# Overlap Detection and IGD
|
||||
|
||||
The overlaprs module provides efficient overlap detection between genomic intervals using the Integrated Genome Database (IGD) data structure.
|
||||
|
||||
## IGD Index
|
||||
|
||||
IGD (Integrated Genome Database) is a specialized data structure for fast genomic interval queries and overlap detection.
|
||||
|
||||
### Building an IGD Index
|
||||
|
||||
Create indexes from genomic region files:
|
||||
|
||||
```python
|
||||
import gtars
|
||||
|
||||
# Build IGD index from BED file
|
||||
igd = gtars.igd.build_index("regions.bed")
|
||||
|
||||
# Save index for reuse
|
||||
igd.save("regions.igd")
|
||||
|
||||
# Load existing index
|
||||
igd = gtars.igd.load_index("regions.igd")
|
||||
```
|
||||
|
||||
### Querying Overlaps
|
||||
|
||||
Find overlapping regions efficiently:
|
||||
|
||||
```python
|
||||
# Query a single region
|
||||
overlaps = igd.query("chr1", 1000, 2000)
|
||||
|
||||
# Query multiple regions
|
||||
results = []
|
||||
for chrom, start, end in query_regions:
|
||||
overlaps = igd.query(chrom, start, end)
|
||||
results.append(overlaps)
|
||||
|
||||
# Get overlap counts only
|
||||
count = igd.count_overlaps("chr1", 1000, 2000)
|
||||
```
|
||||
|
||||
## CLI Usage
|
||||
|
||||
The overlaprs command-line tool provides overlap detection:
|
||||
|
||||
```bash
|
||||
# Find overlaps between two BED files
|
||||
gtars overlaprs query --index regions.bed --query query_regions.bed
|
||||
|
||||
# Count overlaps
|
||||
gtars overlaprs count --index regions.bed --query query_regions.bed
|
||||
|
||||
# Output overlapping regions
|
||||
gtars overlaprs overlap --index regions.bed --query query_regions.bed --output overlaps.bed
|
||||
```
|
||||
|
||||
### IGD CLI Commands
|
||||
|
||||
Build and query IGD indexes:
|
||||
|
||||
```bash
|
||||
# Build IGD index
|
||||
gtars igd build --input regions.bed --output regions.igd
|
||||
|
||||
# Query IGD index
|
||||
gtars igd query --index regions.igd --region "chr1:1000-2000"
|
||||
|
||||
# Batch query from file
|
||||
gtars igd query --index regions.igd --query-file queries.bed --output results.bed
|
||||
```
|
||||
|
||||
## Python API
|
||||
|
||||
### Overlap Detection
|
||||
|
||||
Compute overlaps between region sets:
|
||||
|
||||
```python
|
||||
import gtars
|
||||
|
||||
# Load two region sets
|
||||
set_a = gtars.RegionSet.from_bed("regions_a.bed")
|
||||
set_b = gtars.RegionSet.from_bed("regions_b.bed")
|
||||
|
||||
# Find overlaps
|
||||
overlaps = set_a.overlap(set_b)
|
||||
|
||||
# Get regions from A that overlap with B
|
||||
overlapping_a = set_a.filter_overlapping(set_b)
|
||||
|
||||
# Get regions from A that don't overlap with B
|
||||
non_overlapping_a = set_a.filter_non_overlapping(set_b)
|
||||
```
|
||||
|
||||
### Overlap Statistics
|
||||
|
||||
Calculate overlap metrics:
|
||||
|
||||
```python
|
||||
# Count overlaps
|
||||
overlap_count = set_a.count_overlaps(set_b)
|
||||
|
||||
# Calculate overlap fraction
|
||||
overlap_fraction = set_a.overlap_fraction(set_b)
|
||||
|
||||
# Get overlap coverage
|
||||
coverage = set_a.overlap_coverage(set_b)
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
IGD provides efficient querying:
|
||||
- **Index construction**: O(n log n) where n is number of regions
|
||||
- **Query time**: O(k + log n) where k is number of overlaps
|
||||
- **Memory efficient**: Compact representation of genomic intervals
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Regulatory Element Analysis
|
||||
|
||||
Identify overlap between genomic features:
|
||||
|
||||
```python
|
||||
# Find transcription factor binding sites overlapping promoters
|
||||
tfbs = gtars.RegionSet.from_bed("chip_seq_peaks.bed")
|
||||
promoters = gtars.RegionSet.from_bed("promoters.bed")
|
||||
|
||||
overlapping_tfbs = tfbs.filter_overlapping(promoters)
|
||||
print(f"Found {len(overlapping_tfbs)} TFBS in promoters")
|
||||
```
|
||||
|
||||
### Variant Annotation
|
||||
|
||||
Annotate variants with overlapping features:
|
||||
|
||||
```python
|
||||
# Check which variants overlap with coding regions
|
||||
variants = gtars.RegionSet.from_bed("variants.bed")
|
||||
cds = gtars.RegionSet.from_bed("coding_sequences.bed")
|
||||
|
||||
coding_variants = variants.filter_overlapping(cds)
|
||||
```
|
||||
|
||||
### Chromatin State Analysis
|
||||
|
||||
Compare chromatin states across samples:
|
||||
|
||||
```python
|
||||
# Find regions with consistent chromatin states
|
||||
sample1 = gtars.RegionSet.from_bed("sample1_peaks.bed")
|
||||
sample2 = gtars.RegionSet.from_bed("sample2_peaks.bed")
|
||||
|
||||
consistent_regions = sample1.overlap(sample2)
|
||||
```
|
||||
211
skills/gtars/references/python-api.md
Normal file
211
skills/gtars/references/python-api.md
Normal file
@@ -0,0 +1,211 @@
|
||||
# Python API Reference
|
||||
|
||||
Comprehensive reference for gtars Python bindings.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Install gtars Python package
|
||||
uv pip install gtars
|
||||
|
||||
# Or with pip
|
||||
pip install gtars
|
||||
```
|
||||
|
||||
## Core Classes
|
||||
|
||||
### RegionSet
|
||||
|
||||
Manage collections of genomic intervals:
|
||||
|
||||
```python
|
||||
import gtars
|
||||
|
||||
# Create from BED file
|
||||
regions = gtars.RegionSet.from_bed("regions.bed")
|
||||
|
||||
# Create from coordinates
|
||||
regions = gtars.RegionSet([
|
||||
("chr1", 1000, 2000),
|
||||
("chr1", 3000, 4000),
|
||||
("chr2", 5000, 6000)
|
||||
])
|
||||
|
||||
# Access regions
|
||||
for region in regions:
|
||||
print(f"{region.chromosome}:{region.start}-{region.end}")
|
||||
|
||||
# Get region count
|
||||
num_regions = len(regions)
|
||||
|
||||
# Get total coverage
|
||||
total_coverage = regions.total_coverage()
|
||||
```
|
||||
|
||||
### Region Operations
|
||||
|
||||
Perform operations on region sets:
|
||||
|
||||
```python
|
||||
# Sort regions
|
||||
sorted_regions = regions.sort()
|
||||
|
||||
# Merge overlapping regions
|
||||
merged = regions.merge()
|
||||
|
||||
# Filter by size
|
||||
large_regions = regions.filter_by_size(min_size=1000)
|
||||
|
||||
# Filter by chromosome
|
||||
chr1_regions = regions.filter_by_chromosome("chr1")
|
||||
```
|
||||
|
||||
### Set Operations
|
||||
|
||||
Perform set operations on genomic regions:
|
||||
|
||||
```python
|
||||
# Load two region sets
|
||||
set_a = gtars.RegionSet.from_bed("set_a.bed")
|
||||
set_b = gtars.RegionSet.from_bed("set_b.bed")
|
||||
|
||||
# Union
|
||||
union = set_a.union(set_b)
|
||||
|
||||
# Intersection
|
||||
intersection = set_a.intersect(set_b)
|
||||
|
||||
# Difference
|
||||
difference = set_a.subtract(set_b)
|
||||
|
||||
# Symmetric difference
|
||||
sym_diff = set_a.symmetric_difference(set_b)
|
||||
```
|
||||
|
||||
## Data Export
|
||||
|
||||
### Writing BED Files
|
||||
|
||||
Export regions to BED format:
|
||||
|
||||
```python
|
||||
# Write to BED file
|
||||
regions.to_bed("output.bed")
|
||||
|
||||
# Write with scores
|
||||
regions.to_bed("output.bed", scores=score_array)
|
||||
|
||||
# Write with names
|
||||
regions.to_bed("output.bed", names=name_list)
|
||||
```
|
||||
|
||||
### Format Conversion
|
||||
|
||||
Convert between formats:
|
||||
|
||||
```python
|
||||
# BED to JSON
|
||||
regions = gtars.RegionSet.from_bed("input.bed")
|
||||
regions.to_json("output.json")
|
||||
|
||||
# JSON to BED
|
||||
regions = gtars.RegionSet.from_json("input.json")
|
||||
regions.to_bed("output.bed")
|
||||
```
|
||||
|
||||
## NumPy Integration
|
||||
|
||||
Seamless integration with NumPy arrays:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
# Export to NumPy arrays
|
||||
starts = regions.starts_array() # NumPy array of start positions
|
||||
ends = regions.ends_array() # NumPy array of end positions
|
||||
sizes = regions.sizes_array() # NumPy array of region sizes
|
||||
|
||||
# Create from NumPy arrays
|
||||
chromosomes = ["chr1"] * len(starts)
|
||||
regions = gtars.RegionSet.from_arrays(chromosomes, starts, ends)
|
||||
```
|
||||
|
||||
## Parallel Processing
|
||||
|
||||
Leverage parallel processing for large datasets:
|
||||
|
||||
```python
|
||||
# Enable parallel processing
|
||||
regions = gtars.RegionSet.from_bed("large_file.bed", parallel=True)
|
||||
|
||||
# Parallel operations
|
||||
result = regions.parallel_apply(custom_function)
|
||||
```
|
||||
|
||||
## Memory Management
|
||||
|
||||
Efficient memory usage for large datasets:
|
||||
|
||||
```python
|
||||
# Stream large BED files
|
||||
for chunk in gtars.RegionSet.stream_bed("large_file.bed", chunk_size=10000):
|
||||
process_chunk(chunk)
|
||||
|
||||
# Memory-mapped mode
|
||||
regions = gtars.RegionSet.from_bed("large_file.bed", mmap=True)
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
Handle common errors:
|
||||
|
||||
```python
|
||||
try:
|
||||
regions = gtars.RegionSet.from_bed("file.bed")
|
||||
except gtars.FileNotFoundError:
|
||||
print("File not found")
|
||||
except gtars.InvalidFormatError as e:
|
||||
print(f"Invalid BED format: {e}")
|
||||
except gtars.ParseError as e:
|
||||
print(f"Parse error at line {e.line}: {e.message}")
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
Configure gtars behavior:
|
||||
|
||||
```python
|
||||
# Set global options
|
||||
gtars.set_option("parallel.threads", 4)
|
||||
gtars.set_option("memory.limit", "4GB")
|
||||
gtars.set_option("warnings.strict", True)
|
||||
|
||||
# Context manager for temporary options
|
||||
with gtars.option_context("parallel.threads", 8):
|
||||
# Use 8 threads for this block
|
||||
regions = gtars.RegionSet.from_bed("large_file.bed", parallel=True)
|
||||
```
|
||||
|
||||
## Logging
|
||||
|
||||
Enable logging for debugging:
|
||||
|
||||
```python
|
||||
import logging
|
||||
|
||||
# Enable gtars logging
|
||||
gtars.set_log_level("DEBUG")
|
||||
|
||||
# Or use Python logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
logger = logging.getLogger("gtars")
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
- Use parallel processing for large datasets
|
||||
- Enable memory-mapped mode for very large files
|
||||
- Stream data when possible to reduce memory usage
|
||||
- Pre-sort regions before operations when applicable
|
||||
- Use NumPy arrays for numerical computations
|
||||
- Cache frequently accessed data
|
||||
147
skills/gtars/references/refget.md
Normal file
147
skills/gtars/references/refget.md
Normal file
@@ -0,0 +1,147 @@
|
||||
# Reference Sequence Management
|
||||
|
||||
The refget module handles reference sequence retrieval and digest computation, following the refget protocol for sequence identification.
|
||||
|
||||
## RefgetStore
|
||||
|
||||
RefgetStore manages reference sequences and their digests:
|
||||
|
||||
```python
|
||||
import gtars
|
||||
|
||||
# Create RefgetStore
|
||||
store = gtars.RefgetStore()
|
||||
|
||||
# Add sequence
|
||||
store.add_sequence("chr1", sequence_data)
|
||||
|
||||
# Retrieve sequence
|
||||
seq = store.get_sequence("chr1")
|
||||
|
||||
# Get sequence digest
|
||||
digest = store.get_digest("chr1")
|
||||
```
|
||||
|
||||
## Sequence Digests
|
||||
|
||||
Compute and verify sequence digests:
|
||||
|
||||
```python
|
||||
# Compute digest for sequence
|
||||
from gtars.refget import compute_digest
|
||||
|
||||
digest = compute_digest(sequence_data)
|
||||
|
||||
# Verify digest matches
|
||||
is_valid = store.verify_digest("chr1", expected_digest)
|
||||
```
|
||||
|
||||
## Integration with Reference Genomes
|
||||
|
||||
Work with standard reference genomes:
|
||||
|
||||
```python
|
||||
# Load reference genome
|
||||
store = gtars.RefgetStore.from_fasta("hg38.fa")
|
||||
|
||||
# Get chromosome sequences
|
||||
chr1 = store.get_sequence("chr1")
|
||||
chr2 = store.get_sequence("chr2")
|
||||
|
||||
# Get subsequence
|
||||
region_seq = store.get_subsequence("chr1", 1000, 2000)
|
||||
```
|
||||
|
||||
## CLI Usage
|
||||
|
||||
Manage reference sequences from command line:
|
||||
|
||||
```bash
|
||||
# Compute digest for FASTA file
|
||||
gtars refget digest --input genome.fa --output digests.txt
|
||||
|
||||
# Verify sequence digest
|
||||
gtars refget verify --sequence sequence.fa --digest expected_digest
|
||||
```
|
||||
|
||||
## Refget Protocol Compliance
|
||||
|
||||
The refget module follows the GA4GH refget protocol:
|
||||
|
||||
### Digest Computation
|
||||
|
||||
Digests are computed using SHA-512 truncated to 48 bytes:
|
||||
|
||||
```python
|
||||
# Compute refget-compliant digest
|
||||
digest = gtars.refget.compute_digest(sequence)
|
||||
# Returns: "SQ.abc123..."
|
||||
```
|
||||
|
||||
### Sequence Retrieval
|
||||
|
||||
Retrieve sequences by digest:
|
||||
|
||||
```python
|
||||
# Get sequence by refget digest
|
||||
seq = store.get_sequence_by_digest("SQ.abc123...")
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Reference Validation
|
||||
|
||||
Verify reference genome integrity:
|
||||
|
||||
```python
|
||||
# Compute digests for reference
|
||||
store = gtars.RefgetStore.from_fasta("reference.fa")
|
||||
digests = {chrom: store.get_digest(chrom) for chrom in store.chromosomes}
|
||||
|
||||
# Compare with expected digests
|
||||
for chrom, expected in expected_digests.items():
|
||||
actual = digests[chrom]
|
||||
if actual != expected:
|
||||
print(f"Mismatch for {chrom}: {actual} != {expected}")
|
||||
```
|
||||
|
||||
### Sequence Extraction
|
||||
|
||||
Extract specific genomic regions:
|
||||
|
||||
```python
|
||||
# Extract regions of interest
|
||||
store = gtars.RefgetStore.from_fasta("hg38.fa")
|
||||
|
||||
regions = [
|
||||
("chr1", 1000, 2000),
|
||||
("chr2", 5000, 6000),
|
||||
("chr3", 10000, 11000)
|
||||
]
|
||||
|
||||
sequences = [store.get_subsequence(c, s, e) for c, s, e in regions]
|
||||
```
|
||||
|
||||
### Cross-Reference Comparison
|
||||
|
||||
Compare sequences across different references:
|
||||
|
||||
```python
|
||||
# Load two reference versions
|
||||
hg19 = gtars.RefgetStore.from_fasta("hg19.fa")
|
||||
hg38 = gtars.RefgetStore.from_fasta("hg38.fa")
|
||||
|
||||
# Compare digests
|
||||
for chrom in hg19.chromosomes:
|
||||
digest_19 = hg19.get_digest(chrom)
|
||||
digest_38 = hg38.get_digest(chrom)
|
||||
if digest_19 != digest_38:
|
||||
print(f"{chrom} differs between hg19 and hg38")
|
||||
```
|
||||
|
||||
## Performance Notes
|
||||
|
||||
- Sequences loaded on demand
|
||||
- Digests cached after computation
|
||||
- Efficient subsequence extraction
|
||||
- Memory-mapped file support for large genomes
|
||||
103
skills/gtars/references/tokenizers.md
Normal file
103
skills/gtars/references/tokenizers.md
Normal file
@@ -0,0 +1,103 @@
|
||||
# Genomic Tokenizers
|
||||
|
||||
Tokenizers convert genomic regions into discrete tokens for machine learning applications, particularly useful for training genomic deep learning models.
|
||||
|
||||
## Python API
|
||||
|
||||
### Creating a Tokenizer
|
||||
|
||||
Load tokenizer configurations from various sources:
|
||||
|
||||
```python
|
||||
import gtars
|
||||
|
||||
# From BED file
|
||||
tokenizer = gtars.tokenizers.TreeTokenizer.from_bed_file("regions.bed")
|
||||
|
||||
# From configuration file
|
||||
tokenizer = gtars.tokenizers.TreeTokenizer.from_config("tokenizer_config.yaml")
|
||||
|
||||
# From region string
|
||||
tokenizer = gtars.tokenizers.TreeTokenizer.from_region_string("chr1:1000-2000")
|
||||
```
|
||||
|
||||
### Tokenizing Genomic Regions
|
||||
|
||||
Convert genomic coordinates to tokens:
|
||||
|
||||
```python
|
||||
# Tokenize a single region
|
||||
token = tokenizer.tokenize("chr1", 1000, 2000)
|
||||
|
||||
# Tokenize multiple regions
|
||||
tokens = []
|
||||
for chrom, start, end in regions:
|
||||
token = tokenizer.tokenize(chrom, start, end)
|
||||
tokens.append(token)
|
||||
```
|
||||
|
||||
### Token Properties
|
||||
|
||||
Access token information:
|
||||
|
||||
```python
|
||||
# Get token ID
|
||||
token_id = token.id
|
||||
|
||||
# Get genomic coordinates
|
||||
chrom = token.chromosome
|
||||
start = token.start
|
||||
end = token.end
|
||||
|
||||
# Get token metadata
|
||||
metadata = token.metadata
|
||||
```
|
||||
|
||||
## Use Cases
|
||||
|
||||
### Machine Learning Preprocessing
|
||||
|
||||
Tokenizers are essential for preparing genomic data for ML models:
|
||||
|
||||
1. **Sequence modeling**: Convert genomic intervals into discrete tokens for transformer models
|
||||
2. **Position encoding**: Create consistent positional encodings across datasets
|
||||
3. **Data augmentation**: Generate alternative tokenizations for training
|
||||
|
||||
### Integration with geniml
|
||||
|
||||
The tokenizers module integrates seamlessly with the geniml library for genomic ML:
|
||||
|
||||
```python
|
||||
# Tokenize regions for geniml
|
||||
from gtars.tokenizers import TreeTokenizer
|
||||
import geniml
|
||||
|
||||
tokenizer = TreeTokenizer.from_bed_file("training_regions.bed")
|
||||
tokens = [tokenizer.tokenize(r.chrom, r.start, r.end) for r in regions]
|
||||
|
||||
# Use tokens in geniml models
|
||||
model = geniml.Model(vocab_size=tokenizer.vocab_size)
|
||||
```
|
||||
|
||||
## Configuration Format
|
||||
|
||||
Tokenizer configuration files support YAML format:
|
||||
|
||||
```yaml
|
||||
# tokenizer_config.yaml
|
||||
type: tree
|
||||
resolution: 1000 # Token resolution in base pairs
|
||||
chromosomes:
|
||||
- chr1
|
||||
- chr2
|
||||
- chr3
|
||||
options:
|
||||
overlap_handling: merge
|
||||
gap_threshold: 100
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
- TreeTokenizer uses efficient data structures for fast tokenization
|
||||
- Batch tokenization is recommended for large datasets
|
||||
- Pre-loading tokenizers reduces overhead for repeated operations
|
||||
Reference in New Issue
Block a user