Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,103 @@
# Genomic Tokenizers
Tokenizers convert genomic regions into discrete tokens for machine learning applications, particularly useful for training genomic deep learning models.
## Python API
### Creating a Tokenizer
Load tokenizer configurations from various sources:
```python
import gtars
# From BED file
tokenizer = gtars.tokenizers.TreeTokenizer.from_bed_file("regions.bed")
# From configuration file
tokenizer = gtars.tokenizers.TreeTokenizer.from_config("tokenizer_config.yaml")
# From region string
tokenizer = gtars.tokenizers.TreeTokenizer.from_region_string("chr1:1000-2000")
```
### Tokenizing Genomic Regions
Convert genomic coordinates to tokens:
```python
# Tokenize a single region
token = tokenizer.tokenize("chr1", 1000, 2000)
# Tokenize multiple regions
tokens = []
for chrom, start, end in regions:
token = tokenizer.tokenize(chrom, start, end)
tokens.append(token)
```
### Token Properties
Access token information:
```python
# Get token ID
token_id = token.id
# Get genomic coordinates
chrom = token.chromosome
start = token.start
end = token.end
# Get token metadata
metadata = token.metadata
```
## Use Cases
### Machine Learning Preprocessing
Tokenizers are essential for preparing genomic data for ML models:
1. **Sequence modeling**: Convert genomic intervals into discrete tokens for transformer models
2. **Position encoding**: Create consistent positional encodings across datasets
3. **Data augmentation**: Generate alternative tokenizations for training
### Integration with geniml
The tokenizers module integrates seamlessly with the geniml library for genomic ML:
```python
# Tokenize regions for geniml
from gtars.tokenizers import TreeTokenizer
import geniml
tokenizer = TreeTokenizer.from_bed_file("training_regions.bed")
tokens = [tokenizer.tokenize(r.chrom, r.start, r.end) for r in regions]
# Use tokens in geniml models
model = geniml.Model(vocab_size=tokenizer.vocab_size)
```
## Configuration Format
Tokenizer configuration files support YAML format:
```yaml
# tokenizer_config.yaml
type: tree
resolution: 1000 # Token resolution in base pairs
chromosomes:
- chr1
- chr2
- chr3
options:
overlap_handling: merge
gap_threshold: 100
```
## Performance Considerations
- TreeTokenizer uses efficient data structures for fast tokenization
- Batch tokenization is recommended for large datasets
- Pre-loading tokenizers reduces overhead for repeated operations