Files
2025-11-30 08:30:10 +08:00

2.8 KiB

Region2Vec: Genomic Region Embeddings

Overview

Region2Vec generates unsupervised embeddings of genomic regions and region sets from BED files. It maps genomic regions to a vocabulary, creates sentences through concatenation, and applies word2vec training to learn meaningful representations.

When to Use

Use Region2Vec when working with:

  • BED file collections requiring dimensionality reduction
  • Genomic region similarity analysis
  • Downstream ML tasks requiring region feature vectors
  • Comparative analysis across multiple genomic datasets

Workflow

Step 1: Prepare Data

Gather BED files in a source folder. Optionally specify a file list (default uses all files in the directory). Prepare a universe file as the reference vocabulary for tokenization.

Step 2: Tokenization

Run hard tokenization to convert genomic regions into tokens:

from geniml.tokenization import hard_tokenization

src_folder = '/path/to/raw/bed/files'
dst_folder = '/path/to/tokenized_files'
universe_file = '/path/to/universe_file.bed'

hard_tokenization(src_folder, dst_folder, universe_file, 1e-9)

The final parameter (1e-9) is the p-value threshold for tokenization overlap significance.

Step 3: Train Region2Vec Model

Execute Region2Vec training on the tokenized files:

from geniml.region2vec import region2vec

region2vec(
    token_folder=dst_folder,
    save_dir='./region2vec_model',
    num_shufflings=1000,
    embedding_dim=100,
    context_len=50,
    window_size=5,
    init_lr=0.025
)

Key Parameters

Parameter Description Typical Range
init_lr Initial learning rate 0.01 - 0.05
window_size Context window size 3 - 10
num_shufflings Number of shuffling iterations 500 - 2000
embedding_dim Dimension of output embeddings 50 - 300
context_len Context length for training 30 - 100

CLI Usage

geniml region2vec --token-folder /path/to/tokens \
  --save-dir ./region2vec_model \
  --num-shuffle 1000 \
  --embed-dim 100 \
  --context-len 50 \
  --window-size 5 \
  --init-lr 0.025

Best Practices

  • Parameter tuning: Frequently tune init_lr, window_size, num_shufflings, and embedding_dim for optimal performance on your specific dataset
  • Universe file: Use a comprehensive universe file that covers all regions of interest in your analysis
  • Validation: Always validate tokenization output before proceeding to training
  • Resources: Training can be computationally intensive; monitor memory usage with large datasets

Output

The trained model saves embeddings that can be used for:

  • Similarity searches across genomic regions
  • Clustering region sets
  • Feature vectors for downstream ML tasks
  • Visualization via dimensionality reduction (t-SNE, UMAP)