3.8 KiB
BEDspace: Joint Region and Metadata Embeddings
Overview
BEDspace applies the StarSpace model to genomic data, enabling simultaneous training of numerical embeddings for both region sets and their metadata labels in a shared low-dimensional space. This allows for rich queries across regions and metadata.
When to Use
Use BEDspace when working with:
- Region sets with associated metadata (cell types, tissues, conditions)
- Search tasks requiring metadata-aware similarity
- Cross-modal queries (e.g., "find regions similar to label X")
- Joint analysis of genomic content and experimental conditions
Workflow
BEDspace consists of four sequential operations:
1. Preprocess
Format genomic intervals and metadata for StarSpace training:
geniml bedspace preprocess \
--input /path/to/regions/ \
--metadata labels.csv \
--universe universe.bed \
--labels "cell_type,tissue" \
--output preprocessed.txt
Required files:
- Input folder: Directory containing BED files
- Metadata CSV: Must include
file_namecolumn matching BED filenames, plus metadata columns - Universe file: Reference BED file for tokenization
- Labels: Comma-separated list of metadata columns to use
The preprocessing step adds __label__ prefixes to metadata and converts regions to StarSpace-compatible format.
2. Train
Execute StarSpace model on preprocessed data:
geniml bedspace train \
--path-to-starspace /path/to/starspace \
--input preprocessed.txt \
--output model/ \
--dim 100 \
--epochs 50 \
--lr 0.05
Key training parameters:
--dim: Embedding dimension (typical: 50-200)--epochs: Training epochs (typical: 20-100)--lr: Learning rate (typical: 0.01-0.1)
3. Distances
Compute distance metrics between region sets and metadata labels:
geniml bedspace distances \
--input model/ \
--metadata labels.csv \
--universe universe.bed \
--output distances.pkl
This step creates a distance matrix needed for similarity searches.
4. Search
Retrieve similar items across three scenarios:
Region-to-Label (r2l): Query region set → retrieve similar metadata labels
geniml bedspace search -t r2l -d distances.pkl -q query_regions.bed -n 10
Label-to-Region (l2r): Query metadata label → retrieve similar region sets
geniml bedspace search -t l2r -d distances.pkl -q "T_cell" -n 10
Region-to-Region (r2r): Query region set → retrieve similar region sets
geniml bedspace search -t r2r -d distances.pkl -q query_regions.bed -n 10
The -n parameter controls the number of results returned.
Python API
from geniml.bedspace import BEDSpaceModel
# Load trained model
model = BEDSpaceModel.load('model/')
# Query similar items
results = model.search(
query="T_cell",
search_type="l2r",
top_k=10
)
Best Practices
- Metadata structure: Ensure metadata CSV includes
file_namecolumn that exactly matches BED filenames (without path) - Label selection: Choose informative metadata columns that capture biological variation of interest
- Universe consistency: Use the same universe file across preprocessing, distances, and any subsequent analyses
- Validation: Preprocess and check output format before investing in training
- StarSpace installation: Install StarSpace separately as it's an external dependency
Output Interpretation
Search results return items ranked by similarity in the joint embedding space:
- r2l: Identifies metadata labels characterizing your query regions
- l2r: Finds region sets matching your metadata criteria
- r2r: Discovers region sets with similar genomic content
Requirements
BEDspace requires StarSpace to be installed separately. Download from: https://github.com/facebookresearch/StarSpace