Files
2025-11-30 08:30:10 +08:00

3.8 KiB

BEDspace: Joint Region and Metadata Embeddings

Overview

BEDspace applies the StarSpace model to genomic data, enabling simultaneous training of numerical embeddings for both region sets and their metadata labels in a shared low-dimensional space. This allows for rich queries across regions and metadata.

When to Use

Use BEDspace when working with:

  • Region sets with associated metadata (cell types, tissues, conditions)
  • Search tasks requiring metadata-aware similarity
  • Cross-modal queries (e.g., "find regions similar to label X")
  • Joint analysis of genomic content and experimental conditions

Workflow

BEDspace consists of four sequential operations:

1. Preprocess

Format genomic intervals and metadata for StarSpace training:

geniml bedspace preprocess \
  --input /path/to/regions/ \
  --metadata labels.csv \
  --universe universe.bed \
  --labels "cell_type,tissue" \
  --output preprocessed.txt

Required files:

  • Input folder: Directory containing BED files
  • Metadata CSV: Must include file_name column matching BED filenames, plus metadata columns
  • Universe file: Reference BED file for tokenization
  • Labels: Comma-separated list of metadata columns to use

The preprocessing step adds __label__ prefixes to metadata and converts regions to StarSpace-compatible format.

2. Train

Execute StarSpace model on preprocessed data:

geniml bedspace train \
  --path-to-starspace /path/to/starspace \
  --input preprocessed.txt \
  --output model/ \
  --dim 100 \
  --epochs 50 \
  --lr 0.05

Key training parameters:

  • --dim: Embedding dimension (typical: 50-200)
  • --epochs: Training epochs (typical: 20-100)
  • --lr: Learning rate (typical: 0.01-0.1)

3. Distances

Compute distance metrics between region sets and metadata labels:

geniml bedspace distances \
  --input model/ \
  --metadata labels.csv \
  --universe universe.bed \
  --output distances.pkl

This step creates a distance matrix needed for similarity searches.

Retrieve similar items across three scenarios:

Region-to-Label (r2l): Query region set → retrieve similar metadata labels

geniml bedspace search -t r2l -d distances.pkl -q query_regions.bed -n 10

Label-to-Region (l2r): Query metadata label → retrieve similar region sets

geniml bedspace search -t l2r -d distances.pkl -q "T_cell" -n 10

Region-to-Region (r2r): Query region set → retrieve similar region sets

geniml bedspace search -t r2r -d distances.pkl -q query_regions.bed -n 10

The -n parameter controls the number of results returned.

Python API

from geniml.bedspace import BEDSpaceModel

# Load trained model
model = BEDSpaceModel.load('model/')

# Query similar items
results = model.search(
    query="T_cell",
    search_type="l2r",
    top_k=10
)

Best Practices

  • Metadata structure: Ensure metadata CSV includes file_name column that exactly matches BED filenames (without path)
  • Label selection: Choose informative metadata columns that capture biological variation of interest
  • Universe consistency: Use the same universe file across preprocessing, distances, and any subsequent analyses
  • Validation: Preprocess and check output format before investing in training
  • StarSpace installation: Install StarSpace separately as it's an external dependency

Output Interpretation

Search results return items ranked by similarity in the joint embedding space:

  • r2l: Identifies metadata labels characterizing your query regions
  • l2r: Finds region sets matching your metadata criteria
  • r2r: Discovers region sets with similar genomic content

Requirements

BEDspace requires StarSpace to be installed separately. Download from: https://github.com/facebookresearch/StarSpace