Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/geniml/references/bedspace.md
+++ b/skills/geniml/references/bedspace.md
@@ -0,0 +1,127 @@
+# BEDspace: Joint Region and Metadata Embeddings
+
+## Overview
+
+BEDspace applies the StarSpace model to genomic data, enabling simultaneous training of numerical embeddings for both region sets and their metadata labels in a shared low-dimensional space. This allows for rich queries across regions and metadata.
+
+## When to Use
+
+Use BEDspace when working with:
+- Region sets with associated metadata (cell types, tissues, conditions)
+- Search tasks requiring metadata-aware similarity
+- Cross-modal queries (e.g., "find regions similar to label X")
+- Joint analysis of genomic content and experimental conditions
+
+## Workflow
+
+BEDspace consists of four sequential operations:
+
+### 1. Preprocess
+
+Format genomic intervals and metadata for StarSpace training:
+
+```bash
+geniml bedspace preprocess \
+  --input /path/to/regions/ \
+  --metadata labels.csv \
+  --universe universe.bed \
+  --labels "cell_type,tissue" \
+  --output preprocessed.txt
+```
+
+**Required files:**
+- **Input folder**: Directory containing BED files
+- **Metadata CSV**: Must include `file_name` column matching BED filenames, plus metadata columns
+- **Universe file**: Reference BED file for tokenization
+- **Labels**: Comma-separated list of metadata columns to use
+
+The preprocessing step adds `__label__` prefixes to metadata and converts regions to StarSpace-compatible format.
+
+### 2. Train
+
+Execute StarSpace model on preprocessed data:
+
+```bash
+geniml bedspace train \
+  --path-to-starspace /path/to/starspace \
+  --input preprocessed.txt \
+  --output model/ \
+  --dim 100 \
+  --epochs 50 \
+  --lr 0.05
+```
+
+**Key training parameters:**
+- `--dim`: Embedding dimension (typical: 50-200)
+- `--epochs`: Training epochs (typical: 20-100)
+- `--lr`: Learning rate (typical: 0.01-0.1)
+
+### 3. Distances
+
+Compute distance metrics between region sets and metadata labels:
+
+```bash
+geniml bedspace distances \
+  --input model/ \
+  --metadata labels.csv \
+  --universe universe.bed \
+  --output distances.pkl
+```
+
+This step creates a distance matrix needed for similarity searches.
+
+### 4. Search
+
+Retrieve similar items across three scenarios:
+
+**Region-to-Label (r2l)**: Query region set → retrieve similar metadata labels
+```bash
+geniml bedspace search -t r2l -d distances.pkl -q query_regions.bed -n 10
+```
+
+**Label-to-Region (l2r)**: Query metadata label → retrieve similar region sets
+```bash
+geniml bedspace search -t l2r -d distances.pkl -q "T_cell" -n 10
+```
+
+**Region-to-Region (r2r)**: Query region set → retrieve similar region sets
+```bash
+geniml bedspace search -t r2r -d distances.pkl -q query_regions.bed -n 10
+```
+
+The `-n` parameter controls the number of results returned.
+
+## Python API
+
+```python
+from geniml.bedspace import BEDSpaceModel
+
+# Load trained model
+model = BEDSpaceModel.load('model/')
+
+# Query similar items
+results = model.search(
+    query="T_cell",
+    search_type="l2r",
+    top_k=10
+)
+```
+
+## Best Practices
+
+- **Metadata structure**: Ensure metadata CSV includes `file_name` column that exactly matches BED filenames (without path)
+- **Label selection**: Choose informative metadata columns that capture biological variation of interest
+- **Universe consistency**: Use the same universe file across preprocessing, distances, and any subsequent analyses
+- **Validation**: Preprocess and check output format before investing in training
+- **StarSpace installation**: Install StarSpace separately as it's an external dependency
+
+## Output Interpretation
+
+Search results return items ranked by similarity in the joint embedding space:
+- **r2l**: Identifies metadata labels characterizing your query regions
+- **l2r**: Finds region sets matching your metadata criteria
+- **r2r**: Discovers region sets with similar genomic content
+
+## Requirements
+
+BEDspace requires StarSpace to be installed separately. Download from: https://github.com/facebookresearch/StarSpace