Files
2025-11-30 08:30:10 +08:00

5.1 KiB

scEmbed: Single-Cell Embedding Generation

Overview

scEmbed trains Region2Vec models on single-cell ATAC-seq datasets to generate cell embeddings for clustering and analysis. It provides an unsupervised machine learning framework for representing and analyzing scATAC-seq data in low-dimensional space.

When to Use

Use scEmbed when working with:

  • Single-cell ATAC-seq (scATAC-seq) data requiring clustering
  • Cell-type annotation tasks
  • Dimensionality reduction for single-cell chromatin accessibility
  • Integration with scanpy workflows for downstream analysis

Workflow

Step 1: Data Preparation

Input data must be in AnnData format with .var attributes containing chr, start, and end values for peaks.

Starting from raw data (barcodes.txt, peaks.bed, matrix.mtx):

import scanpy as sc
import pandas as pd
import scipy.io
import anndata

# Load data
barcodes = pd.read_csv('barcodes.txt', header=None, names=['barcode'])
peaks = pd.read_csv('peaks.bed', sep='\t', header=None,
                    names=['chr', 'start', 'end'])
matrix = scipy.io.mmread('matrix.mtx').tocsr()

# Create AnnData
adata = anndata.AnnData(X=matrix.T, obs=barcodes, var=peaks)
adata.write('scatac_data.h5ad')

Step 2: Pre-tokenization

Convert genomic regions into tokens using gtars utilities. This creates a parquet file with tokenized cells for faster training:

from geniml.io import tokenize_cells

tokenize_cells(
    adata='scatac_data.h5ad',
    universe_file='universe.bed',
    output='tokenized_cells.parquet'
)

Benefits of pre-tokenization:

  • Faster training iterations
  • Reduced memory requirements
  • Reusable tokenized data for multiple training runs

Step 3: Model Training

Train the scEmbed model using tokenized data:

from geniml.scembed import ScEmbed
from geniml.region2vec import Region2VecDataset

# Load tokenized dataset
dataset = Region2VecDataset('tokenized_cells.parquet')

# Initialize and train model
model = ScEmbed(
    embedding_dim=100,
    window_size=5,
    negative_samples=5
)

model.train(
    dataset=dataset,
    epochs=100,
    batch_size=256,
    learning_rate=0.025
)

# Save model
model.save('scembed_model/')

Step 4: Generate Cell Embeddings

Use the trained model to generate embeddings for cells:

from geniml.scembed import ScEmbed

# Load trained model
model = ScEmbed.from_pretrained('scembed_model/')

# Generate embeddings for AnnData object
embeddings = model.encode(adata)

# Add to AnnData for downstream analysis
adata.obsm['scembed_X'] = embeddings

Step 5: Downstream Analysis

Integrate with scanpy for clustering and visualization:

import scanpy as sc

# Use scEmbed embeddings for neighborhood graph
sc.pp.neighbors(adata, use_rep='scembed_X')

# Cluster cells
sc.tl.leiden(adata, resolution=0.5)

# Compute UMAP for visualization
sc.tl.umap(adata)

# Plot results
sc.pl.umap(adata, color='leiden')

Key Parameters

Training Parameters

Parameter Description Typical Range
embedding_dim Dimension of cell embeddings 50 - 200
window_size Context window for training 3 - 10
negative_samples Number of negative samples 5 - 20
epochs Training epochs 50 - 200
batch_size Training batch size 128 - 512
learning_rate Initial learning rate 0.01 - 0.05

Tokenization Parameters

  • Universe file: Reference BED file defining the genomic vocabulary
  • Overlap threshold: Minimum overlap for peak-universe matching (typically 1e-9)

Pre-trained Models

Pre-trained scEmbed models are available on Hugging Face for common reference datasets. Load them using:

from geniml.scembed import ScEmbed

# Load pre-trained model
model = ScEmbed.from_pretrained('databio/scembed-pbmc-10k')

# Generate embeddings
embeddings = model.encode(adata)

Best Practices

  • Data quality: Use filtered peak-barcode matrices, not raw counts
  • Pre-tokenization: Always pre-tokenize to improve training efficiency
  • Parameter tuning: Adjust embedding_dim and training epochs based on dataset size
  • Validation: Use known cell-type markers to validate clustering quality
  • Integration: Combine with scanpy for comprehensive single-cell analysis
  • Model sharing: Export trained models to Hugging Face for reproducibility

Example Dataset

The 10x Genomics PBMC 10k dataset (10,000 peripheral blood mononuclear cells) serves as a standard benchmark:

  • Contains diverse immune cell types
  • Well-characterized cell populations
  • Available from 10x Genomics website

Cell-Type Annotation

After clustering, annotate cell types using k-nearest neighbors (KNN) with reference datasets:

from geniml.scembed import annotate_celltypes

# Annotate using reference
annotations = annotate_celltypes(
    query_adata=adata,
    reference_adata=reference,
    embedding_key='scembed_X',
    k=10
)

adata.obs['cell_type'] = annotations

Output

scEmbed produces:

  • Low-dimensional cell embeddings (stored in adata.obsm)
  • Trained model files for reuse
  • Compatible format for scanpy downstream analysis
  • Optional export to Hugging Face for sharing