Initial commit
This commit is contained in:
197
skills/geniml/references/scembed.md
Normal file
197
skills/geniml/references/scembed.md
Normal file
@@ -0,0 +1,197 @@
|
||||
# scEmbed: Single-Cell Embedding Generation
|
||||
|
||||
## Overview
|
||||
|
||||
scEmbed trains Region2Vec models on single-cell ATAC-seq datasets to generate cell embeddings for clustering and analysis. It provides an unsupervised machine learning framework for representing and analyzing scATAC-seq data in low-dimensional space.
|
||||
|
||||
## When to Use
|
||||
|
||||
Use scEmbed when working with:
|
||||
- Single-cell ATAC-seq (scATAC-seq) data requiring clustering
|
||||
- Cell-type annotation tasks
|
||||
- Dimensionality reduction for single-cell chromatin accessibility
|
||||
- Integration with scanpy workflows for downstream analysis
|
||||
|
||||
## Workflow
|
||||
|
||||
### Step 1: Data Preparation
|
||||
|
||||
Input data must be in AnnData format with `.var` attributes containing `chr`, `start`, and `end` values for peaks.
|
||||
|
||||
**Starting from raw data** (barcodes.txt, peaks.bed, matrix.mtx):
|
||||
|
||||
```python
|
||||
import scanpy as sc
|
||||
import pandas as pd
|
||||
import scipy.io
|
||||
import anndata
|
||||
|
||||
# Load data
|
||||
barcodes = pd.read_csv('barcodes.txt', header=None, names=['barcode'])
|
||||
peaks = pd.read_csv('peaks.bed', sep='\t', header=None,
|
||||
names=['chr', 'start', 'end'])
|
||||
matrix = scipy.io.mmread('matrix.mtx').tocsr()
|
||||
|
||||
# Create AnnData
|
||||
adata = anndata.AnnData(X=matrix.T, obs=barcodes, var=peaks)
|
||||
adata.write('scatac_data.h5ad')
|
||||
```
|
||||
|
||||
### Step 2: Pre-tokenization
|
||||
|
||||
Convert genomic regions into tokens using gtars utilities. This creates a parquet file with tokenized cells for faster training:
|
||||
|
||||
```python
|
||||
from geniml.io import tokenize_cells
|
||||
|
||||
tokenize_cells(
|
||||
adata='scatac_data.h5ad',
|
||||
universe_file='universe.bed',
|
||||
output='tokenized_cells.parquet'
|
||||
)
|
||||
```
|
||||
|
||||
**Benefits of pre-tokenization:**
|
||||
- Faster training iterations
|
||||
- Reduced memory requirements
|
||||
- Reusable tokenized data for multiple training runs
|
||||
|
||||
### Step 3: Model Training
|
||||
|
||||
Train the scEmbed model using tokenized data:
|
||||
|
||||
```python
|
||||
from geniml.scembed import ScEmbed
|
||||
from geniml.region2vec import Region2VecDataset
|
||||
|
||||
# Load tokenized dataset
|
||||
dataset = Region2VecDataset('tokenized_cells.parquet')
|
||||
|
||||
# Initialize and train model
|
||||
model = ScEmbed(
|
||||
embedding_dim=100,
|
||||
window_size=5,
|
||||
negative_samples=5
|
||||
)
|
||||
|
||||
model.train(
|
||||
dataset=dataset,
|
||||
epochs=100,
|
||||
batch_size=256,
|
||||
learning_rate=0.025
|
||||
)
|
||||
|
||||
# Save model
|
||||
model.save('scembed_model/')
|
||||
```
|
||||
|
||||
### Step 4: Generate Cell Embeddings
|
||||
|
||||
Use the trained model to generate embeddings for cells:
|
||||
|
||||
```python
|
||||
from geniml.scembed import ScEmbed
|
||||
|
||||
# Load trained model
|
||||
model = ScEmbed.from_pretrained('scembed_model/')
|
||||
|
||||
# Generate embeddings for AnnData object
|
||||
embeddings = model.encode(adata)
|
||||
|
||||
# Add to AnnData for downstream analysis
|
||||
adata.obsm['scembed_X'] = embeddings
|
||||
```
|
||||
|
||||
### Step 5: Downstream Analysis
|
||||
|
||||
Integrate with scanpy for clustering and visualization:
|
||||
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# Use scEmbed embeddings for neighborhood graph
|
||||
sc.pp.neighbors(adata, use_rep='scembed_X')
|
||||
|
||||
# Cluster cells
|
||||
sc.tl.leiden(adata, resolution=0.5)
|
||||
|
||||
# Compute UMAP for visualization
|
||||
sc.tl.umap(adata)
|
||||
|
||||
# Plot results
|
||||
sc.pl.umap(adata, color='leiden')
|
||||
```
|
||||
|
||||
## Key Parameters
|
||||
|
||||
### Training Parameters
|
||||
|
||||
| Parameter | Description | Typical Range |
|
||||
|-----------|-------------|---------------|
|
||||
| `embedding_dim` | Dimension of cell embeddings | 50 - 200 |
|
||||
| `window_size` | Context window for training | 3 - 10 |
|
||||
| `negative_samples` | Number of negative samples | 5 - 20 |
|
||||
| `epochs` | Training epochs | 50 - 200 |
|
||||
| `batch_size` | Training batch size | 128 - 512 |
|
||||
| `learning_rate` | Initial learning rate | 0.01 - 0.05 |
|
||||
|
||||
### Tokenization Parameters
|
||||
|
||||
- **Universe file**: Reference BED file defining the genomic vocabulary
|
||||
- **Overlap threshold**: Minimum overlap for peak-universe matching (typically 1e-9)
|
||||
|
||||
## Pre-trained Models
|
||||
|
||||
Pre-trained scEmbed models are available on Hugging Face for common reference datasets. Load them using:
|
||||
|
||||
```python
|
||||
from geniml.scembed import ScEmbed
|
||||
|
||||
# Load pre-trained model
|
||||
model = ScEmbed.from_pretrained('databio/scembed-pbmc-10k')
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = model.encode(adata)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
- **Data quality**: Use filtered peak-barcode matrices, not raw counts
|
||||
- **Pre-tokenization**: Always pre-tokenize to improve training efficiency
|
||||
- **Parameter tuning**: Adjust `embedding_dim` and training epochs based on dataset size
|
||||
- **Validation**: Use known cell-type markers to validate clustering quality
|
||||
- **Integration**: Combine with scanpy for comprehensive single-cell analysis
|
||||
- **Model sharing**: Export trained models to Hugging Face for reproducibility
|
||||
|
||||
## Example Dataset
|
||||
|
||||
The 10x Genomics PBMC 10k dataset (10,000 peripheral blood mononuclear cells) serves as a standard benchmark:
|
||||
- Contains diverse immune cell types
|
||||
- Well-characterized cell populations
|
||||
- Available from 10x Genomics website
|
||||
|
||||
## Cell-Type Annotation
|
||||
|
||||
After clustering, annotate cell types using k-nearest neighbors (KNN) with reference datasets:
|
||||
|
||||
```python
|
||||
from geniml.scembed import annotate_celltypes
|
||||
|
||||
# Annotate using reference
|
||||
annotations = annotate_celltypes(
|
||||
query_adata=adata,
|
||||
reference_adata=reference,
|
||||
embedding_key='scembed_X',
|
||||
k=10
|
||||
)
|
||||
|
||||
adata.obs['cell_type'] = annotations
|
||||
```
|
||||
|
||||
## Output
|
||||
|
||||
scEmbed produces:
|
||||
- Low-dimensional cell embeddings (stored in `adata.obsm`)
|
||||
- Trained model files for reuse
|
||||
- Compatible format for scanpy downstream analysis
|
||||
- Optional export to Hugging Face for sharing
|
||||
Reference in New Issue
Block a user