Initial commit
This commit is contained in:
385
skills/geniml/references/utilities.md
Normal file
385
skills/geniml/references/utilities.md
Normal file
@@ -0,0 +1,385 @@
|
||||
# Geniml Utilities and Additional Tools
|
||||
|
||||
## BBClient: BED File Caching
|
||||
|
||||
### Overview
|
||||
|
||||
BBClient provides efficient caching of BED files from remote sources, enabling faster repeated access and integration with R workflows.
|
||||
|
||||
### When to Use
|
||||
|
||||
Use BBClient when:
|
||||
- Repeatedly accessing BED files from remote databases
|
||||
- Working with BEDbase repositories
|
||||
- Integrating genomic data with R pipelines
|
||||
- Need local caching for performance
|
||||
|
||||
### Python Usage
|
||||
|
||||
```python
|
||||
from geniml.bbclient import BBClient
|
||||
|
||||
# Initialize client
|
||||
client = BBClient(cache_folder='~/.bedcache')
|
||||
|
||||
# Fetch and cache BED file
|
||||
bed_file = client.load_bed(bed_id='GSM123456')
|
||||
|
||||
# Access cached file
|
||||
regions = client.get_regions('GSM123456')
|
||||
```
|
||||
|
||||
### R Integration
|
||||
|
||||
```r
|
||||
library(reticulate)
|
||||
geniml <- import("geniml.bbclient")
|
||||
|
||||
# Initialize client
|
||||
client <- geniml$BBClient(cache_folder='~/.bedcache')
|
||||
|
||||
# Load BED file
|
||||
bed_file <- client$load_bed(bed_id='GSM123456')
|
||||
```
|
||||
|
||||
### Best Practices
|
||||
|
||||
- Configure cache directory with sufficient storage
|
||||
- Use consistent cache locations across analyses
|
||||
- Clear cache periodically to remove unused files
|
||||
|
||||
---
|
||||
|
||||
## BEDshift: BED File Randomization
|
||||
|
||||
### Overview
|
||||
|
||||
BEDshift provides tools for randomizing BED files while preserving genomic context, essential for generating null distributions and statistical testing.
|
||||
|
||||
### When to Use
|
||||
|
||||
Use BEDshift when:
|
||||
- Creating null models for statistical testing
|
||||
- Generating control datasets
|
||||
- Assessing significance of genomic overlaps
|
||||
- Benchmarking analysis methods
|
||||
|
||||
### Usage
|
||||
|
||||
```python
|
||||
from geniml.bedshift import bedshift
|
||||
|
||||
# Randomize BED file preserving chromosome distribution
|
||||
randomized = bedshift(
|
||||
input_bed='peaks.bed',
|
||||
genome='hg38',
|
||||
preserve_chrom=True,
|
||||
n_iterations=100
|
||||
)
|
||||
```
|
||||
|
||||
### CLI Usage
|
||||
|
||||
```bash
|
||||
geniml bedshift \
|
||||
--input peaks.bed \
|
||||
--genome hg38 \
|
||||
--preserve-chrom \
|
||||
--iterations 100 \
|
||||
--output randomized_peaks.bed
|
||||
```
|
||||
|
||||
### Randomization Strategies
|
||||
|
||||
**Preserve chromosome distribution:**
|
||||
```python
|
||||
bedshift(input_bed, genome, preserve_chrom=True)
|
||||
```
|
||||
Maintains regions on same chromosomes as original.
|
||||
|
||||
**Preserve distance distribution:**
|
||||
```python
|
||||
bedshift(input_bed, genome, preserve_distance=True)
|
||||
```
|
||||
Maintains inter-region distances.
|
||||
|
||||
**Preserve region sizes:**
|
||||
```python
|
||||
bedshift(input_bed, genome, preserve_size=True)
|
||||
```
|
||||
Keeps original region lengths.
|
||||
|
||||
### Best Practices
|
||||
|
||||
- Choose randomization strategy matching null hypothesis
|
||||
- Generate multiple iterations for robust statistics
|
||||
- Validate randomized output maintains desired properties
|
||||
- Document randomization parameters for reproducibility
|
||||
|
||||
---
|
||||
|
||||
## Evaluation: Model Assessment Tools
|
||||
|
||||
### Overview
|
||||
|
||||
Geniml provides evaluation utilities for assessing embedding quality and model performance.
|
||||
|
||||
### When to Use
|
||||
|
||||
Use evaluation tools when:
|
||||
- Validating trained embeddings
|
||||
- Comparing different models
|
||||
- Assessing clustering quality
|
||||
- Publishing model results
|
||||
|
||||
### Embedding Evaluation
|
||||
|
||||
```python
|
||||
from geniml.evaluation import evaluate_embeddings
|
||||
|
||||
# Evaluate Region2Vec embeddings
|
||||
metrics = evaluate_embeddings(
|
||||
embeddings_file='region2vec_model/embeddings.npy',
|
||||
labels_file='metadata.csv',
|
||||
metrics=['silhouette', 'davies_bouldin', 'calinski_harabasz']
|
||||
)
|
||||
|
||||
print(f"Silhouette score: {metrics['silhouette']:.3f}")
|
||||
print(f"Davies-Bouldin index: {metrics['davies_bouldin']:.3f}")
|
||||
```
|
||||
|
||||
### Clustering Metrics
|
||||
|
||||
**Silhouette score:** Measures cluster cohesion and separation (-1 to 1, higher better)
|
||||
|
||||
**Davies-Bouldin index:** Average similarity between clusters (≥0, lower better)
|
||||
|
||||
**Calinski-Harabasz score:** Ratio of between/within cluster dispersion (higher better)
|
||||
|
||||
### scEmbed Cell-Type Annotation Evaluation
|
||||
|
||||
```python
|
||||
from geniml.evaluation import evaluate_annotation
|
||||
|
||||
# Evaluate cell-type predictions
|
||||
results = evaluate_annotation(
|
||||
predicted=adata.obs['predicted_celltype'],
|
||||
true=adata.obs['true_celltype'],
|
||||
metrics=['accuracy', 'f1', 'confusion_matrix']
|
||||
)
|
||||
|
||||
print(f"Accuracy: {results['accuracy']:.1%}")
|
||||
print(f"F1 score: {results['f1']:.3f}")
|
||||
```
|
||||
|
||||
### Best Practices
|
||||
|
||||
- Use multiple complementary metrics
|
||||
- Compare against baseline models
|
||||
- Report metrics on held-out test data
|
||||
- Visualize embeddings (UMAP/t-SNE) alongside metrics
|
||||
|
||||
---
|
||||
|
||||
## Tokenization: Region Tokenization Utilities
|
||||
|
||||
### Overview
|
||||
|
||||
Tokenization converts genomic regions into discrete tokens using a reference universe, enabling word2vec-style training.
|
||||
|
||||
### When to Use
|
||||
|
||||
Tokenization is a required preprocessing step for:
|
||||
- Region2Vec training
|
||||
- scEmbed model training
|
||||
- Any embedding method requiring discrete tokens
|
||||
|
||||
### Hard Tokenization
|
||||
|
||||
Strict overlap-based tokenization:
|
||||
|
||||
```python
|
||||
from geniml.tokenization import hard_tokenization
|
||||
|
||||
hard_tokenization(
|
||||
src_folder='bed_files/',
|
||||
dst_folder='tokenized/',
|
||||
universe_file='universe.bed',
|
||||
p_value_threshold=1e-9
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `p_value_threshold`: Significance level for overlap (typically 1e-9 or 1e-6)
|
||||
|
||||
### Soft Tokenization
|
||||
|
||||
Probabilistic tokenization allowing partial matches:
|
||||
|
||||
```python
|
||||
from geniml.tokenization import soft_tokenization
|
||||
|
||||
soft_tokenization(
|
||||
src_folder='bed_files/',
|
||||
dst_folder='tokenized/',
|
||||
universe_file='universe.bed',
|
||||
overlap_threshold=0.5
|
||||
)
|
||||
```
|
||||
|
||||
**Parameters:**
|
||||
- `overlap_threshold`: Minimum overlap fraction (0-1)
|
||||
|
||||
### Universe-Based Tokenization
|
||||
|
||||
Map regions to universe tokens with custom parameters:
|
||||
|
||||
```python
|
||||
from geniml.tokenization import universe_tokenization
|
||||
|
||||
universe_tokenization(
|
||||
bed_file='peaks.bed',
|
||||
universe_file='universe.bed',
|
||||
output_file='tokens.txt',
|
||||
method='hard',
|
||||
threshold=1e-9
|
||||
)
|
||||
```
|
||||
|
||||
### Best Practices
|
||||
|
||||
- **Universe quality**: Use comprehensive, well-constructed universes
|
||||
- **Threshold selection**: More stringent (lower p-value) for higher confidence
|
||||
- **Validation**: Check tokenization coverage (what % of regions tokenized)
|
||||
- **Consistency**: Use same universe and parameters across related analyses
|
||||
|
||||
### Tokenization Coverage
|
||||
|
||||
Check how well regions tokenize:
|
||||
|
||||
```python
|
||||
from geniml.tokenization import check_coverage
|
||||
|
||||
coverage = check_coverage(
|
||||
bed_file='peaks.bed',
|
||||
universe_file='universe.bed',
|
||||
threshold=1e-9
|
||||
)
|
||||
|
||||
print(f"Tokenization coverage: {coverage:.1%}")
|
||||
```
|
||||
|
||||
Aim for >80% coverage for reliable training.
|
||||
|
||||
---
|
||||
|
||||
## Text2BedNN: Search Backend
|
||||
|
||||
### Overview
|
||||
|
||||
Text2BedNN creates neural network-based search backends for querying genomic regions using natural language or metadata.
|
||||
|
||||
### When to Use
|
||||
|
||||
Use Text2BedNN when:
|
||||
- Building search interfaces for genomic databases
|
||||
- Enabling natural language queries over BED files
|
||||
- Creating metadata-aware search systems
|
||||
- Deploying interactive genomic search applications
|
||||
|
||||
### Workflow
|
||||
|
||||
**Step 1: Prepare embeddings**
|
||||
|
||||
Train BEDspace or Region2Vec model with metadata.
|
||||
|
||||
**Step 2: Build search index**
|
||||
|
||||
```python
|
||||
from geniml.search import build_search_index
|
||||
|
||||
build_search_index(
|
||||
embeddings_file='bedspace_model/embeddings.npy',
|
||||
metadata_file='metadata.csv',
|
||||
output_dir='search_backend/'
|
||||
)
|
||||
```
|
||||
|
||||
**Step 3: Query the index**
|
||||
|
||||
```python
|
||||
from geniml.search import SearchBackend
|
||||
|
||||
backend = SearchBackend.load('search_backend/')
|
||||
|
||||
# Natural language query
|
||||
results = backend.query(
|
||||
text="T cell regulatory regions",
|
||||
top_k=10
|
||||
)
|
||||
|
||||
# Metadata query
|
||||
results = backend.query(
|
||||
metadata={'cell_type': 'T_cell', 'tissue': 'blood'},
|
||||
top_k=10
|
||||
)
|
||||
```
|
||||
|
||||
### Best Practices
|
||||
|
||||
- Train embeddings with rich metadata for better search
|
||||
- Index large collections for comprehensive coverage
|
||||
- Validate search relevance on known queries
|
||||
- Deploy with API for interactive applications
|
||||
|
||||
---
|
||||
|
||||
## Additional Tools
|
||||
|
||||
### I/O Utilities
|
||||
|
||||
```python
|
||||
from geniml.io import read_bed, write_bed, load_universe
|
||||
|
||||
# Read BED file
|
||||
regions = read_bed('peaks.bed')
|
||||
|
||||
# Write BED file
|
||||
write_bed(regions, 'output.bed')
|
||||
|
||||
# Load universe
|
||||
universe = load_universe('universe.bed')
|
||||
```
|
||||
|
||||
### Model Utilities
|
||||
|
||||
```python
|
||||
from geniml.models import save_model, load_model
|
||||
|
||||
# Save trained model
|
||||
save_model(model, 'my_model/')
|
||||
|
||||
# Load model
|
||||
model = load_model('my_model/')
|
||||
```
|
||||
|
||||
### Common Patterns
|
||||
|
||||
**Pipeline workflow:**
|
||||
```python
|
||||
# 1. Build universe
|
||||
universe = build_universe(coverage_folder='coverage/', method='cc', cutoff=5)
|
||||
|
||||
# 2. Tokenize
|
||||
hard_tokenization(src_folder='beds/', dst_folder='tokens/',
|
||||
universe_file='universe.bed', p_value_threshold=1e-9)
|
||||
|
||||
# 3. Train embeddings
|
||||
region2vec(token_folder='tokens/', save_dir='model/', num_shufflings=1000)
|
||||
|
||||
# 4. Evaluate
|
||||
metrics = evaluate_embeddings(embeddings_file='model/embeddings.npy',
|
||||
labels_file='metadata.csv')
|
||||
```
|
||||
|
||||
This modular design allows flexible composition of geniml tools for diverse genomic ML workflows.
|
||||
Reference in New Issue
Block a user