Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/geniml/references/utilities.md
+++ b/skills/geniml/references/utilities.md
@@ -0,0 +1,385 @@
+# Geniml Utilities and Additional Tools
+
+## BBClient: BED File Caching
+
+### Overview
+
+BBClient provides efficient caching of BED files from remote sources, enabling faster repeated access and integration with R workflows.
+
+### When to Use
+
+Use BBClient when:
+- Repeatedly accessing BED files from remote databases
+- Working with BEDbase repositories
+- Integrating genomic data with R pipelines
+- Need local caching for performance
+
+### Python Usage
+
+```python
+from geniml.bbclient import BBClient
+
+# Initialize client
+client = BBClient(cache_folder='~/.bedcache')
+
+# Fetch and cache BED file
+bed_file = client.load_bed(bed_id='GSM123456')
+
+# Access cached file
+regions = client.get_regions('GSM123456')
+```
+
+### R Integration
+
+```r
+library(reticulate)
+geniml <- import("geniml.bbclient")
+
+# Initialize client
+client <- geniml$BBClient(cache_folder='~/.bedcache')
+
+# Load BED file
+bed_file <- client$load_bed(bed_id='GSM123456')
+```
+
+### Best Practices
+
+- Configure cache directory with sufficient storage
+- Use consistent cache locations across analyses
+- Clear cache periodically to remove unused files
+
+---
+
+## BEDshift: BED File Randomization
+
+### Overview
+
+BEDshift provides tools for randomizing BED files while preserving genomic context, essential for generating null distributions and statistical testing.
+
+### When to Use
+
+Use BEDshift when:
+- Creating null models for statistical testing
+- Generating control datasets
+- Assessing significance of genomic overlaps
+- Benchmarking analysis methods
+
+### Usage
+
+```python
+from geniml.bedshift import bedshift
+
+# Randomize BED file preserving chromosome distribution
+randomized = bedshift(
+    input_bed='peaks.bed',
+    genome='hg38',
+    preserve_chrom=True,
+    n_iterations=100
+)
+```
+
+### CLI Usage
+
+```bash
+geniml bedshift \
+  --input peaks.bed \
+  --genome hg38 \
+  --preserve-chrom \
+  --iterations 100 \
+  --output randomized_peaks.bed
+```
+
+### Randomization Strategies
+
+**Preserve chromosome distribution:**
+```python
+bedshift(input_bed, genome, preserve_chrom=True)
+```
+Maintains regions on same chromosomes as original.
+
+**Preserve distance distribution:**
+```python
+bedshift(input_bed, genome, preserve_distance=True)
+```
+Maintains inter-region distances.
+
+**Preserve region sizes:**
+```python
+bedshift(input_bed, genome, preserve_size=True)
+```
+Keeps original region lengths.
+
+### Best Practices
+
+- Choose randomization strategy matching null hypothesis
+- Generate multiple iterations for robust statistics
+- Validate randomized output maintains desired properties
+- Document randomization parameters for reproducibility
+
+---
+
+## Evaluation: Model Assessment Tools
+
+### Overview
+
+Geniml provides evaluation utilities for assessing embedding quality and model performance.
+
+### When to Use
+
+Use evaluation tools when:
+- Validating trained embeddings
+- Comparing different models
+- Assessing clustering quality
+- Publishing model results
+
+### Embedding Evaluation
+
+```python
+from geniml.evaluation import evaluate_embeddings
+
+# Evaluate Region2Vec embeddings
+metrics = evaluate_embeddings(
+    embeddings_file='region2vec_model/embeddings.npy',
+    labels_file='metadata.csv',
+    metrics=['silhouette', 'davies_bouldin', 'calinski_harabasz']
+)
+
+print(f"Silhouette score: {metrics['silhouette']:.3f}")
+print(f"Davies-Bouldin index: {metrics['davies_bouldin']:.3f}")
+```
+
+### Clustering Metrics
+
+**Silhouette score:** Measures cluster cohesion and separation (-1 to 1, higher better)
+
+**Davies-Bouldin index:** Average similarity between clusters (≥0, lower better)
+
+**Calinski-Harabasz score:** Ratio of between/within cluster dispersion (higher better)
+
+### scEmbed Cell-Type Annotation Evaluation
+
+```python
+from geniml.evaluation import evaluate_annotation
+
+# Evaluate cell-type predictions
+results = evaluate_annotation(
+    predicted=adata.obs['predicted_celltype'],
+    true=adata.obs['true_celltype'],
+    metrics=['accuracy', 'f1', 'confusion_matrix']
+)
+
+print(f"Accuracy: {results['accuracy']:.1%}")
+print(f"F1 score: {results['f1']:.3f}")
+```
+
+### Best Practices
+
+- Use multiple complementary metrics
+- Compare against baseline models
+- Report metrics on held-out test data
+- Visualize embeddings (UMAP/t-SNE) alongside metrics
+
+---
+
+## Tokenization: Region Tokenization Utilities
+
+### Overview
+
+Tokenization converts genomic regions into discrete tokens using a reference universe, enabling word2vec-style training.
+
+### When to Use
+
+Tokenization is a required preprocessing step for:
+- Region2Vec training
+- scEmbed model training
+- Any embedding method requiring discrete tokens
+
+### Hard Tokenization
+
+Strict overlap-based tokenization:
+
+```python
+from geniml.tokenization import hard_tokenization
+
+hard_tokenization(
+    src_folder='bed_files/',
+    dst_folder='tokenized/',
+    universe_file='universe.bed',
+    p_value_threshold=1e-9
+)
+```
+
+**Parameters:**
+- `p_value_threshold`: Significance level for overlap (typically 1e-9 or 1e-6)
+
+### Soft Tokenization
+
+Probabilistic tokenization allowing partial matches:
+
+```python
+from geniml.tokenization import soft_tokenization
+
+soft_tokenization(
+    src_folder='bed_files/',
+    dst_folder='tokenized/',
+    universe_file='universe.bed',
+    overlap_threshold=0.5
+)
+```
+
+**Parameters:**
+- `overlap_threshold`: Minimum overlap fraction (0-1)
+
+### Universe-Based Tokenization
+
+Map regions to universe tokens with custom parameters:
+
+```python
+from geniml.tokenization import universe_tokenization
+
+universe_tokenization(
+    bed_file='peaks.bed',
+    universe_file='universe.bed',
+    output_file='tokens.txt',
+    method='hard',
+    threshold=1e-9
+)
+```
+
+### Best Practices
+
+- **Universe quality**: Use comprehensive, well-constructed universes
+- **Threshold selection**: More stringent (lower p-value) for higher confidence
+- **Validation**: Check tokenization coverage (what % of regions tokenized)
+- **Consistency**: Use same universe and parameters across related analyses
+
+### Tokenization Coverage
+
+Check how well regions tokenize:
+
+```python
+from geniml.tokenization import check_coverage
+
+coverage = check_coverage(
+    bed_file='peaks.bed',
+    universe_file='universe.bed',
+    threshold=1e-9
+)
+
+print(f"Tokenization coverage: {coverage:.1%}")
+```
+
+Aim for >80% coverage for reliable training.
+
+---
+
+## Text2BedNN: Search Backend
+
+### Overview
+
+Text2BedNN creates neural network-based search backends for querying genomic regions using natural language or metadata.
+
+### When to Use
+
+Use Text2BedNN when:
+- Building search interfaces for genomic databases
+- Enabling natural language queries over BED files
+- Creating metadata-aware search systems
+- Deploying interactive genomic search applications
+
+### Workflow
+
+**Step 1: Prepare embeddings**
+
+Train BEDspace or Region2Vec model with metadata.
+
+**Step 2: Build search index**
+
+```python
+from geniml.search import build_search_index
+
+build_search_index(
+    embeddings_file='bedspace_model/embeddings.npy',
+    metadata_file='metadata.csv',
+    output_dir='search_backend/'
+)
+```
+
+**Step 3: Query the index**
+
+```python
+from geniml.search import SearchBackend
+
+backend = SearchBackend.load('search_backend/')
+
+# Natural language query
+results = backend.query(
+    text="T cell regulatory regions",
+    top_k=10
+)
+
+# Metadata query
+results = backend.query(
+    metadata={'cell_type': 'T_cell', 'tissue': 'blood'},
+    top_k=10
+)
+```
+
+### Best Practices
+
+- Train embeddings with rich metadata for better search
+- Index large collections for comprehensive coverage
+- Validate search relevance on known queries
+- Deploy with API for interactive applications
+
+---
+
+## Additional Tools
+
+### I/O Utilities
+
+```python
+from geniml.io import read_bed, write_bed, load_universe
+
+# Read BED file
+regions = read_bed('peaks.bed')
+
+# Write BED file
+write_bed(regions, 'output.bed')
+
+# Load universe
+universe = load_universe('universe.bed')
+```
+
+### Model Utilities
+
+```python
+from geniml.models import save_model, load_model
+
+# Save trained model
+save_model(model, 'my_model/')
+
+# Load model
+model = load_model('my_model/')
+```
+
+### Common Patterns
+
+**Pipeline workflow:**
+```python
+# 1. Build universe
+universe = build_universe(coverage_folder='coverage/', method='cc', cutoff=5)
+
+# 2. Tokenize
+hard_tokenization(src_folder='beds/', dst_folder='tokens/',
+                   universe_file='universe.bed', p_value_threshold=1e-9)
+
+# 3. Train embeddings
+region2vec(token_folder='tokens/', save_dir='model/', num_shufflings=1000)
+
+# 4. Evaluate
+metrics = evaluate_embeddings(embeddings_file='model/embeddings.npy',
+                               labels_file='metadata.csv')
+```
+
+This modular design allows flexible composition of geniml tools for diverse genomic ML workflows.