zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

6.2 KiB

Raw Blame History

Consensus Peaks: Universe Building

Overview

Geniml provides tools for building genomic "universes" — standardized reference sets of consensus peaks from collections of BED files. These universes represent genomic regions where analyzed datasets show significant coverage overlap, serving as reference vocabularies for tokenization and analysis.

When to Use

Use consensus peak creation when:

Building reference peak sets from multiple experiments
Creating universe files for Region2Vec or BEDspace tokenization
Standardizing genomic regions across a collection of datasets
Defining regions of interest with statistical significance

Workflow

Step 1: Combine BED Files

Merge all BED files into a single combined file:

cat /path/to/bed/files/*.bed > combined_files.bed

Step 2: Generate Coverage Tracks

Create bigWig coverage tracks using uniwig with a smoothing window:

uniwig -m 25 combined_files.bed chrom.sizes coverage/

Parameters:

-m 25: Smoothing window size (25bp typical for chromatin accessibility)
chrom.sizes: Chromosome sizes file for your genome
coverage/: Output directory for bigWig files

The smoothing window helps reduce noise and creates more robust peak boundaries.

Step 3: Build Universe

Use one of four methods to construct the consensus peaks:

Universe-Building Methods

1. Coverage Cutoff (CC)

The simplest approach using a fixed coverage threshold:

geniml universe build cc \
  --coverage-folder coverage/ \
  --output-file universe_cc.bed \
  --cutoff 5 \
  --merge 100 \
  --filter-size 50

Parameters:

--cutoff: Coverage threshold (1 = union; file count = intersection)
--merge: Distance for merging adjacent peaks (bp)
--filter-size: Minimum peak size for inclusion (bp)

Use when: Simple threshold-based selection is sufficient

2. Coverage Cutoff Flexible (CCF)

Creates confidence intervals around likelihood cutoffs for boundaries and region cores:

geniml universe build ccf \
  --coverage-folder coverage/ \
  --output-file universe_ccf.bed \
  --cutoff 5 \
  --confidence 0.95 \
  --merge 100 \
  --filter-size 50

Additional parameters:

--confidence: Confidence level for flexible boundaries (0-1)

Use when: Uncertainty in peak boundaries should be captured

3. Maximum Likelihood (ML)

Builds probabilistic models accounting for region start/end positions:

geniml universe build ml \
  --coverage-folder coverage/ \
  --output-file universe_ml.bed \
  --merge 100 \
  --filter-size 50 \
  --model-type gaussian

Parameters:

--model-type: Distribution for likelihood estimation (gaussian, poisson)

Use when: Statistical modeling of peak locations is important

4. Hidden Markov Model (HMM)

Models genomic regions as hidden states with coverage as emissions:

geniml universe build hmm \
  --coverage-folder coverage/ \
  --output-file universe_hmm.bed \
  --states 3 \
  --merge 100 \
  --filter-size 50

Parameters:

--states: Number of HMM hidden states (typically 2-5)

Use when: Complex patterns of genomic states should be captured

Python API

from geniml.universe import build_universe

# Build using coverage cutoff method
universe = build_universe(
    coverage_folder='coverage/',
    method='cc',
    cutoff=5,
    merge_distance=100,
    min_size=50,
    output_file='universe.bed'
)

Method Comparison

Method	Complexity	Flexibility	Computational Cost	Best For
CC	Low	Low	Low	Quick reference sets
CCF	Medium	Medium	Medium	Boundary uncertainty
ML	High	High	High	Statistical rigor
HMM	High	High	Very High	Complex patterns

Best Practices

Choosing a Method

Start with CC: Quick and interpretable for initial exploration
Use CCF: When peak boundaries are uncertain or noisy
Apply ML: For publication-quality statistical analysis
Deploy HMM: When modeling complex chromatin states

Parameter Selection

Coverage cutoff:

cutoff = 1: Union of all peaks (most permissive)
cutoff = n_files: Intersection (most stringent)
cutoff = 0.5 * n_files: Moderate consensus (typical choice)

Merge distance:

ATAC-seq: 100-200bp
ChIP-seq (narrow peaks): 50-100bp
ChIP-seq (broad peaks): 500-1000bp

Filter size:

Minimum 30bp to avoid artifacts
50-100bp typical for most assays
Larger for broad histone marks

Quality Control

After building, assess universe quality:

from geniml.evaluation import assess_universe

metrics = assess_universe(
    universe_file='universe.bed',
    coverage_folder='coverage/',
    bed_files='bed_files/'
)

print(f"Number of regions: {metrics['n_regions']}")
print(f"Mean region size: {metrics['mean_size']:.1f}bp")
print(f"Coverage of input peaks: {metrics['coverage']:.1%}")

Key metrics:

Region count: Should capture major features without excessive fragmentation
Size distribution: Should match expected biology (e.g., ~500bp for ATAC-seq)
Input coverage: Proportion of original peaks represented (typically >80%)

Output Format

Consensus peaks are saved as BED files with three required columns:

chr1    1000    1500
chr1    2000    2800
chr2    500     1000

Additional columns may include confidence scores or state annotations depending on the method.

Common Workflows

For Region2Vec

Build universe using preferred method
Use universe as tokenization reference
Tokenize BED files
Train Region2Vec model

For BEDspace

Build universe from all datasets
Use universe in preprocessing step
Train BEDspace with metadata
Query across regions and labels

For scEmbed

Create universe from bulk or aggregated scATAC-seq
Use for cell tokenization
Train scEmbed model
Generate cell embeddings

Troubleshooting

Too few regions: Lower cutoff threshold or reduce filter size

Too many regions: Raise cutoff threshold, increase merge distance, or increase filter size

Noisy boundaries: Use CCF or ML methods instead of CC

Long computation: Start with CC method for quick results, then refine with ML/HMM if needed

6.2 KiB Raw Blame History

Consensus Peaks: Universe Building

Overview

When to Use

Workflow

Step 1: Combine BED Files

Step 2: Generate Coverage Tracks

Step 3: Build Universe

Universe-Building Methods

1. Coverage Cutoff (CC)

2. Coverage Cutoff Flexible (CCF)

3. Maximum Likelihood (ML)

4. Hidden Markov Model (HMM)

Python API

Method Comparison

Best Practices

Choosing a Method

Parameter Selection

Quality Control

Output Format

Common Workflows

For Region2Vec

For BEDspace

For scEmbed

Troubleshooting

6.2 KiB

Raw Blame History