zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

11 KiB

Raw Blame History

deepTools Normalization Methods

This document explains the various normalization methods available in deepTools and when to use each one.

Why Normalize?

Normalization is essential for:

Comparing samples with different sequencing depths
Accounting for library size differences
Making coverage values interpretable across experiments
Enabling fair comparisons between conditions

Without normalization, a sample with 100 million reads will appear to have higher coverage than a sample with 50 million reads, even if the true biological signal is identical.

Available Normalization Methods

1. RPKM (Reads Per Kilobase per Million mapped reads)

Formula: (Number of reads) / (Length of region in kb × Total mapped reads in millions)

When to use:

Comparing different genomic regions within the same sample
Adjusting for both sequencing depth AND region length
RNA-seq gene expression analysis

Available in: bamCoverage

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing RPKM

Interpretation: RPKM of 10 means 10 reads per kilobase of feature per million mapped reads.

Pros:

Accounts for both region length and library size
Widely used and understood in genomics

Cons:

Not ideal for comparing between samples if total RNA content differs
Can be misleading when comparing samples with very different compositions

2. CPM (Counts Per Million mapped reads)

Formula: (Number of reads) / (Total mapped reads in millions)

Also known as: RPM (Reads Per Million)

When to use:

Comparing the same genomic regions across different samples
When region length is constant or not relevant
ChIP-seq, ATAC-seq, DNase-seq analyses

Available in: bamCoverage, bamCompare

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing CPM

Interpretation: CPM of 5 means 5 reads per million mapped reads in that bin.

Pros:

Simple and intuitive
Good for comparing samples with different sequencing depths
Appropriate when comparing fixed-size bins

Cons:

Does not account for region length
Affected by highly abundant regions (e.g., rRNA in RNA-seq)

3. BPM (Bins Per Million mapped reads)

Formula: (Number of reads in bin) / (Sum of all reads in bins in millions)

Key difference from CPM: Only considers reads that fall within the analyzed bins, not all mapped reads.

When to use:

Similar to CPM, but when you want to exclude reads outside analyzed regions
Comparing specific genomic regions while ignoring background

Available in: bamCoverage, bamCompare

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing BPM

Interpretation: BPM accounts only for reads in the binned regions.

Pros:

Focuses normalization on analyzed regions
Less affected by reads in unanalyzed areas

Cons:

Less commonly used, may be harder to compare with published data

4. RPGC (Reads Per Genomic Content)

Formula: (Number of reads × Scaling factor) / Effective genome size

Scaling factor: Calculated to achieve 1× genomic coverage (1 read per base)

When to use:

Want comparable coverage values across samples
Need interpretable absolute coverage values
Comparing samples with very different total read counts
ChIP-seq with spike-in normalization context

Available in: bamCoverage, bamCompare

Requires: --effectiveGenomeSize parameter

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398

Interpretation: Signal value approximates the coverage depth (e.g., value of 2 ≈ 2× coverage).

Pros:

Produces 1× normalized coverage
Interpretable in terms of genomic coverage
Good for comparing samples with different sequencing depths

Cons:

Requires knowing effective genome size
Assumes uniform coverage (not true for ChIP-seq with peaks)

5. None (No Normalization)

Formula: Raw read counts

When to use:

Preliminary analysis
When samples have identical library sizes (rare)
When downstream tool will perform normalization
Debugging or quality control

Available in: All tools (usually default)

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing None

Interpretation: Raw read counts per bin.

Pros:

No assumptions made
Useful for seeing raw data
Fastest computation

Cons:

Cannot fairly compare samples with different sequencing depths
Not suitable for publication figures

6. SES (Selective Enrichment Statistics)

Method: Signal Extraction Scaling - more sophisticated method for comparing ChIP to control

When to use:

ChIP-seq analysis with bamCompare
Want sophisticated background correction
Alternative to simple readCount scaling

Available in: bamCompare only

Example:

bamCompare -b1 chip.bam -b2 input.bam -o output.bw \
    --scaleFactorsMethod SES

Note: SES is specifically designed for ChIP-seq data and may work better than simple read count scaling for noisy data.

7. readCount (Read Count Scaling)

Method: Scale by ratio of total read counts between samples

When to use:

Default for bamCompare
Compensating for sequencing depth differences in comparisons
When you trust that total read counts reflect library size

Available in: bamCompare

Example:

bamCompare -b1 treatment.bam -b2 control.bam -o output.bw \
    --scaleFactorsMethod readCount

How it works: If sample1 has 100M reads and sample2 has 50M reads, sample2 is scaled by 2× before comparison.

Normalization Method Selection Guide

For ChIP-seq Coverage Tracks

Recommended: RPGC or CPM

bamCoverage --bam chip.bam --outFileName chip.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398 \
    --extendReads 200 \
    --ignoreDuplicates

Reasoning: Accounts for sequencing depth differences; RPGC provides interpretable coverage values.

For ChIP-seq Comparisons (Treatment vs Control)

Recommended: log2 ratio with readCount or SES scaling

bamCompare -b1 chip.bam -b2 input.bam -o ratio.bw \
    --operation log2 \
    --scaleFactorsMethod readCount \
    --extendReads 200 \
    --ignoreDuplicates

Reasoning: Log2 ratio shows enrichment (positive) and depletion (negative); readCount adjusts for depth.

For RNA-seq Coverage Tracks

Recommended: CPM or RPKM

# Strand-specific forward
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
    --normalizeUsing CPM \
    --filterRNAstrand forward

# For gene-level: RPKM accounts for gene length
bamCoverage --bam rnaseq.bam --outFileName output.bw \
    --normalizeUsing RPKM

Reasoning: CPM for comparing fixed-width bins; RPKM for genes (accounts for length).

For ATAC-seq

Recommended: RPGC or CPM

bamCoverage --bam atac_shifted.bam --outFileName atac.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398

Reasoning: Similar to ChIP-seq; want comparable coverage across samples.

For Sample Correlation Analysis

Recommended: CPM or RPGC

multiBamSummary bins \
    --bamfiles sample1.bam sample2.bam sample3.bam \
    -o readCounts.npz

plotCorrelation -in readCounts.npz \
    --corMethod pearson \
    --whatToShow heatmap \
    -o correlation.png

Note: multiBamSummary doesn't explicitly normalize, but correlation analysis is robust to scaling. For very different library sizes, consider normalizing BAM files first or using CPM-normalized bigWig files with multiBigwigSummary.

Advanced Normalization Considerations

Spike-in Normalization

For experiments with spike-in controls (e.g., Drosophila chromatin spike-in for ChIP-seq):

Calculate scaling factors from spike-in reads
Apply custom scaling factors using --scaleFactor parameter

# Calculate spike-in factor (example: 0.8)
SCALE_FACTOR=0.8

bamCoverage --bam chip.bam --outFileName chip_spikenorm.bw \
    --scaleFactor ${SCALE_FACTOR} \
    --extendReads 200

Manual Scaling Factors

You can apply custom scaling factors:

# Apply 2× scaling
bamCoverage --bam input.bam --outFileName output.bw \
    --scaleFactor 2.0

Chromosome Exclusion

Exclude specific chromosomes from normalization calculations:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398 \
    --ignoreForNormalization chrX chrY chrM

When to use: Sex chromosomes in mixed-sex samples, mitochondrial DNA, or chromosomes with unusual coverage.

Common Pitfalls

1. Using RPKM for bin-based data

Problem: RPKM accounts for region length, but all bins are the same size Solution: Use CPM or RPGC instead

2. Comparing unnormalized samples

Problem: Sample with 2× sequencing depth appears to have 2× signal Solution: Always normalize when comparing samples

3. Wrong effective genome size

Problem: Using hg19 genome size for hg38 data Solution: Double-check genome assembly and use correct size

4. Ignoring duplicates after GC correction

Problem: Can introduce bias Solution: Never use --ignoreDuplicates after correctGCBias

5. Using RPGC without effective genome size

Problem: Command fails Solution: Always specify --effectiveGenomeSize with RPGC

Normalization for Different Comparisons

Within-sample comparisons (different regions)

Use: RPKM (accounts for region length)

Between-sample comparisons (same regions)

Use: CPM, RPGC, or BPM (accounts for library size)

Treatment vs Control

Use: bamCompare with log2 ratio and readCount/SES scaling

Multiple samples correlation

Use: CPM or RPGC normalized bigWig files, then multiBigwigSummary

Quick Reference Table

Method	Accounts for Depth	Accounts for Length	Best For	Command
RPKM	✓	✓	RNA-seq genes	`--normalizeUsing RPKM`
CPM	✓	✗	Fixed-size bins	`--normalizeUsing CPM`
BPM	✓	✗	Specific regions	`--normalizeUsing BPM`
RPGC	✓	✗	Interpretable coverage	`--normalizeUsing RPGC --effectiveGenomeSize X`
None	✗	✗	Raw data	`--normalizeUsing None`
SES	✓	✗	ChIP comparisons	`bamCompare --scaleFactorsMethod SES`
readCount	✓	✗	ChIP comparisons	`bamCompare --scaleFactorsMethod readCount`

11 KiB Raw Blame History Unescape Escape

deepTools Normalization Methods

Why Normalize?

Available Normalization Methods

1. RPKM (Reads Per Kilobase per Million mapped reads)

2. CPM (Counts Per Million mapped reads)

3. BPM (Bins Per Million mapped reads)

4. RPGC (Reads Per Genomic Content)

5. None (No Normalization)

6. SES (Selective Enrichment Statistics)

7. readCount (Read Count Scaling)

Normalization Method Selection Guide

For ChIP-seq Coverage Tracks

For ChIP-seq Comparisons (Treatment vs Control)

For RNA-seq Coverage Tracks

For ATAC-seq

For Sample Correlation Analysis

Advanced Normalization Considerations

Spike-in Normalization

Manual Scaling Factors

Chromosome Exclusion

Common Pitfalls

1. Using RPKM for bin-based data

2. Comparing unnormalized samples

3. Wrong effective genome size

4. Ignoring duplicates after GC correction

5. Using RPGC without effective genome size

Normalization for Different Comparisons

Within-sample comparisons (different regions)

Between-sample comparisons (same regions)

Treatment vs Control

Multiple samples correlation

Quick Reference Table

Further Reading

11 KiB

Raw Blame History