Files
gh-k-dense-ai-claude-scient…/skills/deeptools/references/normalization_methods.md
2025-11-30 08:30:10 +08:00

11 KiB
Raw Blame History

deepTools Normalization Methods

This document explains the various normalization methods available in deepTools and when to use each one.

Why Normalize?

Normalization is essential for:

  1. Comparing samples with different sequencing depths
  2. Accounting for library size differences
  3. Making coverage values interpretable across experiments
  4. Enabling fair comparisons between conditions

Without normalization, a sample with 100 million reads will appear to have higher coverage than a sample with 50 million reads, even if the true biological signal is identical.


Available Normalization Methods

1. RPKM (Reads Per Kilobase per Million mapped reads)

Formula: (Number of reads) / (Length of region in kb × Total mapped reads in millions)

When to use:

  • Comparing different genomic regions within the same sample
  • Adjusting for both sequencing depth AND region length
  • RNA-seq gene expression analysis

Available in: bamCoverage

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing RPKM

Interpretation: RPKM of 10 means 10 reads per kilobase of feature per million mapped reads.

Pros:

  • Accounts for both region length and library size
  • Widely used and understood in genomics

Cons:

  • Not ideal for comparing between samples if total RNA content differs
  • Can be misleading when comparing samples with very different compositions

2. CPM (Counts Per Million mapped reads)

Formula: (Number of reads) / (Total mapped reads in millions)

Also known as: RPM (Reads Per Million)

When to use:

  • Comparing the same genomic regions across different samples
  • When region length is constant or not relevant
  • ChIP-seq, ATAC-seq, DNase-seq analyses

Available in: bamCoverage, bamCompare

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing CPM

Interpretation: CPM of 5 means 5 reads per million mapped reads in that bin.

Pros:

  • Simple and intuitive
  • Good for comparing samples with different sequencing depths
  • Appropriate when comparing fixed-size bins

Cons:

  • Does not account for region length
  • Affected by highly abundant regions (e.g., rRNA in RNA-seq)

3. BPM (Bins Per Million mapped reads)

Formula: (Number of reads in bin) / (Sum of all reads in bins in millions)

Key difference from CPM: Only considers reads that fall within the analyzed bins, not all mapped reads.

When to use:

  • Similar to CPM, but when you want to exclude reads outside analyzed regions
  • Comparing specific genomic regions while ignoring background

Available in: bamCoverage, bamCompare

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing BPM

Interpretation: BPM accounts only for reads in the binned regions.

Pros:

  • Focuses normalization on analyzed regions
  • Less affected by reads in unanalyzed areas

Cons:

  • Less commonly used, may be harder to compare with published data

4. RPGC (Reads Per Genomic Content)

Formula: (Number of reads × Scaling factor) / Effective genome size

Scaling factor: Calculated to achieve 1× genomic coverage (1 read per base)

When to use:

  • Want comparable coverage values across samples
  • Need interpretable absolute coverage values
  • Comparing samples with very different total read counts
  • ChIP-seq with spike-in normalization context

Available in: bamCoverage, bamCompare

Requires: --effectiveGenomeSize parameter

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398

Interpretation: Signal value approximates the coverage depth (e.g., value of 2 ≈ 2× coverage).

Pros:

  • Produces 1× normalized coverage
  • Interpretable in terms of genomic coverage
  • Good for comparing samples with different sequencing depths

Cons:

  • Requires knowing effective genome size
  • Assumes uniform coverage (not true for ChIP-seq with peaks)

5. None (No Normalization)

Formula: Raw read counts

When to use:

  • Preliminary analysis
  • When samples have identical library sizes (rare)
  • When downstream tool will perform normalization
  • Debugging or quality control

Available in: All tools (usually default)

Example:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing None

Interpretation: Raw read counts per bin.

Pros:

  • No assumptions made
  • Useful for seeing raw data
  • Fastest computation

Cons:

  • Cannot fairly compare samples with different sequencing depths
  • Not suitable for publication figures

6. SES (Selective Enrichment Statistics)

Method: Signal Extraction Scaling - more sophisticated method for comparing ChIP to control

When to use:

  • ChIP-seq analysis with bamCompare
  • Want sophisticated background correction
  • Alternative to simple readCount scaling

Available in: bamCompare only

Example:

bamCompare -b1 chip.bam -b2 input.bam -o output.bw \
    --scaleFactorsMethod SES

Note: SES is specifically designed for ChIP-seq data and may work better than simple read count scaling for noisy data.


7. readCount (Read Count Scaling)

Method: Scale by ratio of total read counts between samples

When to use:

  • Default for bamCompare
  • Compensating for sequencing depth differences in comparisons
  • When you trust that total read counts reflect library size

Available in: bamCompare

Example:

bamCompare -b1 treatment.bam -b2 control.bam -o output.bw \
    --scaleFactorsMethod readCount

How it works: If sample1 has 100M reads and sample2 has 50M reads, sample2 is scaled by 2× before comparison.


Normalization Method Selection Guide

For ChIP-seq Coverage Tracks

Recommended: RPGC or CPM

bamCoverage --bam chip.bam --outFileName chip.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398 \
    --extendReads 200 \
    --ignoreDuplicates

Reasoning: Accounts for sequencing depth differences; RPGC provides interpretable coverage values.


For ChIP-seq Comparisons (Treatment vs Control)

Recommended: log2 ratio with readCount or SES scaling

bamCompare -b1 chip.bam -b2 input.bam -o ratio.bw \
    --operation log2 \
    --scaleFactorsMethod readCount \
    --extendReads 200 \
    --ignoreDuplicates

Reasoning: Log2 ratio shows enrichment (positive) and depletion (negative); readCount adjusts for depth.


For RNA-seq Coverage Tracks

Recommended: CPM or RPKM

# Strand-specific forward
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
    --normalizeUsing CPM \
    --filterRNAstrand forward

# For gene-level: RPKM accounts for gene length
bamCoverage --bam rnaseq.bam --outFileName output.bw \
    --normalizeUsing RPKM

Reasoning: CPM for comparing fixed-width bins; RPKM for genes (accounts for length).


For ATAC-seq

Recommended: RPGC or CPM

bamCoverage --bam atac_shifted.bam --outFileName atac.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398

Reasoning: Similar to ChIP-seq; want comparable coverage across samples.


For Sample Correlation Analysis

Recommended: CPM or RPGC

multiBamSummary bins \
    --bamfiles sample1.bam sample2.bam sample3.bam \
    -o readCounts.npz

plotCorrelation -in readCounts.npz \
    --corMethod pearson \
    --whatToShow heatmap \
    -o correlation.png

Note: multiBamSummary doesn't explicitly normalize, but correlation analysis is robust to scaling. For very different library sizes, consider normalizing BAM files first or using CPM-normalized bigWig files with multiBigwigSummary.


Advanced Normalization Considerations

Spike-in Normalization

For experiments with spike-in controls (e.g., Drosophila chromatin spike-in for ChIP-seq):

  1. Calculate scaling factors from spike-in reads
  2. Apply custom scaling factors using --scaleFactor parameter
# Calculate spike-in factor (example: 0.8)
SCALE_FACTOR=0.8

bamCoverage --bam chip.bam --outFileName chip_spikenorm.bw \
    --scaleFactor ${SCALE_FACTOR} \
    --extendReads 200

Manual Scaling Factors

You can apply custom scaling factors:

# Apply 2× scaling
bamCoverage --bam input.bam --outFileName output.bw \
    --scaleFactor 2.0

Chromosome Exclusion

Exclude specific chromosomes from normalization calculations:

bamCoverage --bam input.bam --outFileName output.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398 \
    --ignoreForNormalization chrX chrY chrM

When to use: Sex chromosomes in mixed-sex samples, mitochondrial DNA, or chromosomes with unusual coverage.


Common Pitfalls

1. Using RPKM for bin-based data

Problem: RPKM accounts for region length, but all bins are the same size Solution: Use CPM or RPGC instead

2. Comparing unnormalized samples

Problem: Sample with 2× sequencing depth appears to have 2× signal Solution: Always normalize when comparing samples

3. Wrong effective genome size

Problem: Using hg19 genome size for hg38 data Solution: Double-check genome assembly and use correct size

4. Ignoring duplicates after GC correction

Problem: Can introduce bias Solution: Never use --ignoreDuplicates after correctGCBias

5. Using RPGC without effective genome size

Problem: Command fails Solution: Always specify --effectiveGenomeSize with RPGC


Normalization for Different Comparisons

Within-sample comparisons (different regions)

Use: RPKM (accounts for region length)

Between-sample comparisons (same regions)

Use: CPM, RPGC, or BPM (accounts for library size)

Treatment vs Control

Use: bamCompare with log2 ratio and readCount/SES scaling

Multiple samples correlation

Use: CPM or RPGC normalized bigWig files, then multiBigwigSummary


Quick Reference Table

Method Accounts for Depth Accounts for Length Best For Command
RPKM RNA-seq genes --normalizeUsing RPKM
CPM Fixed-size bins --normalizeUsing CPM
BPM Specific regions --normalizeUsing BPM
RPGC Interpretable coverage --normalizeUsing RPGC --effectiveGenomeSize X
None Raw data --normalizeUsing None
SES ChIP comparisons bamCompare --scaleFactorsMethod SES
readCount ChIP comparisons bamCompare --scaleFactorsMethod readCount

Further Reading

For more details on normalization theory and best practices: