411 lines
11 KiB
Markdown
411 lines
11 KiB
Markdown
# deepTools Normalization Methods
|
||
|
||
This document explains the various normalization methods available in deepTools and when to use each one.
|
||
|
||
## Why Normalize?
|
||
|
||
Normalization is essential for:
|
||
1. **Comparing samples with different sequencing depths**
|
||
2. **Accounting for library size differences**
|
||
3. **Making coverage values interpretable across experiments**
|
||
4. **Enabling fair comparisons between conditions**
|
||
|
||
Without normalization, a sample with 100 million reads will appear to have higher coverage than a sample with 50 million reads, even if the true biological signal is identical.
|
||
|
||
---
|
||
|
||
## Available Normalization Methods
|
||
|
||
### 1. RPKM (Reads Per Kilobase per Million mapped reads)
|
||
|
||
**Formula:** `(Number of reads) / (Length of region in kb × Total mapped reads in millions)`
|
||
|
||
**When to use:**
|
||
- Comparing different genomic regions within the same sample
|
||
- Adjusting for both sequencing depth AND region length
|
||
- RNA-seq gene expression analysis
|
||
|
||
**Available in:** `bamCoverage`
|
||
|
||
**Example:**
|
||
```bash
|
||
bamCoverage --bam input.bam --outFileName output.bw \
|
||
--normalizeUsing RPKM
|
||
```
|
||
|
||
**Interpretation:** RPKM of 10 means 10 reads per kilobase of feature per million mapped reads.
|
||
|
||
**Pros:**
|
||
- Accounts for both region length and library size
|
||
- Widely used and understood in genomics
|
||
|
||
**Cons:**
|
||
- Not ideal for comparing between samples if total RNA content differs
|
||
- Can be misleading when comparing samples with very different compositions
|
||
|
||
---
|
||
|
||
### 2. CPM (Counts Per Million mapped reads)
|
||
|
||
**Formula:** `(Number of reads) / (Total mapped reads in millions)`
|
||
|
||
**Also known as:** RPM (Reads Per Million)
|
||
|
||
**When to use:**
|
||
- Comparing the same genomic regions across different samples
|
||
- When region length is constant or not relevant
|
||
- ChIP-seq, ATAC-seq, DNase-seq analyses
|
||
|
||
**Available in:** `bamCoverage`, `bamCompare`
|
||
|
||
**Example:**
|
||
```bash
|
||
bamCoverage --bam input.bam --outFileName output.bw \
|
||
--normalizeUsing CPM
|
||
```
|
||
|
||
**Interpretation:** CPM of 5 means 5 reads per million mapped reads in that bin.
|
||
|
||
**Pros:**
|
||
- Simple and intuitive
|
||
- Good for comparing samples with different sequencing depths
|
||
- Appropriate when comparing fixed-size bins
|
||
|
||
**Cons:**
|
||
- Does not account for region length
|
||
- Affected by highly abundant regions (e.g., rRNA in RNA-seq)
|
||
|
||
---
|
||
|
||
### 3. BPM (Bins Per Million mapped reads)
|
||
|
||
**Formula:** `(Number of reads in bin) / (Sum of all reads in bins in millions)`
|
||
|
||
**Key difference from CPM:** Only considers reads that fall within the analyzed bins, not all mapped reads.
|
||
|
||
**When to use:**
|
||
- Similar to CPM, but when you want to exclude reads outside analyzed regions
|
||
- Comparing specific genomic regions while ignoring background
|
||
|
||
**Available in:** `bamCoverage`, `bamCompare`
|
||
|
||
**Example:**
|
||
```bash
|
||
bamCoverage --bam input.bam --outFileName output.bw \
|
||
--normalizeUsing BPM
|
||
```
|
||
|
||
**Interpretation:** BPM accounts only for reads in the binned regions.
|
||
|
||
**Pros:**
|
||
- Focuses normalization on analyzed regions
|
||
- Less affected by reads in unanalyzed areas
|
||
|
||
**Cons:**
|
||
- Less commonly used, may be harder to compare with published data
|
||
|
||
---
|
||
|
||
### 4. RPGC (Reads Per Genomic Content)
|
||
|
||
**Formula:** `(Number of reads × Scaling factor) / Effective genome size`
|
||
|
||
**Scaling factor:** Calculated to achieve 1× genomic coverage (1 read per base)
|
||
|
||
**When to use:**
|
||
- Want comparable coverage values across samples
|
||
- Need interpretable absolute coverage values
|
||
- Comparing samples with very different total read counts
|
||
- ChIP-seq with spike-in normalization context
|
||
|
||
**Available in:** `bamCoverage`, `bamCompare`
|
||
|
||
**Requires:** `--effectiveGenomeSize` parameter
|
||
|
||
**Example:**
|
||
```bash
|
||
bamCoverage --bam input.bam --outFileName output.bw \
|
||
--normalizeUsing RPGC \
|
||
--effectiveGenomeSize 2913022398
|
||
```
|
||
|
||
**Interpretation:** Signal value approximates the coverage depth (e.g., value of 2 ≈ 2× coverage).
|
||
|
||
**Pros:**
|
||
- Produces 1× normalized coverage
|
||
- Interpretable in terms of genomic coverage
|
||
- Good for comparing samples with different sequencing depths
|
||
|
||
**Cons:**
|
||
- Requires knowing effective genome size
|
||
- Assumes uniform coverage (not true for ChIP-seq with peaks)
|
||
|
||
---
|
||
|
||
### 5. None (No Normalization)
|
||
|
||
**Formula:** Raw read counts
|
||
|
||
**When to use:**
|
||
- Preliminary analysis
|
||
- When samples have identical library sizes (rare)
|
||
- When downstream tool will perform normalization
|
||
- Debugging or quality control
|
||
|
||
**Available in:** All tools (usually default)
|
||
|
||
**Example:**
|
||
```bash
|
||
bamCoverage --bam input.bam --outFileName output.bw \
|
||
--normalizeUsing None
|
||
```
|
||
|
||
**Interpretation:** Raw read counts per bin.
|
||
|
||
**Pros:**
|
||
- No assumptions made
|
||
- Useful for seeing raw data
|
||
- Fastest computation
|
||
|
||
**Cons:**
|
||
- Cannot fairly compare samples with different sequencing depths
|
||
- Not suitable for publication figures
|
||
|
||
---
|
||
|
||
### 6. SES (Selective Enrichment Statistics)
|
||
|
||
**Method:** Signal Extraction Scaling - more sophisticated method for comparing ChIP to control
|
||
|
||
**When to use:**
|
||
- ChIP-seq analysis with bamCompare
|
||
- Want sophisticated background correction
|
||
- Alternative to simple readCount scaling
|
||
|
||
**Available in:** `bamCompare` only
|
||
|
||
**Example:**
|
||
```bash
|
||
bamCompare -b1 chip.bam -b2 input.bam -o output.bw \
|
||
--scaleFactorsMethod SES
|
||
```
|
||
|
||
**Note:** SES is specifically designed for ChIP-seq data and may work better than simple read count scaling for noisy data.
|
||
|
||
---
|
||
|
||
### 7. readCount (Read Count Scaling)
|
||
|
||
**Method:** Scale by ratio of total read counts between samples
|
||
|
||
**When to use:**
|
||
- Default for `bamCompare`
|
||
- Compensating for sequencing depth differences in comparisons
|
||
- When you trust that total read counts reflect library size
|
||
|
||
**Available in:** `bamCompare`
|
||
|
||
**Example:**
|
||
```bash
|
||
bamCompare -b1 treatment.bam -b2 control.bam -o output.bw \
|
||
--scaleFactorsMethod readCount
|
||
```
|
||
|
||
**How it works:** If sample1 has 100M reads and sample2 has 50M reads, sample2 is scaled by 2× before comparison.
|
||
|
||
---
|
||
|
||
## Normalization Method Selection Guide
|
||
|
||
### For ChIP-seq Coverage Tracks
|
||
|
||
**Recommended:** RPGC or CPM
|
||
|
||
```bash
|
||
bamCoverage --bam chip.bam --outFileName chip.bw \
|
||
--normalizeUsing RPGC \
|
||
--effectiveGenomeSize 2913022398 \
|
||
--extendReads 200 \
|
||
--ignoreDuplicates
|
||
```
|
||
|
||
**Reasoning:** Accounts for sequencing depth differences; RPGC provides interpretable coverage values.
|
||
|
||
---
|
||
|
||
### For ChIP-seq Comparisons (Treatment vs Control)
|
||
|
||
**Recommended:** log2 ratio with readCount or SES scaling
|
||
|
||
```bash
|
||
bamCompare -b1 chip.bam -b2 input.bam -o ratio.bw \
|
||
--operation log2 \
|
||
--scaleFactorsMethod readCount \
|
||
--extendReads 200 \
|
||
--ignoreDuplicates
|
||
```
|
||
|
||
**Reasoning:** Log2 ratio shows enrichment (positive) and depletion (negative); readCount adjusts for depth.
|
||
|
||
---
|
||
|
||
### For RNA-seq Coverage Tracks
|
||
|
||
**Recommended:** CPM or RPKM
|
||
|
||
```bash
|
||
# Strand-specific forward
|
||
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
|
||
--normalizeUsing CPM \
|
||
--filterRNAstrand forward
|
||
|
||
# For gene-level: RPKM accounts for gene length
|
||
bamCoverage --bam rnaseq.bam --outFileName output.bw \
|
||
--normalizeUsing RPKM
|
||
```
|
||
|
||
**Reasoning:** CPM for comparing fixed-width bins; RPKM for genes (accounts for length).
|
||
|
||
---
|
||
|
||
### For ATAC-seq
|
||
|
||
**Recommended:** RPGC or CPM
|
||
|
||
```bash
|
||
bamCoverage --bam atac_shifted.bam --outFileName atac.bw \
|
||
--normalizeUsing RPGC \
|
||
--effectiveGenomeSize 2913022398
|
||
```
|
||
|
||
**Reasoning:** Similar to ChIP-seq; want comparable coverage across samples.
|
||
|
||
---
|
||
|
||
### For Sample Correlation Analysis
|
||
|
||
**Recommended:** CPM or RPGC
|
||
|
||
```bash
|
||
multiBamSummary bins \
|
||
--bamfiles sample1.bam sample2.bam sample3.bam \
|
||
-o readCounts.npz
|
||
|
||
plotCorrelation -in readCounts.npz \
|
||
--corMethod pearson \
|
||
--whatToShow heatmap \
|
||
-o correlation.png
|
||
```
|
||
|
||
**Note:** `multiBamSummary` doesn't explicitly normalize, but correlation analysis is robust to scaling. For very different library sizes, consider normalizing BAM files first or using CPM-normalized bigWig files with `multiBigwigSummary`.
|
||
|
||
---
|
||
|
||
## Advanced Normalization Considerations
|
||
|
||
### Spike-in Normalization
|
||
|
||
For experiments with spike-in controls (e.g., *Drosophila* chromatin spike-in for ChIP-seq):
|
||
|
||
1. Calculate scaling factors from spike-in reads
|
||
2. Apply custom scaling factors using `--scaleFactor` parameter
|
||
|
||
```bash
|
||
# Calculate spike-in factor (example: 0.8)
|
||
SCALE_FACTOR=0.8
|
||
|
||
bamCoverage --bam chip.bam --outFileName chip_spikenorm.bw \
|
||
--scaleFactor ${SCALE_FACTOR} \
|
||
--extendReads 200
|
||
```
|
||
|
||
---
|
||
|
||
### Manual Scaling Factors
|
||
|
||
You can apply custom scaling factors:
|
||
|
||
```bash
|
||
# Apply 2× scaling
|
||
bamCoverage --bam input.bam --outFileName output.bw \
|
||
--scaleFactor 2.0
|
||
```
|
||
|
||
---
|
||
|
||
### Chromosome Exclusion
|
||
|
||
Exclude specific chromosomes from normalization calculations:
|
||
|
||
```bash
|
||
bamCoverage --bam input.bam --outFileName output.bw \
|
||
--normalizeUsing RPGC \
|
||
--effectiveGenomeSize 2913022398 \
|
||
--ignoreForNormalization chrX chrY chrM
|
||
```
|
||
|
||
**When to use:** Sex chromosomes in mixed-sex samples, mitochondrial DNA, or chromosomes with unusual coverage.
|
||
|
||
---
|
||
|
||
## Common Pitfalls
|
||
|
||
### 1. Using RPKM for bin-based data
|
||
**Problem:** RPKM accounts for region length, but all bins are the same size
|
||
**Solution:** Use CPM or RPGC instead
|
||
|
||
### 2. Comparing unnormalized samples
|
||
**Problem:** Sample with 2× sequencing depth appears to have 2× signal
|
||
**Solution:** Always normalize when comparing samples
|
||
|
||
### 3. Wrong effective genome size
|
||
**Problem:** Using hg19 genome size for hg38 data
|
||
**Solution:** Double-check genome assembly and use correct size
|
||
|
||
### 4. Ignoring duplicates after GC correction
|
||
**Problem:** Can introduce bias
|
||
**Solution:** Never use `--ignoreDuplicates` after `correctGCBias`
|
||
|
||
### 5. Using RPGC without effective genome size
|
||
**Problem:** Command fails
|
||
**Solution:** Always specify `--effectiveGenomeSize` with RPGC
|
||
|
||
---
|
||
|
||
## Normalization for Different Comparisons
|
||
|
||
### Within-sample comparisons (different regions)
|
||
**Use:** RPKM (accounts for region length)
|
||
|
||
### Between-sample comparisons (same regions)
|
||
**Use:** CPM, RPGC, or BPM (accounts for library size)
|
||
|
||
### Treatment vs Control
|
||
**Use:** bamCompare with log2 ratio and readCount/SES scaling
|
||
|
||
### Multiple samples correlation
|
||
**Use:** CPM or RPGC normalized bigWig files, then multiBigwigSummary
|
||
|
||
---
|
||
|
||
## Quick Reference Table
|
||
|
||
| Method | Accounts for Depth | Accounts for Length | Best For | Command |
|
||
|--------|-------------------|---------------------|----------|---------|
|
||
| RPKM | ✓ | ✓ | RNA-seq genes | `--normalizeUsing RPKM` |
|
||
| CPM | ✓ | ✗ | Fixed-size bins | `--normalizeUsing CPM` |
|
||
| BPM | ✓ | ✗ | Specific regions | `--normalizeUsing BPM` |
|
||
| RPGC | ✓ | ✗ | Interpretable coverage | `--normalizeUsing RPGC --effectiveGenomeSize X` |
|
||
| None | ✗ | ✗ | Raw data | `--normalizeUsing None` |
|
||
| SES | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod SES` |
|
||
| readCount | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod readCount` |
|
||
|
||
---
|
||
|
||
## Further Reading
|
||
|
||
For more details on normalization theory and best practices:
|
||
- deepTools documentation: https://deeptools.readthedocs.io/
|
||
- ENCODE guidelines for ChIP-seq analysis
|
||
- RNA-seq normalization papers (DESeq2, TMM methods)
|