Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,116 @@
# Effective Genome Sizes
## Definition
Effective genome size refers to the length of the "mappable" genome - regions that can be uniquely mapped by sequencing reads. This metric is crucial for proper normalization in many deepTools commands.
## Why It Matters
- Required for RPGC normalization (`--normalizeUsing RPGC`)
- Affects accuracy of coverage calculations
- Must match your data processing approach (filtered vs unfiltered reads)
## Calculation Methods
1. **Non-N bases**: Count of non-N nucleotides in genome sequence
2. **Unique mappability**: Regions of specific size that can be uniquely mapped (may consider edit distance)
## Common Organism Values
### Using Non-N Bases Method
| Organism | Assembly | Effective Size | Full Command |
|----------|----------|----------------|--------------|
| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
| Human | GRCh37/hg19 | 2,864,785,220 | `--effectiveGenomeSize 2864785220` |
| Mouse | GRCm39/mm39 | 2,654,621,837 | `--effectiveGenomeSize 2654621837` |
| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
| *C. elegans* | WBcel235/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
| *C. elegans* | ce10 | 100,258,171 | `--effectiveGenomeSize 100258171` |
### Human (GRCh38) by Read Length
For quality-filtered reads, values vary by read length:
| Read Length | Effective Size |
|-------------|----------------|
| 50bp | ~2.7 billion |
| 75bp | ~2.8 billion |
| 100bp | ~2.8 billion |
| 150bp | ~2.9 billion |
| 250bp | ~2.9 billion |
### Mouse (GRCm38) by Read Length
| Read Length | Effective Size |
|-------------|----------------|
| 50bp | ~2.3 billion |
| 75bp | ~2.5 billion |
| 100bp | ~2.6 billion |
## Usage in deepTools
The effective genome size is most commonly used with:
### bamCoverage with RPGC normalization
```bash
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398
```
### bamCompare with RPGC normalization
```bash
bamCompare -b1 treatment.bam -b2 control.bam \
--outFileName comparison.bw \
--scaleFactorsMethod RPGC \
--effectiveGenomeSize 2913022398
```
### computeGCBias / correctGCBias
```bash
computeGCBias --bamfile input.bam \
--effectiveGenomeSize 2913022398 \
--genome genome.2bit \
--fragmentLength 200 \
--biasPlot bias.png
```
## Choosing the Right Value
**For most analyses:** Use the non-N bases method value for your reference genome
**For filtered data:** If you apply strict quality filters or remove multimapping reads, consider using the read-length-specific values
**When unsure:** Use the conservative non-N bases value - it's more widely applicable
## Common Shortcuts
deepTools also accepts these shorthand values in some contexts:
- `hs` or `GRCh38`: 2913022398
- `mm` or `GRCm38`: 2652783500
- `dm` or `dm6`: 142573017
- `ce` or `ce10`: 100286401
Check your specific deepTools version documentation for supported shortcuts.
## Calculating Custom Values
For custom genomes or assemblies, calculate the non-N bases count:
```bash
# Using faCount (UCSC tools)
faCount genome.fa | grep "total" | awk '{print $2-$7}'
# Using seqtk
seqtk comp genome.fa | awk '{x+=$2}END{print x}'
```
## References
For the most up-to-date effective genome sizes and detailed calculation methods, see:
- deepTools documentation: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
- ENCODE documentation for reference genome details