Initial commit
This commit is contained in:
116
skills/deeptools/references/effective_genome_sizes.md
Normal file
116
skills/deeptools/references/effective_genome_sizes.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# Effective Genome Sizes
|
||||
|
||||
## Definition
|
||||
|
||||
Effective genome size refers to the length of the "mappable" genome - regions that can be uniquely mapped by sequencing reads. This metric is crucial for proper normalization in many deepTools commands.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
- Required for RPGC normalization (`--normalizeUsing RPGC`)
|
||||
- Affects accuracy of coverage calculations
|
||||
- Must match your data processing approach (filtered vs unfiltered reads)
|
||||
|
||||
## Calculation Methods
|
||||
|
||||
1. **Non-N bases**: Count of non-N nucleotides in genome sequence
|
||||
2. **Unique mappability**: Regions of specific size that can be uniquely mapped (may consider edit distance)
|
||||
|
||||
## Common Organism Values
|
||||
|
||||
### Using Non-N Bases Method
|
||||
|
||||
| Organism | Assembly | Effective Size | Full Command |
|
||||
|----------|----------|----------------|--------------|
|
||||
| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
|
||||
| Human | GRCh37/hg19 | 2,864,785,220 | `--effectiveGenomeSize 2864785220` |
|
||||
| Mouse | GRCm39/mm39 | 2,654,621,837 | `--effectiveGenomeSize 2654621837` |
|
||||
| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
|
||||
| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
|
||||
| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
|
||||
| *C. elegans* | WBcel235/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
|
||||
| *C. elegans* | ce10 | 100,258,171 | `--effectiveGenomeSize 100258171` |
|
||||
|
||||
### Human (GRCh38) by Read Length
|
||||
|
||||
For quality-filtered reads, values vary by read length:
|
||||
|
||||
| Read Length | Effective Size |
|
||||
|-------------|----------------|
|
||||
| 50bp | ~2.7 billion |
|
||||
| 75bp | ~2.8 billion |
|
||||
| 100bp | ~2.8 billion |
|
||||
| 150bp | ~2.9 billion |
|
||||
| 250bp | ~2.9 billion |
|
||||
|
||||
### Mouse (GRCm38) by Read Length
|
||||
|
||||
| Read Length | Effective Size |
|
||||
|-------------|----------------|
|
||||
| 50bp | ~2.3 billion |
|
||||
| 75bp | ~2.5 billion |
|
||||
| 100bp | ~2.6 billion |
|
||||
|
||||
## Usage in deepTools
|
||||
|
||||
The effective genome size is most commonly used with:
|
||||
|
||||
### bamCoverage with RPGC normalization
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
### bamCompare with RPGC normalization
|
||||
```bash
|
||||
bamCompare -b1 treatment.bam -b2 control.bam \
|
||||
--outFileName comparison.bw \
|
||||
--scaleFactorsMethod RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
### computeGCBias / correctGCBias
|
||||
```bash
|
||||
computeGCBias --bamfile input.bam \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--genome genome.2bit \
|
||||
--fragmentLength 200 \
|
||||
--biasPlot bias.png
|
||||
```
|
||||
|
||||
## Choosing the Right Value
|
||||
|
||||
**For most analyses:** Use the non-N bases method value for your reference genome
|
||||
|
||||
**For filtered data:** If you apply strict quality filters or remove multimapping reads, consider using the read-length-specific values
|
||||
|
||||
**When unsure:** Use the conservative non-N bases value - it's more widely applicable
|
||||
|
||||
## Common Shortcuts
|
||||
|
||||
deepTools also accepts these shorthand values in some contexts:
|
||||
|
||||
- `hs` or `GRCh38`: 2913022398
|
||||
- `mm` or `GRCm38`: 2652783500
|
||||
- `dm` or `dm6`: 142573017
|
||||
- `ce` or `ce10`: 100286401
|
||||
|
||||
Check your specific deepTools version documentation for supported shortcuts.
|
||||
|
||||
## Calculating Custom Values
|
||||
|
||||
For custom genomes or assemblies, calculate the non-N bases count:
|
||||
|
||||
```bash
|
||||
# Using faCount (UCSC tools)
|
||||
faCount genome.fa | grep "total" | awk '{print $2-$7}'
|
||||
|
||||
# Using seqtk
|
||||
seqtk comp genome.fa | awk '{x+=$2}END{print x}'
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
For the most up-to-date effective genome sizes and detailed calculation methods, see:
|
||||
- deepTools documentation: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
|
||||
- ENCODE documentation for reference genome details
|
||||
Reference in New Issue
Block a user