Initial commit
This commit is contained in:
525
skills/deeptools/SKILL.md
Normal file
525
skills/deeptools/SKILL.md
Normal file
@@ -0,0 +1,525 @@
|
||||
---
|
||||
name: deeptools
|
||||
description: "NGS analysis toolkit. BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles (TSS, peaks), for ChIP-seq, RNA-seq, ATAC-seq visualization."
|
||||
---
|
||||
|
||||
# deepTools: NGS Data Analysis Toolkit
|
||||
|
||||
## Overview
|
||||
|
||||
deepTools is a comprehensive suite of Python command-line tools designed for processing and analyzing high-throughput sequencing data. Use deepTools to perform quality control, normalize data, compare samples, and generate publication-quality visualizations for ChIP-seq, RNA-seq, ATAC-seq, MNase-seq, and other NGS experiments.
|
||||
|
||||
**Core capabilities:**
|
||||
- Convert BAM alignments to normalized coverage tracks (bigWig/bedGraph)
|
||||
- Quality control assessment (fingerprint, correlation, coverage)
|
||||
- Sample comparison and correlation analysis
|
||||
- Heatmap and profile plot generation around genomic features
|
||||
- Enrichment analysis and peak region visualization
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
|
||||
- **File conversion**: "Convert BAM to bigWig", "generate coverage tracks", "normalize ChIP-seq data"
|
||||
- **Quality control**: "check ChIP quality", "compare replicates", "assess sequencing depth", "QC analysis"
|
||||
- **Visualization**: "create heatmap around TSS", "plot ChIP signal", "visualize enrichment", "generate profile plot"
|
||||
- **Sample comparison**: "compare treatment vs control", "correlate samples", "PCA analysis"
|
||||
- **Analysis workflows**: "analyze ChIP-seq data", "RNA-seq coverage", "ATAC-seq analysis", "complete workflow"
|
||||
- **Working with specific file types**: BAM files, bigWig files, BED region files in genomics context
|
||||
|
||||
## Quick Start
|
||||
|
||||
For users new to deepTools, start with file validation and common workflows:
|
||||
|
||||
### 1. Validate Input Files
|
||||
|
||||
Before running any analysis, validate BAM, bigWig, and BED files using the validation script:
|
||||
|
||||
```bash
|
||||
python scripts/validate_files.py --bam sample1.bam sample2.bam --bed regions.bed
|
||||
```
|
||||
|
||||
This checks file existence, BAM indices, and format correctness.
|
||||
|
||||
### 2. Generate Workflow Template
|
||||
|
||||
For standard analyses, use the workflow generator to create customized scripts:
|
||||
|
||||
```bash
|
||||
# List available workflows
|
||||
python scripts/workflow_generator.py --list
|
||||
|
||||
# Generate ChIP-seq QC workflow
|
||||
python scripts/workflow_generator.py chipseq_qc -o qc_workflow.sh \
|
||||
--input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
|
||||
--genome-size 2913022398
|
||||
|
||||
# Make executable and run
|
||||
chmod +x qc_workflow.sh
|
||||
./qc_workflow.sh
|
||||
```
|
||||
|
||||
### 3. Most Common Operations
|
||||
|
||||
See `assets/quick_reference.md` for frequently used commands and parameters.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
uv pip install deeptools
|
||||
```
|
||||
|
||||
## Core Workflows
|
||||
|
||||
deepTools workflows typically follow this pattern: **QC → Normalization → Comparison/Visualization**
|
||||
|
||||
### ChIP-seq Quality Control Workflow
|
||||
|
||||
When users request ChIP-seq QC or quality assessment:
|
||||
|
||||
1. **Generate workflow script** using `scripts/workflow_generator.py chipseq_qc`
|
||||
2. **Key QC steps**:
|
||||
- Sample correlation (multiBamSummary + plotCorrelation)
|
||||
- PCA analysis (plotPCA)
|
||||
- Coverage assessment (plotCoverage)
|
||||
- Fragment size validation (bamPEFragmentSize)
|
||||
- ChIP enrichment strength (plotFingerprint)
|
||||
|
||||
**Interpreting results:**
|
||||
- **Correlation**: Replicates should cluster together with high correlation (>0.9)
|
||||
- **Fingerprint**: Strong ChIP shows steep rise; flat diagonal indicates poor enrichment
|
||||
- **Coverage**: Assess if sequencing depth is adequate for analysis
|
||||
|
||||
Full workflow details in `references/workflows.md` → "ChIP-seq Quality Control Workflow"
|
||||
|
||||
### ChIP-seq Complete Analysis Workflow
|
||||
|
||||
For full ChIP-seq analysis from BAM to visualizations:
|
||||
|
||||
1. **Generate coverage tracks** with normalization (bamCoverage)
|
||||
2. **Create comparison tracks** (bamCompare for log2 ratio)
|
||||
3. **Compute signal matrices** around features (computeMatrix)
|
||||
4. **Generate visualizations** (plotHeatmap, plotProfile)
|
||||
5. **Enrichment analysis** at peaks (plotEnrichment)
|
||||
|
||||
Use `scripts/workflow_generator.py chipseq_analysis` to generate template.
|
||||
|
||||
Complete command sequences in `references/workflows.md` → "ChIP-seq Analysis Workflow"
|
||||
|
||||
### RNA-seq Coverage Workflow
|
||||
|
||||
For strand-specific RNA-seq coverage tracks:
|
||||
|
||||
Use bamCoverage with `--filterRNAstrand` to separate forward and reverse strands.
|
||||
|
||||
**Important:** NEVER use `--extendReads` for RNA-seq (would extend over splice junctions).
|
||||
|
||||
Use normalization: CPM for fixed bins, RPKM for gene-level analysis.
|
||||
|
||||
Template available: `scripts/workflow_generator.py rnaseq_coverage`
|
||||
|
||||
Details in `references/workflows.md` → "RNA-seq Coverage Workflow"
|
||||
|
||||
### ATAC-seq Analysis Workflow
|
||||
|
||||
ATAC-seq requires Tn5 offset correction:
|
||||
|
||||
1. **Shift reads** using alignmentSieve with `--ATACshift`
|
||||
2. **Generate coverage** with bamCoverage
|
||||
3. **Analyze fragment sizes** (expect nucleosome ladder pattern)
|
||||
4. **Visualize at peaks** if available
|
||||
|
||||
Template: `scripts/workflow_generator.py atacseq`
|
||||
|
||||
Full workflow in `references/workflows.md` → "ATAC-seq Workflow"
|
||||
|
||||
## Tool Categories and Common Tasks
|
||||
|
||||
### BAM/bigWig Processing
|
||||
|
||||
**Convert BAM to normalized coverage:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 --numberOfProcessors 8
|
||||
```
|
||||
|
||||
**Compare two samples (log2 ratio):**
|
||||
```bash
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \
|
||||
--operation log2 --scaleFactorsMethod readCount
|
||||
```
|
||||
|
||||
**Key tools:** bamCoverage, bamCompare, multiBamSummary, multiBigwigSummary, correctGCBias, alignmentSieve
|
||||
|
||||
Complete reference: `references/tools_reference.md` → "BAM and bigWig File Processing Tools"
|
||||
|
||||
### Quality Control
|
||||
|
||||
**Check ChIP enrichment:**
|
||||
```bash
|
||||
plotFingerprint -b input.bam chip.bam -o fingerprint.png \
|
||||
--extendReads 200 --ignoreDuplicates
|
||||
```
|
||||
|
||||
**Sample correlation:**
|
||||
```bash
|
||||
multiBamSummary bins --bamfiles *.bam -o counts.npz
|
||||
plotCorrelation -in counts.npz --corMethod pearson \
|
||||
--whatToShow heatmap -o correlation.png
|
||||
```
|
||||
|
||||
**Key tools:** plotFingerprint, plotCoverage, plotCorrelation, plotPCA, bamPEFragmentSize
|
||||
|
||||
Complete reference: `references/tools_reference.md` → "Quality Control Tools"
|
||||
|
||||
### Visualization
|
||||
|
||||
**Create heatmap around TSS:**
|
||||
```bash
|
||||
# Compute matrix
|
||||
computeMatrix reference-point -S signal.bw -R genes.bed \
|
||||
-b 3000 -a 3000 --referencePoint TSS -o matrix.gz
|
||||
|
||||
# Generate heatmap
|
||||
plotHeatmap -m matrix.gz -o heatmap.png \
|
||||
--colorMap RdBu --kmeans 3
|
||||
```
|
||||
|
||||
**Create profile plot:**
|
||||
```bash
|
||||
plotProfile -m matrix.gz -o profile.png \
|
||||
--plotType lines --colors blue red
|
||||
```
|
||||
|
||||
**Key tools:** computeMatrix, plotHeatmap, plotProfile, plotEnrichment
|
||||
|
||||
Complete reference: `references/tools_reference.md` → "Visualization Tools"
|
||||
|
||||
## Normalization Methods
|
||||
|
||||
Choosing the correct normalization is critical for valid comparisons. Consult `references/normalization_methods.md` for comprehensive guidance.
|
||||
|
||||
**Quick selection guide:**
|
||||
|
||||
- **ChIP-seq coverage**: Use RPGC or CPM
|
||||
- **ChIP-seq comparison**: Use bamCompare with log2 and readCount
|
||||
- **RNA-seq bins**: Use CPM
|
||||
- **RNA-seq genes**: Use RPKM (accounts for gene length)
|
||||
- **ATAC-seq**: Use RPGC or CPM
|
||||
|
||||
**Normalization methods:**
|
||||
- **RPGC**: 1× genome coverage (requires --effectiveGenomeSize)
|
||||
- **CPM**: Counts per million mapped reads
|
||||
- **RPKM**: Reads per kb per million (accounts for region length)
|
||||
- **BPM**: Bins per million
|
||||
- **None**: Raw counts (not recommended for comparisons)
|
||||
|
||||
Full explanation: `references/normalization_methods.md`
|
||||
|
||||
## Effective Genome Sizes
|
||||
|
||||
RPGC normalization requires effective genome size. Common values:
|
||||
|
||||
| Organism | Assembly | Size | Usage |
|
||||
|----------|----------|------|-------|
|
||||
| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
|
||||
| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
|
||||
| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
|
||||
| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
|
||||
| *C. elegans* | ce10/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
|
||||
|
||||
Complete table with read-length-specific values: `references/effective_genome_sizes.md`
|
||||
|
||||
## Common Parameters Across Tools
|
||||
|
||||
Many deepTools commands share these options:
|
||||
|
||||
**Performance:**
|
||||
- `--numberOfProcessors, -p`: Enable parallel processing (always use available cores)
|
||||
- `--region`: Process specific regions for testing (e.g., `chr1:1-1000000`)
|
||||
|
||||
**Read Filtering:**
|
||||
- `--ignoreDuplicates`: Remove PCR duplicates (recommended for most analyses)
|
||||
- `--minMappingQuality`: Filter by alignment quality (e.g., `--minMappingQuality 10`)
|
||||
- `--minFragmentLength` / `--maxFragmentLength`: Fragment length bounds
|
||||
- `--samFlagInclude` / `--samFlagExclude`: SAM flag filtering
|
||||
|
||||
**Read Processing:**
|
||||
- `--extendReads`: Extend to fragment length (ChIP-seq: YES, RNA-seq: NO)
|
||||
- `--centerReads`: Center at fragment midpoint for sharper signals
|
||||
|
||||
## Best Practices
|
||||
|
||||
### File Validation
|
||||
**Always validate files first** using `scripts/validate_files.py` to check:
|
||||
- File existence and readability
|
||||
- BAM indices present (.bai files)
|
||||
- BED format correctness
|
||||
- File sizes reasonable
|
||||
|
||||
### Analysis Strategy
|
||||
|
||||
1. **Start with QC**: Run correlation, coverage, and fingerprint analysis before proceeding
|
||||
2. **Test on small regions**: Use `--region chr1:1-10000000` for parameter testing
|
||||
3. **Document commands**: Save full command lines for reproducibility
|
||||
4. **Use consistent normalization**: Apply same method across samples in comparisons
|
||||
5. **Verify genome assembly**: Ensure BAM and BED files use matching genome builds
|
||||
|
||||
### ChIP-seq Specific
|
||||
|
||||
- **Always extend reads** for ChIP-seq: `--extendReads 200`
|
||||
- **Remove duplicates**: Use `--ignoreDuplicates` in most cases
|
||||
- **Check enrichment first**: Run plotFingerprint before detailed analysis
|
||||
- **GC correction**: Only apply if significant bias detected; never use `--ignoreDuplicates` after GC correction
|
||||
|
||||
### RNA-seq Specific
|
||||
|
||||
- **Never extend reads** for RNA-seq (would span splice junctions)
|
||||
- **Strand-specific**: Use `--filterRNAstrand forward/reverse` for stranded libraries
|
||||
- **Normalization**: CPM for bins, RPKM for genes
|
||||
|
||||
### ATAC-seq Specific
|
||||
|
||||
- **Apply Tn5 correction**: Use alignmentSieve with `--ATACshift`
|
||||
- **Fragment filtering**: Set appropriate min/max fragment lengths
|
||||
- **Check nucleosome pattern**: Fragment size plot should show ladder pattern
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
1. **Use multiple processors**: `--numberOfProcessors 8` (or available cores)
|
||||
2. **Increase bin size** for faster processing and smaller files
|
||||
3. **Process chromosomes separately** for memory-limited systems
|
||||
4. **Pre-filter BAM files** using alignmentSieve to create reusable filtered files
|
||||
5. **Use bigWig over bedGraph**: Compressed and faster to process
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**BAM index missing:**
|
||||
```bash
|
||||
samtools index input.bam
|
||||
```
|
||||
|
||||
**Out of memory:**
|
||||
Process chromosomes individually using `--region`:
|
||||
```bash
|
||||
bamCoverage --bam input.bam -o chr1.bw --region chr1
|
||||
```
|
||||
|
||||
**Slow processing:**
|
||||
Increase `--numberOfProcessors` and/or increase `--binSize`
|
||||
|
||||
**bigWig files too large:**
|
||||
Increase bin size: `--binSize 50` or larger
|
||||
|
||||
### Validation Errors
|
||||
|
||||
Run validation script to identify issues:
|
||||
```bash
|
||||
python scripts/validate_files.py --bam *.bam --bed regions.bed
|
||||
```
|
||||
|
||||
Common errors and solutions explained in script output.
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
This skill includes comprehensive reference documentation:
|
||||
|
||||
### references/tools_reference.md
|
||||
Complete documentation of all deepTools commands organized by category:
|
||||
- BAM and bigWig processing tools (9 tools)
|
||||
- Quality control tools (6 tools)
|
||||
- Visualization tools (3 tools)
|
||||
- Miscellaneous tools (2 tools)
|
||||
|
||||
Each tool includes:
|
||||
- Purpose and overview
|
||||
- Key parameters with explanations
|
||||
- Usage examples
|
||||
- Important notes and best practices
|
||||
|
||||
**Use this reference when:** Users ask about specific tools, parameters, or detailed usage.
|
||||
|
||||
### references/workflows.md
|
||||
Complete workflow examples for common analyses:
|
||||
- ChIP-seq quality control workflow
|
||||
- ChIP-seq complete analysis workflow
|
||||
- RNA-seq coverage workflow
|
||||
- ATAC-seq analysis workflow
|
||||
- Multi-sample comparison workflow
|
||||
- Peak region analysis workflow
|
||||
- Troubleshooting and performance tips
|
||||
|
||||
**Use this reference when:** Users need complete analysis pipelines or workflow examples.
|
||||
|
||||
### references/normalization_methods.md
|
||||
Comprehensive guide to normalization methods:
|
||||
- Detailed explanation of each method (RPGC, CPM, RPKM, BPM, etc.)
|
||||
- When to use each method
|
||||
- Formulas and interpretation
|
||||
- Selection guide by experiment type
|
||||
- Common pitfalls and solutions
|
||||
- Quick reference table
|
||||
|
||||
**Use this reference when:** Users ask about normalization, comparing samples, or which method to use.
|
||||
|
||||
### references/effective_genome_sizes.md
|
||||
Effective genome size values and usage:
|
||||
- Common organism values (human, mouse, fly, worm, zebrafish)
|
||||
- Read-length-specific values
|
||||
- Calculation methods
|
||||
- When and how to use in commands
|
||||
- Custom genome calculation instructions
|
||||
|
||||
**Use this reference when:** Users need genome size for RPGC normalization or GC bias correction.
|
||||
|
||||
## Helper Scripts
|
||||
|
||||
### scripts/validate_files.py
|
||||
|
||||
Validates BAM, bigWig, and BED files for deepTools analysis. Checks file existence, indices, and format.
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
python scripts/validate_files.py --bam sample1.bam sample2.bam \
|
||||
--bed peaks.bed --bigwig signal.bw
|
||||
```
|
||||
|
||||
**When to use:** Before starting any analysis, or when troubleshooting errors.
|
||||
|
||||
### scripts/workflow_generator.py
|
||||
|
||||
Generates customizable bash script templates for common deepTools workflows.
|
||||
|
||||
**Available workflows:**
|
||||
- `chipseq_qc`: ChIP-seq quality control
|
||||
- `chipseq_analysis`: Complete ChIP-seq analysis
|
||||
- `rnaseq_coverage`: Strand-specific RNA-seq coverage
|
||||
- `atacseq`: ATAC-seq with Tn5 correction
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
# List workflows
|
||||
python scripts/workflow_generator.py --list
|
||||
|
||||
# Generate workflow
|
||||
python scripts/workflow_generator.py chipseq_qc -o qc.sh \
|
||||
--input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
|
||||
--genome-size 2913022398 --threads 8
|
||||
|
||||
# Run generated workflow
|
||||
chmod +x qc.sh
|
||||
./qc.sh
|
||||
```
|
||||
|
||||
**When to use:** Users request standard workflows or need template scripts to customize.
|
||||
|
||||
## Assets
|
||||
|
||||
### assets/quick_reference.md
|
||||
|
||||
Quick reference card with most common commands, effective genome sizes, and typical workflow pattern.
|
||||
|
||||
**When to use:** Users need quick command examples without detailed documentation.
|
||||
|
||||
## Handling User Requests
|
||||
|
||||
### For New Users
|
||||
|
||||
1. Start with installation verification
|
||||
2. Validate input files using `scripts/validate_files.py`
|
||||
3. Recommend appropriate workflow based on experiment type
|
||||
4. Generate workflow template using `scripts/workflow_generator.py`
|
||||
5. Guide through customization and execution
|
||||
|
||||
### For Experienced Users
|
||||
|
||||
1. Provide specific tool commands for requested operations
|
||||
2. Reference appropriate sections in `references/tools_reference.md`
|
||||
3. Suggest optimizations and best practices
|
||||
4. Offer troubleshooting for issues
|
||||
|
||||
### For Specific Tasks
|
||||
|
||||
**"Convert BAM to bigWig":**
|
||||
- Use bamCoverage with appropriate normalization
|
||||
- Recommend RPGC or CPM based on use case
|
||||
- Provide effective genome size for organism
|
||||
- Suggest relevant parameters (extendReads, ignoreDuplicates, binSize)
|
||||
|
||||
**"Check ChIP quality":**
|
||||
- Run full QC workflow or use plotFingerprint specifically
|
||||
- Explain interpretation of results
|
||||
- Suggest follow-up actions based on results
|
||||
|
||||
**"Create heatmap":**
|
||||
- Guide through two-step process: computeMatrix → plotHeatmap
|
||||
- Help choose appropriate matrix mode (reference-point vs scale-regions)
|
||||
- Suggest visualization parameters and clustering options
|
||||
|
||||
**"Compare samples":**
|
||||
- Recommend bamCompare for two-sample comparison
|
||||
- Suggest multiBamSummary + plotCorrelation for multiple samples
|
||||
- Guide normalization method selection
|
||||
|
||||
### Referencing Documentation
|
||||
|
||||
When users need detailed information:
|
||||
- **Tool details**: Direct to specific sections in `references/tools_reference.md`
|
||||
- **Workflows**: Use `references/workflows.md` for complete analysis pipelines
|
||||
- **Normalization**: Consult `references/normalization_methods.md` for method selection
|
||||
- **Genome sizes**: Reference `references/effective_genome_sizes.md`
|
||||
|
||||
Search references using grep patterns:
|
||||
```bash
|
||||
# Find tool documentation
|
||||
grep -A 20 "^### toolname" references/tools_reference.md
|
||||
|
||||
# Find workflow
|
||||
grep -A 50 "^## Workflow Name" references/workflows.md
|
||||
|
||||
# Find normalization method
|
||||
grep -A 15 "^### Method Name" references/normalization_methods.md
|
||||
```
|
||||
|
||||
## Example Interactions
|
||||
|
||||
**User: "I need to analyze my ChIP-seq data"**
|
||||
|
||||
Response approach:
|
||||
1. Ask about files available (BAM files, peaks, genes)
|
||||
2. Validate files using validation script
|
||||
3. Generate chipseq_analysis workflow template
|
||||
4. Customize for their specific files and organism
|
||||
5. Explain each step as script runs
|
||||
|
||||
**User: "Which normalization should I use?"**
|
||||
|
||||
Response approach:
|
||||
1. Ask about experiment type (ChIP-seq, RNA-seq, etc.)
|
||||
2. Ask about comparison goal (within-sample or between-sample)
|
||||
3. Consult `references/normalization_methods.md` selection guide
|
||||
4. Recommend appropriate method with justification
|
||||
5. Provide command example with parameters
|
||||
|
||||
**User: "Create a heatmap around TSS"**
|
||||
|
||||
Response approach:
|
||||
1. Verify bigWig and gene BED files available
|
||||
2. Use computeMatrix with reference-point mode at TSS
|
||||
3. Generate plotHeatmap with appropriate visualization parameters
|
||||
4. Suggest clustering if dataset is large
|
||||
5. Offer profile plot as complement
|
||||
|
||||
## Key Reminders
|
||||
|
||||
- **File validation first**: Always validate input files before analysis
|
||||
- **Normalization matters**: Choose appropriate method for comparison type
|
||||
- **Extend reads carefully**: YES for ChIP-seq, NO for RNA-seq
|
||||
- **Use all cores**: Set `--numberOfProcessors` to available cores
|
||||
- **Test on regions**: Use `--region` for parameter testing
|
||||
- **Check QC first**: Run quality control before detailed analysis
|
||||
- **Document everything**: Save commands for reproducibility
|
||||
- **Reference documentation**: Use comprehensive references for detailed guidance
|
||||
58
skills/deeptools/assets/quick_reference.md
Normal file
58
skills/deeptools/assets/quick_reference.md
Normal file
@@ -0,0 +1,58 @@
|
||||
# deepTools Quick Reference
|
||||
|
||||
## Most Common Commands
|
||||
|
||||
### BAM to bigWig (normalized)
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 --numberOfProcessors 8
|
||||
```
|
||||
|
||||
### Compare two BAM files
|
||||
```bash
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \
|
||||
--operation log2 --scaleFactorsMethod readCount
|
||||
```
|
||||
|
||||
### Correlation heatmap
|
||||
```bash
|
||||
multiBamSummary bins --bamfiles *.bam -o counts.npz
|
||||
plotCorrelation -in counts.npz --corMethod pearson \
|
||||
--whatToShow heatmap -o correlation.png
|
||||
```
|
||||
|
||||
### Heatmap around TSS
|
||||
```bash
|
||||
computeMatrix reference-point -S signal.bw -R genes.bed \
|
||||
-b 3000 -a 3000 --referencePoint TSS -o matrix.gz
|
||||
|
||||
plotHeatmap -m matrix.gz -o heatmap.png
|
||||
```
|
||||
|
||||
### ChIP enrichment check
|
||||
```bash
|
||||
plotFingerprint -b input.bam chip.bam -o fingerprint.png \
|
||||
--extendReads 200 --ignoreDuplicates
|
||||
```
|
||||
|
||||
## Effective Genome Sizes
|
||||
|
||||
| Organism | Assembly | Size |
|
||||
|----------|----------|------|
|
||||
| Human | hg38 | 2913022398 |
|
||||
| Mouse | mm10 | 2652783500 |
|
||||
| Fly | dm6 | 142573017 |
|
||||
|
||||
## Common Normalization Methods
|
||||
|
||||
- **RPGC**: 1× genome coverage (requires --effectiveGenomeSize)
|
||||
- **CPM**: Counts per million (for fixed bins)
|
||||
- **RPKM**: Reads per kb per million (for genes)
|
||||
|
||||
## Typical Workflow
|
||||
|
||||
1. **QC**: plotFingerprint, plotCorrelation
|
||||
2. **Coverage**: bamCoverage with normalization
|
||||
3. **Comparison**: bamCompare for treatment vs control
|
||||
4. **Visualization**: computeMatrix → plotHeatmap/plotProfile
|
||||
116
skills/deeptools/references/effective_genome_sizes.md
Normal file
116
skills/deeptools/references/effective_genome_sizes.md
Normal file
@@ -0,0 +1,116 @@
|
||||
# Effective Genome Sizes
|
||||
|
||||
## Definition
|
||||
|
||||
Effective genome size refers to the length of the "mappable" genome - regions that can be uniquely mapped by sequencing reads. This metric is crucial for proper normalization in many deepTools commands.
|
||||
|
||||
## Why It Matters
|
||||
|
||||
- Required for RPGC normalization (`--normalizeUsing RPGC`)
|
||||
- Affects accuracy of coverage calculations
|
||||
- Must match your data processing approach (filtered vs unfiltered reads)
|
||||
|
||||
## Calculation Methods
|
||||
|
||||
1. **Non-N bases**: Count of non-N nucleotides in genome sequence
|
||||
2. **Unique mappability**: Regions of specific size that can be uniquely mapped (may consider edit distance)
|
||||
|
||||
## Common Organism Values
|
||||
|
||||
### Using Non-N Bases Method
|
||||
|
||||
| Organism | Assembly | Effective Size | Full Command |
|
||||
|----------|----------|----------------|--------------|
|
||||
| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
|
||||
| Human | GRCh37/hg19 | 2,864,785,220 | `--effectiveGenomeSize 2864785220` |
|
||||
| Mouse | GRCm39/mm39 | 2,654,621,837 | `--effectiveGenomeSize 2654621837` |
|
||||
| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
|
||||
| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
|
||||
| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
|
||||
| *C. elegans* | WBcel235/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
|
||||
| *C. elegans* | ce10 | 100,258,171 | `--effectiveGenomeSize 100258171` |
|
||||
|
||||
### Human (GRCh38) by Read Length
|
||||
|
||||
For quality-filtered reads, values vary by read length:
|
||||
|
||||
| Read Length | Effective Size |
|
||||
|-------------|----------------|
|
||||
| 50bp | ~2.7 billion |
|
||||
| 75bp | ~2.8 billion |
|
||||
| 100bp | ~2.8 billion |
|
||||
| 150bp | ~2.9 billion |
|
||||
| 250bp | ~2.9 billion |
|
||||
|
||||
### Mouse (GRCm38) by Read Length
|
||||
|
||||
| Read Length | Effective Size |
|
||||
|-------------|----------------|
|
||||
| 50bp | ~2.3 billion |
|
||||
| 75bp | ~2.5 billion |
|
||||
| 100bp | ~2.6 billion |
|
||||
|
||||
## Usage in deepTools
|
||||
|
||||
The effective genome size is most commonly used with:
|
||||
|
||||
### bamCoverage with RPGC normalization
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
### bamCompare with RPGC normalization
|
||||
```bash
|
||||
bamCompare -b1 treatment.bam -b2 control.bam \
|
||||
--outFileName comparison.bw \
|
||||
--scaleFactorsMethod RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
### computeGCBias / correctGCBias
|
||||
```bash
|
||||
computeGCBias --bamfile input.bam \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--genome genome.2bit \
|
||||
--fragmentLength 200 \
|
||||
--biasPlot bias.png
|
||||
```
|
||||
|
||||
## Choosing the Right Value
|
||||
|
||||
**For most analyses:** Use the non-N bases method value for your reference genome
|
||||
|
||||
**For filtered data:** If you apply strict quality filters or remove multimapping reads, consider using the read-length-specific values
|
||||
|
||||
**When unsure:** Use the conservative non-N bases value - it's more widely applicable
|
||||
|
||||
## Common Shortcuts
|
||||
|
||||
deepTools also accepts these shorthand values in some contexts:
|
||||
|
||||
- `hs` or `GRCh38`: 2913022398
|
||||
- `mm` or `GRCm38`: 2652783500
|
||||
- `dm` or `dm6`: 142573017
|
||||
- `ce` or `ce10`: 100286401
|
||||
|
||||
Check your specific deepTools version documentation for supported shortcuts.
|
||||
|
||||
## Calculating Custom Values
|
||||
|
||||
For custom genomes or assemblies, calculate the non-N bases count:
|
||||
|
||||
```bash
|
||||
# Using faCount (UCSC tools)
|
||||
faCount genome.fa | grep "total" | awk '{print $2-$7}'
|
||||
|
||||
# Using seqtk
|
||||
seqtk comp genome.fa | awk '{x+=$2}END{print x}'
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
For the most up-to-date effective genome sizes and detailed calculation methods, see:
|
||||
- deepTools documentation: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
|
||||
- ENCODE documentation for reference genome details
|
||||
410
skills/deeptools/references/normalization_methods.md
Normal file
410
skills/deeptools/references/normalization_methods.md
Normal file
@@ -0,0 +1,410 @@
|
||||
# deepTools Normalization Methods
|
||||
|
||||
This document explains the various normalization methods available in deepTools and when to use each one.
|
||||
|
||||
## Why Normalize?
|
||||
|
||||
Normalization is essential for:
|
||||
1. **Comparing samples with different sequencing depths**
|
||||
2. **Accounting for library size differences**
|
||||
3. **Making coverage values interpretable across experiments**
|
||||
4. **Enabling fair comparisons between conditions**
|
||||
|
||||
Without normalization, a sample with 100 million reads will appear to have higher coverage than a sample with 50 million reads, even if the true biological signal is identical.
|
||||
|
||||
---
|
||||
|
||||
## Available Normalization Methods
|
||||
|
||||
### 1. RPKM (Reads Per Kilobase per Million mapped reads)
|
||||
|
||||
**Formula:** `(Number of reads) / (Length of region in kb × Total mapped reads in millions)`
|
||||
|
||||
**When to use:**
|
||||
- Comparing different genomic regions within the same sample
|
||||
- Adjusting for both sequencing depth AND region length
|
||||
- RNA-seq gene expression analysis
|
||||
|
||||
**Available in:** `bamCoverage`
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPKM
|
||||
```
|
||||
|
||||
**Interpretation:** RPKM of 10 means 10 reads per kilobase of feature per million mapped reads.
|
||||
|
||||
**Pros:**
|
||||
- Accounts for both region length and library size
|
||||
- Widely used and understood in genomics
|
||||
|
||||
**Cons:**
|
||||
- Not ideal for comparing between samples if total RNA content differs
|
||||
- Can be misleading when comparing samples with very different compositions
|
||||
|
||||
---
|
||||
|
||||
### 2. CPM (Counts Per Million mapped reads)
|
||||
|
||||
**Formula:** `(Number of reads) / (Total mapped reads in millions)`
|
||||
|
||||
**Also known as:** RPM (Reads Per Million)
|
||||
|
||||
**When to use:**
|
||||
- Comparing the same genomic regions across different samples
|
||||
- When region length is constant or not relevant
|
||||
- ChIP-seq, ATAC-seq, DNase-seq analyses
|
||||
|
||||
**Available in:** `bamCoverage`, `bamCompare`
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing CPM
|
||||
```
|
||||
|
||||
**Interpretation:** CPM of 5 means 5 reads per million mapped reads in that bin.
|
||||
|
||||
**Pros:**
|
||||
- Simple and intuitive
|
||||
- Good for comparing samples with different sequencing depths
|
||||
- Appropriate when comparing fixed-size bins
|
||||
|
||||
**Cons:**
|
||||
- Does not account for region length
|
||||
- Affected by highly abundant regions (e.g., rRNA in RNA-seq)
|
||||
|
||||
---
|
||||
|
||||
### 3. BPM (Bins Per Million mapped reads)
|
||||
|
||||
**Formula:** `(Number of reads in bin) / (Sum of all reads in bins in millions)`
|
||||
|
||||
**Key difference from CPM:** Only considers reads that fall within the analyzed bins, not all mapped reads.
|
||||
|
||||
**When to use:**
|
||||
- Similar to CPM, but when you want to exclude reads outside analyzed regions
|
||||
- Comparing specific genomic regions while ignoring background
|
||||
|
||||
**Available in:** `bamCoverage`, `bamCompare`
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing BPM
|
||||
```
|
||||
|
||||
**Interpretation:** BPM accounts only for reads in the binned regions.
|
||||
|
||||
**Pros:**
|
||||
- Focuses normalization on analyzed regions
|
||||
- Less affected by reads in unanalyzed areas
|
||||
|
||||
**Cons:**
|
||||
- Less commonly used, may be harder to compare with published data
|
||||
|
||||
---
|
||||
|
||||
### 4. RPGC (Reads Per Genomic Content)
|
||||
|
||||
**Formula:** `(Number of reads × Scaling factor) / Effective genome size`
|
||||
|
||||
**Scaling factor:** Calculated to achieve 1× genomic coverage (1 read per base)
|
||||
|
||||
**When to use:**
|
||||
- Want comparable coverage values across samples
|
||||
- Need interpretable absolute coverage values
|
||||
- Comparing samples with very different total read counts
|
||||
- ChIP-seq with spike-in normalization context
|
||||
|
||||
**Available in:** `bamCoverage`, `bamCompare`
|
||||
|
||||
**Requires:** `--effectiveGenomeSize` parameter
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
**Interpretation:** Signal value approximates the coverage depth (e.g., value of 2 ≈ 2× coverage).
|
||||
|
||||
**Pros:**
|
||||
- Produces 1× normalized coverage
|
||||
- Interpretable in terms of genomic coverage
|
||||
- Good for comparing samples with different sequencing depths
|
||||
|
||||
**Cons:**
|
||||
- Requires knowing effective genome size
|
||||
- Assumes uniform coverage (not true for ChIP-seq with peaks)
|
||||
|
||||
---
|
||||
|
||||
### 5. None (No Normalization)
|
||||
|
||||
**Formula:** Raw read counts
|
||||
|
||||
**When to use:**
|
||||
- Preliminary analysis
|
||||
- When samples have identical library sizes (rare)
|
||||
- When downstream tool will perform normalization
|
||||
- Debugging or quality control
|
||||
|
||||
**Available in:** All tools (usually default)
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing None
|
||||
```
|
||||
|
||||
**Interpretation:** Raw read counts per bin.
|
||||
|
||||
**Pros:**
|
||||
- No assumptions made
|
||||
- Useful for seeing raw data
|
||||
- Fastest computation
|
||||
|
||||
**Cons:**
|
||||
- Cannot fairly compare samples with different sequencing depths
|
||||
- Not suitable for publication figures
|
||||
|
||||
---
|
||||
|
||||
### 6. SES (Selective Enrichment Statistics)
|
||||
|
||||
**Method:** Signal Extraction Scaling - more sophisticated method for comparing ChIP to control
|
||||
|
||||
**When to use:**
|
||||
- ChIP-seq analysis with bamCompare
|
||||
- Want sophisticated background correction
|
||||
- Alternative to simple readCount scaling
|
||||
|
||||
**Available in:** `bamCompare` only
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCompare -b1 chip.bam -b2 input.bam -o output.bw \
|
||||
--scaleFactorsMethod SES
|
||||
```
|
||||
|
||||
**Note:** SES is specifically designed for ChIP-seq data and may work better than simple read count scaling for noisy data.
|
||||
|
||||
---
|
||||
|
||||
### 7. readCount (Read Count Scaling)
|
||||
|
||||
**Method:** Scale by ratio of total read counts between samples
|
||||
|
||||
**When to use:**
|
||||
- Default for `bamCompare`
|
||||
- Compensating for sequencing depth differences in comparisons
|
||||
- When you trust that total read counts reflect library size
|
||||
|
||||
**Available in:** `bamCompare`
|
||||
|
||||
**Example:**
|
||||
```bash
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o output.bw \
|
||||
--scaleFactorsMethod readCount
|
||||
```
|
||||
|
||||
**How it works:** If sample1 has 100M reads and sample2 has 50M reads, sample2 is scaled by 2× before comparison.
|
||||
|
||||
---
|
||||
|
||||
## Normalization Method Selection Guide
|
||||
|
||||
### For ChIP-seq Coverage Tracks
|
||||
|
||||
**Recommended:** RPGC or CPM
|
||||
|
||||
```bash
|
||||
bamCoverage --bam chip.bam --outFileName chip.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates
|
||||
```
|
||||
|
||||
**Reasoning:** Accounts for sequencing depth differences; RPGC provides interpretable coverage values.
|
||||
|
||||
---
|
||||
|
||||
### For ChIP-seq Comparisons (Treatment vs Control)
|
||||
|
||||
**Recommended:** log2 ratio with readCount or SES scaling
|
||||
|
||||
```bash
|
||||
bamCompare -b1 chip.bam -b2 input.bam -o ratio.bw \
|
||||
--operation log2 \
|
||||
--scaleFactorsMethod readCount \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates
|
||||
```
|
||||
|
||||
**Reasoning:** Log2 ratio shows enrichment (positive) and depletion (negative); readCount adjusts for depth.
|
||||
|
||||
---
|
||||
|
||||
### For RNA-seq Coverage Tracks
|
||||
|
||||
**Recommended:** CPM or RPKM
|
||||
|
||||
```bash
|
||||
# Strand-specific forward
|
||||
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
|
||||
--normalizeUsing CPM \
|
||||
--filterRNAstrand forward
|
||||
|
||||
# For gene-level: RPKM accounts for gene length
|
||||
bamCoverage --bam rnaseq.bam --outFileName output.bw \
|
||||
--normalizeUsing RPKM
|
||||
```
|
||||
|
||||
**Reasoning:** CPM for comparing fixed-width bins; RPKM for genes (accounts for length).
|
||||
|
||||
---
|
||||
|
||||
### For ATAC-seq
|
||||
|
||||
**Recommended:** RPGC or CPM
|
||||
|
||||
```bash
|
||||
bamCoverage --bam atac_shifted.bam --outFileName atac.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398
|
||||
```
|
||||
|
||||
**Reasoning:** Similar to ChIP-seq; want comparable coverage across samples.
|
||||
|
||||
---
|
||||
|
||||
### For Sample Correlation Analysis
|
||||
|
||||
**Recommended:** CPM or RPGC
|
||||
|
||||
```bash
|
||||
multiBamSummary bins \
|
||||
--bamfiles sample1.bam sample2.bam sample3.bam \
|
||||
-o readCounts.npz
|
||||
|
||||
plotCorrelation -in readCounts.npz \
|
||||
--corMethod pearson \
|
||||
--whatToShow heatmap \
|
||||
-o correlation.png
|
||||
```
|
||||
|
||||
**Note:** `multiBamSummary` doesn't explicitly normalize, but correlation analysis is robust to scaling. For very different library sizes, consider normalizing BAM files first or using CPM-normalized bigWig files with `multiBigwigSummary`.
|
||||
|
||||
---
|
||||
|
||||
## Advanced Normalization Considerations
|
||||
|
||||
### Spike-in Normalization
|
||||
|
||||
For experiments with spike-in controls (e.g., *Drosophila* chromatin spike-in for ChIP-seq):
|
||||
|
||||
1. Calculate scaling factors from spike-in reads
|
||||
2. Apply custom scaling factors using `--scaleFactor` parameter
|
||||
|
||||
```bash
|
||||
# Calculate spike-in factor (example: 0.8)
|
||||
SCALE_FACTOR=0.8
|
||||
|
||||
bamCoverage --bam chip.bam --outFileName chip_spikenorm.bw \
|
||||
--scaleFactor ${SCALE_FACTOR} \
|
||||
--extendReads 200
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Manual Scaling Factors
|
||||
|
||||
You can apply custom scaling factors:
|
||||
|
||||
```bash
|
||||
# Apply 2× scaling
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--scaleFactor 2.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Chromosome Exclusion
|
||||
|
||||
Exclude specific chromosomes from normalization calculations:
|
||||
|
||||
```bash
|
||||
bamCoverage --bam input.bam --outFileName output.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--ignoreForNormalization chrX chrY chrM
|
||||
```
|
||||
|
||||
**When to use:** Sex chromosomes in mixed-sex samples, mitochondrial DNA, or chromosomes with unusual coverage.
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls
|
||||
|
||||
### 1. Using RPKM for bin-based data
|
||||
**Problem:** RPKM accounts for region length, but all bins are the same size
|
||||
**Solution:** Use CPM or RPGC instead
|
||||
|
||||
### 2. Comparing unnormalized samples
|
||||
**Problem:** Sample with 2× sequencing depth appears to have 2× signal
|
||||
**Solution:** Always normalize when comparing samples
|
||||
|
||||
### 3. Wrong effective genome size
|
||||
**Problem:** Using hg19 genome size for hg38 data
|
||||
**Solution:** Double-check genome assembly and use correct size
|
||||
|
||||
### 4. Ignoring duplicates after GC correction
|
||||
**Problem:** Can introduce bias
|
||||
**Solution:** Never use `--ignoreDuplicates` after `correctGCBias`
|
||||
|
||||
### 5. Using RPGC without effective genome size
|
||||
**Problem:** Command fails
|
||||
**Solution:** Always specify `--effectiveGenomeSize` with RPGC
|
||||
|
||||
---
|
||||
|
||||
## Normalization for Different Comparisons
|
||||
|
||||
### Within-sample comparisons (different regions)
|
||||
**Use:** RPKM (accounts for region length)
|
||||
|
||||
### Between-sample comparisons (same regions)
|
||||
**Use:** CPM, RPGC, or BPM (accounts for library size)
|
||||
|
||||
### Treatment vs Control
|
||||
**Use:** bamCompare with log2 ratio and readCount/SES scaling
|
||||
|
||||
### Multiple samples correlation
|
||||
**Use:** CPM or RPGC normalized bigWig files, then multiBigwigSummary
|
||||
|
||||
---
|
||||
|
||||
## Quick Reference Table
|
||||
|
||||
| Method | Accounts for Depth | Accounts for Length | Best For | Command |
|
||||
|--------|-------------------|---------------------|----------|---------|
|
||||
| RPKM | ✓ | ✓ | RNA-seq genes | `--normalizeUsing RPKM` |
|
||||
| CPM | ✓ | ✗ | Fixed-size bins | `--normalizeUsing CPM` |
|
||||
| BPM | ✓ | ✗ | Specific regions | `--normalizeUsing BPM` |
|
||||
| RPGC | ✓ | ✗ | Interpretable coverage | `--normalizeUsing RPGC --effectiveGenomeSize X` |
|
||||
| None | ✗ | ✗ | Raw data | `--normalizeUsing None` |
|
||||
| SES | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod SES` |
|
||||
| readCount | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod readCount` |
|
||||
|
||||
---
|
||||
|
||||
## Further Reading
|
||||
|
||||
For more details on normalization theory and best practices:
|
||||
- deepTools documentation: https://deeptools.readthedocs.io/
|
||||
- ENCODE guidelines for ChIP-seq analysis
|
||||
- RNA-seq normalization papers (DESeq2, TMM methods)
|
||||
533
skills/deeptools/references/tools_reference.md
Normal file
533
skills/deeptools/references/tools_reference.md
Normal file
@@ -0,0 +1,533 @@
|
||||
# deepTools Complete Tool Reference
|
||||
|
||||
This document provides a comprehensive reference for all deepTools command-line utilities organized by category.
|
||||
|
||||
## BAM and bigWig File Processing Tools
|
||||
|
||||
### multiBamSummary
|
||||
|
||||
Computes read coverages for genomic regions across multiple BAM files, outputting compressed numpy arrays for downstream correlation and PCA analysis.
|
||||
|
||||
**Modes:**
|
||||
- **bins**: Genome-wide analysis using consecutive equal-sized windows (default 10kb)
|
||||
- **BED-file**: Restricts analysis to user-specified genomic regions
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: Indexed BAM files (space-separated, required)
|
||||
- `--outFileName, -o`: Output coverage matrix file (required)
|
||||
- `--BED`: Region specification file (BED-file mode only)
|
||||
- `--binSize`: Window size in bases (default: 10,000)
|
||||
- `--labels`: Custom sample identifiers
|
||||
- `--minMappingQuality`: Quality threshold for read inclusion
|
||||
- `--numberOfProcessors, -p`: Parallel processing cores
|
||||
- `--extendReads`: Fragment size extension
|
||||
- `--ignoreDuplicates`: Remove PCR duplicates
|
||||
- `--outRawCounts`: Export tab-delimited file with coordinate columns and per-sample counts
|
||||
|
||||
**Output:** Compressed numpy array (.npz) for plotCorrelation and plotPCA
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Genome-wide comparison
|
||||
multiBamSummary bins --bamfiles sample1.bam sample2.bam -o results.npz
|
||||
|
||||
# Peak region comparison
|
||||
multiBamSummary BED-file --BED peaks.bed --bamfiles sample1.bam sample2.bam -o results.npz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### multiBigwigSummary
|
||||
|
||||
Similar to multiBamSummary but operates on bigWig files instead of BAM files. Used for comparing coverage tracks across samples.
|
||||
|
||||
**Modes:**
|
||||
- **bins**: Genome-wide analysis
|
||||
- **BED-file**: Region-specific analysis
|
||||
|
||||
**Key Parameters:** Similar to multiBamSummary but accepts bigWig files
|
||||
|
||||
---
|
||||
|
||||
### bamCoverage
|
||||
|
||||
Converts BAM alignment files into normalized coverage tracks in bigWig or bedGraph formats. Calculates coverage as number of reads per bin.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bam, -b`: Input BAM file (required)
|
||||
- `--outFileName, -o`: Output filename (required)
|
||||
- `--outFileFormat, -of`: Output type (bigwig or bedgraph)
|
||||
- `--normalizeUsing`: Normalization method
|
||||
- **RPKM**: Reads Per Kilobase per Million mapped reads
|
||||
- **CPM**: Counts Per Million mapped reads
|
||||
- **BPM**: Bins Per Million mapped reads
|
||||
- **RPGC**: Reads per genomic content (requires --effectiveGenomeSize)
|
||||
- **None**: No normalization (default)
|
||||
- `--effectiveGenomeSize`: Mappable genome size (required for RPGC)
|
||||
- `--binSize`: Resolution in base pairs (default: 50)
|
||||
- `--extendReads, -e`: Extend reads to fragment length (recommended for ChIP-seq, NOT for RNA-seq)
|
||||
- `--centerReads`: Center reads at fragment length for sharper signals
|
||||
- `--ignoreDuplicates`: Count identical reads only once
|
||||
- `--minMappingQuality`: Filter reads below quality threshold
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment length filtering
|
||||
- `--smoothLength`: Window averaging for noise reduction
|
||||
- `--MNase`: Analyze MNase-seq data for nucleosome positioning
|
||||
- `--Offset`: Position-specific offsets (useful for RiboSeq, GROseq)
|
||||
- `--filterRNAstrand`: Separate forward/reverse strand reads
|
||||
- `--ignoreForNormalization`: Exclude chromosomes from normalization (e.g., sex chromosomes)
|
||||
- `--numberOfProcessors, -p`: Parallel processing
|
||||
|
||||
**Important Notes:**
|
||||
- For RNA-seq: Do NOT use --extendReads (would extend over splice junctions)
|
||||
- For ChIP-seq: Use --extendReads with smaller bin sizes
|
||||
- Never apply --ignoreDuplicates after GC bias correction
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Basic coverage with RPKM normalization
|
||||
bamCoverage --bam input.bam --outFileName coverage.bw --normalizeUsing RPKM
|
||||
|
||||
# ChIP-seq with extension
|
||||
bamCoverage --bam chip.bam --outFileName chip_coverage.bw \
|
||||
--binSize 10 --extendReads 200 --ignoreDuplicates
|
||||
|
||||
# Strand-specific RNA-seq
|
||||
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
|
||||
--filterRNAstrand forward
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### bamCompare
|
||||
|
||||
Compares two BAM files by generating bigWig or bedGraph files, normalizing for sequencing depth differences. Processes genome in equal-sized bins and performs per-bin calculations.
|
||||
|
||||
**Comparison Methods:**
|
||||
- **log2** (default): Log2 ratio of samples
|
||||
- **ratio**: Direct ratio calculation
|
||||
- **subtract**: Difference between files
|
||||
- **add**: Sum of samples
|
||||
- **mean**: Average across samples
|
||||
- **reciprocal_ratio**: Negative inverse for ratios < 0
|
||||
- **first/second**: Output scaled signal from single file
|
||||
|
||||
**Normalization Methods:**
|
||||
- **readCount** (default): Compensates for sequencing depth
|
||||
- **SES**: Selective enrichment statistics
|
||||
- **RPKM**: Reads per kilobase per million
|
||||
- **CPM**: Counts per million
|
||||
- **BPM**: Bins per million
|
||||
- **RPGC**: Reads per genomic content (requires --effectiveGenomeSize)
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfile1, -b1`: First BAM file (required)
|
||||
- `--bamfile2, -b2`: Second BAM file (required)
|
||||
- `--outFileName, -o`: Output filename (required)
|
||||
- `--outFileFormat`: bigwig or bedgraph
|
||||
- `--operation`: Comparison method (see above)
|
||||
- `--scaleFactorsMethod`: Normalization method (see above)
|
||||
- `--binSize`: Bin width for output (default: 50bp)
|
||||
- `--pseudocount`: Avoid division by zero (default: 1)
|
||||
- `--extendReads`: Extend reads to fragment length
|
||||
- `--ignoreDuplicates`: Count identical reads once
|
||||
- `--minMappingQuality`: Quality threshold
|
||||
- `--numberOfProcessors, -p`: Parallelization
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Log2 ratio of treatment vs control
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bw
|
||||
|
||||
# Subtract control from treatment
|
||||
bamCompare -b1 treatment.bam -b2 control.bam -o difference.bw \
|
||||
--operation subtract --scaleFactorsMethod readCount
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### correctGCBias / computeGCBias
|
||||
|
||||
**computeGCBias:** Identifies GC-content bias from sequencing and PCR amplification.
|
||||
|
||||
**correctGCBias:** Corrects BAM files for GC bias detected by computeGCBias.
|
||||
|
||||
**Key Parameters (computeGCBias):**
|
||||
- `--bamfile, -b`: Input BAM file
|
||||
- `--effectiveGenomeSize`: Mappable genome size
|
||||
- `--genome, -g`: Reference genome in 2bit format
|
||||
- `--fragmentLength, -l`: Fragment length (for single-end)
|
||||
- `--biasPlot`: Output diagnostic plot
|
||||
|
||||
**Key Parameters (correctGCBias):**
|
||||
- `--bamfile, -b`: Input BAM file
|
||||
- `--effectiveGenomeSize`: Mappable genome size
|
||||
- `--genome, -g`: Reference genome in 2bit format
|
||||
- `--GCbiasFrequenciesFile`: Frequencies from computeGCBias
|
||||
- `--correctedFile, -o`: Output corrected BAM
|
||||
|
||||
**Important:** Never use --ignoreDuplicates after GC bias correction
|
||||
|
||||
---
|
||||
|
||||
### alignmentSieve
|
||||
|
||||
Filters BAM files by various quality metrics on-the-fly. Useful for creating filtered BAM files for specific analyses.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bam, -b`: Input BAM file
|
||||
- `--outFile, -o`: Output BAM file
|
||||
- `--minMappingQuality`: Minimum mapping quality
|
||||
- `--ignoreDuplicates`: Remove duplicates
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment length filters
|
||||
- `--samFlagInclude / --samFlagExclude`: SAM flag filtering
|
||||
- `--shift`: Shift reads (e.g., for ATACseq Tn5 correction)
|
||||
- `--ATACshift`: Automatically shift for ATAC-seq data
|
||||
|
||||
---
|
||||
|
||||
### computeMatrix
|
||||
|
||||
Calculates scores per genomic region and prepares matrices for plotHeatmap and plotProfile. Processes bigWig score files and BED/GTF region files.
|
||||
|
||||
**Modes:**
|
||||
- **reference-point**: Signal distribution relative to specific position (TSS, TES, or center)
|
||||
- **scale-regions**: Signal across regions standardized to uniform lengths
|
||||
|
||||
**Key Parameters:**
|
||||
- `-R`: Region file(s) in BED/GTF format (required)
|
||||
- `-S`: BigWig score file(s) (required)
|
||||
- `-o`: Output matrix file (required)
|
||||
- `-b`: Upstream distance from reference point
|
||||
- `-a`: Downstream distance from reference point
|
||||
- `-m`: Region body length (scale-regions only)
|
||||
- `-bs, --binSize`: Bin size for averaging scores
|
||||
- `--skipZeros`: Skip regions with all zeros
|
||||
- `--minThreshold / --maxThreshold`: Filter by signal intensity
|
||||
- `--sortRegions`: ascending, descending, keep, no
|
||||
- `--sortUsing`: mean, median, max, min, sum, region_length
|
||||
- `-p, --numberOfProcessors`: Parallel processing
|
||||
- `--averageTypeBins`: Statistical method (mean, median, min, max, sum, std)
|
||||
|
||||
**Output Options:**
|
||||
- `--outFileNameMatrix`: Export tab-delimited data
|
||||
- `--outFileSortedRegions`: Save filtered/sorted BED file
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# TSS analysis
|
||||
computeMatrix reference-point -S signal.bw -R genes.bed \
|
||||
-o matrix.gz -b 2000 -a 2000 --referencePoint TSS
|
||||
|
||||
# Scaled gene body
|
||||
computeMatrix scale-regions -S signal.bw -R genes.bed \
|
||||
-o matrix.gz -b 1000 -a 1000 -m 3000
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quality Control Tools
|
||||
|
||||
### plotFingerprint
|
||||
|
||||
Quality control tool primarily for ChIP-seq experiments. Assesses whether antibody enrichment was successful. Generates cumulative read coverage profiles to distinguish signal from noise.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: Indexed BAM files (required)
|
||||
- `--plotFile, -plot, -o`: Output image filename (required)
|
||||
- `--extendReads, -e`: Extend reads to fragment length
|
||||
- `--ignoreDuplicates`: Count identical reads once
|
||||
- `--minMappingQuality`: Mapping quality filter
|
||||
- `--centerReads`: Center reads at fragment length
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment filters
|
||||
- `--outRawCounts`: Save per-bin read counts
|
||||
- `--outQualityMetrics`: Output QC metrics (Jensen-Shannon distance)
|
||||
- `--labels`: Custom sample names
|
||||
- `--numberOfProcessors, -p`: Parallel processing
|
||||
|
||||
**Interpretation:**
|
||||
- Ideal control: Straight diagonal line
|
||||
- Strong ChIP: Steep rise towards highest rank (concentrated reads in few bins)
|
||||
- Weak enrichment: Flatter curve approaching diagonal
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
plotFingerprint -b input.bam chip1.bam chip2.bam \
|
||||
--labels Input ChIP1 ChIP2 -o fingerprint.png \
|
||||
--extendReads 200 --ignoreDuplicates
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### plotCoverage
|
||||
|
||||
Visualizes average read distribution across the genome. Shows genome coverage and helps determine if sequencing depth is adequate.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: BAM files to analyze (required)
|
||||
- `--plotFile, -o`: Output plot filename (required)
|
||||
- `--ignoreDuplicates`: Remove PCR duplicates
|
||||
- `--minMappingQuality`: Quality threshold
|
||||
- `--outRawCounts`: Save underlying data
|
||||
- `--labels`: Sample names
|
||||
- `--numberOfSamples`: Number of positions to sample (default: 1,000,000)
|
||||
|
||||
---
|
||||
|
||||
### bamPEFragmentSize
|
||||
|
||||
Determines fragment length distribution for paired-end sequencing data. Essential QC to verify expected fragment sizes from library preparation.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: BAM files (required)
|
||||
- `--histogram, -hist`: Output histogram filename (required)
|
||||
- `--plotTitle, -T`: Plot title
|
||||
- `--maxFragmentLength`: Maximum length to consider (default: 1000)
|
||||
- `--logScale`: Use logarithmic Y-axis
|
||||
- `--outRawFragmentLengths`: Save raw fragment lengths
|
||||
|
||||
---
|
||||
|
||||
### plotCorrelation
|
||||
|
||||
Analyzes sample correlations from multiBamSummary or multiBigwigSummary outputs. Shows how similar different samples are.
|
||||
|
||||
**Correlation Methods:**
|
||||
- **Pearson**: Measures metric differences; sensitive to outliers; appropriate for normally distributed data
|
||||
- **Spearman**: Rank-based; less influenced by outliers; better for non-normal distributions
|
||||
|
||||
**Visualization Options:**
|
||||
- **heatmap**: Color intensity with hierarchical clustering (complete linkage)
|
||||
- **scatterplot**: Pairwise scatter plots with correlation coefficients
|
||||
|
||||
**Key Parameters:**
|
||||
- `--corData, -in`: Input matrix from multiBamSummary/multiBigwigSummary (required)
|
||||
- `--corMethod`: pearson or spearman (required)
|
||||
- `--whatToShow`: heatmap or scatterplot (required)
|
||||
- `--plotFile, -o`: Output filename (required)
|
||||
- `--skipZeros`: Exclude zero-value regions
|
||||
- `--removeOutliers`: Use median absolute deviation (MAD) filtering
|
||||
- `--outFileCorMatrix`: Export correlation matrix
|
||||
- `--labels`: Custom sample names
|
||||
- `--plotTitle`: Plot title
|
||||
- `--colorMap`: Color scheme (50+ options)
|
||||
- `--plotNumbers`: Display correlation values on heatmap
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Heatmap with Pearson correlation
|
||||
plotCorrelation -in readCounts.npz --corMethod pearson \
|
||||
--whatToShow heatmap -o correlation_heatmap.png --plotNumbers
|
||||
|
||||
# Scatterplot with Spearman correlation
|
||||
plotCorrelation -in readCounts.npz --corMethod spearman \
|
||||
--whatToShow scatterplot -o correlation_scatter.png
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### plotPCA
|
||||
|
||||
Generates principal component analysis plots from multiBamSummary or multiBigwigSummary output. Displays sample relationships in reduced dimensionality.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--corData, -in`: Coverage file from multiBamSummary/multiBigwigSummary (required)
|
||||
- `--plotFile, -o`: Output image (png, eps, pdf, svg) (required)
|
||||
- `--outFileNameData`: Export PCA data (loadings/rotation and eigenvalues)
|
||||
- `--labels, -l`: Custom sample labels
|
||||
- `--plotTitle, -T`: Plot title
|
||||
- `--plotHeight / --plotWidth`: Dimensions in centimeters
|
||||
- `--colors`: Custom symbol colors
|
||||
- `--markers`: Symbol shapes
|
||||
- `--transpose`: Perform PCA on transposed matrix (rows=samples)
|
||||
- `--ntop`: Use top N variable rows (default: 1000)
|
||||
- `--PCs`: Components to plot (default: 1 2)
|
||||
- `--log2`: Log2-transform data before analysis
|
||||
- `--rowCenter`: Center each row at 0
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
plotPCA -in readCounts.npz -o PCA_plot.png \
|
||||
-T "PCA of read counts" --transpose
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Visualization Tools
|
||||
|
||||
### plotHeatmap
|
||||
|
||||
Creates genomic region heatmaps from computeMatrix output. Generates publication-quality visualizations.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--matrixFile, -m`: Matrix from computeMatrix (required)
|
||||
- `--outFileName, -o`: Output image (png, eps, pdf, svg) (required)
|
||||
- `--outFileSortedRegions`: Save regions after filtering
|
||||
- `--outFileNameMatrix`: Export matrix values
|
||||
- `--interpolationMethod`: auto, nearest, bilinear, bicubic, gaussian
|
||||
- Default: nearest (≤1000 columns), bilinear (>1000 columns)
|
||||
- `--dpi`: Figure resolution
|
||||
|
||||
**Clustering:**
|
||||
- `--kmeans`: k-means clustering
|
||||
- `--hclust`: Hierarchical clustering (slower for >1000 regions)
|
||||
- `--silhouette`: Calculate cluster quality metrics
|
||||
|
||||
**Visual Customization:**
|
||||
- `--heatmapHeight / --heatmapWidth`: Dimensions (3-100 cm)
|
||||
- `--whatToShow`: plot, heatmap, colorbar (combinations)
|
||||
- `--alpha`: Transparency (0-1)
|
||||
- `--colorMap`: 50+ color schemes
|
||||
- `--colorList`: Custom gradient colors
|
||||
- `--zMin / --zMax`: Intensity scale limits
|
||||
- `--boxAroundHeatmaps`: yes/no (default: yes)
|
||||
|
||||
**Labels:**
|
||||
- `--xAxisLabel / --yAxisLabel`: Axis labels
|
||||
- `--regionsLabel`: Region set identifiers
|
||||
- `--samplesLabel`: Sample names
|
||||
- `--refPointLabel`: Reference point label
|
||||
- `--startLabel / --endLabel`: Region boundary labels
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Basic heatmap
|
||||
plotHeatmap -m matrix.gz -o heatmap.png
|
||||
|
||||
# With clustering and custom colors
|
||||
plotHeatmap -m matrix.gz -o heatmap.png \
|
||||
--kmeans 3 --colorMap RdBu --zMin -3 --zMax 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### plotProfile
|
||||
|
||||
Generates profile plots showing scores across genomic regions using computeMatrix output.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--matrixFile, -m`: Matrix from computeMatrix (required)
|
||||
- `--outFileName, -o`: Output image (png, eps, pdf, svg) (required)
|
||||
- `--plotType`: lines, fill, se, std, overlapped_lines, heatmap
|
||||
- `--colors`: Color palette (names or hex codes)
|
||||
- `--plotHeight / --plotWidth`: Dimensions in centimeters
|
||||
- `--yMin / --yMax`: Y-axis range
|
||||
- `--averageType`: mean, median, min, max, std, sum
|
||||
|
||||
**Clustering:**
|
||||
- `--kmeans`: k-means clustering
|
||||
- `--hclust`: Hierarchical clustering
|
||||
- `--silhouette`: Cluster quality metrics
|
||||
|
||||
**Labels:**
|
||||
- `--plotTitle`: Main heading
|
||||
- `--regionsLabel`: Region set identifiers
|
||||
- `--samplesLabel`: Sample names
|
||||
- `--startLabel / --endLabel`: Region boundary labels (scale-regions mode)
|
||||
|
||||
**Output Options:**
|
||||
- `--outFileNameData`: Export data as tab-separated values
|
||||
- `--outFileSortedRegions`: Save filtered/sorted regions as BED
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Line plot
|
||||
plotProfile -m matrix.gz -o profile.png --plotType lines
|
||||
|
||||
# With standard error shading
|
||||
plotProfile -m matrix.gz -o profile.png --plotType se \
|
||||
--colors blue red green
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### plotEnrichment
|
||||
|
||||
Calculates and visualizes signal enrichment across genomic regions. Measures percentage of alignments overlapping region groups. Useful for FRiP (Fragment in Peaks) scores.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: Indexed BAM files (required)
|
||||
- `--BED`: Region files in BED/GTF format (required)
|
||||
- `--plotFile, -o`: Output visualization (png, pdf, eps, svg)
|
||||
- `--labels, -l`: Custom sample identifiers
|
||||
- `--outRawCounts`: Export numerical data
|
||||
- `--perSample`: Group by sample instead of feature (default)
|
||||
- `--regionLabels`: Custom region names
|
||||
|
||||
**Read Processing:**
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment filters
|
||||
- `--minMappingQuality`: Quality threshold
|
||||
- `--samFlagInclude / --samFlagExclude`: SAM flag filters
|
||||
- `--ignoreDuplicates`: Remove duplicates
|
||||
- `--centerReads`: Center reads for sharper signal
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
plotEnrichment -b Input.bam H3K4me3.bam \
|
||||
--BED peaks_up.bed peaks_down.bed \
|
||||
--regionLabels "Up regulated" "Down regulated" \
|
||||
-o enrichment.png
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Miscellaneous Tools
|
||||
|
||||
### computeMatrixOperations
|
||||
|
||||
Advanced matrix manipulation tool for combining or subsetting matrices from computeMatrix. Enables complex multi-sample, multi-region analyses.
|
||||
|
||||
**Operations:**
|
||||
- `cbind`: Combine matrices column-wise
|
||||
- `rbind`: Combine matrices row-wise
|
||||
- `subset`: Extract specific samples or regions
|
||||
- `filterStrand`: Keep only regions on specific strand
|
||||
- `filterValues`: Apply signal intensity filters
|
||||
- `sort`: Order regions by various criteria
|
||||
- `dataRange`: Report min/max values
|
||||
|
||||
**Common Usage:**
|
||||
```bash
|
||||
# Combine matrices
|
||||
computeMatrixOperations cbind -m matrix1.gz matrix2.gz -o combined.gz
|
||||
|
||||
# Extract specific samples
|
||||
computeMatrixOperations subset -m matrix.gz --samples 0 2 -o subset.gz
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### estimateReadFiltering
|
||||
|
||||
Predicts the impact of various filtering parameters without actually filtering. Helps optimize filtering strategies before running full analyses.
|
||||
|
||||
**Key Parameters:**
|
||||
- `--bamfiles, -b`: BAM files to analyze
|
||||
- `--sampleSize`: Number of reads to sample (default: 100,000)
|
||||
- `--binSize`: Bin size for analysis
|
||||
- `--distanceBetweenBins`: Spacing between sampled bins
|
||||
|
||||
**Filtration Options to Test:**
|
||||
- `--minMappingQuality`: Test quality thresholds
|
||||
- `--ignoreDuplicates`: Assess duplicate impact
|
||||
- `--minFragmentLength / --maxFragmentLength`: Test fragment filters
|
||||
|
||||
---
|
||||
|
||||
## Common Parameters Across Tools
|
||||
|
||||
Many deepTools commands share these filtering and performance options:
|
||||
|
||||
**Read Filtering:**
|
||||
- `--ignoreDuplicates`: Remove PCR duplicates
|
||||
- `--minMappingQuality`: Filter by alignment confidence
|
||||
- `--samFlagInclude / --samFlagExclude`: SAM format filtering
|
||||
- `--minFragmentLength / --maxFragmentLength`: Fragment length bounds
|
||||
|
||||
**Performance:**
|
||||
- `--numberOfProcessors, -p`: Enable parallel processing
|
||||
- `--region`: Process specific genomic regions (chr:start-end)
|
||||
|
||||
**Read Processing:**
|
||||
- `--extendReads`: Extend to fragment length
|
||||
- `--centerReads`: Center at fragment midpoint
|
||||
- `--ignoreDuplicates`: Count unique reads only
|
||||
474
skills/deeptools/references/workflows.md
Normal file
474
skills/deeptools/references/workflows.md
Normal file
@@ -0,0 +1,474 @@
|
||||
# deepTools Common Workflows
|
||||
|
||||
This document provides complete workflow examples for common deepTools analyses.
|
||||
|
||||
## ChIP-seq Quality Control Workflow
|
||||
|
||||
Complete quality control assessment for ChIP-seq experiments.
|
||||
|
||||
### Step 1: Initial Correlation Assessment
|
||||
|
||||
Compare replicates and samples to verify experimental quality:
|
||||
|
||||
```bash
|
||||
# Generate coverage matrix across genome
|
||||
multiBamSummary bins \
|
||||
--bamfiles Input1.bam Input2.bam ChIP1.bam ChIP2.bam \
|
||||
--labels Input_rep1 Input_rep2 ChIP_rep1 ChIP_rep2 \
|
||||
-o readCounts.npz \
|
||||
--numberOfProcessors 8
|
||||
|
||||
# Create correlation heatmap
|
||||
plotCorrelation \
|
||||
-in readCounts.npz \
|
||||
--corMethod pearson \
|
||||
--whatToShow heatmap \
|
||||
--plotFile correlation_heatmap.png \
|
||||
--plotNumbers
|
||||
|
||||
# Generate PCA plot
|
||||
plotPCA \
|
||||
-in readCounts.npz \
|
||||
-o PCA_plot.png \
|
||||
-T "PCA of ChIP-seq samples"
|
||||
```
|
||||
|
||||
**Expected Results:**
|
||||
- Replicates should cluster together
|
||||
- Input samples should be distinct from ChIP samples
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Coverage and Depth Assessment
|
||||
|
||||
```bash
|
||||
# Check sequencing depth and coverage
|
||||
plotCoverage \
|
||||
--bamfiles Input1.bam ChIP1.bam ChIP2.bam \
|
||||
--labels Input ChIP_rep1 ChIP_rep2 \
|
||||
--plotFile coverage.png \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
**Interpretation:** Assess whether sequencing depth is adequate for downstream analysis.
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Fragment Size Validation (Paired-end)
|
||||
|
||||
```bash
|
||||
# Verify expected fragment sizes
|
||||
bamPEFragmentSize \
|
||||
--bamfiles Input1.bam ChIP1.bam ChIP2.bam \
|
||||
--histogram fragmentSizes.png \
|
||||
--plotTitle "Fragment Size Distribution"
|
||||
```
|
||||
|
||||
**Expected Results:** Fragment sizes should match library preparation protocols (typically 200-600bp for ChIP-seq).
|
||||
|
||||
---
|
||||
|
||||
### Step 4: GC Bias Detection and Correction
|
||||
|
||||
```bash
|
||||
# Compute GC bias
|
||||
computeGCBias \
|
||||
--bamfile ChIP1.bam \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--genome genome.2bit \
|
||||
--fragmentLength 200 \
|
||||
--biasPlot GCbias.png \
|
||||
--frequenciesFile freq.txt
|
||||
|
||||
# If bias detected, correct it
|
||||
correctGCBias \
|
||||
--bamfile ChIP1.bam \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--genome genome.2bit \
|
||||
--GCbiasFrequenciesFile freq.txt \
|
||||
--correctedFile ChIP1_GCcorrected.bam
|
||||
```
|
||||
|
||||
**Note:** Only correct if significant bias is observed. Do NOT use `--ignoreDuplicates` with GC-corrected files.
|
||||
|
||||
---
|
||||
|
||||
### Step 5: ChIP Signal Strength Assessment
|
||||
|
||||
```bash
|
||||
# Evaluate ChIP enrichment quality
|
||||
plotFingerprint \
|
||||
--bamfiles Input1.bam ChIP1.bam ChIP2.bam \
|
||||
--labels Input ChIP_rep1 ChIP_rep2 \
|
||||
--plotFile fingerprint.png \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8 \
|
||||
--outQualityMetrics fingerprint_metrics.txt
|
||||
```
|
||||
|
||||
**Interpretation:**
|
||||
- Strong ChIP: Steep rise in cumulative curve
|
||||
- Weak enrichment: Curve close to diagonal (input-like)
|
||||
|
||||
---
|
||||
|
||||
## ChIP-seq Analysis Workflow
|
||||
|
||||
Complete workflow from BAM files to publication-quality visualizations.
|
||||
|
||||
### Step 1: Generate Normalized Coverage Tracks
|
||||
|
||||
```bash
|
||||
# Input control
|
||||
bamCoverage \
|
||||
--bam Input.bam \
|
||||
--outFileName Input_coverage.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
|
||||
# ChIP sample
|
||||
bamCoverage \
|
||||
--bam ChIP.bam \
|
||||
--outFileName ChIP_coverage.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Create Log2 Ratio Track
|
||||
|
||||
```bash
|
||||
# Compare ChIP to Input
|
||||
bamCompare \
|
||||
--bamfile1 ChIP.bam \
|
||||
--bamfile2 Input.bam \
|
||||
--outFileName ChIP_vs_Input_log2ratio.bw \
|
||||
--operation log2 \
|
||||
--scaleFactorsMethod readCount \
|
||||
--binSize 10 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
**Result:** Log2 ratio track showing enrichment (positive values) and depletion (negative values).
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Compute Matrix Around TSS
|
||||
|
||||
```bash
|
||||
# Prepare data for heatmap/profile around transcription start sites
|
||||
computeMatrix reference-point \
|
||||
--referencePoint TSS \
|
||||
--scoreFileName ChIP_coverage.bw \
|
||||
--regionsFileName genes.bed \
|
||||
--beforeRegionStartLength 3000 \
|
||||
--afterRegionStartLength 3000 \
|
||||
--binSize 10 \
|
||||
--sortRegions descend \
|
||||
--sortUsing mean \
|
||||
--outFileName matrix_TSS.gz \
|
||||
--outFileNameMatrix matrix_TSS.tab \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 4: Generate Heatmap
|
||||
|
||||
```bash
|
||||
# Create heatmap around TSS
|
||||
plotHeatmap \
|
||||
--matrixFile matrix_TSS.gz \
|
||||
--outFileName heatmap_TSS.png \
|
||||
--colorMap RdBu \
|
||||
--whatToShow 'plot, heatmap and colorbar' \
|
||||
--zMin -3 --zMax 3 \
|
||||
--yAxisLabel "Genes" \
|
||||
--xAxisLabel "Distance from TSS (bp)" \
|
||||
--refPointLabel "TSS" \
|
||||
--heatmapHeight 15 \
|
||||
--kmeans 3
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 5: Generate Profile Plot
|
||||
|
||||
```bash
|
||||
# Create meta-profile around TSS
|
||||
plotProfile \
|
||||
--matrixFile matrix_TSS.gz \
|
||||
--outFileName profile_TSS.png \
|
||||
--plotType lines \
|
||||
--perGroup \
|
||||
--colors blue \
|
||||
--plotTitle "ChIP-seq signal around TSS" \
|
||||
--yAxisLabel "Average signal" \
|
||||
--xAxisLabel "Distance from TSS (bp)" \
|
||||
--refPointLabel "TSS"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 6: Enrichment at Peaks
|
||||
|
||||
```bash
|
||||
# Calculate enrichment in peak regions
|
||||
plotEnrichment \
|
||||
--bamfiles Input.bam ChIP.bam \
|
||||
--BED peaks.bed \
|
||||
--labels Input ChIP \
|
||||
--plotFile enrichment.png \
|
||||
--outRawCounts enrichment_counts.tab \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## RNA-seq Coverage Workflow
|
||||
|
||||
Generate strand-specific coverage tracks for RNA-seq data.
|
||||
|
||||
### Forward Strand
|
||||
|
||||
```bash
|
||||
bamCoverage \
|
||||
--bam rnaseq.bam \
|
||||
--outFileName forward_coverage.bw \
|
||||
--filterRNAstrand forward \
|
||||
--normalizeUsing CPM \
|
||||
--binSize 1 \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
### Reverse Strand
|
||||
|
||||
```bash
|
||||
bamCoverage \
|
||||
--bam rnaseq.bam \
|
||||
--outFileName reverse_coverage.bw \
|
||||
--filterRNAstrand reverse \
|
||||
--normalizeUsing CPM \
|
||||
--binSize 1 \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
**Important:** Do NOT use `--extendReads` for RNA-seq (would extend over splice junctions).
|
||||
|
||||
---
|
||||
|
||||
## Multi-Sample Comparison Workflow
|
||||
|
||||
Compare multiple ChIP-seq samples (e.g., different conditions or time points).
|
||||
|
||||
### Step 1: Generate Coverage Files
|
||||
|
||||
```bash
|
||||
# For each sample
|
||||
for sample in Control_ChIP Treated_ChIP; do
|
||||
bamCoverage \
|
||||
--bam ${sample}.bam \
|
||||
--outFileName ${sample}.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--binSize 10 \
|
||||
--extendReads 200 \
|
||||
--ignoreDuplicates \
|
||||
--numberOfProcessors 8
|
||||
done
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Compute Multi-Sample Matrix
|
||||
|
||||
```bash
|
||||
computeMatrix scale-regions \
|
||||
--scoreFileName Control_ChIP.bw Treated_ChIP.bw \
|
||||
--regionsFileName genes.bed \
|
||||
--beforeRegionStartLength 1000 \
|
||||
--afterRegionStartLength 1000 \
|
||||
--regionBodyLength 3000 \
|
||||
--binSize 10 \
|
||||
--sortRegions descend \
|
||||
--sortUsing mean \
|
||||
--outFileName matrix_multi.gz \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Multi-Sample Heatmap
|
||||
|
||||
```bash
|
||||
plotHeatmap \
|
||||
--matrixFile matrix_multi.gz \
|
||||
--outFileName heatmap_comparison.png \
|
||||
--colorMap Blues \
|
||||
--whatToShow 'plot, heatmap and colorbar' \
|
||||
--samplesLabel Control Treated \
|
||||
--yAxisLabel "Genes" \
|
||||
--heatmapHeight 15 \
|
||||
--kmeans 4
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 4: Multi-Sample Profile
|
||||
|
||||
```bash
|
||||
plotProfile \
|
||||
--matrixFile matrix_multi.gz \
|
||||
--outFileName profile_comparison.png \
|
||||
--plotType lines \
|
||||
--perGroup \
|
||||
--colors blue red \
|
||||
--samplesLabel Control Treated \
|
||||
--plotTitle "ChIP-seq signal comparison" \
|
||||
--startLabel "TSS" \
|
||||
--endLabel "TES"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ATAC-seq Workflow
|
||||
|
||||
Specialized workflow for ATAC-seq data with Tn5 offset correction.
|
||||
|
||||
### Step 1: Shift Reads for Tn5 Correction
|
||||
|
||||
```bash
|
||||
alignmentSieve \
|
||||
--bam atacseq.bam \
|
||||
--outFile atacseq_shifted.bam \
|
||||
--ATACshift \
|
||||
--minFragmentLength 38 \
|
||||
--maxFragmentLength 2000 \
|
||||
--ignoreDuplicates
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Generate Coverage Track
|
||||
|
||||
```bash
|
||||
bamCoverage \
|
||||
--bam atacseq_shifted.bam \
|
||||
--outFileName atacseq_coverage.bw \
|
||||
--normalizeUsing RPGC \
|
||||
--effectiveGenomeSize 2913022398 \
|
||||
--binSize 1 \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 3: Fragment Size Analysis
|
||||
|
||||
```bash
|
||||
bamPEFragmentSize \
|
||||
--bamfiles atacseq.bam \
|
||||
--histogram fragmentSizes_atac.png \
|
||||
--maxFragmentLength 1000
|
||||
```
|
||||
|
||||
**Expected Pattern:** Nucleosome ladder with peaks at ~50bp (nucleosome-free), ~200bp (mono-nucleosome), ~400bp (di-nucleosome).
|
||||
|
||||
---
|
||||
|
||||
## Peak Region Analysis Workflow
|
||||
|
||||
Analyze ChIP-seq signal specifically at peak regions.
|
||||
|
||||
### Step 1: Matrix at Peaks
|
||||
|
||||
```bash
|
||||
computeMatrix reference-point \
|
||||
--referencePoint center \
|
||||
--scoreFileName ChIP_coverage.bw \
|
||||
--regionsFileName peaks.bed \
|
||||
--beforeRegionStartLength 2000 \
|
||||
--afterRegionStartLength 2000 \
|
||||
--binSize 10 \
|
||||
--outFileName matrix_peaks.gz \
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Step 2: Heatmap at Peaks
|
||||
|
||||
```bash
|
||||
plotHeatmap \
|
||||
--matrixFile matrix_peaks.gz \
|
||||
--outFileName heatmap_peaks.png \
|
||||
--colorMap YlOrRd \
|
||||
--refPointLabel "Peak Center" \
|
||||
--heatmapHeight 15 \
|
||||
--sortUsing max
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Issue: Out of Memory
|
||||
**Solution:** Use `--region` parameter to process chromosomes individually:
|
||||
```bash
|
||||
bamCoverage --bam input.bam -o chr1.bw --region chr1
|
||||
```
|
||||
|
||||
### Issue: BAM Index Missing
|
||||
**Solution:** Index BAM files before running deepTools:
|
||||
```bash
|
||||
samtools index input.bam
|
||||
```
|
||||
|
||||
### Issue: Slow Processing
|
||||
**Solution:** Increase `--numberOfProcessors`:
|
||||
```bash
|
||||
# Use 8 cores instead of default
|
||||
--numberOfProcessors 8
|
||||
```
|
||||
|
||||
### Issue: bigWig Files Too Large
|
||||
**Solution:** Increase bin size:
|
||||
```bash
|
||||
--binSize 50 # or larger (default is 10-50)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use multiple processors:** Always set `--numberOfProcessors` to available cores
|
||||
2. **Process regions:** Use `--region` for testing or memory-limited environments
|
||||
3. **Adjust bin size:** Larger bins = faster processing and smaller files
|
||||
4. **Pre-filter BAM files:** Use `alignmentSieve` to create filtered BAM files once, then reuse
|
||||
5. **Use bigWig over bedGraph:** bigWig format is compressed and faster to process
|
||||
|
||||
---
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always check QC first:** Run correlation, coverage, and fingerprint analysis before proceeding
|
||||
2. **Document parameters:** Save command lines for reproducibility
|
||||
3. **Use consistent normalization:** Apply same normalization method across samples in a comparison
|
||||
4. **Verify reference genome match:** Ensure BAM files and region files use same genome build
|
||||
5. **Check strand orientation:** For RNA-seq, verify correct strand orientation
|
||||
6. **Test on small regions first:** Use `--region chr1:1-1000000` for testing parameters
|
||||
7. **Keep intermediate files:** Save matrices for regenerating plots with different settings
|
||||
195
skills/deeptools/scripts/validate_files.py
Normal file
195
skills/deeptools/scripts/validate_files.py
Normal file
@@ -0,0 +1,195 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
deepTools File Validation Script
|
||||
|
||||
Validates BAM, bigWig, and BED files for deepTools analysis.
|
||||
Checks for file existence, proper indexing, and basic format requirements.
|
||||
"""
|
||||
|
||||
import os
|
||||
import sys
|
||||
import argparse
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def check_file_exists(filepath):
|
||||
"""Check if file exists and is readable."""
|
||||
if not os.path.exists(filepath):
|
||||
return False, f"File not found: {filepath}"
|
||||
if not os.access(filepath, os.R_OK):
|
||||
return False, f"File not readable: {filepath}"
|
||||
return True, f"✓ File exists: {filepath}"
|
||||
|
||||
|
||||
def check_bam_index(bam_file):
|
||||
"""Check if BAM file has an index (.bai or .bam.bai)."""
|
||||
bai_file1 = bam_file + ".bai"
|
||||
bai_file2 = bam_file.replace(".bam", ".bai")
|
||||
|
||||
if os.path.exists(bai_file1):
|
||||
return True, f"✓ BAM index found: {bai_file1}"
|
||||
elif os.path.exists(bai_file2):
|
||||
return True, f"✓ BAM index found: {bai_file2}"
|
||||
else:
|
||||
return False, f"✗ BAM index missing for: {bam_file}\n Run: samtools index {bam_file}"
|
||||
|
||||
|
||||
def check_bigwig_file(bw_file):
|
||||
"""Basic check for bigWig file."""
|
||||
# Check file size (bigWig files should have reasonable size)
|
||||
file_size = os.path.getsize(bw_file)
|
||||
if file_size < 100:
|
||||
return False, f"✗ bigWig file suspiciously small: {bw_file} ({file_size} bytes)"
|
||||
return True, f"✓ bigWig file appears valid: {bw_file} ({file_size} bytes)"
|
||||
|
||||
|
||||
def check_bed_file(bed_file):
|
||||
"""Basic validation of BED file format."""
|
||||
try:
|
||||
with open(bed_file, 'r') as f:
|
||||
lines = [line.strip() for line in f if line.strip() and not line.startswith('#')]
|
||||
|
||||
if len(lines) == 0:
|
||||
return False, f"✗ BED file is empty: {bed_file}"
|
||||
|
||||
# Check first few lines for basic format
|
||||
for i, line in enumerate(lines[:10], 1):
|
||||
fields = line.split('\t')
|
||||
if len(fields) < 3:
|
||||
return False, f"✗ BED file format error at line {i}: expected at least 3 columns\n Line: {line}"
|
||||
|
||||
# Check if start and end are integers
|
||||
try:
|
||||
start = int(fields[1])
|
||||
end = int(fields[2])
|
||||
if start >= end:
|
||||
return False, f"✗ BED file error at line {i}: start >= end ({start} >= {end})"
|
||||
except ValueError:
|
||||
return False, f"✗ BED file format error at line {i}: start and end must be integers\n Line: {line}"
|
||||
|
||||
return True, f"✓ BED file format appears valid: {bed_file} ({len(lines)} regions)"
|
||||
|
||||
except Exception as e:
|
||||
return False, f"✗ Error reading BED file: {bed_file}\n Error: {str(e)}"
|
||||
|
||||
|
||||
def validate_files(bam_files=None, bigwig_files=None, bed_files=None):
|
||||
"""
|
||||
Validate all provided files.
|
||||
|
||||
Args:
|
||||
bam_files: List of BAM file paths
|
||||
bigwig_files: List of bigWig file paths
|
||||
bed_files: List of BED file paths
|
||||
|
||||
Returns:
|
||||
Tuple of (success: bool, messages: list)
|
||||
"""
|
||||
all_success = True
|
||||
messages = []
|
||||
|
||||
# Validate BAM files
|
||||
if bam_files:
|
||||
messages.append("\n=== Validating BAM Files ===")
|
||||
for bam_file in bam_files:
|
||||
# Check existence
|
||||
success, msg = check_file_exists(bam_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
continue
|
||||
|
||||
# Check index
|
||||
success, msg = check_bam_index(bam_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
|
||||
# Validate bigWig files
|
||||
if bigwig_files:
|
||||
messages.append("\n=== Validating bigWig Files ===")
|
||||
for bw_file in bigwig_files:
|
||||
# Check existence
|
||||
success, msg = check_file_exists(bw_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
continue
|
||||
|
||||
# Basic bigWig check
|
||||
success, msg = check_bigwig_file(bw_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
|
||||
# Validate BED files
|
||||
if bed_files:
|
||||
messages.append("\n=== Validating BED Files ===")
|
||||
for bed_file in bed_files:
|
||||
# Check existence
|
||||
success, msg = check_file_exists(bed_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
continue
|
||||
|
||||
# Check BED format
|
||||
success, msg = check_bed_file(bed_file)
|
||||
messages.append(msg)
|
||||
if not success:
|
||||
all_success = False
|
||||
|
||||
return all_success, messages
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Validate files for deepTools analysis",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Validate BAM files
|
||||
python validate_files.py --bam sample1.bam sample2.bam
|
||||
|
||||
# Validate all file types
|
||||
python validate_files.py --bam input.bam chip.bam --bed peaks.bed --bigwig signal.bw
|
||||
|
||||
# Validate from a directory
|
||||
python validate_files.py --bam *.bam --bed *.bed
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('--bam', nargs='+', help='BAM files to validate')
|
||||
parser.add_argument('--bigwig', '--bw', nargs='+', help='bigWig files to validate')
|
||||
parser.add_argument('--bed', nargs='+', help='BED files to validate')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Check if any files were provided
|
||||
if not any([args.bam, args.bigwig, args.bed]):
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
# Run validation
|
||||
success, messages = validate_files(
|
||||
bam_files=args.bam,
|
||||
bigwig_files=args.bigwig,
|
||||
bed_files=args.bed
|
||||
)
|
||||
|
||||
# Print results
|
||||
for msg in messages:
|
||||
print(msg)
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*50)
|
||||
if success:
|
||||
print("✓ All validations passed!")
|
||||
sys.exit(0)
|
||||
else:
|
||||
print("✗ Some validations failed. Please fix the issues above.")
|
||||
sys.exit(1)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
454
skills/deeptools/scripts/workflow_generator.py
Normal file
454
skills/deeptools/scripts/workflow_generator.py
Normal file
@@ -0,0 +1,454 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
deepTools Workflow Generator
|
||||
|
||||
Generates bash script templates for common deepTools workflows.
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
|
||||
|
||||
WORKFLOWS = {
|
||||
'chipseq_qc': {
|
||||
'name': 'ChIP-seq Quality Control',
|
||||
'description': 'Complete QC workflow for ChIP-seq experiments',
|
||||
},
|
||||
'chipseq_analysis': {
|
||||
'name': 'ChIP-seq Complete Analysis',
|
||||
'description': 'Full ChIP-seq analysis from BAM to heatmaps',
|
||||
},
|
||||
'rnaseq_coverage': {
|
||||
'name': 'RNA-seq Coverage Tracks',
|
||||
'description': 'Generate strand-specific RNA-seq coverage',
|
||||
},
|
||||
'atacseq': {
|
||||
'name': 'ATAC-seq Analysis',
|
||||
'description': 'ATAC-seq workflow with Tn5 correction',
|
||||
},
|
||||
}
|
||||
|
||||
|
||||
def generate_chipseq_qc_workflow(output_file, params):
|
||||
"""Generate ChIP-seq QC workflow script."""
|
||||
|
||||
script = f"""#!/bin/bash
|
||||
# deepTools ChIP-seq Quality Control Workflow
|
||||
# Generated by deepTools workflow generator
|
||||
|
||||
# Configuration
|
||||
INPUT_BAM="{params.get('input_bam', 'Input.bam')}"
|
||||
CHIP_BAM=("{params.get('chip_bams', 'ChIP1.bam ChIP2.bam')}")
|
||||
GENOME_SIZE={params.get('genome_size', '2913022398')}
|
||||
THREADS={params.get('threads', '8')}
|
||||
OUTPUT_DIR="{params.get('output_dir', 'deeptools_qc')}"
|
||||
|
||||
# Create output directory
|
||||
mkdir -p $OUTPUT_DIR
|
||||
|
||||
echo "=== Starting ChIP-seq QC workflow ==="
|
||||
|
||||
# Step 1: Correlation analysis
|
||||
echo "Step 1: Computing correlation matrix..."
|
||||
multiBamSummary bins \\
|
||||
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
|
||||
-o $OUTPUT_DIR/readCounts.npz \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
echo "Step 2: Generating correlation heatmap..."
|
||||
plotCorrelation \\
|
||||
-in $OUTPUT_DIR/readCounts.npz \\
|
||||
--corMethod pearson \\
|
||||
--whatToShow heatmap \\
|
||||
--plotFile $OUTPUT_DIR/correlation_heatmap.png \\
|
||||
--plotNumbers
|
||||
|
||||
echo "Step 3: Generating PCA plot..."
|
||||
plotPCA \\
|
||||
-in $OUTPUT_DIR/readCounts.npz \\
|
||||
-o $OUTPUT_DIR/PCA_plot.png \\
|
||||
-T "PCA of ChIP-seq samples"
|
||||
|
||||
# Step 2: Coverage assessment
|
||||
echo "Step 4: Assessing coverage..."
|
||||
plotCoverage \\
|
||||
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
|
||||
--plotFile $OUTPUT_DIR/coverage.png \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 3: Fragment size (for paired-end data)
|
||||
echo "Step 5: Analyzing fragment sizes..."
|
||||
bamPEFragmentSize \\
|
||||
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
|
||||
--histogram $OUTPUT_DIR/fragmentSizes.png \\
|
||||
--plotTitle "Fragment Size Distribution"
|
||||
|
||||
# Step 4: ChIP signal strength
|
||||
echo "Step 6: Evaluating ChIP enrichment..."
|
||||
plotFingerprint \\
|
||||
--bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
|
||||
--plotFile $OUTPUT_DIR/fingerprint.png \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS \\
|
||||
--outQualityMetrics $OUTPUT_DIR/fingerprint_metrics.txt
|
||||
|
||||
echo "=== ChIP-seq QC workflow complete ==="
|
||||
echo "Results are in: $OUTPUT_DIR"
|
||||
"""
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write(script)
|
||||
|
||||
return f"✓ Generated ChIP-seq QC workflow: {output_file}"
|
||||
|
||||
|
||||
def generate_chipseq_analysis_workflow(output_file, params):
|
||||
"""Generate complete ChIP-seq analysis workflow script."""
|
||||
|
||||
script = f"""#!/bin/bash
|
||||
# deepTools ChIP-seq Complete Analysis Workflow
|
||||
# Generated by deepTools workflow generator
|
||||
|
||||
# Configuration
|
||||
INPUT_BAM="{params.get('input_bam', 'Input.bam')}"
|
||||
CHIP_BAM="{params.get('chip_bam', 'ChIP.bam')}"
|
||||
GENES_BED="{params.get('genes_bed', 'genes.bed')}"
|
||||
PEAKS_BED="{params.get('peaks_bed', 'peaks.bed')}"
|
||||
GENOME_SIZE={params.get('genome_size', '2913022398')}
|
||||
THREADS={params.get('threads', '8')}
|
||||
OUTPUT_DIR="{params.get('output_dir', 'chipseq_analysis')}"
|
||||
|
||||
# Create output directory
|
||||
mkdir -p $OUTPUT_DIR
|
||||
|
||||
echo "=== Starting ChIP-seq analysis workflow ==="
|
||||
|
||||
# Step 1: Generate normalized coverage tracks
|
||||
echo "Step 1: Generating coverage tracks..."
|
||||
|
||||
bamCoverage \\
|
||||
--bam $INPUT_BAM \\
|
||||
--outFileName $OUTPUT_DIR/Input_coverage.bw \\
|
||||
--normalizeUsing RPGC \\
|
||||
--effectiveGenomeSize $GENOME_SIZE \\
|
||||
--binSize 10 \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
bamCoverage \\
|
||||
--bam $CHIP_BAM \\
|
||||
--outFileName $OUTPUT_DIR/ChIP_coverage.bw \\
|
||||
--normalizeUsing RPGC \\
|
||||
--effectiveGenomeSize $GENOME_SIZE \\
|
||||
--binSize 10 \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 2: Create log2 ratio track
|
||||
echo "Step 2: Creating log2 ratio track..."
|
||||
bamCompare \\
|
||||
--bamfile1 $CHIP_BAM \\
|
||||
--bamfile2 $INPUT_BAM \\
|
||||
--outFileName $OUTPUT_DIR/ChIP_vs_Input_log2ratio.bw \\
|
||||
--operation log2 \\
|
||||
--scaleFactorsMethod readCount \\
|
||||
--binSize 10 \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 3: Compute matrix around TSS
|
||||
echo "Step 3: Computing matrix around TSS..."
|
||||
computeMatrix reference-point \\
|
||||
--referencePoint TSS \\
|
||||
--scoreFileName $OUTPUT_DIR/ChIP_coverage.bw \\
|
||||
--regionsFileName $GENES_BED \\
|
||||
--beforeRegionStartLength 3000 \\
|
||||
--afterRegionStartLength 3000 \\
|
||||
--binSize 10 \\
|
||||
--sortRegions descend \\
|
||||
--sortUsing mean \\
|
||||
--outFileName $OUTPUT_DIR/matrix_TSS.gz \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 4: Generate heatmap
|
||||
echo "Step 4: Generating heatmap..."
|
||||
plotHeatmap \\
|
||||
--matrixFile $OUTPUT_DIR/matrix_TSS.gz \\
|
||||
--outFileName $OUTPUT_DIR/heatmap_TSS.png \\
|
||||
--colorMap RdBu \\
|
||||
--whatToShow 'plot, heatmap and colorbar' \\
|
||||
--yAxisLabel "Genes" \\
|
||||
--xAxisLabel "Distance from TSS (bp)" \\
|
||||
--refPointLabel "TSS" \\
|
||||
--heatmapHeight 15 \\
|
||||
--kmeans 3
|
||||
|
||||
# Step 5: Generate profile plot
|
||||
echo "Step 5: Generating profile plot..."
|
||||
plotProfile \\
|
||||
--matrixFile $OUTPUT_DIR/matrix_TSS.gz \\
|
||||
--outFileName $OUTPUT_DIR/profile_TSS.png \\
|
||||
--plotType lines \\
|
||||
--perGroup \\
|
||||
--colors blue \\
|
||||
--plotTitle "ChIP-seq signal around TSS" \\
|
||||
--yAxisLabel "Average signal" \\
|
||||
--refPointLabel "TSS"
|
||||
|
||||
# Step 6: Enrichment at peaks (if peaks provided)
|
||||
if [ -f "$PEAKS_BED" ]; then
|
||||
echo "Step 6: Calculating enrichment at peaks..."
|
||||
plotEnrichment \\
|
||||
--bamfiles $INPUT_BAM $CHIP_BAM \\
|
||||
--BED $PEAKS_BED \\
|
||||
--labels Input ChIP \\
|
||||
--plotFile $OUTPUT_DIR/enrichment.png \\
|
||||
--outRawCounts $OUTPUT_DIR/enrichment_counts.tab \\
|
||||
--extendReads 200 \\
|
||||
--ignoreDuplicates
|
||||
fi
|
||||
|
||||
echo "=== ChIP-seq analysis complete ==="
|
||||
echo "Results are in: $OUTPUT_DIR"
|
||||
"""
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write(script)
|
||||
|
||||
return f"✓ Generated ChIP-seq analysis workflow: {output_file}"
|
||||
|
||||
|
||||
def generate_rnaseq_coverage_workflow(output_file, params):
|
||||
"""Generate RNA-seq coverage workflow script."""
|
||||
|
||||
script = f"""#!/bin/bash
|
||||
# deepTools RNA-seq Coverage Workflow
|
||||
# Generated by deepTools workflow generator
|
||||
|
||||
# Configuration
|
||||
RNASEQ_BAM="{params.get('rnaseq_bam', 'rnaseq.bam')}"
|
||||
THREADS={params.get('threads', '8')}
|
||||
OUTPUT_DIR="{params.get('output_dir', 'rnaseq_coverage')}"
|
||||
|
||||
# Create output directory
|
||||
mkdir -p $OUTPUT_DIR
|
||||
|
||||
echo "=== Starting RNA-seq coverage workflow ==="
|
||||
|
||||
# Generate strand-specific coverage tracks
|
||||
echo "Step 1: Generating forward strand coverage..."
|
||||
bamCoverage \\
|
||||
--bam $RNASEQ_BAM \\
|
||||
--outFileName $OUTPUT_DIR/forward_coverage.bw \\
|
||||
--filterRNAstrand forward \\
|
||||
--normalizeUsing CPM \\
|
||||
--binSize 1 \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
echo "Step 2: Generating reverse strand coverage..."
|
||||
bamCoverage \\
|
||||
--bam $RNASEQ_BAM \\
|
||||
--outFileName $OUTPUT_DIR/reverse_coverage.bw \\
|
||||
--filterRNAstrand reverse \\
|
||||
--normalizeUsing CPM \\
|
||||
--binSize 1 \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
echo "=== RNA-seq coverage workflow complete ==="
|
||||
echo "Results are in: $OUTPUT_DIR"
|
||||
echo ""
|
||||
echo "Note: These bigWig files can be loaded into genome browsers"
|
||||
echo "for strand-specific visualization of RNA-seq data."
|
||||
"""
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write(script)
|
||||
|
||||
return f"✓ Generated RNA-seq coverage workflow: {output_file}"
|
||||
|
||||
|
||||
def generate_atacseq_workflow(output_file, params):
|
||||
"""Generate ATAC-seq workflow script."""
|
||||
|
||||
script = f"""#!/bin/bash
|
||||
# deepTools ATAC-seq Analysis Workflow
|
||||
# Generated by deepTools workflow generator
|
||||
|
||||
# Configuration
|
||||
ATAC_BAM="{params.get('atac_bam', 'atacseq.bam')}"
|
||||
PEAKS_BED="{params.get('peaks_bed', 'peaks.bed')}"
|
||||
GENOME_SIZE={params.get('genome_size', '2913022398')}
|
||||
THREADS={params.get('threads', '8')}
|
||||
OUTPUT_DIR="{params.get('output_dir', 'atacseq_analysis')}"
|
||||
|
||||
# Create output directory
|
||||
mkdir -p $OUTPUT_DIR
|
||||
|
||||
echo "=== Starting ATAC-seq analysis workflow ==="
|
||||
|
||||
# Step 1: Shift reads for Tn5 correction
|
||||
echo "Step 1: Applying Tn5 offset correction..."
|
||||
alignmentSieve \\
|
||||
--bam $ATAC_BAM \\
|
||||
--outFile $OUTPUT_DIR/atacseq_shifted.bam \\
|
||||
--ATACshift \\
|
||||
--minFragmentLength 38 \\
|
||||
--maxFragmentLength 2000 \\
|
||||
--ignoreDuplicates
|
||||
|
||||
# Index the shifted BAM
|
||||
samtools index $OUTPUT_DIR/atacseq_shifted.bam
|
||||
|
||||
# Step 2: Generate coverage track
|
||||
echo "Step 2: Generating coverage track..."
|
||||
bamCoverage \\
|
||||
--bam $OUTPUT_DIR/atacseq_shifted.bam \\
|
||||
--outFileName $OUTPUT_DIR/atacseq_coverage.bw \\
|
||||
--normalizeUsing RPGC \\
|
||||
--effectiveGenomeSize $GENOME_SIZE \\
|
||||
--binSize 1 \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
# Step 3: Fragment size analysis
|
||||
echo "Step 3: Analyzing fragment sizes..."
|
||||
bamPEFragmentSize \\
|
||||
--bamfiles $ATAC_BAM \\
|
||||
--histogram $OUTPUT_DIR/fragmentSizes.png \\
|
||||
--maxFragmentLength 1000
|
||||
|
||||
# Step 4: Compute matrix at peaks (if peaks provided)
|
||||
if [ -f "$PEAKS_BED" ]; then
|
||||
echo "Step 4: Computing matrix at peaks..."
|
||||
computeMatrix reference-point \\
|
||||
--referencePoint center \\
|
||||
--scoreFileName $OUTPUT_DIR/atacseq_coverage.bw \\
|
||||
--regionsFileName $PEAKS_BED \\
|
||||
--beforeRegionStartLength 2000 \\
|
||||
--afterRegionStartLength 2000 \\
|
||||
--binSize 10 \\
|
||||
--outFileName $OUTPUT_DIR/matrix_peaks.gz \\
|
||||
--numberOfProcessors $THREADS
|
||||
|
||||
echo "Step 5: Generating heatmap..."
|
||||
plotHeatmap \\
|
||||
--matrixFile $OUTPUT_DIR/matrix_peaks.gz \\
|
||||
--outFileName $OUTPUT_DIR/heatmap_peaks.png \\
|
||||
--colorMap YlOrRd \\
|
||||
--refPointLabel "Peak Center" \\
|
||||
--heatmapHeight 15
|
||||
fi
|
||||
|
||||
echo "=== ATAC-seq analysis complete ==="
|
||||
echo "Results are in: $OUTPUT_DIR"
|
||||
echo ""
|
||||
echo "Expected fragment size pattern:"
|
||||
echo " ~50bp: nucleosome-free regions"
|
||||
echo " ~200bp: mono-nucleosome"
|
||||
echo " ~400bp: di-nucleosome"
|
||||
"""
|
||||
|
||||
with open(output_file, 'w') as f:
|
||||
f.write(script)
|
||||
|
||||
return f"✓ Generated ATAC-seq workflow: {output_file}"
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description="Generate deepTools workflow scripts",
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog=f"""
|
||||
Available workflows:
|
||||
{chr(10).join(f" {key}: {value['name']}" for key, value in WORKFLOWS.items())}
|
||||
|
||||
Examples:
|
||||
# Generate ChIP-seq QC workflow
|
||||
python workflow_generator.py chipseq_qc -o chipseq_qc.sh
|
||||
|
||||
# Generate ChIP-seq analysis with custom parameters
|
||||
python workflow_generator.py chipseq_analysis -o analysis.sh \\
|
||||
--chip-bam H3K4me3.bam --input-bam Input.bam
|
||||
|
||||
# List all available workflows
|
||||
python workflow_generator.py --list
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('workflow', nargs='?', choices=list(WORKFLOWS.keys()),
|
||||
help='Workflow type to generate')
|
||||
parser.add_argument('-o', '--output', default='deeptools_workflow.sh',
|
||||
help='Output script filename (default: deeptools_workflow.sh)')
|
||||
parser.add_argument('--list', action='store_true',
|
||||
help='List all available workflows')
|
||||
|
||||
# Common parameters
|
||||
parser.add_argument('--threads', type=int, default=8,
|
||||
help='Number of threads (default: 8)')
|
||||
parser.add_argument('--genome-size', type=int, default=2913022398,
|
||||
help='Effective genome size (default: 2913022398 for hg38)')
|
||||
parser.add_argument('--output-dir', default=None,
|
||||
help='Output directory for results')
|
||||
|
||||
# Workflow-specific parameters
|
||||
parser.add_argument('--input-bam', help='Input/control BAM file')
|
||||
parser.add_argument('--chip-bam', help='ChIP BAM file')
|
||||
parser.add_argument('--chip-bams', help='Multiple ChIP BAM files (space-separated)')
|
||||
parser.add_argument('--rnaseq-bam', help='RNA-seq BAM file')
|
||||
parser.add_argument('--atac-bam', help='ATAC-seq BAM file')
|
||||
parser.add_argument('--genes-bed', help='Genes BED file')
|
||||
parser.add_argument('--peaks-bed', help='Peaks BED file')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# List workflows
|
||||
if args.list:
|
||||
print("\nAvailable deepTools workflows:\n")
|
||||
for key, value in WORKFLOWS.items():
|
||||
print(f" {key}")
|
||||
print(f" {value['name']}")
|
||||
print(f" {value['description']}\n")
|
||||
sys.exit(0)
|
||||
|
||||
# Check if workflow was specified
|
||||
if not args.workflow:
|
||||
parser.print_help()
|
||||
sys.exit(1)
|
||||
|
||||
# Prepare parameters
|
||||
params = {
|
||||
'threads': args.threads,
|
||||
'genome_size': args.genome_size,
|
||||
'output_dir': args.output_dir or f"{args.workflow}_output",
|
||||
'input_bam': args.input_bam,
|
||||
'chip_bam': args.chip_bam,
|
||||
'chip_bams': args.chip_bams,
|
||||
'rnaseq_bam': args.rnaseq_bam,
|
||||
'atac_bam': args.atac_bam,
|
||||
'genes_bed': args.genes_bed,
|
||||
'peaks_bed': args.peaks_bed,
|
||||
}
|
||||
|
||||
# Generate workflow
|
||||
if args.workflow == 'chipseq_qc':
|
||||
message = generate_chipseq_qc_workflow(args.output, params)
|
||||
elif args.workflow == 'chipseq_analysis':
|
||||
message = generate_chipseq_analysis_workflow(args.output, params)
|
||||
elif args.workflow == 'rnaseq_coverage':
|
||||
message = generate_rnaseq_coverage_workflow(args.output, params)
|
||||
elif args.workflow == 'atacseq':
|
||||
message = generate_atacseq_workflow(args.output, params)
|
||||
|
||||
print(message)
|
||||
print(f"\nTo run the workflow:")
|
||||
print(f" chmod +x {args.output}")
|
||||
print(f" ./{args.output}")
|
||||
print(f"\nNote: Edit the script to customize file paths and parameters.")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
Reference in New Issue
Block a user