Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/deeptools/references/workflows.md
+++ b/skills/deeptools/references/workflows.md
@@ -0,0 +1,474 @@
+# deepTools Common Workflows
+
+This document provides complete workflow examples for common deepTools analyses.
+
+## ChIP-seq Quality Control Workflow
+
+Complete quality control assessment for ChIP-seq experiments.
+
+### Step 1: Initial Correlation Assessment
+
+Compare replicates and samples to verify experimental quality:
+
+```bash
+# Generate coverage matrix across genome
+multiBamSummary bins \
+    --bamfiles Input1.bam Input2.bam ChIP1.bam ChIP2.bam \
+    --labels Input_rep1 Input_rep2 ChIP_rep1 ChIP_rep2 \
+    -o readCounts.npz \
+    --numberOfProcessors 8
+
+# Create correlation heatmap
+plotCorrelation \
+    -in readCounts.npz \
+    --corMethod pearson \
+    --whatToShow heatmap \
+    --plotFile correlation_heatmap.png \
+    --plotNumbers
+
+# Generate PCA plot
+plotPCA \
+    -in readCounts.npz \
+    -o PCA_plot.png \
+    -T "PCA of ChIP-seq samples"
+```
+
+**Expected Results:**
+- Replicates should cluster together
+- Input samples should be distinct from ChIP samples
+
+---
+
+### Step 2: Coverage and Depth Assessment
+
+```bash
+# Check sequencing depth and coverage
+plotCoverage \
+    --bamfiles Input1.bam ChIP1.bam ChIP2.bam \
+    --labels Input ChIP_rep1 ChIP_rep2 \
+    --plotFile coverage.png \
+    --ignoreDuplicates \
+    --numberOfProcessors 8
+```
+
+**Interpretation:** Assess whether sequencing depth is adequate for downstream analysis.
+
+---
+
+### Step 3: Fragment Size Validation (Paired-end)
+
+```bash
+# Verify expected fragment sizes
+bamPEFragmentSize \
+    --bamfiles Input1.bam ChIP1.bam ChIP2.bam \
+    --histogram fragmentSizes.png \
+    --plotTitle "Fragment Size Distribution"
+```
+
+**Expected Results:** Fragment sizes should match library preparation protocols (typically 200-600bp for ChIP-seq).
+
+---
+
+### Step 4: GC Bias Detection and Correction
+
+```bash
+# Compute GC bias
+computeGCBias \
+    --bamfile ChIP1.bam \
+    --effectiveGenomeSize 2913022398 \
+    --genome genome.2bit \
+    --fragmentLength 200 \
+    --biasPlot GCbias.png \
+    --frequenciesFile freq.txt
+
+# If bias detected, correct it
+correctGCBias \
+    --bamfile ChIP1.bam \
+    --effectiveGenomeSize 2913022398 \
+    --genome genome.2bit \
+    --GCbiasFrequenciesFile freq.txt \
+    --correctedFile ChIP1_GCcorrected.bam
+```
+
+**Note:** Only correct if significant bias is observed. Do NOT use `--ignoreDuplicates` with GC-corrected files.
+
+---
+
+### Step 5: ChIP Signal Strength Assessment
+
+```bash
+# Evaluate ChIP enrichment quality
+plotFingerprint \
+    --bamfiles Input1.bam ChIP1.bam ChIP2.bam \
+    --labels Input ChIP_rep1 ChIP_rep2 \
+    --plotFile fingerprint.png \
+    --extendReads 200 \
+    --ignoreDuplicates \
+    --numberOfProcessors 8 \
+    --outQualityMetrics fingerprint_metrics.txt
+```
+
+**Interpretation:**
+- Strong ChIP: Steep rise in cumulative curve
+- Weak enrichment: Curve close to diagonal (input-like)
+
+---
+
+## ChIP-seq Analysis Workflow
+
+Complete workflow from BAM files to publication-quality visualizations.
+
+### Step 1: Generate Normalized Coverage Tracks
+
+```bash
+# Input control
+bamCoverage \
+    --bam Input.bam \
+    --outFileName Input_coverage.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398 \
+    --binSize 10 \
+    --extendReads 200 \
+    --ignoreDuplicates \
+    --numberOfProcessors 8
+
+# ChIP sample
+bamCoverage \
+    --bam ChIP.bam \
+    --outFileName ChIP_coverage.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398 \
+    --binSize 10 \
+    --extendReads 200 \
+    --ignoreDuplicates \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 2: Create Log2 Ratio Track
+
+```bash
+# Compare ChIP to Input
+bamCompare \
+    --bamfile1 ChIP.bam \
+    --bamfile2 Input.bam \
+    --outFileName ChIP_vs_Input_log2ratio.bw \
+    --operation log2 \
+    --scaleFactorsMethod readCount \
+    --binSize 10 \
+    --extendReads 200 \
+    --ignoreDuplicates \
+    --numberOfProcessors 8
+```
+
+**Result:** Log2 ratio track showing enrichment (positive values) and depletion (negative values).
+
+---
+
+### Step 3: Compute Matrix Around TSS
+
+```bash
+# Prepare data for heatmap/profile around transcription start sites
+computeMatrix reference-point \
+    --referencePoint TSS \
+    --scoreFileName ChIP_coverage.bw \
+    --regionsFileName genes.bed \
+    --beforeRegionStartLength 3000 \
+    --afterRegionStartLength 3000 \
+    --binSize 10 \
+    --sortRegions descend \
+    --sortUsing mean \
+    --outFileName matrix_TSS.gz \
+    --outFileNameMatrix matrix_TSS.tab \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 4: Generate Heatmap
+
+```bash
+# Create heatmap around TSS
+plotHeatmap \
+    --matrixFile matrix_TSS.gz \
+    --outFileName heatmap_TSS.png \
+    --colorMap RdBu \
+    --whatToShow 'plot, heatmap and colorbar' \
+    --zMin -3 --zMax 3 \
+    --yAxisLabel "Genes" \
+    --xAxisLabel "Distance from TSS (bp)" \
+    --refPointLabel "TSS" \
+    --heatmapHeight 15 \
+    --kmeans 3
+```
+
+---
+
+### Step 5: Generate Profile Plot
+
+```bash
+# Create meta-profile around TSS
+plotProfile \
+    --matrixFile matrix_TSS.gz \
+    --outFileName profile_TSS.png \
+    --plotType lines \
+    --perGroup \
+    --colors blue \
+    --plotTitle "ChIP-seq signal around TSS" \
+    --yAxisLabel "Average signal" \
+    --xAxisLabel "Distance from TSS (bp)" \
+    --refPointLabel "TSS"
+```
+
+---
+
+### Step 6: Enrichment at Peaks
+
+```bash
+# Calculate enrichment in peak regions
+plotEnrichment \
+    --bamfiles Input.bam ChIP.bam \
+    --BED peaks.bed \
+    --labels Input ChIP \
+    --plotFile enrichment.png \
+    --outRawCounts enrichment_counts.tab \
+    --extendReads 200 \
+    --ignoreDuplicates
+```
+
+---
+
+## RNA-seq Coverage Workflow
+
+Generate strand-specific coverage tracks for RNA-seq data.
+
+### Forward Strand
+
+```bash
+bamCoverage \
+    --bam rnaseq.bam \
+    --outFileName forward_coverage.bw \
+    --filterRNAstrand forward \
+    --normalizeUsing CPM \
+    --binSize 1 \
+    --numberOfProcessors 8
+```
+
+### Reverse Strand
+
+```bash
+bamCoverage \
+    --bam rnaseq.bam \
+    --outFileName reverse_coverage.bw \
+    --filterRNAstrand reverse \
+    --normalizeUsing CPM \
+    --binSize 1 \
+    --numberOfProcessors 8
+```
+
+**Important:** Do NOT use `--extendReads` for RNA-seq (would extend over splice junctions).
+
+---
+
+## Multi-Sample Comparison Workflow
+
+Compare multiple ChIP-seq samples (e.g., different conditions or time points).
+
+### Step 1: Generate Coverage Files
+
+```bash
+# For each sample
+for sample in Control_ChIP Treated_ChIP; do
+    bamCoverage \
+        --bam ${sample}.bam \
+        --outFileName ${sample}.bw \
+        --normalizeUsing RPGC \
+        --effectiveGenomeSize 2913022398 \
+        --binSize 10 \
+        --extendReads 200 \
+        --ignoreDuplicates \
+        --numberOfProcessors 8
+done
+```
+
+---
+
+### Step 2: Compute Multi-Sample Matrix
+
+```bash
+computeMatrix scale-regions \
+    --scoreFileName Control_ChIP.bw Treated_ChIP.bw \
+    --regionsFileName genes.bed \
+    --beforeRegionStartLength 1000 \
+    --afterRegionStartLength 1000 \
+    --regionBodyLength 3000 \
+    --binSize 10 \
+    --sortRegions descend \
+    --sortUsing mean \
+    --outFileName matrix_multi.gz \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 3: Multi-Sample Heatmap
+
+```bash
+plotHeatmap \
+    --matrixFile matrix_multi.gz \
+    --outFileName heatmap_comparison.png \
+    --colorMap Blues \
+    --whatToShow 'plot, heatmap and colorbar' \
+    --samplesLabel Control Treated \
+    --yAxisLabel "Genes" \
+    --heatmapHeight 15 \
+    --kmeans 4
+```
+
+---
+
+### Step 4: Multi-Sample Profile
+
+```bash
+plotProfile \
+    --matrixFile matrix_multi.gz \
+    --outFileName profile_comparison.png \
+    --plotType lines \
+    --perGroup \
+    --colors blue red \
+    --samplesLabel Control Treated \
+    --plotTitle "ChIP-seq signal comparison" \
+    --startLabel "TSS" \
+    --endLabel "TES"
+```
+
+---
+
+## ATAC-seq Workflow
+
+Specialized workflow for ATAC-seq data with Tn5 offset correction.
+
+### Step 1: Shift Reads for Tn5 Correction
+
+```bash
+alignmentSieve \
+    --bam atacseq.bam \
+    --outFile atacseq_shifted.bam \
+    --ATACshift \
+    --minFragmentLength 38 \
+    --maxFragmentLength 2000 \
+    --ignoreDuplicates
+```
+
+---
+
+### Step 2: Generate Coverage Track
+
+```bash
+bamCoverage \
+    --bam atacseq_shifted.bam \
+    --outFileName atacseq_coverage.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398 \
+    --binSize 1 \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 3: Fragment Size Analysis
+
+```bash
+bamPEFragmentSize \
+    --bamfiles atacseq.bam \
+    --histogram fragmentSizes_atac.png \
+    --maxFragmentLength 1000
+```
+
+**Expected Pattern:** Nucleosome ladder with peaks at ~50bp (nucleosome-free), ~200bp (mono-nucleosome), ~400bp (di-nucleosome).
+
+---
+
+## Peak Region Analysis Workflow
+
+Analyze ChIP-seq signal specifically at peak regions.
+
+### Step 1: Matrix at Peaks
+
+```bash
+computeMatrix reference-point \
+    --referencePoint center \
+    --scoreFileName ChIP_coverage.bw \
+    --regionsFileName peaks.bed \
+    --beforeRegionStartLength 2000 \
+    --afterRegionStartLength 2000 \
+    --binSize 10 \
+    --outFileName matrix_peaks.gz \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 2: Heatmap at Peaks
+
+```bash
+plotHeatmap \
+    --matrixFile matrix_peaks.gz \
+    --outFileName heatmap_peaks.png \
+    --colorMap YlOrRd \
+    --refPointLabel "Peak Center" \
+    --heatmapHeight 15 \
+    --sortUsing max
+```
+
+---
+
+## Troubleshooting Common Issues
+
+### Issue: Out of Memory
+**Solution:** Use `--region` parameter to process chromosomes individually:
+```bash
+bamCoverage --bam input.bam -o chr1.bw --region chr1
+```
+
+### Issue: BAM Index Missing
+**Solution:** Index BAM files before running deepTools:
+```bash
+samtools index input.bam
+```
+
+### Issue: Slow Processing
+**Solution:** Increase `--numberOfProcessors`:
+```bash
+# Use 8 cores instead of default
+--numberOfProcessors 8
+```
+
+### Issue: bigWig Files Too Large
+**Solution:** Increase bin size:
+```bash
+--binSize 50  # or larger (default is 10-50)
+```
+
+---
+
+## Performance Tips
+
+1. **Use multiple processors:** Always set `--numberOfProcessors` to available cores
+2. **Process regions:** Use `--region` for testing or memory-limited environments
+3. **Adjust bin size:** Larger bins = faster processing and smaller files
+4. **Pre-filter BAM files:** Use `alignmentSieve` to create filtered BAM files once, then reuse
+5. **Use bigWig over bedGraph:** bigWig format is compressed and faster to process
+
+---
+
+## Best Practices
+
+1. **Always check QC first:** Run correlation, coverage, and fingerprint analysis before proceeding
+2. **Document parameters:** Save command lines for reproducibility
+3. **Use consistent normalization:** Apply same normalization method across samples in a comparison
+4. **Verify reference genome match:** Ensure BAM files and region files use same genome build
+5. **Check strand orientation:** For RNA-seq, verify correct strand orientation
+6. **Test on small regions first:** Use `--region chr1:1-1000000` for testing parameters
+7. **Keep intermediate files:** Save matrices for regenerating plots with different settings