zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

11 KiB

Raw Blame History

deepTools Common Workflows

This document provides complete workflow examples for common deepTools analyses.

ChIP-seq Quality Control Workflow

Complete quality control assessment for ChIP-seq experiments.

Step 1: Initial Correlation Assessment

Compare replicates and samples to verify experimental quality:

# Generate coverage matrix across genome
multiBamSummary bins \
    --bamfiles Input1.bam Input2.bam ChIP1.bam ChIP2.bam \
    --labels Input_rep1 Input_rep2 ChIP_rep1 ChIP_rep2 \
    -o readCounts.npz \
    --numberOfProcessors 8

# Create correlation heatmap
plotCorrelation \
    -in readCounts.npz \
    --corMethod pearson \
    --whatToShow heatmap \
    --plotFile correlation_heatmap.png \
    --plotNumbers

# Generate PCA plot
plotPCA \
    -in readCounts.npz \
    -o PCA_plot.png \
    -T "PCA of ChIP-seq samples"

Expected Results:

Replicates should cluster together
Input samples should be distinct from ChIP samples

Step 2: Coverage and Depth Assessment

# Check sequencing depth and coverage
plotCoverage \
    --bamfiles Input1.bam ChIP1.bam ChIP2.bam \
    --labels Input ChIP_rep1 ChIP_rep2 \
    --plotFile coverage.png \
    --ignoreDuplicates \
    --numberOfProcessors 8

Interpretation: Assess whether sequencing depth is adequate for downstream analysis.

Step 3: Fragment Size Validation (Paired-end)

# Verify expected fragment sizes
bamPEFragmentSize \
    --bamfiles Input1.bam ChIP1.bam ChIP2.bam \
    --histogram fragmentSizes.png \
    --plotTitle "Fragment Size Distribution"

Expected Results: Fragment sizes should match library preparation protocols (typically 200-600bp for ChIP-seq).

Step 4: GC Bias Detection and Correction

# Compute GC bias
computeGCBias \
    --bamfile ChIP1.bam \
    --effectiveGenomeSize 2913022398 \
    --genome genome.2bit \
    --fragmentLength 200 \
    --biasPlot GCbias.png \
    --frequenciesFile freq.txt

# If bias detected, correct it
correctGCBias \
    --bamfile ChIP1.bam \
    --effectiveGenomeSize 2913022398 \
    --genome genome.2bit \
    --GCbiasFrequenciesFile freq.txt \
    --correctedFile ChIP1_GCcorrected.bam

Note: Only correct if significant bias is observed. Do NOT use --ignoreDuplicates with GC-corrected files.

Step 5: ChIP Signal Strength Assessment

# Evaluate ChIP enrichment quality
plotFingerprint \
    --bamfiles Input1.bam ChIP1.bam ChIP2.bam \
    --labels Input ChIP_rep1 ChIP_rep2 \
    --plotFile fingerprint.png \
    --extendReads 200 \
    --ignoreDuplicates \
    --numberOfProcessors 8 \
    --outQualityMetrics fingerprint_metrics.txt

Interpretation:

Strong ChIP: Steep rise in cumulative curve
Weak enrichment: Curve close to diagonal (input-like)

ChIP-seq Analysis Workflow

Complete workflow from BAM files to publication-quality visualizations.

Step 1: Generate Normalized Coverage Tracks

# Input control
bamCoverage \
    --bam Input.bam \
    --outFileName Input_coverage.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398 \
    --binSize 10 \
    --extendReads 200 \
    --ignoreDuplicates \
    --numberOfProcessors 8

# ChIP sample
bamCoverage \
    --bam ChIP.bam \
    --outFileName ChIP_coverage.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398 \
    --binSize 10 \
    --extendReads 200 \
    --ignoreDuplicates \
    --numberOfProcessors 8

Step 2: Create Log2 Ratio Track

# Compare ChIP to Input
bamCompare \
    --bamfile1 ChIP.bam \
    --bamfile2 Input.bam \
    --outFileName ChIP_vs_Input_log2ratio.bw \
    --operation log2 \
    --scaleFactorsMethod readCount \
    --binSize 10 \
    --extendReads 200 \
    --ignoreDuplicates \
    --numberOfProcessors 8

Result: Log2 ratio track showing enrichment (positive values) and depletion (negative values).

Step 3: Compute Matrix Around TSS

# Prepare data for heatmap/profile around transcription start sites
computeMatrix reference-point \
    --referencePoint TSS \
    --scoreFileName ChIP_coverage.bw \
    --regionsFileName genes.bed \
    --beforeRegionStartLength 3000 \
    --afterRegionStartLength 3000 \
    --binSize 10 \
    --sortRegions descend \
    --sortUsing mean \
    --outFileName matrix_TSS.gz \
    --outFileNameMatrix matrix_TSS.tab \
    --numberOfProcessors 8

Step 4: Generate Heatmap

# Create heatmap around TSS
plotHeatmap \
    --matrixFile matrix_TSS.gz \
    --outFileName heatmap_TSS.png \
    --colorMap RdBu \
    --whatToShow 'plot, heatmap and colorbar' \
    --zMin -3 --zMax 3 \
    --yAxisLabel "Genes" \
    --xAxisLabel "Distance from TSS (bp)" \
    --refPointLabel "TSS" \
    --heatmapHeight 15 \
    --kmeans 3

Step 5: Generate Profile Plot

# Create meta-profile around TSS
plotProfile \
    --matrixFile matrix_TSS.gz \
    --outFileName profile_TSS.png \
    --plotType lines \
    --perGroup \
    --colors blue \
    --plotTitle "ChIP-seq signal around TSS" \
    --yAxisLabel "Average signal" \
    --xAxisLabel "Distance from TSS (bp)" \
    --refPointLabel "TSS"

Step 6: Enrichment at Peaks

# Calculate enrichment in peak regions
plotEnrichment \
    --bamfiles Input.bam ChIP.bam \
    --BED peaks.bed \
    --labels Input ChIP \
    --plotFile enrichment.png \
    --outRawCounts enrichment_counts.tab \
    --extendReads 200 \
    --ignoreDuplicates

RNA-seq Coverage Workflow

Generate strand-specific coverage tracks for RNA-seq data.

Forward Strand

bamCoverage \
    --bam rnaseq.bam \
    --outFileName forward_coverage.bw \
    --filterRNAstrand forward \
    --normalizeUsing CPM \
    --binSize 1 \
    --numberOfProcessors 8

Reverse Strand

bamCoverage \
    --bam rnaseq.bam \
    --outFileName reverse_coverage.bw \
    --filterRNAstrand reverse \
    --normalizeUsing CPM \
    --binSize 1 \
    --numberOfProcessors 8

Important: Do NOT use --extendReads for RNA-seq (would extend over splice junctions).

Multi-Sample Comparison Workflow

Compare multiple ChIP-seq samples (e.g., different conditions or time points).

Step 1: Generate Coverage Files

# For each sample
for sample in Control_ChIP Treated_ChIP; do
    bamCoverage \
        --bam ${sample}.bam \
        --outFileName ${sample}.bw \
        --normalizeUsing RPGC \
        --effectiveGenomeSize 2913022398 \
        --binSize 10 \
        --extendReads 200 \
        --ignoreDuplicates \
        --numberOfProcessors 8
done

Step 2: Compute Multi-Sample Matrix

computeMatrix scale-regions \
    --scoreFileName Control_ChIP.bw Treated_ChIP.bw \
    --regionsFileName genes.bed \
    --beforeRegionStartLength 1000 \
    --afterRegionStartLength 1000 \
    --regionBodyLength 3000 \
    --binSize 10 \
    --sortRegions descend \
    --sortUsing mean \
    --outFileName matrix_multi.gz \
    --numberOfProcessors 8

Step 3: Multi-Sample Heatmap

plotHeatmap \
    --matrixFile matrix_multi.gz \
    --outFileName heatmap_comparison.png \
    --colorMap Blues \
    --whatToShow 'plot, heatmap and colorbar' \
    --samplesLabel Control Treated \
    --yAxisLabel "Genes" \
    --heatmapHeight 15 \
    --kmeans 4

Step 4: Multi-Sample Profile

plotProfile \
    --matrixFile matrix_multi.gz \
    --outFileName profile_comparison.png \
    --plotType lines \
    --perGroup \
    --colors blue red \
    --samplesLabel Control Treated \
    --plotTitle "ChIP-seq signal comparison" \
    --startLabel "TSS" \
    --endLabel "TES"

ATAC-seq Workflow

Specialized workflow for ATAC-seq data with Tn5 offset correction.

Step 1: Shift Reads for Tn5 Correction

alignmentSieve \
    --bam atacseq.bam \
    --outFile atacseq_shifted.bam \
    --ATACshift \
    --minFragmentLength 38 \
    --maxFragmentLength 2000 \
    --ignoreDuplicates

Step 2: Generate Coverage Track

bamCoverage \
    --bam atacseq_shifted.bam \
    --outFileName atacseq_coverage.bw \
    --normalizeUsing RPGC \
    --effectiveGenomeSize 2913022398 \
    --binSize 1 \
    --numberOfProcessors 8

Step 3: Fragment Size Analysis

bamPEFragmentSize \
    --bamfiles atacseq.bam \
    --histogram fragmentSizes_atac.png \
    --maxFragmentLength 1000

Expected Pattern: Nucleosome ladder with peaks at ~50bp (nucleosome-free), ~200bp (mono-nucleosome), ~400bp (di-nucleosome).

Peak Region Analysis Workflow

Analyze ChIP-seq signal specifically at peak regions.

Step 1: Matrix at Peaks

computeMatrix reference-point \
    --referencePoint center \
    --scoreFileName ChIP_coverage.bw \
    --regionsFileName peaks.bed \
    --beforeRegionStartLength 2000 \
    --afterRegionStartLength 2000 \
    --binSize 10 \
    --outFileName matrix_peaks.gz \
    --numberOfProcessors 8

Step 2: Heatmap at Peaks

plotHeatmap \
    --matrixFile matrix_peaks.gz \
    --outFileName heatmap_peaks.png \
    --colorMap YlOrRd \
    --refPointLabel "Peak Center" \
    --heatmapHeight 15 \
    --sortUsing max

Troubleshooting Common Issues

Issue: Out of Memory

Solution: Use --region parameter to process chromosomes individually:

bamCoverage --bam input.bam -o chr1.bw --region chr1

Issue: BAM Index Missing

Solution: Index BAM files before running deepTools:

samtools index input.bam

Issue: Slow Processing

Solution: Increase --numberOfProcessors:

# Use 8 cores instead of default
--numberOfProcessors 8

Issue: bigWig Files Too Large

Solution: Increase bin size:

--binSize 50  # or larger (default is 10-50)

Performance Tips

Use multiple processors: Always set --numberOfProcessors to available cores
Process regions: Use --region for testing or memory-limited environments
Adjust bin size: Larger bins = faster processing and smaller files
Pre-filter BAM files: Use alignmentSieve to create filtered BAM files once, then reuse
Use bigWig over bedGraph: bigWig format is compressed and faster to process

Best Practices

Always check QC first: Run correlation, coverage, and fingerprint analysis before proceeding
Document parameters: Save command lines for reproducibility
Use consistent normalization: Apply same normalization method across samples in a comparison
Verify reference genome match: Ensure BAM files and region files use same genome build
Check strand orientation: For RNA-seq, verify correct strand orientation
Test on small regions first: Use --region chr1:1-1000000 for testing parameters
Keep intermediate files: Save matrices for regenerating plots with different settings

11 KiB Raw Blame History