Files
gh-k-dense-ai-claude-scient…/skills/deeptools/references/tools_reference.md
2025-11-30 08:30:10 +08:00

18 KiB

deepTools Complete Tool Reference

This document provides a comprehensive reference for all deepTools command-line utilities organized by category.

BAM and bigWig File Processing Tools

multiBamSummary

Computes read coverages for genomic regions across multiple BAM files, outputting compressed numpy arrays for downstream correlation and PCA analysis.

Modes:

  • bins: Genome-wide analysis using consecutive equal-sized windows (default 10kb)
  • BED-file: Restricts analysis to user-specified genomic regions

Key Parameters:

  • --bamfiles, -b: Indexed BAM files (space-separated, required)
  • --outFileName, -o: Output coverage matrix file (required)
  • --BED: Region specification file (BED-file mode only)
  • --binSize: Window size in bases (default: 10,000)
  • --labels: Custom sample identifiers
  • --minMappingQuality: Quality threshold for read inclusion
  • --numberOfProcessors, -p: Parallel processing cores
  • --extendReads: Fragment size extension
  • --ignoreDuplicates: Remove PCR duplicates
  • --outRawCounts: Export tab-delimited file with coordinate columns and per-sample counts

Output: Compressed numpy array (.npz) for plotCorrelation and plotPCA

Common Usage:

# Genome-wide comparison
multiBamSummary bins --bamfiles sample1.bam sample2.bam -o results.npz

# Peak region comparison
multiBamSummary BED-file --BED peaks.bed --bamfiles sample1.bam sample2.bam -o results.npz

multiBigwigSummary

Similar to multiBamSummary but operates on bigWig files instead of BAM files. Used for comparing coverage tracks across samples.

Modes:

  • bins: Genome-wide analysis
  • BED-file: Region-specific analysis

Key Parameters: Similar to multiBamSummary but accepts bigWig files


bamCoverage

Converts BAM alignment files into normalized coverage tracks in bigWig or bedGraph formats. Calculates coverage as number of reads per bin.

Key Parameters:

  • --bam, -b: Input BAM file (required)
  • --outFileName, -o: Output filename (required)
  • --outFileFormat, -of: Output type (bigwig or bedgraph)
  • --normalizeUsing: Normalization method
    • RPKM: Reads Per Kilobase per Million mapped reads
    • CPM: Counts Per Million mapped reads
    • BPM: Bins Per Million mapped reads
    • RPGC: Reads per genomic content (requires --effectiveGenomeSize)
    • None: No normalization (default)
  • --effectiveGenomeSize: Mappable genome size (required for RPGC)
  • --binSize: Resolution in base pairs (default: 50)
  • --extendReads, -e: Extend reads to fragment length (recommended for ChIP-seq, NOT for RNA-seq)
  • --centerReads: Center reads at fragment length for sharper signals
  • --ignoreDuplicates: Count identical reads only once
  • --minMappingQuality: Filter reads below quality threshold
  • --minFragmentLength / --maxFragmentLength: Fragment length filtering
  • --smoothLength: Window averaging for noise reduction
  • --MNase: Analyze MNase-seq data for nucleosome positioning
  • --Offset: Position-specific offsets (useful for RiboSeq, GROseq)
  • --filterRNAstrand: Separate forward/reverse strand reads
  • --ignoreForNormalization: Exclude chromosomes from normalization (e.g., sex chromosomes)
  • --numberOfProcessors, -p: Parallel processing

Important Notes:

  • For RNA-seq: Do NOT use --extendReads (would extend over splice junctions)
  • For ChIP-seq: Use --extendReads with smaller bin sizes
  • Never apply --ignoreDuplicates after GC bias correction

Common Usage:

# Basic coverage with RPKM normalization
bamCoverage --bam input.bam --outFileName coverage.bw --normalizeUsing RPKM

# ChIP-seq with extension
bamCoverage --bam chip.bam --outFileName chip_coverage.bw \
    --binSize 10 --extendReads 200 --ignoreDuplicates

# Strand-specific RNA-seq
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
    --filterRNAstrand forward

bamCompare

Compares two BAM files by generating bigWig or bedGraph files, normalizing for sequencing depth differences. Processes genome in equal-sized bins and performs per-bin calculations.

Comparison Methods:

  • log2 (default): Log2 ratio of samples
  • ratio: Direct ratio calculation
  • subtract: Difference between files
  • add: Sum of samples
  • mean: Average across samples
  • reciprocal_ratio: Negative inverse for ratios < 0
  • first/second: Output scaled signal from single file

Normalization Methods:

  • readCount (default): Compensates for sequencing depth
  • SES: Selective enrichment statistics
  • RPKM: Reads per kilobase per million
  • CPM: Counts per million
  • BPM: Bins per million
  • RPGC: Reads per genomic content (requires --effectiveGenomeSize)

Key Parameters:

  • --bamfile1, -b1: First BAM file (required)
  • --bamfile2, -b2: Second BAM file (required)
  • --outFileName, -o: Output filename (required)
  • --outFileFormat: bigwig or bedgraph
  • --operation: Comparison method (see above)
  • --scaleFactorsMethod: Normalization method (see above)
  • --binSize: Bin width for output (default: 50bp)
  • --pseudocount: Avoid division by zero (default: 1)
  • --extendReads: Extend reads to fragment length
  • --ignoreDuplicates: Count identical reads once
  • --minMappingQuality: Quality threshold
  • --numberOfProcessors, -p: Parallelization

Common Usage:

# Log2 ratio of treatment vs control
bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bw

# Subtract control from treatment
bamCompare -b1 treatment.bam -b2 control.bam -o difference.bw \
    --operation subtract --scaleFactorsMethod readCount

correctGCBias / computeGCBias

computeGCBias: Identifies GC-content bias from sequencing and PCR amplification.

correctGCBias: Corrects BAM files for GC bias detected by computeGCBias.

Key Parameters (computeGCBias):

  • --bamfile, -b: Input BAM file
  • --effectiveGenomeSize: Mappable genome size
  • --genome, -g: Reference genome in 2bit format
  • --fragmentLength, -l: Fragment length (for single-end)
  • --biasPlot: Output diagnostic plot

Key Parameters (correctGCBias):

  • --bamfile, -b: Input BAM file
  • --effectiveGenomeSize: Mappable genome size
  • --genome, -g: Reference genome in 2bit format
  • --GCbiasFrequenciesFile: Frequencies from computeGCBias
  • --correctedFile, -o: Output corrected BAM

Important: Never use --ignoreDuplicates after GC bias correction


alignmentSieve

Filters BAM files by various quality metrics on-the-fly. Useful for creating filtered BAM files for specific analyses.

Key Parameters:

  • --bam, -b: Input BAM file
  • --outFile, -o: Output BAM file
  • --minMappingQuality: Minimum mapping quality
  • --ignoreDuplicates: Remove duplicates
  • --minFragmentLength / --maxFragmentLength: Fragment length filters
  • --samFlagInclude / --samFlagExclude: SAM flag filtering
  • --shift: Shift reads (e.g., for ATACseq Tn5 correction)
  • --ATACshift: Automatically shift for ATAC-seq data

computeMatrix

Calculates scores per genomic region and prepares matrices for plotHeatmap and plotProfile. Processes bigWig score files and BED/GTF region files.

Modes:

  • reference-point: Signal distribution relative to specific position (TSS, TES, or center)
  • scale-regions: Signal across regions standardized to uniform lengths

Key Parameters:

  • -R: Region file(s) in BED/GTF format (required)
  • -S: BigWig score file(s) (required)
  • -o: Output matrix file (required)
  • -b: Upstream distance from reference point
  • -a: Downstream distance from reference point
  • -m: Region body length (scale-regions only)
  • -bs, --binSize: Bin size for averaging scores
  • --skipZeros: Skip regions with all zeros
  • --minThreshold / --maxThreshold: Filter by signal intensity
  • --sortRegions: ascending, descending, keep, no
  • --sortUsing: mean, median, max, min, sum, region_length
  • -p, --numberOfProcessors: Parallel processing
  • --averageTypeBins: Statistical method (mean, median, min, max, sum, std)

Output Options:

  • --outFileNameMatrix: Export tab-delimited data
  • --outFileSortedRegions: Save filtered/sorted BED file

Common Usage:

# TSS analysis
computeMatrix reference-point -S signal.bw -R genes.bed \
    -o matrix.gz -b 2000 -a 2000 --referencePoint TSS

# Scaled gene body
computeMatrix scale-regions -S signal.bw -R genes.bed \
    -o matrix.gz -b 1000 -a 1000 -m 3000

Quality Control Tools

plotFingerprint

Quality control tool primarily for ChIP-seq experiments. Assesses whether antibody enrichment was successful. Generates cumulative read coverage profiles to distinguish signal from noise.

Key Parameters:

  • --bamfiles, -b: Indexed BAM files (required)
  • --plotFile, -plot, -o: Output image filename (required)
  • --extendReads, -e: Extend reads to fragment length
  • --ignoreDuplicates: Count identical reads once
  • --minMappingQuality: Mapping quality filter
  • --centerReads: Center reads at fragment length
  • --minFragmentLength / --maxFragmentLength: Fragment filters
  • --outRawCounts: Save per-bin read counts
  • --outQualityMetrics: Output QC metrics (Jensen-Shannon distance)
  • --labels: Custom sample names
  • --numberOfProcessors, -p: Parallel processing

Interpretation:

  • Ideal control: Straight diagonal line
  • Strong ChIP: Steep rise towards highest rank (concentrated reads in few bins)
  • Weak enrichment: Flatter curve approaching diagonal

Common Usage:

plotFingerprint -b input.bam chip1.bam chip2.bam \
    --labels Input ChIP1 ChIP2 -o fingerprint.png \
    --extendReads 200 --ignoreDuplicates

plotCoverage

Visualizes average read distribution across the genome. Shows genome coverage and helps determine if sequencing depth is adequate.

Key Parameters:

  • --bamfiles, -b: BAM files to analyze (required)
  • --plotFile, -o: Output plot filename (required)
  • --ignoreDuplicates: Remove PCR duplicates
  • --minMappingQuality: Quality threshold
  • --outRawCounts: Save underlying data
  • --labels: Sample names
  • --numberOfSamples: Number of positions to sample (default: 1,000,000)

bamPEFragmentSize

Determines fragment length distribution for paired-end sequencing data. Essential QC to verify expected fragment sizes from library preparation.

Key Parameters:

  • --bamfiles, -b: BAM files (required)
  • --histogram, -hist: Output histogram filename (required)
  • --plotTitle, -T: Plot title
  • --maxFragmentLength: Maximum length to consider (default: 1000)
  • --logScale: Use logarithmic Y-axis
  • --outRawFragmentLengths: Save raw fragment lengths

plotCorrelation

Analyzes sample correlations from multiBamSummary or multiBigwigSummary outputs. Shows how similar different samples are.

Correlation Methods:

  • Pearson: Measures metric differences; sensitive to outliers; appropriate for normally distributed data
  • Spearman: Rank-based; less influenced by outliers; better for non-normal distributions

Visualization Options:

  • heatmap: Color intensity with hierarchical clustering (complete linkage)
  • scatterplot: Pairwise scatter plots with correlation coefficients

Key Parameters:

  • --corData, -in: Input matrix from multiBamSummary/multiBigwigSummary (required)
  • --corMethod: pearson or spearman (required)
  • --whatToShow: heatmap or scatterplot (required)
  • --plotFile, -o: Output filename (required)
  • --skipZeros: Exclude zero-value regions
  • --removeOutliers: Use median absolute deviation (MAD) filtering
  • --outFileCorMatrix: Export correlation matrix
  • --labels: Custom sample names
  • --plotTitle: Plot title
  • --colorMap: Color scheme (50+ options)
  • --plotNumbers: Display correlation values on heatmap

Common Usage:

# Heatmap with Pearson correlation
plotCorrelation -in readCounts.npz --corMethod pearson \
    --whatToShow heatmap -o correlation_heatmap.png --plotNumbers

# Scatterplot with Spearman correlation
plotCorrelation -in readCounts.npz --corMethod spearman \
    --whatToShow scatterplot -o correlation_scatter.png

plotPCA

Generates principal component analysis plots from multiBamSummary or multiBigwigSummary output. Displays sample relationships in reduced dimensionality.

Key Parameters:

  • --corData, -in: Coverage file from multiBamSummary/multiBigwigSummary (required)
  • --plotFile, -o: Output image (png, eps, pdf, svg) (required)
  • --outFileNameData: Export PCA data (loadings/rotation and eigenvalues)
  • --labels, -l: Custom sample labels
  • --plotTitle, -T: Plot title
  • --plotHeight / --plotWidth: Dimensions in centimeters
  • --colors: Custom symbol colors
  • --markers: Symbol shapes
  • --transpose: Perform PCA on transposed matrix (rows=samples)
  • --ntop: Use top N variable rows (default: 1000)
  • --PCs: Components to plot (default: 1 2)
  • --log2: Log2-transform data before analysis
  • --rowCenter: Center each row at 0

Common Usage:

plotPCA -in readCounts.npz -o PCA_plot.png \
    -T "PCA of read counts" --transpose

Visualization Tools

plotHeatmap

Creates genomic region heatmaps from computeMatrix output. Generates publication-quality visualizations.

Key Parameters:

  • --matrixFile, -m: Matrix from computeMatrix (required)
  • --outFileName, -o: Output image (png, eps, pdf, svg) (required)
  • --outFileSortedRegions: Save regions after filtering
  • --outFileNameMatrix: Export matrix values
  • --interpolationMethod: auto, nearest, bilinear, bicubic, gaussian
    • Default: nearest (≤1000 columns), bilinear (>1000 columns)
  • --dpi: Figure resolution

Clustering:

  • --kmeans: k-means clustering
  • --hclust: Hierarchical clustering (slower for >1000 regions)
  • --silhouette: Calculate cluster quality metrics

Visual Customization:

  • --heatmapHeight / --heatmapWidth: Dimensions (3-100 cm)
  • --whatToShow: plot, heatmap, colorbar (combinations)
  • --alpha: Transparency (0-1)
  • --colorMap: 50+ color schemes
  • --colorList: Custom gradient colors
  • --zMin / --zMax: Intensity scale limits
  • --boxAroundHeatmaps: yes/no (default: yes)

Labels:

  • --xAxisLabel / --yAxisLabel: Axis labels
  • --regionsLabel: Region set identifiers
  • --samplesLabel: Sample names
  • --refPointLabel: Reference point label
  • --startLabel / --endLabel: Region boundary labels

Common Usage:

# Basic heatmap
plotHeatmap -m matrix.gz -o heatmap.png

# With clustering and custom colors
plotHeatmap -m matrix.gz -o heatmap.png \
    --kmeans 3 --colorMap RdBu --zMin -3 --zMax 3

plotProfile

Generates profile plots showing scores across genomic regions using computeMatrix output.

Key Parameters:

  • --matrixFile, -m: Matrix from computeMatrix (required)
  • --outFileName, -o: Output image (png, eps, pdf, svg) (required)
  • --plotType: lines, fill, se, std, overlapped_lines, heatmap
  • --colors: Color palette (names or hex codes)
  • --plotHeight / --plotWidth: Dimensions in centimeters
  • --yMin / --yMax: Y-axis range
  • --averageType: mean, median, min, max, std, sum

Clustering:

  • --kmeans: k-means clustering
  • --hclust: Hierarchical clustering
  • --silhouette: Cluster quality metrics

Labels:

  • --plotTitle: Main heading
  • --regionsLabel: Region set identifiers
  • --samplesLabel: Sample names
  • --startLabel / --endLabel: Region boundary labels (scale-regions mode)

Output Options:

  • --outFileNameData: Export data as tab-separated values
  • --outFileSortedRegions: Save filtered/sorted regions as BED

Common Usage:

# Line plot
plotProfile -m matrix.gz -o profile.png --plotType lines

# With standard error shading
plotProfile -m matrix.gz -o profile.png --plotType se \
    --colors blue red green

plotEnrichment

Calculates and visualizes signal enrichment across genomic regions. Measures percentage of alignments overlapping region groups. Useful for FRiP (Fragment in Peaks) scores.

Key Parameters:

  • --bamfiles, -b: Indexed BAM files (required)
  • --BED: Region files in BED/GTF format (required)
  • --plotFile, -o: Output visualization (png, pdf, eps, svg)
  • --labels, -l: Custom sample identifiers
  • --outRawCounts: Export numerical data
  • --perSample: Group by sample instead of feature (default)
  • --regionLabels: Custom region names

Read Processing:

  • --minFragmentLength / --maxFragmentLength: Fragment filters
  • --minMappingQuality: Quality threshold
  • --samFlagInclude / --samFlagExclude: SAM flag filters
  • --ignoreDuplicates: Remove duplicates
  • --centerReads: Center reads for sharper signal

Common Usage:

plotEnrichment -b Input.bam H3K4me3.bam \
    --BED peaks_up.bed peaks_down.bed \
    --regionLabels "Up regulated" "Down regulated" \
    -o enrichment.png

Miscellaneous Tools

computeMatrixOperations

Advanced matrix manipulation tool for combining or subsetting matrices from computeMatrix. Enables complex multi-sample, multi-region analyses.

Operations:

  • cbind: Combine matrices column-wise
  • rbind: Combine matrices row-wise
  • subset: Extract specific samples or regions
  • filterStrand: Keep only regions on specific strand
  • filterValues: Apply signal intensity filters
  • sort: Order regions by various criteria
  • dataRange: Report min/max values

Common Usage:

# Combine matrices
computeMatrixOperations cbind -m matrix1.gz matrix2.gz -o combined.gz

# Extract specific samples
computeMatrixOperations subset -m matrix.gz --samples 0 2 -o subset.gz

estimateReadFiltering

Predicts the impact of various filtering parameters without actually filtering. Helps optimize filtering strategies before running full analyses.

Key Parameters:

  • --bamfiles, -b: BAM files to analyze
  • --sampleSize: Number of reads to sample (default: 100,000)
  • --binSize: Bin size for analysis
  • --distanceBetweenBins: Spacing between sampled bins

Filtration Options to Test:

  • --minMappingQuality: Test quality thresholds
  • --ignoreDuplicates: Assess duplicate impact
  • --minFragmentLength / --maxFragmentLength: Test fragment filters

Common Parameters Across Tools

Many deepTools commands share these filtering and performance options:

Read Filtering:

  • --ignoreDuplicates: Remove PCR duplicates
  • --minMappingQuality: Filter by alignment confidence
  • --samFlagInclude / --samFlagExclude: SAM format filtering
  • --minFragmentLength / --maxFragmentLength: Fragment length bounds

Performance:

  • --numberOfProcessors, -p: Enable parallel processing
  • --region: Process specific genomic regions (chr:start-end)

Read Processing:

  • --extendReads: Extend to fragment length
  • --centerReads: Center at fragment midpoint
  • --ignoreDuplicates: Count unique reads only