Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/deeptools/SKILL.md
+++ b/skills/deeptools/SKILL.md
@@ -0,0 +1,525 @@
+---
+name: deeptools
+description: "NGS analysis toolkit. BAM to bigWig conversion, QC (correlation, PCA, fingerprints), heatmaps/profiles (TSS, peaks), for ChIP-seq, RNA-seq, ATAC-seq visualization."
+---
+
+# deepTools: NGS Data Analysis Toolkit
+
+## Overview
+
+deepTools is a comprehensive suite of Python command-line tools designed for processing and analyzing high-throughput sequencing data. Use deepTools to perform quality control, normalize data, compare samples, and generate publication-quality visualizations for ChIP-seq, RNA-seq, ATAC-seq, MNase-seq, and other NGS experiments.
+
+**Core capabilities:**
+- Convert BAM alignments to normalized coverage tracks (bigWig/bedGraph)
+- Quality control assessment (fingerprint, correlation, coverage)
+- Sample comparison and correlation analysis
+- Heatmap and profile plot generation around genomic features
+- Enrichment analysis and peak region visualization
+
+## When to Use This Skill
+
+This skill should be used when:
+
+- **File conversion**: "Convert BAM to bigWig", "generate coverage tracks", "normalize ChIP-seq data"
+- **Quality control**: "check ChIP quality", "compare replicates", "assess sequencing depth", "QC analysis"
+- **Visualization**: "create heatmap around TSS", "plot ChIP signal", "visualize enrichment", "generate profile plot"
+- **Sample comparison**: "compare treatment vs control", "correlate samples", "PCA analysis"
+- **Analysis workflows**: "analyze ChIP-seq data", "RNA-seq coverage", "ATAC-seq analysis", "complete workflow"
+- **Working with specific file types**: BAM files, bigWig files, BED region files in genomics context
+
+## Quick Start
+
+For users new to deepTools, start with file validation and common workflows:
+
+### 1. Validate Input Files
+
+Before running any analysis, validate BAM, bigWig, and BED files using the validation script:
+
+```bash
+python scripts/validate_files.py --bam sample1.bam sample2.bam --bed regions.bed
+```
+
+This checks file existence, BAM indices, and format correctness.
+
+### 2. Generate Workflow Template
+
+For standard analyses, use the workflow generator to create customized scripts:
+
+```bash
+# List available workflows
+python scripts/workflow_generator.py --list
+
+# Generate ChIP-seq QC workflow
+python scripts/workflow_generator.py chipseq_qc -o qc_workflow.sh \
+    --input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
+    --genome-size 2913022398
+
+# Make executable and run
+chmod +x qc_workflow.sh
+./qc_workflow.sh
+```
+
+### 3. Most Common Operations
+
+See `assets/quick_reference.md` for frequently used commands and parameters.
+
+## Installation
+
+```bash
+uv pip install deeptools
+```
+
+## Core Workflows
+
+deepTools workflows typically follow this pattern: **QC → Normalization → Comparison/Visualization**
+
+### ChIP-seq Quality Control Workflow
+
+When users request ChIP-seq QC or quality assessment:
+
+1. **Generate workflow script** using `scripts/workflow_generator.py chipseq_qc`
+2. **Key QC steps**:
+   - Sample correlation (multiBamSummary + plotCorrelation)
+   - PCA analysis (plotPCA)
+   - Coverage assessment (plotCoverage)
+   - Fragment size validation (bamPEFragmentSize)
+   - ChIP enrichment strength (plotFingerprint)
+
+**Interpreting results:**
+- **Correlation**: Replicates should cluster together with high correlation (>0.9)
+- **Fingerprint**: Strong ChIP shows steep rise; flat diagonal indicates poor enrichment
+- **Coverage**: Assess if sequencing depth is adequate for analysis
+
+Full workflow details in `references/workflows.md` → "ChIP-seq Quality Control Workflow"
+
+### ChIP-seq Complete Analysis Workflow
+
+For full ChIP-seq analysis from BAM to visualizations:
+
+1. **Generate coverage tracks** with normalization (bamCoverage)
+2. **Create comparison tracks** (bamCompare for log2 ratio)
+3. **Compute signal matrices** around features (computeMatrix)
+4. **Generate visualizations** (plotHeatmap, plotProfile)
+5. **Enrichment analysis** at peaks (plotEnrichment)
+
+Use `scripts/workflow_generator.py chipseq_analysis` to generate template.
+
+Complete command sequences in `references/workflows.md` → "ChIP-seq Analysis Workflow"
+
+### RNA-seq Coverage Workflow
+
+For strand-specific RNA-seq coverage tracks:
+
+Use bamCoverage with `--filterRNAstrand` to separate forward and reverse strands.
+
+**Important:** NEVER use `--extendReads` for RNA-seq (would extend over splice junctions).
+
+Use normalization: CPM for fixed bins, RPKM for gene-level analysis.
+
+Template available: `scripts/workflow_generator.py rnaseq_coverage`
+
+Details in `references/workflows.md` → "RNA-seq Coverage Workflow"
+
+### ATAC-seq Analysis Workflow
+
+ATAC-seq requires Tn5 offset correction:
+
+1. **Shift reads** using alignmentSieve with `--ATACshift`
+2. **Generate coverage** with bamCoverage
+3. **Analyze fragment sizes** (expect nucleosome ladder pattern)
+4. **Visualize at peaks** if available
+
+Template: `scripts/workflow_generator.py atacseq`
+
+Full workflow in `references/workflows.md` → "ATAC-seq Workflow"
+
+## Tool Categories and Common Tasks
+
+### BAM/bigWig Processing
+
+**Convert BAM to normalized coverage:**
+```bash
+bamCoverage --bam input.bam --outFileName output.bw \
+    --normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
+    --binSize 10 --numberOfProcessors 8
+```
+
+**Compare two samples (log2 ratio):**
+```bash
+bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \
+    --operation log2 --scaleFactorsMethod readCount
+```
+
+**Key tools:** bamCoverage, bamCompare, multiBamSummary, multiBigwigSummary, correctGCBias, alignmentSieve
+
+Complete reference: `references/tools_reference.md` → "BAM and bigWig File Processing Tools"
+
+### Quality Control
+
+**Check ChIP enrichment:**
+```bash
+plotFingerprint -b input.bam chip.bam -o fingerprint.png \
+    --extendReads 200 --ignoreDuplicates
+```
+
+**Sample correlation:**
+```bash
+multiBamSummary bins --bamfiles *.bam -o counts.npz
+plotCorrelation -in counts.npz --corMethod pearson \
+    --whatToShow heatmap -o correlation.png
+```
+
+**Key tools:** plotFingerprint, plotCoverage, plotCorrelation, plotPCA, bamPEFragmentSize
+
+Complete reference: `references/tools_reference.md` → "Quality Control Tools"
+
+### Visualization
+
+**Create heatmap around TSS:**
+```bash
+# Compute matrix
+computeMatrix reference-point -S signal.bw -R genes.bed \
+    -b 3000 -a 3000 --referencePoint TSS -o matrix.gz
+
+# Generate heatmap
+plotHeatmap -m matrix.gz -o heatmap.png \
+    --colorMap RdBu --kmeans 3
+```
+
+**Create profile plot:**
+```bash
+plotProfile -m matrix.gz -o profile.png \
+    --plotType lines --colors blue red
+```
+
+**Key tools:** computeMatrix, plotHeatmap, plotProfile, plotEnrichment
+
+Complete reference: `references/tools_reference.md` → "Visualization Tools"
+
+## Normalization Methods
+
+Choosing the correct normalization is critical for valid comparisons. Consult `references/normalization_methods.md` for comprehensive guidance.
+
+**Quick selection guide:**
+
+- **ChIP-seq coverage**: Use RPGC or CPM
+- **ChIP-seq comparison**: Use bamCompare with log2 and readCount
+- **RNA-seq bins**: Use CPM
+- **RNA-seq genes**: Use RPKM (accounts for gene length)
+- **ATAC-seq**: Use RPGC or CPM
+
+**Normalization methods:**
+- **RPGC**: 1× genome coverage (requires --effectiveGenomeSize)
+- **CPM**: Counts per million mapped reads
+- **RPKM**: Reads per kb per million (accounts for region length)
+- **BPM**: Bins per million
+- **None**: Raw counts (not recommended for comparisons)
+
+Full explanation: `references/normalization_methods.md`
+
+## Effective Genome Sizes
+
+RPGC normalization requires effective genome size. Common values:
+
+| Organism | Assembly | Size | Usage |
+|----------|----------|------|-------|
+| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
+| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
+| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
+| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
+| *C. elegans* | ce10/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
+
+Complete table with read-length-specific values: `references/effective_genome_sizes.md`
+
+## Common Parameters Across Tools
+
+Many deepTools commands share these options:
+
+**Performance:**
+- `--numberOfProcessors, -p`: Enable parallel processing (always use available cores)
+- `--region`: Process specific regions for testing (e.g., `chr1:1-1000000`)
+
+**Read Filtering:**
+- `--ignoreDuplicates`: Remove PCR duplicates (recommended for most analyses)
+- `--minMappingQuality`: Filter by alignment quality (e.g., `--minMappingQuality 10`)
+- `--minFragmentLength` / `--maxFragmentLength`: Fragment length bounds
+- `--samFlagInclude` / `--samFlagExclude`: SAM flag filtering
+
+**Read Processing:**
+- `--extendReads`: Extend to fragment length (ChIP-seq: YES, RNA-seq: NO)
+- `--centerReads`: Center at fragment midpoint for sharper signals
+
+## Best Practices
+
+### File Validation
+**Always validate files first** using `scripts/validate_files.py` to check:
+- File existence and readability
+- BAM indices present (.bai files)
+- BED format correctness
+- File sizes reasonable
+
+### Analysis Strategy
+
+1. **Start with QC**: Run correlation, coverage, and fingerprint analysis before proceeding
+2. **Test on small regions**: Use `--region chr1:1-10000000` for parameter testing
+3. **Document commands**: Save full command lines for reproducibility
+4. **Use consistent normalization**: Apply same method across samples in comparisons
+5. **Verify genome assembly**: Ensure BAM and BED files use matching genome builds
+
+### ChIP-seq Specific
+
+- **Always extend reads** for ChIP-seq: `--extendReads 200`
+- **Remove duplicates**: Use `--ignoreDuplicates` in most cases
+- **Check enrichment first**: Run plotFingerprint before detailed analysis
+- **GC correction**: Only apply if significant bias detected; never use `--ignoreDuplicates` after GC correction
+
+### RNA-seq Specific
+
+- **Never extend reads** for RNA-seq (would span splice junctions)
+- **Strand-specific**: Use `--filterRNAstrand forward/reverse` for stranded libraries
+- **Normalization**: CPM for bins, RPKM for genes
+
+### ATAC-seq Specific
+
+- **Apply Tn5 correction**: Use alignmentSieve with `--ATACshift`
+- **Fragment filtering**: Set appropriate min/max fragment lengths
+- **Check nucleosome pattern**: Fragment size plot should show ladder pattern
+
+### Performance Optimization
+
+1. **Use multiple processors**: `--numberOfProcessors 8` (or available cores)
+2. **Increase bin size** for faster processing and smaller files
+3. **Process chromosomes separately** for memory-limited systems
+4. **Pre-filter BAM files** using alignmentSieve to create reusable filtered files
+5. **Use bigWig over bedGraph**: Compressed and faster to process
+
+## Troubleshooting
+
+### Common Issues
+
+**BAM index missing:**
+```bash
+samtools index input.bam
+```
+
+**Out of memory:**
+Process chromosomes individually using `--region`:
+```bash
+bamCoverage --bam input.bam -o chr1.bw --region chr1
+```
+
+**Slow processing:**
+Increase `--numberOfProcessors` and/or increase `--binSize`
+
+**bigWig files too large:**
+Increase bin size: `--binSize 50` or larger
+
+### Validation Errors
+
+Run validation script to identify issues:
+```bash
+python scripts/validate_files.py --bam *.bam --bed regions.bed
+```
+
+Common errors and solutions explained in script output.
+
+## Reference Documentation
+
+This skill includes comprehensive reference documentation:
+
+### references/tools_reference.md
+Complete documentation of all deepTools commands organized by category:
+- BAM and bigWig processing tools (9 tools)
+- Quality control tools (6 tools)
+- Visualization tools (3 tools)
+- Miscellaneous tools (2 tools)
+
+Each tool includes:
+- Purpose and overview
+- Key parameters with explanations
+- Usage examples
+- Important notes and best practices
+
+**Use this reference when:** Users ask about specific tools, parameters, or detailed usage.
+
+### references/workflows.md
+Complete workflow examples for common analyses:
+- ChIP-seq quality control workflow
+- ChIP-seq complete analysis workflow
+- RNA-seq coverage workflow
+- ATAC-seq analysis workflow
+- Multi-sample comparison workflow
+- Peak region analysis workflow
+- Troubleshooting and performance tips
+
+**Use this reference when:** Users need complete analysis pipelines or workflow examples.
+
+### references/normalization_methods.md
+Comprehensive guide to normalization methods:
+- Detailed explanation of each method (RPGC, CPM, RPKM, BPM, etc.)
+- When to use each method
+- Formulas and interpretation
+- Selection guide by experiment type
+- Common pitfalls and solutions
+- Quick reference table
+
+**Use this reference when:** Users ask about normalization, comparing samples, or which method to use.
+
+### references/effective_genome_sizes.md
+Effective genome size values and usage:
+- Common organism values (human, mouse, fly, worm, zebrafish)
+- Read-length-specific values
+- Calculation methods
+- When and how to use in commands
+- Custom genome calculation instructions
+
+**Use this reference when:** Users need genome size for RPGC normalization or GC bias correction.
+
+## Helper Scripts
+
+### scripts/validate_files.py
+
+Validates BAM, bigWig, and BED files for deepTools analysis. Checks file existence, indices, and format.
+
+**Usage:**
+```bash
+python scripts/validate_files.py --bam sample1.bam sample2.bam \
+    --bed peaks.bed --bigwig signal.bw
+```
+
+**When to use:** Before starting any analysis, or when troubleshooting errors.
+
+### scripts/workflow_generator.py
+
+Generates customizable bash script templates for common deepTools workflows.
+
+**Available workflows:**
+- `chipseq_qc`: ChIP-seq quality control
+- `chipseq_analysis`: Complete ChIP-seq analysis
+- `rnaseq_coverage`: Strand-specific RNA-seq coverage
+- `atacseq`: ATAC-seq with Tn5 correction
+
+**Usage:**
+```bash
+# List workflows
+python scripts/workflow_generator.py --list
+
+# Generate workflow
+python scripts/workflow_generator.py chipseq_qc -o qc.sh \
+    --input-bam Input.bam --chip-bams "ChIP1.bam ChIP2.bam" \
+    --genome-size 2913022398 --threads 8
+
+# Run generated workflow
+chmod +x qc.sh
+./qc.sh
+```
+
+**When to use:** Users request standard workflows or need template scripts to customize.
+
+## Assets
+
+### assets/quick_reference.md
+
+Quick reference card with most common commands, effective genome sizes, and typical workflow pattern.
+
+**When to use:** Users need quick command examples without detailed documentation.
+
+## Handling User Requests
+
+### For New Users
+
+1. Start with installation verification
+2. Validate input files using `scripts/validate_files.py`
+3. Recommend appropriate workflow based on experiment type
+4. Generate workflow template using `scripts/workflow_generator.py`
+5. Guide through customization and execution
+
+### For Experienced Users
+
+1. Provide specific tool commands for requested operations
+2. Reference appropriate sections in `references/tools_reference.md`
+3. Suggest optimizations and best practices
+4. Offer troubleshooting for issues
+
+### For Specific Tasks
+
+**"Convert BAM to bigWig":**
+- Use bamCoverage with appropriate normalization
+- Recommend RPGC or CPM based on use case
+- Provide effective genome size for organism
+- Suggest relevant parameters (extendReads, ignoreDuplicates, binSize)
+
+**"Check ChIP quality":**
+- Run full QC workflow or use plotFingerprint specifically
+- Explain interpretation of results
+- Suggest follow-up actions based on results
+
+**"Create heatmap":**
+- Guide through two-step process: computeMatrix → plotHeatmap
+- Help choose appropriate matrix mode (reference-point vs scale-regions)
+- Suggest visualization parameters and clustering options
+
+**"Compare samples":**
+- Recommend bamCompare for two-sample comparison
+- Suggest multiBamSummary + plotCorrelation for multiple samples
+- Guide normalization method selection
+
+### Referencing Documentation
+
+When users need detailed information:
+- **Tool details**: Direct to specific sections in `references/tools_reference.md`
+- **Workflows**: Use `references/workflows.md` for complete analysis pipelines
+- **Normalization**: Consult `references/normalization_methods.md` for method selection
+- **Genome sizes**: Reference `references/effective_genome_sizes.md`
+
+Search references using grep patterns:
+```bash
+# Find tool documentation
+grep -A 20 "^### toolname" references/tools_reference.md
+
+# Find workflow
+grep -A 50 "^## Workflow Name" references/workflows.md
+
+# Find normalization method
+grep -A 15 "^### Method Name" references/normalization_methods.md
+```
+
+## Example Interactions
+
+**User: "I need to analyze my ChIP-seq data"**
+
+Response approach:
+1. Ask about files available (BAM files, peaks, genes)
+2. Validate files using validation script
+3. Generate chipseq_analysis workflow template
+4. Customize for their specific files and organism
+5. Explain each step as script runs
+
+**User: "Which normalization should I use?"**
+
+Response approach:
+1. Ask about experiment type (ChIP-seq, RNA-seq, etc.)
+2. Ask about comparison goal (within-sample or between-sample)
+3. Consult `references/normalization_methods.md` selection guide
+4. Recommend appropriate method with justification
+5. Provide command example with parameters
+
+**User: "Create a heatmap around TSS"**
+
+Response approach:
+1. Verify bigWig and gene BED files available
+2. Use computeMatrix with reference-point mode at TSS
+3. Generate plotHeatmap with appropriate visualization parameters
+4. Suggest clustering if dataset is large
+5. Offer profile plot as complement
+
+## Key Reminders
+
+- **File validation first**: Always validate input files before analysis
+- **Normalization matters**: Choose appropriate method for comparison type
+- **Extend reads carefully**: YES for ChIP-seq, NO for RNA-seq
+- **Use all cores**: Set `--numberOfProcessors` to available cores
+- **Test on regions**: Use `--region` for parameter testing
+- **Check QC first**: Run quality control before detailed analysis
+- **Document everything**: Save commands for reproducibility
+- **Reference documentation**: Use comprehensive references for detailed guidance
--- a/skills/deeptools/assets/quick_reference.md
+++ b/skills/deeptools/assets/quick_reference.md
@@ -0,0 +1,58 @@
+# deepTools Quick Reference
+
+## Most Common Commands
+
+### BAM to bigWig (normalized)
+```bash
+bamCoverage --bam input.bam --outFileName output.bw \
+    --normalizeUsing RPGC --effectiveGenomeSize 2913022398 \
+    --binSize 10 --numberOfProcessors 8
+```
+
+### Compare two BAM files
+```bash
+bamCompare -b1 treatment.bam -b2 control.bam -o ratio.bw \
+    --operation log2 --scaleFactorsMethod readCount
+```
+
+### Correlation heatmap
+```bash
+multiBamSummary bins --bamfiles *.bam -o counts.npz
+plotCorrelation -in counts.npz --corMethod pearson \
+    --whatToShow heatmap -o correlation.png
+```
+
+### Heatmap around TSS
+```bash
+computeMatrix reference-point -S signal.bw -R genes.bed \
+    -b 3000 -a 3000 --referencePoint TSS -o matrix.gz
+
+plotHeatmap -m matrix.gz -o heatmap.png
+```
+
+### ChIP enrichment check
+```bash
+plotFingerprint -b input.bam chip.bam -o fingerprint.png \
+    --extendReads 200 --ignoreDuplicates
+```
+
+## Effective Genome Sizes
+
+| Organism | Assembly | Size |
+|----------|----------|------|
+| Human | hg38 | 2913022398 |
+| Mouse | mm10 | 2652783500 |
+| Fly | dm6 | 142573017 |
+
+## Common Normalization Methods
+
+- **RPGC**: 1× genome coverage (requires --effectiveGenomeSize)
+- **CPM**: Counts per million (for fixed bins)
+- **RPKM**: Reads per kb per million (for genes)
+
+## Typical Workflow
+
+1. **QC**: plotFingerprint, plotCorrelation
+2. **Coverage**: bamCoverage with normalization
+3. **Comparison**: bamCompare for treatment vs control
+4. **Visualization**: computeMatrix → plotHeatmap/plotProfile
--- a/skills/deeptools/references/effective_genome_sizes.md
+++ b/skills/deeptools/references/effective_genome_sizes.md
@@ -0,0 +1,116 @@
+# Effective Genome Sizes
+
+## Definition
+
+Effective genome size refers to the length of the "mappable" genome - regions that can be uniquely mapped by sequencing reads. This metric is crucial for proper normalization in many deepTools commands.
+
+## Why It Matters
+
+- Required for RPGC normalization (`--normalizeUsing RPGC`)
+- Affects accuracy of coverage calculations
+- Must match your data processing approach (filtered vs unfiltered reads)
+
+## Calculation Methods
+
+1. **Non-N bases**: Count of non-N nucleotides in genome sequence
+2. **Unique mappability**: Regions of specific size that can be uniquely mapped (may consider edit distance)
+
+## Common Organism Values
+
+### Using Non-N Bases Method
+
+| Organism | Assembly | Effective Size | Full Command |
+|----------|----------|----------------|--------------|
+| Human | GRCh38/hg38 | 2,913,022,398 | `--effectiveGenomeSize 2913022398` |
+| Human | GRCh37/hg19 | 2,864,785,220 | `--effectiveGenomeSize 2864785220` |
+| Mouse | GRCm39/mm39 | 2,654,621,837 | `--effectiveGenomeSize 2654621837` |
+| Mouse | GRCm38/mm10 | 2,652,783,500 | `--effectiveGenomeSize 2652783500` |
+| Zebrafish | GRCz11 | 1,368,780,147 | `--effectiveGenomeSize 1368780147` |
+| *Drosophila* | dm6 | 142,573,017 | `--effectiveGenomeSize 142573017` |
+| *C. elegans* | WBcel235/ce11 | 100,286,401 | `--effectiveGenomeSize 100286401` |
+| *C. elegans* | ce10 | 100,258,171 | `--effectiveGenomeSize 100258171` |
+
+### Human (GRCh38) by Read Length
+
+For quality-filtered reads, values vary by read length:
+
+| Read Length | Effective Size |
+|-------------|----------------|
+| 50bp | ~2.7 billion |
+| 75bp | ~2.8 billion |
+| 100bp | ~2.8 billion |
+| 150bp | ~2.9 billion |
+| 250bp | ~2.9 billion |
+
+### Mouse (GRCm38) by Read Length
+
+| Read Length | Effective Size |
+|-------------|----------------|
+| 50bp | ~2.3 billion |
+| 75bp | ~2.5 billion |
+| 100bp | ~2.6 billion |
+
+## Usage in deepTools
+
+The effective genome size is most commonly used with:
+
+### bamCoverage with RPGC normalization
+```bash
+bamCoverage --bam input.bam --outFileName output.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398
+```
+
+### bamCompare with RPGC normalization
+```bash
+bamCompare -b1 treatment.bam -b2 control.bam \
+    --outFileName comparison.bw \
+    --scaleFactorsMethod RPGC \
+    --effectiveGenomeSize 2913022398
+```
+
+### computeGCBias / correctGCBias
+```bash
+computeGCBias --bamfile input.bam \
+    --effectiveGenomeSize 2913022398 \
+    --genome genome.2bit \
+    --fragmentLength 200 \
+    --biasPlot bias.png
+```
+
+## Choosing the Right Value
+
+**For most analyses:** Use the non-N bases method value for your reference genome
+
+**For filtered data:** If you apply strict quality filters or remove multimapping reads, consider using the read-length-specific values
+
+**When unsure:** Use the conservative non-N bases value - it's more widely applicable
+
+## Common Shortcuts
+
+deepTools also accepts these shorthand values in some contexts:
+
+- `hs` or `GRCh38`: 2913022398
+- `mm` or `GRCm38`: 2652783500
+- `dm` or `dm6`: 142573017
+- `ce` or `ce10`: 100286401
+
+Check your specific deepTools version documentation for supported shortcuts.
+
+## Calculating Custom Values
+
+For custom genomes or assemblies, calculate the non-N bases count:
+
+```bash
+# Using faCount (UCSC tools)
+faCount genome.fa | grep "total" | awk '{print $2-$7}'
+
+# Using seqtk
+seqtk comp genome.fa | awk '{x+=$2}END{print x}'
+```
+
+## References
+
+For the most up-to-date effective genome sizes and detailed calculation methods, see:
+- deepTools documentation: https://deeptools.readthedocs.io/en/latest/content/feature/effectiveGenomeSize.html
+- ENCODE documentation for reference genome details
--- a/skills/deeptools/references/normalization_methods.md
+++ b/skills/deeptools/references/normalization_methods.md
@@ -0,0 +1,410 @@
+# deepTools Normalization Methods
+
+This document explains the various normalization methods available in deepTools and when to use each one.
+
+## Why Normalize?
+
+Normalization is essential for:
+1. **Comparing samples with different sequencing depths**
+2. **Accounting for library size differences**
+3. **Making coverage values interpretable across experiments**
+4. **Enabling fair comparisons between conditions**
+
+Without normalization, a sample with 100 million reads will appear to have higher coverage than a sample with 50 million reads, even if the true biological signal is identical.
+
+---
+
+## Available Normalization Methods
+
+### 1. RPKM (Reads Per Kilobase per Million mapped reads)
+
+**Formula:** `(Number of reads) / (Length of region in kb × Total mapped reads in millions)`
+
+**When to use:**
+- Comparing different genomic regions within the same sample
+- Adjusting for both sequencing depth AND region length
+- RNA-seq gene expression analysis
+
+**Available in:** `bamCoverage`
+
+**Example:**
+```bash
+bamCoverage --bam input.bam --outFileName output.bw \
+    --normalizeUsing RPKM
+```
+
+**Interpretation:** RPKM of 10 means 10 reads per kilobase of feature per million mapped reads.
+
+**Pros:**
+- Accounts for both region length and library size
+- Widely used and understood in genomics
+
+**Cons:**
+- Not ideal for comparing between samples if total RNA content differs
+- Can be misleading when comparing samples with very different compositions
+
+---
+
+### 2. CPM (Counts Per Million mapped reads)
+
+**Formula:** `(Number of reads) / (Total mapped reads in millions)`
+
+**Also known as:** RPM (Reads Per Million)
+
+**When to use:**
+- Comparing the same genomic regions across different samples
+- When region length is constant or not relevant
+- ChIP-seq, ATAC-seq, DNase-seq analyses
+
+**Available in:** `bamCoverage`, `bamCompare`
+
+**Example:**
+```bash
+bamCoverage --bam input.bam --outFileName output.bw \
+    --normalizeUsing CPM
+```
+
+**Interpretation:** CPM of 5 means 5 reads per million mapped reads in that bin.
+
+**Pros:**
+- Simple and intuitive
+- Good for comparing samples with different sequencing depths
+- Appropriate when comparing fixed-size bins
+
+**Cons:**
+- Does not account for region length
+- Affected by highly abundant regions (e.g., rRNA in RNA-seq)
+
+---
+
+### 3. BPM (Bins Per Million mapped reads)
+
+**Formula:** `(Number of reads in bin) / (Sum of all reads in bins in millions)`
+
+**Key difference from CPM:** Only considers reads that fall within the analyzed bins, not all mapped reads.
+
+**When to use:**
+- Similar to CPM, but when you want to exclude reads outside analyzed regions
+- Comparing specific genomic regions while ignoring background
+
+**Available in:** `bamCoverage`, `bamCompare`
+
+**Example:**
+```bash
+bamCoverage --bam input.bam --outFileName output.bw \
+    --normalizeUsing BPM
+```
+
+**Interpretation:** BPM accounts only for reads in the binned regions.
+
+**Pros:**
+- Focuses normalization on analyzed regions
+- Less affected by reads in unanalyzed areas
+
+**Cons:**
+- Less commonly used, may be harder to compare with published data
+
+---
+
+### 4. RPGC (Reads Per Genomic Content)
+
+**Formula:** `(Number of reads × Scaling factor) / Effective genome size`
+
+**Scaling factor:** Calculated to achieve 1× genomic coverage (1 read per base)
+
+**When to use:**
+- Want comparable coverage values across samples
+- Need interpretable absolute coverage values
+- Comparing samples with very different total read counts
+- ChIP-seq with spike-in normalization context
+
+**Available in:** `bamCoverage`, `bamCompare`
+
+**Requires:** `--effectiveGenomeSize` parameter
+
+**Example:**
+```bash
+bamCoverage --bam input.bam --outFileName output.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398
+```
+
+**Interpretation:** Signal value approximates the coverage depth (e.g., value of 2 ≈ 2× coverage).
+
+**Pros:**
+- Produces 1× normalized coverage
+- Interpretable in terms of genomic coverage
+- Good for comparing samples with different sequencing depths
+
+**Cons:**
+- Requires knowing effective genome size
+- Assumes uniform coverage (not true for ChIP-seq with peaks)
+
+---
+
+### 5. None (No Normalization)
+
+**Formula:** Raw read counts
+
+**When to use:**
+- Preliminary analysis
+- When samples have identical library sizes (rare)
+- When downstream tool will perform normalization
+- Debugging or quality control
+
+**Available in:** All tools (usually default)
+
+**Example:**
+```bash
+bamCoverage --bam input.bam --outFileName output.bw \
+    --normalizeUsing None
+```
+
+**Interpretation:** Raw read counts per bin.
+
+**Pros:**
+- No assumptions made
+- Useful for seeing raw data
+- Fastest computation
+
+**Cons:**
+- Cannot fairly compare samples with different sequencing depths
+- Not suitable for publication figures
+
+---
+
+### 6. SES (Selective Enrichment Statistics)
+
+**Method:** Signal Extraction Scaling - more sophisticated method for comparing ChIP to control
+
+**When to use:**
+- ChIP-seq analysis with bamCompare
+- Want sophisticated background correction
+- Alternative to simple readCount scaling
+
+**Available in:** `bamCompare` only
+
+**Example:**
+```bash
+bamCompare -b1 chip.bam -b2 input.bam -o output.bw \
+    --scaleFactorsMethod SES
+```
+
+**Note:** SES is specifically designed for ChIP-seq data and may work better than simple read count scaling for noisy data.
+
+---
+
+### 7. readCount (Read Count Scaling)
+
+**Method:** Scale by ratio of total read counts between samples
+
+**When to use:**
+- Default for `bamCompare`
+- Compensating for sequencing depth differences in comparisons
+- When you trust that total read counts reflect library size
+
+**Available in:** `bamCompare`
+
+**Example:**
+```bash
+bamCompare -b1 treatment.bam -b2 control.bam -o output.bw \
+    --scaleFactorsMethod readCount
+```
+
+**How it works:** If sample1 has 100M reads and sample2 has 50M reads, sample2 is scaled by 2× before comparison.
+
+---
+
+## Normalization Method Selection Guide
+
+### For ChIP-seq Coverage Tracks
+
+**Recommended:** RPGC or CPM
+
+```bash
+bamCoverage --bam chip.bam --outFileName chip.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398 \
+    --extendReads 200 \
+    --ignoreDuplicates
+```
+
+**Reasoning:** Accounts for sequencing depth differences; RPGC provides interpretable coverage values.
+
+---
+
+### For ChIP-seq Comparisons (Treatment vs Control)
+
+**Recommended:** log2 ratio with readCount or SES scaling
+
+```bash
+bamCompare -b1 chip.bam -b2 input.bam -o ratio.bw \
+    --operation log2 \
+    --scaleFactorsMethod readCount \
+    --extendReads 200 \
+    --ignoreDuplicates
+```
+
+**Reasoning:** Log2 ratio shows enrichment (positive) and depletion (negative); readCount adjusts for depth.
+
+---
+
+### For RNA-seq Coverage Tracks
+
+**Recommended:** CPM or RPKM
+
+```bash
+# Strand-specific forward
+bamCoverage --bam rnaseq.bam --outFileName forward.bw \
+    --normalizeUsing CPM \
+    --filterRNAstrand forward
+
+# For gene-level: RPKM accounts for gene length
+bamCoverage --bam rnaseq.bam --outFileName output.bw \
+    --normalizeUsing RPKM
+```
+
+**Reasoning:** CPM for comparing fixed-width bins; RPKM for genes (accounts for length).
+
+---
+
+### For ATAC-seq
+
+**Recommended:** RPGC or CPM
+
+```bash
+bamCoverage --bam atac_shifted.bam --outFileName atac.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398
+```
+
+**Reasoning:** Similar to ChIP-seq; want comparable coverage across samples.
+
+---
+
+### For Sample Correlation Analysis
+
+**Recommended:** CPM or RPGC
+
+```bash
+multiBamSummary bins \
+    --bamfiles sample1.bam sample2.bam sample3.bam \
+    -o readCounts.npz
+
+plotCorrelation -in readCounts.npz \
+    --corMethod pearson \
+    --whatToShow heatmap \
+    -o correlation.png
+```
+
+**Note:** `multiBamSummary` doesn't explicitly normalize, but correlation analysis is robust to scaling. For very different library sizes, consider normalizing BAM files first or using CPM-normalized bigWig files with `multiBigwigSummary`.
+
+---
+
+## Advanced Normalization Considerations
+
+### Spike-in Normalization
+
+For experiments with spike-in controls (e.g., *Drosophila* chromatin spike-in for ChIP-seq):
+
+1. Calculate scaling factors from spike-in reads
+2. Apply custom scaling factors using `--scaleFactor` parameter
+
+```bash
+# Calculate spike-in factor (example: 0.8)
+SCALE_FACTOR=0.8
+
+bamCoverage --bam chip.bam --outFileName chip_spikenorm.bw \
+    --scaleFactor ${SCALE_FACTOR} \
+    --extendReads 200
+```
+
+---
+
+### Manual Scaling Factors
+
+You can apply custom scaling factors:
+
+```bash
+# Apply 2× scaling
+bamCoverage --bam input.bam --outFileName output.bw \
+    --scaleFactor 2.0
+```
+
+---
+
+### Chromosome Exclusion
+
+Exclude specific chromosomes from normalization calculations:
+
+```bash
+bamCoverage --bam input.bam --outFileName output.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398 \
+    --ignoreForNormalization chrX chrY chrM
+```
+
+**When to use:** Sex chromosomes in mixed-sex samples, mitochondrial DNA, or chromosomes with unusual coverage.
+
+---
+
+## Common Pitfalls
+
+### 1. Using RPKM for bin-based data
+**Problem:** RPKM accounts for region length, but all bins are the same size
+**Solution:** Use CPM or RPGC instead
+
+### 2. Comparing unnormalized samples
+**Problem:** Sample with 2× sequencing depth appears to have 2× signal
+**Solution:** Always normalize when comparing samples
+
+### 3. Wrong effective genome size
+**Problem:** Using hg19 genome size for hg38 data
+**Solution:** Double-check genome assembly and use correct size
+
+### 4. Ignoring duplicates after GC correction
+**Problem:** Can introduce bias
+**Solution:** Never use `--ignoreDuplicates` after `correctGCBias`
+
+### 5. Using RPGC without effective genome size
+**Problem:** Command fails
+**Solution:** Always specify `--effectiveGenomeSize` with RPGC
+
+---
+
+## Normalization for Different Comparisons
+
+### Within-sample comparisons (different regions)
+**Use:** RPKM (accounts for region length)
+
+### Between-sample comparisons (same regions)
+**Use:** CPM, RPGC, or BPM (accounts for library size)
+
+### Treatment vs Control
+**Use:** bamCompare with log2 ratio and readCount/SES scaling
+
+### Multiple samples correlation
+**Use:** CPM or RPGC normalized bigWig files, then multiBigwigSummary
+
+---
+
+## Quick Reference Table
+
+| Method | Accounts for Depth | Accounts for Length | Best For | Command |
+|--------|-------------------|---------------------|----------|---------|
+| RPKM | ✓ | ✓ | RNA-seq genes | `--normalizeUsing RPKM` |
+| CPM | ✓ | ✗ | Fixed-size bins | `--normalizeUsing CPM` |
+| BPM | ✓ | ✗ | Specific regions | `--normalizeUsing BPM` |
+| RPGC | ✓ | ✗ | Interpretable coverage | `--normalizeUsing RPGC --effectiveGenomeSize X` |
+| None | ✗ | ✗ | Raw data | `--normalizeUsing None` |
+| SES | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod SES` |
+| readCount | ✓ | ✗ | ChIP comparisons | `bamCompare --scaleFactorsMethod readCount` |
+
+---
+
+## Further Reading
+
+For more details on normalization theory and best practices:
+- deepTools documentation: https://deeptools.readthedocs.io/
+- ENCODE guidelines for ChIP-seq analysis
+- RNA-seq normalization papers (DESeq2, TMM methods)
--- a/skills/deeptools/references/tools_reference.md
+++ b/skills/deeptools/references/tools_reference.md
@@ -0,0 +1,533 @@
+# deepTools Complete Tool Reference
+
+This document provides a comprehensive reference for all deepTools command-line utilities organized by category.
+
+## BAM and bigWig File Processing Tools
+
+### multiBamSummary
+
+Computes read coverages for genomic regions across multiple BAM files, outputting compressed numpy arrays for downstream correlation and PCA analysis.
+
+**Modes:**
+- **bins**: Genome-wide analysis using consecutive equal-sized windows (default 10kb)
+- **BED-file**: Restricts analysis to user-specified genomic regions
+
+**Key Parameters:**
+- `--bamfiles, -b`: Indexed BAM files (space-separated, required)
+- `--outFileName, -o`: Output coverage matrix file (required)
+- `--BED`: Region specification file (BED-file mode only)
+- `--binSize`: Window size in bases (default: 10,000)
+- `--labels`: Custom sample identifiers
+- `--minMappingQuality`: Quality threshold for read inclusion
+- `--numberOfProcessors, -p`: Parallel processing cores
+- `--extendReads`: Fragment size extension
+- `--ignoreDuplicates`: Remove PCR duplicates
+- `--outRawCounts`: Export tab-delimited file with coordinate columns and per-sample counts
+
+**Output:** Compressed numpy array (.npz) for plotCorrelation and plotPCA
+
+**Common Usage:**
+```bash
+# Genome-wide comparison
+multiBamSummary bins --bamfiles sample1.bam sample2.bam -o results.npz
+
+# Peak region comparison
+multiBamSummary BED-file --BED peaks.bed --bamfiles sample1.bam sample2.bam -o results.npz
+```
+
+---
+
+### multiBigwigSummary
+
+Similar to multiBamSummary but operates on bigWig files instead of BAM files. Used for comparing coverage tracks across samples.
+
+**Modes:**
+- **bins**: Genome-wide analysis
+- **BED-file**: Region-specific analysis
+
+**Key Parameters:** Similar to multiBamSummary but accepts bigWig files
+
+---
+
+### bamCoverage
+
+Converts BAM alignment files into normalized coverage tracks in bigWig or bedGraph formats. Calculates coverage as number of reads per bin.
+
+**Key Parameters:**
+- `--bam, -b`: Input BAM file (required)
+- `--outFileName, -o`: Output filename (required)
+- `--outFileFormat, -of`: Output type (bigwig or bedgraph)
+- `--normalizeUsing`: Normalization method
+  - **RPKM**: Reads Per Kilobase per Million mapped reads
+  - **CPM**: Counts Per Million mapped reads
+  - **BPM**: Bins Per Million mapped reads
+  - **RPGC**: Reads per genomic content (requires --effectiveGenomeSize)
+  - **None**: No normalization (default)
+- `--effectiveGenomeSize`: Mappable genome size (required for RPGC)
+- `--binSize`: Resolution in base pairs (default: 50)
+- `--extendReads, -e`: Extend reads to fragment length (recommended for ChIP-seq, NOT for RNA-seq)
+- `--centerReads`: Center reads at fragment length for sharper signals
+- `--ignoreDuplicates`: Count identical reads only once
+- `--minMappingQuality`: Filter reads below quality threshold
+- `--minFragmentLength / --maxFragmentLength`: Fragment length filtering
+- `--smoothLength`: Window averaging for noise reduction
+- `--MNase`: Analyze MNase-seq data for nucleosome positioning
+- `--Offset`: Position-specific offsets (useful for RiboSeq, GROseq)
+- `--filterRNAstrand`: Separate forward/reverse strand reads
+- `--ignoreForNormalization`: Exclude chromosomes from normalization (e.g., sex chromosomes)
+- `--numberOfProcessors, -p`: Parallel processing
+
+**Important Notes:**
+- For RNA-seq: Do NOT use --extendReads (would extend over splice junctions)
+- For ChIP-seq: Use --extendReads with smaller bin sizes
+- Never apply --ignoreDuplicates after GC bias correction
+
+**Common Usage:**
+```bash
+# Basic coverage with RPKM normalization
+bamCoverage --bam input.bam --outFileName coverage.bw --normalizeUsing RPKM
+
+# ChIP-seq with extension
+bamCoverage --bam chip.bam --outFileName chip_coverage.bw \
+    --binSize 10 --extendReads 200 --ignoreDuplicates
+
+# Strand-specific RNA-seq
+bamCoverage --bam rnaseq.bam --outFileName forward.bw \
+    --filterRNAstrand forward
+```
+
+---
+
+### bamCompare
+
+Compares two BAM files by generating bigWig or bedGraph files, normalizing for sequencing depth differences. Processes genome in equal-sized bins and performs per-bin calculations.
+
+**Comparison Methods:**
+- **log2** (default): Log2 ratio of samples
+- **ratio**: Direct ratio calculation
+- **subtract**: Difference between files
+- **add**: Sum of samples
+- **mean**: Average across samples
+- **reciprocal_ratio**: Negative inverse for ratios < 0
+- **first/second**: Output scaled signal from single file
+
+**Normalization Methods:**
+- **readCount** (default): Compensates for sequencing depth
+- **SES**: Selective enrichment statistics
+- **RPKM**: Reads per kilobase per million
+- **CPM**: Counts per million
+- **BPM**: Bins per million
+- **RPGC**: Reads per genomic content (requires --effectiveGenomeSize)
+
+**Key Parameters:**
+- `--bamfile1, -b1`: First BAM file (required)
+- `--bamfile2, -b2`: Second BAM file (required)
+- `--outFileName, -o`: Output filename (required)
+- `--outFileFormat`: bigwig or bedgraph
+- `--operation`: Comparison method (see above)
+- `--scaleFactorsMethod`: Normalization method (see above)
+- `--binSize`: Bin width for output (default: 50bp)
+- `--pseudocount`: Avoid division by zero (default: 1)
+- `--extendReads`: Extend reads to fragment length
+- `--ignoreDuplicates`: Count identical reads once
+- `--minMappingQuality`: Quality threshold
+- `--numberOfProcessors, -p`: Parallelization
+
+**Common Usage:**
+```bash
+# Log2 ratio of treatment vs control
+bamCompare -b1 treatment.bam -b2 control.bam -o log2ratio.bw
+
+# Subtract control from treatment
+bamCompare -b1 treatment.bam -b2 control.bam -o difference.bw \
+    --operation subtract --scaleFactorsMethod readCount
+```
+
+---
+
+### correctGCBias / computeGCBias
+
+**computeGCBias:** Identifies GC-content bias from sequencing and PCR amplification.
+
+**correctGCBias:** Corrects BAM files for GC bias detected by computeGCBias.
+
+**Key Parameters (computeGCBias):**
+- `--bamfile, -b`: Input BAM file
+- `--effectiveGenomeSize`: Mappable genome size
+- `--genome, -g`: Reference genome in 2bit format
+- `--fragmentLength, -l`: Fragment length (for single-end)
+- `--biasPlot`: Output diagnostic plot
+
+**Key Parameters (correctGCBias):**
+- `--bamfile, -b`: Input BAM file
+- `--effectiveGenomeSize`: Mappable genome size
+- `--genome, -g`: Reference genome in 2bit format
+- `--GCbiasFrequenciesFile`: Frequencies from computeGCBias
+- `--correctedFile, -o`: Output corrected BAM
+
+**Important:** Never use --ignoreDuplicates after GC bias correction
+
+---
+
+### alignmentSieve
+
+Filters BAM files by various quality metrics on-the-fly. Useful for creating filtered BAM files for specific analyses.
+
+**Key Parameters:**
+- `--bam, -b`: Input BAM file
+- `--outFile, -o`: Output BAM file
+- `--minMappingQuality`: Minimum mapping quality
+- `--ignoreDuplicates`: Remove duplicates
+- `--minFragmentLength / --maxFragmentLength`: Fragment length filters
+- `--samFlagInclude / --samFlagExclude`: SAM flag filtering
+- `--shift`: Shift reads (e.g., for ATACseq Tn5 correction)
+- `--ATACshift`: Automatically shift for ATAC-seq data
+
+---
+
+### computeMatrix
+
+Calculates scores per genomic region and prepares matrices for plotHeatmap and plotProfile. Processes bigWig score files and BED/GTF region files.
+
+**Modes:**
+- **reference-point**: Signal distribution relative to specific position (TSS, TES, or center)
+- **scale-regions**: Signal across regions standardized to uniform lengths
+
+**Key Parameters:**
+- `-R`: Region file(s) in BED/GTF format (required)
+- `-S`: BigWig score file(s) (required)
+- `-o`: Output matrix file (required)
+- `-b`: Upstream distance from reference point
+- `-a`: Downstream distance from reference point
+- `-m`: Region body length (scale-regions only)
+- `-bs, --binSize`: Bin size for averaging scores
+- `--skipZeros`: Skip regions with all zeros
+- `--minThreshold / --maxThreshold`: Filter by signal intensity
+- `--sortRegions`: ascending, descending, keep, no
+- `--sortUsing`: mean, median, max, min, sum, region_length
+- `-p, --numberOfProcessors`: Parallel processing
+- `--averageTypeBins`: Statistical method (mean, median, min, max, sum, std)
+
+**Output Options:**
+- `--outFileNameMatrix`: Export tab-delimited data
+- `--outFileSortedRegions`: Save filtered/sorted BED file
+
+**Common Usage:**
+```bash
+# TSS analysis
+computeMatrix reference-point -S signal.bw -R genes.bed \
+    -o matrix.gz -b 2000 -a 2000 --referencePoint TSS
+
+# Scaled gene body
+computeMatrix scale-regions -S signal.bw -R genes.bed \
+    -o matrix.gz -b 1000 -a 1000 -m 3000
+```
+
+---
+
+## Quality Control Tools
+
+### plotFingerprint
+
+Quality control tool primarily for ChIP-seq experiments. Assesses whether antibody enrichment was successful. Generates cumulative read coverage profiles to distinguish signal from noise.
+
+**Key Parameters:**
+- `--bamfiles, -b`: Indexed BAM files (required)
+- `--plotFile, -plot, -o`: Output image filename (required)
+- `--extendReads, -e`: Extend reads to fragment length
+- `--ignoreDuplicates`: Count identical reads once
+- `--minMappingQuality`: Mapping quality filter
+- `--centerReads`: Center reads at fragment length
+- `--minFragmentLength / --maxFragmentLength`: Fragment filters
+- `--outRawCounts`: Save per-bin read counts
+- `--outQualityMetrics`: Output QC metrics (Jensen-Shannon distance)
+- `--labels`: Custom sample names
+- `--numberOfProcessors, -p`: Parallel processing
+
+**Interpretation:**
+- Ideal control: Straight diagonal line
+- Strong ChIP: Steep rise towards highest rank (concentrated reads in few bins)
+- Weak enrichment: Flatter curve approaching diagonal
+
+**Common Usage:**
+```bash
+plotFingerprint -b input.bam chip1.bam chip2.bam \
+    --labels Input ChIP1 ChIP2 -o fingerprint.png \
+    --extendReads 200 --ignoreDuplicates
+```
+
+---
+
+### plotCoverage
+
+Visualizes average read distribution across the genome. Shows genome coverage and helps determine if sequencing depth is adequate.
+
+**Key Parameters:**
+- `--bamfiles, -b`: BAM files to analyze (required)
+- `--plotFile, -o`: Output plot filename (required)
+- `--ignoreDuplicates`: Remove PCR duplicates
+- `--minMappingQuality`: Quality threshold
+- `--outRawCounts`: Save underlying data
+- `--labels`: Sample names
+- `--numberOfSamples`: Number of positions to sample (default: 1,000,000)
+
+---
+
+### bamPEFragmentSize
+
+Determines fragment length distribution for paired-end sequencing data. Essential QC to verify expected fragment sizes from library preparation.
+
+**Key Parameters:**
+- `--bamfiles, -b`: BAM files (required)
+- `--histogram, -hist`: Output histogram filename (required)
+- `--plotTitle, -T`: Plot title
+- `--maxFragmentLength`: Maximum length to consider (default: 1000)
+- `--logScale`: Use logarithmic Y-axis
+- `--outRawFragmentLengths`: Save raw fragment lengths
+
+---
+
+### plotCorrelation
+
+Analyzes sample correlations from multiBamSummary or multiBigwigSummary outputs. Shows how similar different samples are.
+
+**Correlation Methods:**
+- **Pearson**: Measures metric differences; sensitive to outliers; appropriate for normally distributed data
+- **Spearman**: Rank-based; less influenced by outliers; better for non-normal distributions
+
+**Visualization Options:**
+- **heatmap**: Color intensity with hierarchical clustering (complete linkage)
+- **scatterplot**: Pairwise scatter plots with correlation coefficients
+
+**Key Parameters:**
+- `--corData, -in`: Input matrix from multiBamSummary/multiBigwigSummary (required)
+- `--corMethod`: pearson or spearman (required)
+- `--whatToShow`: heatmap or scatterplot (required)
+- `--plotFile, -o`: Output filename (required)
+- `--skipZeros`: Exclude zero-value regions
+- `--removeOutliers`: Use median absolute deviation (MAD) filtering
+- `--outFileCorMatrix`: Export correlation matrix
+- `--labels`: Custom sample names
+- `--plotTitle`: Plot title
+- `--colorMap`: Color scheme (50+ options)
+- `--plotNumbers`: Display correlation values on heatmap
+
+**Common Usage:**
+```bash
+# Heatmap with Pearson correlation
+plotCorrelation -in readCounts.npz --corMethod pearson \
+    --whatToShow heatmap -o correlation_heatmap.png --plotNumbers
+
+# Scatterplot with Spearman correlation
+plotCorrelation -in readCounts.npz --corMethod spearman \
+    --whatToShow scatterplot -o correlation_scatter.png
+```
+
+---
+
+### plotPCA
+
+Generates principal component analysis plots from multiBamSummary or multiBigwigSummary output. Displays sample relationships in reduced dimensionality.
+
+**Key Parameters:**
+- `--corData, -in`: Coverage file from multiBamSummary/multiBigwigSummary (required)
+- `--plotFile, -o`: Output image (png, eps, pdf, svg) (required)
+- `--outFileNameData`: Export PCA data (loadings/rotation and eigenvalues)
+- `--labels, -l`: Custom sample labels
+- `--plotTitle, -T`: Plot title
+- `--plotHeight / --plotWidth`: Dimensions in centimeters
+- `--colors`: Custom symbol colors
+- `--markers`: Symbol shapes
+- `--transpose`: Perform PCA on transposed matrix (rows=samples)
+- `--ntop`: Use top N variable rows (default: 1000)
+- `--PCs`: Components to plot (default: 1 2)
+- `--log2`: Log2-transform data before analysis
+- `--rowCenter`: Center each row at 0
+
+**Common Usage:**
+```bash
+plotPCA -in readCounts.npz -o PCA_plot.png \
+    -T "PCA of read counts" --transpose
+```
+
+---
+
+## Visualization Tools
+
+### plotHeatmap
+
+Creates genomic region heatmaps from computeMatrix output. Generates publication-quality visualizations.
+
+**Key Parameters:**
+- `--matrixFile, -m`: Matrix from computeMatrix (required)
+- `--outFileName, -o`: Output image (png, eps, pdf, svg) (required)
+- `--outFileSortedRegions`: Save regions after filtering
+- `--outFileNameMatrix`: Export matrix values
+- `--interpolationMethod`: auto, nearest, bilinear, bicubic, gaussian
+  - Default: nearest (≤1000 columns), bilinear (>1000 columns)
+- `--dpi`: Figure resolution
+
+**Clustering:**
+- `--kmeans`: k-means clustering
+- `--hclust`: Hierarchical clustering (slower for >1000 regions)
+- `--silhouette`: Calculate cluster quality metrics
+
+**Visual Customization:**
+- `--heatmapHeight / --heatmapWidth`: Dimensions (3-100 cm)
+- `--whatToShow`: plot, heatmap, colorbar (combinations)
+- `--alpha`: Transparency (0-1)
+- `--colorMap`: 50+ color schemes
+- `--colorList`: Custom gradient colors
+- `--zMin / --zMax`: Intensity scale limits
+- `--boxAroundHeatmaps`: yes/no (default: yes)
+
+**Labels:**
+- `--xAxisLabel / --yAxisLabel`: Axis labels
+- `--regionsLabel`: Region set identifiers
+- `--samplesLabel`: Sample names
+- `--refPointLabel`: Reference point label
+- `--startLabel / --endLabel`: Region boundary labels
+
+**Common Usage:**
+```bash
+# Basic heatmap
+plotHeatmap -m matrix.gz -o heatmap.png
+
+# With clustering and custom colors
+plotHeatmap -m matrix.gz -o heatmap.png \
+    --kmeans 3 --colorMap RdBu --zMin -3 --zMax 3
+```
+
+---
+
+### plotProfile
+
+Generates profile plots showing scores across genomic regions using computeMatrix output.
+
+**Key Parameters:**
+- `--matrixFile, -m`: Matrix from computeMatrix (required)
+- `--outFileName, -o`: Output image (png, eps, pdf, svg) (required)
+- `--plotType`: lines, fill, se, std, overlapped_lines, heatmap
+- `--colors`: Color palette (names or hex codes)
+- `--plotHeight / --plotWidth`: Dimensions in centimeters
+- `--yMin / --yMax`: Y-axis range
+- `--averageType`: mean, median, min, max, std, sum
+
+**Clustering:**
+- `--kmeans`: k-means clustering
+- `--hclust`: Hierarchical clustering
+- `--silhouette`: Cluster quality metrics
+
+**Labels:**
+- `--plotTitle`: Main heading
+- `--regionsLabel`: Region set identifiers
+- `--samplesLabel`: Sample names
+- `--startLabel / --endLabel`: Region boundary labels (scale-regions mode)
+
+**Output Options:**
+- `--outFileNameData`: Export data as tab-separated values
+- `--outFileSortedRegions`: Save filtered/sorted regions as BED
+
+**Common Usage:**
+```bash
+# Line plot
+plotProfile -m matrix.gz -o profile.png --plotType lines
+
+# With standard error shading
+plotProfile -m matrix.gz -o profile.png --plotType se \
+    --colors blue red green
+```
+
+---
+
+### plotEnrichment
+
+Calculates and visualizes signal enrichment across genomic regions. Measures percentage of alignments overlapping region groups. Useful for FRiP (Fragment in Peaks) scores.
+
+**Key Parameters:**
+- `--bamfiles, -b`: Indexed BAM files (required)
+- `--BED`: Region files in BED/GTF format (required)
+- `--plotFile, -o`: Output visualization (png, pdf, eps, svg)
+- `--labels, -l`: Custom sample identifiers
+- `--outRawCounts`: Export numerical data
+- `--perSample`: Group by sample instead of feature (default)
+- `--regionLabels`: Custom region names
+
+**Read Processing:**
+- `--minFragmentLength / --maxFragmentLength`: Fragment filters
+- `--minMappingQuality`: Quality threshold
+- `--samFlagInclude / --samFlagExclude`: SAM flag filters
+- `--ignoreDuplicates`: Remove duplicates
+- `--centerReads`: Center reads for sharper signal
+
+**Common Usage:**
+```bash
+plotEnrichment -b Input.bam H3K4me3.bam \
+    --BED peaks_up.bed peaks_down.bed \
+    --regionLabels "Up regulated" "Down regulated" \
+    -o enrichment.png
+```
+
+---
+
+## Miscellaneous Tools
+
+### computeMatrixOperations
+
+Advanced matrix manipulation tool for combining or subsetting matrices from computeMatrix. Enables complex multi-sample, multi-region analyses.
+
+**Operations:**
+- `cbind`: Combine matrices column-wise
+- `rbind`: Combine matrices row-wise
+- `subset`: Extract specific samples or regions
+- `filterStrand`: Keep only regions on specific strand
+- `filterValues`: Apply signal intensity filters
+- `sort`: Order regions by various criteria
+- `dataRange`: Report min/max values
+
+**Common Usage:**
+```bash
+# Combine matrices
+computeMatrixOperations cbind -m matrix1.gz matrix2.gz -o combined.gz
+
+# Extract specific samples
+computeMatrixOperations subset -m matrix.gz --samples 0 2 -o subset.gz
+```
+
+---
+
+### estimateReadFiltering
+
+Predicts the impact of various filtering parameters without actually filtering. Helps optimize filtering strategies before running full analyses.
+
+**Key Parameters:**
+- `--bamfiles, -b`: BAM files to analyze
+- `--sampleSize`: Number of reads to sample (default: 100,000)
+- `--binSize`: Bin size for analysis
+- `--distanceBetweenBins`: Spacing between sampled bins
+
+**Filtration Options to Test:**
+- `--minMappingQuality`: Test quality thresholds
+- `--ignoreDuplicates`: Assess duplicate impact
+- `--minFragmentLength / --maxFragmentLength`: Test fragment filters
+
+---
+
+## Common Parameters Across Tools
+
+Many deepTools commands share these filtering and performance options:
+
+**Read Filtering:**
+- `--ignoreDuplicates`: Remove PCR duplicates
+- `--minMappingQuality`: Filter by alignment confidence
+- `--samFlagInclude / --samFlagExclude`: SAM format filtering
+- `--minFragmentLength / --maxFragmentLength`: Fragment length bounds
+
+**Performance:**
+- `--numberOfProcessors, -p`: Enable parallel processing
+- `--region`: Process specific genomic regions (chr:start-end)
+
+**Read Processing:**
+- `--extendReads`: Extend to fragment length
+- `--centerReads`: Center at fragment midpoint
+- `--ignoreDuplicates`: Count unique reads only
--- a/skills/deeptools/references/workflows.md
+++ b/skills/deeptools/references/workflows.md
@@ -0,0 +1,474 @@
+# deepTools Common Workflows
+
+This document provides complete workflow examples for common deepTools analyses.
+
+## ChIP-seq Quality Control Workflow
+
+Complete quality control assessment for ChIP-seq experiments.
+
+### Step 1: Initial Correlation Assessment
+
+Compare replicates and samples to verify experimental quality:
+
+```bash
+# Generate coverage matrix across genome
+multiBamSummary bins \
+    --bamfiles Input1.bam Input2.bam ChIP1.bam ChIP2.bam \
+    --labels Input_rep1 Input_rep2 ChIP_rep1 ChIP_rep2 \
+    -o readCounts.npz \
+    --numberOfProcessors 8
+
+# Create correlation heatmap
+plotCorrelation \
+    -in readCounts.npz \
+    --corMethod pearson \
+    --whatToShow heatmap \
+    --plotFile correlation_heatmap.png \
+    --plotNumbers
+
+# Generate PCA plot
+plotPCA \
+    -in readCounts.npz \
+    -o PCA_plot.png \
+    -T "PCA of ChIP-seq samples"
+```
+
+**Expected Results:**
+- Replicates should cluster together
+- Input samples should be distinct from ChIP samples
+
+---
+
+### Step 2: Coverage and Depth Assessment
+
+```bash
+# Check sequencing depth and coverage
+plotCoverage \
+    --bamfiles Input1.bam ChIP1.bam ChIP2.bam \
+    --labels Input ChIP_rep1 ChIP_rep2 \
+    --plotFile coverage.png \
+    --ignoreDuplicates \
+    --numberOfProcessors 8
+```
+
+**Interpretation:** Assess whether sequencing depth is adequate for downstream analysis.
+
+---
+
+### Step 3: Fragment Size Validation (Paired-end)
+
+```bash
+# Verify expected fragment sizes
+bamPEFragmentSize \
+    --bamfiles Input1.bam ChIP1.bam ChIP2.bam \
+    --histogram fragmentSizes.png \
+    --plotTitle "Fragment Size Distribution"
+```
+
+**Expected Results:** Fragment sizes should match library preparation protocols (typically 200-600bp for ChIP-seq).
+
+---
+
+### Step 4: GC Bias Detection and Correction
+
+```bash
+# Compute GC bias
+computeGCBias \
+    --bamfile ChIP1.bam \
+    --effectiveGenomeSize 2913022398 \
+    --genome genome.2bit \
+    --fragmentLength 200 \
+    --biasPlot GCbias.png \
+    --frequenciesFile freq.txt
+
+# If bias detected, correct it
+correctGCBias \
+    --bamfile ChIP1.bam \
+    --effectiveGenomeSize 2913022398 \
+    --genome genome.2bit \
+    --GCbiasFrequenciesFile freq.txt \
+    --correctedFile ChIP1_GCcorrected.bam
+```
+
+**Note:** Only correct if significant bias is observed. Do NOT use `--ignoreDuplicates` with GC-corrected files.
+
+---
+
+### Step 5: ChIP Signal Strength Assessment
+
+```bash
+# Evaluate ChIP enrichment quality
+plotFingerprint \
+    --bamfiles Input1.bam ChIP1.bam ChIP2.bam \
+    --labels Input ChIP_rep1 ChIP_rep2 \
+    --plotFile fingerprint.png \
+    --extendReads 200 \
+    --ignoreDuplicates \
+    --numberOfProcessors 8 \
+    --outQualityMetrics fingerprint_metrics.txt
+```
+
+**Interpretation:**
+- Strong ChIP: Steep rise in cumulative curve
+- Weak enrichment: Curve close to diagonal (input-like)
+
+---
+
+## ChIP-seq Analysis Workflow
+
+Complete workflow from BAM files to publication-quality visualizations.
+
+### Step 1: Generate Normalized Coverage Tracks
+
+```bash
+# Input control
+bamCoverage \
+    --bam Input.bam \
+    --outFileName Input_coverage.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398 \
+    --binSize 10 \
+    --extendReads 200 \
+    --ignoreDuplicates \
+    --numberOfProcessors 8
+
+# ChIP sample
+bamCoverage \
+    --bam ChIP.bam \
+    --outFileName ChIP_coverage.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398 \
+    --binSize 10 \
+    --extendReads 200 \
+    --ignoreDuplicates \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 2: Create Log2 Ratio Track
+
+```bash
+# Compare ChIP to Input
+bamCompare \
+    --bamfile1 ChIP.bam \
+    --bamfile2 Input.bam \
+    --outFileName ChIP_vs_Input_log2ratio.bw \
+    --operation log2 \
+    --scaleFactorsMethod readCount \
+    --binSize 10 \
+    --extendReads 200 \
+    --ignoreDuplicates \
+    --numberOfProcessors 8
+```
+
+**Result:** Log2 ratio track showing enrichment (positive values) and depletion (negative values).
+
+---
+
+### Step 3: Compute Matrix Around TSS
+
+```bash
+# Prepare data for heatmap/profile around transcription start sites
+computeMatrix reference-point \
+    --referencePoint TSS \
+    --scoreFileName ChIP_coverage.bw \
+    --regionsFileName genes.bed \
+    --beforeRegionStartLength 3000 \
+    --afterRegionStartLength 3000 \
+    --binSize 10 \
+    --sortRegions descend \
+    --sortUsing mean \
+    --outFileName matrix_TSS.gz \
+    --outFileNameMatrix matrix_TSS.tab \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 4: Generate Heatmap
+
+```bash
+# Create heatmap around TSS
+plotHeatmap \
+    --matrixFile matrix_TSS.gz \
+    --outFileName heatmap_TSS.png \
+    --colorMap RdBu \
+    --whatToShow 'plot, heatmap and colorbar' \
+    --zMin -3 --zMax 3 \
+    --yAxisLabel "Genes" \
+    --xAxisLabel "Distance from TSS (bp)" \
+    --refPointLabel "TSS" \
+    --heatmapHeight 15 \
+    --kmeans 3
+```
+
+---
+
+### Step 5: Generate Profile Plot
+
+```bash
+# Create meta-profile around TSS
+plotProfile \
+    --matrixFile matrix_TSS.gz \
+    --outFileName profile_TSS.png \
+    --plotType lines \
+    --perGroup \
+    --colors blue \
+    --plotTitle "ChIP-seq signal around TSS" \
+    --yAxisLabel "Average signal" \
+    --xAxisLabel "Distance from TSS (bp)" \
+    --refPointLabel "TSS"
+```
+
+---
+
+### Step 6: Enrichment at Peaks
+
+```bash
+# Calculate enrichment in peak regions
+plotEnrichment \
+    --bamfiles Input.bam ChIP.bam \
+    --BED peaks.bed \
+    --labels Input ChIP \
+    --plotFile enrichment.png \
+    --outRawCounts enrichment_counts.tab \
+    --extendReads 200 \
+    --ignoreDuplicates
+```
+
+---
+
+## RNA-seq Coverage Workflow
+
+Generate strand-specific coverage tracks for RNA-seq data.
+
+### Forward Strand
+
+```bash
+bamCoverage \
+    --bam rnaseq.bam \
+    --outFileName forward_coverage.bw \
+    --filterRNAstrand forward \
+    --normalizeUsing CPM \
+    --binSize 1 \
+    --numberOfProcessors 8
+```
+
+### Reverse Strand
+
+```bash
+bamCoverage \
+    --bam rnaseq.bam \
+    --outFileName reverse_coverage.bw \
+    --filterRNAstrand reverse \
+    --normalizeUsing CPM \
+    --binSize 1 \
+    --numberOfProcessors 8
+```
+
+**Important:** Do NOT use `--extendReads` for RNA-seq (would extend over splice junctions).
+
+---
+
+## Multi-Sample Comparison Workflow
+
+Compare multiple ChIP-seq samples (e.g., different conditions or time points).
+
+### Step 1: Generate Coverage Files
+
+```bash
+# For each sample
+for sample in Control_ChIP Treated_ChIP; do
+    bamCoverage \
+        --bam ${sample}.bam \
+        --outFileName ${sample}.bw \
+        --normalizeUsing RPGC \
+        --effectiveGenomeSize 2913022398 \
+        --binSize 10 \
+        --extendReads 200 \
+        --ignoreDuplicates \
+        --numberOfProcessors 8
+done
+```
+
+---
+
+### Step 2: Compute Multi-Sample Matrix
+
+```bash
+computeMatrix scale-regions \
+    --scoreFileName Control_ChIP.bw Treated_ChIP.bw \
+    --regionsFileName genes.bed \
+    --beforeRegionStartLength 1000 \
+    --afterRegionStartLength 1000 \
+    --regionBodyLength 3000 \
+    --binSize 10 \
+    --sortRegions descend \
+    --sortUsing mean \
+    --outFileName matrix_multi.gz \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 3: Multi-Sample Heatmap
+
+```bash
+plotHeatmap \
+    --matrixFile matrix_multi.gz \
+    --outFileName heatmap_comparison.png \
+    --colorMap Blues \
+    --whatToShow 'plot, heatmap and colorbar' \
+    --samplesLabel Control Treated \
+    --yAxisLabel "Genes" \
+    --heatmapHeight 15 \
+    --kmeans 4
+```
+
+---
+
+### Step 4: Multi-Sample Profile
+
+```bash
+plotProfile \
+    --matrixFile matrix_multi.gz \
+    --outFileName profile_comparison.png \
+    --plotType lines \
+    --perGroup \
+    --colors blue red \
+    --samplesLabel Control Treated \
+    --plotTitle "ChIP-seq signal comparison" \
+    --startLabel "TSS" \
+    --endLabel "TES"
+```
+
+---
+
+## ATAC-seq Workflow
+
+Specialized workflow for ATAC-seq data with Tn5 offset correction.
+
+### Step 1: Shift Reads for Tn5 Correction
+
+```bash
+alignmentSieve \
+    --bam atacseq.bam \
+    --outFile atacseq_shifted.bam \
+    --ATACshift \
+    --minFragmentLength 38 \
+    --maxFragmentLength 2000 \
+    --ignoreDuplicates
+```
+
+---
+
+### Step 2: Generate Coverage Track
+
+```bash
+bamCoverage \
+    --bam atacseq_shifted.bam \
+    --outFileName atacseq_coverage.bw \
+    --normalizeUsing RPGC \
+    --effectiveGenomeSize 2913022398 \
+    --binSize 1 \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 3: Fragment Size Analysis
+
+```bash
+bamPEFragmentSize \
+    --bamfiles atacseq.bam \
+    --histogram fragmentSizes_atac.png \
+    --maxFragmentLength 1000
+```
+
+**Expected Pattern:** Nucleosome ladder with peaks at ~50bp (nucleosome-free), ~200bp (mono-nucleosome), ~400bp (di-nucleosome).
+
+---
+
+## Peak Region Analysis Workflow
+
+Analyze ChIP-seq signal specifically at peak regions.
+
+### Step 1: Matrix at Peaks
+
+```bash
+computeMatrix reference-point \
+    --referencePoint center \
+    --scoreFileName ChIP_coverage.bw \
+    --regionsFileName peaks.bed \
+    --beforeRegionStartLength 2000 \
+    --afterRegionStartLength 2000 \
+    --binSize 10 \
+    --outFileName matrix_peaks.gz \
+    --numberOfProcessors 8
+```
+
+---
+
+### Step 2: Heatmap at Peaks
+
+```bash
+plotHeatmap \
+    --matrixFile matrix_peaks.gz \
+    --outFileName heatmap_peaks.png \
+    --colorMap YlOrRd \
+    --refPointLabel "Peak Center" \
+    --heatmapHeight 15 \
+    --sortUsing max
+```
+
+---
+
+## Troubleshooting Common Issues
+
+### Issue: Out of Memory
+**Solution:** Use `--region` parameter to process chromosomes individually:
+```bash
+bamCoverage --bam input.bam -o chr1.bw --region chr1
+```
+
+### Issue: BAM Index Missing
+**Solution:** Index BAM files before running deepTools:
+```bash
+samtools index input.bam
+```
+
+### Issue: Slow Processing
+**Solution:** Increase `--numberOfProcessors`:
+```bash
+# Use 8 cores instead of default
+--numberOfProcessors 8
+```
+
+### Issue: bigWig Files Too Large
+**Solution:** Increase bin size:
+```bash
+--binSize 50  # or larger (default is 10-50)
+```
+
+---
+
+## Performance Tips
+
+1. **Use multiple processors:** Always set `--numberOfProcessors` to available cores
+2. **Process regions:** Use `--region` for testing or memory-limited environments
+3. **Adjust bin size:** Larger bins = faster processing and smaller files
+4. **Pre-filter BAM files:** Use `alignmentSieve` to create filtered BAM files once, then reuse
+5. **Use bigWig over bedGraph:** bigWig format is compressed and faster to process
+
+---
+
+## Best Practices
+
+1. **Always check QC first:** Run correlation, coverage, and fingerprint analysis before proceeding
+2. **Document parameters:** Save command lines for reproducibility
+3. **Use consistent normalization:** Apply same normalization method across samples in a comparison
+4. **Verify reference genome match:** Ensure BAM files and region files use same genome build
+5. **Check strand orientation:** For RNA-seq, verify correct strand orientation
+6. **Test on small regions first:** Use `--region chr1:1-1000000` for testing parameters
+7. **Keep intermediate files:** Save matrices for regenerating plots with different settings
--- a/skills/deeptools/scripts/validate_files.py
+++ b/skills/deeptools/scripts/validate_files.py
@@ -0,0 +1,195 @@
+#!/usr/bin/env python3
+"""
+deepTools File Validation Script
+
+Validates BAM, bigWig, and BED files for deepTools analysis.
+Checks for file existence, proper indexing, and basic format requirements.
+"""
+
+import os
+import sys
+import argparse
+from pathlib import Path
+
+
+def check_file_exists(filepath):
+    """Check if file exists and is readable."""
+    if not os.path.exists(filepath):
+        return False, f"File not found: {filepath}"
+    if not os.access(filepath, os.R_OK):
+        return False, f"File not readable: {filepath}"
+    return True, f"✓ File exists: {filepath}"
+
+
+def check_bam_index(bam_file):
+    """Check if BAM file has an index (.bai or .bam.bai)."""
+    bai_file1 = bam_file + ".bai"
+    bai_file2 = bam_file.replace(".bam", ".bai")
+
+    if os.path.exists(bai_file1):
+        return True, f"✓ BAM index found: {bai_file1}"
+    elif os.path.exists(bai_file2):
+        return True, f"✓ BAM index found: {bai_file2}"
+    else:
+        return False, f"✗ BAM index missing for: {bam_file}\n  Run: samtools index {bam_file}"
+
+
+def check_bigwig_file(bw_file):
+    """Basic check for bigWig file."""
+    # Check file size (bigWig files should have reasonable size)
+    file_size = os.path.getsize(bw_file)
+    if file_size < 100:
+        return False, f"✗ bigWig file suspiciously small: {bw_file} ({file_size} bytes)"
+    return True, f"✓ bigWig file appears valid: {bw_file} ({file_size} bytes)"
+
+
+def check_bed_file(bed_file):
+    """Basic validation of BED file format."""
+    try:
+        with open(bed_file, 'r') as f:
+            lines = [line.strip() for line in f if line.strip() and not line.startswith('#')]
+
+        if len(lines) == 0:
+            return False, f"✗ BED file is empty: {bed_file}"
+
+        # Check first few lines for basic format
+        for i, line in enumerate(lines[:10], 1):
+            fields = line.split('\t')
+            if len(fields) < 3:
+                return False, f"✗ BED file format error at line {i}: expected at least 3 columns\n  Line: {line}"
+
+            # Check if start and end are integers
+            try:
+                start = int(fields[1])
+                end = int(fields[2])
+                if start >= end:
+                    return False, f"✗ BED file error at line {i}: start >= end ({start} >= {end})"
+            except ValueError:
+                return False, f"✗ BED file format error at line {i}: start and end must be integers\n  Line: {line}"
+
+        return True, f"✓ BED file format appears valid: {bed_file} ({len(lines)} regions)"
+
+    except Exception as e:
+        return False, f"✗ Error reading BED file: {bed_file}\n  Error: {str(e)}"
+
+
+def validate_files(bam_files=None, bigwig_files=None, bed_files=None):
+    """
+    Validate all provided files.
+
+    Args:
+        bam_files: List of BAM file paths
+        bigwig_files: List of bigWig file paths
+        bed_files: List of BED file paths
+
+    Returns:
+        Tuple of (success: bool, messages: list)
+    """
+    all_success = True
+    messages = []
+
+    # Validate BAM files
+    if bam_files:
+        messages.append("\n=== Validating BAM Files ===")
+        for bam_file in bam_files:
+            # Check existence
+            success, msg = check_file_exists(bam_file)
+            messages.append(msg)
+            if not success:
+                all_success = False
+                continue
+
+            # Check index
+            success, msg = check_bam_index(bam_file)
+            messages.append(msg)
+            if not success:
+                all_success = False
+
+    # Validate bigWig files
+    if bigwig_files:
+        messages.append("\n=== Validating bigWig Files ===")
+        for bw_file in bigwig_files:
+            # Check existence
+            success, msg = check_file_exists(bw_file)
+            messages.append(msg)
+            if not success:
+                all_success = False
+                continue
+
+            # Basic bigWig check
+            success, msg = check_bigwig_file(bw_file)
+            messages.append(msg)
+            if not success:
+                all_success = False
+
+    # Validate BED files
+    if bed_files:
+        messages.append("\n=== Validating BED Files ===")
+        for bed_file in bed_files:
+            # Check existence
+            success, msg = check_file_exists(bed_file)
+            messages.append(msg)
+            if not success:
+                all_success = False
+                continue
+
+            # Check BED format
+            success, msg = check_bed_file(bed_file)
+            messages.append(msg)
+            if not success:
+                all_success = False
+
+    return all_success, messages
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Validate files for deepTools analysis",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Validate BAM files
+  python validate_files.py --bam sample1.bam sample2.bam
+
+  # Validate all file types
+  python validate_files.py --bam input.bam chip.bam --bed peaks.bed --bigwig signal.bw
+
+  # Validate from a directory
+  python validate_files.py --bam *.bam --bed *.bed
+        """
+    )
+
+    parser.add_argument('--bam', nargs='+', help='BAM files to validate')
+    parser.add_argument('--bigwig', '--bw', nargs='+', help='bigWig files to validate')
+    parser.add_argument('--bed', nargs='+', help='BED files to validate')
+
+    args = parser.parse_args()
+
+    # Check if any files were provided
+    if not any([args.bam, args.bigwig, args.bed]):
+        parser.print_help()
+        sys.exit(1)
+
+    # Run validation
+    success, messages = validate_files(
+        bam_files=args.bam,
+        bigwig_files=args.bigwig,
+        bed_files=args.bed
+    )
+
+    # Print results
+    for msg in messages:
+        print(msg)
+
+    # Summary
+    print("\n" + "="*50)
+    if success:
+        print("✓ All validations passed!")
+        sys.exit(0)
+    else:
+        print("✗ Some validations failed. Please fix the issues above.")
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/deeptools/scripts/workflow_generator.py
+++ b/skills/deeptools/scripts/workflow_generator.py
@@ -0,0 +1,454 @@
+#!/usr/bin/env python3
+"""
+deepTools Workflow Generator
+
+Generates bash script templates for common deepTools workflows.
+"""
+
+import argparse
+import sys
+
+
+WORKFLOWS = {
+    'chipseq_qc': {
+        'name': 'ChIP-seq Quality Control',
+        'description': 'Complete QC workflow for ChIP-seq experiments',
+    },
+    'chipseq_analysis': {
+        'name': 'ChIP-seq Complete Analysis',
+        'description': 'Full ChIP-seq analysis from BAM to heatmaps',
+    },
+    'rnaseq_coverage': {
+        'name': 'RNA-seq Coverage Tracks',
+        'description': 'Generate strand-specific RNA-seq coverage',
+    },
+    'atacseq': {
+        'name': 'ATAC-seq Analysis',
+        'description': 'ATAC-seq workflow with Tn5 correction',
+    },
+}
+
+
+def generate_chipseq_qc_workflow(output_file, params):
+    """Generate ChIP-seq QC workflow script."""
+
+    script = f"""#!/bin/bash
+# deepTools ChIP-seq Quality Control Workflow
+# Generated by deepTools workflow generator
+
+# Configuration
+INPUT_BAM="{params.get('input_bam', 'Input.bam')}"
+CHIP_BAM=("{params.get('chip_bams', 'ChIP1.bam ChIP2.bam')}")
+GENOME_SIZE={params.get('genome_size', '2913022398')}
+THREADS={params.get('threads', '8')}
+OUTPUT_DIR="{params.get('output_dir', 'deeptools_qc')}"
+
+# Create output directory
+mkdir -p $OUTPUT_DIR
+
+echo "=== Starting ChIP-seq QC workflow ==="
+
+# Step 1: Correlation analysis
+echo "Step 1: Computing correlation matrix..."
+multiBamSummary bins \\
+    --bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
+    -o $OUTPUT_DIR/readCounts.npz \\
+    --numberOfProcessors $THREADS
+
+echo "Step 2: Generating correlation heatmap..."
+plotCorrelation \\
+    -in $OUTPUT_DIR/readCounts.npz \\
+    --corMethod pearson \\
+    --whatToShow heatmap \\
+    --plotFile $OUTPUT_DIR/correlation_heatmap.png \\
+    --plotNumbers
+
+echo "Step 3: Generating PCA plot..."
+plotPCA \\
+    -in $OUTPUT_DIR/readCounts.npz \\
+    -o $OUTPUT_DIR/PCA_plot.png \\
+    -T "PCA of ChIP-seq samples"
+
+# Step 2: Coverage assessment
+echo "Step 4: Assessing coverage..."
+plotCoverage \\
+    --bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
+    --plotFile $OUTPUT_DIR/coverage.png \\
+    --ignoreDuplicates \\
+    --numberOfProcessors $THREADS
+
+# Step 3: Fragment size (for paired-end data)
+echo "Step 5: Analyzing fragment sizes..."
+bamPEFragmentSize \\
+    --bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
+    --histogram $OUTPUT_DIR/fragmentSizes.png \\
+    --plotTitle "Fragment Size Distribution"
+
+# Step 4: ChIP signal strength
+echo "Step 6: Evaluating ChIP enrichment..."
+plotFingerprint \\
+    --bamfiles $INPUT_BAM ${{CHIP_BAM[@]}} \\
+    --plotFile $OUTPUT_DIR/fingerprint.png \\
+    --extendReads 200 \\
+    --ignoreDuplicates \\
+    --numberOfProcessors $THREADS \\
+    --outQualityMetrics $OUTPUT_DIR/fingerprint_metrics.txt
+
+echo "=== ChIP-seq QC workflow complete ==="
+echo "Results are in: $OUTPUT_DIR"
+"""
+
+    with open(output_file, 'w') as f:
+        f.write(script)
+
+    return f"✓ Generated ChIP-seq QC workflow: {output_file}"
+
+
+def generate_chipseq_analysis_workflow(output_file, params):
+    """Generate complete ChIP-seq analysis workflow script."""
+
+    script = f"""#!/bin/bash
+# deepTools ChIP-seq Complete Analysis Workflow
+# Generated by deepTools workflow generator
+
+# Configuration
+INPUT_BAM="{params.get('input_bam', 'Input.bam')}"
+CHIP_BAM="{params.get('chip_bam', 'ChIP.bam')}"
+GENES_BED="{params.get('genes_bed', 'genes.bed')}"
+PEAKS_BED="{params.get('peaks_bed', 'peaks.bed')}"
+GENOME_SIZE={params.get('genome_size', '2913022398')}
+THREADS={params.get('threads', '8')}
+OUTPUT_DIR="{params.get('output_dir', 'chipseq_analysis')}"
+
+# Create output directory
+mkdir -p $OUTPUT_DIR
+
+echo "=== Starting ChIP-seq analysis workflow ==="
+
+# Step 1: Generate normalized coverage tracks
+echo "Step 1: Generating coverage tracks..."
+
+bamCoverage \\
+    --bam $INPUT_BAM \\
+    --outFileName $OUTPUT_DIR/Input_coverage.bw \\
+    --normalizeUsing RPGC \\
+    --effectiveGenomeSize $GENOME_SIZE \\
+    --binSize 10 \\
+    --extendReads 200 \\
+    --ignoreDuplicates \\
+    --numberOfProcessors $THREADS
+
+bamCoverage \\
+    --bam $CHIP_BAM \\
+    --outFileName $OUTPUT_DIR/ChIP_coverage.bw \\
+    --normalizeUsing RPGC \\
+    --effectiveGenomeSize $GENOME_SIZE \\
+    --binSize 10 \\
+    --extendReads 200 \\
+    --ignoreDuplicates \\
+    --numberOfProcessors $THREADS
+
+# Step 2: Create log2 ratio track
+echo "Step 2: Creating log2 ratio track..."
+bamCompare \\
+    --bamfile1 $CHIP_BAM \\
+    --bamfile2 $INPUT_BAM \\
+    --outFileName $OUTPUT_DIR/ChIP_vs_Input_log2ratio.bw \\
+    --operation log2 \\
+    --scaleFactorsMethod readCount \\
+    --binSize 10 \\
+    --extendReads 200 \\
+    --ignoreDuplicates \\
+    --numberOfProcessors $THREADS
+
+# Step 3: Compute matrix around TSS
+echo "Step 3: Computing matrix around TSS..."
+computeMatrix reference-point \\
+    --referencePoint TSS \\
+    --scoreFileName $OUTPUT_DIR/ChIP_coverage.bw \\
+    --regionsFileName $GENES_BED \\
+    --beforeRegionStartLength 3000 \\
+    --afterRegionStartLength 3000 \\
+    --binSize 10 \\
+    --sortRegions descend \\
+    --sortUsing mean \\
+    --outFileName $OUTPUT_DIR/matrix_TSS.gz \\
+    --numberOfProcessors $THREADS
+
+# Step 4: Generate heatmap
+echo "Step 4: Generating heatmap..."
+plotHeatmap \\
+    --matrixFile $OUTPUT_DIR/matrix_TSS.gz \\
+    --outFileName $OUTPUT_DIR/heatmap_TSS.png \\
+    --colorMap RdBu \\
+    --whatToShow 'plot, heatmap and colorbar' \\
+    --yAxisLabel "Genes" \\
+    --xAxisLabel "Distance from TSS (bp)" \\
+    --refPointLabel "TSS" \\
+    --heatmapHeight 15 \\
+    --kmeans 3
+
+# Step 5: Generate profile plot
+echo "Step 5: Generating profile plot..."
+plotProfile \\
+    --matrixFile $OUTPUT_DIR/matrix_TSS.gz \\
+    --outFileName $OUTPUT_DIR/profile_TSS.png \\
+    --plotType lines \\
+    --perGroup \\
+    --colors blue \\
+    --plotTitle "ChIP-seq signal around TSS" \\
+    --yAxisLabel "Average signal" \\
+    --refPointLabel "TSS"
+
+# Step 6: Enrichment at peaks (if peaks provided)
+if [ -f "$PEAKS_BED" ]; then
+    echo "Step 6: Calculating enrichment at peaks..."
+    plotEnrichment \\
+        --bamfiles $INPUT_BAM $CHIP_BAM \\
+        --BED $PEAKS_BED \\
+        --labels Input ChIP \\
+        --plotFile $OUTPUT_DIR/enrichment.png \\
+        --outRawCounts $OUTPUT_DIR/enrichment_counts.tab \\
+        --extendReads 200 \\
+        --ignoreDuplicates
+fi
+
+echo "=== ChIP-seq analysis complete ==="
+echo "Results are in: $OUTPUT_DIR"
+"""
+
+    with open(output_file, 'w') as f:
+        f.write(script)
+
+    return f"✓ Generated ChIP-seq analysis workflow: {output_file}"
+
+
+def generate_rnaseq_coverage_workflow(output_file, params):
+    """Generate RNA-seq coverage workflow script."""
+
+    script = f"""#!/bin/bash
+# deepTools RNA-seq Coverage Workflow
+# Generated by deepTools workflow generator
+
+# Configuration
+RNASEQ_BAM="{params.get('rnaseq_bam', 'rnaseq.bam')}"
+THREADS={params.get('threads', '8')}
+OUTPUT_DIR="{params.get('output_dir', 'rnaseq_coverage')}"
+
+# Create output directory
+mkdir -p $OUTPUT_DIR
+
+echo "=== Starting RNA-seq coverage workflow ==="
+
+# Generate strand-specific coverage tracks
+echo "Step 1: Generating forward strand coverage..."
+bamCoverage \\
+    --bam $RNASEQ_BAM \\
+    --outFileName $OUTPUT_DIR/forward_coverage.bw \\
+    --filterRNAstrand forward \\
+    --normalizeUsing CPM \\
+    --binSize 1 \\
+    --numberOfProcessors $THREADS
+
+echo "Step 2: Generating reverse strand coverage..."
+bamCoverage \\
+    --bam $RNASEQ_BAM \\
+    --outFileName $OUTPUT_DIR/reverse_coverage.bw \\
+    --filterRNAstrand reverse \\
+    --normalizeUsing CPM \\
+    --binSize 1 \\
+    --numberOfProcessors $THREADS
+
+echo "=== RNA-seq coverage workflow complete ==="
+echo "Results are in: $OUTPUT_DIR"
+echo ""
+echo "Note: These bigWig files can be loaded into genome browsers"
+echo "for strand-specific visualization of RNA-seq data."
+"""
+
+    with open(output_file, 'w') as f:
+        f.write(script)
+
+    return f"✓ Generated RNA-seq coverage workflow: {output_file}"
+
+
+def generate_atacseq_workflow(output_file, params):
+    """Generate ATAC-seq workflow script."""
+
+    script = f"""#!/bin/bash
+# deepTools ATAC-seq Analysis Workflow
+# Generated by deepTools workflow generator
+
+# Configuration
+ATAC_BAM="{params.get('atac_bam', 'atacseq.bam')}"
+PEAKS_BED="{params.get('peaks_bed', 'peaks.bed')}"
+GENOME_SIZE={params.get('genome_size', '2913022398')}
+THREADS={params.get('threads', '8')}
+OUTPUT_DIR="{params.get('output_dir', 'atacseq_analysis')}"
+
+# Create output directory
+mkdir -p $OUTPUT_DIR
+
+echo "=== Starting ATAC-seq analysis workflow ==="
+
+# Step 1: Shift reads for Tn5 correction
+echo "Step 1: Applying Tn5 offset correction..."
+alignmentSieve \\
+    --bam $ATAC_BAM \\
+    --outFile $OUTPUT_DIR/atacseq_shifted.bam \\
+    --ATACshift \\
+    --minFragmentLength 38 \\
+    --maxFragmentLength 2000 \\
+    --ignoreDuplicates
+
+# Index the shifted BAM
+samtools index $OUTPUT_DIR/atacseq_shifted.bam
+
+# Step 2: Generate coverage track
+echo "Step 2: Generating coverage track..."
+bamCoverage \\
+    --bam $OUTPUT_DIR/atacseq_shifted.bam \\
+    --outFileName $OUTPUT_DIR/atacseq_coverage.bw \\
+    --normalizeUsing RPGC \\
+    --effectiveGenomeSize $GENOME_SIZE \\
+    --binSize 1 \\
+    --numberOfProcessors $THREADS
+
+# Step 3: Fragment size analysis
+echo "Step 3: Analyzing fragment sizes..."
+bamPEFragmentSize \\
+    --bamfiles $ATAC_BAM \\
+    --histogram $OUTPUT_DIR/fragmentSizes.png \\
+    --maxFragmentLength 1000
+
+# Step 4: Compute matrix at peaks (if peaks provided)
+if [ -f "$PEAKS_BED" ]; then
+    echo "Step 4: Computing matrix at peaks..."
+    computeMatrix reference-point \\
+        --referencePoint center \\
+        --scoreFileName $OUTPUT_DIR/atacseq_coverage.bw \\
+        --regionsFileName $PEAKS_BED \\
+        --beforeRegionStartLength 2000 \\
+        --afterRegionStartLength 2000 \\
+        --binSize 10 \\
+        --outFileName $OUTPUT_DIR/matrix_peaks.gz \\
+        --numberOfProcessors $THREADS
+
+    echo "Step 5: Generating heatmap..."
+    plotHeatmap \\
+        --matrixFile $OUTPUT_DIR/matrix_peaks.gz \\
+        --outFileName $OUTPUT_DIR/heatmap_peaks.png \\
+        --colorMap YlOrRd \\
+        --refPointLabel "Peak Center" \\
+        --heatmapHeight 15
+fi
+
+echo "=== ATAC-seq analysis complete ==="
+echo "Results are in: $OUTPUT_DIR"
+echo ""
+echo "Expected fragment size pattern:"
+echo "  ~50bp: nucleosome-free regions"
+echo "  ~200bp: mono-nucleosome"
+echo "  ~400bp: di-nucleosome"
+"""
+
+    with open(output_file, 'w') as f:
+        f.write(script)
+
+    return f"✓ Generated ATAC-seq workflow: {output_file}"
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Generate deepTools workflow scripts",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog=f"""
+Available workflows:
+{chr(10).join(f"  {key}: {value['name']}" for key, value in WORKFLOWS.items())}
+
+Examples:
+  # Generate ChIP-seq QC workflow
+  python workflow_generator.py chipseq_qc -o chipseq_qc.sh
+
+  # Generate ChIP-seq analysis with custom parameters
+  python workflow_generator.py chipseq_analysis -o analysis.sh \\
+      --chip-bam H3K4me3.bam --input-bam Input.bam
+
+  # List all available workflows
+  python workflow_generator.py --list
+        """
+    )
+
+    parser.add_argument('workflow', nargs='?', choices=list(WORKFLOWS.keys()),
+                        help='Workflow type to generate')
+    parser.add_argument('-o', '--output', default='deeptools_workflow.sh',
+                        help='Output script filename (default: deeptools_workflow.sh)')
+    parser.add_argument('--list', action='store_true',
+                        help='List all available workflows')
+
+    # Common parameters
+    parser.add_argument('--threads', type=int, default=8,
+                        help='Number of threads (default: 8)')
+    parser.add_argument('--genome-size', type=int, default=2913022398,
+                        help='Effective genome size (default: 2913022398 for hg38)')
+    parser.add_argument('--output-dir', default=None,
+                        help='Output directory for results')
+
+    # Workflow-specific parameters
+    parser.add_argument('--input-bam', help='Input/control BAM file')
+    parser.add_argument('--chip-bam', help='ChIP BAM file')
+    parser.add_argument('--chip-bams', help='Multiple ChIP BAM files (space-separated)')
+    parser.add_argument('--rnaseq-bam', help='RNA-seq BAM file')
+    parser.add_argument('--atac-bam', help='ATAC-seq BAM file')
+    parser.add_argument('--genes-bed', help='Genes BED file')
+    parser.add_argument('--peaks-bed', help='Peaks BED file')
+
+    args = parser.parse_args()
+
+    # List workflows
+    if args.list:
+        print("\nAvailable deepTools workflows:\n")
+        for key, value in WORKFLOWS.items():
+            print(f"  {key}")
+            print(f"    {value['name']}")
+            print(f"    {value['description']}\n")
+        sys.exit(0)
+
+    # Check if workflow was specified
+    if not args.workflow:
+        parser.print_help()
+        sys.exit(1)
+
+    # Prepare parameters
+    params = {
+        'threads': args.threads,
+        'genome_size': args.genome_size,
+        'output_dir': args.output_dir or f"{args.workflow}_output",
+        'input_bam': args.input_bam,
+        'chip_bam': args.chip_bam,
+        'chip_bams': args.chip_bams,
+        'rnaseq_bam': args.rnaseq_bam,
+        'atac_bam': args.atac_bam,
+        'genes_bed': args.genes_bed,
+        'peaks_bed': args.peaks_bed,
+    }
+
+    # Generate workflow
+    if args.workflow == 'chipseq_qc':
+        message = generate_chipseq_qc_workflow(args.output, params)
+    elif args.workflow == 'chipseq_analysis':
+        message = generate_chipseq_analysis_workflow(args.output, params)
+    elif args.workflow == 'rnaseq_coverage':
+        message = generate_rnaseq_coverage_workflow(args.output, params)
+    elif args.workflow == 'atacseq':
+        message = generate_atacseq_workflow(args.output, params)
+
+    print(message)
+    print(f"\nTo run the workflow:")
+    print(f"  chmod +x {args.output}")
+    print(f"  ./{args.output}")
+    print(f"\nNote: Edit the script to customize file paths and parameters.")
+
+
+if __name__ == "__main__":
+    main()