Initial commit

2025-11-29 18:02:37 +08:00
commit c1d9dee646
38 changed files with 11210 additions and 0 deletions
--- a/skills/phylo_from_buscos/templates/README.md
+++ b/skills/phylo_from_buscos/templates/README.md
@@ -0,0 +1,125 @@
+# Phylogenomics Workflow Templates
+
+This directory contains template scripts for running the phylogenomics pipeline across different computing environments.
+
+## Directory Structure
+
+```
+templates/
+├── slurm/      # SLURM job scheduler templates
+├── pbs/        # PBS/Torque job scheduler templates
+└── local/      # Local machine templates (with GNU parallel support)
+```
+
+## Template Naming Convention
+
+Templates follow a consistent naming pattern: `NN_step_name[_variant].ext`
+
+- `NN`: Step number (e.g., `02` for compleasm, `08a` for partition search)
+- `step_name`: Descriptive name of the pipeline step
+- `_variant`: Optional variant (e.g., `_first`, `_parallel`, `_serial`)
+- `.ext`: File extension (`.job` for schedulers, `.sh` for local scripts)
+
+## Available Templates
+
+### Step 2: Ortholog Identification (compleasm)
+
+**SLURM:**
+- `02_compleasm_first.job` - Process first genome to download lineage database
+- `02_compleasm_parallel.job` - Array job for remaining genomes
+
+**PBS:**
+- `02_compleasm_first.job` - Process first genome to download lineage database
+- `02_compleasm_parallel.job` - Array job for remaining genomes
+
+**Local:**
+- `02_compleasm_first.sh` - Process first genome to download lineage database
+- `02_compleasm_parallel.sh` - GNU parallel for remaining genomes
+
+### Step 8A: Partition Model Selection
+
+**SLURM:**
+- `08a_partition_search.job` - IQ-TREE partition model search with TESTMERGEONLY
+
+**PBS:**
+- `08a_partition_search.job` - IQ-TREE partition model search with TESTMERGEONLY
+
+**Local:**
+- `08a_partition_search.sh` - IQ-TREE partition model search with TESTMERGEONLY
+
+### Step 8C: Individual Gene Trees
+
+**SLURM:**
+- `08c_gene_trees_array.job` - Array job for parallel gene tree estimation
+
+**PBS:**
+- `08c_gene_trees_array.job` - Array job for parallel gene tree estimation
+
+**Local:**
+- `08c_gene_trees_parallel.sh` - GNU parallel for gene tree estimation
+- `08c_gene_trees_serial.sh` - Serial processing (for debugging/limited resources)
+
+## Placeholders
+
+Templates contain placeholders that must be replaced with user-specific values:
+
+| Placeholder | Description | Example |
+|-------------|-------------|---------|
+| `TOTAL_THREADS` | Total CPU cores available | `64` |
+| `THREADS_PER_JOB` | Threads per concurrent job | `16` |
+| `NUM_GENOMES` | Number of genomes in analysis | `20` |
+| `NUM_LOCI` | Number of loci/alignments | `2795` |
+| `LINEAGE` | BUSCO lineage dataset | `insecta_odb10` |
+| `MODEL_SET` | Comma-separated substitution models | `LG,WAG,JTT,Q.pfam` |
+
+## Usage
+
+### For Claude (LLM)
+
+When a user requests scripts for a specific computing environment:
+
+1. **Read the appropriate template** using the Read tool
+2. **Replace placeholders** with user-specified values
+3. **Present the customized script** to the user
+4. **Provide setup instructions** (e.g., how many genomes, how to calculate thread allocation)
+
+Example:
+```python
+# Read template
+template = Read("templates/slurm/02_compleasm_first.job")
+
+# Replace placeholders
+script = template.replace("TOTAL_THREADS", "64")
+script = script.replace("LINEAGE", "insecta_odb10")
+
+# Present to user
+print(script)
+```
+
+### For Users
+
+Templates are not meant to be used directly. Instead:
+
+1. Follow the workflow in `SKILL.md`
+2. Answer Claude's questions about your setup
+3. Claude will fetch the appropriate template and customize it for you
+4. Copy the customized script Claude provides
+
+## Benefits of This Structure
+
+1. **Reduced token usage**: Claude only reads templates when needed
+2. **Easier maintenance**: Update one template file instead of multiple locations in SKILL.md
+3. **Consistency**: All users get the same base template structure
+4. **Clarity**: Separate files are easier to review than inline code
+5. **Extensibility**: Easy to add new templates for additional tools or variants
+
+## Adding New Templates
+
+When adding new templates:
+
+1. **Follow naming convention**: `NN_descriptive_name[_variant].ext`
+2. **Include clear comments**: Explain what the script does
+3. **Use consistent placeholders**: Match existing placeholder names
+4. **Test thoroughly**: Ensure placeholders are complete and correct
+5. **Update this README**: Add the new template to the "Available Templates" section
+6. **Update SKILL.md**: Reference the new template in the appropriate workflow step
--- a/skills/phylo_from_buscos/templates/local/02_compleasm_first.sh
+++ b/skills/phylo_from_buscos/templates/local/02_compleasm_first.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+# run_compleasm_first.sh
+source ~/.bashrc
+conda activate phylo
+
+# User-specified total CPU threads
+TOTAL_THREADS=TOTAL_THREADS  # Replace with total cores you want to use (e.g., 16, 32, 64)
+echo "Processing first genome with ${TOTAL_THREADS} CPU threads to download lineage database..."
+
+# Create output directory
+mkdir -p 01_busco_results
+
+# Process FIRST genome only
+first_genome=$(head -n 1 genome_list.txt)
+genome_name=$(basename ${first_genome} .fasta)
+echo "Processing: ${genome_name}"
+
+compleasm run \
+  -a ${first_genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t ${TOTAL_THREADS}
+
+echo ""
+echo "First genome complete! Lineage database is now cached."
+echo "Now run the parallel script for remaining genomes: bash run_compleasm_parallel.sh"
--- a/skills/phylo_from_buscos/templates/local/02_compleasm_parallel.sh
+++ b/skills/phylo_from_buscos/templates/local/02_compleasm_parallel.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+# run_compleasm_parallel.sh
+source ~/.bashrc
+conda activate phylo
+
+# Threading configuration (adjust based on your system)
+TOTAL_THREADS=TOTAL_THREADS      # Total cores to use (e.g., 64)
+THREADS_PER_JOB=THREADS_PER_JOB  # Threads per genome (e.g., 16)
+CONCURRENT_JOBS=$((TOTAL_THREADS / THREADS_PER_JOB))  # Calculated automatically
+
+echo "Configuration:"
+echo "  Total threads:      ${TOTAL_THREADS}"
+echo "  Threads per genome: ${THREADS_PER_JOB}"
+echo "  Concurrent genomes: ${CONCURRENT_JOBS}"
+echo ""
+
+# Create output directory
+mkdir -p 01_busco_results
+
+# Process remaining genomes (skip first one) in parallel
+tail -n +2 genome_list.txt | parallel -j ${CONCURRENT_JOBS} '
+  genome_name=$(basename {} .fasta)
+  echo "Processing ${genome_name} with THREADS_PER_JOB threads..."
+
+  compleasm run \
+    -a {} \
+    -o 01_busco_results/${genome_name}_compleasm \
+    -l LINEAGE \
+    -t THREADS_PER_JOB
+'
+
+echo ""
+echo "All genomes processed!"
--- a/skills/phylo_from_buscos/templates/local/08a_partition_search.sh
+++ b/skills/phylo_from_buscos/templates/local/08a_partition_search.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+source ~/.bashrc
+conda activate phylo
+
+cd 06_concatenation
+
+iqtree \
+  -s FcC_supermatrix.fas \
+  -spp partition_def.txt \
+  -nt 18 \
+  -safe \
+  -pre partition_search \
+  -m TESTMERGEONLY \
+  -mset MODEL_SET \
+  -msub nuclear \
+  -rcluster 10 \
+  -bb 1000 \
+  -alrt 1000
+
+echo "Partition search complete! Best scheme: partition_search.best_scheme.nex"
--- a/skills/phylo_from_buscos/templates/local/08c_gene_trees_parallel.sh
+++ b/skills/phylo_from_buscos/templates/local/08c_gene_trees_parallel.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+source ~/.bashrc
+conda activate phylo
+
+cd trimmed_aa
+
+# Create list of alignments
+ls *_trimmed.fas > locus_alignments.txt
+
+# Run IQ-TREE in parallel (adjust -j for number of concurrent jobs)
+cat locus_alignments.txt | parallel -j 4 '
+  prefix=$(basename {} _trimmed.fas)
+  iqtree -s {} -m MFP -bb 1000 -bnni -czb -pre ${prefix} -nt 1
+  echo "Tree complete: ${prefix}"
+'
+
+echo "All gene trees complete!"
--- a/skills/phylo_from_buscos/templates/local/08c_gene_trees_serial.sh
+++ b/skills/phylo_from_buscos/templates/local/08c_gene_trees_serial.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+source ~/.bashrc
+conda activate phylo
+
+cd trimmed_aa
+
+for locus in *_trimmed.fas; do
+    prefix=$(basename ${locus} _trimmed.fas)
+    echo "Processing ${prefix}..."
+    iqtree -s ${locus} -m MFP -bb 1000 -bnni -czb -pre ${prefix} -nt 1
+done
+
+echo "All gene trees complete!"
--- a/skills/phylo_from_buscos/templates/pbs/02_compleasm_first.job
+++ b/skills/phylo_from_buscos/templates/pbs/02_compleasm_first.job
@@ -0,0 +1,27 @@
+#!/bin/bash
+#PBS -N compleasm_first
+#PBS -l nodes=1:ppn=TOTAL_THREADS  # Replace with total available CPUs (e.g., 64)
+#PBS -l mem=384gb  # Adjust based on ppn × 6GB
+#PBS -l walltime=24:00:00
+
+cd $PBS_O_WORKDIR
+source ~/.bashrc
+conda activate phylo
+
+mkdir -p logs
+mkdir -p 01_busco_results
+
+# Process FIRST genome only (downloads lineage database)
+first_genome=$(head -n 1 genome_list.txt)
+genome_name=$(basename ${first_genome} .fasta)
+echo "Processing first genome: ${genome_name} with $PBS_NUM_PPN threads..."
+echo "This will download the BUSCO lineage database for subsequent runs."
+
+compleasm run \
+  -a ${first_genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t $PBS_NUM_PPN
+
+echo "First genome complete! Lineage database is now cached."
+echo "Submit the parallel job for remaining genomes: qsub run_compleasm_parallel.job"
--- a/skills/phylo_from_buscos/templates/pbs/02_compleasm_parallel.job
+++ b/skills/phylo_from_buscos/templates/pbs/02_compleasm_parallel.job
@@ -0,0 +1,24 @@
+#!/bin/bash
+#PBS -N compleasm_parallel
+#PBS -t 2-NUM_GENOMES  # Start from genome 2 (first genome already processed)
+#PBS -l nodes=1:ppn=THREADS_PER_JOB  # e.g., 16 for 64-core system
+#PBS -l mem=96gb  # Adjust based on ppn × 6GB
+#PBS -l walltime=48:00:00
+
+cd $PBS_O_WORKDIR
+source ~/.bashrc
+conda activate phylo
+
+mkdir -p 01_busco_results
+
+# Get genome for this array task
+genome=$(sed -n "${PBS_ARRAYID}p" genome_list.txt)
+genome_name=$(basename ${genome} .fasta)
+
+echo "Processing ${genome_name} with $PBS_NUM_PPN threads..."
+
+compleasm run \
+  -a ${genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t $PBS_NUM_PPN
--- a/skills/phylo_from_buscos/templates/pbs/08a_partition_search.job
+++ b/skills/phylo_from_buscos/templates/pbs/08a_partition_search.job
@@ -0,0 +1,22 @@
+#!/bin/bash
+#PBS -N iqtree_partition
+#PBS -l nodes=1:ppn=18
+#PBS -l mem=72gb
+#PBS -l walltime=72:00:00
+
+cd $PBS_O_WORKDIR/06_concatenation
+source ~/.bashrc
+conda activate phylo
+
+iqtree \
+  -s FcC_supermatrix.fas \
+  -spp partition_def.txt \
+  -nt 18 \
+  -safe \
+  -pre partition_search \
+  -m TESTMERGEONLY \
+  -mset MODEL_SET \
+  -msub nuclear \
+  -rcluster 10 \
+  -bb 1000 \
+  -alrt 1000
--- a/skills/phylo_from_buscos/templates/pbs/08c_gene_trees_array.job
+++ b/skills/phylo_from_buscos/templates/pbs/08c_gene_trees_array.job
@@ -0,0 +1,26 @@
+#!/bin/bash
+#PBS -N iqtree_genes
+#PBS -t 1-NUM_LOCI
+#PBS -l nodes=1:ppn=1
+#PBS -l mem=4gb
+#PBS -l walltime=2:00:00
+
+cd $PBS_O_WORKDIR/trimmed_aa
+source ~/.bashrc
+conda activate phylo
+
+# Create list of alignments if not present
+if [ ! -f locus_alignments.txt ]; then
+    ls *_trimmed.fas > locus_alignments.txt
+fi
+
+locus=$(sed -n "${PBS_ARRAYID}p" locus_alignments.txt)
+
+iqtree \
+  -s ${locus} \
+  -m MFP \
+  -bb 1000 \
+  -bnni \
+  -czb \
+  -pre $(basename ${locus} _trimmed.fas) \
+  -nt 1
--- a/skills/phylo_from_buscos/templates/slurm/02_compleasm_first.job
+++ b/skills/phylo_from_buscos/templates/slurm/02_compleasm_first.job
@@ -0,0 +1,28 @@
+#!/bin/bash
+#SBATCH --job-name=compleasm_first
+#SBATCH --cpus-per-task=TOTAL_THREADS  # Replace with total available CPUs (e.g., 64)
+#SBATCH --mem-per-cpu=6G
+#SBATCH --time=24:00:00
+#SBATCH --output=logs/compleasm_first.%j.out
+#SBATCH --error=logs/compleasm_first.%j.err
+
+source ~/.bashrc
+conda activate phylo
+
+mkdir -p logs
+mkdir -p 01_busco_results
+
+# Process FIRST genome only (downloads lineage database)
+first_genome=$(head -n 1 genome_list.txt)
+genome_name=$(basename ${first_genome} .fasta)
+echo "Processing first genome: ${genome_name} with ${SLURM_CPUS_PER_TASK} threads..."
+echo "This will download the BUSCO lineage database for subsequent runs."
+
+compleasm run \
+  -a ${first_genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t ${SLURM_CPUS_PER_TASK}
+
+echo "First genome complete! Lineage database is now cached."
+echo "Submit the parallel job for remaining genomes: sbatch run_compleasm_parallel.job"
--- a/skills/phylo_from_buscos/templates/slurm/02_compleasm_parallel.job
+++ b/skills/phylo_from_buscos/templates/slurm/02_compleasm_parallel.job
@@ -0,0 +1,25 @@
+#!/bin/bash
+#SBATCH --job-name=compleasm_parallel
+#SBATCH --array=2-NUM_GENOMES  # Start from genome 2 (first genome already processed)
+#SBATCH --cpus-per-task=THREADS_PER_JOB  # e.g., 16 for 64-core system with 4 concurrent jobs
+#SBATCH --mem-per-cpu=6G
+#SBATCH --time=48:00:00
+#SBATCH --output=logs/compleasm.%A_%a.out
+#SBATCH --error=logs/compleasm.%A_%a.err
+
+source ~/.bashrc
+conda activate phylo
+
+mkdir -p 01_busco_results
+
+# Get genome for this array task (skipping the first one)
+genome=$(sed -n "${SLURM_ARRAY_TASK_ID}p" genome_list.txt)
+genome_name=$(basename ${genome} .fasta)
+
+echo "Processing ${genome_name} with ${SLURM_CPUS_PER_TASK} threads..."
+
+compleasm run \
+  -a ${genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t ${SLURM_CPUS_PER_TASK}
--- a/skills/phylo_from_buscos/templates/slurm/08a_partition_search.job
+++ b/skills/phylo_from_buscos/templates/slurm/08a_partition_search.job
@@ -0,0 +1,27 @@
+#!/bin/bash
+#SBATCH --job-name=iqtree_partition
+#SBATCH --cpus-per-task=18
+#SBATCH --mem-per-cpu=4G
+#SBATCH --time=72:00:00
+#SBATCH --output=logs/partition_search.out
+#SBATCH --error=logs/partition_search.err
+
+source ~/.bashrc
+conda activate phylo
+
+cd 06_concatenation  # Use organized directory structure
+
+iqtree \
+  -s FcC_supermatrix.fas \
+  -spp partition_def.txt \
+  -nt ${SLURM_CPUS_PER_TASK} \
+  -safe \
+  -pre partition_search \
+  -m TESTMERGEONLY \
+  -mset MODEL_SET \
+  -msub nuclear \
+  -rcluster 10 \
+  -bb 1000 \
+  -alrt 1000
+
+# Output: partition_search.best_scheme.nex
--- a/skills/phylo_from_buscos/templates/slurm/08c_gene_trees_array.job
+++ b/skills/phylo_from_buscos/templates/slurm/08c_gene_trees_array.job
@@ -0,0 +1,28 @@
+#!/bin/bash
+#SBATCH --job-name=iqtree_genes
+#SBATCH --array=1-NUM_LOCI
+#SBATCH --cpus-per-task=1
+#SBATCH --mem-per-cpu=4G
+#SBATCH --time=2:00:00
+#SBATCH --output=logs/%A_%a.genetree.out
+
+source ~/.bashrc
+conda activate phylo
+
+cd trimmed_aa
+
+# Create list of alignments if not present
+if [ ! -f locus_alignments.txt ]; then
+    ls *_trimmed.fas > locus_alignments.txt
+fi
+
+locus=$(sed -n "${SLURM_ARRAY_TASK_ID}p" locus_alignments.txt)
+
+iqtree \
+  -s ${locus} \
+  -m MFP \
+  -bb 1000 \
+  -bnni \
+  -czb \
+  -pre $(basename ${locus} _trimmed.fas) \
+  -nt 1