Initial commit
This commit is contained in:
125
skills/phylo_from_buscos/templates/README.md
Normal file
125
skills/phylo_from_buscos/templates/README.md
Normal file
@@ -0,0 +1,125 @@
|
||||
# Phylogenomics Workflow Templates
|
||||
|
||||
This directory contains template scripts for running the phylogenomics pipeline across different computing environments.
|
||||
|
||||
## Directory Structure
|
||||
|
||||
```
|
||||
templates/
|
||||
├── slurm/ # SLURM job scheduler templates
|
||||
├── pbs/ # PBS/Torque job scheduler templates
|
||||
└── local/ # Local machine templates (with GNU parallel support)
|
||||
```
|
||||
|
||||
## Template Naming Convention
|
||||
|
||||
Templates follow a consistent naming pattern: `NN_step_name[_variant].ext`
|
||||
|
||||
- `NN`: Step number (e.g., `02` for compleasm, `08a` for partition search)
|
||||
- `step_name`: Descriptive name of the pipeline step
|
||||
- `_variant`: Optional variant (e.g., `_first`, `_parallel`, `_serial`)
|
||||
- `.ext`: File extension (`.job` for schedulers, `.sh` for local scripts)
|
||||
|
||||
## Available Templates
|
||||
|
||||
### Step 2: Ortholog Identification (compleasm)
|
||||
|
||||
**SLURM:**
|
||||
- `02_compleasm_first.job` - Process first genome to download lineage database
|
||||
- `02_compleasm_parallel.job` - Array job for remaining genomes
|
||||
|
||||
**PBS:**
|
||||
- `02_compleasm_first.job` - Process first genome to download lineage database
|
||||
- `02_compleasm_parallel.job` - Array job for remaining genomes
|
||||
|
||||
**Local:**
|
||||
- `02_compleasm_first.sh` - Process first genome to download lineage database
|
||||
- `02_compleasm_parallel.sh` - GNU parallel for remaining genomes
|
||||
|
||||
### Step 8A: Partition Model Selection
|
||||
|
||||
**SLURM:**
|
||||
- `08a_partition_search.job` - IQ-TREE partition model search with TESTMERGEONLY
|
||||
|
||||
**PBS:**
|
||||
- `08a_partition_search.job` - IQ-TREE partition model search with TESTMERGEONLY
|
||||
|
||||
**Local:**
|
||||
- `08a_partition_search.sh` - IQ-TREE partition model search with TESTMERGEONLY
|
||||
|
||||
### Step 8C: Individual Gene Trees
|
||||
|
||||
**SLURM:**
|
||||
- `08c_gene_trees_array.job` - Array job for parallel gene tree estimation
|
||||
|
||||
**PBS:**
|
||||
- `08c_gene_trees_array.job` - Array job for parallel gene tree estimation
|
||||
|
||||
**Local:**
|
||||
- `08c_gene_trees_parallel.sh` - GNU parallel for gene tree estimation
|
||||
- `08c_gene_trees_serial.sh` - Serial processing (for debugging/limited resources)
|
||||
|
||||
## Placeholders
|
||||
|
||||
Templates contain placeholders that must be replaced with user-specific values:
|
||||
|
||||
| Placeholder | Description | Example |
|
||||
|-------------|-------------|---------|
|
||||
| `TOTAL_THREADS` | Total CPU cores available | `64` |
|
||||
| `THREADS_PER_JOB` | Threads per concurrent job | `16` |
|
||||
| `NUM_GENOMES` | Number of genomes in analysis | `20` |
|
||||
| `NUM_LOCI` | Number of loci/alignments | `2795` |
|
||||
| `LINEAGE` | BUSCO lineage dataset | `insecta_odb10` |
|
||||
| `MODEL_SET` | Comma-separated substitution models | `LG,WAG,JTT,Q.pfam` |
|
||||
|
||||
## Usage
|
||||
|
||||
### For Claude (LLM)
|
||||
|
||||
When a user requests scripts for a specific computing environment:
|
||||
|
||||
1. **Read the appropriate template** using the Read tool
|
||||
2. **Replace placeholders** with user-specified values
|
||||
3. **Present the customized script** to the user
|
||||
4. **Provide setup instructions** (e.g., how many genomes, how to calculate thread allocation)
|
||||
|
||||
Example:
|
||||
```python
|
||||
# Read template
|
||||
template = Read("templates/slurm/02_compleasm_first.job")
|
||||
|
||||
# Replace placeholders
|
||||
script = template.replace("TOTAL_THREADS", "64")
|
||||
script = script.replace("LINEAGE", "insecta_odb10")
|
||||
|
||||
# Present to user
|
||||
print(script)
|
||||
```
|
||||
|
||||
### For Users
|
||||
|
||||
Templates are not meant to be used directly. Instead:
|
||||
|
||||
1. Follow the workflow in `SKILL.md`
|
||||
2. Answer Claude's questions about your setup
|
||||
3. Claude will fetch the appropriate template and customize it for you
|
||||
4. Copy the customized script Claude provides
|
||||
|
||||
## Benefits of This Structure
|
||||
|
||||
1. **Reduced token usage**: Claude only reads templates when needed
|
||||
2. **Easier maintenance**: Update one template file instead of multiple locations in SKILL.md
|
||||
3. **Consistency**: All users get the same base template structure
|
||||
4. **Clarity**: Separate files are easier to review than inline code
|
||||
5. **Extensibility**: Easy to add new templates for additional tools or variants
|
||||
|
||||
## Adding New Templates
|
||||
|
||||
When adding new templates:
|
||||
|
||||
1. **Follow naming convention**: `NN_descriptive_name[_variant].ext`
|
||||
2. **Include clear comments**: Explain what the script does
|
||||
3. **Use consistent placeholders**: Match existing placeholder names
|
||||
4. **Test thoroughly**: Ensure placeholders are complete and correct
|
||||
5. **Update this README**: Add the new template to the "Available Templates" section
|
||||
6. **Update SKILL.md**: Reference the new template in the appropriate workflow step
|
||||
@@ -0,0 +1,26 @@
|
||||
#!/bin/bash
|
||||
# run_compleasm_first.sh
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
# User-specified total CPU threads
|
||||
TOTAL_THREADS=TOTAL_THREADS # Replace with total cores you want to use (e.g., 16, 32, 64)
|
||||
echo "Processing first genome with ${TOTAL_THREADS} CPU threads to download lineage database..."
|
||||
|
||||
# Create output directory
|
||||
mkdir -p 01_busco_results
|
||||
|
||||
# Process FIRST genome only
|
||||
first_genome=$(head -n 1 genome_list.txt)
|
||||
genome_name=$(basename ${first_genome} .fasta)
|
||||
echo "Processing: ${genome_name}"
|
||||
|
||||
compleasm run \
|
||||
-a ${first_genome} \
|
||||
-o 01_busco_results/${genome_name}_compleasm \
|
||||
-l LINEAGE \
|
||||
-t ${TOTAL_THREADS}
|
||||
|
||||
echo ""
|
||||
echo "First genome complete! Lineage database is now cached."
|
||||
echo "Now run the parallel script for remaining genomes: bash run_compleasm_parallel.sh"
|
||||
@@ -0,0 +1,33 @@
|
||||
#!/bin/bash
|
||||
# run_compleasm_parallel.sh
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
# Threading configuration (adjust based on your system)
|
||||
TOTAL_THREADS=TOTAL_THREADS # Total cores to use (e.g., 64)
|
||||
THREADS_PER_JOB=THREADS_PER_JOB # Threads per genome (e.g., 16)
|
||||
CONCURRENT_JOBS=$((TOTAL_THREADS / THREADS_PER_JOB)) # Calculated automatically
|
||||
|
||||
echo "Configuration:"
|
||||
echo " Total threads: ${TOTAL_THREADS}"
|
||||
echo " Threads per genome: ${THREADS_PER_JOB}"
|
||||
echo " Concurrent genomes: ${CONCURRENT_JOBS}"
|
||||
echo ""
|
||||
|
||||
# Create output directory
|
||||
mkdir -p 01_busco_results
|
||||
|
||||
# Process remaining genomes (skip first one) in parallel
|
||||
tail -n +2 genome_list.txt | parallel -j ${CONCURRENT_JOBS} '
|
||||
genome_name=$(basename {} .fasta)
|
||||
echo "Processing ${genome_name} with THREADS_PER_JOB threads..."
|
||||
|
||||
compleasm run \
|
||||
-a {} \
|
||||
-o 01_busco_results/${genome_name}_compleasm \
|
||||
-l LINEAGE \
|
||||
-t THREADS_PER_JOB
|
||||
'
|
||||
|
||||
echo ""
|
||||
echo "All genomes processed!"
|
||||
@@ -0,0 +1,20 @@
|
||||
#!/bin/bash
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
cd 06_concatenation
|
||||
|
||||
iqtree \
|
||||
-s FcC_supermatrix.fas \
|
||||
-spp partition_def.txt \
|
||||
-nt 18 \
|
||||
-safe \
|
||||
-pre partition_search \
|
||||
-m TESTMERGEONLY \
|
||||
-mset MODEL_SET \
|
||||
-msub nuclear \
|
||||
-rcluster 10 \
|
||||
-bb 1000 \
|
||||
-alrt 1000
|
||||
|
||||
echo "Partition search complete! Best scheme: partition_search.best_scheme.nex"
|
||||
@@ -0,0 +1,17 @@
|
||||
#!/bin/bash
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
cd trimmed_aa
|
||||
|
||||
# Create list of alignments
|
||||
ls *_trimmed.fas > locus_alignments.txt
|
||||
|
||||
# Run IQ-TREE in parallel (adjust -j for number of concurrent jobs)
|
||||
cat locus_alignments.txt | parallel -j 4 '
|
||||
prefix=$(basename {} _trimmed.fas)
|
||||
iqtree -s {} -m MFP -bb 1000 -bnni -czb -pre ${prefix} -nt 1
|
||||
echo "Tree complete: ${prefix}"
|
||||
'
|
||||
|
||||
echo "All gene trees complete!"
|
||||
@@ -0,0 +1,13 @@
|
||||
#!/bin/bash
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
cd trimmed_aa
|
||||
|
||||
for locus in *_trimmed.fas; do
|
||||
prefix=$(basename ${locus} _trimmed.fas)
|
||||
echo "Processing ${prefix}..."
|
||||
iqtree -s ${locus} -m MFP -bb 1000 -bnni -czb -pre ${prefix} -nt 1
|
||||
done
|
||||
|
||||
echo "All gene trees complete!"
|
||||
@@ -0,0 +1,27 @@
|
||||
#!/bin/bash
|
||||
#PBS -N compleasm_first
|
||||
#PBS -l nodes=1:ppn=TOTAL_THREADS # Replace with total available CPUs (e.g., 64)
|
||||
#PBS -l mem=384gb # Adjust based on ppn × 6GB
|
||||
#PBS -l walltime=24:00:00
|
||||
|
||||
cd $PBS_O_WORKDIR
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
mkdir -p logs
|
||||
mkdir -p 01_busco_results
|
||||
|
||||
# Process FIRST genome only (downloads lineage database)
|
||||
first_genome=$(head -n 1 genome_list.txt)
|
||||
genome_name=$(basename ${first_genome} .fasta)
|
||||
echo "Processing first genome: ${genome_name} with $PBS_NUM_PPN threads..."
|
||||
echo "This will download the BUSCO lineage database for subsequent runs."
|
||||
|
||||
compleasm run \
|
||||
-a ${first_genome} \
|
||||
-o 01_busco_results/${genome_name}_compleasm \
|
||||
-l LINEAGE \
|
||||
-t $PBS_NUM_PPN
|
||||
|
||||
echo "First genome complete! Lineage database is now cached."
|
||||
echo "Submit the parallel job for remaining genomes: qsub run_compleasm_parallel.job"
|
||||
@@ -0,0 +1,24 @@
|
||||
#!/bin/bash
|
||||
#PBS -N compleasm_parallel
|
||||
#PBS -t 2-NUM_GENOMES # Start from genome 2 (first genome already processed)
|
||||
#PBS -l nodes=1:ppn=THREADS_PER_JOB # e.g., 16 for 64-core system
|
||||
#PBS -l mem=96gb # Adjust based on ppn × 6GB
|
||||
#PBS -l walltime=48:00:00
|
||||
|
||||
cd $PBS_O_WORKDIR
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
mkdir -p 01_busco_results
|
||||
|
||||
# Get genome for this array task
|
||||
genome=$(sed -n "${PBS_ARRAYID}p" genome_list.txt)
|
||||
genome_name=$(basename ${genome} .fasta)
|
||||
|
||||
echo "Processing ${genome_name} with $PBS_NUM_PPN threads..."
|
||||
|
||||
compleasm run \
|
||||
-a ${genome} \
|
||||
-o 01_busco_results/${genome_name}_compleasm \
|
||||
-l LINEAGE \
|
||||
-t $PBS_NUM_PPN
|
||||
@@ -0,0 +1,22 @@
|
||||
#!/bin/bash
|
||||
#PBS -N iqtree_partition
|
||||
#PBS -l nodes=1:ppn=18
|
||||
#PBS -l mem=72gb
|
||||
#PBS -l walltime=72:00:00
|
||||
|
||||
cd $PBS_O_WORKDIR/06_concatenation
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
iqtree \
|
||||
-s FcC_supermatrix.fas \
|
||||
-spp partition_def.txt \
|
||||
-nt 18 \
|
||||
-safe \
|
||||
-pre partition_search \
|
||||
-m TESTMERGEONLY \
|
||||
-mset MODEL_SET \
|
||||
-msub nuclear \
|
||||
-rcluster 10 \
|
||||
-bb 1000 \
|
||||
-alrt 1000
|
||||
@@ -0,0 +1,26 @@
|
||||
#!/bin/bash
|
||||
#PBS -N iqtree_genes
|
||||
#PBS -t 1-NUM_LOCI
|
||||
#PBS -l nodes=1:ppn=1
|
||||
#PBS -l mem=4gb
|
||||
#PBS -l walltime=2:00:00
|
||||
|
||||
cd $PBS_O_WORKDIR/trimmed_aa
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
# Create list of alignments if not present
|
||||
if [ ! -f locus_alignments.txt ]; then
|
||||
ls *_trimmed.fas > locus_alignments.txt
|
||||
fi
|
||||
|
||||
locus=$(sed -n "${PBS_ARRAYID}p" locus_alignments.txt)
|
||||
|
||||
iqtree \
|
||||
-s ${locus} \
|
||||
-m MFP \
|
||||
-bb 1000 \
|
||||
-bnni \
|
||||
-czb \
|
||||
-pre $(basename ${locus} _trimmed.fas) \
|
||||
-nt 1
|
||||
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=compleasm_first
|
||||
#SBATCH --cpus-per-task=TOTAL_THREADS # Replace with total available CPUs (e.g., 64)
|
||||
#SBATCH --mem-per-cpu=6G
|
||||
#SBATCH --time=24:00:00
|
||||
#SBATCH --output=logs/compleasm_first.%j.out
|
||||
#SBATCH --error=logs/compleasm_first.%j.err
|
||||
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
mkdir -p logs
|
||||
mkdir -p 01_busco_results
|
||||
|
||||
# Process FIRST genome only (downloads lineage database)
|
||||
first_genome=$(head -n 1 genome_list.txt)
|
||||
genome_name=$(basename ${first_genome} .fasta)
|
||||
echo "Processing first genome: ${genome_name} with ${SLURM_CPUS_PER_TASK} threads..."
|
||||
echo "This will download the BUSCO lineage database for subsequent runs."
|
||||
|
||||
compleasm run \
|
||||
-a ${first_genome} \
|
||||
-o 01_busco_results/${genome_name}_compleasm \
|
||||
-l LINEAGE \
|
||||
-t ${SLURM_CPUS_PER_TASK}
|
||||
|
||||
echo "First genome complete! Lineage database is now cached."
|
||||
echo "Submit the parallel job for remaining genomes: sbatch run_compleasm_parallel.job"
|
||||
@@ -0,0 +1,25 @@
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=compleasm_parallel
|
||||
#SBATCH --array=2-NUM_GENOMES # Start from genome 2 (first genome already processed)
|
||||
#SBATCH --cpus-per-task=THREADS_PER_JOB # e.g., 16 for 64-core system with 4 concurrent jobs
|
||||
#SBATCH --mem-per-cpu=6G
|
||||
#SBATCH --time=48:00:00
|
||||
#SBATCH --output=logs/compleasm.%A_%a.out
|
||||
#SBATCH --error=logs/compleasm.%A_%a.err
|
||||
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
mkdir -p 01_busco_results
|
||||
|
||||
# Get genome for this array task (skipping the first one)
|
||||
genome=$(sed -n "${SLURM_ARRAY_TASK_ID}p" genome_list.txt)
|
||||
genome_name=$(basename ${genome} .fasta)
|
||||
|
||||
echo "Processing ${genome_name} with ${SLURM_CPUS_PER_TASK} threads..."
|
||||
|
||||
compleasm run \
|
||||
-a ${genome} \
|
||||
-o 01_busco_results/${genome_name}_compleasm \
|
||||
-l LINEAGE \
|
||||
-t ${SLURM_CPUS_PER_TASK}
|
||||
@@ -0,0 +1,27 @@
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=iqtree_partition
|
||||
#SBATCH --cpus-per-task=18
|
||||
#SBATCH --mem-per-cpu=4G
|
||||
#SBATCH --time=72:00:00
|
||||
#SBATCH --output=logs/partition_search.out
|
||||
#SBATCH --error=logs/partition_search.err
|
||||
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
cd 06_concatenation # Use organized directory structure
|
||||
|
||||
iqtree \
|
||||
-s FcC_supermatrix.fas \
|
||||
-spp partition_def.txt \
|
||||
-nt ${SLURM_CPUS_PER_TASK} \
|
||||
-safe \
|
||||
-pre partition_search \
|
||||
-m TESTMERGEONLY \
|
||||
-mset MODEL_SET \
|
||||
-msub nuclear \
|
||||
-rcluster 10 \
|
||||
-bb 1000 \
|
||||
-alrt 1000
|
||||
|
||||
# Output: partition_search.best_scheme.nex
|
||||
@@ -0,0 +1,28 @@
|
||||
#!/bin/bash
|
||||
#SBATCH --job-name=iqtree_genes
|
||||
#SBATCH --array=1-NUM_LOCI
|
||||
#SBATCH --cpus-per-task=1
|
||||
#SBATCH --mem-per-cpu=4G
|
||||
#SBATCH --time=2:00:00
|
||||
#SBATCH --output=logs/%A_%a.genetree.out
|
||||
|
||||
source ~/.bashrc
|
||||
conda activate phylo
|
||||
|
||||
cd trimmed_aa
|
||||
|
||||
# Create list of alignments if not present
|
||||
if [ ! -f locus_alignments.txt ]; then
|
||||
ls *_trimmed.fas > locus_alignments.txt
|
||||
fi
|
||||
|
||||
locus=$(sed -n "${SLURM_ARRAY_TASK_ID}p" locus_alignments.txt)
|
||||
|
||||
iqtree \
|
||||
-s ${locus} \
|
||||
-m MFP \
|
||||
-bb 1000 \
|
||||
-bnni \
|
||||
-czb \
|
||||
-pre $(basename ${locus} _trimmed.fas) \
|
||||
-nt 1
|
||||
Reference in New Issue
Block a user