Files
gh-brunoasm-my-claude-skill…/skills/phylo_from_buscos/references/REFERENCE.md
2025-11-29 18:02:37 +08:00

63 KiB
Raw Blame History

BUSCO-based Phylogenomics - Technical Reference

Detailed technical reference for implementing phylogenomic workflows.

Table of Contents

  1. Sample Naming Best Practices
  2. BUSCO Lineage Datasets
  3. Resource Recommendations
  4. Template Job Scripts
  5. Common Issues
  6. Quality Control Guidelines
  7. Aliscore/ALICUT: Detailed Guide
  8. Tool Citations
  9. Software Installation Guide

Sample Naming Best Practices

Sample names appear in your final phylogenetic trees, so choose them carefully!

[ACCESSION]_[SPECIES_NAME]

Examples:

  • GCA000001735_Arabidopsis_thaliana
  • GCF009858895_Apis_mellifera
  • PRJNA12345_Drosophila_melanogaster_strain_w1118

Why This Format?

  1. Accession first = Easy to trace back to original data
  2. Species name = Readable in phylogenetic trees
  3. Underscore-separated = Compatible with all phylogenetics software
  4. No spaces or special characters = Prevents parsing errors

Rules for Sample Names

DO:

  • Use only letters, numbers, underscores, and hyphens
  • Keep names reasonably short (<50 characters)
  • Be consistent across your dataset
  • Include strain/population info if relevant (e.g., GCA123_Species_name_pop1)

DON'T:

  • Use spaces (use underscores instead)
  • Use special characters: ()[]{}|<>@#$%^&*+=;:'",./\
  • Start with numbers (some tools don't like this)
  • Use periods except for version numbers
  • Make names too cryptic (will appear in publications!)

Using the Rename Helper Script

The scripts/rename_genomes.py helper can assist with renaming:

# Create a template mapping file
python scripts/rename_genomes.py --create-template *.fasta > samples.tsv

# Edit samples.tsv to add meaningful names:
# GCA_000001735.2.fasta    GCA000001735_Arabidopsis_thaliana
# GCF_009858895.2.fasta    GCF009858895_Apis_mellifera

# Apply the mapping (with backup)
python scripts/rename_genomes.py --mapping samples.tsv

# Or use interactive mode
python scripts/rename_genomes.py --interactive *.fasta

For NCBI Downloaded Genomes

When downloading from NCBI, genome files are typically in subdirectories like:

ncbi_dataset/data/GCA_000001735.2/GCA_000001735.2_genomic.fna

You'll need to:

  1. Extract assembly accessions and organism names
  2. Create meaningful sample names
  3. Copy and rename files to working directory

Example workflow:

# List assemblies with organism names
for dir in ncbi_dataset/data/GCA_*; do
    acc=$(basename $dir)
    # Extract organism name from metadata
    echo "$acc"
done

# Create mapping file manually or with download_ncbi_genomes.py --list-only

BUSCO Lineage Datasets

General Lineages

  • eukaryota_odb10 - All eukaryotes (255 BUSCOs)
  • bacteria_odb10 - All bacteria (124 BUSCOs)
  • archaea_odb10 - All archaea (194 BUSCOs)

Eukaryotic Kingdoms

  • metazoa_odb10 - Animals (954 BUSCOs)
  • viridiplantae_odb10 - Green plants (425 BUSCOs)
  • fungi_odb10 - Fungi (758 BUSCOs)

Animals (Metazoa)

  • arthropoda_odb10 - Arthropods (1013 BUSCOs)
    • insecta_odb10 - Insects (1367 BUSCOs)
      • diptera_odb10 - Flies (3285 BUSCOs)
      • hymenoptera_odb10 - Bees, wasps, ants (5991 BUSCOs)
      • lepidoptera_odb10 - Moths, butterflies (5286 BUSCOs)
    • arachnida_odb10 - Spiders, mites (2934 BUSCOs)
  • vertebrata_odb10 - Vertebrates (3354 BUSCOs)
    • actinopterygii_odb10 - Ray-finned fish (3640 BUSCOs)
    • mammalia_odb10 - Mammals (9226 BUSCOs)
    • aves_odb10 - Birds (8338 BUSCOs)
  • mollusca_odb10 - Molluscs (5295 BUSCOs)
  • nematoda_odb10 - Roundworms (3131 BUSCOs)

Plants (Viridiplantae)

  • eudicots_odb10 - Eudicots (2326 BUSCOs)
  • liliopsida_odb10 - Monocots (3278 BUSCOs)
  • embryophyta_odb10 - Land plants (1614 BUSCOs)

Fungi

  • ascomycota_odb10 - Ascomycetes (1706 BUSCOs)
  • basidiomycota_odb10 - Basidiomycetes (1335 BUSCOs)

For complete list, see: https://busco-data.ezlab.org/v5/data/lineages/


Resource Recommendations

SLURM/PBS Job Resource Allocations

Step CPUs RAM per CPU Total RAM Walltime Notes
compleasm 4 6 GB 24 GB 24h Increase to 8-10 GB for large genomes (>2 Gbp)
MAFFT (per locus) 1 4 GB 4 GB 24h Can run as large array job
Aliscore 1 4 GB 4 GB 24h Array job
trimAl 1 2 GB 2 GB 2h Very fast
BMGE 1 2 GB 2 GB 4h Moderate speed
ClipKit 1 2 GB 2 GB 2h Very fast
IQ-TREE (gene) 1 4 GB 4 GB 2h Array job for all loci
IQ-TREE (concat) 18-32 4 GB 72-128 GB 72h Main phylogeny job
ASTRAL 1 8 GB 8 GB <1h Usually very fast

Scaling Guidelines

Small dataset (<20 genomes, <1000 loci):

  • Can run on local machine
  • Expect ~2-5 days total runtime

Medium dataset (20-50 genomes, 1000-3000 loci):

  • Cluster recommended
  • Expect ~3-7 days with parallelization

Large dataset (>50 genomes, >3000 loci):

  • Cluster required
  • Expect 1-2 weeks with good parallelization

Template Job Scripts

SLURM Array Template

#!/bin/bash
#SBATCH --job-name=JOB_NAME
#SBATCH --array=1-N
#SBATCH --cpus-per-task=NCPUS
#SBATCH --mem-per-cpu=MEMORY
#SBATCH --time=WALLTIME
#SBATCH --output=logs/%A_%a.JOBNAME.out
#SBATCH --error=logs/%A_%a.JOBNAME.err
#SBATCH --mail-type=FAIL,END
#SBATCH --mail-user=YOUR_EMAIL

source ~/.bashrc
conda activate ENV_NAME

# Parse input file
input=$(sed -n "${SLURM_ARRAY_TASK_ID}p" input_list.txt)

# Run command
COMMAND ${input}

PBS Array Template

#!/bin/bash
#PBS -N JOB_NAME
#PBS -t 1-N
#PBS -l nodes=1:ppn=NCPUS
#PBS -l mem=MEMORY
#PBS -l walltime=WALLTIME
#PBS -j oe
#PBS -m abe
#PBS -M YOUR_EMAIL

cd $PBS_O_WORKDIR
source ~/.bashrc
conda activate ENV_NAME

# Parse input file
input=$(sed -n "${PBS_ARRAYID}p" input_list.txt)

# Run command
COMMAND ${input}

Local Sequential Template

#!/bin/bash
# Sequential execution for local machine

source ~/.bashrc
conda activate ENV_NAME

while read input; do
  echo "Processing ${input}..."
  COMMAND ${input}
done < input_list.txt

echo "All jobs complete"

Common Issues

Problem: compleasm runs out of memory

Solution:

  • Increase --mem-per-cpu to 8 GB or 10 GB
  • Some large/complex genomes need more RAM

Problem: IQ-TREE stalls or runs extremely slowly

Solution:

  • Add -safe flag (enables safe numerical mode, slower but more stable)
  • Reduce number of threads if on shared system
  • Check for very long branches or problematic sequences

Problem: Array job exceeds cluster limits

Solution:

  • Split into batches (e.g., if limit is 1000, run arrays 1-1000, 1001-2000, etc.)
  • Example: #SBATCH --array=1-1000%50 (runs 1000 jobs, max 50 concurrent)

Problem: Missing orthologs in some genomes

Solution:

  • This is normal and expected
  • FASconCAT and IQ-TREE handle missing data automatically
  • If >20% orthologs missing, consider genome quality issues

Problem: Alignment looks poor/misaligned

Solution:

  • Visualize with AliView or Jalview
  • MAFFT L-INS-i is accurate but slow; for very divergent sequences, try E-INS-i
  • Consider stricter trimming parameters
  • Very divergent sequences may not be suitable for phylogenomics

Problem: Gene trees conflict with concatenation tree

Solution:

  • This is common and expected (incomplete lineage sorting, gene flow)
  • ASTRAL species tree accounts for discordance
  • Compare both trees and branch support values
  • Look for systematic vs. random conflicts

Problem: Low bootstrap/posterior support values

Solution:

  • Check alignment quality
  • Try more stringent trimming
  • Evaluate locus informativeness (some may be uninformative)
  • Consider rapid diversification or conflicting signal
  • More data doesn't always help if signal quality is poor

Quality Control Guidelines

Genome Completeness Assessment

Excellent (>95% complete BUSCOs):

  • Highly complete genomes
  • Retain for phylogenomics
  • Expected to contribute many orthologs

Good (90-95% complete):

  • Generally acceptable
  • May be missing some loci
  • Retain unless other quality concerns

Acceptable (85-90% complete):

  • Marginal quality
  • Will have more missing orthologs
  • Consider case-by-case based on biological importance

Questionable (70-85% complete):

  • Poor completeness
  • May introduce noise
  • Recommend excluding unless scientifically critical

Poor (<70% complete):

  • Very incomplete
  • Strong recommend to exclude
  • Likely contaminated, fragmented, or poor assembly

Fragmentation and Duplication

Fragmented BUSCOs:

  • <5%: Excellent
  • 5-10%: Good
  • 10%: Indicates assembly fragmentation issues

Duplicated BUSCOs:

  • <2%: Excellent
  • 2-5%: Good (may indicate recent WGD or heterozygosity)
  • 10%: Likely contamination or assembly issues


Aliscore/ALICUT: Detailed Guide

What is Aliscore/ALICUT?

Aliscore (Alignment Sequence Conservancy Checker) uses Monte Carlo resampling to identify randomly similar sequence (RSS) sections in multiple sequence alignments. These are regions where observed similarity is not significantly different from random expectation, which can mislead phylogenetic inference.

ALICUT (Alignment Cutter) removes the RSS positions identified by Aliscore, producing trimmed alignments suitable for phylogenetic analysis.

When to Use Aliscore/ALICUT

Recommended for:

  • Phylogenomic datasets (hundreds to thousands of loci)
  • Amino acid alignments from single-copy orthologs
  • Mixed-quality alignments with variable conservation
  • Published phylogenomics studies (widely used and cited)

Consider alternatives for:

  • Very short alignments (<100 positions) - not enough data for statistics
  • Perfectly conserved sequences - no trimming needed
  • Time-sensitive analyses - Aliscore can be slow for very long alignments

Key Parameters Explained

Window Size (-w)

Controls the sliding window used for scoring alignment regions.

  • Default: 4 (recommended for most analyses)
  • Smaller (3): More sensitive, may over-trim well-aligned regions (Type I error)
  • Larger (6-8): Less sensitive to short RSS sections, more conservative

Recommendation: Use default -w 4 unless you have specific concerns about over-trimming.

Random Pairs (-r)

Number of pairwise sequence comparisons to perform.

  • Default: 4×N (where N = number of taxa)
  • Higher values: More comprehensive but slower; diminishing returns beyond 4×N
  • Lower values: Faster but less reliable statistics

Recommendation: Use default for most analyses; increase for small datasets (<10 taxa).

Gap Treatment (-N)

Controls how alignment gaps (indels) are interpreted.

  • Without -N: Gaps treated as 5th character state (informative)

    • Use for: Well-aligned conserved proteins where indels are rare
    • Effect: Conserves long indel regions present in most taxa
  • With -N: Gaps treated as ambiguous/missing data

    • Use for: Amino acid sequences (recommended)
    • Effect: More stringent; removes poorly aligned gappy regions

Recommendation: Always use -N for amino acid data from BUSCO/compleasm orthologs.

Tree-Guided Mode (-t)

Use a phylogenetic tree to guide comparisons (compares sister taxa first).

  • Advantages: More phylogenetically informed, can be faster
  • Disadvantages: Requires pre-existing tree, assumes tree is accurate
  • When to use: Large datasets (>100 taxa) where random sampling is slow

Recommendation: Use random mode (default) for initial analyses; tree-guided mode for refinement.

Understanding Aliscore Output

Each Aliscore run generates three files:

  1. [alignment]_List_random.txt

    • Space-separated list of RSS position numbers
    • Input file for ALICUT
    • Empty file = no RSS detected (alignment is clean)
  2. [alignment]_Profile_random.txt

    • Three columns: Position, Positive_Score, Negative_Score
    • Shows quality profile across alignment
    • Negative values indicate RSS positions
  3. [alignment].svg

    • Visual plot of scoring profiles
    • Y-axis: Score (positive = conserved, negative = random)
    • X-axis: Alignment position
    • Useful for manual inspection

Interpreting Trimming Results

After running Aliscore + ALICUT, evaluate trimming statistics:

Positions Removed

  • <10%: Excellent alignment quality, minimal trimming needed
  • 10-20%: Good alignment quality, reasonable trimming
  • 20-35%: Moderate quality, substantial but acceptable trimming
  • 35-50%: Poor alignment quality, consider manual inspection
  • >50%: Very poor alignment, consider excluding entire locus

What to Check

# View summary statistics
cat trimmed_aa/trimming_summary.txt

# Check specific locus
cd aliscore_output/aliscore_[locus]
cat ALICUT_info.xls

Red Flags

  • Uniformly high trimming across all loci: Check alignment quality (MAFFT parameters)
  • One locus with >50% trimmed: Likely paralogous or contamination
  • No RSS detected for most loci: Sequences may be too similar (recent divergence)

ALICUT Options

-r (Remain Stems)

For RNA secondary structure alignments, preserves paired stem positions.

bash scripts/run_alicut.sh aliscore_output/aliscore_16S/ -r -s

When to use: rRNA genes (16S, 18S, 28S) with structure annotation

-c (Remove Codon)

Translates amino acid RSS positions to nucleotide triplets (back-translation).

# After running Aliscore on protein alignment
bash scripts/run_alicut.sh aliscore_output/aliscore_protein/ -c -s

When to use:

  • Analyzed protein alignment, want to trim corresponding nucleotide alignment
  • Requires both protein and nucleotide files with identical names

-3 (Remove 3rd Position)

Removes only 3rd codon positions of identified RSS.

bash scripts/run_alicut.sh aliscore_output/aliscore_protein/ -c -3 -s

When to use:

  • Want to exclude fast-evolving 3rd codon positions
  • Can combine with -c option

Workflow Integration

Typical Usage (Batch Mode)

# Process all aligned amino acids through Aliscore + ALICUT
bash scripts/run_aliscore_alicut_batch.sh aligned_aa/ -N -o trimmed_aa

This is the recommended approach for most phylogenomic analyses.

Array Job Mode (HPC)

For large datasets on compute clusters:

# Step 1: Aliscore array job
cd aligned_aa
ls *.fas > locus_list.txt
# Submit array job (see SKILL.md for templates)

# Step 2: After Aliscore completes, batch process ALICUT
for dir in aliscore_output/aliscore_*/; do
    bash ../scripts/run_alicut.sh "${dir}" -s
done

Quality Control After Trimming

Check Alignment Lengths

# Summary statistics
awk 'NR>1 {sum+=$3; count++} END {print "Mean trimmed length:", sum/count}' \
    trimmed_aa/trimming_summary.txt

# Find very short alignments
awk 'NR>1 && $3<100 {print $1, $3}' trimmed_aa/trimming_summary.txt

Minimum recommended: 100 amino acids after trimming

Visual Inspection

For critical loci, view the SVG plots:

# Open in browser
firefox aliscore_output/aliscore_[locus]/*svg

Look for:

  • Large contiguous RSS regions → May indicate paralogy
  • RSS at alignment ends → Common and acceptable
  • Scattered RSS throughout → Normal for divergent sequences

Troubleshooting

Aliscore Errors

"Can't locate Aliscore_module.pm"

  • Download both Aliscore.02.2.pl and Aliscore_module.pm
  • Keep in same directory

"taxon names of tree and sequence files do not match"

  • Tree mode: ensure tree tip labels exactly match FASTA headers
  • Solution: Use random mode instead (-r option)

"Sequence length too short"

  • Alignment has fewer positions than RSS identified
  • Usually indicates corrupted input file

ALICUT Errors

"Can not find List file"

  • Aliscore didn't complete successfully
  • Check Aliscore logs for errors

"File [alignment] is empty"

  • Aliscore didn't copy alignment file
  • Run from correct directory

All positions removed

  • Extremely poor alignment quality
  • Exclude this locus from analysis

Performance Considerations

Memory: Aliscore typically uses 1-2 GB per alignment

Runtime:

  • Small alignments (10 taxa, 1000 positions): 1-5 minutes
  • Medium (50 taxa, 2000 positions): 10-30 minutes
  • Large (100 taxa, 3000 positions): 30-60 minutes

Parallelization:

  • Aliscore itself is single-threaded
  • Parallelize across loci using array jobs
  • Typical dataset: 1000-2000 loci × 20 minutes = use array jobs

Comparison with Other Trimmers

Tool Method Speed Stringency Best For
Aliscore/ALICUT Monte Carlo RSS detection Slow Moderate-High Phylogenomics (gold standard)
trimAl Gap/conservation thresholds Very fast Customizable Large datasets, quick analyses
BMGE Entropy-based Fast Moderate Balanced speed/quality
ClipKit Parsimony-informative sites Very fast Low-Moderate Maximum data retention

Recommendation: Use Aliscore/ALICUT for final publication-quality analyses; trimAl for initial exploratory work.


Tool Citations

Required Citations

compleasm: Huang, N., & Li, H. (2023). compleasm: a faster and more accurate reimplementation of BUSCO. Bioinformatics, 39(10), btad595. https://doi.org/10.1093/bioinformatics/btad595

BUSCO (if used instead of compleasm): Manni, M., Berkeley, M. R., Seppey, M., Simão, F. A., & Zdobnov, E. M. (2021). BUSCO update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Molecular Biology and Evolution, 38(10), 4647-4654.

MAFFT: Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Molecular Biology and Evolution, 30(4), 772-780.

IQ-TREE: Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., von Haeseler, A., & Lanfear, R. (2020). IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. Molecular Biology and Evolution, 37(5), 1530-1534.

ASTRAL: Zhang, C., Rabiee, M., Sayyari, E., & Mirarab, S. (2018). ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics, 19(6), 153.

Trimming Tool Citations

Aliscore/ALICUT: Kück, P., Meusemann, K., Dambach, J., Thormann, B., von Reumont, B. M., Wägele, J. W., & Misof, B. (2010). Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. Frontiers in Zoology, 7(1), 10.

trimAl: Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics, 25(15), 1972-1973.

BMGE: Criscuolo, A., & Gribaldo, S. (2010). BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. BMC Evolutionary Biology, 10(1), 210.

ClipKit: Steenwyk, J. L., Buida III, T. J., Li, Y., Shen, X. X., & Rokas, A. (2020). ClipKIT: a multiple sequence alignment trimming software for accurate phylogenomic inference. PLOS Biology, 18(12), e3001007.


Software Installation Guide

This section provides detailed installation instructions for all tools, with options for both conda-based and manual installations. All methods work without sudo/admin access.

Use this automated bash script to set up a complete unified conda environment with all necessary tools. This is the script referenced in SKILL.md STEP 0.

Generate setup_phylo_env.sh with the following content:

#!/bin/bash
# setup_phylo_env.sh
# Sets up unified conda environment for phylogenomics workflow
# Generated by Claude phylo_from_buscos skill

set -e

echo "=========================================="
echo "Phylogenomics Environment Setup"
echo "=========================================="
echo ""

# Check if conda is available
if ! command -v conda &> /dev/null; then
    echo "ERROR: conda not found. Please install Miniconda or Anaconda first."
    echo "Visit: https://docs.conda.io/en/latest/miniconda.html"
    exit 1
fi

# Ask user preference for conda vs mamba
echo "We will use Anaconda/Miniconda to set up the software environment."
echo ""
echo "Package Manager Options:"
echo "  1) mamba (faster, recommended if available)"
echo "  2) conda (standard, always available)"
echo ""
read -p "Enter choice [1-2] (default: 2): " PKG_MGR_CHOICE
PKG_MGR_CHOICE=${PKG_MGR_CHOICE:-2}

if [ "${PKG_MGR_CHOICE}" = "1" ]; then
    if command -v mamba &> /dev/null; then
        PKG_MANAGER="mamba"
        echo "Using mamba for environment creation"
    else
        echo "WARNING: mamba not found. Falling back to conda."
        echo "To install mamba: conda install -n base -c conda-forge mamba"
        PKG_MANAGER="conda"
    fi
else
    PKG_MANAGER="conda"
    echo "Using conda for environment creation"
fi

echo ""

# Environment name
ENV_NAME="phylo"

echo "Creating environment: ${ENV_NAME}"
echo ""

# Create environment with all tools (using chosen package manager)
${PKG_MANAGER} create -n ${ENV_NAME} -y \
    -c conda-forge -c bioconda \
    python=3.9 \
    astral-tree \
    compleasm \
    mafft \
    trimal \
    clipkit \
    bmge \
    iqtree \
    perl \
    perl-bioperl \
    parallel \
    wget \
    ncbi-datasets-cli \
    openjdk

echo ""
echo "Environment created successfully!"
echo ""

# Setup Aliscore and ALICUT Perl scripts
echo "Setting up Aliscore and ALICUT Perl scripts..."
echo ""

# Activate environment
source "$(conda info --base)/etc/profile.d/conda.sh"
conda activate ${ENV_NAME}

# Get the directory where this skill is located
SKILL_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

# Create scripts directory
mkdir -p scripts

# Ask user preference for script source
echo "Aliscore/ALICUT Script Source Options:"
echo "  1) Use predownloaded scripts (from Paul Frandsen's tutorial, tested)"
echo "  2) Download latest versions from official repository"
echo ""
read -p "Enter choice [1-2] (default: 1): " SCRIPT_CHOICE
SCRIPT_CHOICE=${SCRIPT_CHOICE:-1}

if [ "${SCRIPT_CHOICE}" = "1" ]; then
    echo "Using predownloaded Aliscore/ALICUT scripts..."

    # Download predownloaded scripts from GitHub repository
    GITHUB_BASE="https://raw.githubusercontent.com/brunoasm/my_claude_skills/main/phylo_from_buscos/scripts/predownloaded_aliscore_alicut"
    if wget -q "${GITHUB_BASE}/Aliscore.02.2.pl" -O scripts/Aliscore.02.2.pl && \
       wget -q "${GITHUB_BASE}/ALICUT_V2.31.pl" -O scripts/ALICUT_V2.31.pl && \
       wget -q "${GITHUB_BASE}/Aliscore_module.pm" -O scripts/Aliscore_module.pm; then
        chmod +x scripts/Aliscore.02.2.pl scripts/ALICUT_V2.31.pl
        echo "Predownloaded scripts downloaded successfully."
    else
        echo "ERROR: Failed to download predownloaded scripts. Falling back to download option."
        SCRIPT_CHOICE="2"
    fi
fi

if [ "${SCRIPT_CHOICE}" = "2" ]; then
    echo "Downloading latest Aliscore/ALICUT scripts from GitHub..."

    # Try to download from GitHub repository
    if wget -q https://github.com/PatrickKueck/AliCUT/raw/master/Aliscore_v.2.0/Aliscore.02.2.pl -O scripts/Aliscore.02.2.pl && \
       wget -q https://github.com/PatrickKueck/AliCUT/raw/master/ALICUT_V2.3.1/ALICUT_V2.31.pl -O scripts/ALICUT_V2.31.pl && \
       wget -q https://github.com/PatrickKueck/AliCUT/raw/master/Aliscore_v.2.0/Aliscore_module.pm -O scripts/Aliscore_module.pm; then
        chmod +x scripts/Aliscore.02.2.pl scripts/ALICUT_V2.31.pl
        echo "Latest scripts downloaded successfully."
    else
        echo "ERROR: Failed to download scripts from GitHub."
        echo "Please manually download from: https://github.com/PatrickKueck/AliCUT"
        exit 1
    fi
fi

echo ""
echo "=========================================="
echo "Setup Complete!"
echo "=========================================="
echo ""
echo "Conda environment: ${ENV_NAME}"
echo "Perl with BioPerl: installed"
echo "Aliscore script:   scripts/Aliscore.02.2.pl"
echo "ALICUT script:     scripts/ALICUT_V2.31.pl"
echo "Aliscore module:   scripts/Aliscore_module.pm"
echo ""
echo "To activate environment:"
echo "  conda activate ${ENV_NAME}"
echo ""
echo "Key tools installed:"
conda list | grep -E "compleasm|mafft|trimal|clipkit|bmge|iqtree|astral|parallel|perl|openjdk"
echo ""

Usage:

# Run setup script
bash setup_phylo_env.sh

# Activate environment for all workflow steps
conda activate phylo

This unified environment includes:

  • compleasm - BUSCO ortholog identification
  • mafft - Multiple sequence alignment
  • trimal, clipkit, bmge - Alignment trimming
  • iqtree - Phylogenetic inference
  • astral-tree - Species tree inference (coalescent method)
  • openjdk - Java runtime for ASTRAL and other tools
  • perl with BioPerl - Required for Aliscore/ALICUT
  • parallel - GNU parallel for batch processing
  • ncbi-datasets-cli - For NCBI genome downloads

Important Notes:

  • Users can choose between using mamba (faster) or conda (standard) for environment creation
  • Users can choose between predownloaded Aliscore/ALICUT scripts (tested with tutorial) or latest versions from GitHub
  • Predownloaded scripts are downloaded from the GitHub repository
  • All subsequent workflow steps should use conda activate phylo instead of creating separate environments
  • The unified environment simplifies workflow management and reduces disk space usage
  • ASTRAL is installed as astral-tree and accessible via the astral command (no manual download needed)

Most tools are available via conda/bioconda and work on both Linux and macOS (including Apple Silicon with Rosetta).

Initial Setup

# Add channels (one-time setup)
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
conda config --set channel_priority strict

Core Tools (Available via Conda)

# Create main phylogenomics environment
conda create -n phylogenomics -c conda-forge -c bioconda \
    compleasm \
    mafft \
    iqtree \
    trimal \
    bmge \
    clipkit \
    astral-tree \
    perl \
    perl-bioperl \
    python=3.9 \
    biopython

conda activate phylogenomics

Individual installations (if you prefer separate environments):

# Ortholog detection
conda create -n compleasm -c conda-forge -c bioconda compleasm
# Alternative: BUSCO
conda create -n busco -c conda-forge -c bioconda busco

# Alignment
conda create -n mafft -c conda-forge -c bioconda mafft

# Trimming tools (choose one or more)
conda create -n trimal -c bioconda trimal
conda create -n bmge -c bioconda bmge
conda create -n clipkit -c bioconda clipkit

# Phylogenetic inference
conda create -n iqtree -c bioconda iqtree
conda create -n astral -c bioconda astral-tree

# NCBI data download
conda create -n ncbi -c conda-forge ncbi-datasets-cli

Platform-Specific Notes

macOS (Intel and Apple Silicon):

  • Most tools work natively on Intel Macs
  • Apple Silicon (M1/M2/M3) may require Rosetta 2 for some packages
  • If you encounter issues, use: CONDA_SUBDIR=osx-64 conda create -n myenv ...

Linux:

  • All tools work natively
  • HPC systems: Use module load conda or install Miniforge in your home directory

Manual Installation (Tools NOT in Conda)

Some tools require manual download and setup within your conda environment.

1. Aliscore and ALICUT

Aliscore and ALICUT are Perl scripts that need to be installed into your conda environment.

Installation:

# Activate your environment first
conda activate phylogenomics

# Create temporary download directory
mkdir -p /tmp/phylo_downloads
cd /tmp/phylo_downloads

# Download Aliscore scripts
wget https://raw.githubusercontent.com/PatrickKueck/AliCUT/master/Aliscore.02.2.pl
wget https://raw.githubusercontent.com/PatrickKueck/AliCUT/master/Aliscore_module.pm

# Download ALICUT
wget https://raw.githubusercontent.com/PatrickKueck/AliCUT/master/ALICUT_V2.31.pl

# Install into conda environment bin directory
mkdir -p $CONDA_PREFIX/bin
cp Aliscore.02.2.pl $CONDA_PREFIX/bin/
cp Aliscore_module.pm $CONDA_PREFIX/bin/
cp ALICUT_V2.31.pl $CONDA_PREFIX/bin/

# Make scripts executable
chmod +x $CONDA_PREFIX/bin/Aliscore.02.2.pl
chmod +x $CONDA_PREFIX/bin/ALICUT_V2.31.pl

# Create convenient wrapper scripts without .pl extension
cat > $CONDA_PREFIX/bin/aliscore <<'EOF'
#!/bin/bash
perl $(dirname $0)/Aliscore.02.2.pl "$@"
EOF

cat > $CONDA_PREFIX/bin/alicut <<'EOF'
#!/bin/bash
perl $(dirname $0)/ALICUT_V2.31.pl "$@"
EOF

chmod +x $CONDA_PREFIX/bin/aliscore
chmod +x $CONDA_PREFIX/bin/alicut

# Cleanup
cd ~
rm -rf /tmp/phylo_downloads

Verify installation:

# Test with wrapper commands
aliscore
alicut -h

# Or call Perl scripts directly
perl $CONDA_PREFIX/bin/Aliscore.02.2.pl
perl $CONDA_PREFIX/bin/ALICUT_V2.31.pl -h

Note: The Aliscore_module.pm must be in the same directory as Aliscore.02.2.pl. Since both are installed to $CONDA_PREFIX/bin, this is handled automatically.

2. FASconCAT-G

FASconCAT-G is a Perl script for concatenating multiple sequence alignments.

Installation:

# Activate your environment
conda activate phylogenomics

# Create temporary download directory
mkdir -p /tmp/phylo_downloads
cd /tmp/phylo_downloads

# Download FASconCAT-G
wget https://raw.githubusercontent.com/PatrickKueck/FASconCAT-G/master/FASconCAT-G_v1.06.1.pl

# Install into conda environment bin directory
mkdir -p $CONDA_PREFIX/bin
cp FASconCAT-G_v1.06.1.pl $CONDA_PREFIX/bin/

# Make executable
chmod +x $CONDA_PREFIX/bin/FASconCAT-G_v1.06.1.pl

# Create convenient wrapper script
cat > $CONDA_PREFIX/bin/fasconcat <<'EOF'
#!/bin/bash
perl $(dirname $0)/FASconCAT-G_v1.06.1.pl "$@"
EOF

chmod +x $CONDA_PREFIX/bin/fasconcat

# Cleanup
cd ~
rm -rf /tmp/phylo_downloads

Verify installation:

# Test with wrapper command
fasconcat
# Should display the interactive menu

# Or call Perl script directly
perl $CONDA_PREFIX/bin/FASconCAT-G_v1.06.1.pl

3. IQ-TREE (Alternative: Direct Binary Download)

While IQ-TREE is available via conda, you may want the latest version directly from GitHub.

Installation (Linux):

conda activate phylogenomics

# Create temporary download directory
mkdir -p /tmp/phylo_downloads
cd /tmp/phylo_downloads

# Download and extract
wget https://github.com/iqtree/iqtree2/releases/download/v2.3.6/iqtree-2.3.6-Linux-intel.tar.gz
tar -xzf iqtree-2.3.6-Linux-intel.tar.gz

# Install binaries into conda environment
mkdir -p $CONDA_PREFIX/bin
cp iqtree-2.3.6-Linux-intel/bin/iqtree2 $CONDA_PREFIX/bin/
chmod +x $CONDA_PREFIX/bin/iqtree2

# Create symlink for version-agnostic usage
ln -sf $CONDA_PREFIX/bin/iqtree2 $CONDA_PREFIX/bin/iqtree

# Cleanup
cd ~
rm -rf /tmp/phylo_downloads

Installation (macOS Intel):

conda activate phylogenomics

mkdir -p /tmp/phylo_downloads
cd /tmp/phylo_downloads

# Download and extract
wget https://github.com/iqtree/iqtree2/releases/download/v2.3.6/iqtree-2.3.6-macOS-intel.tar.gz
tar -xzf iqtree-2.3.6-macOS-intel.tar.gz

# Install into conda environment
mkdir -p $CONDA_PREFIX/bin
cp iqtree-2.3.6-macOS-intel/bin/iqtree2 $CONDA_PREFIX/bin/
chmod +x $CONDA_PREFIX/bin/iqtree2

# Create symlink for version-agnostic usage
ln -sf $CONDA_PREFIX/bin/iqtree2 $CONDA_PREFIX/bin/iqtree

# Cleanup
cd ~
rm -rf /tmp/phylo_downloads

Installation (macOS Apple Silicon):

conda activate phylogenomics

mkdir -p /tmp/phylo_downloads
cd /tmp/phylo_downloads

# Download and extract
wget https://github.com/iqtree/iqtree2/releases/download/v2.3.6/iqtree-2.3.6-macOS-arm.tar.gz
tar -xzf iqtree-2.3.6-macOS-arm.tar.gz

# Install into conda environment
mkdir -p $CONDA_PREFIX/bin
cp iqtree-2.3.6-macOS-arm/bin/iqtree2 $CONDA_PREFIX/bin/
chmod +x $CONDA_PREFIX/bin/iqtree2

# Create symlink for version-agnostic usage
ln -sf $CONDA_PREFIX/bin/iqtree2 $CONDA_PREFIX/bin/iqtree

# Cleanup
cd ~
rm -rf /tmp/phylo_downloads

Verify installation:

iqtree --version
# or
iqtree2 --version

4. ASTRAL (Alternative: Direct JAR Download)

While ASTRAL is available via conda, you can also install the JAR file directly.

Installation:

conda activate phylogenomics

# Create temporary download directory
mkdir -p /tmp/phylo_downloads
cd /tmp/phylo_downloads

# Download ASTRAL
wget https://github.com/smirarab/ASTRAL/raw/master/Astral.5.7.8.zip
unzip Astral.5.7.8.zip

# Install JAR file into conda environment
mkdir -p $CONDA_PREFIX/share/astral
cp Astral/astral.5.7.8.jar $CONDA_PREFIX/share/astral/
cp -r Astral/lib $CONDA_PREFIX/share/astral/ 2>/dev/null || true

# Create wrapper script
cat > $CONDA_PREFIX/bin/astral-jar <<'EOF'
#!/bin/bash
java -jar $CONDA_PREFIX/share/astral/astral.5.7.8.jar "$@"
EOF

chmod +x $CONDA_PREFIX/bin/astral-jar

# Cleanup
cd ~
rm -rf /tmp/phylo_downloads

Verify installation:

astral-jar -h

Note: This creates an astral-jar command to avoid conflicts with the conda astral package if both are installed.

Verification Script

Create a script to verify all installations:

#!/bin/bash
# verify_installations.sh

echo "=============================================="
echo "Phylogenomics Tools Installation Verification"
echo "=============================================="
echo ""
echo "Conda environment: $CONDA_PREFIX"
echo ""

# Function to check command
check_cmd() {
    if command -v $1 &> /dev/null; then
        version=$($1 --version 2>&1 | head -n 1 || echo "installed")
        echo "✓ $1 is installed ($version)"
        return 0
    else
        echo "✗ $1 is NOT installed"
        return 1
    fi
}

# Function to check file
check_file() {
    if [ -f "$1" ]; then
        echo "✓ $2 is installed at $1"
        return 0
    else
        echo "✗ $2 is NOT installed"
        return 1
    fi
}

echo "Conda-installed tools:"
echo "----------------------"
check_cmd compleasm
check_cmd mafft
check_cmd trimal
check_cmd bmge
check_cmd clipkit
check_cmd iqtree || check_cmd iqtree2
check_cmd astral

echo ""
echo "Manually-installed Perl scripts:"
echo "---------------------------------"

# Check Aliscore
if command -v aliscore &> /dev/null; then
    echo "✓ Aliscore is available (wrapper: aliscore)"
    check_file "$CONDA_PREFIX/bin/Aliscore.02.2.pl" "  Aliscore.02.2.pl"
    check_file "$CONDA_PREFIX/bin/Aliscore_module.pm" "  Aliscore_module.pm"
elif [ -f "$CONDA_PREFIX/bin/Aliscore.02.2.pl" ]; then
    echo "✓ Aliscore.02.2.pl is installed (no wrapper)"
else
    echo "✗ Aliscore is NOT installed"
fi

# Check ALICUT
if command -v alicut &> /dev/null; then
    echo "✓ ALICUT is available (wrapper: alicut)"
    check_file "$CONDA_PREFIX/bin/ALICUT_V2.31.pl" "  ALICUT_V2.31.pl"
elif [ -f "$CONDA_PREFIX/bin/ALICUT_V2.31.pl" ]; then
    echo "✓ ALICUT_V2.31.pl is installed (no wrapper)"
else
    echo "✗ ALICUT is NOT installed"
fi

# Check FASconCAT-G
if command -v fasconcat &> /dev/null; then
    echo "✓ FASconCAT-G is available (wrapper: fasconcat)"
    check_file "$CONDA_PREFIX/bin/FASconCAT-G_v1.06.1.pl" "  FASconCAT-G_v1.06.1.pl"
elif [ -f "$CONDA_PREFIX/bin/FASconCAT-G_v1.06.1.pl" ]; then
    echo "✓ FASconCAT-G_v1.06.1.pl is installed (no wrapper)"
else
    echo "✗ FASconCAT-G is NOT installed"
fi

echo ""
echo "Alternative installations:"
echo "--------------------------"

# Check manually installed IQ-TREE
if [ -f "$CONDA_PREFIX/bin/iqtree2" ] && [ ! -L "$CONDA_PREFIX/bin/iqtree2" ]; then
    echo "✓ IQ-TREE2 binary (manually installed)"
    if [ -L "$CONDA_PREFIX/bin/iqtree" ]; then
        echo "  ✓ iqtree symlink present"
    fi
fi

# Check manually installed ASTRAL
if command -v astral-jar &> /dev/null; then
    echo "✓ ASTRAL JAR (manually installed, command: astral-jar)"
    check_file "$CONDA_PREFIX/share/astral/astral.5.7.8.jar" "  astral.5.7.8.jar"
fi

echo ""
echo "=============================================="
echo "Verification complete!"
echo "=============================================="

Save and run:

# Save the script
cat > verify_installations.sh <<'SCRIPT'
# ... paste the script above ...
SCRIPT

chmod +x verify_installations.sh

# Run it
./verify_installations.sh

Troubleshooting

Problem: Conda is slow

  • Solution: Use mamba instead: conda install -n base mamba -c conda-forge
  • Then use mamba instead of conda for all installations

Problem: Perl script can't find modules

# Install additional Perl dependencies
conda install -c bioconda perl-bioperl perl-file-copy-recursive

Problem: Java not found for ASTRAL

conda install -c conda-forge openjdk

Problem: Permission denied on HPC

# Install Miniforge in home directory
wget https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh
bash Miniforge3-Linux-x86_64.sh -b -p $HOME/miniforge3
source $HOME/miniforge3/etc/profile.d/conda.sh

Problem: Apple Silicon compatibility

# Force x86_64 architecture
CONDA_SUBDIR=osx-64 conda create -n phylogenomics ...
conda activate phylogenomics
conda config --env --set subdir osx-64

Docker Container Specification

If using Docker, here's a complete Dockerfile with all tools:

FROM mambaorg/micromamba:latest

LABEL maintainer="Bruno de Medeiros <Field Museum>"
LABEL description="Complete environment for BUSCO-based phylogenomics"

# Install all phylogenomics tools
RUN micromamba install -y -n base -c conda-forge -c bioconda \
    compleasm \
    busco \
    mafft \
    trimal \
    bmge \
    clipkit \
    iqtree \
    ncbi-datasets-cli \
    python=3.9 \
    biopython \
    perl \
    openjdk \
    wget \
    unzip \
    && micromamba clean --all --yes

# Set working directory
WORKDIR /data

# Set entrypoint
ENTRYPOINT ["/usr/local/bin/_entrypoint.sh"]
CMD ["/bin/bash"]

Build and run:

docker build -t phylogenomics:latest .
docker run -v $(pwd):/data -it phylogenomics:latest

Substitution Model Recommendation Detailed Guide

This section provides the detailed decision matrix and recommendation process for IQ-TREE substitution model selection.

Model Recommendation Matrix

For Nuclear Proteins (most phylogenomics workflows)

Deep Phylogeny (class+ level), Many Taxa (>50):

  1. LG+F+G4 or LG+F+R - Best general model, widely used
  2. WAG+F+G4 - Alternative general model
  3. LG4X or LG4M - Mixture models for heterogeneity
  4. Q.pfam+F+G4 - Database-derived, broad taxonomic sampling
  5. JTT+F+G4 - Classical alternative

Moderate Phylogeny (family/order level), Moderate Taxa (20-50):

  1. LG+F+G4 - Top choice for most analyses
  2. WAG+F+G4 - Reliable alternative
  3. JTT+F+G4 - Classical model
  4. Q.pfam+F+G4 - Database-derived option

Shallow Phylogeny (species/genus level), Few Taxa (<20):

  1. LG+F+G4 - Still recommended default
  2. WAG+F+G4 - Good alternative
  3. JTT+F+G4 - Classical option

Taxonomically-Targeted Models

When applicable, use these taxonomically-specific models:

  • Birds: Q.bird+F+G4
  • Mammals: Q.mammal+F+G4, mtMAM (if mtDNA)
  • Insects: Q.insect+F+G4, mtART (if mtDNA)
  • Plants: Q.plant+F+G4, cpREV (if chloroplast)
  • Yeasts/Fungi: Q.yeast+F+G4

Model Notation Explained

  • LG, WAG, JTT: Empirical exchange rate matrix name
  • +F: Use empirical amino acid frequencies from data (recommended)
  • +G4: Gamma model with 4 rate categories for among-site rate variation
  • +R: FreeRate model (alternative to Gamma, often better but slower)

Presenting Model Recommendations

Format recommendations like this:

### Recommended Substitution Models for Your Dataset

Based on your dataset ([NUMBER] taxa, [TAXONOMIC_SCOPE] phylogenetic breadth, [SEQUENCE_TYPE]):

**Primary Recommendations** (use all in model testing):

1. **LG+F+G4**
   - **Why**: Most widely used modern AA model, performs well across diverse datasets
   - **Citations**: Proven effective in numerous phylogenomic studies

2. **WAG+F+G4**
   - **Why**: Excellent general-purpose alternative, often comparable to LG
   - **Use case**: Good backup if LG shows poor fit

3. **JTT+F+G4**
   - **Why**: Classical model still widely used, allows comparison with older studies
   - **Use case**: Historical comparisons

[Add 2-3 more specific to their data...]

**For IQ-TREE Step 8A**, we'll use these models in the partition search:
```bash
-mset LG,WAG,JTT,Q.pfam  # Model set for testing
-m TESTMERGEONLY          # Test models and merge partitions

For Step 8C (gene trees), we'll use:

-m MFP  # Model Finder Plus (tests models from our set)

Would you like to:

  1. ✓ Use these recommended models (recommended)
  2. Specify a custom model set
  3. Let me fetch more information about your specific taxonomic group

---

## Ortholog Identification Implementation

This section provides detailed implementation scripts for Step 2 (compleasm).

### Threading Allocation Table

| Total Cores | First Genome | Subsequent Genomes | Concurrent Jobs | Threads/Job |
|-------------|--------------|-------------------|-----------------|-------------|
| 8           | 8 threads    | 8 threads (serial)| 1               | 8           |
| 16          | 16 threads   | 8 threads         | 2               | 8           |
| 32          | 32 threads   | 8 threads         | 4               | 8           |
| 64          | 64 threads   | 16 threads        | 4               | 16          |
| 128         | 128 threads  | 16-32 threads     | 4-8             | 16          |

### SLURM Implementation (Option A: Optimized Parallel)

**First genome job** (`run_compleasm_first.job`):
```bash
#!/bin/bash
#SBATCH --job-name=compleasm_first
#SBATCH --cpus-per-task=TOTAL_THREADS  # Replace with all available cores
#SBATCH --mem-per-cpu=6G
#SBATCH --time=24:00:00
#SBATCH --output=logs/compleasm_first.%j.out
#SBATCH --error=logs/compleasm_first.%j.err

source ~/.bashrc
conda activate phylo

mkdir -p logs

# Process first genome only (downloads lineage database)
first_genome=$(head -n 1 genome_list.txt)
genome_name=$(basename ${first_genome} .fasta)

echo "Processing first genome: ${genome_name}"
echo "This will download the BUSCO lineage database..."

compleasm run \
  -a ${first_genome} \
  -o ${genome_name}_compleasm \
  -l LINEAGE \
  -t ${SLURM_CPUS_PER_TASK}

echo "First genome complete. Database downloaded."
echo "Ready to process remaining genomes in parallel."

Parallel genomes job (run_compleasm_parallel.job):

#!/bin/bash
#SBATCH --job-name=compleasm_parallel
#SBATCH --array=2-NUM_GENOMES  # Replace NUM_GENOMES with count
#SBATCH --cpus-per-task=THREADS_PER_JOB  # Replace based on table above
#SBATCH --mem-per-cpu=6G
#SBATCH --time=48:00:00
#SBATCH --output=logs/compleasm.%A_%a.out
#SBATCH --error=logs/compleasm.%A_%a.err

source ~/.bashrc
conda activate phylo

# Get genome for this array task
genome=$(sed -n "${SLURM_ARRAY_TASK_ID}p" genome_list.txt)
genome_name=$(basename ${genome} .fasta)

echo "Processing genome ${SLURM_ARRAY_TASK_ID}: ${genome_name}"

compleasm run \
  -a ${genome} \
  -o ${genome_name}_compleasm \
  -l LINEAGE \
  -t ${SLURM_CPUS_PER_TASK}

echo "Completed: ${genome_name}"

PBS Implementation (Option A: Optimized Parallel)

Similar structure to SLURM, but with PBS directives. See templates in templates/pbs/.

Local Implementation (Option A: Optimized Parallel)

First genome script (run_compleasm_first.sh):

#!/bin/bash
source ~/.bashrc
conda activate phylo

TOTAL_THREADS=TOTAL_THREADS  # Replace with total cores to use

mkdir -p logs

# Process first genome only
first_genome=$(head -n 1 genome_list.txt)
genome_name=$(basename ${first_genome} .fasta)

echo "Processing first genome: ${genome_name}"
echo "Using ${TOTAL_THREADS} threads"
echo "This will download the BUSCO lineage database..."

compleasm run \
  -a ${first_genome} \
  -o ${genome_name}_compleasm \
  -l LINEAGE \
  -t ${TOTAL_THREADS} 2>&1 | tee logs/compleasm_first.log

echo "First genome complete!"

Parallel genomes script (run_compleasm_parallel.sh):

#!/bin/bash
source ~/.bashrc
conda activate phylo

TOTAL_THREADS=TOTAL_THREADS      # Replace with total cores available
THREADS_PER_JOB=THREADS_PER_JOB  # Replace based on table above
CONCURRENT_JOBS=$((TOTAL_THREADS / THREADS_PER_JOB))

echo "Parallel compleasm processing"
echo "Total threads: ${TOTAL_THREADS}"
echo "Threads per genome: ${THREADS_PER_JOB}"
echo "Concurrent genomes: ${CONCURRENT_JOBS}"
echo ""

mkdir -p logs

# Process genomes 2-end in parallel using GNU parallel
tail -n +2 genome_list.txt | parallel -j ${CONCURRENT_JOBS} '
  genome_name=$(basename {} .fasta)
  echo "Processing: ${genome_name}"

  compleasm run \
    -a {} \
    -o ${genome_name}_compleasm \
    -l LINEAGE \
    -t THREADS_PER_JOB 2>&1 | tee logs/compleasm_${genome_name}.log

  echo "Completed: ${genome_name}"
'

echo ""
echo "All genomes processed!"

Simple Serial Implementation (All Platforms)

For users who prefer simplicity over optimization:

SLURM (run_compleasm_serial.job):

#!/bin/bash
#SBATCH --job-name=compleasm_serial
#SBATCH --cpus-per-task=THREADS
#SBATCH --mem-per-cpu=6G
#SBATCH --time=72:00:00
#SBATCH --output=logs/compleasm.%j.out

source ~/.bashrc
conda activate phylo

mkdir -p logs

while read genome; do
  genome_name=$(basename ${genome} .fasta)
  echo "Processing ${genome_name}..."

  compleasm run \
    -a ${genome} \
    -o ${genome_name}_compleasm \
    -l LINEAGE \
    -t ${SLURM_CPUS_PER_TASK}
done < genome_list.txt

echo "All genomes processed!"

Alignment Implementation

This section provides detailed implementation scripts for Step 5 (MAFFT alignment).

SLURM Array Job

#!/bin/bash
#SBATCH --job-name=mafft_array
#SBATCH --array=1-NUM_LOCI  # Replace with actual number
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=24:00:00
#SBATCH --output=logs/%A_%a.mafft.out
#SBATCH --error=logs/%A_%a.mafft.err

source ~/.bashrc
conda activate phylo

cd single_copy_orthologs/unaligned_aa
mkdir -p ../aligned_aa

locus=$(sed -n "${SLURM_ARRAY_TASK_ID}p" locus_names.txt)
output=$(basename ${locus} .fas)_aligned.fas

echo "Aligning: ${locus}"

mafft-linsi ${locus} > ../aligned_aa/${output}

echo "Completed: ${output}"

PBS Array Job

#!/bin/bash
#PBS -N mafft_array
#PBS -t 1-NUM_LOCI
#PBS -l nodes=1:ppn=1
#PBS -l mem=4gb
#PBS -l walltime=24:00:00

cd $PBS_O_WORKDIR/single_copy_orthologs/unaligned_aa
source ~/.bashrc
conda activate phylo

mkdir -p ../aligned_aa

locus=$(sed -n "${PBS_ARRAYID}p" locus_names.txt)
output=$(basename ${locus} .fas)_aligned.fas

mafft-linsi ${locus} > ../aligned_aa/${output}

Local Sequential

#!/bin/bash
source ~/.bashrc
conda activate phylo

cd single_copy_orthologs/unaligned_aa
mkdir -p ../aligned_aa

while read locus; do
  output=$(basename ${locus} .fas)_aligned.fas
  echo "Aligning ${locus}..."

  mafft-linsi ${locus} > ../aligned_aa/${output}
done < locus_names.txt

echo "All alignments complete!"

Local Parallel (GNU parallel)

#!/bin/bash
source ~/.bashrc
conda activate phylo

cd single_copy_orthologs/unaligned_aa
mkdir -p ../aligned_aa

CONCURRENT_JOBS=4  # Adjust based on available cores

cat locus_names.txt | parallel -j ${CONCURRENT_JOBS} '
  output=$(basename {} .fas)_aligned.fas
  echo "Aligning: {}"
  mafft-linsi {} > ../aligned_aa/${output}
  echo "Completed: ${output}"
'

echo "All alignments complete!"

Alignment Trimming Implementation

This section provides detailed implementation scripts for Step 6 (alignment trimming) using all supported methods.

trimAl Implementation

SLURM Array Job:

#!/bin/bash
#SBATCH --job-name=trimal_array
#SBATCH --array=1-NUM_LOCI
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
#SBATCH --time=2:00:00
#SBATCH --output=logs/%A_%a.trimal.out

source ~/.bashrc
conda activate phylo

cd aligned_aa
mkdir -p ../trimmed_aa
ls *.fas > aligned_loci.txt

locus=$(sed -n "${SLURM_ARRAY_TASK_ID}p" aligned_loci.txt)
output=$(basename ${locus} _aligned.fas)_trimmed.fas

trimal -in ${locus} -out ../trimmed_aa/${output} -automated1

PBS Array Job:

#!/bin/bash
#PBS -N trimal_array
#PBS -t 1-NUM_LOCI
#PBS -l nodes=1:ppn=1
#PBS -l mem=2gb
#PBS -l walltime=2:00:00

cd $PBS_O_WORKDIR/aligned_aa
source ~/.bashrc
conda activate phylo

mkdir -p ../trimmed_aa
ls *.fas > aligned_loci.txt

locus=$(sed -n "${PBS_ARRAYID}p" aligned_loci.txt)
output=$(basename ${locus} _aligned.fas)_trimmed.fas

trimal -in ${locus} -out ../trimmed_aa/${output} -automated1

Local with GNU parallel:

#!/bin/bash
source ~/.bashrc
conda activate phylo

cd aligned_aa
mkdir -p ../trimmed_aa

cat aligned_loci.txt | parallel -j 4 '
  output=$(basename {} _aligned.fas)_trimmed.fas
  trimal -in {} -out ../trimmed_aa/${output} -automated1
  echo "Trimmed: ${output}"
'

ClipKit Implementation

SLURM Array Job:

#!/bin/bash
#SBATCH --job-name=clipkit_array
#SBATCH --array=1-NUM_LOCI
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
#SBATCH --time=2:00:00
#SBATCH --output=logs/%A_%a.clipkit.out

source ~/.bashrc
conda activate phylo

cd aligned_aa
mkdir -p ../trimmed_aa

locus=$(sed -n "${SLURM_ARRAY_TASK_ID}p" aligned_loci.txt)
output=$(basename ${locus} _aligned.fas)_trimmed.fas

clipkit ${locus} -o ../trimmed_aa/${output}

Local with GNU parallel:

#!/bin/bash
source ~/.bashrc
conda activate phylo

cd aligned_aa
mkdir -p ../trimmed_aa

cat aligned_loci.txt | parallel -j 4 '
  output=$(basename {} _aligned.fas)_trimmed.fas
  clipkit {} -o ../trimmed_aa/${output}
'

BMGE Implementation

SLURM Array Job:

#!/bin/bash
#SBATCH --job-name=bmge_array
#SBATCH --array=1-NUM_LOCI
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=2G
#SBATCH --time=2:00:00
#SBATCH --output=logs/%A_%a.bmge.out

source ~/.bashrc
conda activate phylo

cd aligned_aa
mkdir -p ../trimmed_aa

locus=$(sed -n "${SLURM_ARRAY_TASK_ID}p" aligned_loci.txt)
output=$(basename ${locus} _aligned.fas)_trimmed.fas

bmge -i ${locus} -t AA -o ../trimmed_aa/${output}

Local with GNU parallel:

#!/bin/bash
source ~/.bashrc
conda activate phylo

cd aligned_aa
mkdir -p ../trimmed_aa

cat aligned_loci.txt | parallel -j 4 '
  output=$(basename {} _aligned.fas)_trimmed.fas
  bmge -i {} -t AA -o ../trimmed_aa/${output}
'

Partition Model Selection Implementation

This section provides detailed implementation scripts for Step 8A (IQ-TREE partition model selection).

SLURM Implementation

#!/bin/bash
#SBATCH --job-name=partition_search
#SBATCH --cpus-per-task=18
#SBATCH --mem-per-cpu=4G
#SBATCH --time=72:00:00
#SBATCH --output=logs/partition_search.%j.out
#SBATCH --error=logs/partition_search.%j.err

source ~/.bashrc
conda activate phylo

cd trimmed_aa

echo "Starting partition model selection..."
echo "Using model set: MODEL_SET"
echo "Threads: ${SLURM_CPUS_PER_TASK}"

iqtree \
  -s FcC_supermatrix.fas \
  -spp partition_def.txt \
  -m TESTMERGEONLY \
  -mset MODEL_SET \
  -msub nuclear \
  -rcluster 10 \
  -bb 1000 \
  -alrt 1000 \
  -nt ${SLURM_CPUS_PER_TASK} \
  -safe \
  -pre partition_search

echo "Partition search complete!"
echo "Best scheme: partition_search.best_scheme.nex"

PBS Implementation

#!/bin/bash
#PBS -N partition_search
#PBS -l nodes=1:ppn=18
#PBS -l mem=72gb
#PBS -l walltime=72:00:00

cd $PBS_O_WORKDIR/trimmed_aa
source ~/.bashrc
conda activate phylo

iqtree \
  -s FcC_supermatrix.fas \
  -spp partition_def.txt \
  -m TESTMERGEONLY \
  -mset MODEL_SET \
  -msub nuclear \
  -rcluster 10 \
  -bb 1000 \
  -alrt 1000 \
  -nt 18 \
  -safe \
  -pre partition_search

Local Implementation

#!/bin/bash
source ~/.bashrc
conda activate phylo

cd trimmed_aa

THREADS=18  # Adjust based on available cores

echo "Starting partition model selection..."
echo "Using ${THREADS} threads"

iqtree \
  -s FcC_supermatrix.fas \
  -spp partition_def.txt \
  -m TESTMERGEONLY \
  -mset MODEL_SET \
  -msub nuclear \
  -rcluster 10 \
  -bb 1000 \
  -alrt 1000 \
  -nt ${THREADS} \
  -safe \
  -pre partition_search

echo "Partition search complete!"

Parameter Explanations

  • -m TESTMERGEONLY: Test models and merge partitions with same best model
  • -mset MODEL_SET: Test only specified models (e.g., "LG,WAG,JTT,Q.pfam")
  • -msub nuclear: Rate heterogeneity appropriate for nuclear genes
  • -rcluster 10: Merge similar partitions (10% relaxed clustering)
  • -bb 1000: 1000 ultrafast bootstrap replicates
  • -alrt 1000: 1000 SH-aLRT replicates (additional branch support)
  • -safe: Safe numerical mode (slower but more stable)

Gene Trees Implementation

This section provides detailed implementation scripts for Step 8C (individual gene tree estimation).

SLURM Array Job

#!/bin/bash
#SBATCH --job-name=gene_trees
#SBATCH --array=1-NUM_LOCI  # Replace with loci count
#SBATCH --cpus-per-task=1
#SBATCH --mem-per-cpu=4G
#SBATCH --time=2:00:00
#SBATCH --output=logs/%A_%a.gene_tree.out
#SBATCH --error=logs/%A_%a.gene_tree.err

source ~/.bashrc
conda activate phylo

cd trimmed_aa
ls *_trimmed.fas > locus_list.txt

locus=$(sed -n "${SLURM_ARRAY_TASK_ID}p" locus_list.txt)
prefix=$(basename ${locus} .fas)

echo "Estimating gene tree: ${locus}"

iqtree \
  -s ${locus} \
  -m MFP \
  -bb 1000 \
  -bnni \
  -czb \
  -nt 1 \
  -pre ${prefix}

echo "Completed: ${prefix}.treefile"

PBS Array Job

#!/bin/bash
#PBS -N gene_trees
#PBS -t 1-NUM_LOCI
#PBS -l nodes=1:ppn=1
#PBS -l mem=4gb
#PBS -l walltime=2:00:00

cd $PBS_O_WORKDIR/trimmed_aa
source ~/.bashrc
conda activate phylo

ls *_trimmed.fas > locus_list.txt

locus=$(sed -n "${PBS_ARRAYID}p" locus_list.txt)
prefix=$(basename ${locus} .fas)

iqtree \
  -s ${locus} \
  -m MFP \
  -bb 1000 \
  -bnni \
  -czb \
  -nt 1 \
  -pre ${prefix}

Local Parallel (GNU parallel)

#!/bin/bash
source ~/.bashrc
conda activate phylo

cd trimmed_aa
ls *_trimmed.fas > locus_list.txt

CONCURRENT_JOBS=4  # Adjust based on available cores

echo "Estimating gene trees in parallel..."
echo "Concurrent jobs: ${CONCURRENT_JOBS}"

cat locus_list.txt | parallel -j ${CONCURRENT_JOBS} '
  prefix=$(basename {} .fas)
  echo "Processing: {}"

  iqtree \
    -s {} \
    -m MFP \
    -bb 1000 \
    -bnni \
    -czb \
    -nt 1 \
    -pre ${prefix} > ${prefix}.log 2>&1

  echo "Completed: ${prefix}.treefile"
'

echo "All gene trees estimated!"

Local Serial

#!/bin/bash
source ~/.bashrc
conda activate phylo

cd trimmed_aa

echo "Estimating gene trees (serial processing)..."

for locus in *_trimmed.fas; do
  prefix=$(basename ${locus} .fas)
  echo "Processing: ${locus}"

  iqtree \
    -s ${locus} \
    -m MFP \
    -bb 1000 \
    -bnni \
    -czb \
    -nt 1 \
    -pre ${prefix}

  echo "Completed: ${prefix}.treefile"
done

echo "All gene trees estimated!"

Parameter Explanations

  • -m MFP: Model Finder Plus (automatically select best model)
  • -bb 1000: 1000 ultrafast bootstrap replicates
  • -bnni: Reduce NNI iterations to avoid overestimating bootstrap support
  • -czb: Collapse zero-length branches in final tree
  • -nt 1: Single thread per gene (parallelize across genes, not within)

Methods Paragraph Template

This section provides the complete methods paragraph template for publications.

Use this template to generate a METHODS_PARAGRAPH.md file for users. Customize based on their workflow choices.

# Methods Paragraph for Publication

## Phylogenomic Analysis

[Copy and customize the text below for your manuscript]

---

### Ortholog Identification and Quality Control

We identified single-copy orthologs from [NUMBER] genome assemblies using compleasm v[VERSION] (Huang & Li, 2023) with the [LINEAGE_NAME] BUSCO lineage dataset (v[VERSION]). Genomes with completeness scores below [THRESHOLD]% were excluded from downstream analyses. From the retained high-quality genomes, we extracted [NUMBER] single-copy orthologs present in all species.

### Multiple Sequence Alignment and Trimming

Each orthologous gene set was aligned using MAFFT v7 (Katoh & Standley, 2013) with the L-INS-i algorithm for accurate alignment of conserved protein sequences. Aligned sequences were then trimmed to remove ambiguously aligned regions using [TRIMMING_METHOD]:

- **Aliscore/ALICUT**: We used Aliscore v2.2 and ALICUT v2.31 (Kück et al., 2010) to identify and remove randomly similar sequence (RSS) sections. Aliscore identified RSS positions using Monte Carlo resampling with default parameters (window size = 4, treating gaps as ambiguous characters with -N option), and ALICUT removed these positions from the alignments.

- **trimAl**: We employed trimAl v1.4 (Capella-Gutiérrez et al., 2009) with the -automated1 heuristic method to automatically optimize gap threshold selection.

- **BMGE**: We used BMGE v1.12 (Criscuolo & Gribaldo, 2010) with entropy-based trimming for amino acid sequences (option -t AA).

- **ClipKit**: We applied ClipKit v1.3 (Steenwyk et al., 2020) with the default smart-gap mode for phylogenetically informative position selection.

After trimming, alignments containing fewer than [MIN_LENGTH] informative positions were excluded, resulting in [FINAL_NUMBER] high-quality gene alignments.

### Phylogenetic Inference

#### Concatenated Analysis

Trimmed alignments were concatenated into a supermatrix using FASconCAT-G v1.06.1 (Kück & Longo, 2014), yielding a final alignment of [TOTAL_LENGTH] amino acid positions across [NUMBER] partitions. We performed partitioned maximum likelihood (ML) phylogenetic inference using IQ-TREE v2.3 (Minh et al., 2020). The best-fit partitioning scheme and substitution models were selected using ModelFinder (Kalyaanamoorthy et al., 2017) with the TESTMERGEONLY option and [MODEL_SET] model set. Partitions were merged if they shared the same evolutionary model to reduce model complexity. The final tree was inferred using the selected partition scheme, with branch support assessed using 1,000 ultrafast bootstrap replicates (Hoang et al., 2018). To improve accuracy, we used the -bnni option to reduce potential overestimation of bootstrap support.

#### Coalescent-Based Species Tree

To account for incomplete lineage sorting, we also inferred a species tree using the multispecies coalescent model. Individual gene trees were estimated for each of the [NUMBER] alignments using IQ-TREE v2.3 with automatic model selection and 1,000 ultrafast bootstrap replicates. To improve accuracy, we used the -bnni option to reduce potential overestimation of bootstrap support and -czb to collapse zero-length branches. The resulting gene trees were summarized into a species tree using ASTRAL-III v5.7.8 (Zhang et al., 2018), which estimates the species tree topology that agrees with the largest number of quartet trees induced by the gene trees. Branch support was quantified using local posterior probabilities.

### Software and Reproducibility

All analyses were conducted using conda environments (conda v[VERSION]) to ensure reproducibility. Analysis scripts and detailed workflow documentation are available at [GITHUB_URL or supplementary materials].

---

## Complete Reference List

Capella-Gutiérrez, S., Silla-Martínez, J. M., & Gabaldón, T. (2009). trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. *Bioinformatics*, 25(15), 1972-1973. https://doi.org/10.1093/bioinformatics/btp348

Criscuolo, A., & Gribaldo, S. (2010). BMGE (Block Mapping and Gathering with Entropy): a new software for selection of phylogenetic informative regions from multiple sequence alignments. *BMC Evolutionary Biology*, 10(1), 210. https://doi.org/10.1186/1471-2148-10-210

Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q., & Vinh, L. S. (2018). UFBoot2: improving the ultrafast bootstrap approximation. *Molecular Biology and Evolution*, 35(2), 518-522. https://doi.org/10.1093/molbev/msx281

Huang, N., & Li, H. (2023). compleasm: a faster and more accurate reimplementation of BUSCO. *Bioinformatics*, 39(10), btad595. https://doi.org/10.1093/bioinformatics/btad595

Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K., von Haeseler, A., & Jermiin, L. S. (2017). ModelFinder: fast model selection for accurate phylogenetic estimates. *Nature Methods*, 14(6), 587-589. https://doi.org/10.1038/nmeth.4285

Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7: improvements in performance and usability. *Molecular Biology and Evolution*, 30(4), 772-780. https://doi.org/10.1093/molbev/mst010

Kück, P., & Longo, G. C. (2014). FASconCAT-G: extensive functions for multiple sequence alignment preparations concerning phylogenetic studies. *Frontiers in Zoology*, 11(1), 81. https://doi.org/10.1186/s12983-014-0081-x

Kück, P., Meusemann, K., Dambach, J., Thormann, B., von Reumont, B. M., Wägele, J. W., & Misof, B. (2010). Parametric and non-parametric masking of randomness in sequence alignments can be improved and leads to better resolved trees. *Frontiers in Zoology*, 7(1), 10. https://doi.org/10.1186/1742-9994-7-10

Minh, B. Q., Schmidt, H. A., Chernomor, O., Schrempf, D., Woodhams, M. D., von Haeseler, A., & Lanfear, R. (2020). IQ-TREE 2: new models and efficient methods for phylogenetic inference in the genomic era. *Molecular Biology and Evolution*, 37(5), 1530-1534. https://doi.org/10.1093/molbev/msaa015

Steenwyk, J. L., Buida III, T. J., Li, Y., Shen, X. X., & Rokas, A. (2020). ClipKIT: a multiple sequence alignment trimming software for accurate phylogenomic inference. *PLOS Biology*, 18(12), e3001007. https://doi.org/10.1371/journal.pbio.3001007

Zhang, C., Rabiee, M., Sayyari, E., & Mirarab, S. (2018). ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. *BMC Bioinformatics*, 19(6), 153. https://doi.org/10.1186/s12859-018-2129-y

---

## Instructions for Use

1. **Replace placeholders in brackets** with your actual values:
   - `[NUMBER]`, `[VERSION]`, `[LINEAGE_NAME]`, `[THRESHOLD]`, `[MIN_LENGTH]`, etc.

2. **Remove sections for tools you didn't use**:
   - Delete the trimming method descriptions you didn't use
   - If you only did concatenated OR coalescent analysis, remove the other section

3. **Adjust detail level** based on your target journal:
   - Combine into shorter paragraph for journals with strict word limits
   - Expand with more parameter details for bioinformatics journals

4. **Add to your manuscript**:
   - This goes in your Materials and Methods section
   - Add all references to your bibliography

5. **Update version numbers**:
   - Check actual versions used: `conda list` in your phylo environment
   - Include versions in your methods for reproducibility

This reference guide complements the main BUSCO phylogenomics skill and provides detailed technical specifications for implementation.