Initial commit

2025-11-29 18:02:37 +08:00
commit c1d9dee646
38 changed files with 11210 additions and 0 deletions
--- a/skills/phylo_from_buscos/.skillignore
+++ b/skills/phylo_from_buscos/.skillignore
@@ -0,0 +1,12 @@
+# Exclude development materials from skill packaging
+info_to_craft_skill/
+
+# Exclude GitHub documentation (not needed in skill package)
+README.md
+
+# Exclude local settings
+.claude/
+
+# Exclude git files
+.git/
+.gitignore
--- a/skills/phylo_from_buscos/README.md
+++ b/skills/phylo_from_buscos/README.md
@@ -0,0 +1,99 @@
+# BUSCO-based Phylogenomics Skill
+
+A Claude Code skills for phylogenomic analyses, created by Bruno de Medeiros (Field Museum) based on code initially written by Paul Frandsen (Brigham Young University)
+
+It generate a complete phylogenetic workflow from genome assemblies using BUSCO/compleasm-based single-copy orthologs.
+
+**Features:**
+- Supports local genome files and NCBI accessions (BioProjects/Assemblies)
+- Generates scheduler-specific scripts (SLURM, PBS, cloud, local)
+- Uses modern tools (compleasm, MAFFT, IQ-TREE, ASTRAL)
+- Multiple alignment trimming options
+- Both concatenation and coalescent approaches
+- Quality control with recommendations
+- Writes a draft methods paragraph describing the pipeline for publications
+
+**Use when you need to:**
+- Build phylogenetic trees from multiple genome assemblies
+- Extract and align single-copy orthologs across genomes
+- Download genomes from NCBI by accession
+- Generate ready-to-run scripts for your computing environment
+
+## Installation
+See README on the repository root folder for plugin installation.
+
+
+## Usage
+
+Once installed, simply describe your phylogenomics task:
+
+```
+I need to generate a phylogeny from 20 genome assemblies on a SLURM cluster
+```
+
+Claude Code will automatically activate the appropriate skill and guide you through the workflow.
+
+## Workflow Overview
+
+The complete phylogenomics pipeline:
+
+1. **Input Preparation** - Download NCBI genomes if needed
+2. **Ortholog Identification** - Run compleasm/BUSCO on all genomes
+3. **Quality Control** - Assess genome completeness with recommendations
+4. **Ortholog Extraction** - Generate per-locus unaligned FASTA files
+5. **Alignment** - Align orthologs with MAFFT
+6. **Trimming** - Remove poorly aligned regions (Aliscore/ALICUT, trimAl, BMGE, ClipKit)
+7. **Concatenation** - Build supermatrix with partition scheme
+8. **Phylogenetic Inference** - Generate ML concatenated tree (IQ-TREE), gene trees, and coalescent species tree (ASTRAL)
+
+## Requirements
+
+Claude Code is better than the web interface, since Claude will then help you install all requirements.
+
+The skill generates scripts that install and use:
+
+- **compleasm** or BUSCO - ortholog detection
+- **MAFFT** - multiple sequence alignment
+- **Aliscore/ALICUT, trimAl, BMGE, or ClipKit** - alignment trimming
+- **FASconCAT** - alignment concatenation
+- **IQ-TREE** - maximum likelihood phylogenetic inference
+- **ASTRAL** - coalescent species tree estimation
+- **NCBI Datasets CLI** - genome download (if using NCBI accessions)
+
+
+## Computing Environments
+
+The skill supports multiple computing environments:
+
+- **SLURM clusters** - generates SBATCH array jobs
+- **PBS/Torque clusters** - generates PBS array jobs
+- **Local machines** - sequential execution scripts
+
+## Attribution
+
+Created by **Bruno de Medeiros** (Curator of Pollinating Insects, Field Museum) based on phylogenomics tutorials by **Paul Frandsen** (Brigham Young University).
+
+## Citation
+
+If you use this skill for published research, please cite this website and also:
+
+- **compleasm**: Huang, N., & Li, H. (2023). compleasm: a faster and more accurate reimplementation of BUSCO. *Bioinformatics*, 39(10), btad595.
+- **MAFFT**: Katoh, K., & Standley, D. M. (2013). MAFFT multiple sequence alignment software version 7. *Molecular Biology and Evolution*, 30(4), 772-780.
+- **IQ-TREE**: Minh, B. Q., et al. (2020). IQ-TREE 2: New models and efficient methods for phylogenetic inference. *Molecular Biology and Evolution*, 37(5), 1530-1534.
+- **ASTRAL**: Zhang, C., et al. (2018). ASTRAL-III: polynomial time species tree reconstruction. *BMC Bioinformatics*, 19(6), 153.
+
+Plus any trimming tool you use (Aliscore/ALICUT, trimAl, BMGE, or ClipKit).
+
+## License
+
+MIT License - see individual tool licenses for software dependencies.
+
+## Support
+
+For issues or questions:
+- Open an issue in this repository
+- Contact Bruno de Medeiros at the Field Museum (bdemedeiros@fieldmuseum.org)
+
+## Acknowledgments
+
+Special thanks to Paul Frandsen (BYU) for creating the excellent phylogenomics tutorials that form the foundation of this skill.
--- a/skills/phylo_from_buscos/SKILL.md
+++ b/skills/phylo_from_buscos/SKILL.md
@@ -0,0 +1,757 @@
+---
+name: busco-phylogeny
+description: Generate phylogenies from genome assemblies using BUSCO/compleasm-based single-copy orthologs with scheduler-aware workflow generation
+---
+
+# BUSCO-based Phylogenomics Workflow Generator
+
+This skill provides phylogenomics expertise for generating comprehensive, scheduler-aware workflows for phylogenetic inference from genome assemblies using single-copy orthologs.
+
+## Purpose
+
+This skill helps users generate phylogenies from genome assemblies by:
+1. Handling mixed input (local files and NCBI accessions)
+2. Creating scheduler-specific scripts (SLURM, PBS, cloud, local)
+3. Setting up complete workflows from raw genomes to final trees
+4. Providing quality control and recommendations
+5. Supporting flexible software management (bioconda, Docker, custom)
+
+## Available Resources
+
+The skill provides access to these bundled resources:
+
+### Scripts (`scripts/`)
+- **`query_ncbi_assemblies.py`** - Query NCBI for available genome assemblies by taxon name (new!)
+- **`download_ncbi_genomes.py`** - Download genomes from NCBI using BioProjects or Assembly accessions
+- **`rename_genomes.py`** - Rename genome files with meaningful sample names (important!)
+- **`generate_qc_report.sh`** - Generate quality control reports from compleasm results
+- **`extract_orthologs.sh`** - Extract and reorganize single-copy orthologs
+- **`run_aliscore.sh`** - Wrapper for Aliscore to identify randomly similar sequences (RSS)
+- **`run_alicut.sh`** - Wrapper for ALICUT to remove RSS positions from alignments
+- **`run_aliscore_alicut_batch.sh`** - Batch process all alignments through Aliscore + ALICUT
+- **`convert_fasconcat_to_partition.py`** - Convert FASconCAT output to IQ-TREE partition format
+- **`predownloaded_aliscore_alicut/`** - Pre-tested Aliscore and ALICUT Perl scripts
+
+### Templates (`templates/`)
+- **`slurm/`** - SLURM job scheduler templates
+- **`pbs/`** - PBS/Torque job scheduler templates
+- **`local/`** - Local machine templates (with GNU parallel)
+- **`README.md`** - Complete template documentation
+
+### References (`references/`)
+- **`REFERENCE.md`** - Detailed technical reference including:
+  - Sample naming best practices
+  - BUSCO lineage datasets (complete list)
+  - Resource recommendations (memory, CPUs, walltime)
+  - Detailed step-by-step implementation guides
+  - Quality control guidelines
+  - Aliscore/ALICUT detailed guide
+  - Tool citations and download links
+  - Software installation guide
+  - Common issues and troubleshooting
+
+## Workflow Overview
+
+The complete phylogenomics pipeline follows this sequence:
+
+**Input Preparation** → **Ortholog Identification** → **Quality Control** → **Ortholog Extraction** → **Alignment** → **Trimming** → **Concatenation** → **Phylogenetic Inference**
+
+## Initial User Questions
+
+When a user requests phylogeny generation, gather the following information systematically:
+
+### Step 1: Detect Computing Environment
+
+Before asking questions, attempt to detect the local computing environment:
+
+```bash
+# Check for job schedulers
+command -v sbatch >/dev/null 2>&1  # SLURM
+command -v qsub >/dev/null 2>&1    # PBS/Torque
+command -v parallel >/dev/null 2>&1  # GNU parallel
+```
+
+Report findings to the user, then confirm: **"I detected [X] on this machine. Will you be running the scripts here or on a different system?"**
+
+### Required Information
+
+Ask these questions to gather essential workflow parameters:
+
+1. **Computing Environment**
+   - Where will these scripts run? (SLURM cluster, PBS/Torque cluster, Cloud computing, Local machine)
+
+2. **Input Data**
+   - Local genome files, NCBI accessions, or both?
+   - If NCBI: Do you already have Assembly accessions (GCA_*/GCF_*) or BioProject accessions (PRJNA*/PRJEB*/PRJDA*)?
+   - If user doesn't have accessions: Offer to help find assemblies using `query_ncbi_assemblies.py` (see "STEP 0A: Query NCBI for Assemblies" below)
+   - If local files: What are the file paths?
+
+3. **Taxonomic Scope & Dataset Details**
+   - What taxonomic group? (determines BUSCO lineage dataset)
+   - How many taxa/genomes will be analyzed?
+   - What is the approximate phylogenetic breadth? (species-level, genus-level, family-level, order-level, etc.)
+   - See `references/REFERENCE.md` for complete lineage list
+
+4. **Environment Management**
+   - Use unified conda environment (default, recommended), or separate environments per tool?
+
+5. **Resource Constraints**
+   - How many CPU cores/threads to use in total? (Ask user to specify, do not auto-detect)
+   - Available memory (RAM) per node/machine?
+   - Maximum walltime for jobs?
+   - See `references/REFERENCE.md` for resource recommendations
+
+6. **Parallelization Strategy**
+
+   Ask the user how they want to handle parallel processing:
+
+   - **For job schedulers (SLURM/PBS)**:
+     - Use array jobs for parallel steps? (Recommended: Yes)
+     - Which steps to parallelize? (Steps 2, 5, 6, 8C recommended)
+
+   - **For local machines**:
+     - Use GNU parallel for parallel steps? (requires `parallel` installed)
+     - How many concurrent jobs?
+
+   - **For all systems**:
+     - Optimize for maximum throughput or simplicity?
+
+7. **Scheduler-Specific Configuration** (if using SLURM or PBS)
+   - Account/Username for compute time charges
+   - Partition/Queue to submit jobs to
+   - Email notifications? (address and when: START, END, FAIL, ALL)
+   - Job dependencies? (Recommended: Yes for linear workflow)
+   - Output log directory? (Default: `logs/`)
+
+8. **Alignment Trimming Preference**
+   - Aliscore/ALICUT (traditional, thorough), trimAl (fast), BMGE (entropy-based), or ClipKit (modern)?
+
+9. **Substitution Model Selection** (for IQ-TREE phylogenetic inference)
+
+   **Context needed**: Taxonomic breadth, number of taxa, evolutionary rates
+
+   **Action**: Fetch IQ-TREE model documentation and suggest appropriate amino acid substitution models based on dataset characteristics.
+
+   Use the substitution model recommendation system (see "Substitution Model Recommendation" section below).
+
+10. **Educational Goals**
+   - Are you learning bioinformatics and would you like comprehensive explanations of each workflow step?
+   - If yes: After completing each major workflow stage, offer to explain what the step accomplishes, why certain choices were made, and what best practices are being followed.
+   - Store this preference to use throughout the workflow.
+
+---
+
+## Recommended Directory Structure
+
+Organize analyses with dedicated folders for each pipeline step:
+
+```
+project_name/
+├── logs/                          # All log files
+├── 00_genomes/                    # Input genome assemblies
+├── 01_busco_results/              # BUSCO/compleasm outputs
+├── 02_qc/                         # Quality control reports
+├── 03_extracted_orthologs/        # Extracted single-copy orthologs
+├── 04_alignments/                 # Multiple sequence alignments
+├── 05_trimmed/                    # Trimmed alignments
+├── 06_concatenation/              # Supermatrix and partition files
+├── 07_partition_search/           # Partition model selection
+├── 08_concatenated_tree/          # Concatenated ML tree
+├── 09_gene_trees/                 # Individual gene trees
+├── 10_species_tree/               # ASTRAL species tree
+└── scripts/                       # All analysis scripts
+```
+
+**Benefits**: Easy debugging, clear workflow progression, reproducibility, prevents root directory clutter.
+
+---
+
+## Template System
+
+This skill uses a template-based system to reduce token usage and improve maintainability. Script templates are stored in the `templates/` directory and organized by computing environment.
+
+### How to Use Templates
+
+When generating scripts for users:
+
+1. **Read the appropriate template** for their computing environment:
+   ```
+   Read("templates/slurm/02_compleasm_first.job")
+   ```
+
+2. **Replace placeholders** with user-specific values:
+   - `TOTAL_THREADS` → e.g., `64`
+   - `THREADS_PER_JOB` → e.g., `16`
+   - `NUM_GENOMES` → e.g., `20`
+   - `NUM_LOCI` → e.g., `2795`
+   - `LINEAGE` → e.g., `insecta_odb10`
+   - `MODEL_SET` → e.g., `LG,WAG,JTT,Q.pfam`
+
+3. **Present the customized script** to the user with setup instructions
+
+### Available Templates
+
+Key templates by workflow step:
+- **Step 0 (setup)**: Environment setup script in `references/REFERENCE.md`
+- **Step 2 (compleasm)**: `02_compleasm_first`, `02_compleasm_parallel`
+- **Step 8A (partition search)**: `08a_partition_search`
+- **Step 8C (gene trees)**: `08c_gene_trees_array`, `08c_gene_trees_parallel`, `08c_gene_trees_serial`
+
+See `templates/README.md` for complete template documentation.
+
+---
+
+## Substitution Model Recommendation
+
+When asked about substitution model selection (Question 9), use this systematic approach:
+
+### Step 1: Fetch IQ-TREE Documentation
+
+Use WebFetch to retrieve current model information:
+```
+WebFetch(url="https://iqtree.github.io/doc/Substitution-Models",
+         prompt="Extract all amino acid substitution models with descriptions and usage guidelines")
+```
+
+### Step 2: Analyze Dataset Characteristics
+
+Consider these factors from user responses:
+- **Taxonomic Scope**: Species/genus (shallow) vs. family/order (moderate) vs. class/phylum+ (deep)
+- **Number of Taxa**: <20 (small), 20-50 (medium), >50 (large)
+- **Evolutionary Rates**: Fast-evolving, moderate, or slow-evolving
+- **Sequence Type**: Nuclear proteins, mitochondrial, or chloroplast
+
+### Step 3: Recommend Models
+
+Provide 3-5 appropriate models based on dataset characteristics. For detailed model recommendation matrices and taxonomically-targeted models, see `references/REFERENCE.md` section "Substitution Model Recommendation".
+
+**General recommendations**:
+- **Nuclear proteins (most common)**: LG, WAG, JTT, Q.pfam
+- **Mitochondrial**: mtREV, mtZOA, mtMAM, mtART, mtVer, mtInv
+- **Chloroplast**: cpREV
+- **Taxonomically-targeted**: Q.bird, Q.mammal, Q.insect, Q.plant, Q.yeast (when applicable)
+
+### Step 4: Present Recommendations
+
+Format recommendations with justifications and explain how models will be used in IQ-TREE steps 8A and 8C.
+
+### Step 5: Store Model Set
+
+Store the final comma-separated model list (e.g., "LG,WAG,JTT,Q.pfam") for use in Step 8 template placeholders.
+
+---
+
+## Workflow Implementation
+
+Once required information is gathered, guide the user through these steps. For each step, use templates where available and refer to `references/REFERENCE.md` for detailed implementation.
+
+### STEP 0: Environment Setup
+
+**ALWAYS start by generating a setup script** for the user's environment.
+
+Use the unified conda environment setup script from `references/REFERENCE.md` (Section: "Software Installation Guide"). This creates a single conda environment with all necessary tools:
+- compleasm, MAFFT, trimming tools (trimAl, ClipKit, BMGE)
+- IQ-TREE, ASTRAL, Perl with BioPerl, GNU parallel
+- Downloads and installs Aliscore/ALICUT Perl scripts
+
+**Key points**:
+- Users choose between mamba (faster) or conda
+- Users choose between predownloaded Aliscore/ALICUT scripts (tested) or latest from GitHub
+- All subsequent steps use `conda activate phylo` (the unified environment)
+
+See `references/REFERENCE.md` for the complete setup script template.
+
+---
+
+### STEP 0A: Query NCBI for Assemblies (Optional)
+
+**Use this step when**: User wants to use NCBI data but doesn't have specific assembly accessions yet.
+
+This optional preliminary step helps users discover available genome assemblies by taxon name before proceeding with the main workflow.
+
+#### When to Offer This Step
+
+Offer this step when:
+- User wants to analyze genomes from NCBI
+- User doesn't have specific Assembly or BioProject accessions
+- User mentions a taxonomic group (e.g., "I want to build a phylogeny for beetles")
+
+#### Workflow
+
+1. **Ask for focal taxon**: Request the taxonomic group of interest
+   - Examples: "Coleoptera", "Drosophila", "Apis mellifera"
+   - Can be at any taxonomic level (order, family, genus, species)
+
+2. **Query NCBI using the script**: Use `scripts/query_ncbi_assemblies.py` to search for assemblies
+
+   ```bash
+   # Basic query (returns 20 results by default)
+   python scripts/query_ncbi_assemblies.py --taxon "Coleoptera"
+
+   # Query with more results
+   python scripts/query_ncbi_assemblies.py --taxon "Drosophila" --max-results 50
+
+   # Query for RefSeq assemblies only (higher quality, GCF_* accessions)
+   python scripts/query_ncbi_assemblies.py --taxon "Apis" --refseq-only
+
+   # Save accessions to file for later download
+   python scripts/query_ncbi_assemblies.py --taxon "Coleoptera" --save assembly_accessions.txt
+   ```
+
+3. **Present results to user**: The script displays:
+   - Assembly accession (GCA_* or GCF_*)
+   - Organism name
+   - Assembly level (Chromosome, Scaffold, Contig)
+   - Assembly name
+
+4. **Help user select assemblies**: Ask user which assemblies they want to include
+   - Consider assembly level (Chromosome > Scaffold > Contig)
+   - Consider phylogenetic breadth (species coverage)
+   - Consider data quality (RefSeq > GenBank when available)
+
+5. **Collect selected accessions**: Compile the list of chosen assembly accessions
+
+6. **Proceed to STEP 1**: Use the selected accessions with `download_ncbi_genomes.py`
+
+#### Tips for Assembly Selection
+
+- **Assembly Level**: Chromosome-level assemblies are most complete, followed by Scaffold, then Contig
+- **RefSeq vs GenBank**: RefSeq (GCF_*) assemblies undergo additional curation; GenBank (GCA_*) are submitter-provided
+- **Taxonomic Sampling**: For phylogenetics, aim for representative sampling across the taxonomic group
+- **Quality over Quantity**: Better to have 20 high-quality assemblies than 100 poor-quality ones
+
+---
+
+### STEP 1: Download NCBI Genomes (if applicable)
+
+If user provided NCBI accessions, use `scripts/download_ncbi_genomes.py`:
+
+**For BioProjects**:
+```bash
+python scripts/download_ncbi_genomes.py --bioprojects PRJNA12345 -o genomes.zip
+unzip genomes.zip
+```
+
+**For Assembly Accessions**:
+```bash
+python scripts/download_ncbi_genomes.py --assemblies GCA_123456789.1 -o genomes.zip
+unzip genomes.zip
+```
+
+**IMPORTANT**: After download, genomes must be renamed with meaningful sample names (format: `[ACCESSION]_[SPECIES_NAME]`). Sample names appear in final phylogenetic trees.
+
+Generate a script that:
+1. Finds all downloaded FASTA files in ncbi_dataset directory structure
+2. Moves/renames files to main genomes directory with meaningful names
+3. Includes any local genome files
+4. Creates final genome_list.txt with ALL genomes (local + downloaded)
+
+See `references/REFERENCE.md` section "Sample Naming Best Practices" for detailed guidelines.
+
+---
+
+### STEP 2: Ortholog Identification with compleasm
+
+Activate the unified environment and run compleasm on all genomes to identify single-copy orthologs.
+
+**Key considerations**:
+- First genome must run alone to download lineage database
+- Remaining genomes can run in parallel
+- Thread allocation: Miniprot scales well up to ~16-32 threads per genome
+
+**Threading guidelines**: See `references/REFERENCE.md` for recommended thread allocation table.
+
+**Generate scripts using templates**:
+- **SLURM**: Read templates `02_compleasm_first.job` and `02_compleasm_parallel.job`
+- **PBS**: Read templates `02_compleasm_first.job` and `02_compleasm_parallel.job`
+- **Local**: Read templates `02_compleasm_first.sh` and `02_compleasm_parallel.sh`
+
+Replace placeholders: `TOTAL_THREADS`, `THREADS_PER_JOB`, `NUM_GENOMES`, `LINEAGE`
+
+For detailed implementation examples, see `references/REFERENCE.md` section "Ortholog Identification Implementation".
+
+---
+
+### STEP 3: Quality Control
+
+After compleasm completes, generate QC report using `scripts/generate_qc_report.sh`:
+
+```bash
+bash scripts/generate_qc_report.sh qc_report.csv
+```
+
+Provide interpretation:
+- **>95% complete**: Excellent, retain
+- **90-95% complete**: Good, retain
+- **85-90% complete**: Acceptable, case-by-case
+- **70-85% complete**: Questionable, consider excluding
+- **<70% complete**: Poor, recommend excluding
+
+See `references/REFERENCE.md` section "Quality Control Guidelines" for detailed assessment criteria.
+
+---
+
+### STEP 4: Ortholog Extraction
+
+Use `scripts/extract_orthologs.sh` to extract single-copy orthologs:
+
+```bash
+bash scripts/extract_orthologs.sh LINEAGE_NAME
+```
+
+This generates per-locus unaligned FASTA files in `single_copy_orthologs/unaligned_aa/`.
+
+---
+
+### STEP 5: Alignment with MAFFT
+
+Activate the unified environment (`conda activate phylo`) which contains MAFFT.
+
+Create locus list, then generate alignment scripts:
+```bash
+cd single_copy_orthologs/unaligned_aa
+ls *.fas > locus_names.txt
+num_loci=$(wc -l < locus_names.txt)
+```
+
+**Generate scheduler-specific scripts**:
+- **SLURM/PBS**: Array job with one task per locus
+- **Local**: Sequential processing or GNU parallel
+
+For detailed script templates, see `references/REFERENCE.md` section "Alignment Implementation".
+
+---
+
+### STEP 6: Alignment Trimming
+
+Based on user's preference, provide appropriate trimming method. All tools are available in the unified conda environment.
+
+**Options**:
+- **trimAl**: Fast (`-automated1`), recommended for large datasets
+- **ClipKit**: Modern, fast (default smart-gap mode)
+- **BMGE**: Entropy-based (`-t AA`)
+- **Aliscore/ALICUT**: Traditional, thorough (recommended for phylogenomics)
+
+**For Aliscore/ALICUT**:
+- Perl scripts were installed in STEP 0
+- Use `scripts/run_aliscore_alicut_batch.sh` for batch processing
+- Or use array jobs with `scripts/run_aliscore.sh` and `scripts/run_alicut.sh`
+- Always use `-N` flag for amino acid sequences
+
+**Generate scripts** using scheduler-appropriate templates (array jobs for SLURM/PBS, parallel or serial for local).
+
+For detailed implementation of each trimming method, see `references/REFERENCE.md` section "Alignment Trimming Implementation".
+
+---
+
+### STEP 7: Concatenation and Partition Definition
+
+Download FASconCAT-G (Perl script) and run concatenation:
+
+```bash
+conda activate phylo  # Has Perl installed
+wget https://raw.githubusercontent.com/PatrickKueck/FASconCAT-G/master/FASconCAT-G_v1.06.1.pl -O FASconCAT-G.pl
+chmod +x FASconCAT-G.pl
+
+cd trimmed_aa
+perl ../FASconCAT-G.pl -s -i
+```
+
+Convert to IQ-TREE format using `scripts/convert_fasconcat_to_partition.py`:
+```bash
+python ../scripts/convert_fasconcat_to_partition.py FcC_info.xls partition_def.txt
+```
+
+Outputs: `FcC_supermatrix.fas`, `FcC_info.xls`, `partition_def.txt`
+
+---
+
+### STEP 8: Phylogenetic Inference
+
+IQ-TREE is already installed in the unified environment. Activate with `conda activate phylo`.
+
+#### Part 8A: Partition Model Selection
+
+Use the substitution models selected during initial setup (Question 9).
+
+**Generate script using templates**:
+- Read appropriate template: `templates/[slurm|pbs|local]/08a_partition_search.[job|sh]`
+- Replace `MODEL_SET` placeholder with user's selected models (e.g., "LG,WAG,JTT,Q.pfam")
+
+For detailed implementation, see `references/REFERENCE.md` section "Partition Model Selection Implementation".
+
+#### Part 8B: Concatenated ML Tree
+
+Run IQ-TREE using the best partition scheme from Part 8A:
+
+```bash
+iqtree -s FcC_supermatrix.fas -spp partition_search.best_scheme.nex \
+  -nt 18 -safe -pre concatenated_ML_tree -bb 1000 -bnni
+```
+
+Output: `concatenated_ML_tree.treefile`
+
+#### Part 8C: Individual Gene Trees
+
+Estimate gene trees for coalescent-based species tree inference.
+
+**Generate scripts using templates**:
+- **SLURM/PBS**: Read `08c_gene_trees_array.job` template
+- **Local**: Read `08c_gene_trees_parallel.sh` or `08c_gene_trees_serial.sh` template
+- Replace `NUM_LOCI` placeholder
+
+For detailed implementation, see `references/REFERENCE.md` section "Gene Trees Implementation".
+
+#### Part 8D: ASTRAL Species Tree
+
+ASTRAL is already installed in the unified conda environment.
+
+```bash
+conda activate phylo
+
+# Concatenate all gene trees
+cat trimmed_aa/*.treefile > all_gene_trees.tre
+
+# Run ASTRAL
+astral -i all_gene_trees.tre -o astral_species_tree.tre
+```
+
+Output: `astral_species_tree.tre`
+
+---
+
+### STEP 9: Generate Methods Paragraph
+
+**ALWAYS generate a methods paragraph** to help users write their publication methods section.
+
+Create `METHODS_PARAGRAPH.md` file with:
+- Customized text based on tools and parameters used
+- Complete citations for all software
+- Placeholders for user-specific values (genome count, loci count, thresholds)
+- Instructions for adapting to journal requirements
+
+For the complete methods paragraph template, see `references/REFERENCE.md` section "Methods Paragraph Template".
+
+Pre-fill known values when possible:
+- Number of genomes
+- BUSCO lineage
+- Trimming method used
+- Substitution models tested
+
+---
+
+## Final Outputs Summary
+
+Provide users with a summary of outputs:
+
+**Phylogenetic Results**:
+1. `concatenated_ML_tree.treefile` - ML tree from concatenated supermatrix
+2. `astral_species_tree.tre` - Coalescent species tree
+3. `*.treefile` - Individual gene trees
+
+**Data and Quality Control**:
+4. `qc_report.csv` - Genome quality statistics
+5. `FcC_supermatrix.fas` - Concatenated alignment
+6. `partition_search.best_scheme.nex` - Selected partitioning scheme
+
+**Publication Materials**:
+7. `METHODS_PARAGRAPH.md` - Ready-to-use methods section with citations
+
+**Visualization tools**: FigTree, iTOL, ggtree (R), ete3/toytree (Python)
+
+---
+
+## Script Validation
+
+**ALWAYS perform validation checks** after generating scripts but before presenting them to the user. This ensures script accuracy, consistency, and proper resource allocation.
+
+### Validation Workflow
+
+For each generated script, perform these validation checks in order:
+
+#### 1. Program Option Verification
+
+**Purpose**: Detect hallucinated or incorrect command-line options that may cause scripts to fail.
+
+**Procedure**:
+1. **Extract all command invocations** from the generated script (e.g., `compleasm run`, `iqtree -s`, `mafft --auto`)
+2. **Compare against reference sources**:
+   - First check: Compare against corresponding template in `templates/` directory
+   - Second check: Compare against examples in `references/REFERENCE.md`
+   - Third check: If options differ significantly or are uncertain, perform web search for official documentation
+3. **Common tools to validate**:
+   - `compleasm run` - Check `-a`, `-o`, `-l`, `-t` options
+   - `iqtree` - Verify `-s`, `-p`, `-m`, `-bb`, `-alrt`, `-nt`, `-safe` options
+   - `mafft` - Check `--auto`, `--thread`, `--reorder` options
+   - `astral` - Verify `-i`, `-o` options
+   - Trimming tools (`trimal`, `clipkit`, `BMGE.jar`) - Validate options
+
+**Action on issues**:
+- If incorrect options found: Inform user of the issue and ask if they want you to correct it
+- If uncertain: Ask user to verify with tool documentation before proceeding
+
+#### 2. Pipeline Continuity Verification
+
+**Purpose**: Ensure outputs from one step correctly feed into inputs of subsequent steps.
+
+**Procedure**:
+1. **Map input/output relationships**:
+   - Step 2 output (`01_busco_results/*_compleasm/`) → Step 3 input (QC script)
+   - Step 3 output (`single_copy_orthologs/`) → Step 5 input (MAFFT)
+   - Step 5 output (`04_alignments/*.fas`) → Step 6 input (trimming)
+   - Step 6 output (`05_trimmed/*.fas`) → Step 7 input (FASconCAT-G)
+   - Step 7 output (`FcC_supermatrix.fas`, partition file) → Step 8A input (IQ-TREE)
+   - Step 8C output (`*.treefile`) → Step 8D input (ASTRAL)
+
+2. **Check for consistency**:
+   - File path references match across scripts
+   - Directory structure follows recommended layout
+   - Glob patterns correctly match expected files
+   - Required intermediate files are generated before being used
+
+**Action on issues**:
+- If path mismatches found: Inform user and ask if they want you to correct them
+- If directory structure inconsistent: Suggest corrections aligned with recommended structure
+
+#### 3. Resource Compatibility Check
+
+**Purpose**: Ensure allocated computational resources are appropriate for the task.
+
+**Procedure**:
+1. **Verify resource allocations** against recommendations in `references/REFERENCE.md`:
+   - **Memory allocation**: Check if memory per CPU (typically 6GB for compleasm, 2-4GB for others) is adequate
+   - **Thread allocation**: Verify thread counts are reasonable for the number of genomes/loci
+   - **Walltime**: Ensure walltime is sufficient based on dataset size guidelines
+   - **Parallelization**: Check that threads per job × concurrent jobs ≤ total threads
+
+2. **Common issues to check**:
+   - Compleasm: First job needs full thread allocation (downloads database)
+   - IQ-TREE: `-nt` should match allocated CPUs
+   - Gene trees: Ensure enough threads per tree × concurrent trees ≤ total available
+   - Memory: Concatenated tree inference may need 8-16GB per CPU for large datasets
+
+3. **Validate against user-specified constraints**:
+   - Total CPUs specified by user
+   - Available memory per node
+   - Maximum walltime limits
+   - Scheduler-specific limits (if mentioned)
+
+**Action on issues**:
+- If resource allocation issues found: Inform user and suggest corrections with justification
+- If uncertain about adequacy: Ask user about typical job performance in their environment
+
+### Validation Reporting
+
+After completing all validation checks:
+
+1. **If all checks pass**: Inform user briefly: "Scripts validated successfully - options, pipeline flow, and resources verified."
+
+2. **If issues found**: Present a structured report:
+   ```
+   **Validation Results**
+
+   ⚠️ Issues found during validation:
+
+   1. [Issue category]: [Description]
+      - Current: [What was generated]
+      - Suggested: [Recommended fix]
+      - Reason: [Why this is an issue]
+
+   Would you like me to apply these corrections?
+   ```
+
+3. **Always ask before correcting**: Never silently fix issues - always get user confirmation before applying changes.
+
+4. **Document corrections**: If corrections are applied, explain what was changed and why.
+
+---
+
+## Communication Guidelines
+
+- **Always start with STEP 0**: Generate the unified environment setup script
+- **Always end with STEP 9**: Generate the customized methods paragraph
+- **Always validate scripts**: Perform validation checks before presenting scripts to users
+- **Use unified environment by default**: All scripts should use `conda activate phylo`
+- **Always ask about CPU allocation**: Never auto-detect cores, always ask user
+- **Recommend optimized workflows**: For users with adequate resources, recommend optimized parallel approaches over simple serial approaches
+- **Be clear and pedagogical**: Explain why each step is necessary
+- **Provide educational explanations when requested**: If user answered yes to educational goals (question 10):
+  - After completing each major workflow stage, ask: "Would you like me to explain this step?"
+  - If yes, provide moderate-length explanation (1-2 paragraphs) covering:
+    - What the step accomplishes biologically and computationally
+    - Significant choices made and their rationale
+    - Best practices being followed in the workflow
+  - Examples of "major workflow stages": STEP 0 (setup), STEP 1 (download), STEP 2 (BUSCO), STEP 3 (QC), STEP 5 (alignment), STEP 6 (trimming), STEP 7 (concatenation), STEP 8 (phylogenetic inference)
+- **Provide complete, ready-to-run scripts**: Users should copy-paste and run
+- **Adapt to user's environment**: Always generate scheduler-specific scripts
+- **Reference supporting files**: Direct users to `references/REFERENCE.md` for details
+- **Use helper scripts**: Leverage provided scripts in `scripts/` directory
+- **Include error checking**: Add file existence checks and informative error messages
+- **Be encouraging**: Phylogenomics is complex; maintain supportive tone
+
+---
+
+## Important Notes
+
+### Mandatory Steps
+1. **STEP 0 is mandatory**: Always generate the environment setup script first
+2. **STEP 9 is mandatory**: Always generate the methods paragraph file at the end
+
+### Template Usage (IMPORTANT!)
+3. **Prefer templates over inline code**: Use `templates/` directory for major scripts
+4. **Template workflow**:
+   - Read: `Read("templates/slurm/02_compleasm_first.job")`
+   - Replace placeholders: `TOTAL_THREADS`, `LINEAGE`, `NUM_GENOMES`, `MODEL_SET`, etc.
+   - Present customized script to user
+5. **Available templates**: See `templates/README.md` for complete list
+6. **Benefits**: Reduces token usage, easier maintenance, consistent structure
+
+### Script Generation
+7. **Always adapt scripts** to user's scheduler (SLURM/PBS/local)
+8. **Replace all placeholders** before presenting scripts
+9. **Never auto-detect CPU cores**: Always ask user to specify
+10. **Provide parallelization options**: For each parallelizable step, offer array job, parallel, and serial options
+11. **Scheduler-specific configuration**: For SLURM/PBS, always ask about account, partition, email, etc.
+
+### Parallelization Strategy
+12. **Ask about preferences**: Let user choose between throughput optimization vs. simplicity
+13. **Compleasm optimization**: For ≥2 genomes and ≥16 cores, recommend two-phase approach
+14. **Use threading guidelines**: Refer to `references/REFERENCE.md` for thread allocation recommendations
+15. **Parallelizable steps**: Steps 2 (compleasm), 5 (MAFFT), 6 (trimming), 8C (gene trees)
+
+### Substitution Model Selection
+16. **Always recommend models**: Use the systematic model recommendation process
+17. **Fetch current documentation**: Use WebFetch to get IQ-TREE model information
+18. **Replace MODEL_SET placeholder**: In Step 8A templates with comma-separated list
+19. **Taxonomically-targeted models**: Suggest Q.bird, Q.mammal, Q.insect, Q.plant when applicable
+
+### Reference Material
+20. **Direct users to references/REFERENCE.md** for:
+    - Detailed implementation guides
+    - BUSCO lineage datasets (complete list)
+    - Resource recommendations (memory, CPUs, walltime tables)
+    - Sample naming best practices
+    - Quality control assessment criteria
+    - Aliscore/ALICUT detailed guide and parameters
+    - Tool citations with DOIs
+    - Software installation instructions
+    - Common issues and troubleshooting
+
+---
+
+## Attribution
+
+This skill was created by **Bruno de Medeiros** (Curator of Pollinating Insects, Field Museum) based on phylogenomics tutorials by **Paul Frandsen** (Brigham Young University).
+
+## Workflow Entry Point
+
+When a user requests phylogeny generation:
+
+1. Gather required information using the "Initial User Questions" section
+2. Generate STEP 0 setup script from `references/REFERENCE.MD`
+3. If user needs help finding NCBI assemblies, perform STEP 0A using `query_ncbi_assemblies.py`
+4. Proceed step-by-step through workflow (STEPS 1-8), using templates and referring to `references/REFERENCE.md` for detailed implementation
+5. All workflow scripts should use the unified conda environment (`conda activate phylo`)
+6. Validate all generated scripts before presenting to user (see "Script Validation" section)
+7. Generate STEP 9 methods paragraph from template in `references/REFERENCE.md`
+8. Provide final outputs summary
--- a/skills/phylo_from_buscos/references/REFERENCE.md
+++ b/skills/phylo_from_buscos/references/REFERENCE.md
--- a/skills/phylo_from_buscos/scripts/convert_fasconcat_to_partition.py
+++ b/skills/phylo_from_buscos/scripts/convert_fasconcat_to_partition.py
@@ -0,0 +1,63 @@
+#!/usr/bin/env python3
+"""
+Convert FASconCAT info file to IQ-TREE partition format
+
+Usage:
+    python convert_fasconcat_to_partition.py FcC_info.xls [output_file.txt]
+
+Author: Bruno de Medeiros (Field Museum)
+Based on tutorials by Paul Frandsen (BYU)
+"""
+
+import sys
+
+
+def convert_fcc_to_partition(fcc_file, output_file="partition_def.txt"):
+    """
+    Convert FASconCAT info file to IQ-TREE partition format
+
+    Args:
+        fcc_file: Path to FcC_info.xls file from FASconCAT
+        output_file: Path to output partition definition file
+    """
+
+    try:
+        with open(fcc_file, 'r') as f:
+            lines = f.readlines()
+    except FileNotFoundError:
+        print(f"Error: File '{fcc_file}' not found")
+        sys.exit(1)
+
+    partitions_written = 0
+
+    with open(output_file, 'w') as out:
+        # Skip first two header lines (FASconCAT INFO and column headers)
+        for line in lines[2:]:
+            line = line.strip()
+            if line:
+                parts = line.split('\t')
+                if len(parts) >= 3:
+                    locus = parts[0]
+                    start = parts[1]
+                    end = parts[2]
+                    out.write(f"AA, {locus} = {start}-{end}\n")
+                    partitions_written += 1
+
+    print(f"Partition file created: {output_file}")
+    print(f"Number of partitions: {partitions_written}")
+
+
+def main():
+    if len(sys.argv) < 2:
+        print("Usage: python convert_fasconcat_to_partition.py FcC_info.xls [output_file.txt]")
+        print("\nConverts FASconCAT info file to IQ-TREE partition format")
+        sys.exit(1)
+
+    fcc_file = sys.argv[1]
+    output_file = sys.argv[2] if len(sys.argv) > 2 else "partition_def.txt"
+
+    convert_fcc_to_partition(fcc_file, output_file)
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/phylo_from_buscos/scripts/download_ncbi_genomes.py
+++ b/skills/phylo_from_buscos/scripts/download_ncbi_genomes.py
@@ -0,0 +1,133 @@
+#!/usr/bin/env python3
+"""
+Download genomes from NCBI using BioProject or Assembly accessions
+
+Usage:
+    python download_ncbi_genomes.py --bioprojects PRJNA12345 PRJEB67890
+    python download_ncbi_genomes.py --assemblies GCA_123456789.1 GCF_987654321.1
+
+Requires: ncbi-datasets-pylib (pip install ncbi-datasets-pylib)
+
+Author: Bruno de Medeiros (Field Museum)
+Based on tutorials by Paul Frandsen (BYU)
+"""
+
+import argparse
+import sys
+import subprocess
+
+
+def download_using_cli(accessions, output_file="genomes.zip"):
+    """
+    Download genomes using NCBI datasets CLI
+
+    Args:
+        accessions: List of BioProject or Assembly accessions
+        output_file: Name of output zip file
+    """
+    cmd = ["datasets", "download", "genome", "accession"] + accessions + ["--filename", output_file]
+
+    print(f"Running: {' '.join(cmd)}")
+    print("")
+
+    try:
+        result = subprocess.run(cmd, check=True, capture_output=True, text=True)
+        print(result.stdout)
+        print(f"\nDownload complete: {output_file}")
+        print("Extract with: unzip " + output_file)
+        return True
+    except subprocess.CalledProcessError as e:
+        print(f"Error downloading genomes: {e}", file=sys.stderr)
+        print(e.stderr, file=sys.stderr)
+        return False
+    except FileNotFoundError:
+        print("Error: 'datasets' command not found", file=sys.stderr)
+        print("Install with: conda install -c conda-forge ncbi-datasets-cli", file=sys.stderr)
+        return False
+
+
+def get_bioproject_assemblies(bioprojects):
+    """
+    Get assembly accessions for given BioProjects using Python API
+
+    Args:
+        bioprojects: List of BioProject accessions
+
+    Returns:
+        List of tuples (assembly_accession, organism_name)
+    """
+    try:
+        from ncbi.datasets.metadata.genome import get_assembly_metadata_by_bioproject_accessions
+    except ImportError:
+        print("Error: ncbi-datasets-pylib not installed", file=sys.stderr)
+        print("Install with: pip install ncbi-datasets-pylib", file=sys.stderr)
+        sys.exit(1)
+
+    assemblies = []
+
+    print(f"Fetching assembly information for {len(bioprojects)} BioProject(s)...")
+    print("")
+
+    for assembly in get_assembly_metadata_by_bioproject_accessions(bioprojects):
+        acc = assembly.accession
+        name = assembly.organism.organism_name
+        assemblies.append((acc, name))
+        print(f"  {name}: {acc}")
+
+    print(f"\nFound {len(assemblies)} assemblies")
+
+    return assemblies
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Download genomes from NCBI using BioProject or Assembly accessions"
+    )
+
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument(
+        "--bioprojects",
+        nargs="+",
+        help="BioProject accessions (e.g., PRJNA12345 PRJEB67890)"
+    )
+    group.add_argument(
+        "--assemblies",
+        nargs="+",
+        help="Assembly accessions (e.g., GCA_123456789.1 GCF_987654321.1)"
+    )
+
+    parser.add_argument(
+        "-o", "--output",
+        default="genomes.zip",
+        help="Output zip file name (default: genomes.zip)"
+    )
+
+    parser.add_argument(
+        "--list-only",
+        action="store_true",
+        help="List assemblies without downloading (BioProject mode only)"
+    )
+
+    args = parser.parse_args()
+
+    if args.bioprojects:
+        assemblies = get_bioproject_assemblies(args.bioprojects)
+
+        if args.list_only:
+            print("\nAssembly accessions (use with --assemblies to download):")
+            for acc, name in assemblies:
+                print(acc)
+            return
+
+        # Download assemblies
+        assembly_accs = [acc for acc, name in assemblies]
+        success = download_using_cli(assembly_accs, args.output)
+
+    elif args.assemblies:
+        success = download_using_cli(args.assemblies, args.output)
+
+    sys.exit(0 if success else 1)
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/phylo_from_buscos/scripts/extract_orthologs.sh
+++ b/skills/phylo_from_buscos/scripts/extract_orthologs.sh
@@ -0,0 +1,88 @@
+#!/bin/bash
+# Extract and reorganize single-copy orthologs from compleasm output
+#
+# Usage: bash extract_orthologs.sh LINEAGE_NAME
+#   Example: bash extract_orthologs.sh metazoa
+#
+# Author: Bruno de Medeiros (Field Museum)
+# Based on tutorials by Paul Frandsen (BYU)
+
+if [ $# -lt 1 ]; then
+  echo "Usage: bash extract_orthologs.sh LINEAGE_NAME"
+  echo "  Example: bash extract_orthologs.sh metazoa"
+  exit 1
+fi
+
+LINEAGE="$1"
+
+echo "Extracting single-copy orthologs for lineage: ${LINEAGE}"
+
+# Create directory for ortholog FASTA files
+mkdir -p single_copy_orthologs
+
+# Copy gene_marker.fasta files and rename by species
+count=0
+for dir in 01_busco_results/*_compleasm; do
+  if [ ! -d "${dir}" ]; then
+    continue
+  fi
+
+  genome=$(basename "${dir}" _compleasm)
+
+  # Auto-detect the OrthoDB version (odb10, odb11, odb12, etc.)
+  odb_dirs=("${dir}/${LINEAGE}_odb"*)
+  if [ -d "${odb_dirs[0]}" ]; then
+    marker_file="${odb_dirs[0]}/gene_marker.fasta"
+  else
+    echo "  Warning: No OrthoDB directory found for ${genome}" >&2
+    continue
+  fi
+
+  if [ -f "${marker_file}" ]; then
+    cp "${marker_file}" "single_copy_orthologs/${genome}.fasta"
+    echo "  Extracted: ${genome}"
+    count=$((count + 1))
+  else
+    echo "  Warning: Marker file not found for ${genome}" >&2
+  fi
+done
+
+if [ ${count} -eq 0 ]; then
+  echo "Error: No gene_marker.fasta files found. Check lineage name." >&2
+  exit 1
+fi
+
+echo "Extracted ${count} genomes"
+echo ""
+echo "Now generating per-locus unaligned FASTA files..."
+
+cd single_copy_orthologs || exit 1
+mkdir -p unaligned_aa
+cd unaligned_aa || exit 1
+
+# AWK script to split by ortholog ID
+awk 'BEGIN{RS=">"; FS="\n"} {
+  if (NF > 1) {
+    split($1, b, "_");
+    fnme = b[1] ".fas";
+    n = split(FILENAME, a, "/");
+    species = a[length(a)];
+    gsub(".fasta", "", species);
+    print ">" species "\n" $2 >> fnme;
+    close(fnme);
+  }
+}' ../*.fasta
+
+# Fix headers
+if [[ "$OSTYPE" == "darwin"* ]]; then
+  # macOS
+  sed -i '' -e 's/.fasta//g' *.fas
+else
+  # Linux
+  sed -i -e 's/.fasta//g' *.fas
+fi
+
+num_loci=$(ls -1 *.fas 2>/dev/null | wc -l)
+echo "Unaligned ortholog files generated: ${num_loci} loci"
+echo ""
+echo "Output directory: single_copy_orthologs/unaligned_aa/"
--- a/skills/phylo_from_buscos/scripts/generate_qc_report.sh
+++ b/skills/phylo_from_buscos/scripts/generate_qc_report.sh
@@ -0,0 +1,59 @@
+#!/bin/bash
+# Quality control report generator for compleasm results
+#
+# Usage: bash generate_qc_report.sh [output_file.csv]
+#
+# Author: Bruno de Medeiros (Field Museum)
+# Based on tutorials by Paul Frandsen (BYU)
+
+OUTPUT_FILE="${1:-qc_report.csv}"
+
+echo "Genome,Complete_SCO,Fragmented,Duplicated,Missing,Completeness(%)" > "${OUTPUT_FILE}"
+
+count=0
+for dir in 01_busco_results/*_compleasm; do
+  if [ ! -d "${dir}" ]; then
+    continue
+  fi
+
+  genome=$(basename "${dir}" _compleasm)
+  summary="${dir}/summary.txt"
+
+  if [ -f "${summary}" ]; then
+    # Parse completeness statistics from compleasm format
+    # compleasm uses: S: (single-copy), D: (duplicated), F: (fragmented), M: (missing)
+    # Format: "S:80.93%, 2283" where we need the count (2283)
+    complete=$(grep "^S:" "${summary}" | awk -F',' '{print $2}' | tr -d ' ')
+    duplicated=$(grep "^D:" "${summary}" | awk -F',' '{print $2}' | tr -d ' ')
+    fragmented=$(grep "^F:" "${summary}" | awk -F',' '{print $2}' | tr -d ' ')
+    missing=$(grep "^M:" "${summary}" | awk -F',' '{print $2}' | tr -d ' ')
+
+    # Check if all values were successfully extracted
+    if [ -z "${complete}" ] || [ -z "${fragmented}" ] || [ -z "${missing}" ]; then
+      echo "Warning: Could not parse statistics for ${genome}" >&2
+      continue
+    fi
+
+    # Calculate completeness percentage (Complete / Total * 100)
+    total=$((complete + duplicated + fragmented + missing))
+    if command -v bc &> /dev/null; then
+      completeness=$(echo "scale=2; (${complete} + ${duplicated}) / ${total} * 100" | bc)
+    else
+      # Fallback if bc not available
+      completeness=$(awk "BEGIN {printf \"%.2f\", (${complete} + ${duplicated}) / ${total} * 100}")
+    fi
+
+    echo "${genome},${complete},${fragmented},${duplicated},${missing},${completeness}" >> "${OUTPUT_FILE}"
+    count=$((count + 1))
+  else
+    echo "Warning: Summary file not found for ${genome}" >&2
+  fi
+done
+
+if [ ${count} -eq 0 ]; then
+  echo "Error: No compleasm output directories found (*_compleasm)" >&2
+  exit 1
+fi
+
+echo "QC report generated: ${OUTPUT_FILE}"
+echo "Genomes analyzed: ${count}"
--- a/skills/phylo_from_buscos/scripts/predownloaded_aliscore_alicut/ALICUT_V2.31.pl
+++ b/skills/phylo_from_buscos/scripts/predownloaded_aliscore_alicut/ALICUT_V2.31.pl
@@ -0,0 +1,742 @@
+#!/usr/bin/perl
+use strict       ;
+use File::Copy   ;
+use Tie::File    ;
+use Fcntl        ;
+use Term::Cap ;
+use Term::ANSIColor qw(:constants);
+use Getopt::Std  ;
+
+# updated on 13th february , 2009 by patrick k<>ck
+# updated on  2nd april    , 2009 by patrick k<>ck
+# updated on 15th june     , 2009 by patrick k<>ck
+# updated on 26th july     , 2009 by patrick k<>ck
+# updated on  7th september, 2011 by patrick k<>ck (alicut v2.3)
+# updated on 22.2.2017, by patrick k<>ck (alicut v2.31) -> correction of initial warning due to line 547, changed some terminal prints, argv handling commands
+
+my @answer_remain_stems = ( 'no', 'yes' ) ;
+my @answer_codons       = ( 'no', 'yes' ) ;
+my @answer_third_pos    = ( 'no', 'yes' ) ;
+
+&argv_handling ( \@answer_remain_stems, \@answer_codons, \@answer_third_pos ) ;
+&menu          ( \@answer_remain_stems, \@answer_codons, \@answer_third_pos ) ;
+
+
+
+sub argv_handling{
+	
+	my $aref_remain_stems = $_[0] ;
+	my $aref_codons       = $_[1] ;
+	my $aref_third_pos    = $_[2] ;
+	
+	my ( $commandline )   = join "", @ARGV ;
+		
+	$commandline =~ s/ |\s+// ;
+	my @commands = split "-", $commandline ;
+	shift @commands ;
+		
+	for my $single_command ( sort @commands ){
+			
+			if		( $single_command =~ /^r$/i ) { @$aref_remain_stems = ( reverse @$aref_remain_stems) }
+			elsif	( $single_command =~ /^c$/i ) { @$aref_codons       = ( reverse @$aref_codons      ) }
+			elsif	( $single_command =~ /^3$/i ) { @$aref_third_pos    = ( reverse @$aref_third_pos   ) }
+			elsif	( $single_command =~ /^h$/i ) { &help }
+			elsif	( $single_command =~ /^p$/i ) { &preface }
+			elsif	( $single_command =~ /^s$/i ) {  
+													&header ;
+													&commands( \$aref_remain_stems->[0], \$aref_codons->[0], \$aref_third_pos->[0]) ;
+													&start (\$aref_remain_stems->[0], \$aref_codons->[0], \$aref_third_pos->[0])
+			}
+			else	{ print "\n\t!COMMAND-ERROR!: unknown command \"-", $single_command, "\"\n" }
+	}
+		
+	&menu ( \@$aref_remain_stems, \@$aref_codons, \@$aref_third_pos)
+}
+
+sub header{
+	
+	printf "\n%68s\n", "------------------------------------------------------------"     ;
+	printf "%49s\n"  , "Welcome to ALICUT V2.31 !"                                        ;
+	printf "%60s\n"  , "a Perlscript to cut ALISCORE identified RSS"                      ;
+	printf "%57s\n"  , "written by Patrick Kueck (ZFMK, Bonn)"                            ;
+	printf "%68s\n\n", "------------------------------------------------------------"     ;
+}
+
+sub commands{
+	
+	my $sref_rem_stems = $_[0] ;
+	my $sref_reo_codon = $_[1] ;
+	my $sref_th_posit  = $_[2] ;
+	
+	print  "\n\t------------------------------------------------------------"             ;
+	print  "\n\tRemain Stem Position   :\t", $$sref_rem_stems ;
+	print  "\n\tRemove Codon           :\t", $$sref_reo_codon ;
+	print  "\n\tRemove 3rd Position    :\t", $$sref_th_posit ;
+	print  "\n\t------------------------------------------------------------\n"           ;
+}
+
+sub help{
+	
+	print
+ <<info;
+    
+	-------------------------------------------------------------------
+	-------------------------------------------------------------------
+	
+	General Information and Usage:
+	-------------------------------
+	ALICUT V2.31 removes ALISCORE identified RSS positions 
+	in given FASTA file(s) which are listed in the FASTA file cor-
+	responding ALISCORE "List" outfile(s). If structure sequences
+	are implemented, ALICUT V2.3 automatically replaces brackets 
+	of non rss positions by dots when they are paired with rss 
+	identified positions.
+	
+	
+	
+	Start ALICUT under default
+	-------------------------------------------------------------------
+	To remove all ALISCORE identified RSS positions:
+	
+	Type <s> return (via Menu) or
+	Type <perl ALICUT_V2.3.pl -s> <enter> (via command line)
+	
+	
+	
+	R-Option (Remain Stems)
+	-------------------------------------------------------------------
+	To remain all stem positions of identified rss within FASTA file(s): 
+	
+	Type <r> <return> <s> <enter> (via Menu)
+	Type <perl ALICUT_V2.3.pl -r -s> <enter> (via command line)
+	
+	
+	
+	C-Option (Remove Codon)
+	-------------------------------------------------------------------
+	To translate ALISCORE identified RSS positions of amino-acid data
+	into nucleotide triplet positions before exclusion of randomised
+	sequence sections:
+	
+	Type <c> return <s> return (via Menu) or
+	Type <perl ALICUT_V2.3.pl -c -s> <enter> (via command line)
+	
+	Note: 
+	This option is only useful if you have analysed amino-acid 
+	data, but wish to exclude nucleotide positions from the amino-acid 
+	data corresponding nucleotide data.
+	Be aware, that the name of the nucleotide data file has to be named 
+	equal to the ALISCORE analysed amino-acid data file. The C-option
+	can not be applied on amino-acid sequences. Otherwise, ALICUT
+	excludes the original ALISCORE identified sequence sections.
+	
+	
+	
+	3-Option (Remove 3rd position)
+	-------------------------------------------------------------------
+	To remove ALISCORE identified RSS only if its sequence position is 
+	up to amultiple of 3:
+	
+	Type <3> <return> <s> <return> (via Menu)
+	Type <perl ALICUT_V2.3.pl -3 -s> <enter> (via command line)
+	
+	Note: 
+	The 3-Option can be combined with the C-option. In this case,
+	positions of the ALISCORE "List" outfile(s) are translated into
+	codon positions from which only the 3rd positions are excluded.
+	The 3-Option can only be applied on nucleotide data. Otherwise, 
+	ALICUT excludes the original ALISCORE identified sequence sections.
+	
+	
+	
+	ALICUT IN and OUT files
+	-------------------------------------------------------------------
+	ALICUT V2.3 needs the original ALISCORE FASTA infile(s) and "List"
+	outfile(s) in the same folder as ALICUT V2.3.
+	
+	The "List" outfile(s) must contain the identified RSS positions
+	in one single line, separated by whitespace.
+	
+	e.g. 1 3 5 6 8 9 10 11 123 127 10000 10001
+	
+	ALICUT V2.0 can handle unlimited FASTA files in one single run.
+	The sole condition is that the Prefix of the ALISCORE "List" 
+	outfile(s) are identic with the associated FASTA infile(s). 
+	ALICUT V2.3 first searches for the ALISCORE "List" outfile(s), 
+	removes the Suffix "_List_random.txt" and searches for the 
+	"List" associated FASTA file(s).
+	
+	e.g. COI.fas_List_random.txt (ALISCORE "List" outfile)
+	     COI.fas                 (Associated FASTA infile)
+	
+	If both files are detected, ALICUT V2.3 excludes the RSS identified 
+	positions of the "List" file(s) in the associated
+	FASTA file(s) and saves the changes in a new FASTA outfile,
+	named "ALICUT_FASTAinputname.fas".
+	
+	Under the C- and 3-Option, removed sequence positions differ from
+	the original "List" position numbers. Under both options, ALICUT 
+	prints the actually removed positions in separate "ALICUT_LIST" 
+	outfile(s).
+	
+	ALICUT V2.3 generates also an info file "ALICUT_info". This file 
+	informs about the number and percentage of removed positions, number 
+	of single sequences, single parameter settings, and sequence states 
+	of each restricted FASTA file. 
+	If structure sequences are identified by ALICUT, ALICUT generates
+	structure info file(s) which lists remaining stem pairs and loop 
+	positions, as well as percentages of both structure elements.
+	
+	-------------------------------------------------------------------
+	-------------------------------------------------------------------
+	
+	
+info
+;
+
+	print  "\tBACK to ALICUT MAIN-Menu:\t\t type <return>\n"                    ;
+	print  "\n\t------------------------------------------------------------\n\t"  ;
+
+	chomp ( my $answer_xy = <STDIN> );
+
+	&menu ;
+	
+}
+
+sub preface{
+
+print
+<<preface
+	
+	--------------------FASconCAT PREFACE---------------------
+	
+	Version     : 2.31
+	Language    : PERL
+	Last Update : 22nd February, 2017
+	Author      : Patrick Kueck, ZFMK Bonn GERMANY
+	e-mail      : patrick_kueck\@web.de
+	Homepage    : http://www.zfmk.de
+	
+	This program is free software; you can whitedistribute it 
+	and/or modify it under the terms of the GNU General Public 
+	License as published by the Free Software Foundation ; 
+	either version 2 of the License, or (at your option) any 
+	later version.
+
+	This program is distributed in the hope that it will be 
+	useful, but WITHOUT ANY WARRANTY; without even the 
+	implied warranty of MERCHANTABILITY or FITNESS FOR A 
+	PARTICULAR PURPOSE. See the GNU General Public License for 
+	more details. 
+
+	You should have received a copy of the GNU General Public 
+	License along with this program; if not, write to the Free 
+	Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, 
+	USA.
+	
+	For further free downloadable programs visit:
+	www.zfmk.de/web/Forschung/Abteilungen/AG_Wgele/index.en.html
+	
+	------------------------------------------------------------
+
+preface
+; 
+
+	print  "\tBACK to ALICUT MAIN-Menu:\t\t type <return>\n"                       ;
+	print  "\n\t------------------------------------------------------------\n\t"  ;
+
+	chomp ( my $answer_xy = <STDIN> );
+
+	&menu;
+}
+
+sub menu{
+	
+	my $aref_remain_stems = $_[0] ;
+	my $aref_remove_codon = $_[1] ;
+	my $aref_third_posit  = $_[2] ;
+	
+	&header ;
+	
+	print "\n\tSTART ALICUT:\t\ttype <s> <return>"                                        ;
+	print "\n\tQUIT  ALICUT:\t\ttype <q> <return>"                                        ;
+	print "\n\tREMAIN STEMS:\t\ttype <r> <return>"                                        ;
+	print "\n\tREMOVE CODON:\t\ttype <c> <return>"                                        ;
+	print "\n\tREMOVE   3rd:\t\ttype <3> <return>"                                        ;
+	print "\n\tHELP:\t\t\ttype <h> <return>"                                              ;
+	print "\n\tPREFACE:\t\ttype <p> <return>"                                             ;
+	
+	&commands ( \$aref_remain_stems->[0], \$aref_remove_codon->[0], \$aref_third_posit->[0] );
+	
+	my       $answer_opening =  &commandline ;
+	
+	until  ( $answer_opening =~ /^s$|^r$|^c$|^p$|^h$|^1$|^2$|^q$|^3$/i ){ 
+		
+		print "\n\t!COMMAND-ERROR!: unknown command \"$answer_opening\"!\n" ;
+
+		$answer_opening =  &commandline ;
+	}
+
+	$answer_opening =~ /^s$/i      and do { &start ( \$aref_remain_stems->[0], \$aref_remove_codon->[0], \$aref_third_posit->[0] ) } ;
+	$answer_opening =~ /^r$/i      and do { @$aref_remain_stems = (reverse @$aref_remain_stems ); &menu                            } ;
+	$answer_opening =~ /^c$/i      and do { @$aref_remove_codon = (reverse @$aref_remove_codon ); &menu                            } ;
+	$answer_opening =~ /^3$/i      and do { @$aref_third_posit  = (reverse @$aref_third_posit  ); &menu                            } ;
+	$answer_opening =~ /^q$/i      and do {                                                        exit                            } ;
+	$answer_opening =~ /^h$/i      and do {                                                       &help                            } ;
+	$answer_opening =~ /^1$/       and do {                                                       &error1                          } ;
+	$answer_opening =~ /^2$/       and do {                                                       &error2                          } ;
+	$answer_opening =~ /^p$/i      and do {                                                       &preface                         }
+}
+
+sub start{
+	
+	my $sref_stems_remain = $_[0] ;
+	my $sref_codon_remove = $_[1] ;
+	my $sref_third_remove = $_[2] ;
+	
+	my $j = 0  ;
+	
+	open  OUTinfo, ">>ALICUT_info.xls" ;
+	print OUTinfo  "\nUsed List File\tUsed Fasta file\tremove triplets\tremove 3rd position\tnumber taxa\tbp before\tbp after\tremaining bp [%]\tsequence type\n"  ;
+	
+	
+	
+	# Read IN of all List_random.txt files within the same folder as ALICUT and handle it
+	READING:
+	foreach my $file ( <*List_*.txt> ) {
+		
+		# Set counter +1
+		$j++;
+	    
+		
+		
+		# Read in of the ALISCORE-list outfile
+		&tie_linefeeds ( \$file ) ;
+		( open IN, "<$file" ) or die "n\t!FILE-ERROR!: Can not open listfile $file!\n" ;
+		my $line = <IN> ; chomp $line ;
+		
+		# check for correct aliscore list format
+		unless ( $line =~ /^(\d+ )+\d+$|^\d+$/ ) { warn "\t!FILE-WARN!: $file has no ALISCORE list format!\n" ; next READING }
+		
+		# Total number of randomized identified positions
+		my @cut_positions = split " ", $line  ; close IN ;
+		
+		
+		
+		# "filename.fas_List_random.txt" to "filename.fas"
+		( my $file_fasta = $file ) =~ s/_List_.+//  ;
+		
+		# Read in of the original ALISCORE fasta infile which belongs to the listfile
+		&tie_linefeeds ( \$file_fasta ) ;
+		( open INfas, "<$file_fasta" ) or warn "\t!FILE-WARN!: Can not find $file_fasta!\n" and next READING ;
+		
+		chomp ( my @inputfile = <INfas> ) ; close INfas ;
+		warn  "\t!FILE-WARN!: File $file_fasta is empty!\n" if 0 == @inputfile and next READING ;
+		
+		# Handle the FASTA file in the way that sequencename and sequence alternate in each line
+		@inputfile                   = fas_bearbeiten ( @inputfile ) ;
+		
+		# Generate a hash: key=>taxon, value => sequenz
+		my %sequence                 = @inputfile ;
+		my @values                   = values %sequence ;
+		
+		# Determine basepositions before und after cut. Output of cuttings as total number and in percent
+		my $number_sequences         = keys %sequence ;
+	    my $number_characters_before = length $values[0] ;
+		
+		
+		
+		
+		
+		
+		# Check for correct FASTA format and handling of structure sequence
+		my $sequence_state = 'nt' ;
+		SEQUENCE_CHECK:
+		for my $raw_taxon ( keys %sequence ){
+				
+				# if whitespace are between ">" and the next sign within a sequence name, delete these whitespaces
+				$raw_taxon =~ s/^\>\s*/\>/g ;
+			
+				# if whitespaces between last sign and newline in sequence name, delete these whitespaces
+				$raw_taxon =~ s/\s*$//g ;
+			
+				die    "\n\t!FILE-ERROR!: $raw_taxon in $file_fasta is not in FASTA format!\n"                     if           $raw_taxon                  !~ /^\>/                             ;
+				die    "\n\t!FILE-ERROR!: Sequence name missing in $file_fasta!\n"                                 if           $raw_taxon                  =~ /^\>$/                            ;
+				die    "\n\t!FILE-ERROR!: Sequence name $raw_taxon in $file_fasta involves forbidden signs!\n"     if           $raw_taxon                  !~ /\w/                              ;
+				die    "\n\t!FILE-ERROR!: Sequences of $file_fasta have no equal length!\n"                        if length    $sequence{$raw_taxon}       != $number_characters_before         ;
+				die    "\n\t!FILE-ERROR!: Sequence missing in $file_fasta!\n"                                      if           $sequence{$raw_taxon}       =~ /^\n$|^$/                         ;
+				die    "\n\t!FILE-ERROR!: Sequence length in $file_fasta is too short to cut all positions!\n"     if           $number_characters_before   <  $cut_positions[ $#cut_positions ] ;
+				
+				
+				
+				# Structure handling
+				if ( $sequence{$raw_taxon} =~ /.*\(.*\).*/ ){
+					
+					$sequence{$raw_taxon}  =~ s/-/./g  ;
+					my @strc_elements      =  split "" , $sequence{$raw_taxon} ;
+					
+					for my $str_sign ( @strc_elements ){ 
+						
+						unless ( $str_sign =~ /\(|\)|\./ ){ die "\n\t!FILE-ERROR!: Structure string of $file_fasta involves forbidden signs in $raw_taxon!\n" }
+					}
+					
+					my $structurestring       =  $sequence{$raw_taxon} ; 
+					   $structurestring       =~ s/-/./g ;
+					   $sequence{$raw_taxon}  =  &structure_handling ( \$structurestring, \$$sref_stems_remain, \@cut_positions, \$file_fasta ); next SEQUENCE_CHECK ;
+				}
+		
+				
+				
+				# Check for correct sequence states
+				$sequence{$raw_taxon}   =~ s/(\w+)/\U$1/ig ;
+				my @seq_elements           = split "" , $sequence{$raw_taxon} ;
+				
+				for my $seq_sign ( @seq_elements ){ 
+					
+					unless ( $seq_sign =~ /A|C|G|T|U|-|N|Y|X|R|W|S|K|M|D|V|H|B|Q|E|I|L|F|P|\?/ ){ die "\n\t!FILE-ERROR!: Sequence of $file_fasta involves forbidden signs in $raw_taxon!\n" }
+				}
+				
+				if ( $sequence{$raw_taxon}  =~ /I|E|L|Q|F|P/ ) { $sequence_state = 'aa' }
+		}
+		
+		
+		
+		
+		
+		
+		
+		
+		# Translate cut positions
+		my @fasta_cut;
+		&translate_cut_positions( \$$sref_codon_remove, \$$sref_third_remove, \@cut_positions, \$number_characters_before, \@fasta_cut, \$sequence_state, \$file_fasta );
+		
+		
+		# Calculate percent of remaining positions
+		my $number_cut_positions     = @cut_positions ;
+		my $number_characters_after  = $number_characters_before-$number_cut_positions ;
+		
+		my $percent_left =  sprintf "%.1f", ( $number_characters_after / $number_characters_before ) * 100 ;
+		   $percent_left =~ s/\./,/g ;
+		   
+		
+		# Assume uncut positions to $final and print out to ALICUT_$file_fasta
+		if    ( $$sref_codon_remove =~ /yes/ && $$sref_third_remove =~ /yes/ ){ open OUT, ">ALICUT_codon_3rd_$file_fasta" }
+		elsif ( $$sref_codon_remove =~ /yes/ && $$sref_third_remove =~ /no/  ){ open OUT, ">ALICUT_codon_$file_fasta"     }
+		elsif ( $$sref_codon_remove =~ /no/  && $$sref_third_remove =~ /yes/ ){ open OUT, ">ALICUT_3rd_$file_fasta"       }
+		else                                                                  { open OUT, ">ALICUT_$file_fasta"           }
+		
+		for ( keys %sequence ){
+			
+			my @bases = split "", $sequence{$_}          ;
+			my @final = map { $bases[$_] } @fasta_cut    ;
+			my $final = $_."\n".( join "", @final )."\n" ;
+			
+			print OUT "$final" ;
+		}
+		close OUT;
+		
+		
+		
+		# Print Out of extra infos to ALICUT_info
+		print OUTinfo  "$file\t$file_fasta\t$$sref_codon_remove\t$$sref_third_remove\t$number_sequences\t$number_characters_before\t$number_characters_after\t$percent_left\t$sequence_state\n" ;
+		print          "\tDone  : $file cut to ALICUT_$file_fasta\n" 
+	}
+	
+	close OUTinfo  ;
+	
+	
+	# Print OUT number of right handled FASTA files in relation to total number of files
+	printf "\n%68s\n",   "------------------------------------------------------------" ;
+	printf "%42s\n",     "$j FASTA file(s) correctly handled!"                          ;
+	printf "%57s\n",     "Further infos are printed out in Alicut_info.txt!"            ;
+	printf "\n%63s\n",   "ALICUT V2.0 Finished! Thank you and good bye!"                ;
+	printf "%68s\n",     "------------------------------------------------------------" ;
+	
+	
+	&set_timer ;
+	exit ;
+	
+	sub tie_linefeeds{
+		
+		my $sref_filename = $_[0] ;
+		
+		( open IN , "<$$sref_filename" ) or warn "\tError: can not open $$sref_filename!\n" and next READING ;
+		
+		(tie ( my @data, 'Tie::File', $$sref_filename )) ;
+		
+		warn "\t!FILE-WARN!: $$sref_filename is empty!\n" and next READING if 0 == @data ;
+		
+		map { s/\r\n/\n/g } @data ;
+		map { s/\r/\n/g   } @data ;
+		
+		untie @data ; close IN ;
+		
+	}
+	
+	sub set_timer{
+		
+			my ( $user, $system, $cuser, $csystem ) = times ;
+	
+print <<TIME;
+
+			***  time used: $user sec  ***
+
+TIME
+
+		
+	}
+	
+	sub translate_cut_positions {
+		
+		my $sref_command_codon_remove = $_[0] ;
+		my $sref_command_third_remove = $_[1] ;
+		my $aref_cut_positions        = $_[2] ;
+		my $sref_number_characters    = $_[3] ;
+		my $aref_remaining_positions  = $_[4] ;
+		my $sref_sequence_state       = $_[5] ;
+		my $sref_filename             = $_[6] ;
+		
+		
+		# Translate identified RSS aminoacid positions to nucleotide triplet positions
+		if ( $$sref_command_codon_remove =~ /yes/ && $$sref_command_third_remove =~ /no/){
+			
+			unless ( $$sref_sequence_state =~ /aa/ ){
+				
+				my @fasta_old = @$aref_cut_positions ; @$aref_cut_positions = ();
+				for my $number( @fasta_old ){
+					
+					my $newno1 = ($number*3)-2;
+					my $newno2 = $newno1+1;
+					my $newno3 = $newno2+1;
+					
+					push @$aref_cut_positions, ( $newno1, $newno2, $newno3 )
+				}
+				
+				my $string_cutnumbers = join " ",  @$aref_cut_positions ;
+				open  OUTnewcut, ">ALICUT_cut_positions_codon.txt" or die "\n\t!FILE-ERROR!: Can not open File ALICUT_cut_positions_codon.txt" ;
+				print OUTnewcut  $string_cutnumbers ; close OUTnewcut ;
+			}
+			
+			else { warn "\n\t!FILE-WARN!: $$sref_filename include aa sequences!\n\tCodon positions not translated!" }
+		}
+		
+		# Translate identified RSS aminoacid positions to nucleotide triplet positions, but remove only third position
+		elsif ( $$sref_command_codon_remove =~ /yes/ && $$sref_command_third_remove =~ /yes/){
+			
+			unless ( $$sref_sequence_state =~ /aa/ ){
+			
+				my @fasta_old = @$aref_cut_positions ; @$aref_cut_positions = ();
+				for my $number( @fasta_old ){ 
+					
+					push @$aref_cut_positions, ($number*3) 
+				}
+				
+				my $string_cutnumbers = join " ",  @$aref_cut_positions ;
+				open  OUTnewcut, ">ALICUT_cut_positions_codon_3rd.txt" or die "\n\t!FILE-ERROR!: Can not open File ALICUT_cut_positions_codon_3rd.txt" ;
+				print OUTnewcut  $string_cutnumbers ; close OUTnewcut ;
+			}
+			
+			else { warn "\n\t!FILE-WARN!: $$sref_filename include aa sequences!\n\tCodon positions not translated!\n\t3rd codon position not removed!" }
+		}
+		
+		# Remove only identified RSS if third position of original sequence 
+		elsif ( $$sref_command_codon_remove =~ /no/ && $$sref_command_third_remove =~ /yes/){
+			
+			unless ( $$sref_sequence_state =~ /aa/ ){
+				
+				my @fasta_old = @$aref_cut_positions ; @$aref_cut_positions = ();
+				for my $number( @fasta_old ){
+					
+					if ( $number % 3 == 0 ){ push @$aref_cut_positions, $number }
+				}
+				
+				my $string_cutnumbers = join " ",  @$aref_cut_positions ;
+				open  OUTnewcut, ">ALICUT_cut_positions_3rd.txt" or die "\n\t!FILE-ERROR!: Can not open File ALICUT_cut_positions_3rd.txt" ;
+				print OUTnewcut  $string_cutnumbers ; close OUTnewcut
+			}
+			
+			else { warn "\n\t!FILE-WARN!: $$sref_filename include aa sequences!\n\tNot only 3rd codon position removed!" }
+		}
+		
+		
+		# Examine remaining positions
+		my  ( %seen, @zahlenreihe ) ;
+		for ( 1 .. $$sref_number_characters ) { push @zahlenreihe, $_-1 }
+		
+		for my $value ( @$aref_cut_positions ){ $seen{$value-1}++ }
+		for           ( @zahlenreihe         ){ unless ( $seen{$_} ){ push @$aref_remaining_positions, $_ } }
+	}
+}
+
+sub fas_bearbeiten{
+	
+	my @infile = @_                   ;
+	
+	grep  s/(\>.*)/$1\t/,     @infile ;
+	grep  s/ //g,             @infile ;
+	grep  s/\n//g,            @infile ;
+	grep  s/\t/\n/g,          @infile ;
+	grep  s/\>/\n\>/g,        @infile ;
+	my $string = join "",     @infile ;
+	@infile    = split "\n",  $string ;
+	shift                     @infile ;
+	return                    @infile ;
+}
+
+sub structure_handling{
+	
+	my $sref_string        = $_[0] ;
+	my $sref_answer_remain = $_[1] ;
+	my $aref_cut_positions = $_[2] ;
+	my $sref_filename      = $_[3] ;
+	
+	my ( 
+		
+		@pair_infos            ,
+		@forward               ,
+		@structurestring       ,
+		@loops                 ,
+		@pairs                 ,
+		%structure_of_position ,
+		%seen_struc
+		
+	);
+	
+	
+	# Stem assignment
+	my @structures = split "", $$sref_string ;
+	my  $i = 0                                                                                                         	                  ;
+	CHECKING:
+	for ( @structures ){ $i++                                                                                                             ;
+		
+		SWITCH:
+		$structure_of_position{$i} = $_                                                                                                   ;
+		
+		if ( $_  =~ /\(/ ){ push @forward, $i                                                                          and next CHECKING  }
+		if ( $_  =~ /\)/ ){ my $pair_1 = pop @forward; push @pairs, ( $pair_1, $i ); push @pair_infos, ( $pair_1.":".$i ); next CHECKING  }
+		if ( $_  =~ /\./ ){ push @loops,   $i                                                                          and next CHECKING  }
+	}
+	
+	@pair_infos  =  reverse @pair_infos                                                                                                   ;
+	
+	
+	
+	
+	# Generate listfiles for structure_info file
+	my $pairlist =  join "\n\t\t\t\t\t", @pair_infos   ;
+	my $looplist =  join "\n\t\t\t\t\t", @loops        ;
+	
+	
+	# Number and proportion of stem and loop positions for structure info file
+	my $N_total  =  @structures                        ;
+	my $N_stems  =  @pair_infos                        ;
+	my $N_loops  =  $N_total - ( $N_stems * 2 )        ;
+	my $P_loops  =  ( $N_loops / $N_total ) * 100      ;
+	my $P_stems  =  100 - $P_loops                     ;
+
+	
+	# Open structure info outfile
+	open OUTstruc, ">ALICUT_Struc_info_${$sref_filename}.txt"                                  ;
+	
+	# Print out
+	print OUTstruc "\nOriginal structure information identified in $$sref_filename:\n\n"  ;
+	print OUTstruc "- Number of characters:\t\t\t$N_total\n"                              ;
+	print OUTstruc "- Number of single loop characters:\t$N_loops [$P_stems %]\n"         ;
+	print OUTstruc "- Number of paired stem characters:\t$N_stems [$P_loops %]\n"         ;
+	print OUTstruc "\n- Paired stem positions:\t\t$pairlist\n\n"                          ;
+	print OUTstruc "\n- Loop positions:\t\t\t$looplist\n"                                 ;
+
+	close OUTstruc;
+	
+	if  ( $$sref_answer_remain =~ /yes/i ){
+		
+		my @cut_positions2 = ();
+		
+		# Remain rss identified stem positions within the MSA
+		for ( @pairs ){ $seen_struc{$_} = 1                                                   }
+		for ( @$aref_cut_positions ){ unless ( $seen_struc{$_} ){ push @cut_positions2, $_  } }
+		@$aref_cut_positions = @cut_positions2                                                ;
+	}
+	
+	else{
+		
+		my %pair = @pairs;
+		
+		# Replace paired structure positions of rss identified positions by dots
+		for my $bp_for ( keys %pair ){
+			
+			for my $rss ( @$aref_cut_positions ){
+				
+				if ( $bp_for        == $rss ){ $structure_of_position{$pair{$bp_for}}  = "." ; last }
+				if ( $pair{$bp_for} == $rss ){ $structure_of_position{$bp_for}         = "." ; last }
+			}
+		}
+	}
+	
+	for    ( my $k=1; $k<=@structures-1; $k++ ){ push @structurestring, $structure_of_position{$k}   }
+	my     $structure_string_neu = join "", @structurestring                                       ;
+	return $structure_string_neu                                                                   ;
+	
+}
+
+sub commandline{
+
+	print  "\n\tCOMMAND:\t "                                                          ;
+	
+	chomp ( my $sub_answer_opening = <STDIN> );
+
+	print  "\n\t------------------------------------------------------------\n"        ;
+	
+	return $sub_answer_opening;
+}	
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
--- a/skills/phylo_from_buscos/scripts/predownloaded_aliscore_alicut/Aliscore.02.2.pl
+++ b/skills/phylo_from_buscos/scripts/predownloaded_aliscore_alicut/Aliscore.02.2.pl
--- a/skills/phylo_from_buscos/scripts/predownloaded_aliscore_alicut/Aliscore_module.pm
+++ b/skills/phylo_from_buscos/scripts/predownloaded_aliscore_alicut/Aliscore_module.pm
--- a/skills/phylo_from_buscos/scripts/query_ncbi_assemblies.py
+++ b/skills/phylo_from_buscos/scripts/query_ncbi_assemblies.py
@@ -0,0 +1,174 @@
+#!/usr/bin/env python3
+"""
+Query NCBI for available genome assemblies by taxon name
+
+Usage:
+    python query_ncbi_assemblies.py --taxon "Coleoptera"
+    python query_ncbi_assemblies.py --taxon "Drosophila" --max-results 50
+    python query_ncbi_assemblies.py --taxon "Apis" --refseq-only
+
+Requires: ncbi-datasets-pylib (pip install ncbi-datasets-pylib)
+
+Author: Bruno de Medeiros (Field Museum)
+"""
+
+import argparse
+import sys
+
+
+def query_assemblies_by_taxon(taxon, max_results=20, refseq_only=False):
+    """
+    Query NCBI for genome assemblies of a given taxon
+
+    Args:
+        taxon: Taxon name (e.g., "Coleoptera", "Drosophila melanogaster")
+        max_results: Maximum number of results to return
+        refseq_only: If True, only return RefSeq assemblies (GCF_*)
+
+    Returns:
+        List of dictionaries with assembly information
+    """
+    try:
+        from ncbi.datasets import GenomeApi
+        from ncbi.datasets.openapi import ApiClient, ApiException
+    except ImportError:
+        print("Error: ncbi-datasets-pylib not installed", file=sys.stderr)
+        print("Install with: pip install ncbi-datasets-pylib", file=sys.stderr)
+        sys.exit(1)
+
+    assemblies = []
+
+    print(f"Querying NCBI for '{taxon}' genome assemblies...")
+    print(f"(Limiting to {max_results} results)")
+    if refseq_only:
+        print("(RefSeq assemblies only)")
+    print("")
+
+    try:
+        with ApiClient() as api_client:
+            api = GenomeApi(api_client)
+
+            # Query genome assemblies for the taxon
+            genome_summary = api.genome_summary_by_taxon(
+                taxon=taxon,
+                limit=str(max_results),
+                filters_refseq_only=refseq_only
+            )
+
+            if not genome_summary.reports:
+                print(f"No assemblies found for taxon '{taxon}'")
+                return []
+
+            for report in genome_summary.reports:
+                assembly_info = {
+                    'accession': report.accession,
+                    'organism': report.organism.organism_name,
+                    'assembly_level': report.assembly_info.assembly_level,
+                    'assembly_name': report.assembly_info.assembly_name,
+                    'submission_date': report.assembly_info.release_date if hasattr(report.assembly_info, 'release_date') else 'N/A'
+                }
+                assemblies.append(assembly_info)
+
+    except ApiException as e:
+        print(f"Error querying NCBI: {e}", file=sys.stderr)
+        sys.exit(1)
+    except Exception as e:
+        print(f"Unexpected error: {e}", file=sys.stderr)
+        sys.exit(1)
+
+    return assemblies
+
+
+def format_table(assemblies):
+    """
+    Format assemblies as a readable table
+
+    Args:
+        assemblies: List of assembly dictionaries
+    """
+    if not assemblies:
+        return
+
+    print(f"Found {len(assemblies)} assemblies:\n")
+
+    # Print header
+    print(f"{'#':<4} {'Accession':<20} {'Organism':<40} {'Level':<15} {'Assembly Name':<30}")
+    print("-" * 110)
+
+    # Print data rows
+    for i, asm in enumerate(assemblies, 1):
+        organism = asm['organism'][:38] + '..' if len(asm['organism']) > 40 else asm['organism']
+        assembly_name = asm['assembly_name'][:28] + '..' if len(asm['assembly_name']) > 30 else asm['assembly_name']
+
+        print(f"{i:<4} {asm['accession']:<20} {organism:<40} {asm['assembly_level']:<15} {assembly_name:<30}")
+
+    print("")
+
+
+def save_accessions(assemblies, output_file):
+    """
+    Save assembly accessions to a file
+
+    Args:
+        assemblies: List of assembly dictionaries
+        output_file: Output file path
+    """
+    with open(output_file, 'w') as f:
+        for asm in assemblies:
+            f.write(f"{asm['accession']}\n")
+
+    print(f"Accessions saved to: {output_file}")
+    print(f"You can download these assemblies using:")
+    print(f"  python download_ncbi_genomes.py --assemblies $(cat {output_file})")
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Query NCBI for available genome assemblies by taxon name",
+        epilog="Example: python query_ncbi_assemblies.py --taxon 'Coleoptera' --max-results 50"
+    )
+
+    parser.add_argument(
+        "--taxon",
+        required=True,
+        help="Taxon name (e.g., 'Coleoptera', 'Drosophila melanogaster')"
+    )
+
+    parser.add_argument(
+        "--max-results",
+        type=int,
+        default=20,
+        help="Maximum number of results to return (default: 20)"
+    )
+
+    parser.add_argument(
+        "--refseq-only",
+        action="store_true",
+        help="Only return RefSeq assemblies (GCF_* accessions)"
+    )
+
+    parser.add_argument(
+        "--save",
+        metavar="FILE",
+        help="Save accessions to a file for later download"
+    )
+
+    args = parser.parse_args()
+
+    # Query NCBI
+    assemblies = query_assemblies_by_taxon(
+        taxon=args.taxon,
+        max_results=args.max_results,
+        refseq_only=args.refseq_only
+    )
+
+    # Display results
+    format_table(assemblies)
+
+    # Save if requested
+    if args.save and assemblies:
+        save_accessions(assemblies, args.save)
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/phylo_from_buscos/scripts/rename_genomes.py
+++ b/skills/phylo_from_buscos/scripts/rename_genomes.py
@@ -0,0 +1,240 @@
+#!/usr/bin/env python3
+"""
+Rename genome files with clean, meaningful sample names for phylogenomics
+
+This script helps create a mapping between genome files (often with cryptic
+accession numbers) and clean species/sample names that will appear in the
+final phylogenetic tree.
+
+Usage:
+    # Interactive mode - prompts for names
+    python rename_genomes.py --interactive genome1.fasta genome2.fasta
+
+    # From mapping file (TSV: old_name<TAB>new_name)
+    python rename_genomes.py --mapping samples.tsv
+
+    # Create template mapping file
+    python rename_genomes.py --create-template *.fasta > samples.tsv
+
+Author: Bruno de Medeiros (Field Museum)
+Based on tutorials by Paul Frandsen (BYU)
+"""
+
+import argparse
+import os
+import sys
+import shutil
+from pathlib import Path
+
+
+def sanitize_name(name):
+    """
+    Sanitize a name to be phylogenomics-safe
+    - Replace spaces with underscores
+    - Remove special characters
+    - Keep only alphanumeric, underscore, hyphen
+    """
+    # Replace spaces with underscores
+    name = name.replace(' ', '_')
+    # Remove special characters except underscore and hyphen
+    name = ''.join(c for c in name if c.isalnum() or c in '_-')
+    return name
+
+
+def create_template(genome_files, output=sys.stdout):
+    """Create a template mapping file"""
+    output.write("# Sample mapping file\n")
+    output.write("# Format: original_filename<TAB>new_sample_name\n")
+    output.write("# Edit the second column with meaningful species/sample names\n")
+    output.write("# Recommended format: [ACCESSION]_[NAME] (e.g., GCA000123456_Penstemon_eatonii)\n")
+    output.write("# This keeps accession for traceability while having readable names in trees\n")
+    output.write("# Names should contain only letters, numbers, underscores, and hyphens\n")
+    output.write("#\n")
+
+    for gfile in genome_files:
+        basename = Path(gfile).stem  # Remove extension
+        output.write(f"{gfile}\t{basename}\n")
+
+
+def read_mapping(mapping_file):
+    """Read mapping from TSV file"""
+    mapping = {}
+    with open(mapping_file, 'r') as f:
+        for line in f:
+            line = line.strip()
+            # Skip comments and empty lines
+            if not line or line.startswith('#'):
+                continue
+
+            parts = line.split('\t')
+            if len(parts) != 2:
+                print(f"Warning: Skipping invalid line: {line}", file=sys.stderr)
+                continue
+
+            old_name, new_name = parts
+            new_name = sanitize_name(new_name)
+            mapping[old_name] = new_name
+
+    return mapping
+
+
+def interactive_rename(genome_files):
+    """Interactively ask for new names"""
+    mapping = {}
+
+    print("Enter new sample names for each genome file.")
+    print("Press Enter to keep the current name.")
+    print("Names will be sanitized (spaces→underscores, special chars removed)\n")
+
+    for gfile in genome_files:
+        current_name = Path(gfile).stem
+        new_name = input(f"{gfile} → [{current_name}]: ").strip()
+
+        if not new_name:
+            new_name = current_name
+
+        new_name = sanitize_name(new_name)
+        mapping[gfile] = new_name
+        print(f"  Will rename to: {new_name}.fasta\n")
+
+    return mapping
+
+
+def rename_files(mapping, dry_run=False, backup=True):
+    """Rename genome files according to mapping"""
+
+    renamed = []
+    errors = []
+
+    for old_file, new_name in mapping.items():
+        if not os.path.exists(old_file):
+            errors.append(f"File not found: {old_file}")
+            continue
+
+        # Get extension from original file
+        ext = Path(old_file).suffix
+        if not ext:
+            ext = '.fasta'
+
+        new_file = f"{new_name}{ext}"
+
+        # Check if target exists
+        if os.path.exists(new_file) and new_file != old_file:
+            errors.append(f"Target exists: {new_file}")
+            continue
+
+        # Skip if names are the same
+        if old_file == new_file:
+            print(f"Skip (no change): {old_file}")
+            continue
+
+        if dry_run:
+            print(f"[DRY RUN] Would rename: {old_file} → {new_file}")
+        else:
+            # Backup if requested
+            if backup:
+                backup_file = f"{old_file}.backup"
+                shutil.copy2(old_file, backup_file)
+                print(f"Backup created: {backup_file}")
+
+            # Rename
+            shutil.move(old_file, new_file)
+            print(f"Renamed: {old_file} → {new_file}")
+            renamed.append((old_file, new_file))
+
+    return renamed, errors
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Rename genome files with meaningful sample names for phylogenomics",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+        epilog="""
+Examples:
+  # Create template mapping file
+  python rename_genomes.py --create-template *.fasta > samples.tsv
+  # Edit samples.tsv, then apply mapping
+  python rename_genomes.py --mapping samples.tsv
+
+  # Interactive renaming
+  python rename_genomes.py --interactive genome1.fasta genome2.fasta
+
+  # Dry run (preview changes)
+  python rename_genomes.py --mapping samples.tsv --dry-run
+        """
+    )
+
+    group = parser.add_mutually_exclusive_group(required=True)
+    group.add_argument(
+        '--create-template',
+        nargs='+',
+        metavar='GENOME',
+        help='Create a template mapping file from genome files'
+    )
+    group.add_argument(
+        '--mapping',
+        metavar='FILE',
+        help='TSV file with mapping (old_name<TAB>new_name)'
+    )
+    group.add_argument(
+        '--interactive',
+        nargs='+',
+        metavar='GENOME',
+        help='Interactively rename genome files'
+    )
+
+    parser.add_argument(
+        '--dry-run',
+        action='store_true',
+        help='Show what would be renamed without actually renaming'
+    )
+
+    parser.add_argument(
+        '--no-backup',
+        action='store_true',
+        help='Do not create backup files'
+    )
+
+    args = parser.parse_args()
+
+    # Create template
+    if args.create_template:
+        create_template(args.create_template)
+        return
+
+    # Interactive mode
+    if args.interactive:
+        mapping = interactive_rename(args.interactive)
+    # Mapping file mode
+    elif args.mapping:
+        mapping = read_mapping(args.mapping)
+    else:
+        parser.error("No mode specified")
+
+    if not mapping:
+        print("No files to rename", file=sys.stderr)
+        return
+
+    # Perform renaming
+    renamed, errors = rename_files(
+        mapping,
+        dry_run=args.dry_run,
+        backup=not args.no_backup
+    )
+
+    # Summary
+    print("\n" + "="*60)
+    if args.dry_run:
+        print("DRY RUN - No files were actually renamed")
+    else:
+        print(f"Successfully renamed {len(renamed)} file(s)")
+
+    if errors:
+        print(f"\nErrors ({len(errors)}):")
+        for error in errors:
+            print(f"  - {error}")
+        sys.exit(1)
+
+
+if __name__ == "__main__":
+    main()
--- a/skills/phylo_from_buscos/scripts/run_alicut.sh
+++ b/skills/phylo_from_buscos/scripts/run_alicut.sh
@@ -0,0 +1,247 @@
+#!/bin/bash
+
+# run_alicut.sh
+# Wrapper script for running ALICUT to remove Aliscore-identified RSS positions
+# Removes randomly similar sequence sections from alignments
+#
+# Usage:
+#   bash run_alicut.sh [aliscore_dir] [options]
+#
+# Options:
+#   -r         Remain stem positions (for RNA secondary structures)
+#   -c         Remove codon (translate AA positions to nucleotide triplets)
+#   -3         Remove only 3rd codon positions
+#   -s         Silent mode (non-interactive, use defaults)
+#
+# Requirements:
+#   - ALICUT_V2.31.pl in PATH or same directory
+#   - Perl with File::Copy, Tie::File, Term::Cap modules
+#   - Aliscore output directory with *_List_*.txt and original .fas file
+
+set -euo pipefail
+
+# Script directory
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# Check for ALICUT script
+if command -v ALICUT_V2.31.pl &> /dev/null; then
+    ALICUT_SCRIPT="ALICUT_V2.31.pl"
+elif [ -f "${SCRIPT_DIR}/ALICUT_V2.31.pl" ]; then
+    ALICUT_SCRIPT="${SCRIPT_DIR}/ALICUT_V2.31.pl"
+elif [ -f "./ALICUT_V2.31.pl" ]; then
+    ALICUT_SCRIPT="./ALICUT_V2.31.pl"
+else
+    echo "ERROR: ALICUT_V2.31.pl not found in PATH, script directory, or current directory"
+    echo "Please download from: https://www.zfmk.de/en/research/research-centres-and-groups/alicut"
+    exit 1
+fi
+
+# Function to display usage
+usage() {
+    cat <<EOF
+Usage: $0 [aliscore_dir] [options]
+
+Run ALICUT to remove Aliscore-identified randomly similar sequence sections.
+
+Arguments:
+  aliscore_dir   Directory containing Aliscore output files
+
+Options:
+  -r             Remain stem positions in RNA secondary structure alignments
+  -c             Remove entire codon (translates AA RSS positions to nt triplets)
+  -3             Remove only 3rd codon position of identified RSS
+  -s             Silent/scripted mode (non-interactive, use defaults)
+  -h             Display this help message
+
+Input Requirements:
+  The aliscore_dir must contain:
+    - Original FASTA alignment file (*.fas)
+    - Aliscore List file (*_List_random.txt or *_List_*.txt)
+
+Examples:
+  # Basic usage (interactive mode)
+  bash run_alicut.sh aliscore_alignment1
+
+  # Silent mode with defaults
+  bash run_alicut.sh aliscore_alignment1 -s
+
+  # Remain RNA stem positions
+  bash run_alicut.sh aliscore_16S -r -s
+
+  # Remove entire codons (for back-translation)
+  bash run_alicut.sh aliscore_protein1 -c -s
+
+  # Process all Aliscore output directories
+  for dir in aliscore_*/; do
+    bash run_alicut.sh "\${dir}" -s
+  done
+
+Output Files (in aliscore_dir):
+  - ALICUT_[alignment].fas        : Trimmed alignment
+  - ALICUT_info.xls               : Statistics (taxa, positions removed, etc.)
+  - ALICUT_Struc_info_*.txt       : Structure information (if RNA detected)
+
+Citation:
+  Kück P, Meusemann K, Dambach J, Thormann B, von Reumont BM, Wägele JW,
+  Misof B (2010) Parametric and non-parametric masking of randomness in
+  sequence alignments can be improved and leads to better resolved trees.
+  Front Zool 7:10. doi: 10.1186/1742-9994-7-10
+
+EOF
+    exit 0
+}
+
+# Parse command line arguments
+ALISCORE_DIR=""
+ALICUT_OPTS=""
+SILENT_MODE=false
+
+if [ $# -eq 0 ]; then
+    usage
+fi
+
+ALISCORE_DIR="$1"
+shift
+
+# Validate directory exists
+if [ ! -d "${ALISCORE_DIR}" ]; then
+    echo "ERROR: Aliscore directory not found: ${ALISCORE_DIR}"
+    exit 1
+fi
+
+# Parse ALICUT options
+while [ $# -gt 0 ]; do
+    case "$1" in
+        -h|--help)
+            usage
+            ;;
+        -r)
+            ALICUT_OPTS="${ALICUT_OPTS} -r"
+            shift
+            ;;
+        -c)
+            ALICUT_OPTS="${ALICUT_OPTS} -c"
+            shift
+            ;;
+        -3)
+            ALICUT_OPTS="${ALICUT_OPTS} -3"
+            shift
+            ;;
+        -s|--silent)
+            SILENT_MODE=true
+            ALICUT_OPTS="${ALICUT_OPTS} -s"
+            shift
+            ;;
+        *)
+            echo "ERROR: Unknown option: $1"
+            usage
+            ;;
+    esac
+done
+
+# Change to Aliscore output directory
+cd "${ALISCORE_DIR}"
+
+echo "Processing Aliscore output in: ${ALISCORE_DIR}"
+
+# Find List file
+LIST_FILE=$(ls *_List_*.txt 2>/dev/null | head -n 1)
+if [ -z "${LIST_FILE}" ]; then
+    echo "ERROR: No Aliscore List file found (*_List_*.txt)"
+    echo "Make sure Aliscore completed successfully"
+    exit 1
+fi
+
+echo "Found List file: ${LIST_FILE}"
+
+# Find original FASTA file
+FASTA_FILE=$(find . -maxdepth 1 \( -name "*.fas" -o -name "*.fasta" \) -type f | head -n 1 | sed 's|^\./||')
+if [ -z "${FASTA_FILE}" ]; then
+    echo "ERROR: No FASTA alignment file found (*.fas or *.fasta)"
+    echo "ALICUT requires the original alignment file in the same directory as List file"
+    exit 1
+fi
+
+echo "Found FASTA file: ${FASTA_FILE}"
+
+# Check if List file contains RSS positions
+RSS_COUNT=$(wc -w < "${LIST_FILE}" || echo "0")
+if [ "${RSS_COUNT}" -eq 0 ]; then
+    echo "WARNING: List file is empty (no RSS positions identified)"
+    echo "Aliscore found no randomly similar sequences to remove"
+    echo "Skipping ALICUT - alignment is already clean"
+
+    # Create a symbolic link to indicate no trimming was needed
+    ln -sf "${FASTA_FILE}" "ALICUT_${FASTA_FILE}"
+    echo "Created symbolic link: ALICUT_${FASTA_FILE} -> ${FASTA_FILE}"
+
+    cd ..
+    exit 0
+fi
+
+echo "Found ${RSS_COUNT} RSS positions to remove"
+
+# Run ALICUT
+echo ""
+echo "Running ALICUT..."
+echo "Options: ${ALICUT_OPTS}"
+
+# Construct ALICUT command
+ALICUT_CMD="perl ${ALICUT_SCRIPT} ${ALICUT_OPTS}"
+
+if [ "${SILENT_MODE}" = true ]; then
+    echo "Command: ${ALICUT_CMD}"
+    eval ${ALICUT_CMD}
+else
+    echo "Running ALICUT in interactive mode..."
+    echo "Press 's' and Enter to start with current options"
+    echo ""
+    perl "${ALICUT_SCRIPT}" ${ALICUT_OPTS}
+fi
+
+# Check if ALICUT completed successfully
+if [ $? -eq 0 ]; then
+    echo ""
+    echo "ALICUT completed successfully"
+
+    # Find output file
+    OUTPUT_FILE=$(ls ALICUT_*.fas ALICUT_*.fasta 2>/dev/null | head -n 1)
+
+    if [ -n "${OUTPUT_FILE}" ]; then
+        echo ""
+        echo "Output files:"
+        ls -lh ALICUT_* 2>/dev/null
+
+        # Calculate and report trimming statistics (handle multi-line FASTA format)
+        if [ -f "${OUTPUT_FILE}" ]; then
+            ORIGINAL_LENGTH=$(awk '/^>/ {if (seq) {print seq; seq=""}; next} {seq = seq $0} END {if (seq) print seq}' "${FASTA_FILE}" | head -n 1 | wc -c)
+            TRIMMED_LENGTH=$(awk '/^>/ {if (seq) {print seq; seq=""}; next} {seq = seq $0} END {if (seq) print seq}' "${OUTPUT_FILE}" | head -n 1 | wc -c)
+            REMOVED_LENGTH=$((ORIGINAL_LENGTH - TRIMMED_LENGTH))
+            PERCENT_REMOVED=$(awk "BEGIN {printf \"%.1f\", (${REMOVED_LENGTH}/${ORIGINAL_LENGTH})*100}")
+
+            echo ""
+            echo "Trimming statistics:"
+            echo "  Original length: ${ORIGINAL_LENGTH} bp"
+            echo "  Trimmed length:  ${TRIMMED_LENGTH} bp"
+            echo "  Removed:         ${REMOVED_LENGTH} bp (${PERCENT_REMOVED}%)"
+        fi
+
+        # Check for info file
+        if [ -f "ALICUT_info.xls" ]; then
+            echo ""
+            echo "Detailed statistics in: ALICUT_info.xls"
+        fi
+    else
+        echo "WARNING: Expected output file ALICUT_*.fas not found"
+    fi
+else
+    echo "ERROR: ALICUT failed"
+    cd ..
+    exit 1
+fi
+
+# Return to parent directory
+cd ..
+
+echo ""
+echo "Done: ${ALISCORE_DIR}"
--- a/skills/phylo_from_buscos/scripts/run_aliscore.sh
+++ b/skills/phylo_from_buscos/scripts/run_aliscore.sh
@@ -0,0 +1,248 @@
+#!/bin/bash
+
+# run_aliscore.sh
+# Wrapper script for running Aliscore on aligned sequences
+# Identifies randomly similar sequence sections (RSS) in multiple sequence alignments
+#
+# Usage:
+#   bash run_aliscore.sh [alignment.fas] [options]
+#
+# Options:
+#   -w INT     Window size (default: 4)
+#   -r INT     Number of random pairs to compare (default: 4*N taxa)
+#   -N         Treat gaps as ambiguous characters (recommended for amino acids)
+#   -t TREE    Tree file in Newick format for guided comparisons
+#   -l LEVEL   Node level for tree-based comparisons
+#   -o TAXA    Comma-separated list of outgroup taxa
+#
+# Array job usage:
+#   Set SLURM_ARRAY_TASK_ID or PBS_ARRAYID environment variable
+#   Create locus_list.txt with one alignment file per line
+#
+# Requirements:
+#   - Aliscore.02.2.pl in PATH or same directory
+#   - Perl with Tie::File and Fcntl modules
+
+set -euo pipefail
+
+# Script directory
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# Check for Aliscore script
+if command -v Aliscore.02.2.pl &> /dev/null; then
+    ALISCORE_SCRIPT="Aliscore.02.2.pl"
+elif [ -f "${SCRIPT_DIR}/Aliscore.02.2.pl" ]; then
+    ALISCORE_SCRIPT="${SCRIPT_DIR}/Aliscore.02.2.pl"
+elif [ -f "./Aliscore.02.2.pl" ]; then
+    ALISCORE_SCRIPT="./Aliscore.02.2.pl"
+else
+    echo "ERROR: Aliscore.02.2.pl not found in PATH, script directory, or current directory"
+    echo "Please download from: https://www.zfmk.de/en/research/research-centres-and-groups/aliscore"
+    exit 1
+fi
+
+# Function to display usage
+usage() {
+    cat <<EOF
+Usage: $0 [alignment.fas] [options]
+
+Run Aliscore to identify randomly similar sequence sections in alignments.
+
+Options:
+  -d DIR     Base output directory for all Aliscore results (default: aliscore_output)
+  -w INT     Window size for sliding window analysis (default: 4)
+  -r INT     Number of random sequence pairs to compare (default: 4*N taxa)
+  -N         Treat gaps as ambiguous characters (recommended for amino acids)
+  -t FILE    Tree file in Newick format for phylogeny-guided comparisons
+  -l LEVEL   Node level limit for tree-based comparisons (default: all)
+  -o TAXA    Comma-separated list of outgroup taxa for focused comparisons
+  -h         Display this help message
+
+Array Job Mode:
+  If SLURM_ARRAY_TASK_ID or PBS_ARRAYID is set, reads alignment from locus_list.txt
+  Create locus_list.txt with: ls *.fas > locus_list.txt
+
+Examples:
+  # Basic run with defaults (outputs to aliscore_output/)
+  bash run_aliscore.sh alignment.fas
+
+  # Amino acid sequences with gaps as ambiguous
+  bash run_aliscore.sh protein_alignment.fas -N
+
+  # Custom output directory
+  bash run_aliscore.sh alignment.fas -d my_aliscore_results
+
+  # Custom window size and random pairs
+  bash run_aliscore.sh alignment.fas -w 6 -r 100
+
+  # Tree-guided analysis
+  bash run_aliscore.sh alignment.fas -t species.tre
+
+  # Array job on SLURM
+  ls aligned_aa/*.fas > locus_list.txt
+  sbatch --array=1-\$(wc -l < locus_list.txt) run_aliscore_array.job
+
+Output Files (in aliscore_output/aliscore_[alignment]/):
+  - [alignment]_List_random.txt   : Positions identified as RSS (for ALICUT)
+  - [alignment]_Profile_random.txt: Quality profile for each position
+  - [alignment].svg               : Visual plot of scoring profiles
+
+Citation:
+  Misof B, Misof K (2009) A Monte Carlo approach successfully identifies
+  randomness in multiple sequence alignments: a more objective means of data
+  exclusion. Syst Biol 58(1):21-34. doi: 10.1093/sysbio/syp006
+
+EOF
+    exit 0
+}
+
+# Parse command line arguments
+ALIGNMENT=""
+ALISCORE_OPTS=""
+BASE_OUTPUT_DIR="aliscore_output"
+
+if [ $# -eq 0 ]; then
+    usage
+fi
+
+# Check for array job mode
+ARRAY_MODE=false
+ARRAY_ID=""
+
+if [ -n "${SLURM_ARRAY_TASK_ID:-}" ]; then
+    ARRAY_MODE=true
+    ARRAY_ID="${SLURM_ARRAY_TASK_ID}"
+elif [ -n "${PBS_ARRAYID:-}" ]; then
+    ARRAY_MODE=true
+    ARRAY_ID="${PBS_ARRAYID}"
+fi
+
+# If in array mode, get alignment from locus list
+if [ "${ARRAY_MODE}" = true ]; then
+    if [ ! -f "locus_list.txt" ]; then
+        echo "ERROR: Array job mode requires locus_list.txt"
+        echo "Create with: ls *.fas > locus_list.txt"
+        exit 1
+    fi
+
+    ALIGNMENT=$(sed -n "${ARRAY_ID}p" locus_list.txt)
+
+    if [ -z "${ALIGNMENT}" ]; then
+        echo "ERROR: Could not read alignment for array index ${ARRAY_ID}"
+        exit 1
+    fi
+
+    echo "Array job ${ARRAY_ID}: Processing ${ALIGNMENT}"
+
+    # Remaining arguments are Aliscore options
+    shift $#  # Clear positional parameters
+    set -- "$@"  # Reset with remaining args
+else
+    # First argument is alignment file
+    ALIGNMENT="$1"
+    shift
+fi
+
+# Validate alignment file exists
+if [ ! -f "${ALIGNMENT}" ]; then
+    echo "ERROR: Alignment file not found: ${ALIGNMENT}"
+    exit 1
+fi
+
+# Parse Aliscore options
+while [ $# -gt 0 ]; do
+    case "$1" in
+        -h|--help)
+            usage
+            ;;
+        -d|--output-dir)
+            BASE_OUTPUT_DIR="$2"
+            shift 2
+            ;;
+        -w)
+            ALISCORE_OPTS="${ALISCORE_OPTS} -w $2"
+            shift 2
+            ;;
+        -r)
+            ALISCORE_OPTS="${ALISCORE_OPTS} -r $2"
+            shift 2
+            ;;
+        -N)
+            ALISCORE_OPTS="${ALISCORE_OPTS} -N"
+            shift
+            ;;
+        -t)
+            if [ ! -f "$2" ]; then
+                echo "ERROR: Tree file not found: $2"
+                exit 1
+            fi
+            ALISCORE_OPTS="${ALISCORE_OPTS} -t $2"
+            shift 2
+            ;;
+        -l)
+            ALISCORE_OPTS="${ALISCORE_OPTS} -l $2"
+            shift 2
+            ;;
+        -o)
+            ALISCORE_OPTS="${ALISCORE_OPTS} -o $2"
+            shift 2
+            ;;
+        *)
+            echo "ERROR: Unknown option: $1"
+            usage
+            ;;
+    esac
+done
+
+# Get alignment name without extension
+ALIGNMENT_NAME=$(basename "${ALIGNMENT}" .fas)
+ALIGNMENT_NAME=$(basename "${ALIGNMENT_NAME}" .fasta)
+
+# Create base output directory and specific directory for this alignment
+mkdir -p "${BASE_OUTPUT_DIR}"
+OUTPUT_DIR="${BASE_OUTPUT_DIR}/aliscore_${ALIGNMENT_NAME}"
+mkdir -p "${OUTPUT_DIR}"
+
+# Copy alignment to output directory
+cp "${ALIGNMENT}" "${OUTPUT_DIR}/"
+
+# Change to output directory
+cd "${OUTPUT_DIR}"
+
+# Run Aliscore
+echo "Running Aliscore on ${ALIGNMENT}..."
+echo "Options: ${ALISCORE_OPTS}"
+echo "Aliscore script: ${ALISCORE_SCRIPT}"
+
+# Construct and run Aliscore command
+ALISCORE_CMD="perl -I${SCRIPT_DIR} ${ALISCORE_SCRIPT} -i $(basename ${ALIGNMENT}) ${ALISCORE_OPTS}"
+echo "Command: ${ALISCORE_CMD}"
+
+eval ${ALISCORE_CMD}
+
+# Check if Aliscore completed successfully
+if [ $? -eq 0 ]; then
+    echo "Aliscore completed successfully for ${ALIGNMENT}"
+
+    # List output files
+    echo ""
+    echo "Output files in ${OUTPUT_DIR}:"
+    ls -lh *List*.txt *Profile*.txt *.svg 2>/dev/null || echo "  (some expected files not generated)"
+
+    # Report RSS positions if found
+    if [ -f "$(basename ${ALIGNMENT})_List_random.txt" ]; then
+        RSS_COUNT=$(wc -w < "$(basename ${ALIGNMENT})_List_random.txt")
+        echo ""
+        echo "Identified ${RSS_COUNT} randomly similar sequence positions"
+        echo "See: ${OUTPUT_DIR}/$(basename ${ALIGNMENT})_List_random.txt"
+    fi
+else
+    echo "ERROR: Aliscore failed for ${ALIGNMENT}"
+    cd ..
+    exit 1
+fi
+
+# Return to parent directory
+cd ..
+
+echo "Done: ${ALIGNMENT} -> ${OUTPUT_DIR}"
--- a/skills/phylo_from_buscos/scripts/run_aliscore_alicut_batch.sh
+++ b/skills/phylo_from_buscos/scripts/run_aliscore_alicut_batch.sh
@@ -0,0 +1,270 @@
+#!/bin/bash
+
+# run_aliscore_alicut_batch.sh
+# Batch processing script for Aliscore + ALICUT alignment trimming
+# Processes all alignments in a directory through both tools sequentially
+#
+# Usage:
+#   bash run_aliscore_alicut_batch.sh [alignment_dir] [options]
+#
+# This script:
+#   1. Runs Aliscore on all alignments to identify RSS
+#   2. Runs ALICUT on each Aliscore output to remove RSS
+#   3. Collects trimmed alignments in output directory
+#
+# Requirements:
+#   - run_aliscore.sh and run_alicut.sh in same directory or PATH
+#   - Aliscore.02.2.pl and ALICUT_V2.31.pl available
+
+set -euo pipefail
+
+# Script directory
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+# Function to display usage
+usage() {
+    cat <<EOF
+Usage: $0 [alignment_dir] [options]
+
+Batch process multiple alignments through Aliscore and ALICUT.
+
+Arguments:
+  alignment_dir   Directory containing aligned FASTA files (*.fas)
+
+Options:
+  -o DIR         Output directory for trimmed alignments (default: aliscore_alicut_trimmed)
+  -d DIR         Base directory for Aliscore outputs (default: aliscore_output)
+  -w INT         Aliscore window size (default: 4)
+  -r INT         Aliscore random pairs (default: 4*N)
+  -N             Aliscore: treat gaps as ambiguous (recommended for AA)
+  --remain-stems ALICUT: remain RNA stem positions
+  --remove-codon ALICUT: remove entire codons (for back-translation)
+  --remove-3rd   ALICUT: remove only 3rd codon positions
+  -h             Display this help message
+
+Examples:
+  # Basic usage for amino acid alignments
+  bash run_aliscore_alicut_batch.sh aligned_aa/ -N
+
+  # Custom window size
+  bash run_aliscore_alicut_batch.sh aligned_aa/ -w 6 -N
+
+  # With RNA structure preservation
+  bash run_aliscore_alicut_batch.sh aligned_rrna/ --remain-stems
+
+Output:
+  - aliscore_output/aliscore_[locus]/  : Individual Aliscore results per locus
+  - aliscore_alicut_trimmed/           : Final trimmed alignments
+  - aliscore_alicut_trimmed/trimming_summary.txt : Statistics for all loci
+
+EOF
+    exit 0
+}
+
+# Default parameters
+ALIGNMENT_DIR=""
+OUTPUT_DIR="aliscore_alicut_trimmed"
+ALISCORE_BASE_DIR="aliscore_output"
+ALISCORE_OPTS=""
+ALICUT_OPTS="-s"  # Silent mode by default
+
+if [ $# -eq 0 ]; then
+    usage
+fi
+
+ALIGNMENT_DIR="$1"
+shift
+
+# Validate alignment directory
+if [ ! -d "${ALIGNMENT_DIR}" ]; then
+    echo "ERROR: Alignment directory not found: ${ALIGNMENT_DIR}"
+    exit 1
+fi
+
+# Parse options
+while [ $# -gt 0 ]; do
+    case "$1" in
+        -h|--help)
+            usage
+            ;;
+        -o|--output)
+            OUTPUT_DIR="$2"
+            shift 2
+            ;;
+        -d|--aliscore-dir)
+            ALISCORE_BASE_DIR="$2"
+            shift 2
+            ;;
+        -w)
+            ALISCORE_OPTS="${ALISCORE_OPTS} -w $2"
+            shift 2
+            ;;
+        -r)
+            ALISCORE_OPTS="${ALISCORE_OPTS} -r $2"
+            shift 2
+            ;;
+        -N)
+            ALISCORE_OPTS="${ALISCORE_OPTS} -N"
+            shift
+            ;;
+        --remain-stems)
+            ALICUT_OPTS="${ALICUT_OPTS} -r"
+            shift
+            ;;
+        --remove-codon)
+            ALICUT_OPTS="${ALICUT_OPTS} -c"
+            shift
+            ;;
+        --remove-3rd)
+            ALICUT_OPTS="${ALICUT_OPTS} -3"
+            shift
+            ;;
+        *)
+            echo "ERROR: Unknown option: $1"
+            usage
+            ;;
+    esac
+done
+
+# Check for wrapper scripts
+RUN_ALISCORE="${SCRIPT_DIR}/run_aliscore.sh"
+RUN_ALICUT="${SCRIPT_DIR}/run_alicut.sh"
+
+if [ ! -f "${RUN_ALISCORE}" ]; then
+    echo "ERROR: run_aliscore.sh not found: ${RUN_ALISCORE}"
+    exit 1
+fi
+
+if [ ! -f "${RUN_ALICUT}" ]; then
+    echo "ERROR: run_alicut.sh not found: ${RUN_ALICUT}"
+    exit 1
+fi
+
+# Create output directory
+mkdir -p "${OUTPUT_DIR}"
+
+# Find all FASTA files
+ALIGNMENTS=($(find "${ALIGNMENT_DIR}" -maxdepth 1 -name "*.fas" -o -name "*.fasta"))
+
+if [ ${#ALIGNMENTS[@]} -eq 0 ]; then
+    echo "ERROR: No FASTA files found in ${ALIGNMENT_DIR}"
+    exit 1
+fi
+
+echo "Found ${#ALIGNMENTS[@]} alignments to process"
+echo "Aliscore options: ${ALISCORE_OPTS}"
+echo "ALICUT options: ${ALICUT_OPTS}"
+echo ""
+
+# Initialize summary file
+SUMMARY_FILE="${OUTPUT_DIR}/trimming_summary.txt"
+echo -e "Locus\tOriginal_Length\tTrimmed_Length\tRemoved_Positions\tPercent_Removed\tRSS_Count" > "${SUMMARY_FILE}"
+
+# Process each alignment
+SUCCESS_COUNT=0
+FAIL_COUNT=0
+
+for ALIGNMENT in "${ALIGNMENTS[@]}"; do
+    LOCUS=$(basename "${ALIGNMENT}" .fas)
+    LOCUS=$(basename "${LOCUS}" .fasta)
+
+    echo "=========================================="
+    echo "Processing: ${LOCUS}"
+    echo "=========================================="
+
+    # Step 1: Run Aliscore
+    echo ""
+    echo "Step 1/2: Running Aliscore..."
+
+    if bash "${RUN_ALISCORE}" "${ALIGNMENT}" -d "${ALISCORE_BASE_DIR}" ${ALISCORE_OPTS}; then
+        echo "Aliscore completed for ${LOCUS}"
+    else
+        echo "ERROR: Aliscore failed for ${LOCUS}"
+        FAIL_COUNT=$((FAIL_COUNT + 1))
+        continue
+    fi
+
+    # Step 2: Run ALICUT
+    echo ""
+    echo "Step 2/2: Running ALICUT..."
+
+    ALISCORE_DIR="${ALISCORE_BASE_DIR}/aliscore_${LOCUS}"
+
+    if [ ! -d "${ALISCORE_DIR}" ]; then
+        echo "ERROR: Aliscore output directory not found: ${ALISCORE_DIR}"
+        FAIL_COUNT=$((FAIL_COUNT + 1))
+        continue
+    fi
+
+    if bash "${RUN_ALICUT}" "${ALISCORE_DIR}" ${ALICUT_OPTS}; then
+        echo "ALICUT completed for ${LOCUS}"
+    else
+        echo "ERROR: ALICUT failed for ${LOCUS}"
+        FAIL_COUNT=$((FAIL_COUNT + 1))
+        continue
+    fi
+
+    # Copy trimmed alignment to output directory
+    TRIMMED_FILE=$(find "${ALISCORE_DIR}" -name "ALICUT_*.fas" -o -name "ALICUT_*.fasta" | head -n 1)
+
+    if [ -n "${TRIMMED_FILE}" ] && [ -f "${TRIMMED_FILE}" ]; then
+        cp "${TRIMMED_FILE}" "${OUTPUT_DIR}/${LOCUS}_trimmed.fas"
+        echo "Trimmed alignment: ${OUTPUT_DIR}/${LOCUS}_trimmed.fas"
+
+        # Calculate statistics (handle multi-line FASTA format)
+        ORIGINAL_LENGTH=$(awk '/^>/ {if (seq) {print seq; seq=""}; next} {seq = seq $0} END {if (seq) print seq}' "${ALIGNMENT}" | head -n 1 | tr -d ' ' | wc -c)
+        TRIMMED_LENGTH=$(awk '/^>/ {if (seq) {print seq; seq=""}; next} {seq = seq $0} END {if (seq) print seq}' "${TRIMMED_FILE}" | head -n 1 | tr -d ' ' | wc -c)
+        REMOVED_LENGTH=$((ORIGINAL_LENGTH - TRIMMED_LENGTH))
+        PERCENT_REMOVED=$(awk "BEGIN {printf \"%.2f\", (${REMOVED_LENGTH}/${ORIGINAL_LENGTH})*100}")
+
+        # Count RSS positions
+        LIST_FILE=$(find "${ALISCORE_DIR}" -name "*_List_*.txt" | head -n 1)
+        RSS_COUNT=$(wc -w < "${LIST_FILE}" 2>/dev/null || echo "0")
+
+        # Append to summary
+        echo -e "${LOCUS}\t${ORIGINAL_LENGTH}\t${TRIMMED_LENGTH}\t${REMOVED_LENGTH}\t${PERCENT_REMOVED}\t${RSS_COUNT}" >> "${SUMMARY_FILE}"
+
+        SUCCESS_COUNT=$((SUCCESS_COUNT + 1))
+    else
+        echo "WARNING: Trimmed file not found for ${LOCUS}"
+        FAIL_COUNT=$((FAIL_COUNT + 1))
+    fi
+
+    echo ""
+done
+
+# Final report
+echo "=========================================="
+echo "BATCH PROCESSING COMPLETE"
+echo "=========================================="
+echo ""
+echo "Successfully processed: ${SUCCESS_COUNT}/${#ALIGNMENTS[@]} alignments"
+echo "Failed: ${FAIL_COUNT}/${#ALIGNMENTS[@]} alignments"
+echo ""
+echo "Output directory: ${OUTPUT_DIR}"
+echo "Trimmed alignments: ${OUTPUT_DIR}/*_trimmed.fas"
+echo "Summary statistics: ${SUMMARY_FILE}"
+echo ""
+
+# Display summary statistics
+if [ ${SUCCESS_COUNT} -gt 0 ]; then
+    echo "Overall trimming statistics:"
+    awk 'NR>1 {
+        total_orig += $2;
+        total_trim += $3;
+        total_removed += $4;
+        count++
+    }
+    END {
+        if (count > 0) {
+            avg_removed = (total_removed / total_orig) * 100;
+            printf "  Total positions before: %d\n", total_orig;
+            printf "  Total positions after:  %d\n", total_trim;
+            printf "  Total removed:          %d (%.2f%%)\n", total_removed, avg_removed;
+            printf "  Average per locus:      %.2f%% removed\n", avg_removed;
+        }
+    }' "${SUMMARY_FILE}"
+fi
+
+echo ""
+echo "Done!"
--- a/skills/phylo_from_buscos/templates/README.md
+++ b/skills/phylo_from_buscos/templates/README.md
@@ -0,0 +1,125 @@
+# Phylogenomics Workflow Templates
+
+This directory contains template scripts for running the phylogenomics pipeline across different computing environments.
+
+## Directory Structure
+
+```
+templates/
+├── slurm/      # SLURM job scheduler templates
+├── pbs/        # PBS/Torque job scheduler templates
+└── local/      # Local machine templates (with GNU parallel support)
+```
+
+## Template Naming Convention
+
+Templates follow a consistent naming pattern: `NN_step_name[_variant].ext`
+
+- `NN`: Step number (e.g., `02` for compleasm, `08a` for partition search)
+- `step_name`: Descriptive name of the pipeline step
+- `_variant`: Optional variant (e.g., `_first`, `_parallel`, `_serial`)
+- `.ext`: File extension (`.job` for schedulers, `.sh` for local scripts)
+
+## Available Templates
+
+### Step 2: Ortholog Identification (compleasm)
+
+**SLURM:**
+- `02_compleasm_first.job` - Process first genome to download lineage database
+- `02_compleasm_parallel.job` - Array job for remaining genomes
+
+**PBS:**
+- `02_compleasm_first.job` - Process first genome to download lineage database
+- `02_compleasm_parallel.job` - Array job for remaining genomes
+
+**Local:**
+- `02_compleasm_first.sh` - Process first genome to download lineage database
+- `02_compleasm_parallel.sh` - GNU parallel for remaining genomes
+
+### Step 8A: Partition Model Selection
+
+**SLURM:**
+- `08a_partition_search.job` - IQ-TREE partition model search with TESTMERGEONLY
+
+**PBS:**
+- `08a_partition_search.job` - IQ-TREE partition model search with TESTMERGEONLY
+
+**Local:**
+- `08a_partition_search.sh` - IQ-TREE partition model search with TESTMERGEONLY
+
+### Step 8C: Individual Gene Trees
+
+**SLURM:**
+- `08c_gene_trees_array.job` - Array job for parallel gene tree estimation
+
+**PBS:**
+- `08c_gene_trees_array.job` - Array job for parallel gene tree estimation
+
+**Local:**
+- `08c_gene_trees_parallel.sh` - GNU parallel for gene tree estimation
+- `08c_gene_trees_serial.sh` - Serial processing (for debugging/limited resources)
+
+## Placeholders
+
+Templates contain placeholders that must be replaced with user-specific values:
+
+| Placeholder | Description | Example |
+|-------------|-------------|---------|
+| `TOTAL_THREADS` | Total CPU cores available | `64` |
+| `THREADS_PER_JOB` | Threads per concurrent job | `16` |
+| `NUM_GENOMES` | Number of genomes in analysis | `20` |
+| `NUM_LOCI` | Number of loci/alignments | `2795` |
+| `LINEAGE` | BUSCO lineage dataset | `insecta_odb10` |
+| `MODEL_SET` | Comma-separated substitution models | `LG,WAG,JTT,Q.pfam` |
+
+## Usage
+
+### For Claude (LLM)
+
+When a user requests scripts for a specific computing environment:
+
+1. **Read the appropriate template** using the Read tool
+2. **Replace placeholders** with user-specified values
+3. **Present the customized script** to the user
+4. **Provide setup instructions** (e.g., how many genomes, how to calculate thread allocation)
+
+Example:
+```python
+# Read template
+template = Read("templates/slurm/02_compleasm_first.job")
+
+# Replace placeholders
+script = template.replace("TOTAL_THREADS", "64")
+script = script.replace("LINEAGE", "insecta_odb10")
+
+# Present to user
+print(script)
+```
+
+### For Users
+
+Templates are not meant to be used directly. Instead:
+
+1. Follow the workflow in `SKILL.md`
+2. Answer Claude's questions about your setup
+3. Claude will fetch the appropriate template and customize it for you
+4. Copy the customized script Claude provides
+
+## Benefits of This Structure
+
+1. **Reduced token usage**: Claude only reads templates when needed
+2. **Easier maintenance**: Update one template file instead of multiple locations in SKILL.md
+3. **Consistency**: All users get the same base template structure
+4. **Clarity**: Separate files are easier to review than inline code
+5. **Extensibility**: Easy to add new templates for additional tools or variants
+
+## Adding New Templates
+
+When adding new templates:
+
+1. **Follow naming convention**: `NN_descriptive_name[_variant].ext`
+2. **Include clear comments**: Explain what the script does
+3. **Use consistent placeholders**: Match existing placeholder names
+4. **Test thoroughly**: Ensure placeholders are complete and correct
+5. **Update this README**: Add the new template to the "Available Templates" section
+6. **Update SKILL.md**: Reference the new template in the appropriate workflow step
--- a/skills/phylo_from_buscos/templates/local/02_compleasm_first.sh
+++ b/skills/phylo_from_buscos/templates/local/02_compleasm_first.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+# run_compleasm_first.sh
+source ~/.bashrc
+conda activate phylo
+
+# User-specified total CPU threads
+TOTAL_THREADS=TOTAL_THREADS  # Replace with total cores you want to use (e.g., 16, 32, 64)
+echo "Processing first genome with ${TOTAL_THREADS} CPU threads to download lineage database..."
+
+# Create output directory
+mkdir -p 01_busco_results
+
+# Process FIRST genome only
+first_genome=$(head -n 1 genome_list.txt)
+genome_name=$(basename ${first_genome} .fasta)
+echo "Processing: ${genome_name}"
+
+compleasm run \
+  -a ${first_genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t ${TOTAL_THREADS}
+
+echo ""
+echo "First genome complete! Lineage database is now cached."
+echo "Now run the parallel script for remaining genomes: bash run_compleasm_parallel.sh"
--- a/skills/phylo_from_buscos/templates/local/02_compleasm_parallel.sh
+++ b/skills/phylo_from_buscos/templates/local/02_compleasm_parallel.sh
@@ -0,0 +1,33 @@
+#!/bin/bash
+# run_compleasm_parallel.sh
+source ~/.bashrc
+conda activate phylo
+
+# Threading configuration (adjust based on your system)
+TOTAL_THREADS=TOTAL_THREADS      # Total cores to use (e.g., 64)
+THREADS_PER_JOB=THREADS_PER_JOB  # Threads per genome (e.g., 16)
+CONCURRENT_JOBS=$((TOTAL_THREADS / THREADS_PER_JOB))  # Calculated automatically
+
+echo "Configuration:"
+echo "  Total threads:      ${TOTAL_THREADS}"
+echo "  Threads per genome: ${THREADS_PER_JOB}"
+echo "  Concurrent genomes: ${CONCURRENT_JOBS}"
+echo ""
+
+# Create output directory
+mkdir -p 01_busco_results
+
+# Process remaining genomes (skip first one) in parallel
+tail -n +2 genome_list.txt | parallel -j ${CONCURRENT_JOBS} '
+  genome_name=$(basename {} .fasta)
+  echo "Processing ${genome_name} with THREADS_PER_JOB threads..."
+
+  compleasm run \
+    -a {} \
+    -o 01_busco_results/${genome_name}_compleasm \
+    -l LINEAGE \
+    -t THREADS_PER_JOB
+'
+
+echo ""
+echo "All genomes processed!"
--- a/skills/phylo_from_buscos/templates/local/08a_partition_search.sh
+++ b/skills/phylo_from_buscos/templates/local/08a_partition_search.sh
@@ -0,0 +1,20 @@
+#!/bin/bash
+source ~/.bashrc
+conda activate phylo
+
+cd 06_concatenation
+
+iqtree \
+  -s FcC_supermatrix.fas \
+  -spp partition_def.txt \
+  -nt 18 \
+  -safe \
+  -pre partition_search \
+  -m TESTMERGEONLY \
+  -mset MODEL_SET \
+  -msub nuclear \
+  -rcluster 10 \
+  -bb 1000 \
+  -alrt 1000
+
+echo "Partition search complete! Best scheme: partition_search.best_scheme.nex"
--- a/skills/phylo_from_buscos/templates/local/08c_gene_trees_parallel.sh
+++ b/skills/phylo_from_buscos/templates/local/08c_gene_trees_parallel.sh
@@ -0,0 +1,17 @@
+#!/bin/bash
+source ~/.bashrc
+conda activate phylo
+
+cd trimmed_aa
+
+# Create list of alignments
+ls *_trimmed.fas > locus_alignments.txt
+
+# Run IQ-TREE in parallel (adjust -j for number of concurrent jobs)
+cat locus_alignments.txt | parallel -j 4 '
+  prefix=$(basename {} _trimmed.fas)
+  iqtree -s {} -m MFP -bb 1000 -bnni -czb -pre ${prefix} -nt 1
+  echo "Tree complete: ${prefix}"
+'
+
+echo "All gene trees complete!"
--- a/skills/phylo_from_buscos/templates/local/08c_gene_trees_serial.sh
+++ b/skills/phylo_from_buscos/templates/local/08c_gene_trees_serial.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+source ~/.bashrc
+conda activate phylo
+
+cd trimmed_aa
+
+for locus in *_trimmed.fas; do
+    prefix=$(basename ${locus} _trimmed.fas)
+    echo "Processing ${prefix}..."
+    iqtree -s ${locus} -m MFP -bb 1000 -bnni -czb -pre ${prefix} -nt 1
+done
+
+echo "All gene trees complete!"
--- a/skills/phylo_from_buscos/templates/pbs/02_compleasm_first.job
+++ b/skills/phylo_from_buscos/templates/pbs/02_compleasm_first.job
@@ -0,0 +1,27 @@
+#!/bin/bash
+#PBS -N compleasm_first
+#PBS -l nodes=1:ppn=TOTAL_THREADS  # Replace with total available CPUs (e.g., 64)
+#PBS -l mem=384gb  # Adjust based on ppn × 6GB
+#PBS -l walltime=24:00:00
+
+cd $PBS_O_WORKDIR
+source ~/.bashrc
+conda activate phylo
+
+mkdir -p logs
+mkdir -p 01_busco_results
+
+# Process FIRST genome only (downloads lineage database)
+first_genome=$(head -n 1 genome_list.txt)
+genome_name=$(basename ${first_genome} .fasta)
+echo "Processing first genome: ${genome_name} with $PBS_NUM_PPN threads..."
+echo "This will download the BUSCO lineage database for subsequent runs."
+
+compleasm run \
+  -a ${first_genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t $PBS_NUM_PPN
+
+echo "First genome complete! Lineage database is now cached."
+echo "Submit the parallel job for remaining genomes: qsub run_compleasm_parallel.job"
--- a/skills/phylo_from_buscos/templates/pbs/02_compleasm_parallel.job
+++ b/skills/phylo_from_buscos/templates/pbs/02_compleasm_parallel.job
@@ -0,0 +1,24 @@
+#!/bin/bash
+#PBS -N compleasm_parallel
+#PBS -t 2-NUM_GENOMES  # Start from genome 2 (first genome already processed)
+#PBS -l nodes=1:ppn=THREADS_PER_JOB  # e.g., 16 for 64-core system
+#PBS -l mem=96gb  # Adjust based on ppn × 6GB
+#PBS -l walltime=48:00:00
+
+cd $PBS_O_WORKDIR
+source ~/.bashrc
+conda activate phylo
+
+mkdir -p 01_busco_results
+
+# Get genome for this array task
+genome=$(sed -n "${PBS_ARRAYID}p" genome_list.txt)
+genome_name=$(basename ${genome} .fasta)
+
+echo "Processing ${genome_name} with $PBS_NUM_PPN threads..."
+
+compleasm run \
+  -a ${genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t $PBS_NUM_PPN
--- a/skills/phylo_from_buscos/templates/pbs/08a_partition_search.job
+++ b/skills/phylo_from_buscos/templates/pbs/08a_partition_search.job
@@ -0,0 +1,22 @@
+#!/bin/bash
+#PBS -N iqtree_partition
+#PBS -l nodes=1:ppn=18
+#PBS -l mem=72gb
+#PBS -l walltime=72:00:00
+
+cd $PBS_O_WORKDIR/06_concatenation
+source ~/.bashrc
+conda activate phylo
+
+iqtree \
+  -s FcC_supermatrix.fas \
+  -spp partition_def.txt \
+  -nt 18 \
+  -safe \
+  -pre partition_search \
+  -m TESTMERGEONLY \
+  -mset MODEL_SET \
+  -msub nuclear \
+  -rcluster 10 \
+  -bb 1000 \
+  -alrt 1000
--- a/skills/phylo_from_buscos/templates/pbs/08c_gene_trees_array.job
+++ b/skills/phylo_from_buscos/templates/pbs/08c_gene_trees_array.job
@@ -0,0 +1,26 @@
+#!/bin/bash
+#PBS -N iqtree_genes
+#PBS -t 1-NUM_LOCI
+#PBS -l nodes=1:ppn=1
+#PBS -l mem=4gb
+#PBS -l walltime=2:00:00
+
+cd $PBS_O_WORKDIR/trimmed_aa
+source ~/.bashrc
+conda activate phylo
+
+# Create list of alignments if not present
+if [ ! -f locus_alignments.txt ]; then
+    ls *_trimmed.fas > locus_alignments.txt
+fi
+
+locus=$(sed -n "${PBS_ARRAYID}p" locus_alignments.txt)
+
+iqtree \
+  -s ${locus} \
+  -m MFP \
+  -bb 1000 \
+  -bnni \
+  -czb \
+  -pre $(basename ${locus} _trimmed.fas) \
+  -nt 1
--- a/skills/phylo_from_buscos/templates/slurm/02_compleasm_first.job
+++ b/skills/phylo_from_buscos/templates/slurm/02_compleasm_first.job
@@ -0,0 +1,28 @@
+#!/bin/bash
+#SBATCH --job-name=compleasm_first
+#SBATCH --cpus-per-task=TOTAL_THREADS  # Replace with total available CPUs (e.g., 64)
+#SBATCH --mem-per-cpu=6G
+#SBATCH --time=24:00:00
+#SBATCH --output=logs/compleasm_first.%j.out
+#SBATCH --error=logs/compleasm_first.%j.err
+
+source ~/.bashrc
+conda activate phylo
+
+mkdir -p logs
+mkdir -p 01_busco_results
+
+# Process FIRST genome only (downloads lineage database)
+first_genome=$(head -n 1 genome_list.txt)
+genome_name=$(basename ${first_genome} .fasta)
+echo "Processing first genome: ${genome_name} with ${SLURM_CPUS_PER_TASK} threads..."
+echo "This will download the BUSCO lineage database for subsequent runs."
+
+compleasm run \
+  -a ${first_genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t ${SLURM_CPUS_PER_TASK}
+
+echo "First genome complete! Lineage database is now cached."
+echo "Submit the parallel job for remaining genomes: sbatch run_compleasm_parallel.job"
--- a/skills/phylo_from_buscos/templates/slurm/02_compleasm_parallel.job
+++ b/skills/phylo_from_buscos/templates/slurm/02_compleasm_parallel.job
@@ -0,0 +1,25 @@
+#!/bin/bash
+#SBATCH --job-name=compleasm_parallel
+#SBATCH --array=2-NUM_GENOMES  # Start from genome 2 (first genome already processed)
+#SBATCH --cpus-per-task=THREADS_PER_JOB  # e.g., 16 for 64-core system with 4 concurrent jobs
+#SBATCH --mem-per-cpu=6G
+#SBATCH --time=48:00:00
+#SBATCH --output=logs/compleasm.%A_%a.out
+#SBATCH --error=logs/compleasm.%A_%a.err
+
+source ~/.bashrc
+conda activate phylo
+
+mkdir -p 01_busco_results
+
+# Get genome for this array task (skipping the first one)
+genome=$(sed -n "${SLURM_ARRAY_TASK_ID}p" genome_list.txt)
+genome_name=$(basename ${genome} .fasta)
+
+echo "Processing ${genome_name} with ${SLURM_CPUS_PER_TASK} threads..."
+
+compleasm run \
+  -a ${genome} \
+  -o 01_busco_results/${genome_name}_compleasm \
+  -l LINEAGE \
+  -t ${SLURM_CPUS_PER_TASK}
--- a/skills/phylo_from_buscos/templates/slurm/08a_partition_search.job
+++ b/skills/phylo_from_buscos/templates/slurm/08a_partition_search.job
@@ -0,0 +1,27 @@
+#!/bin/bash
+#SBATCH --job-name=iqtree_partition
+#SBATCH --cpus-per-task=18
+#SBATCH --mem-per-cpu=4G
+#SBATCH --time=72:00:00
+#SBATCH --output=logs/partition_search.out
+#SBATCH --error=logs/partition_search.err
+
+source ~/.bashrc
+conda activate phylo
+
+cd 06_concatenation  # Use organized directory structure
+
+iqtree \
+  -s FcC_supermatrix.fas \
+  -spp partition_def.txt \
+  -nt ${SLURM_CPUS_PER_TASK} \
+  -safe \
+  -pre partition_search \
+  -m TESTMERGEONLY \
+  -mset MODEL_SET \
+  -msub nuclear \
+  -rcluster 10 \
+  -bb 1000 \
+  -alrt 1000
+
+# Output: partition_search.best_scheme.nex
--- a/skills/phylo_from_buscos/templates/slurm/08c_gene_trees_array.job
+++ b/skills/phylo_from_buscos/templates/slurm/08c_gene_trees_array.job
@@ -0,0 +1,28 @@
+#!/bin/bash
+#SBATCH --job-name=iqtree_genes
+#SBATCH --array=1-NUM_LOCI
+#SBATCH --cpus-per-task=1
+#SBATCH --mem-per-cpu=4G
+#SBATCH --time=2:00:00
+#SBATCH --output=logs/%A_%a.genetree.out
+
+source ~/.bashrc
+conda activate phylo
+
+cd trimmed_aa
+
+# Create list of alignments if not present
+if [ ! -f locus_alignments.txt ]; then
+    ls *_trimmed.fas > locus_alignments.txt
+fi
+
+locus=$(sed -n "${SLURM_ARRAY_TASK_ID}p" locus_alignments.txt)
+
+iqtree \
+  -s ${locus} \
+  -m MFP \
+  -bb 1000 \
+  -bnni \
+  -czb \
+  -pre $(basename ${locus} _trimmed.fas) \
+  -nt 1