# BioGeoBEARS Biogeographic Analysis Skill A Claude skill for setting up and executing phylogenetic biogeographic analyses using BioGeoBEARS in R. ## Overview This skill automates the complete workflow for biogeographic analysis on phylogenetic trees, from raw data validation to publication-ready visualizations. It helps users reconstruct ancestral geographic ranges by: - Validating and reformatting input files (phylogenetic tree + geographic distribution data) - Setting up organized analysis folder structures - Generating customized RMarkdown analysis scripts - Guiding parameter selection (maximum range size, model choices) - Producing visualizations with pie charts and text labels showing ancestral ranges - Comparing multiple biogeographic models with statistical tests ## When to Use Use this skill when you need to: - Reconstruct ancestral geographic ranges on a phylogeny - Test different biogeographic models (DEC, DIVALIKE, BAYAREALIKE) - Analyze how species distributions evolved over time - Determine whether founder-event speciation (+J parameter) is important - Generate publication-ready biogeographic visualizations ## Required Inputs Users must provide: 1. **Phylogenetic tree** (Newick format: .nwk, .tre, or .tree) - Must be rooted - Tip labels must match species in geography file - Branch lengths required 2. **Geographic distribution data** (any tabular format) - Species names matching tree tips - Presence/absence data for different geographic areas - Accepts CSV, TSV, Excel, or PHYLIP format ## What the Skill Does ### 1. Data Validation and Reformatting The skill includes a Python script (`validate_geography_file.py`) that: - Validates geography file format (PHYLIP-like with specific tab/spacing requirements) - Checks for common errors (spaces in species names, tab delimiters, binary code length) - Reformats CSV/TSV files to proper BioGeoBEARS format - Cross-validates species names against tree tip labels ### 2. Analysis Setup Creates an organized directory structure: ``` biogeobears_analysis/ ├── input/ │ ├── tree.nwk # Phylogenetic tree │ ├── geography.data # Validated geography file │ └── original_data/ # Original input files ├── scripts/ │ └── run_biogeobears.Rmd # Customized RMarkdown script ├── results/ # Analysis outputs │ ├── [MODEL]_result.Rdata # Saved model results │ └── plots/ # Visualizations │ ├── [MODEL]_pie.pdf │ └── [MODEL]_text.pdf └── README.md # Documentation ``` ### 3. RMarkdown Analysis Template Generates a complete RMarkdown script that: - Loads and validates input data - Fits 6 biogeographic models: - DEC (Dispersal-Extinction-Cladogenesis) - DEC+J (DEC with founder-event speciation) - DIVALIKE (vicariance-focused) - DIVALIKE+J - BAYAREALIKE (sympatry-focused) - BAYAREALIKE+J - Compares models using AIC, AICc, and AIC weights - Performs likelihood ratio tests for nested models - Estimates parameters (d=dispersal, e=extinction, j=founder-event rates) - Generates visualizations on the phylogeny - Creates HTML report with all results ### 4. Visualization Produces two types of plots: - **Pie charts**: Show probability distributions for ancestral ranges (conveys uncertainty) - **Text labels**: Show maximum likelihood ancestral states (cleaner, easier to read) Colors represent geographic areas: - Single areas: Bright primary colors - Multi-area ranges: Blended colors - All areas: White ## Workflow 1. **Gather information**: Ask user for tree file, geography file, and parameters 2. **Validate tree**: Check if rooted and extract tip labels 3. **Validate/reformat geography file**: Use validation script to check format or convert from CSV/TSV 4. **Set up analysis folder**: Create organized directory structure 5. **Generate RMarkdown script**: Customize template with user parameters 6. **Create documentation**: Generate README and run scripts 7. **Provide instructions**: Clear steps for running the analysis ## Analysis Parameters The skill helps users choose: ### Maximum Range Size - How many areas can a species occupy simultaneously? - Options: Conservative (# areas - 1), Permissive (all areas), Data-driven (max observed) - Larger values increase computation time exponentially ### Models to Compare - Default: All 6 models (recommended for comprehensive comparison) - Alternative: Only base models or only +J models - Rationale: Model comparison is key to biogeographic inference ### Visualization Type - Pie charts (show probabilities and uncertainty) - Text labels (show most likely states, cleaner) - Both (default in template) ## Bundled Resources ### scripts/ **validate_geography_file.py** - Validates BioGeoBEARS geography file format - Reformats from CSV/TSV to PHYLIP - Cross-validates with tree tip labels - Usage: `python validate_geography_file.py --help` **biogeobears_analysis_template.Rmd** - Complete RMarkdown analysis template - Parameterized via YAML header - Fits all models, compares, and visualizes - Generates self-contained HTML report ### references/ **biogeobears_details.md** - Detailed model descriptions (DEC, DIVALIKE, BAYAREALIKE, +J parameter) - Input file format specifications with examples - Parameter interpretation guidelines - Plotting options and customization - Complete citations for publications - Computational considerations and troubleshooting ## Example Output The analysis produces: - `biogeobears_report.html` - Interactive HTML report with all results - `[MODEL]_result.Rdata` - Saved R objects for each model - `plots/[MODEL]_pie.pdf` - Ancestral ranges shown as pie charts on tree - `plots/[MODEL]_text.pdf` - Ancestral ranges shown as text labels on tree ## Interpretation Guidance The skill helps users understand: ### Model Selection - **AIC weights**: Probability each model is best - **ΔAIC thresholds**: <2 (equivalent), 2-7 (less support), >10 (no support) ### Parameter Estimates - **d (dispersal)**: Rate of range expansion - **e (extinction)**: Rate of local extinction - **j (founder-event)**: Rate of jump dispersal at speciation - **d/e ratio**: >1 favors expansion, <1 favors contraction ### Statistical Tests - **LRT p < 0.05**: +J parameter significantly improves fit - Model uncertainty: Report results from multiple models if weights similar ## Installation Requirements Users must have: - R (≥4.0) - BioGeoBEARS R package - Supporting R packages: ape, rmarkdown, knitr, kableExtra - Python 3 (for validation script) Installation instructions are included in generated README.md files. ## Expected Runtime **Skill setup time**: 5-10 minutes (file validation and directory setup) **Analysis runtime** (separate from skill execution): - Small datasets (<50 tips, ≤5 areas): 10-30 minutes - Medium datasets (50-100 tips, 5-6 areas): 30-90 minutes - Large datasets (>100 tips, >5 areas): 1-6 hours ## Common Issues Handled The skill troubleshoots: - Species name mismatches between tree and geography file - Unrooted trees (guides user to root with outgroup) - Geography file formatting errors (tabs, spaces, binary codes) - Optimization convergence failures - Slow runtime with many areas/tips ## Citations Based on: - **BioGeoBEARS** package by Nicholas Matzke - Tutorial resources from http://phylo.wikidot.com/biogeobears - Example workflows from BioGeoBEARS GitHub repository ## Skill Details - **Skill Type**: Workflow-based bioinformatics skill - **Domain**: Phylogenetic biogeography, historical biogeography - **Output**: Complete analysis setup with scripts, documentation, and ready-to-run workflow - **Automation Level**: High (validates, reformats, generates all scripts) - **User Input Required**: File paths and parameter choices via guided questions ## See Also - [phylo_from_buscos](../phylo_from_buscos/README.md) - Complementary skill for generating phylogenies from genomes