7.9 KiB
BioGeoBEARS Biogeographic Analysis Skill
A Claude skill for setting up and executing phylogenetic biogeographic analyses using BioGeoBEARS in R.
Overview
This skill automates the complete workflow for biogeographic analysis on phylogenetic trees, from raw data validation to publication-ready visualizations. It helps users reconstruct ancestral geographic ranges by:
- Validating and reformatting input files (phylogenetic tree + geographic distribution data)
- Setting up organized analysis folder structures
- Generating customized RMarkdown analysis scripts
- Guiding parameter selection (maximum range size, model choices)
- Producing visualizations with pie charts and text labels showing ancestral ranges
- Comparing multiple biogeographic models with statistical tests
When to Use
Use this skill when you need to:
- Reconstruct ancestral geographic ranges on a phylogeny
- Test different biogeographic models (DEC, DIVALIKE, BAYAREALIKE)
- Analyze how species distributions evolved over time
- Determine whether founder-event speciation (+J parameter) is important
- Generate publication-ready biogeographic visualizations
Required Inputs
Users must provide:
-
Phylogenetic tree (Newick format: .nwk, .tre, or .tree)
- Must be rooted
- Tip labels must match species in geography file
- Branch lengths required
-
Geographic distribution data (any tabular format)
- Species names matching tree tips
- Presence/absence data for different geographic areas
- Accepts CSV, TSV, Excel, or PHYLIP format
What the Skill Does
1. Data Validation and Reformatting
The skill includes a Python script (validate_geography_file.py) that:
- Validates geography file format (PHYLIP-like with specific tab/spacing requirements)
- Checks for common errors (spaces in species names, tab delimiters, binary code length)
- Reformats CSV/TSV files to proper BioGeoBEARS format
- Cross-validates species names against tree tip labels
2. Analysis Setup
Creates an organized directory structure:
biogeobears_analysis/
├── input/
│ ├── tree.nwk # Phylogenetic tree
│ ├── geography.data # Validated geography file
│ └── original_data/ # Original input files
├── scripts/
│ └── run_biogeobears.Rmd # Customized RMarkdown script
├── results/ # Analysis outputs
│ ├── [MODEL]_result.Rdata # Saved model results
│ └── plots/ # Visualizations
│ ├── [MODEL]_pie.pdf
│ └── [MODEL]_text.pdf
└── README.md # Documentation
3. RMarkdown Analysis Template
Generates a complete RMarkdown script that:
- Loads and validates input data
- Fits 6 biogeographic models:
- DEC (Dispersal-Extinction-Cladogenesis)
- DEC+J (DEC with founder-event speciation)
- DIVALIKE (vicariance-focused)
- DIVALIKE+J
- BAYAREALIKE (sympatry-focused)
- BAYAREALIKE+J
- Compares models using AIC, AICc, and AIC weights
- Performs likelihood ratio tests for nested models
- Estimates parameters (d=dispersal, e=extinction, j=founder-event rates)
- Generates visualizations on the phylogeny
- Creates HTML report with all results
4. Visualization
Produces two types of plots:
- Pie charts: Show probability distributions for ancestral ranges (conveys uncertainty)
- Text labels: Show maximum likelihood ancestral states (cleaner, easier to read)
Colors represent geographic areas:
- Single areas: Bright primary colors
- Multi-area ranges: Blended colors
- All areas: White
Workflow
- Gather information: Ask user for tree file, geography file, and parameters
- Validate tree: Check if rooted and extract tip labels
- Validate/reformat geography file: Use validation script to check format or convert from CSV/TSV
- Set up analysis folder: Create organized directory structure
- Generate RMarkdown script: Customize template with user parameters
- Create documentation: Generate README and run scripts
- Provide instructions: Clear steps for running the analysis
Analysis Parameters
The skill helps users choose:
Maximum Range Size
- How many areas can a species occupy simultaneously?
- Options: Conservative (# areas - 1), Permissive (all areas), Data-driven (max observed)
- Larger values increase computation time exponentially
Models to Compare
- Default: All 6 models (recommended for comprehensive comparison)
- Alternative: Only base models or only +J models
- Rationale: Model comparison is key to biogeographic inference
Visualization Type
- Pie charts (show probabilities and uncertainty)
- Text labels (show most likely states, cleaner)
- Both (default in template)
Bundled Resources
scripts/
validate_geography_file.py
- Validates BioGeoBEARS geography file format
- Reformats from CSV/TSV to PHYLIP
- Cross-validates with tree tip labels
- Usage:
python validate_geography_file.py --help
biogeobears_analysis_template.Rmd
- Complete RMarkdown analysis template
- Parameterized via YAML header
- Fits all models, compares, and visualizes
- Generates self-contained HTML report
references/
biogeobears_details.md
- Detailed model descriptions (DEC, DIVALIKE, BAYAREALIKE, +J parameter)
- Input file format specifications with examples
- Parameter interpretation guidelines
- Plotting options and customization
- Complete citations for publications
- Computational considerations and troubleshooting
Example Output
The analysis produces:
biogeobears_report.html- Interactive HTML report with all results[MODEL]_result.Rdata- Saved R objects for each modelplots/[MODEL]_pie.pdf- Ancestral ranges shown as pie charts on treeplots/[MODEL]_text.pdf- Ancestral ranges shown as text labels on tree
Interpretation Guidance
The skill helps users understand:
Model Selection
- AIC weights: Probability each model is best
- ΔAIC thresholds: <2 (equivalent), 2-7 (less support), >10 (no support)
Parameter Estimates
- d (dispersal): Rate of range expansion
- e (extinction): Rate of local extinction
- j (founder-event): Rate of jump dispersal at speciation
- d/e ratio: >1 favors expansion, <1 favors contraction
Statistical Tests
- LRT p < 0.05: +J parameter significantly improves fit
- Model uncertainty: Report results from multiple models if weights similar
Installation Requirements
Users must have:
- R (≥4.0)
- BioGeoBEARS R package
- Supporting R packages: ape, rmarkdown, knitr, kableExtra
- Python 3 (for validation script)
Installation instructions are included in generated README.md files.
Expected Runtime
Skill setup time: 5-10 minutes (file validation and directory setup)
Analysis runtime (separate from skill execution):
- Small datasets (<50 tips, ≤5 areas): 10-30 minutes
- Medium datasets (50-100 tips, 5-6 areas): 30-90 minutes
- Large datasets (>100 tips, >5 areas): 1-6 hours
Common Issues Handled
The skill troubleshoots:
- Species name mismatches between tree and geography file
- Unrooted trees (guides user to root with outgroup)
- Geography file formatting errors (tabs, spaces, binary codes)
- Optimization convergence failures
- Slow runtime with many areas/tips
Citations
Based on:
- BioGeoBEARS package by Nicholas Matzke
- Tutorial resources from http://phylo.wikidot.com/biogeobears
- Example workflows from BioGeoBEARS GitHub repository
Skill Details
- Skill Type: Workflow-based bioinformatics skill
- Domain: Phylogenetic biogeography, historical biogeography
- Output: Complete analysis setup with scripts, documentation, and ready-to-run workflow
- Automation Level: High (validates, reformats, generates all scripts)
- User Input Required: File paths and parameter choices via guided questions
See Also
- phylo_from_buscos - Complementary skill for generating phylogenies from genomes