Files
gh-brunoasm-my-claude-skill…/skills/biogeobears/README.md
2025-11-29 18:02:37 +08:00

223 lines
7.9 KiB
Markdown

# BioGeoBEARS Biogeographic Analysis Skill
A Claude skill for setting up and executing phylogenetic biogeographic analyses using BioGeoBEARS in R.
## Overview
This skill automates the complete workflow for biogeographic analysis on phylogenetic trees, from raw data validation to publication-ready visualizations. It helps users reconstruct ancestral geographic ranges by:
- Validating and reformatting input files (phylogenetic tree + geographic distribution data)
- Setting up organized analysis folder structures
- Generating customized RMarkdown analysis scripts
- Guiding parameter selection (maximum range size, model choices)
- Producing visualizations with pie charts and text labels showing ancestral ranges
- Comparing multiple biogeographic models with statistical tests
## When to Use
Use this skill when you need to:
- Reconstruct ancestral geographic ranges on a phylogeny
- Test different biogeographic models (DEC, DIVALIKE, BAYAREALIKE)
- Analyze how species distributions evolved over time
- Determine whether founder-event speciation (+J parameter) is important
- Generate publication-ready biogeographic visualizations
## Required Inputs
Users must provide:
1. **Phylogenetic tree** (Newick format: .nwk, .tre, or .tree)
- Must be rooted
- Tip labels must match species in geography file
- Branch lengths required
2. **Geographic distribution data** (any tabular format)
- Species names matching tree tips
- Presence/absence data for different geographic areas
- Accepts CSV, TSV, Excel, or PHYLIP format
## What the Skill Does
### 1. Data Validation and Reformatting
The skill includes a Python script (`validate_geography_file.py`) that:
- Validates geography file format (PHYLIP-like with specific tab/spacing requirements)
- Checks for common errors (spaces in species names, tab delimiters, binary code length)
- Reformats CSV/TSV files to proper BioGeoBEARS format
- Cross-validates species names against tree tip labels
### 2. Analysis Setup
Creates an organized directory structure:
```
biogeobears_analysis/
├── input/
│ ├── tree.nwk # Phylogenetic tree
│ ├── geography.data # Validated geography file
│ └── original_data/ # Original input files
├── scripts/
│ └── run_biogeobears.Rmd # Customized RMarkdown script
├── results/ # Analysis outputs
│ ├── [MODEL]_result.Rdata # Saved model results
│ └── plots/ # Visualizations
│ ├── [MODEL]_pie.pdf
│ └── [MODEL]_text.pdf
└── README.md # Documentation
```
### 3. RMarkdown Analysis Template
Generates a complete RMarkdown script that:
- Loads and validates input data
- Fits 6 biogeographic models:
- DEC (Dispersal-Extinction-Cladogenesis)
- DEC+J (DEC with founder-event speciation)
- DIVALIKE (vicariance-focused)
- DIVALIKE+J
- BAYAREALIKE (sympatry-focused)
- BAYAREALIKE+J
- Compares models using AIC, AICc, and AIC weights
- Performs likelihood ratio tests for nested models
- Estimates parameters (d=dispersal, e=extinction, j=founder-event rates)
- Generates visualizations on the phylogeny
- Creates HTML report with all results
### 4. Visualization
Produces two types of plots:
- **Pie charts**: Show probability distributions for ancestral ranges (conveys uncertainty)
- **Text labels**: Show maximum likelihood ancestral states (cleaner, easier to read)
Colors represent geographic areas:
- Single areas: Bright primary colors
- Multi-area ranges: Blended colors
- All areas: White
## Workflow
1. **Gather information**: Ask user for tree file, geography file, and parameters
2. **Validate tree**: Check if rooted and extract tip labels
3. **Validate/reformat geography file**: Use validation script to check format or convert from CSV/TSV
4. **Set up analysis folder**: Create organized directory structure
5. **Generate RMarkdown script**: Customize template with user parameters
6. **Create documentation**: Generate README and run scripts
7. **Provide instructions**: Clear steps for running the analysis
## Analysis Parameters
The skill helps users choose:
### Maximum Range Size
- How many areas can a species occupy simultaneously?
- Options: Conservative (# areas - 1), Permissive (all areas), Data-driven (max observed)
- Larger values increase computation time exponentially
### Models to Compare
- Default: All 6 models (recommended for comprehensive comparison)
- Alternative: Only base models or only +J models
- Rationale: Model comparison is key to biogeographic inference
### Visualization Type
- Pie charts (show probabilities and uncertainty)
- Text labels (show most likely states, cleaner)
- Both (default in template)
## Bundled Resources
### scripts/
**validate_geography_file.py**
- Validates BioGeoBEARS geography file format
- Reformats from CSV/TSV to PHYLIP
- Cross-validates with tree tip labels
- Usage: `python validate_geography_file.py --help`
**biogeobears_analysis_template.Rmd**
- Complete RMarkdown analysis template
- Parameterized via YAML header
- Fits all models, compares, and visualizes
- Generates self-contained HTML report
### references/
**biogeobears_details.md**
- Detailed model descriptions (DEC, DIVALIKE, BAYAREALIKE, +J parameter)
- Input file format specifications with examples
- Parameter interpretation guidelines
- Plotting options and customization
- Complete citations for publications
- Computational considerations and troubleshooting
## Example Output
The analysis produces:
- `biogeobears_report.html` - Interactive HTML report with all results
- `[MODEL]_result.Rdata` - Saved R objects for each model
- `plots/[MODEL]_pie.pdf` - Ancestral ranges shown as pie charts on tree
- `plots/[MODEL]_text.pdf` - Ancestral ranges shown as text labels on tree
## Interpretation Guidance
The skill helps users understand:
### Model Selection
- **AIC weights**: Probability each model is best
- **ΔAIC thresholds**: <2 (equivalent), 2-7 (less support), >10 (no support)
### Parameter Estimates
- **d (dispersal)**: Rate of range expansion
- **e (extinction)**: Rate of local extinction
- **j (founder-event)**: Rate of jump dispersal at speciation
- **d/e ratio**: >1 favors expansion, <1 favors contraction
### Statistical Tests
- **LRT p < 0.05**: +J parameter significantly improves fit
- Model uncertainty: Report results from multiple models if weights similar
## Installation Requirements
Users must have:
- R (≥4.0)
- BioGeoBEARS R package
- Supporting R packages: ape, rmarkdown, knitr, kableExtra
- Python 3 (for validation script)
Installation instructions are included in generated README.md files.
## Expected Runtime
**Skill setup time**: 5-10 minutes (file validation and directory setup)
**Analysis runtime** (separate from skill execution):
- Small datasets (<50 tips, ≤5 areas): 10-30 minutes
- Medium datasets (50-100 tips, 5-6 areas): 30-90 minutes
- Large datasets (>100 tips, >5 areas): 1-6 hours
## Common Issues Handled
The skill troubleshoots:
- Species name mismatches between tree and geography file
- Unrooted trees (guides user to root with outgroup)
- Geography file formatting errors (tabs, spaces, binary codes)
- Optimization convergence failures
- Slow runtime with many areas/tips
## Citations
Based on:
- **BioGeoBEARS** package by Nicholas Matzke
- Tutorial resources from http://phylo.wikidot.com/biogeobears
- Example workflows from BioGeoBEARS GitHub repository
## Skill Details
- **Skill Type**: Workflow-based bioinformatics skill
- **Domain**: Phylogenetic biogeography, historical biogeography
- **Output**: Complete analysis setup with scripts, documentation, and ready-to-run workflow
- **Automation Level**: High (validates, reformats, generates all scripts)
- **User Input Required**: File paths and parameter choices via guided questions
## See Also
- [phylo_from_buscos](../phylo_from_buscos/README.md) - Complementary skill for generating phylogenies from genomes