Initial commit
This commit is contained in:
222
skills/biogeobears/README.md
Normal file
222
skills/biogeobears/README.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# BioGeoBEARS Biogeographic Analysis Skill
|
||||
|
||||
A Claude skill for setting up and executing phylogenetic biogeographic analyses using BioGeoBEARS in R.
|
||||
|
||||
## Overview
|
||||
|
||||
This skill automates the complete workflow for biogeographic analysis on phylogenetic trees, from raw data validation to publication-ready visualizations. It helps users reconstruct ancestral geographic ranges by:
|
||||
|
||||
- Validating and reformatting input files (phylogenetic tree + geographic distribution data)
|
||||
- Setting up organized analysis folder structures
|
||||
- Generating customized RMarkdown analysis scripts
|
||||
- Guiding parameter selection (maximum range size, model choices)
|
||||
- Producing visualizations with pie charts and text labels showing ancestral ranges
|
||||
- Comparing multiple biogeographic models with statistical tests
|
||||
|
||||
## When to Use
|
||||
|
||||
Use this skill when you need to:
|
||||
- Reconstruct ancestral geographic ranges on a phylogeny
|
||||
- Test different biogeographic models (DEC, DIVALIKE, BAYAREALIKE)
|
||||
- Analyze how species distributions evolved over time
|
||||
- Determine whether founder-event speciation (+J parameter) is important
|
||||
- Generate publication-ready biogeographic visualizations
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Users must provide:
|
||||
|
||||
1. **Phylogenetic tree** (Newick format: .nwk, .tre, or .tree)
|
||||
- Must be rooted
|
||||
- Tip labels must match species in geography file
|
||||
- Branch lengths required
|
||||
|
||||
2. **Geographic distribution data** (any tabular format)
|
||||
- Species names matching tree tips
|
||||
- Presence/absence data for different geographic areas
|
||||
- Accepts CSV, TSV, Excel, or PHYLIP format
|
||||
|
||||
## What the Skill Does
|
||||
|
||||
### 1. Data Validation and Reformatting
|
||||
|
||||
The skill includes a Python script (`validate_geography_file.py`) that:
|
||||
- Validates geography file format (PHYLIP-like with specific tab/spacing requirements)
|
||||
- Checks for common errors (spaces in species names, tab delimiters, binary code length)
|
||||
- Reformats CSV/TSV files to proper BioGeoBEARS format
|
||||
- Cross-validates species names against tree tip labels
|
||||
|
||||
### 2. Analysis Setup
|
||||
|
||||
Creates an organized directory structure:
|
||||
```
|
||||
biogeobears_analysis/
|
||||
├── input/
|
||||
│ ├── tree.nwk # Phylogenetic tree
|
||||
│ ├── geography.data # Validated geography file
|
||||
│ └── original_data/ # Original input files
|
||||
├── scripts/
|
||||
│ └── run_biogeobears.Rmd # Customized RMarkdown script
|
||||
├── results/ # Analysis outputs
|
||||
│ ├── [MODEL]_result.Rdata # Saved model results
|
||||
│ └── plots/ # Visualizations
|
||||
│ ├── [MODEL]_pie.pdf
|
||||
│ └── [MODEL]_text.pdf
|
||||
└── README.md # Documentation
|
||||
```
|
||||
|
||||
### 3. RMarkdown Analysis Template
|
||||
|
||||
Generates a complete RMarkdown script that:
|
||||
- Loads and validates input data
|
||||
- Fits 6 biogeographic models:
|
||||
- DEC (Dispersal-Extinction-Cladogenesis)
|
||||
- DEC+J (DEC with founder-event speciation)
|
||||
- DIVALIKE (vicariance-focused)
|
||||
- DIVALIKE+J
|
||||
- BAYAREALIKE (sympatry-focused)
|
||||
- BAYAREALIKE+J
|
||||
- Compares models using AIC, AICc, and AIC weights
|
||||
- Performs likelihood ratio tests for nested models
|
||||
- Estimates parameters (d=dispersal, e=extinction, j=founder-event rates)
|
||||
- Generates visualizations on the phylogeny
|
||||
- Creates HTML report with all results
|
||||
|
||||
### 4. Visualization
|
||||
|
||||
Produces two types of plots:
|
||||
- **Pie charts**: Show probability distributions for ancestral ranges (conveys uncertainty)
|
||||
- **Text labels**: Show maximum likelihood ancestral states (cleaner, easier to read)
|
||||
|
||||
Colors represent geographic areas:
|
||||
- Single areas: Bright primary colors
|
||||
- Multi-area ranges: Blended colors
|
||||
- All areas: White
|
||||
|
||||
## Workflow
|
||||
|
||||
1. **Gather information**: Ask user for tree file, geography file, and parameters
|
||||
2. **Validate tree**: Check if rooted and extract tip labels
|
||||
3. **Validate/reformat geography file**: Use validation script to check format or convert from CSV/TSV
|
||||
4. **Set up analysis folder**: Create organized directory structure
|
||||
5. **Generate RMarkdown script**: Customize template with user parameters
|
||||
6. **Create documentation**: Generate README and run scripts
|
||||
7. **Provide instructions**: Clear steps for running the analysis
|
||||
|
||||
## Analysis Parameters
|
||||
|
||||
The skill helps users choose:
|
||||
|
||||
### Maximum Range Size
|
||||
- How many areas can a species occupy simultaneously?
|
||||
- Options: Conservative (# areas - 1), Permissive (all areas), Data-driven (max observed)
|
||||
- Larger values increase computation time exponentially
|
||||
|
||||
### Models to Compare
|
||||
- Default: All 6 models (recommended for comprehensive comparison)
|
||||
- Alternative: Only base models or only +J models
|
||||
- Rationale: Model comparison is key to biogeographic inference
|
||||
|
||||
### Visualization Type
|
||||
- Pie charts (show probabilities and uncertainty)
|
||||
- Text labels (show most likely states, cleaner)
|
||||
- Both (default in template)
|
||||
|
||||
## Bundled Resources
|
||||
|
||||
### scripts/
|
||||
|
||||
**validate_geography_file.py**
|
||||
- Validates BioGeoBEARS geography file format
|
||||
- Reformats from CSV/TSV to PHYLIP
|
||||
- Cross-validates with tree tip labels
|
||||
- Usage: `python validate_geography_file.py --help`
|
||||
|
||||
**biogeobears_analysis_template.Rmd**
|
||||
- Complete RMarkdown analysis template
|
||||
- Parameterized via YAML header
|
||||
- Fits all models, compares, and visualizes
|
||||
- Generates self-contained HTML report
|
||||
|
||||
### references/
|
||||
|
||||
**biogeobears_details.md**
|
||||
- Detailed model descriptions (DEC, DIVALIKE, BAYAREALIKE, +J parameter)
|
||||
- Input file format specifications with examples
|
||||
- Parameter interpretation guidelines
|
||||
- Plotting options and customization
|
||||
- Complete citations for publications
|
||||
- Computational considerations and troubleshooting
|
||||
|
||||
## Example Output
|
||||
|
||||
The analysis produces:
|
||||
- `biogeobears_report.html` - Interactive HTML report with all results
|
||||
- `[MODEL]_result.Rdata` - Saved R objects for each model
|
||||
- `plots/[MODEL]_pie.pdf` - Ancestral ranges shown as pie charts on tree
|
||||
- `plots/[MODEL]_text.pdf` - Ancestral ranges shown as text labels on tree
|
||||
|
||||
## Interpretation Guidance
|
||||
|
||||
The skill helps users understand:
|
||||
|
||||
### Model Selection
|
||||
- **AIC weights**: Probability each model is best
|
||||
- **ΔAIC thresholds**: <2 (equivalent), 2-7 (less support), >10 (no support)
|
||||
|
||||
### Parameter Estimates
|
||||
- **d (dispersal)**: Rate of range expansion
|
||||
- **e (extinction)**: Rate of local extinction
|
||||
- **j (founder-event)**: Rate of jump dispersal at speciation
|
||||
- **d/e ratio**: >1 favors expansion, <1 favors contraction
|
||||
|
||||
### Statistical Tests
|
||||
- **LRT p < 0.05**: +J parameter significantly improves fit
|
||||
- Model uncertainty: Report results from multiple models if weights similar
|
||||
|
||||
## Installation Requirements
|
||||
|
||||
Users must have:
|
||||
- R (≥4.0)
|
||||
- BioGeoBEARS R package
|
||||
- Supporting R packages: ape, rmarkdown, knitr, kableExtra
|
||||
- Python 3 (for validation script)
|
||||
|
||||
Installation instructions are included in generated README.md files.
|
||||
|
||||
## Expected Runtime
|
||||
|
||||
**Skill setup time**: 5-10 minutes (file validation and directory setup)
|
||||
|
||||
**Analysis runtime** (separate from skill execution):
|
||||
- Small datasets (<50 tips, ≤5 areas): 10-30 minutes
|
||||
- Medium datasets (50-100 tips, 5-6 areas): 30-90 minutes
|
||||
- Large datasets (>100 tips, >5 areas): 1-6 hours
|
||||
|
||||
## Common Issues Handled
|
||||
|
||||
The skill troubleshoots:
|
||||
- Species name mismatches between tree and geography file
|
||||
- Unrooted trees (guides user to root with outgroup)
|
||||
- Geography file formatting errors (tabs, spaces, binary codes)
|
||||
- Optimization convergence failures
|
||||
- Slow runtime with many areas/tips
|
||||
|
||||
## Citations
|
||||
|
||||
Based on:
|
||||
- **BioGeoBEARS** package by Nicholas Matzke
|
||||
- Tutorial resources from http://phylo.wikidot.com/biogeobears
|
||||
- Example workflows from BioGeoBEARS GitHub repository
|
||||
|
||||
## Skill Details
|
||||
|
||||
- **Skill Type**: Workflow-based bioinformatics skill
|
||||
- **Domain**: Phylogenetic biogeography, historical biogeography
|
||||
- **Output**: Complete analysis setup with scripts, documentation, and ready-to-run workflow
|
||||
- **Automation Level**: High (validates, reformats, generates all scripts)
|
||||
- **User Input Required**: File paths and parameter choices via guided questions
|
||||
|
||||
## See Also
|
||||
|
||||
- [phylo_from_buscos](../phylo_from_buscos/README.md) - Complementary skill for generating phylogenies from genomes
|
||||
581
skills/biogeobears/SKILL.md
Normal file
581
skills/biogeobears/SKILL.md
Normal file
@@ -0,0 +1,581 @@
|
||||
---
|
||||
name: biogeobears
|
||||
description: Set up and execute phylogenetic biogeographic analyses using BioGeoBEARS in R. Use when users request biogeographic reconstruction, ancestral range estimation, or want to analyze species distributions on phylogenies. Handles input file validation, data reformatting, RMarkdown workflow generation, and result visualization.
|
||||
---
|
||||
|
||||
# BioGeoBEARS Biogeographic Analysis
|
||||
|
||||
## Overview
|
||||
|
||||
BioGeoBEARS (BioGeography with Bayesian and Likelihood Evolutionary Analysis in R Scripts) performs probabilistic inference of ancestral geographic ranges on phylogenetic trees. This skill helps set up complete biogeographic analyses by:
|
||||
|
||||
1. Validating and reformatting input files (phylogenetic tree and geographic distribution data)
|
||||
2. Generating organized analysis folder structure
|
||||
3. Creating customized RMarkdown analysis scripts
|
||||
4. Guiding users through parameter selection and model choices
|
||||
5. Producing publication-ready visualizations
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when users request:
|
||||
- "Analyze biogeography on my phylogeny"
|
||||
- "Reconstruct ancestral ranges for my species"
|
||||
- "Run BioGeoBEARS analysis"
|
||||
- "Which areas did my ancestors occupy?"
|
||||
- "Test biogeographic models (DEC, DIVALIKE, BAYAREALIKE)"
|
||||
|
||||
The skill triggers when users mention phylogenetic biogeography, ancestral area reconstruction, or provide tree + distribution data.
|
||||
|
||||
## Required Inputs
|
||||
|
||||
Users must provide:
|
||||
|
||||
1. **Phylogenetic tree** (Newick format, .nwk, .tre, or .tree file)
|
||||
- Must be rooted
|
||||
- Tip labels will be matched to geography file
|
||||
- Branch lengths required
|
||||
|
||||
2. **Geographic distribution data** (any tabular format)
|
||||
- Species names (matching tree tips)
|
||||
- Presence/absence data for different geographic areas
|
||||
- Can be CSV, TSV, Excel, or already in PHYLIP format
|
||||
|
||||
## Workflow
|
||||
|
||||
### Step 1: Gather Information
|
||||
|
||||
When a user requests a BioGeoBEARS analysis, ask for:
|
||||
|
||||
1. **Input file paths**:
|
||||
- "What is the path to your phylogenetic tree file?"
|
||||
- "What is the path to your geographic distribution file?"
|
||||
|
||||
2. **Analysis parameters** (if not specified):
|
||||
- Maximum range size (how many areas can a species occupy simultaneously?)
|
||||
- Which models to compare (default: all six - DEC, DEC+J, DIVALIKE, DIVALIKE+J, BAYAREALIKE, BAYAREALIKE+J)
|
||||
- Output directory name (default: "biogeobears_analysis")
|
||||
|
||||
Use the AskUserQuestion tool to gather this information efficiently:
|
||||
|
||||
```
|
||||
Example questions:
|
||||
- "Maximum range size" - options based on number of areas (e.g., for 4 areas: "All 4 areas", "3 areas", "2 areas")
|
||||
- "Models to compare" - options: "All 6 models (recommended)", "Only base models (DEC, DIVALIKE, BAYAREALIKE)", "Only +J models", "Custom selection"
|
||||
- "Visualization type" - options: "Pie charts (show probabilities)", "Text labels (show most likely states)", "Both"
|
||||
```
|
||||
|
||||
### Step 2: Validate and Prepare Input Files
|
||||
|
||||
#### Validate Tree File
|
||||
|
||||
Use the Read tool to check the tree file:
|
||||
|
||||
```r
|
||||
# In R, basic validation:
|
||||
library(ape)
|
||||
tr <- read.tree("path/to/tree.nwk")
|
||||
print(paste("Tips:", length(tr$tip.label)))
|
||||
print(paste("Rooted:", is.rooted(tr)))
|
||||
print(tr$tip.label) # Check species names
|
||||
```
|
||||
|
||||
Verify:
|
||||
- File can be parsed as Newick
|
||||
- Tree is rooted (if not, ask user which outgroup to use)
|
||||
- Note the tip labels for geography file validation
|
||||
|
||||
#### Validate and Reformat Geography File
|
||||
|
||||
Use `scripts/validate_geography_file.py` to validate or reformat the geography file.
|
||||
|
||||
**If file is already in PHYLIP format** (starts with numbers):
|
||||
|
||||
```bash
|
||||
python scripts/validate_geography_file.py path/to/geography.txt --validate --tree path/to/tree.nwk
|
||||
```
|
||||
|
||||
This checks:
|
||||
- Correct tab delimiters
|
||||
- Species names match tree tips
|
||||
- Binary codes are correct length
|
||||
- No spaces in species names or binary codes
|
||||
|
||||
**If file is in CSV/TSV format** (needs reformatting):
|
||||
|
||||
```bash
|
||||
python scripts/validate_geography_file.py path/to/distribution.csv --reformat -o geography.data --delimiter ","
|
||||
```
|
||||
|
||||
Or for tab-delimited:
|
||||
|
||||
```bash
|
||||
python scripts/validate_geography_file.py path/to/distribution.txt --reformat -o geography.data --delimiter tab
|
||||
```
|
||||
|
||||
The script will:
|
||||
- Detect area names from header row
|
||||
- Convert presence/absence data to binary (handles "1", "present", "TRUE", etc.)
|
||||
- Remove spaces from species names (replace with underscores)
|
||||
- Create properly formatted PHYLIP file
|
||||
|
||||
**Always validate the reformatted file** before proceeding:
|
||||
|
||||
```bash
|
||||
python scripts/validate_geography_file.py geography.data --validate --tree path/to/tree.nwk
|
||||
```
|
||||
|
||||
### Step 3: Set Up Analysis Folder Structure
|
||||
|
||||
Create an organized directory for the analysis:
|
||||
|
||||
```
|
||||
biogeobears_analysis/
|
||||
├── input/
|
||||
│ ├── tree.nwk # Original or copied tree
|
||||
│ ├── geography.data # Validated/reformatted geography file
|
||||
│ └── original_data/ # Original input files
|
||||
│ ├── original_tree.nwk
|
||||
│ └── original_distribution.csv
|
||||
├── scripts/
|
||||
│ └── run_biogeobears.Rmd # Generated RMarkdown script
|
||||
├── results/ # Created by analysis (output directory)
|
||||
│ ├── [MODEL]_result.Rdata # Saved model results
|
||||
│ └── plots/ # Visualization outputs
|
||||
│ ├── [MODEL]_pie.pdf
|
||||
│ └── [MODEL]_text.pdf
|
||||
└── README.md # Analysis documentation
|
||||
```
|
||||
|
||||
Create this structure programmatically:
|
||||
|
||||
```bash
|
||||
mkdir -p biogeobears_analysis/input/original_data
|
||||
mkdir -p biogeobears_analysis/scripts
|
||||
mkdir -p biogeobears_analysis/results/plots
|
||||
|
||||
# Copy files
|
||||
cp path/to/tree.nwk biogeobears_analysis/input/
|
||||
cp geography.data biogeobears_analysis/input/
|
||||
cp original_files biogeobears_analysis/input/original_data/
|
||||
```
|
||||
|
||||
### Step 4: Generate RMarkdown Analysis Script
|
||||
|
||||
Use the template at `scripts/biogeobears_analysis_template.Rmd` and customize it with user parameters.
|
||||
|
||||
**Copy and customize the template**:
|
||||
|
||||
```bash
|
||||
cp scripts/biogeobears_analysis_template.Rmd biogeobears_analysis/scripts/run_biogeobears.Rmd
|
||||
```
|
||||
|
||||
**Create a parameter file** or modify the YAML header in the Rmd to use the user's specific settings:
|
||||
|
||||
Example customization via R code:
|
||||
|
||||
```r
|
||||
# Edit YAML parameters programmatically or provide as params when rendering
|
||||
rmarkdown::render(
|
||||
"biogeobears_analysis/scripts/run_biogeobears.Rmd",
|
||||
params = list(
|
||||
tree_file = "../input/tree.nwk",
|
||||
geog_file = "../input/geography.data",
|
||||
max_range_size = 4,
|
||||
models = "DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J",
|
||||
output_dir = "../results"
|
||||
),
|
||||
output_file = "../results/biogeobears_report.html"
|
||||
)
|
||||
```
|
||||
|
||||
Or create a run script:
|
||||
|
||||
```bash
|
||||
# biogeobears_analysis/run_analysis.sh
|
||||
#!/bin/bash
|
||||
cd "$(dirname "$0")/scripts"
|
||||
|
||||
R -e "rmarkdown::render('run_biogeobears.Rmd', params = list(
|
||||
tree_file = '../input/tree.nwk',
|
||||
geog_file = '../input/geography.data',
|
||||
max_range_size = 4,
|
||||
models = 'DEC,DEC+J,DIVALIKE,DIVALIKE+J,BAYAREALIKE,BAYAREALIKE+J',
|
||||
output_dir = '../results'
|
||||
), output_file = '../results/biogeobears_report.html')"
|
||||
```
|
||||
|
||||
### Step 5: Create README Documentation
|
||||
|
||||
Generate a README.md in the analysis directory explaining:
|
||||
|
||||
- What files are present
|
||||
- How to run the analysis
|
||||
- What parameters were used
|
||||
- How to interpret results
|
||||
|
||||
Example:
|
||||
|
||||
```markdown
|
||||
# BioGeoBEARS Analysis
|
||||
|
||||
## Overview
|
||||
|
||||
Biogeographic analysis of [NUMBER] species across [NUMBER] geographic areas.
|
||||
|
||||
## Input Data
|
||||
|
||||
- **Tree**: `input/tree.nwk` ([NUMBER] tips)
|
||||
- **Geography**: `input/geography.data` ([NUMBER] species × [NUMBER] areas)
|
||||
- **Areas**: [A, B, C, ...]
|
||||
|
||||
## Parameters
|
||||
|
||||
- Maximum range size: [NUMBER]
|
||||
- Models tested: [LIST]
|
||||
|
||||
## Running the Analysis
|
||||
|
||||
### Option 1: Using RMarkdown directly
|
||||
|
||||
```r
|
||||
library(rmarkdown)
|
||||
render("scripts/run_biogeobears.Rmd",
|
||||
output_file = "../results/biogeobears_report.html")
|
||||
```
|
||||
|
||||
### Option 2: Using the run script
|
||||
|
||||
```bash
|
||||
bash run_analysis.sh
|
||||
```
|
||||
|
||||
## Outputs
|
||||
|
||||
Results will be saved in `results/`:
|
||||
|
||||
- `biogeobears_report.html` - Full analysis report with visualizations
|
||||
- `[MODEL]_result.Rdata` - Saved R objects for each model
|
||||
- `plots/[MODEL]_pie.pdf` - Ancestral range reconstructions (pie charts)
|
||||
- `plots/[MODEL]_text.pdf` - Ancestral range reconstructions (text labels)
|
||||
|
||||
## Interpreting Results
|
||||
|
||||
The HTML report includes:
|
||||
|
||||
1. **Model Comparison** - AIC scores, AIC weights, best-fit model
|
||||
2. **Parameter Estimates** - Dispersal (d), extinction (e), founder-event (j) rates
|
||||
3. **Likelihood Ratio Tests** - Statistical comparisons of nested models
|
||||
4. **Ancestral Range Plots** - Visualizations on phylogeny
|
||||
5. **Session Info** - R package versions for reproducibility
|
||||
|
||||
## Model Descriptions
|
||||
|
||||
- **DEC**: Dispersal-Extinction-Cladogenesis (general-purpose)
|
||||
- **DIVALIKE**: Emphasizes vicariance
|
||||
- **BAYAREALIKE**: Emphasizes sympatric speciation
|
||||
- **+J**: Adds founder-event speciation parameter
|
||||
|
||||
See `references/biogeobears_details.md` for detailed model descriptions.
|
||||
|
||||
## Installation Requirements
|
||||
|
||||
```r
|
||||
# Install BioGeoBEARS
|
||||
install.packages("rexpokit")
|
||||
install.packages("cladoRcpp")
|
||||
library(devtools)
|
||||
devtools::install_github(repo="nmatzke/BioGeoBEARS")
|
||||
|
||||
# Other packages
|
||||
install.packages(c("ape", "rmarkdown", "knitr", "kableExtra"))
|
||||
```
|
||||
```
|
||||
|
||||
### Step 6: Provide User Instructions
|
||||
|
||||
After setting up the analysis, provide clear instructions to the user:
|
||||
|
||||
```
|
||||
Analysis Setup Complete!
|
||||
|
||||
Directory structure created at: biogeobears_analysis/
|
||||
|
||||
📁 Files created:
|
||||
✓ input/tree.nwk - Phylogenetic tree ([N] tips)
|
||||
✓ input/geography.data - Geographic distribution data (validated)
|
||||
✓ scripts/run_biogeobears.Rmd - RMarkdown analysis script
|
||||
✓ README.md - Documentation and instructions
|
||||
✓ run_analysis.sh - Convenience script to run analysis
|
||||
|
||||
📋 Next steps:
|
||||
|
||||
1. Review the README.md for analysis details
|
||||
|
||||
2. Install BioGeoBEARS if not already installed:
|
||||
```r
|
||||
install.packages("rexpokit")
|
||||
install.packages("cladoRcpp")
|
||||
library(devtools)
|
||||
devtools::install_github(repo="nmatzke/BioGeoBEARS")
|
||||
```
|
||||
|
||||
3. Run the analysis:
|
||||
```bash
|
||||
cd biogeobears_analysis
|
||||
bash run_analysis.sh
|
||||
```
|
||||
|
||||
Or in R:
|
||||
```r
|
||||
setwd("biogeobears_analysis")
|
||||
rmarkdown::render("scripts/run_biogeobears.Rmd",
|
||||
output_file = "../results/biogeobears_report.html")
|
||||
```
|
||||
|
||||
4. View results:
|
||||
- Open results/biogeobears_report.html in web browser
|
||||
- Check results/plots/ for PDF visualizations
|
||||
|
||||
⏱️ Expected runtime: [ESTIMATE based on tree size]
|
||||
- Small trees (<50 tips): 5-15 minutes
|
||||
- Medium trees (50-100 tips): 15-60 minutes
|
||||
- Large trees (>100 tips): 1-4 hours
|
||||
|
||||
💡 The HTML report includes model comparison, parameter estimates, and visualization of ancestral ranges on your phylogeny.
|
||||
```
|
||||
|
||||
## Analysis Parameter Guidance
|
||||
|
||||
When users ask for guidance on parameters, consult `references/biogeobears_details.md` and provide recommendations:
|
||||
|
||||
### Maximum Range Size
|
||||
|
||||
**Ask**: "What's the maximum number of areas a species in your group can realistically occupy?"
|
||||
|
||||
Common approaches:
|
||||
- **Conservative**: Number of areas - 1 (prevents unrealistic cosmopolitan ancestral ranges)
|
||||
- **Permissive**: All areas (if biologically plausible)
|
||||
- **Data-driven**: Maximum observed in extant species
|
||||
|
||||
**Impact**: Larger values increase computational time exponentially
|
||||
|
||||
### Model Selection
|
||||
|
||||
**Default recommendation**: Run all 6 models for comprehensive comparison
|
||||
|
||||
- DEC, DIVALIKE, BAYAREALIKE (base models)
|
||||
- DEC+J, DIVALIKE+J, BAYAREALIKE+J (+J variants)
|
||||
|
||||
**Rationale**:
|
||||
- Model comparison is key to inference
|
||||
- +J parameter is often significant
|
||||
- Small additional computational cost
|
||||
|
||||
If computation is a concern, suggest starting with DEC and DEC+J.
|
||||
|
||||
### Visualization Options
|
||||
|
||||
**Pie charts** (`plotwhat = "pie"`):
|
||||
- Show probability distributions across all possible states
|
||||
- Better for conveying uncertainty
|
||||
- Can be cluttered with many areas
|
||||
|
||||
**Text labels** (`plotwhat = "text"`):
|
||||
- Show only maximum likelihood state
|
||||
- Cleaner, easier to read
|
||||
- Doesn't show uncertainty
|
||||
|
||||
**Recommendation**: Generate both in the analysis (template does this automatically)
|
||||
|
||||
## Common Issues and Troubleshooting
|
||||
|
||||
### Species Name Mismatches
|
||||
|
||||
**Symptom**: Error about species in tree not in geography file (or vice versa)
|
||||
|
||||
**Solution**: Use the validation script with `--tree` option to identify mismatches, then either:
|
||||
1. Edit the geography file to match tree tip labels
|
||||
2. Edit tree tip labels to match geography file
|
||||
3. Remove species that aren't in both
|
||||
|
||||
### Tree Not Rooted
|
||||
|
||||
**Symptom**: Error about unrooted tree
|
||||
|
||||
**Solution**:
|
||||
```r
|
||||
library(ape)
|
||||
tr <- read.tree("tree.nwk")
|
||||
tr <- root(tr, outgroup = "outgroup_species_name")
|
||||
write.tree(tr, "tree_rooted.nwk")
|
||||
```
|
||||
|
||||
Ask user which species to use as outgroup.
|
||||
|
||||
### Formatting Errors in Geography File
|
||||
|
||||
**Symptom**: Validation errors about tabs, spaces, or binary codes
|
||||
|
||||
**Solution**: Use the reformat option:
|
||||
```bash
|
||||
python scripts/validate_geography_file.py input.csv --reformat -o geography.data
|
||||
```
|
||||
|
||||
### Optimization Fails to Converge
|
||||
|
||||
**Symptom**: NA values in parameter estimates or very negative log-likelihoods
|
||||
|
||||
**Possible causes**:
|
||||
- Tree and geography data mismatch
|
||||
- All species in same area (no variation)
|
||||
- Unrealistic max_range_size
|
||||
|
||||
**Solution**: Check input data quality and try simpler model first (DEC only)
|
||||
|
||||
### Very Slow Runtime
|
||||
|
||||
**Causes**:
|
||||
- Large number of areas (>6-7 areas gets slow)
|
||||
- Large max_range_size
|
||||
- Many tips (>200)
|
||||
|
||||
**Solutions**:
|
||||
- Reduce max_range_size
|
||||
- Combine geographic areas if appropriate
|
||||
- Use `force_sparse = TRUE` in run object
|
||||
- Run on HPC cluster
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes:
|
||||
|
||||
### scripts/
|
||||
|
||||
- **validate_geography_file.py** - Validates and reformats geography files
|
||||
- Checks PHYLIP format compliance
|
||||
- Validates against tree tip labels
|
||||
- Reformats from CSV/TSV to PHYLIP
|
||||
- Usage: `python validate_geography_file.py --help`
|
||||
|
||||
- **biogeobears_analysis_template.Rmd** - RMarkdown template for complete analysis
|
||||
- Model fitting for DEC, DIVALIKE, BAYAREALIKE (with/without +J)
|
||||
- Model comparison with AIC, AICc, weights
|
||||
- Likelihood ratio tests
|
||||
- Parameter visualization
|
||||
- Ancestral range plotting
|
||||
- Customizable via YAML parameters
|
||||
|
||||
### references/
|
||||
|
||||
- **biogeobears_details.md** - Comprehensive reference including:
|
||||
- Detailed model descriptions
|
||||
- Input file format specifications
|
||||
- Parameter interpretation guidelines
|
||||
- Plotting options and customization
|
||||
- Citations and further reading
|
||||
- Computational considerations
|
||||
|
||||
Load this reference when:
|
||||
- Users ask about specific models
|
||||
- Need to explain parameter estimates
|
||||
- Troubleshooting complex issues
|
||||
- Users want detailed methodology for publications
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always validate input files** before analysis - saves time debugging later
|
||||
|
||||
2. **Organize analysis in a dedicated directory** - keeps everything together and reproducible
|
||||
|
||||
3. **Run all 6 models by default** - model comparison is crucial for biogeographic inference
|
||||
|
||||
4. **Document parameters and decisions** - analysis README helps with reproducibility
|
||||
|
||||
5. **Generate both visualization types** - pie charts for uncertainty, text labels for clarity
|
||||
|
||||
6. **Save intermediate results** - the RMarkdown template does this automatically
|
||||
|
||||
7. **Check parameter estimates** - unrealistic values suggest data or model issues
|
||||
|
||||
8. **Provide context with visualizations** - explain what dispersal/extinction rates mean for the user's system
|
||||
|
||||
## Output Interpretation
|
||||
|
||||
When presenting results to users, explain:
|
||||
|
||||
### Model Selection
|
||||
|
||||
- **AIC weights** represent probability that each model is best
|
||||
- **ΔAIC < 2**: Models essentially equivalent
|
||||
- **ΔAIC 2-7**: Considerably less support
|
||||
- **ΔAIC > 10**: Essentially no support
|
||||
|
||||
### Parameter Estimates
|
||||
|
||||
- **d (dispersal rate)**: Higher = more range expansions
|
||||
- **e (extinction rate)**: Higher = more local extinctions
|
||||
- **j (founder-event rate)**: Higher = more jump dispersal at speciation
|
||||
- **Ratio d/e**: > 1 favors expansion, < 1 favors contraction
|
||||
|
||||
### Ancestral Ranges
|
||||
|
||||
- **Pie charts**: Larger slices = higher probability
|
||||
- **Colors**: Represent areas (single area = bright color, multiple areas = blended)
|
||||
- **Node labels**: Most likely ancestral range
|
||||
- **Split events** (at corners): Range changes at speciation
|
||||
|
||||
### Statistical Tests
|
||||
|
||||
- **LRT p < 0.05**: +J parameter significantly improves fit
|
||||
- **High AIC weight** (>0.7): Strong evidence for one model
|
||||
- **Similar AIC weights**: Model uncertainty - report results from multiple models
|
||||
|
||||
## Example Usage
|
||||
|
||||
```
|
||||
User: "I have a phylogeny of 30 bird species and their distributions across 5 islands. Can you help me figure out where their ancestors lived?"
|
||||
|
||||
Claude (using this skill):
|
||||
1. Ask for tree and distribution file paths
|
||||
2. Validate tree file (check 30 tips, rooted)
|
||||
3. Validate/reformat geography file (5 areas)
|
||||
4. Ask about max_range_size (suggest 4 areas)
|
||||
5. Ask about models (suggest all 6)
|
||||
6. Set up biogeobears_analysis/ directory structure
|
||||
7. Copy template RMarkdown script with parameters
|
||||
8. Generate README.md and run_analysis.sh
|
||||
9. Provide clear instructions to run analysis
|
||||
10. Explain expected outputs and how to interpret them
|
||||
|
||||
Result: User has complete, ready-to-run analysis with documentation
|
||||
```
|
||||
|
||||
## Attribution
|
||||
|
||||
This skill was created based on:
|
||||
- **BioGeoBEARS** package by Nicholas Matzke
|
||||
- Tutorial resources from http://phylo.wikidot.com/biogeobears
|
||||
- Example workflows from the BioGeoBEARS GitHub repository
|
||||
|
||||
## Additional Notes
|
||||
|
||||
**Time estimate for skill execution**:
|
||||
- File validation: 1-2 minutes
|
||||
- Directory setup: < 1 minute
|
||||
- Total setup time: 5-10 minutes
|
||||
|
||||
**Analysis runtime** (separate from skill execution):
|
||||
- Depends on tree size and number of areas
|
||||
- Small datasets (<50 tips, ≤5 areas): 10-30 minutes
|
||||
- Large datasets (>100 tips, >5 areas): 1-6 hours
|
||||
|
||||
**Installation requirements** (user must have):
|
||||
- R (≥4.0)
|
||||
- BioGeoBEARS R package
|
||||
- Supporting packages: ape, rmarkdown, knitr, kableExtra
|
||||
- Python 3 (for validation script)
|
||||
|
||||
**When to consult references/**:
|
||||
- Load `biogeobears_details.md` when users need detailed explanations of models, parameters, or interpretation
|
||||
- Reference it for troubleshooting complex issues
|
||||
- Use it to help users write methods sections for publications
|
||||
358
skills/biogeobears/references/biogeobears_details.md
Normal file
358
skills/biogeobears/references/biogeobears_details.md
Normal file
@@ -0,0 +1,358 @@
|
||||
# BioGeoBEARS Detailed Reference
|
||||
|
||||
## Overview
|
||||
|
||||
BioGeoBEARS (BioGeography with Bayesian and Likelihood Evolutionary Analysis in R Scripts) is an R package for probabilistic inference of historical biogeography on phylogenetic trees. It implements various models of range evolution and allows statistical comparison between them.
|
||||
|
||||
## Installation
|
||||
|
||||
```r
|
||||
# Install dependencies
|
||||
install.packages("rexpokit")
|
||||
install.packages("cladoRcpp")
|
||||
|
||||
# Install from GitHub
|
||||
library(devtools)
|
||||
devtools::install_github(repo="nmatzke/BioGeoBEARS")
|
||||
```
|
||||
|
||||
## Biogeographic Models
|
||||
|
||||
BioGeoBEARS implements several models that differ in their assumptions about how species ranges evolve:
|
||||
|
||||
### DEC (Dispersal-Extinction-Cladogenesis)
|
||||
|
||||
The DEC model is based on LAGRANGE and includes:
|
||||
|
||||
- **Anagenetic changes** (along branches):
|
||||
- `d` (dispersal): Rate of range expansion into adjacent areas
|
||||
- `e` (extinction): Rate of local extinction in an area
|
||||
|
||||
- **Cladogenetic events** (at speciation nodes):
|
||||
- Vicariance: Ancestral range splits between daughter lineages
|
||||
- Subset sympatry: One daughter inherits full range, other subset
|
||||
- Range copying: Both daughters inherit full ancestral range
|
||||
|
||||
**Parameters**: 2 (d, e)
|
||||
**Best for**: General-purpose biogeographic inference
|
||||
|
||||
### DIVALIKE (Vicariance-focused)
|
||||
|
||||
Similar to DIVA (Dispersal-Vicariance Analysis):
|
||||
|
||||
- Emphasizes vicariance at speciation events
|
||||
- Fixes subset sympatry probability to 0
|
||||
- Only allows vicariance and range copying at nodes
|
||||
|
||||
**Parameters**: 2 (d, e)
|
||||
**Best for**: Systems where vicariance is the primary speciation mode
|
||||
|
||||
### BAYAREALIKE (Sympatry-focused)
|
||||
|
||||
Based on the BayArea model:
|
||||
|
||||
- Emphasizes sympatric speciation
|
||||
- Fixes vicariance probability to 0
|
||||
- Only allows subset sympatry and range copying
|
||||
|
||||
**Parameters**: 2 (d, e)
|
||||
**Best for**: Systems where dispersal and sympatric speciation dominate
|
||||
|
||||
### +J Extension (Founder-event speciation)
|
||||
|
||||
Any of the above models can include a "+J" parameter:
|
||||
|
||||
- **j**: Jump dispersal / founder-event speciation rate
|
||||
- Allows instantaneous dispersal to a new area at speciation
|
||||
- Often significantly improves model fit
|
||||
- Can be controversial (some argue it's biologically unrealistic)
|
||||
|
||||
**Examples**: DEC+J, DIVALIKE+J, BAYAREALIKE+J
|
||||
**Additional parameters**: +1 (j)
|
||||
|
||||
## Model Comparison
|
||||
|
||||
### AIC (Akaike Information Criterion)
|
||||
|
||||
```
|
||||
AIC = -2 × ln(L) + 2k
|
||||
```
|
||||
|
||||
Where:
|
||||
- ln(L) = log-likelihood
|
||||
- k = number of parameters
|
||||
|
||||
**Lower AIC = better model**
|
||||
|
||||
### AICc (Corrected AIC)
|
||||
|
||||
Used when sample size is small relative to parameters:
|
||||
|
||||
```
|
||||
AICc = AIC + (2k² + 2k)/(n - k - 1)
|
||||
```
|
||||
|
||||
### AIC Weights
|
||||
|
||||
Probability that a model is the best among the set:
|
||||
|
||||
```
|
||||
w_i = exp(-0.5 × Δ_i) / Σ exp(-0.5 × Δ_j)
|
||||
```
|
||||
|
||||
Where Δ_i = AIC_i - AIC_min
|
||||
|
||||
### Likelihood Ratio Test (LRT)
|
||||
|
||||
For nested models (e.g., DEC vs DEC+J):
|
||||
|
||||
```
|
||||
LRT = 2 × (ln(L_complex) - ln(L_simple))
|
||||
```
|
||||
|
||||
- Test statistic follows χ² distribution
|
||||
- df = difference in number of parameters
|
||||
- p < 0.05 suggests complex model significantly better
|
||||
|
||||
## Input File Formats
|
||||
|
||||
### Phylogenetic Tree (Newick format)
|
||||
|
||||
Standard Newick format with:
|
||||
- Branch lengths required
|
||||
- Tip labels must match geography file
|
||||
- Should be rooted and ultrametric (for time-stratified analyses)
|
||||
|
||||
Example:
|
||||
```
|
||||
((A:1.0,B:1.0):0.5,C:1.5);
|
||||
```
|
||||
|
||||
### Geography File (PHYLIP-like format)
|
||||
|
||||
**Format structure:**
|
||||
```
|
||||
n_species [TAB] n_areas [TAB] (area1 area2 area3 ...)
|
||||
species1 [TAB] 011
|
||||
species2 [TAB] 110
|
||||
species3 [TAB] 001
|
||||
```
|
||||
|
||||
**Important formatting rules:**
|
||||
|
||||
1. **Line 1 (Header)**:
|
||||
- Number of species (integer)
|
||||
- TAB character
|
||||
- Number of areas (integer)
|
||||
- TAB character
|
||||
- Area names in parentheses, separated by spaces
|
||||
|
||||
2. **Subsequent lines (Species data)**:
|
||||
- Species name (must match tree tip label)
|
||||
- TAB character
|
||||
- Binary presence/absence code (1=present, 0=absent)
|
||||
- NO SPACES in the binary code
|
||||
- NO SPACES in species names (use underscores)
|
||||
|
||||
3. **Common errors to avoid**:
|
||||
- Using spaces instead of tabs
|
||||
- Spaces within binary codes
|
||||
- Species names with spaces
|
||||
- Mismatch between species names in tree and geography file
|
||||
- Wrong number of digits in binary code
|
||||
|
||||
**Example file:**
|
||||
```
|
||||
5 3 (A B C)
|
||||
Sp_alpha 011
|
||||
Sp_beta 010
|
||||
Sp_gamma 111
|
||||
Sp_delta 100
|
||||
Sp_epsilon 001
|
||||
```
|
||||
|
||||
## Key Parameters and Settings
|
||||
|
||||
### max_range_size
|
||||
|
||||
Maximum number of areas a species can occupy simultaneously.
|
||||
|
||||
- **Default**: Often set to number of areas, or number of areas - 1
|
||||
- **Impact**: Larger values = more possible states = longer computation
|
||||
- **Recommendation**: Set based on biological realism
|
||||
|
||||
### include_null_range
|
||||
|
||||
Whether to include the "null range" (species extinct everywhere).
|
||||
|
||||
- **Default**: TRUE
|
||||
- **Purpose**: Allows extinction along branches
|
||||
- **Recommendation**: Usually keep TRUE
|
||||
|
||||
### force_sparse
|
||||
|
||||
Use sparse matrix operations for speed.
|
||||
|
||||
- **Default**: FALSE
|
||||
- **When to use**: Large state spaces (many areas)
|
||||
- **Note**: May cause numerical issues
|
||||
|
||||
### speedup
|
||||
|
||||
Various speedup options.
|
||||
|
||||
- **Default**: TRUE
|
||||
- **Recommendation**: Usually keep TRUE
|
||||
|
||||
### use_optimx
|
||||
|
||||
Use optimx for parameter optimization.
|
||||
|
||||
- **Default**: TRUE
|
||||
- **Benefit**: More robust optimization
|
||||
- **Recommendation**: Keep TRUE
|
||||
|
||||
### calc_ancprobs
|
||||
|
||||
Calculate ancestral state probabilities.
|
||||
|
||||
- **Default**: FALSE
|
||||
- **Must set to TRUE** if you want ancestral range estimates
|
||||
- **Impact**: Adds computational time
|
||||
|
||||
## Plotting Functions
|
||||
|
||||
### plot_BioGeoBEARS_results()
|
||||
|
||||
Main function for visualizing results.
|
||||
|
||||
**Key parameters:**
|
||||
|
||||
- `plotwhat`: "pie" (probability distributions) or "text" (ML states)
|
||||
- `tipcex`: Tip label text size
|
||||
- `statecex`: Node state text/pie chart size
|
||||
- `splitcex`: Split state text/pie size (at corners)
|
||||
- `titlecex`: Title text size
|
||||
- `plotsplits`: Show cladogenetic events (TRUE/FALSE)
|
||||
- `include_null_range`: Match analysis setting
|
||||
- `label.offset`: Distance of tip labels from tree
|
||||
- `cornercoords_loc`: Directory with corner coordinate files
|
||||
|
||||
**Color scheme:**
|
||||
|
||||
- Single areas: Bright primary colors
|
||||
- Multi-area ranges: Blended colors
|
||||
- All areas: White
|
||||
- Colors automatically assigned and mixed
|
||||
|
||||
## Biogeographical Stochastic Mapping (BSM)
|
||||
|
||||
Extension of BioGeoBEARS that simulates stochastic histories:
|
||||
|
||||
- Generates multiple possible biogeographic histories
|
||||
- Accounts for uncertainty in ancestral ranges
|
||||
- Allows visualization of range evolution dynamics
|
||||
- More computationally intensive
|
||||
|
||||
Not covered in basic workflow but available in package.
|
||||
|
||||
## Common Analysis Workflow
|
||||
|
||||
1. **Prepare inputs**
|
||||
- Phylogenetic tree (Newick)
|
||||
- Geography file (PHYLIP format)
|
||||
- Validate both files
|
||||
|
||||
2. **Setup analysis**
|
||||
- Define max_range_size
|
||||
- Load tree and geography data
|
||||
- Create state space
|
||||
|
||||
3. **Fit models**
|
||||
- DEC, DIVALIKE, BAYAREALIKE
|
||||
- With and without +J
|
||||
- 6 models total is standard
|
||||
|
||||
4. **Compare models**
|
||||
- AIC/AICc scores
|
||||
- AIC weights
|
||||
- LRT for nested comparisons
|
||||
|
||||
5. **Visualize best model**
|
||||
- Pie charts for probabilities
|
||||
- Text labels for ML states
|
||||
- Annotate with split events
|
||||
|
||||
6. **Interpret results**
|
||||
- Ancestral ranges
|
||||
- Dispersal patterns
|
||||
- Speciation modes (if using +J)
|
||||
|
||||
## Interpretation Guidelines
|
||||
|
||||
### Dispersal rate (d)
|
||||
|
||||
- **High d**: Frequent range expansions
|
||||
- **Low d**: Species mostly stay in current ranges
|
||||
- **Units**: Expected dispersal events per lineage per time unit
|
||||
|
||||
### Extinction rate (e)
|
||||
|
||||
- **High e**: Ranges frequently contract
|
||||
- **Low e**: Stable occupancy once established
|
||||
- **Relative to d**: d/e ratio indicates dispersal vs. contraction tendency
|
||||
|
||||
### Founder-event rate (j)
|
||||
|
||||
- **High j**: Jump dispersal important in clade evolution
|
||||
- **Low j** (but model still better): Minor role but statistically supported
|
||||
- **j = 0** (in +J model): Founder events not supported
|
||||
|
||||
### Model selection insights
|
||||
|
||||
- **DEC favored**: Balanced dispersal, extinction, and vicariance
|
||||
- **DIVALIKE favored**: Vicariance-driven diversification
|
||||
- **BAYAREALIKE favored**: Sympatric speciation and dispersal
|
||||
- **+J improves fit**: Founder-event speciation may be important
|
||||
|
||||
## Computational Considerations
|
||||
|
||||
### Runtime factors
|
||||
|
||||
- **Number of tips**: Polynomial scaling
|
||||
- **Number of areas**: Exponential scaling in state space
|
||||
- **max_range_size**: Major impact (reduces state space)
|
||||
- **Tree depth**: Linear scaling
|
||||
|
||||
### Memory usage
|
||||
|
||||
- Large trees + many areas can require substantial RAM
|
||||
- Sparse matrices help but have trade-offs
|
||||
|
||||
### Optimization issues
|
||||
|
||||
- Complex likelihood surfaces
|
||||
- Multiple local optima possible
|
||||
- May need multiple optimization runs
|
||||
- Check parameter estimates for sensibility
|
||||
|
||||
## Citations
|
||||
|
||||
**Main BioGeoBEARS reference:**
|
||||
Matzke, N. J. (2013). Probabilistic historical biogeography: new models for founder-event speciation, imperfect detection, and fossils allow improved accuracy and model-testing. *Frontiers of Biogeography*, 5(4), 242-248.
|
||||
|
||||
**LAGRANGE (DEC model origin):**
|
||||
Ree, R. H., & Smith, S. A. (2008). Maximum likelihood inference of geographic range evolution by dispersal, local extinction, and cladogenesis. *Systematic Biology*, 57(1), 4-14.
|
||||
|
||||
**+J parameter discussion:**
|
||||
Ree, R. H., & Sanmartín, I. (2018). Conceptual and statistical problems with the DEC+J model of founder-event speciation and its comparison with DEC via model selection. *Journal of Biogeography*, 45(4), 741-749.
|
||||
|
||||
**Model comparison best practices:**
|
||||
Burnham, K. P., & Anderson, D. R. (2002). *Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach* (2nd ed.). Springer.
|
||||
|
||||
## Further Resources
|
||||
|
||||
- **BioGeoBEARS wiki**: http://phylo.wikidot.com/biogeobears
|
||||
- **GitHub repository**: https://github.com/nmatzke/BioGeoBEARS
|
||||
- **Google Group**: biogeobears@googlegroups.com
|
||||
- **Tutorial scripts**: Available in package `inst/extdata/examples/`
|
||||
404
skills/biogeobears/scripts/biogeobears_analysis_template.Rmd
Normal file
404
skills/biogeobears/scripts/biogeobears_analysis_template.Rmd
Normal file
@@ -0,0 +1,404 @@
|
||||
---
|
||||
title: "BioGeoBEARS Biogeographic Analysis"
|
||||
author: "Generated by Claude Code"
|
||||
date: "`r Sys.Date()`"
|
||||
output:
|
||||
html_document:
|
||||
toc: true
|
||||
toc_float: true
|
||||
code_folding: show
|
||||
theme: flatly
|
||||
params:
|
||||
tree_file: "tree.nwk"
|
||||
geog_file: "geography.data"
|
||||
max_range_size: 4
|
||||
models: "DEC,DEC+J,DIVALIKE,DIVALIKE+J"
|
||||
output_dir: "results"
|
||||
---
|
||||
|
||||
```{r setup, include=FALSE}
|
||||
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
|
||||
library(BioGeoBEARS)
|
||||
library(ape)
|
||||
library(knitr)
|
||||
library(kableExtra)
|
||||
```
|
||||
|
||||
# Analysis Parameters
|
||||
|
||||
```{r parameters, echo=FALSE}
|
||||
params_df <- data.frame(
|
||||
Parameter = c("Tree file", "Geography file", "Max range size", "Models to test", "Output directory"),
|
||||
Value = c(params$tree_file, params$geog_file, params$max_range_size, params$models, params$output_dir)
|
||||
)
|
||||
|
||||
kable(params_df, caption = "Analysis Parameters") %>%
|
||||
kable_styling(bootstrap_options = c("striped", "hover"))
|
||||
```
|
||||
|
||||
# Input Data
|
||||
|
||||
## Phylogenetic Tree
|
||||
|
||||
```{r load-tree}
|
||||
trfn <- params$tree_file
|
||||
tr <- read.tree(trfn)
|
||||
|
||||
cat(paste("Number of tips:", length(tr$tip.label), "\n"))
|
||||
cat(paste("Tree is rooted:", is.rooted(tr), "\n"))
|
||||
cat(paste("Tree is ultrametric:", is.ultrametric(tr), "\n"))
|
||||
|
||||
# Plot tree
|
||||
plot(tr, cex = 0.6, main = "Input Phylogeny")
|
||||
```
|
||||
|
||||
## Geographic Distribution Data
|
||||
|
||||
```{r load-geography}
|
||||
geogfn <- params$geog_file
|
||||
tipranges <- getranges_from_LagrangePHYLIP(lgdata_fn = geogfn)
|
||||
|
||||
cat(paste("Number of species:", nrow(tipranges@df), "\n"))
|
||||
cat(paste("Number of areas:", ncol(tipranges@df), "\n"))
|
||||
cat(paste("Area names:", paste(names(tipranges@df), collapse = ", "), "\n\n"))
|
||||
|
||||
# Display geography matrix
|
||||
kable(tipranges@df, caption = "Species Distribution Matrix (1 = present, 0 = absent)") %>%
|
||||
kable_styling(bootstrap_options = c("striped", "hover"), font_size = 10) %>%
|
||||
scroll_box(height = "400px")
|
||||
```
|
||||
|
||||
## State Space Setup
|
||||
|
||||
```{r state-space}
|
||||
max_range_size <- params$max_range_size
|
||||
numareas <- ncol(tipranges@df)
|
||||
|
||||
num_states <- numstates_from_numareas(numareas = numareas,
|
||||
maxareas = max_range_size,
|
||||
include_null_range = TRUE)
|
||||
|
||||
cat(paste("Maximum range size:", max_range_size, "\n"))
|
||||
cat(paste("Number of possible states:", num_states, "\n"))
|
||||
```
|
||||
|
||||
# Model Fitting
|
||||
|
||||
```{r setup-output}
|
||||
# Create output directory
|
||||
if (!dir.exists(params$output_dir)) {
|
||||
dir.create(params$output_dir, recursive = TRUE)
|
||||
}
|
||||
|
||||
# Parse models to run
|
||||
models_to_run <- unlist(strsplit(params$models, ","))
|
||||
models_to_run <- trimws(models_to_run)
|
||||
|
||||
cat("Models to fit:\n")
|
||||
for (model in models_to_run) {
|
||||
cat(paste(" -", model, "\n"))
|
||||
}
|
||||
```
|
||||
|
||||
```{r model-fitting, results='hide'}
|
||||
# Storage for results
|
||||
results_list <- list()
|
||||
model_comparison <- data.frame(
|
||||
Model = character(),
|
||||
LnL = numeric(),
|
||||
nParams = integer(),
|
||||
AIC = numeric(),
|
||||
AICc = numeric(),
|
||||
d = numeric(),
|
||||
e = numeric(),
|
||||
j = numeric(),
|
||||
stringsAsFactors = FALSE
|
||||
)
|
||||
|
||||
# Helper function to setup and run a model
|
||||
run_biogeobears_model <- function(model_name, BioGeoBEARS_run_object) {
|
||||
cat(paste("\n\nFitting model:", model_name, "\n"))
|
||||
|
||||
# Configure model based on name
|
||||
if (grepl("DEC", model_name)) {
|
||||
# DEC model (default settings)
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "free"
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "free"
|
||||
} else if (grepl("DIVALIKE", model_name)) {
|
||||
# DIVALIKE model (vicariance only, no subset sympatry)
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "fixed"
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","init"] = 0.0
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","est"] = 0.0
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "free"
|
||||
} else if (grepl("BAYAREALIKE", model_name)) {
|
||||
# BAYAREALIKE model (sympatry only, no vicariance)
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["s","type"] = "free"
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","type"] = "fixed"
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","init"] = 0.0
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["v","est"] = 0.0
|
||||
}
|
||||
|
||||
# Add +J parameter if specified
|
||||
if (grepl("\\+J", model_name)) {
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "free"
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = 0.01
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = 0.01
|
||||
} else {
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","type"] = "fixed"
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","init"] = 0.0
|
||||
BioGeoBEARS_run_object$BioGeoBEARS_model_object@params_table["j","est"] = 0.0
|
||||
}
|
||||
|
||||
# Run optimization
|
||||
res <- bears_optim_run(BioGeoBEARS_run_object)
|
||||
|
||||
return(res)
|
||||
}
|
||||
|
||||
# Base run object setup
|
||||
BioGeoBEARS_run_object <- define_BioGeoBEARS_run()
|
||||
BioGeoBEARS_run_object$trfn <- trfn
|
||||
BioGeoBEARS_run_object$geogfn <- geogfn
|
||||
BioGeoBEARS_run_object$max_range_size <- max_range_size
|
||||
BioGeoBEARS_run_object$min_branchlength <- 0.000001
|
||||
BioGeoBEARS_run_object$include_null_range <- TRUE
|
||||
BioGeoBEARS_run_object$force_sparse <- FALSE
|
||||
BioGeoBEARS_run_object$speedup <- TRUE
|
||||
BioGeoBEARS_run_object$use_optimx <- TRUE
|
||||
BioGeoBEARS_run_object$calc_ancprobs <- TRUE
|
||||
BioGeoBEARS_run_object <- readfiles_BioGeoBEARS_run(BioGeoBEARS_run_object)
|
||||
BioGeoBEARS_run_object <- calc_loglike_sp(BioGeoBEARS_run_object)
|
||||
|
||||
# Fit each model
|
||||
for (model in models_to_run) {
|
||||
tryCatch({
|
||||
res <- run_biogeobears_model(model, BioGeoBEARS_run_object)
|
||||
results_list[[model]] <- res
|
||||
|
||||
# Save result
|
||||
save(res, file = file.path(params$output_dir, paste0(model, "_result.Rdata")))
|
||||
|
||||
# Extract parameters for comparison
|
||||
params_table <- res$outputs@params_table
|
||||
model_comparison <- rbind(model_comparison, data.frame(
|
||||
Model = model,
|
||||
LnL = res$outputs@loglikelihood,
|
||||
nParams = sum(params_table$type == "free"),
|
||||
AIC = res$outputs@AIC,
|
||||
AICc = res$outputs@AICc,
|
||||
d = params_table["d", "est"],
|
||||
e = params_table["e", "est"],
|
||||
j = params_table["j", "est"],
|
||||
stringsAsFactors = FALSE
|
||||
))
|
||||
}, error = function(e) {
|
||||
cat(paste("Error fitting model", model, ":", e$message, "\n"))
|
||||
})
|
||||
}
|
||||
```
|
||||
|
||||
# Model Comparison
|
||||
|
||||
```{r model-comparison}
|
||||
# Calculate AIC weights
|
||||
if (nrow(model_comparison) > 0) {
|
||||
model_comparison$delta_AIC <- model_comparison$AIC - min(model_comparison$AIC)
|
||||
model_comparison$AIC_weight <- exp(-0.5 * model_comparison$delta_AIC) /
|
||||
sum(exp(-0.5 * model_comparison$delta_AIC))
|
||||
|
||||
# Sort by AIC
|
||||
model_comparison <- model_comparison[order(model_comparison$AIC), ]
|
||||
|
||||
kable(model_comparison, digits = 3,
|
||||
caption = "Model Comparison (sorted by AIC)") %>%
|
||||
kable_styling(bootstrap_options = c("striped", "hover")) %>%
|
||||
row_spec(1, bold = TRUE, background = "#d4edda") # Highlight best model
|
||||
|
||||
# Model selection summary
|
||||
best_model <- model_comparison$Model[1]
|
||||
cat(paste("\n\nBest model by AIC:", best_model, "\n"))
|
||||
cat(paste("AIC weight:", round(model_comparison$AIC_weight[1], 3), "\n"))
|
||||
}
|
||||
```
|
||||
|
||||
# Ancestral Range Reconstruction
|
||||
|
||||
## Best Model: `r if(exists('best_model')) best_model else 'TBD'`
|
||||
|
||||
```{r plot-best-model, fig.width=10, fig.height=12}
|
||||
if (exists('best_model') && best_model %in% names(results_list)) {
|
||||
res_best <- results_list[[best_model]]
|
||||
|
||||
# Create plots directory
|
||||
plots_dir <- file.path(params$output_dir, "plots")
|
||||
if (!dir.exists(plots_dir)) {
|
||||
dir.create(plots_dir, recursive = TRUE)
|
||||
}
|
||||
|
||||
# Plot with pie charts
|
||||
pdf(file.path(plots_dir, paste0(best_model, "_pie.pdf")), width = 10, height = 12)
|
||||
|
||||
analysis_titletxt <- paste("BioGeoBEARS:", best_model)
|
||||
|
||||
plot_BioGeoBEARS_results(
|
||||
results_object = res_best,
|
||||
analysis_titletxt = analysis_titletxt,
|
||||
addl_params = list("j"),
|
||||
plotwhat = "pie",
|
||||
label.offset = 0.5,
|
||||
tipcex = 0.7,
|
||||
statecex = 0.7,
|
||||
splitcex = 0.6,
|
||||
titlecex = 0.8,
|
||||
plotsplits = TRUE,
|
||||
include_null_range = TRUE,
|
||||
tr = tr,
|
||||
tipranges = tipranges
|
||||
)
|
||||
|
||||
dev.off()
|
||||
|
||||
# Also create text plot
|
||||
pdf(file.path(plots_dir, paste0(best_model, "_text.pdf")), width = 10, height = 12)
|
||||
|
||||
plot_BioGeoBEARS_results(
|
||||
results_object = res_best,
|
||||
analysis_titletxt = analysis_titletxt,
|
||||
addl_params = list("j"),
|
||||
plotwhat = "text",
|
||||
label.offset = 0.5,
|
||||
tipcex = 0.7,
|
||||
statecex = 0.7,
|
||||
splitcex = 0.6,
|
||||
titlecex = 0.8,
|
||||
plotsplits = TRUE,
|
||||
include_null_range = TRUE,
|
||||
tr = tr,
|
||||
tipranges = tipranges
|
||||
)
|
||||
|
||||
dev.off()
|
||||
|
||||
# Display in notebook (pie chart version)
|
||||
plot_BioGeoBEARS_results(
|
||||
results_object = res_best,
|
||||
analysis_titletxt = analysis_titletxt,
|
||||
addl_params = list("j"),
|
||||
plotwhat = "pie",
|
||||
label.offset = 0.5,
|
||||
tipcex = 0.7,
|
||||
statecex = 0.7,
|
||||
splitcex = 0.6,
|
||||
titlecex = 0.8,
|
||||
plotsplits = TRUE,
|
||||
include_null_range = TRUE,
|
||||
tr = tr,
|
||||
tipranges = tipranges
|
||||
)
|
||||
|
||||
cat(paste("\n\nPlots saved to:", plots_dir, "\n"))
|
||||
}
|
||||
```
|
||||
|
||||
# Parameter Estimates
|
||||
|
||||
```{r parameter-estimates, fig.width=10, fig.height=6}
|
||||
if (nrow(model_comparison) > 0) {
|
||||
# Extract base models (without +J)
|
||||
base_models <- model_comparison[!grepl("\\+J", model_comparison$Model), ]
|
||||
j_models <- model_comparison[grepl("\\+J", model_comparison$Model), ]
|
||||
|
||||
par(mfrow = c(1, 3))
|
||||
|
||||
# Plot d (dispersal) estimates
|
||||
barplot(model_comparison$d, names.arg = model_comparison$Model,
|
||||
main = "Dispersal Rate (d)", ylab = "Rate", las = 2, cex.names = 0.8,
|
||||
col = ifelse(model_comparison$Model == best_model, "darkgreen", "lightblue"))
|
||||
|
||||
# Plot e (extinction) estimates
|
||||
barplot(model_comparison$e, names.arg = model_comparison$Model,
|
||||
main = "Extinction Rate (e)", ylab = "Rate", las = 2, cex.names = 0.8,
|
||||
col = ifelse(model_comparison$Model == best_model, "darkgreen", "lightblue"))
|
||||
|
||||
# Plot j (founder-event) estimates for +J models
|
||||
j_vals <- model_comparison$j
|
||||
j_vals[j_vals == 0] <- NA
|
||||
barplot(j_vals, names.arg = model_comparison$Model,
|
||||
main = "Founder-event Rate (j)", ylab = "Rate", las = 2, cex.names = 0.8,
|
||||
col = ifelse(model_comparison$Model == best_model, "darkgreen", "lightblue"))
|
||||
}
|
||||
```
|
||||
|
||||
# Likelihood Ratio Tests
|
||||
|
||||
```{r lrt-tests}
|
||||
# Compare models with and without +J
|
||||
if (nrow(model_comparison) > 0) {
|
||||
lrt_results <- data.frame(
|
||||
Comparison = character(),
|
||||
Model1 = character(),
|
||||
Model2 = character(),
|
||||
LRT_statistic = numeric(),
|
||||
df = integer(),
|
||||
p_value = numeric(),
|
||||
stringsAsFactors = FALSE
|
||||
)
|
||||
|
||||
base_model_names <- c("DEC", "DIVALIKE", "BAYAREALIKE")
|
||||
|
||||
for (base in base_model_names) {
|
||||
j_model <- paste0(base, "+J")
|
||||
|
||||
if (base %in% model_comparison$Model && j_model %in% model_comparison$Model) {
|
||||
lnl_base <- model_comparison[model_comparison$Model == base, "LnL"]
|
||||
lnl_j <- model_comparison[model_comparison$Model == j_model, "LnL"]
|
||||
|
||||
lrt_stat <- 2 * (lnl_j - lnl_base)
|
||||
df <- 1 # One additional parameter (j)
|
||||
p_val <- pchisq(lrt_stat, df = df, lower.tail = FALSE)
|
||||
|
||||
lrt_results <- rbind(lrt_results, data.frame(
|
||||
Comparison = paste(base, "vs", j_model),
|
||||
Model1 = base,
|
||||
Model2 = j_model,
|
||||
LRT_statistic = lrt_stat,
|
||||
df = df,
|
||||
p_value = p_val,
|
||||
stringsAsFactors = FALSE
|
||||
))
|
||||
}
|
||||
}
|
||||
|
||||
if (nrow(lrt_results) > 0) {
|
||||
lrt_results$Significant <- ifelse(lrt_results$p_value < 0.05, "Yes*", "No")
|
||||
|
||||
kable(lrt_results, digits = 4,
|
||||
caption = "Likelihood Ratio Tests (nested model comparisons)") %>%
|
||||
kable_styling(bootstrap_options = c("striped", "hover"))
|
||||
|
||||
cat("\n* p < 0.05 indicates significant improvement with +J parameter\n")
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
# Session Info
|
||||
|
||||
```{r session-info}
|
||||
sessionInfo()
|
||||
```
|
||||
|
||||
# Outputs
|
||||
|
||||
All results have been saved to: **`r params$output_dir`**
|
||||
|
||||
Files generated:
|
||||
|
||||
- `[MODEL]_result.Rdata` - R data files with complete model results
|
||||
- `plots/[MODEL]_pie.pdf` - Phylogeny with pie charts showing ancestral range probabilities
|
||||
- `plots/[MODEL]_text.pdf` - Phylogeny with text labels showing most likely ancestral ranges
|
||||
- `biogeobears_analysis_template.html` - This HTML report
|
||||
|
||||
To load a saved result in R:
|
||||
```r
|
||||
load("results/DEC+J_result.Rdata")
|
||||
```
|
||||
299
skills/biogeobears/scripts/validate_geography_file.py
Executable file
299
skills/biogeobears/scripts/validate_geography_file.py
Executable file
@@ -0,0 +1,299 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
Validates and optionally reformats a BioGeoBEARS geography file.
|
||||
|
||||
Geography files must follow the PHYLIP-like format:
|
||||
Line 1: n_species [TAB] n_areas [TAB] (area1 area2 area3 ...)
|
||||
Lines 2+: species_name [TAB] binary_string (e.g., 011 for absent in area1, present in area2 and area3)
|
||||
|
||||
Common errors:
|
||||
- Spaces instead of tabs
|
||||
- Spaces in species names
|
||||
- Spaces within binary strings
|
||||
- Species names not matching tree tip labels
|
||||
"""
|
||||
|
||||
import sys
|
||||
import argparse
|
||||
import re
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def validate_geography_file(filepath, tree_tips=None):
|
||||
"""
|
||||
Validate geography file format.
|
||||
|
||||
Args:
|
||||
filepath: Path to geography file
|
||||
tree_tips: Optional set of tree tip labels to validate against
|
||||
|
||||
Returns:
|
||||
dict with validation results and any errors/warnings
|
||||
"""
|
||||
errors = []
|
||||
warnings = []
|
||||
info = {}
|
||||
|
||||
with open(filepath, 'r') as f:
|
||||
lines = [line.rstrip('\n\r') for line in f.readlines()]
|
||||
|
||||
if not lines:
|
||||
errors.append("File is empty")
|
||||
return {'valid': False, 'errors': errors, 'warnings': warnings, 'info': info}
|
||||
|
||||
# Parse header line
|
||||
header = lines[0]
|
||||
if '\t' not in header:
|
||||
errors.append("Line 1: Missing tab delimiter (should be: n_species [TAB] n_areas [TAB] (area_names))")
|
||||
else:
|
||||
parts = header.split('\t')
|
||||
if len(parts) < 3:
|
||||
errors.append("Line 1: Expected format 'n_species [TAB] n_areas [TAB] (area_names)'")
|
||||
else:
|
||||
try:
|
||||
n_species = int(parts[0])
|
||||
n_areas = int(parts[1])
|
||||
|
||||
# Parse area names
|
||||
area_part = parts[2].strip()
|
||||
if not (area_part.startswith('(') and area_part.endswith(')')):
|
||||
errors.append("Line 1: Area names should be in parentheses: (A B C)")
|
||||
else:
|
||||
areas = area_part[1:-1].split()
|
||||
if len(areas) != n_areas:
|
||||
errors.append(f"Line 1: Declared {n_areas} areas but found {len(areas)} area names")
|
||||
|
||||
info['n_species'] = n_species
|
||||
info['n_areas'] = n_areas
|
||||
info['areas'] = areas
|
||||
|
||||
# Validate species lines
|
||||
species_found = []
|
||||
for i, line in enumerate(lines[1:], start=2):
|
||||
if not line.strip():
|
||||
continue
|
||||
|
||||
if '\t' not in line:
|
||||
errors.append(f"Line {i}: Missing tab between species name and binary code")
|
||||
continue
|
||||
|
||||
parts = line.split('\t')
|
||||
if len(parts) != 2:
|
||||
errors.append(f"Line {i}: Expected exactly one tab between species name and binary code")
|
||||
continue
|
||||
|
||||
species_name = parts[0]
|
||||
binary_code = parts[1]
|
||||
|
||||
# Check for spaces in species name
|
||||
if ' ' in species_name:
|
||||
errors.append(f"Line {i}: Species name '{species_name}' contains spaces (use underscores instead)")
|
||||
|
||||
# Check for spaces in binary code
|
||||
if ' ' in binary_code or '\t' in binary_code:
|
||||
errors.append(f"Line {i}: Binary code '{binary_code}' contains spaces or tabs (should be like '011' with no spaces)")
|
||||
|
||||
# Check binary code length
|
||||
if len(binary_code) != n_areas:
|
||||
errors.append(f"Line {i}: Binary code length ({len(binary_code)}) doesn't match number of areas ({n_areas})")
|
||||
|
||||
# Check binary code characters
|
||||
if not all(c in '01' for c in binary_code):
|
||||
errors.append(f"Line {i}: Binary code contains invalid characters (only 0 and 1 allowed)")
|
||||
|
||||
species_found.append(species_name)
|
||||
|
||||
# Check species count
|
||||
if len(species_found) != n_species:
|
||||
warnings.append(f"Header declares {n_species} species but found {len(species_found)} data lines")
|
||||
|
||||
info['species'] = species_found
|
||||
|
||||
# Check against tree tips if provided
|
||||
if tree_tips:
|
||||
species_set = set(species_found)
|
||||
tree_set = set(tree_tips)
|
||||
|
||||
missing_in_tree = species_set - tree_set
|
||||
missing_in_geog = tree_set - species_set
|
||||
|
||||
if missing_in_tree:
|
||||
errors.append(f"Species in geography file but not in tree: {', '.join(sorted(missing_in_tree))}")
|
||||
if missing_in_geog:
|
||||
errors.append(f"Species in tree but not in geography file: {', '.join(sorted(missing_in_geog))}")
|
||||
|
||||
except ValueError:
|
||||
errors.append("Line 1: First two fields must be integers (n_species and n_areas)")
|
||||
|
||||
return {
|
||||
'valid': len(errors) == 0,
|
||||
'errors': errors,
|
||||
'warnings': warnings,
|
||||
'info': info
|
||||
}
|
||||
|
||||
|
||||
def reformat_geography_file(input_path, output_path, delimiter=','):
|
||||
"""
|
||||
Attempt to reformat a geography file from common formats.
|
||||
|
||||
Args:
|
||||
input_path: Path to input file
|
||||
output_path: Path for output file
|
||||
delimiter: Delimiter used in input file (default: comma)
|
||||
"""
|
||||
with open(input_path, 'r') as f:
|
||||
lines = [line.strip() for line in f.readlines()]
|
||||
|
||||
# Detect if first line is a header
|
||||
header_line = lines[0]
|
||||
has_header = not header_line[0].isdigit()
|
||||
|
||||
if has_header:
|
||||
# Parse area names from header
|
||||
parts = header_line.split(delimiter)
|
||||
species_col = parts[0]
|
||||
area_names = [p.strip() for p in parts[1:]]
|
||||
data_lines = lines[1:]
|
||||
else:
|
||||
# No header, infer from first data line
|
||||
parts = lines[0].split(delimiter)
|
||||
n_areas = len(parts) - 1
|
||||
area_names = [chr(65 + i) for i in range(n_areas)] # A, B, C, ...
|
||||
data_lines = lines
|
||||
|
||||
# Parse species data
|
||||
species_data = []
|
||||
for line in data_lines:
|
||||
if not line:
|
||||
continue
|
||||
parts = line.split(delimiter)
|
||||
if len(parts) < 2:
|
||||
continue
|
||||
|
||||
species_name = parts[0].strip().replace(' ', '_')
|
||||
presence = ''.join(['1' if p.strip() in ['1', 'present', 'Present', 'TRUE', 'True'] else '0'
|
||||
for p in parts[1:]])
|
||||
species_data.append((species_name, presence))
|
||||
|
||||
# Write output
|
||||
with open(output_path, 'w') as f:
|
||||
# Header line
|
||||
n_species = len(species_data)
|
||||
n_areas = len(area_names)
|
||||
f.write(f"{n_species}\t{n_areas}\t({' '.join(area_names)})\n")
|
||||
|
||||
# Species lines
|
||||
for species_name, binary_code in species_data:
|
||||
f.write(f"{species_name}\t{binary_code}\n")
|
||||
|
||||
print(f"Reformatted {n_species} species across {n_areas} areas")
|
||||
print(f"Output written to: {output_path}")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Validate and reformat BioGeoBEARS geography files',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Validate a geography file
|
||||
python validate_geography_file.py input.txt --validate
|
||||
|
||||
# Reformat from CSV to PHYLIP format
|
||||
python validate_geography_file.py input.csv --reformat -o output.data
|
||||
|
||||
# Reformat with tab delimiter
|
||||
python validate_geography_file.py input.txt --reformat --delimiter tab -o output.data
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('input', help='Input geography file')
|
||||
parser.add_argument('--validate', action='store_true',
|
||||
help='Validate the file format')
|
||||
parser.add_argument('--reformat', action='store_true',
|
||||
help='Reformat file to BioGeoBEARS format')
|
||||
parser.add_argument('-o', '--output',
|
||||
help='Output file path (required for --reformat)')
|
||||
parser.add_argument('--delimiter', default=',',
|
||||
help='Delimiter in input file (default: comma). Use "tab" for tab-delimited.')
|
||||
parser.add_argument('--tree',
|
||||
help='Newick tree file to validate species names against')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
if args.delimiter.lower() == 'tab':
|
||||
args.delimiter = '\t'
|
||||
|
||||
# Parse tree tips if provided
|
||||
tree_tips = None
|
||||
if args.tree:
|
||||
try:
|
||||
with open(args.tree, 'r') as f:
|
||||
tree_string = f.read().strip()
|
||||
# Extract tip labels using regex
|
||||
tree_tips = re.findall(r'([^(),:\s]+):', tree_string)
|
||||
if not tree_tips:
|
||||
tree_tips = re.findall(r'([^(),:\s]+)[,)]', tree_string)
|
||||
print(f"Found {len(tree_tips)} tips in tree file")
|
||||
except Exception as e:
|
||||
print(f"Warning: Could not parse tree file: {e}")
|
||||
|
||||
if args.validate:
|
||||
result = validate_geography_file(args.input, tree_tips)
|
||||
|
||||
print(f"\nValidation Results for: {args.input}")
|
||||
print("=" * 60)
|
||||
|
||||
if result['info']:
|
||||
print(f"\nFile Info:")
|
||||
print(f" Species: {result['info'].get('n_species', 'unknown')}")
|
||||
print(f" Areas: {result['info'].get('n_areas', 'unknown')}")
|
||||
if 'areas' in result['info']:
|
||||
print(f" Area names: {', '.join(result['info']['areas'])}")
|
||||
|
||||
if result['warnings']:
|
||||
print(f"\nWarnings ({len(result['warnings'])}):")
|
||||
for warning in result['warnings']:
|
||||
print(f" ⚠️ {warning}")
|
||||
|
||||
if result['errors']:
|
||||
print(f"\nErrors ({len(result['errors'])}):")
|
||||
for error in result['errors']:
|
||||
print(f" ❌ {error}")
|
||||
else:
|
||||
print(f"\n✅ File is valid!")
|
||||
|
||||
return 0 if result['valid'] else 1
|
||||
|
||||
elif args.reformat:
|
||||
if not args.output:
|
||||
print("Error: --output required when using --reformat")
|
||||
return 1
|
||||
|
||||
try:
|
||||
reformat_geography_file(args.input, args.output, args.delimiter)
|
||||
|
||||
# Validate reformatted file
|
||||
result = validate_geography_file(args.output, tree_tips)
|
||||
if result['valid']:
|
||||
print("✅ Reformatted file is valid!")
|
||||
else:
|
||||
print("\n⚠️ Reformatted file has validation errors:")
|
||||
for error in result['errors']:
|
||||
print(f" ❌ {error}")
|
||||
return 1
|
||||
|
||||
except Exception as e:
|
||||
print(f"Error during reformatting: {e}")
|
||||
return 1
|
||||
|
||||
else:
|
||||
parser.print_help()
|
||||
return 1
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user