Initial commit
This commit is contained in:
358
skills/biogeobears/references/biogeobears_details.md
Normal file
358
skills/biogeobears/references/biogeobears_details.md
Normal file
@@ -0,0 +1,358 @@
|
||||
# BioGeoBEARS Detailed Reference
|
||||
|
||||
## Overview
|
||||
|
||||
BioGeoBEARS (BioGeography with Bayesian and Likelihood Evolutionary Analysis in R Scripts) is an R package for probabilistic inference of historical biogeography on phylogenetic trees. It implements various models of range evolution and allows statistical comparison between them.
|
||||
|
||||
## Installation
|
||||
|
||||
```r
|
||||
# Install dependencies
|
||||
install.packages("rexpokit")
|
||||
install.packages("cladoRcpp")
|
||||
|
||||
# Install from GitHub
|
||||
library(devtools)
|
||||
devtools::install_github(repo="nmatzke/BioGeoBEARS")
|
||||
```
|
||||
|
||||
## Biogeographic Models
|
||||
|
||||
BioGeoBEARS implements several models that differ in their assumptions about how species ranges evolve:
|
||||
|
||||
### DEC (Dispersal-Extinction-Cladogenesis)
|
||||
|
||||
The DEC model is based on LAGRANGE and includes:
|
||||
|
||||
- **Anagenetic changes** (along branches):
|
||||
- `d` (dispersal): Rate of range expansion into adjacent areas
|
||||
- `e` (extinction): Rate of local extinction in an area
|
||||
|
||||
- **Cladogenetic events** (at speciation nodes):
|
||||
- Vicariance: Ancestral range splits between daughter lineages
|
||||
- Subset sympatry: One daughter inherits full range, other subset
|
||||
- Range copying: Both daughters inherit full ancestral range
|
||||
|
||||
**Parameters**: 2 (d, e)
|
||||
**Best for**: General-purpose biogeographic inference
|
||||
|
||||
### DIVALIKE (Vicariance-focused)
|
||||
|
||||
Similar to DIVA (Dispersal-Vicariance Analysis):
|
||||
|
||||
- Emphasizes vicariance at speciation events
|
||||
- Fixes subset sympatry probability to 0
|
||||
- Only allows vicariance and range copying at nodes
|
||||
|
||||
**Parameters**: 2 (d, e)
|
||||
**Best for**: Systems where vicariance is the primary speciation mode
|
||||
|
||||
### BAYAREALIKE (Sympatry-focused)
|
||||
|
||||
Based on the BayArea model:
|
||||
|
||||
- Emphasizes sympatric speciation
|
||||
- Fixes vicariance probability to 0
|
||||
- Only allows subset sympatry and range copying
|
||||
|
||||
**Parameters**: 2 (d, e)
|
||||
**Best for**: Systems where dispersal and sympatric speciation dominate
|
||||
|
||||
### +J Extension (Founder-event speciation)
|
||||
|
||||
Any of the above models can include a "+J" parameter:
|
||||
|
||||
- **j**: Jump dispersal / founder-event speciation rate
|
||||
- Allows instantaneous dispersal to a new area at speciation
|
||||
- Often significantly improves model fit
|
||||
- Can be controversial (some argue it's biologically unrealistic)
|
||||
|
||||
**Examples**: DEC+J, DIVALIKE+J, BAYAREALIKE+J
|
||||
**Additional parameters**: +1 (j)
|
||||
|
||||
## Model Comparison
|
||||
|
||||
### AIC (Akaike Information Criterion)
|
||||
|
||||
```
|
||||
AIC = -2 × ln(L) + 2k
|
||||
```
|
||||
|
||||
Where:
|
||||
- ln(L) = log-likelihood
|
||||
- k = number of parameters
|
||||
|
||||
**Lower AIC = better model**
|
||||
|
||||
### AICc (Corrected AIC)
|
||||
|
||||
Used when sample size is small relative to parameters:
|
||||
|
||||
```
|
||||
AICc = AIC + (2k² + 2k)/(n - k - 1)
|
||||
```
|
||||
|
||||
### AIC Weights
|
||||
|
||||
Probability that a model is the best among the set:
|
||||
|
||||
```
|
||||
w_i = exp(-0.5 × Δ_i) / Σ exp(-0.5 × Δ_j)
|
||||
```
|
||||
|
||||
Where Δ_i = AIC_i - AIC_min
|
||||
|
||||
### Likelihood Ratio Test (LRT)
|
||||
|
||||
For nested models (e.g., DEC vs DEC+J):
|
||||
|
||||
```
|
||||
LRT = 2 × (ln(L_complex) - ln(L_simple))
|
||||
```
|
||||
|
||||
- Test statistic follows χ² distribution
|
||||
- df = difference in number of parameters
|
||||
- p < 0.05 suggests complex model significantly better
|
||||
|
||||
## Input File Formats
|
||||
|
||||
### Phylogenetic Tree (Newick format)
|
||||
|
||||
Standard Newick format with:
|
||||
- Branch lengths required
|
||||
- Tip labels must match geography file
|
||||
- Should be rooted and ultrametric (for time-stratified analyses)
|
||||
|
||||
Example:
|
||||
```
|
||||
((A:1.0,B:1.0):0.5,C:1.5);
|
||||
```
|
||||
|
||||
### Geography File (PHYLIP-like format)
|
||||
|
||||
**Format structure:**
|
||||
```
|
||||
n_species [TAB] n_areas [TAB] (area1 area2 area3 ...)
|
||||
species1 [TAB] 011
|
||||
species2 [TAB] 110
|
||||
species3 [TAB] 001
|
||||
```
|
||||
|
||||
**Important formatting rules:**
|
||||
|
||||
1. **Line 1 (Header)**:
|
||||
- Number of species (integer)
|
||||
- TAB character
|
||||
- Number of areas (integer)
|
||||
- TAB character
|
||||
- Area names in parentheses, separated by spaces
|
||||
|
||||
2. **Subsequent lines (Species data)**:
|
||||
- Species name (must match tree tip label)
|
||||
- TAB character
|
||||
- Binary presence/absence code (1=present, 0=absent)
|
||||
- NO SPACES in the binary code
|
||||
- NO SPACES in species names (use underscores)
|
||||
|
||||
3. **Common errors to avoid**:
|
||||
- Using spaces instead of tabs
|
||||
- Spaces within binary codes
|
||||
- Species names with spaces
|
||||
- Mismatch between species names in tree and geography file
|
||||
- Wrong number of digits in binary code
|
||||
|
||||
**Example file:**
|
||||
```
|
||||
5 3 (A B C)
|
||||
Sp_alpha 011
|
||||
Sp_beta 010
|
||||
Sp_gamma 111
|
||||
Sp_delta 100
|
||||
Sp_epsilon 001
|
||||
```
|
||||
|
||||
## Key Parameters and Settings
|
||||
|
||||
### max_range_size
|
||||
|
||||
Maximum number of areas a species can occupy simultaneously.
|
||||
|
||||
- **Default**: Often set to number of areas, or number of areas - 1
|
||||
- **Impact**: Larger values = more possible states = longer computation
|
||||
- **Recommendation**: Set based on biological realism
|
||||
|
||||
### include_null_range
|
||||
|
||||
Whether to include the "null range" (species extinct everywhere).
|
||||
|
||||
- **Default**: TRUE
|
||||
- **Purpose**: Allows extinction along branches
|
||||
- **Recommendation**: Usually keep TRUE
|
||||
|
||||
### force_sparse
|
||||
|
||||
Use sparse matrix operations for speed.
|
||||
|
||||
- **Default**: FALSE
|
||||
- **When to use**: Large state spaces (many areas)
|
||||
- **Note**: May cause numerical issues
|
||||
|
||||
### speedup
|
||||
|
||||
Various speedup options.
|
||||
|
||||
- **Default**: TRUE
|
||||
- **Recommendation**: Usually keep TRUE
|
||||
|
||||
### use_optimx
|
||||
|
||||
Use optimx for parameter optimization.
|
||||
|
||||
- **Default**: TRUE
|
||||
- **Benefit**: More robust optimization
|
||||
- **Recommendation**: Keep TRUE
|
||||
|
||||
### calc_ancprobs
|
||||
|
||||
Calculate ancestral state probabilities.
|
||||
|
||||
- **Default**: FALSE
|
||||
- **Must set to TRUE** if you want ancestral range estimates
|
||||
- **Impact**: Adds computational time
|
||||
|
||||
## Plotting Functions
|
||||
|
||||
### plot_BioGeoBEARS_results()
|
||||
|
||||
Main function for visualizing results.
|
||||
|
||||
**Key parameters:**
|
||||
|
||||
- `plotwhat`: "pie" (probability distributions) or "text" (ML states)
|
||||
- `tipcex`: Tip label text size
|
||||
- `statecex`: Node state text/pie chart size
|
||||
- `splitcex`: Split state text/pie size (at corners)
|
||||
- `titlecex`: Title text size
|
||||
- `plotsplits`: Show cladogenetic events (TRUE/FALSE)
|
||||
- `include_null_range`: Match analysis setting
|
||||
- `label.offset`: Distance of tip labels from tree
|
||||
- `cornercoords_loc`: Directory with corner coordinate files
|
||||
|
||||
**Color scheme:**
|
||||
|
||||
- Single areas: Bright primary colors
|
||||
- Multi-area ranges: Blended colors
|
||||
- All areas: White
|
||||
- Colors automatically assigned and mixed
|
||||
|
||||
## Biogeographical Stochastic Mapping (BSM)
|
||||
|
||||
Extension of BioGeoBEARS that simulates stochastic histories:
|
||||
|
||||
- Generates multiple possible biogeographic histories
|
||||
- Accounts for uncertainty in ancestral ranges
|
||||
- Allows visualization of range evolution dynamics
|
||||
- More computationally intensive
|
||||
|
||||
Not covered in basic workflow but available in package.
|
||||
|
||||
## Common Analysis Workflow
|
||||
|
||||
1. **Prepare inputs**
|
||||
- Phylogenetic tree (Newick)
|
||||
- Geography file (PHYLIP format)
|
||||
- Validate both files
|
||||
|
||||
2. **Setup analysis**
|
||||
- Define max_range_size
|
||||
- Load tree and geography data
|
||||
- Create state space
|
||||
|
||||
3. **Fit models**
|
||||
- DEC, DIVALIKE, BAYAREALIKE
|
||||
- With and without +J
|
||||
- 6 models total is standard
|
||||
|
||||
4. **Compare models**
|
||||
- AIC/AICc scores
|
||||
- AIC weights
|
||||
- LRT for nested comparisons
|
||||
|
||||
5. **Visualize best model**
|
||||
- Pie charts for probabilities
|
||||
- Text labels for ML states
|
||||
- Annotate with split events
|
||||
|
||||
6. **Interpret results**
|
||||
- Ancestral ranges
|
||||
- Dispersal patterns
|
||||
- Speciation modes (if using +J)
|
||||
|
||||
## Interpretation Guidelines
|
||||
|
||||
### Dispersal rate (d)
|
||||
|
||||
- **High d**: Frequent range expansions
|
||||
- **Low d**: Species mostly stay in current ranges
|
||||
- **Units**: Expected dispersal events per lineage per time unit
|
||||
|
||||
### Extinction rate (e)
|
||||
|
||||
- **High e**: Ranges frequently contract
|
||||
- **Low e**: Stable occupancy once established
|
||||
- **Relative to d**: d/e ratio indicates dispersal vs. contraction tendency
|
||||
|
||||
### Founder-event rate (j)
|
||||
|
||||
- **High j**: Jump dispersal important in clade evolution
|
||||
- **Low j** (but model still better): Minor role but statistically supported
|
||||
- **j = 0** (in +J model): Founder events not supported
|
||||
|
||||
### Model selection insights
|
||||
|
||||
- **DEC favored**: Balanced dispersal, extinction, and vicariance
|
||||
- **DIVALIKE favored**: Vicariance-driven diversification
|
||||
- **BAYAREALIKE favored**: Sympatric speciation and dispersal
|
||||
- **+J improves fit**: Founder-event speciation may be important
|
||||
|
||||
## Computational Considerations
|
||||
|
||||
### Runtime factors
|
||||
|
||||
- **Number of tips**: Polynomial scaling
|
||||
- **Number of areas**: Exponential scaling in state space
|
||||
- **max_range_size**: Major impact (reduces state space)
|
||||
- **Tree depth**: Linear scaling
|
||||
|
||||
### Memory usage
|
||||
|
||||
- Large trees + many areas can require substantial RAM
|
||||
- Sparse matrices help but have trade-offs
|
||||
|
||||
### Optimization issues
|
||||
|
||||
- Complex likelihood surfaces
|
||||
- Multiple local optima possible
|
||||
- May need multiple optimization runs
|
||||
- Check parameter estimates for sensibility
|
||||
|
||||
## Citations
|
||||
|
||||
**Main BioGeoBEARS reference:**
|
||||
Matzke, N. J. (2013). Probabilistic historical biogeography: new models for founder-event speciation, imperfect detection, and fossils allow improved accuracy and model-testing. *Frontiers of Biogeography*, 5(4), 242-248.
|
||||
|
||||
**LAGRANGE (DEC model origin):**
|
||||
Ree, R. H., & Smith, S. A. (2008). Maximum likelihood inference of geographic range evolution by dispersal, local extinction, and cladogenesis. *Systematic Biology*, 57(1), 4-14.
|
||||
|
||||
**+J parameter discussion:**
|
||||
Ree, R. H., & Sanmartín, I. (2018). Conceptual and statistical problems with the DEC+J model of founder-event speciation and its comparison with DEC via model selection. *Journal of Biogeography*, 45(4), 741-749.
|
||||
|
||||
**Model comparison best practices:**
|
||||
Burnham, K. P., & Anderson, D. R. (2002). *Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach* (2nd ed.). Springer.
|
||||
|
||||
## Further Resources
|
||||
|
||||
- **BioGeoBEARS wiki**: http://phylo.wikidot.com/biogeobears
|
||||
- **GitHub repository**: https://github.com/nmatzke/BioGeoBEARS
|
||||
- **Google Group**: biogeobears@googlegroups.com
|
||||
- **Tutorial scripts**: Available in package `inst/extdata/examples/`
|
||||
Reference in New Issue
Block a user