Initial commit

This commit is contained in:
Zhongwei Li
2025-11-29 18:02:37 +08:00
commit c1d9dee646
38 changed files with 11210 additions and 0 deletions

View File

@@ -0,0 +1,358 @@
# BioGeoBEARS Detailed Reference
## Overview
BioGeoBEARS (BioGeography with Bayesian and Likelihood Evolutionary Analysis in R Scripts) is an R package for probabilistic inference of historical biogeography on phylogenetic trees. It implements various models of range evolution and allows statistical comparison between them.
## Installation
```r
# Install dependencies
install.packages("rexpokit")
install.packages("cladoRcpp")
# Install from GitHub
library(devtools)
devtools::install_github(repo="nmatzke/BioGeoBEARS")
```
## Biogeographic Models
BioGeoBEARS implements several models that differ in their assumptions about how species ranges evolve:
### DEC (Dispersal-Extinction-Cladogenesis)
The DEC model is based on LAGRANGE and includes:
- **Anagenetic changes** (along branches):
- `d` (dispersal): Rate of range expansion into adjacent areas
- `e` (extinction): Rate of local extinction in an area
- **Cladogenetic events** (at speciation nodes):
- Vicariance: Ancestral range splits between daughter lineages
- Subset sympatry: One daughter inherits full range, other subset
- Range copying: Both daughters inherit full ancestral range
**Parameters**: 2 (d, e)
**Best for**: General-purpose biogeographic inference
### DIVALIKE (Vicariance-focused)
Similar to DIVA (Dispersal-Vicariance Analysis):
- Emphasizes vicariance at speciation events
- Fixes subset sympatry probability to 0
- Only allows vicariance and range copying at nodes
**Parameters**: 2 (d, e)
**Best for**: Systems where vicariance is the primary speciation mode
### BAYAREALIKE (Sympatry-focused)
Based on the BayArea model:
- Emphasizes sympatric speciation
- Fixes vicariance probability to 0
- Only allows subset sympatry and range copying
**Parameters**: 2 (d, e)
**Best for**: Systems where dispersal and sympatric speciation dominate
### +J Extension (Founder-event speciation)
Any of the above models can include a "+J" parameter:
- **j**: Jump dispersal / founder-event speciation rate
- Allows instantaneous dispersal to a new area at speciation
- Often significantly improves model fit
- Can be controversial (some argue it's biologically unrealistic)
**Examples**: DEC+J, DIVALIKE+J, BAYAREALIKE+J
**Additional parameters**: +1 (j)
## Model Comparison
### AIC (Akaike Information Criterion)
```
AIC = -2 × ln(L) + 2k
```
Where:
- ln(L) = log-likelihood
- k = number of parameters
**Lower AIC = better model**
### AICc (Corrected AIC)
Used when sample size is small relative to parameters:
```
AICc = AIC + (2k² + 2k)/(n - k - 1)
```
### AIC Weights
Probability that a model is the best among the set:
```
w_i = exp(-0.5 × Δ_i) / Σ exp(-0.5 × Δ_j)
```
Where Δ_i = AIC_i - AIC_min
### Likelihood Ratio Test (LRT)
For nested models (e.g., DEC vs DEC+J):
```
LRT = 2 × (ln(L_complex) - ln(L_simple))
```
- Test statistic follows χ² distribution
- df = difference in number of parameters
- p < 0.05 suggests complex model significantly better
## Input File Formats
### Phylogenetic Tree (Newick format)
Standard Newick format with:
- Branch lengths required
- Tip labels must match geography file
- Should be rooted and ultrametric (for time-stratified analyses)
Example:
```
((A:1.0,B:1.0):0.5,C:1.5);
```
### Geography File (PHYLIP-like format)
**Format structure:**
```
n_species [TAB] n_areas [TAB] (area1 area2 area3 ...)
species1 [TAB] 011
species2 [TAB] 110
species3 [TAB] 001
```
**Important formatting rules:**
1. **Line 1 (Header)**:
- Number of species (integer)
- TAB character
- Number of areas (integer)
- TAB character
- Area names in parentheses, separated by spaces
2. **Subsequent lines (Species data)**:
- Species name (must match tree tip label)
- TAB character
- Binary presence/absence code (1=present, 0=absent)
- NO SPACES in the binary code
- NO SPACES in species names (use underscores)
3. **Common errors to avoid**:
- Using spaces instead of tabs
- Spaces within binary codes
- Species names with spaces
- Mismatch between species names in tree and geography file
- Wrong number of digits in binary code
**Example file:**
```
5 3 (A B C)
Sp_alpha 011
Sp_beta 010
Sp_gamma 111
Sp_delta 100
Sp_epsilon 001
```
## Key Parameters and Settings
### max_range_size
Maximum number of areas a species can occupy simultaneously.
- **Default**: Often set to number of areas, or number of areas - 1
- **Impact**: Larger values = more possible states = longer computation
- **Recommendation**: Set based on biological realism
### include_null_range
Whether to include the "null range" (species extinct everywhere).
- **Default**: TRUE
- **Purpose**: Allows extinction along branches
- **Recommendation**: Usually keep TRUE
### force_sparse
Use sparse matrix operations for speed.
- **Default**: FALSE
- **When to use**: Large state spaces (many areas)
- **Note**: May cause numerical issues
### speedup
Various speedup options.
- **Default**: TRUE
- **Recommendation**: Usually keep TRUE
### use_optimx
Use optimx for parameter optimization.
- **Default**: TRUE
- **Benefit**: More robust optimization
- **Recommendation**: Keep TRUE
### calc_ancprobs
Calculate ancestral state probabilities.
- **Default**: FALSE
- **Must set to TRUE** if you want ancestral range estimates
- **Impact**: Adds computational time
## Plotting Functions
### plot_BioGeoBEARS_results()
Main function for visualizing results.
**Key parameters:**
- `plotwhat`: "pie" (probability distributions) or "text" (ML states)
- `tipcex`: Tip label text size
- `statecex`: Node state text/pie chart size
- `splitcex`: Split state text/pie size (at corners)
- `titlecex`: Title text size
- `plotsplits`: Show cladogenetic events (TRUE/FALSE)
- `include_null_range`: Match analysis setting
- `label.offset`: Distance of tip labels from tree
- `cornercoords_loc`: Directory with corner coordinate files
**Color scheme:**
- Single areas: Bright primary colors
- Multi-area ranges: Blended colors
- All areas: White
- Colors automatically assigned and mixed
## Biogeographical Stochastic Mapping (BSM)
Extension of BioGeoBEARS that simulates stochastic histories:
- Generates multiple possible biogeographic histories
- Accounts for uncertainty in ancestral ranges
- Allows visualization of range evolution dynamics
- More computationally intensive
Not covered in basic workflow but available in package.
## Common Analysis Workflow
1. **Prepare inputs**
- Phylogenetic tree (Newick)
- Geography file (PHYLIP format)
- Validate both files
2. **Setup analysis**
- Define max_range_size
- Load tree and geography data
- Create state space
3. **Fit models**
- DEC, DIVALIKE, BAYAREALIKE
- With and without +J
- 6 models total is standard
4. **Compare models**
- AIC/AICc scores
- AIC weights
- LRT for nested comparisons
5. **Visualize best model**
- Pie charts for probabilities
- Text labels for ML states
- Annotate with split events
6. **Interpret results**
- Ancestral ranges
- Dispersal patterns
- Speciation modes (if using +J)
## Interpretation Guidelines
### Dispersal rate (d)
- **High d**: Frequent range expansions
- **Low d**: Species mostly stay in current ranges
- **Units**: Expected dispersal events per lineage per time unit
### Extinction rate (e)
- **High e**: Ranges frequently contract
- **Low e**: Stable occupancy once established
- **Relative to d**: d/e ratio indicates dispersal vs. contraction tendency
### Founder-event rate (j)
- **High j**: Jump dispersal important in clade evolution
- **Low j** (but model still better): Minor role but statistically supported
- **j = 0** (in +J model): Founder events not supported
### Model selection insights
- **DEC favored**: Balanced dispersal, extinction, and vicariance
- **DIVALIKE favored**: Vicariance-driven diversification
- **BAYAREALIKE favored**: Sympatric speciation and dispersal
- **+J improves fit**: Founder-event speciation may be important
## Computational Considerations
### Runtime factors
- **Number of tips**: Polynomial scaling
- **Number of areas**: Exponential scaling in state space
- **max_range_size**: Major impact (reduces state space)
- **Tree depth**: Linear scaling
### Memory usage
- Large trees + many areas can require substantial RAM
- Sparse matrices help but have trade-offs
### Optimization issues
- Complex likelihood surfaces
- Multiple local optima possible
- May need multiple optimization runs
- Check parameter estimates for sensibility
## Citations
**Main BioGeoBEARS reference:**
Matzke, N. J. (2013). Probabilistic historical biogeography: new models for founder-event speciation, imperfect detection, and fossils allow improved accuracy and model-testing. *Frontiers of Biogeography*, 5(4), 242-248.
**LAGRANGE (DEC model origin):**
Ree, R. H., & Smith, S. A. (2008). Maximum likelihood inference of geographic range evolution by dispersal, local extinction, and cladogenesis. *Systematic Biology*, 57(1), 4-14.
**+J parameter discussion:**
Ree, R. H., & Sanmartín, I. (2018). Conceptual and statistical problems with the DEC+J model of founder-event speciation and its comparison with DEC via model selection. *Journal of Biogeography*, 45(4), 741-749.
**Model comparison best practices:**
Burnham, K. P., & Anderson, D. R. (2002). *Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach* (2nd ed.). Springer.
## Further Resources
- **BioGeoBEARS wiki**: http://phylo.wikidot.com/biogeobears
- **GitHub repository**: https://github.com/nmatzke/BioGeoBEARS
- **Google Group**: biogeobears@googlegroups.com
- **Tutorial scripts**: Available in package `inst/extdata/examples/`