gh-k-dense-ai-claude-scient…/skills/etetoolkit/references/api_reference.md

# ETE Toolkit API Reference

## Overview

ETE (Environment for Tree Exploration) is a Python toolkit for phylogenetic tree manipulation, analysis, and visualization. This reference covers the main classes and methods.

## Core Classes

### TreeNode (alias: Tree)

The fundamental class representing tree structures with hierarchical node organization.

**Constructor:**
```python
from ete3 import Tree
t = Tree(newick=None, format=0, dist=None, support=None, name=None)
```

**Parameters:**
- `newick`: Newick string or file path
- `format`: Newick format (0-100). Common formats:
  - `0`: Flexible format with branch lengths and names
  - `1`: With internal node names
  - `2`: With bootstrap/support values
  - `5`: Internal node names and branch lengths
  - `8`: All features (names, distances, support)
  - `9`: Leaf names only
  - `100`: Topology only
- `dist`: Branch length to parent (default: 1.0)
- `support`: Bootstrap/confidence value (default: 1.0)
- `name`: Node identifier

### PhyloTree

Specialized class for phylogenetic analysis, extending TreeNode.

**Constructor:**
```python
from ete3 import PhyloTree
t = PhyloTree(newick=None, alignment=None, alg_format='fasta',
              sp_naming_function=None, format=0)
```

**Additional Parameters:**
- `alignment`: Path to alignment file or alignment string
- `alg_format`: 'fasta' or 'phylip'
- `sp_naming_function`: Custom function to extract species from node names

### ClusterTree

Class for hierarchical clustering analysis.

**Constructor:**
```python
from ete3 import ClusterTree
t = ClusterTree(newick, text_array=None)
```

**Parameters:**
- `text_array`: Tab-delimited matrix with column headers and row names

### NCBITaxa

Class for NCBI taxonomy database operations.

**Constructor:**
```python
from ete3 import NCBITaxa
ncbi = NCBITaxa(dbfile=None)
```

First instantiation downloads ~300MB NCBI taxonomy database to `~/.etetoolkit/taxa.sqlite`.

## Node Properties

### Basic Attributes

| Property | Type | Description | Default |
|----------|------|-------------|---------|
| `name` | str | Node identifier | "NoName" |
| `dist` | float | Branch length to parent | 1.0 |
| `support` | float | Bootstrap/confidence value | 1.0 |
| `up` | TreeNode | Parent node reference | None |
| `children` | list | Child nodes | [] |

### Custom Features

Add any custom data to nodes:
```python
node.add_feature("custom_name", value)
node.add_features(feature1=value1, feature2=value2)
```

Access features:
```python
value = node.custom_name
# or
value = getattr(node, "custom_name", default_value)
```

## Navigation & Traversal

### Basic Navigation

```python
# Check node type
node.is_leaf()          # Returns True if terminal node
node.is_root()          # Returns True if root node
len(node)               # Number of leaves under node

# Get relatives
parent = node.up
children = node.children
root = node.get_tree_root()
```

### Traversal Strategies

```python
# Three traversal strategies
for node in tree.traverse("preorder"):    # Root → Left → Right
    print(node.name)

for node in tree.traverse("postorder"):   # Left → Right → Root
    print(node.name)

for node in tree.traverse("levelorder"):  # Level by level
    print(node.name)

# Exclude root
for node in tree.iter_descendants("postorder"):
    print(node.name)
```

### Getting Nodes

```python
# Get all leaves
leaves = tree.get_leaves()
for leaf in tree:  # Shortcut iteration
    print(leaf.name)

# Get all descendants
descendants = tree.get_descendants()

# Get ancestors
ancestors = node.get_ancestors()

# Get specific nodes by attribute
nodes = tree.search_nodes(name="NodeA")
node = tree & "NodeA"  # Shortcut syntax

# Get leaves by name
leaves = tree.get_leaves_by_name("LeafA")

# Get common ancestor
ancestor = tree.get_common_ancestor("LeafA", "LeafB", "LeafC")

# Custom filtering
filtered = [n for n in tree.traverse() if n.dist > 0.5 and n.is_leaf()]
```

### Iterator Methods (Memory Efficient)

```python
# For large trees, use iterators
for match in tree.iter_search_nodes(name="X"):
    if some_condition:
        break  # Stop early

for leaf in tree.iter_leaves():
    process(leaf)

for descendant in node.iter_descendants():
    process(descendant)
```

## Tree Construction & Modification

### Creating Trees from Scratch

```python
# Empty tree
t = Tree()

# Add children
child1 = t.add_child(name="A", dist=1.0)
child2 = t.add_child(name="B", dist=2.0)

# Add siblings
sister = child1.add_sister(name="C", dist=1.5)

# Populate with random topology
t.populate(10)  # Creates 10 random leaves
t.populate(5, names_library=["A", "B", "C", "D", "E"])
```

### Removing & Deleting Nodes

```python
# Detach: removes entire subtree
node.detach()
# or
parent.remove_child(node)

# Delete: removes node, reconnects children to parent
node.delete()
# or
parent.remove_child(node)
```

### Pruning

Keep only specified leaves:
```python
# Keep only these leaves, remove all others
tree.prune(["A", "B", "C"])

# Preserve original branch lengths
tree.prune(["A", "B", "C"], preserve_branch_length=True)
```

### Tree Concatenation

```python
# Attach one tree as child of another
t1 = Tree("(A,(B,C));")
t2 = Tree("((D,E),(F,G));")
A = t1 & "A"
A.add_child(t2)
```

### Tree Copying

```python
# Four copy methods
copy1 = tree.copy()  # Default: cpickle (preserves types)
copy2 = tree.copy("newick")  # Fastest: basic topology
copy3 = tree.copy("newick-extended")  # Includes custom features as text
copy4 = tree.copy("deepcopy")  # Slowest: handles complex objects
```

## Tree Operations

### Rooting

```python
# Set outgroup (reroot tree)
outgroup_node = tree & "OutgroupLeaf"
tree.set_outgroup(outgroup_node)

# Midpoint rooting
midpoint = tree.get_midpoint_outgroup()
tree.set_outgroup(midpoint)

# Unroot tree
tree.unroot()
```

### Resolving Polytomies

```python
# Resolve multifurcations to bifurcations
tree.resolve_polytomy(recursive=False)  # Single node only
tree.resolve_polytomy(recursive=True)   # Entire tree
```

### Ladderize

```python
# Sort branches by size
tree.ladderize()
tree.ladderize(direction=1)  # Ascending order
```

### Convert to Ultrametric

```python
# Make all leaves equidistant from root
tree.convert_to_ultrametric()
tree.convert_to_ultrametric(tree_length=100)  # Specific total length
```

## Distance & Comparison

### Distance Calculations

```python
# Branch length distance between nodes
dist = tree.get_distance("A", "B")
dist = nodeA.get_distance(nodeB)

# Topology-only distance (count nodes)
dist = tree.get_distance("A", "B", topology_only=True)

# Farthest node
farthest, distance = node.get_farthest_node()
farthest_leaf, distance = node.get_farthest_leaf()
```

### Monophyly Testing

```python
# Check if values form monophyletic group
is_mono, clade_type, base_node = tree.check_monophyly(
    values=["A", "B", "C"],
    target_attr="name"
)
# Returns: (bool, "monophyletic"|"paraphyletic"|"polyphyletic", node)

# Get all monophyletic clades
monophyletic_nodes = tree.get_monophyletic(
    values=["A", "B", "C"],
    target_attr="name"
)
```

### Tree Comparison

```python
# Robinson-Foulds distance
rf, max_rf, common_leaves, parts_t1, parts_t2 = t1.robinson_foulds(t2)
print(f"RF distance: {rf}/{max_rf}")

# Normalized RF distance
result = t1.compare(t2)
norm_rf = result["norm_rf"]  # 0.0 to 1.0
ref_edges = result["ref_edges_in_source"]
```

## Input/Output

### Reading Trees

```python
# From string
t = Tree("(A:1,(B:1,(C:1,D:1):0.5):0.5);")

# From file
t = Tree("tree.nw")

# With format
t = Tree("tree.nw", format=1)
```

### Writing Trees

```python
# To string
newick = tree.write()
newick = tree.write(format=1)
newick = tree.write(format=1, features=["support", "custom_feature"])

# To file
tree.write(outfile="output.nw")
tree.write(format=5, outfile="output.nw", features=["name", "dist"])

# Custom leaf function (for collapsing)
def is_leaf(node):
    return len(node) <= 3  # Treat small clades as leaves

newick = tree.write(is_leaf_fn=is_leaf)
```

### Tree Rendering

```python
# Show interactive GUI
tree.show()

# Render to file (PNG, PDF, SVG)
tree.render("tree.png")
tree.render("tree.pdf", w=200, units="mm")
tree.render("tree.svg", dpi=300)

# ASCII representation
print(tree)
print(tree.get_ascii(show_internal=True, compact=False))
```

## Performance Optimization

### Caching Content

For frequent access to node contents:
```python
# Cache all node contents
node2content = tree.get_cached_content()

# Fast lookup
for node in tree.traverse():
    leaves = node2content[node]
    print(f"Node has {len(leaves)} leaves")
```

### Precomputing Distances

```python
# For multiple distance queries
node2dist = {}
for node in tree.traverse():
    node2dist[node] = node.get_distance(tree)
```

## PhyloTree-Specific Methods

### Sequence Alignment

```python
# Link alignment
tree.link_to_alignment("alignment.fasta", alg_format="fasta")

# Access sequences
for leaf in tree:
    print(f"{leaf.name}: {leaf.sequence}")
```

### Species Naming

```python
# Default: first 3 letters
# Custom function
def get_species(node_name):
    return node_name.split("_")[0]

tree.set_species_naming_function(get_species)

# Manual setting
for leaf in tree:
    leaf.species = extract_species(leaf.name)
```

### Evolutionary Events

```python
# Detect duplication/speciation events
events = tree.get_descendant_evol_events()

for node in tree.traverse():
    if hasattr(node, "evoltype"):
        print(f"{node.name}: {node.evoltype}")  # "D" or "S"

# With species tree
species_tree = Tree("(human, (chimp, gorilla));")
events = tree.get_descendant_evol_events(species_tree=species_tree)
```

### Gene Tree Operations

```python
# Get species trees from duplicated gene families
species_trees = tree.get_speciation_trees()

# Split by duplication events
subtrees = tree.split_by_dups()

# Collapse lineage-specific expansions
tree.collapse_lineage_specific_expansions()
```

## NCBITaxa Methods

### Database Operations

```python
from ete3 import NCBITaxa
ncbi = NCBITaxa()

# Update database
ncbi.update_taxonomy_database()
```

### Querying Taxonomy

```python
# Get taxid from name
taxid = ncbi.get_name_translator(["Homo sapiens"])
# Returns: {'Homo sapiens': [9606]}

# Get name from taxid
names = ncbi.get_taxid_translator([9606, 9598])
# Returns: {9606: 'Homo sapiens', 9598: 'Pan troglodytes'}

# Get rank
rank = ncbi.get_rank([9606])
# Returns: {9606: 'species'}

# Get lineage
lineage = ncbi.get_lineage(9606)
# Returns: [1, 131567, 2759, ..., 9606]

# Get descendants
descendants = ncbi.get_descendant_taxa("Primates")
descendants = ncbi.get_descendant_taxa("Primates", collapse_subspecies=True)
```

### Building Taxonomy Trees

```python
# Get minimal tree connecting taxa
tree = ncbi.get_topology([9606, 9598, 9593])  # Human, chimp, gorilla

# Annotate tree with taxonomy
tree.annotate_ncbi_taxa()

# Access taxonomy info
for node in tree.traverse():
    print(f"{node.sci_name} ({node.taxid}) - Rank: {node.rank}")
```

## ClusterTree Methods

### Linking to Data

```python
# Link matrix to tree
tree.link_to_arraytable(matrix_string)

# Access profiles
for leaf in tree:
    print(leaf.profile)  # Numerical array
```

### Cluster Metrics

```python
# Get silhouette coefficient
silhouette = tree.get_silhouette()

# Get Dunn index
dunn = tree.get_dunn()

# Inter/intra cluster distances
inter = node.intercluster_dist
intra = node.intracluster_dist

# Standard deviation
dev = node.deviation
```

### Distance Metrics

Supported metrics:
- `"euclidean"`: Euclidean distance
- `"pearson"`: Pearson correlation
- `"spearman"`: Spearman rank correlation

```python
tree.dist_to(node2, metric="pearson")
```

## Common Error Handling

```python
# Check if tree is empty
if tree.children:
    print("Tree has children")

# Check if node exists
nodes = tree.search_nodes(name="X")
if nodes:
    node = nodes[0]

# Safe feature access
value = getattr(node, "feature_name", default_value)

# Check format compatibility
try:
    tree.write(format=1)
except:
    print("Tree lacks internal node names")
```

## Best Practices

1. **Use appropriate traversal**: Postorder for bottom-up, preorder for top-down
2. **Cache for repeated access**: Use `get_cached_content()` for frequent queries
3. **Use iterators for large trees**: Memory-efficient processing
4. **Preserve branch lengths**: Use `preserve_branch_length=True` when pruning
5. **Choose copy method wisely**: "newick" for speed, "cpickle" for full fidelity
6. **Validate monophyly**: Check returned clade type (monophyletic/paraphyletic/polyphyletic)
7. **Use PhyloTree for phylogenetics**: Specialized methods for evolutionary analysis
8. **Cache NCBI queries**: Store results to avoid repeated database access