Files
2025-11-30 08:30:10 +08:00

498 lines
13 KiB
Markdown

# LaminDB Ontology Management
This document covers biological ontology management in LaminDB through the Bionty plugin, including accessing, searching, and annotating data with standardized biological terms.
## Overview
LaminDB integrates the `bionty` plugin to manage standardized biological ontologies, enabling consistent metadata curation and data annotation across research projects. Bionty provides access to 20+ curated biological ontologies covering genes, proteins, cell types, tissues, diseases, and more.
## Available Ontologies
LaminDB provides access to multiple curated ontology sources:
| Registry | Ontology Source | Description |
|----------|----------------|-------------|
| **Gene** | Ensembl | Genes across organisms (human, mouse, etc.) |
| **Protein** | UniProt | Protein sequences and annotations |
| **CellType** | Cell Ontology (CL) | Standardized cell type classifications |
| **CellLine** | Cell Line Ontology (CLO) | Cell line annotations |
| **Tissue** | Uberon | Anatomical structures and tissues |
| **Disease** | Mondo, DOID | Disease classifications |
| **Phenotype** | Human Phenotype Ontology (HPO) | Phenotypic abnormalities |
| **Pathway** | Gene Ontology (GO) | Biological pathways and processes |
| **ExperimentalFactor** | Experimental Factor Ontology (EFO) | Experimental variables |
| **DevelopmentalStage** | Various | Developmental stages across organisms |
| **Ethnicity** | HANCESTRO | Human ancestry ontology |
| **Drug** | DrugBank | Drug compounds |
| **Organism** | NCBItaxon | Taxonomic classifications |
## Installation and Import
```python
# Install bionty (included with lamindb)
pip install lamindb
# Import
import lamindb as ln
import bionty as bt
```
## Importing Public Ontologies
Populate your registry with a public ontology source:
```python
# Import Cell Ontology
bt.CellType.import_source()
# Import organism-specific genes
bt.Gene.import_source(organism="human")
bt.Gene.import_source(organism="mouse")
# Import tissues
bt.Tissue.import_source()
# Import diseases
bt.Disease.import_source(source="mondo") # Mondo Disease Ontology
bt.Disease.import_source(source="doid") # Disease Ontology
```
## Searching and Accessing Records
### Keyword Search
```python
# Search cell types
bt.CellType.search("T cell").to_dataframe()
bt.CellType.search("gamma-delta").to_dataframe()
# Search genes
bt.Gene.search("CD8").to_dataframe()
bt.Gene.search("TP53").to_dataframe()
# Search diseases
bt.Disease.search("cancer").to_dataframe()
# Search tissues
bt.Tissue.search("brain").to_dataframe()
```
### Auto-Complete Lookup
For registries with fewer than 100k records:
```python
# Create lookup object
cell_types = bt.CellType.lookup()
# Access by name (auto-complete in IDEs)
t_cell = cell_types.t_cell
hsc = cell_types.hematopoietic_stem_cell
# Similarly for other registries
genes = bt.Gene.lookup()
cd8a = genes.cd8a
```
### Exact Field Matching
```python
# By ontology ID
cell_type = bt.CellType.get(ontology_id="CL:0000798")
disease = bt.Disease.get(ontology_id="MONDO:0004992")
# By name
cell_type = bt.CellType.get(name="T cell")
gene = bt.Gene.get(symbol="CD8A")
# By Ensembl ID
gene = bt.Gene.get(ensembl_gene_id="ENSG00000153563")
```
## Ontological Hierarchies
### Exploring Relationships
```python
# Get a cell type
gdt_cell = bt.CellType.get(ontology_id="CL:0000798") # gamma-delta T cell
# View direct parents
gdt_cell.parents.to_dataframe()
# View all ancestors recursively
ancestors = []
current = gdt_cell
while current.parents.exists():
parent = current.parents.first()
ancestors.append(parent)
current = parent
# View direct children
gdt_cell.children.to_dataframe()
# View all descendants recursively
gdt_cell.query_children().to_dataframe()
```
### Visualizing Hierarchies
```python
# Visualize parent hierarchy
gdt_cell.view_parents()
# Include children in visualization
gdt_cell.view_parents(with_children=True)
# Get all related terms for visualization
t_cell = bt.CellType.get(name="T cell")
t_cell.view_parents(with_children=True) # Shows T cell subtypes
```
## Standardizing and Validating Data
### Validation
Check if terms exist in the ontology:
```python
# Validate cell types
bt.CellType.validate(["T cell", "B cell", "invalid_cell"])
# Returns: [True, True, False]
# Validate genes
bt.Gene.validate(["CD8A", "TP53", "FAKEGENE"], organism="human")
# Returns: [True, True, False]
# Check which terms are invalid
terms = ["T cell", "fat cell", "neuron", "invalid_term"]
invalid = [t for t, valid in zip(terms, bt.CellType.validate(terms)) if not valid]
print(f"Invalid terms: {invalid}")
```
### Standardization with Synonyms
Convert non-standard terms to validated names:
```python
# Standardize cell type names
bt.CellType.standardize(["fat cell", "blood forming stem cell"])
# Returns: ['adipocyte', 'hematopoietic stem cell']
# Standardize genes
bt.Gene.standardize(["BRCA-1", "p53"], organism="human")
# Returns: ['BRCA1', 'TP53']
# Handle mixed valid/invalid terms
terms = ["T cell", "T lymphocyte", "invalid"]
standardized = bt.CellType.standardize(terms)
# Returns standardized names where possible
```
### Loading Validated Records
```python
# Load records from values (including synonyms)
records = bt.CellType.from_values(["fat cell", "blood forming stem cell"])
# Returns list of CellType records
for record in records:
print(record.name, record.ontology_id)
# Use with gene symbols
genes = bt.Gene.from_values(["CD8A", "CD8B"], organism="human")
```
## Annotating Datasets
### Annotating AnnData
```python
import anndata as ad
import lamindb as ln
# Load example data
adata = ad.read_h5ad("data.h5ad")
# Validate and retrieve matching records
cell_types = bt.CellType.from_values(adata.obs.cell_type)
# Create artifact with annotations
artifact = ln.Artifact.from_anndata(
adata,
key="scrna/annotated_data.h5ad",
description="scRNA-seq data with validated cell type annotations"
).save()
# Link ontology records to artifact
artifact.feature_sets.add_ontology(cell_types)
```
### Annotating DataFrames
```python
import pandas as pd
# Create DataFrame with biological entities
df = pd.DataFrame({
"cell_type": ["T cell", "B cell", "NK cell"],
"tissue": ["blood", "spleen", "liver"],
"disease": ["healthy", "lymphoma", "healthy"]
})
# Validate and standardize
df["cell_type"] = bt.CellType.standardize(df["cell_type"])
df["tissue"] = bt.Tissue.standardize(df["tissue"])
# Create artifact
artifact = ln.Artifact.from_dataframe(
df,
key="metadata/sample_info.parquet"
).save()
# Link ontology records
cell_type_records = bt.CellType.from_values(df["cell_type"])
tissue_records = bt.Tissue.from_values(df["tissue"])
artifact.feature_sets.add_ontology(cell_type_records)
artifact.feature_sets.add_ontology(tissue_records)
```
## Managing Custom Terms and Hierarchies
### Adding Custom Terms
```python
# Register new term not in public ontology
my_celltype = bt.CellType(name="my_novel_T_cell_subtype").save()
# Establish parent-child relationship
parent = bt.CellType.get(name="T cell")
my_celltype.parents.add(parent)
# Verify relationship
my_celltype.parents.to_dataframe()
parent.children.to_dataframe() # Should include my_celltype
```
### Adding Synonyms
```python
# Add synonyms for standardization
hsc = bt.CellType.get(name="hematopoietic stem cell")
hsc.add_synonym("HSC")
hsc.add_synonym("blood stem cell")
hsc.add_synonym("hematopoietic progenitor")
# Set abbreviation
hsc.set_abbr("HSC")
# Now standardization works with synonyms
bt.CellType.standardize(["HSC", "blood stem cell"])
# Returns: ['hematopoietic stem cell', 'hematopoietic stem cell']
```
### Creating Custom Hierarchies
```python
# Build custom cell type hierarchy
immune_cell = bt.CellType.get(name="immune cell")
# Add custom subtypes
my_subtype1 = bt.CellType(name="custom_immune_subtype_1").save()
my_subtype2 = bt.CellType(name="custom_immune_subtype_2").save()
# Link to parent
my_subtype1.parents.add(immune_cell)
my_subtype2.parents.add(immune_cell)
# Create sub-subtypes
my_subsubtype = bt.CellType(name="custom_sub_subtype").save()
my_subsubtype.parents.add(my_subtype1)
# Visualize custom hierarchy
immune_cell.view_parents(with_children=True)
```
## Multi-Organism Support
For organism-aware registries like Gene:
```python
# Set global organism
bt.settings.organism = "human"
# Validate human genes
bt.Gene.validate(["TCF7", "CD8A"], organism="human")
# Load genes for specific organism
human_genes = bt.Gene.from_values(["CD8A", "TP53"], organism="human")
mouse_genes = bt.Gene.from_values(["Cd8a", "Trp53"], organism="mouse")
# Search organism-specific genes
bt.Gene.search("CD8", organism="human").to_dataframe()
bt.Gene.search("Cd8", organism="mouse").to_dataframe()
# Switch organism context
bt.settings.organism = "mouse"
genes = bt.Gene.from_source(symbol="Ap5b1")
```
## Public Ontology Lookup
Access terms from public ontologies without importing:
```python
# Interactive lookup in public sources
cell_types_public = bt.CellType.lookup(public=True)
# Access public terms
hepatocyte = cell_types_public.hepatocyte
# Import specific term
hepatocyte_local = bt.CellType.from_source(name="hepatocyte")
# Or import by ontology ID
specific_cell = bt.CellType.from_source(ontology_id="CL:0000182")
```
## Version Tracking
LaminDB automatically tracks ontology versions:
```python
# View current source versions
bt.Source.filter(currently_used=True).to_dataframe()
# Check which source a record derives from
cell_type = bt.CellType.get(name="hepatocyte")
cell_type.source # Returns Source metadata
# View source details
source = cell_type.source
print(source.name) # e.g., "cl"
print(source.version) # e.g., "2023-05-18"
print(source.url) # Ontology URL
```
## Ontology Integration Workflows
### Workflow 1: Validate Existing Data
```python
# Load data with biological annotations
adata = ad.read_h5ad("uncurated_data.h5ad")
# Validate cell types
validation = bt.CellType.validate(adata.obs.cell_type)
# Identify invalid terms
invalid_idx = [i for i, v in enumerate(validation) if not v]
invalid_terms = adata.obs.cell_type.iloc[invalid_idx].unique()
print(f"Invalid cell types: {invalid_terms}")
# Fix invalid terms manually or with standardization
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs.cell_type)
# Re-validate
validation = bt.CellType.validate(adata.obs.cell_type)
assert all(validation), "All terms should now be valid"
```
### Workflow 2: Curate and Annotate
```python
import lamindb as ln
ln.track() # Start tracking
# Load data
df = pd.read_csv("experimental_data.csv")
# Standardize using ontologies
df["cell_type"] = bt.CellType.standardize(df["cell_type"])
df["tissue"] = bt.Tissue.standardize(df["tissue"])
# Create curated artifact
artifact = ln.Artifact.from_dataframe(
df,
key="curated/experiment_2025_10.parquet",
description="Curated experimental data with ontology-validated annotations"
).save()
# Link ontology records
artifact.feature_sets.add_ontology(bt.CellType.from_values(df["cell_type"]))
artifact.feature_sets.add_ontology(bt.Tissue.from_values(df["tissue"]))
ln.finish() # Complete tracking
```
### Workflow 3: Cross-Organism Gene Mapping
```python
# Get human genes
human_genes = ["CD8A", "CD8B", "TP53"]
human_records = bt.Gene.from_values(human_genes, organism="human")
# Find mouse orthologs (requires external mapping)
# LaminDB doesn't provide built-in ortholog mapping
# Use external tools like Ensembl BioMart or homologene
mouse_orthologs = ["Cd8a", "Cd8b", "Trp53"]
mouse_records = bt.Gene.from_values(mouse_orthologs, organism="mouse")
```
## Querying Ontology-Annotated Data
```python
# Find all datasets with specific cell type
t_cell = bt.CellType.get(name="T cell")
ln.Artifact.filter(feature_sets__cell_types=t_cell).to_dataframe()
# Find datasets measuring specific genes
cd8a = bt.Gene.get(symbol="CD8A", organism="human")
ln.Artifact.filter(feature_sets__genes=cd8a).to_dataframe()
# Query across ontology hierarchy
# Find all datasets with T cell or T cell subtypes
t_cell_subtypes = t_cell.query_children()
ln.Artifact.filter(
feature_sets__cell_types__in=t_cell_subtypes
).to_dataframe()
```
## Best Practices
1. **Import ontologies first**: Call `import_source()` before validation
2. **Use standardization**: Leverage synonym mapping to handle variations
3. **Validate early**: Check terms before creating artifacts
4. **Set organism context**: Specify organism for gene-related queries
5. **Add custom synonyms**: Register common variations in your domain
6. **Use public lookup**: Access `lookup(public=True)` for term discovery
7. **Track versions**: Monitor ontology source versions for reproducibility
8. **Build hierarchies**: Link custom terms to existing ontology structures
9. **Query hierarchically**: Use `query_children()` for comprehensive searches
10. **Document mappings**: Track custom term additions and relationships
## Common Ontology Operations
```python
# Check if term exists
exists = bt.CellType.filter(name="T cell").exists()
# Count terms in registry
n_cell_types = bt.CellType.filter().count()
# Get all terms with specific parent
immune_cells = bt.CellType.filter(parents__name="immune cell")
# Find orphan terms (no parents)
orphans = bt.CellType.filter(parents__isnull=True)
# Get recently added terms
from datetime import datetime, timedelta
recent = bt.CellType.filter(
created_at__gte=datetime.now() - timedelta(days=7)
)
```