Initial commit
This commit is contained in:
497
skills/lamindb/references/ontologies.md
Normal file
497
skills/lamindb/references/ontologies.md
Normal file
@@ -0,0 +1,497 @@
|
||||
# LaminDB Ontology Management
|
||||
|
||||
This document covers biological ontology management in LaminDB through the Bionty plugin, including accessing, searching, and annotating data with standardized biological terms.
|
||||
|
||||
## Overview
|
||||
|
||||
LaminDB integrates the `bionty` plugin to manage standardized biological ontologies, enabling consistent metadata curation and data annotation across research projects. Bionty provides access to 20+ curated biological ontologies covering genes, proteins, cell types, tissues, diseases, and more.
|
||||
|
||||
## Available Ontologies
|
||||
|
||||
LaminDB provides access to multiple curated ontology sources:
|
||||
|
||||
| Registry | Ontology Source | Description |
|
||||
|----------|----------------|-------------|
|
||||
| **Gene** | Ensembl | Genes across organisms (human, mouse, etc.) |
|
||||
| **Protein** | UniProt | Protein sequences and annotations |
|
||||
| **CellType** | Cell Ontology (CL) | Standardized cell type classifications |
|
||||
| **CellLine** | Cell Line Ontology (CLO) | Cell line annotations |
|
||||
| **Tissue** | Uberon | Anatomical structures and tissues |
|
||||
| **Disease** | Mondo, DOID | Disease classifications |
|
||||
| **Phenotype** | Human Phenotype Ontology (HPO) | Phenotypic abnormalities |
|
||||
| **Pathway** | Gene Ontology (GO) | Biological pathways and processes |
|
||||
| **ExperimentalFactor** | Experimental Factor Ontology (EFO) | Experimental variables |
|
||||
| **DevelopmentalStage** | Various | Developmental stages across organisms |
|
||||
| **Ethnicity** | HANCESTRO | Human ancestry ontology |
|
||||
| **Drug** | DrugBank | Drug compounds |
|
||||
| **Organism** | NCBItaxon | Taxonomic classifications |
|
||||
|
||||
## Installation and Import
|
||||
|
||||
```python
|
||||
# Install bionty (included with lamindb)
|
||||
pip install lamindb
|
||||
|
||||
# Import
|
||||
import lamindb as ln
|
||||
import bionty as bt
|
||||
```
|
||||
|
||||
## Importing Public Ontologies
|
||||
|
||||
Populate your registry with a public ontology source:
|
||||
|
||||
```python
|
||||
# Import Cell Ontology
|
||||
bt.CellType.import_source()
|
||||
|
||||
# Import organism-specific genes
|
||||
bt.Gene.import_source(organism="human")
|
||||
bt.Gene.import_source(organism="mouse")
|
||||
|
||||
# Import tissues
|
||||
bt.Tissue.import_source()
|
||||
|
||||
# Import diseases
|
||||
bt.Disease.import_source(source="mondo") # Mondo Disease Ontology
|
||||
bt.Disease.import_source(source="doid") # Disease Ontology
|
||||
```
|
||||
|
||||
## Searching and Accessing Records
|
||||
|
||||
### Keyword Search
|
||||
|
||||
```python
|
||||
# Search cell types
|
||||
bt.CellType.search("T cell").to_dataframe()
|
||||
bt.CellType.search("gamma-delta").to_dataframe()
|
||||
|
||||
# Search genes
|
||||
bt.Gene.search("CD8").to_dataframe()
|
||||
bt.Gene.search("TP53").to_dataframe()
|
||||
|
||||
# Search diseases
|
||||
bt.Disease.search("cancer").to_dataframe()
|
||||
|
||||
# Search tissues
|
||||
bt.Tissue.search("brain").to_dataframe()
|
||||
```
|
||||
|
||||
### Auto-Complete Lookup
|
||||
|
||||
For registries with fewer than 100k records:
|
||||
|
||||
```python
|
||||
# Create lookup object
|
||||
cell_types = bt.CellType.lookup()
|
||||
|
||||
# Access by name (auto-complete in IDEs)
|
||||
t_cell = cell_types.t_cell
|
||||
hsc = cell_types.hematopoietic_stem_cell
|
||||
|
||||
# Similarly for other registries
|
||||
genes = bt.Gene.lookup()
|
||||
cd8a = genes.cd8a
|
||||
```
|
||||
|
||||
### Exact Field Matching
|
||||
|
||||
```python
|
||||
# By ontology ID
|
||||
cell_type = bt.CellType.get(ontology_id="CL:0000798")
|
||||
disease = bt.Disease.get(ontology_id="MONDO:0004992")
|
||||
|
||||
# By name
|
||||
cell_type = bt.CellType.get(name="T cell")
|
||||
gene = bt.Gene.get(symbol="CD8A")
|
||||
|
||||
# By Ensembl ID
|
||||
gene = bt.Gene.get(ensembl_gene_id="ENSG00000153563")
|
||||
```
|
||||
|
||||
## Ontological Hierarchies
|
||||
|
||||
### Exploring Relationships
|
||||
|
||||
```python
|
||||
# Get a cell type
|
||||
gdt_cell = bt.CellType.get(ontology_id="CL:0000798") # gamma-delta T cell
|
||||
|
||||
# View direct parents
|
||||
gdt_cell.parents.to_dataframe()
|
||||
|
||||
# View all ancestors recursively
|
||||
ancestors = []
|
||||
current = gdt_cell
|
||||
while current.parents.exists():
|
||||
parent = current.parents.first()
|
||||
ancestors.append(parent)
|
||||
current = parent
|
||||
|
||||
# View direct children
|
||||
gdt_cell.children.to_dataframe()
|
||||
|
||||
# View all descendants recursively
|
||||
gdt_cell.query_children().to_dataframe()
|
||||
```
|
||||
|
||||
### Visualizing Hierarchies
|
||||
|
||||
```python
|
||||
# Visualize parent hierarchy
|
||||
gdt_cell.view_parents()
|
||||
|
||||
# Include children in visualization
|
||||
gdt_cell.view_parents(with_children=True)
|
||||
|
||||
# Get all related terms for visualization
|
||||
t_cell = bt.CellType.get(name="T cell")
|
||||
t_cell.view_parents(with_children=True) # Shows T cell subtypes
|
||||
```
|
||||
|
||||
## Standardizing and Validating Data
|
||||
|
||||
### Validation
|
||||
|
||||
Check if terms exist in the ontology:
|
||||
|
||||
```python
|
||||
# Validate cell types
|
||||
bt.CellType.validate(["T cell", "B cell", "invalid_cell"])
|
||||
# Returns: [True, True, False]
|
||||
|
||||
# Validate genes
|
||||
bt.Gene.validate(["CD8A", "TP53", "FAKEGENE"], organism="human")
|
||||
# Returns: [True, True, False]
|
||||
|
||||
# Check which terms are invalid
|
||||
terms = ["T cell", "fat cell", "neuron", "invalid_term"]
|
||||
invalid = [t for t, valid in zip(terms, bt.CellType.validate(terms)) if not valid]
|
||||
print(f"Invalid terms: {invalid}")
|
||||
```
|
||||
|
||||
### Standardization with Synonyms
|
||||
|
||||
Convert non-standard terms to validated names:
|
||||
|
||||
```python
|
||||
# Standardize cell type names
|
||||
bt.CellType.standardize(["fat cell", "blood forming stem cell"])
|
||||
# Returns: ['adipocyte', 'hematopoietic stem cell']
|
||||
|
||||
# Standardize genes
|
||||
bt.Gene.standardize(["BRCA-1", "p53"], organism="human")
|
||||
# Returns: ['BRCA1', 'TP53']
|
||||
|
||||
# Handle mixed valid/invalid terms
|
||||
terms = ["T cell", "T lymphocyte", "invalid"]
|
||||
standardized = bt.CellType.standardize(terms)
|
||||
# Returns standardized names where possible
|
||||
```
|
||||
|
||||
### Loading Validated Records
|
||||
|
||||
```python
|
||||
# Load records from values (including synonyms)
|
||||
records = bt.CellType.from_values(["fat cell", "blood forming stem cell"])
|
||||
|
||||
# Returns list of CellType records
|
||||
for record in records:
|
||||
print(record.name, record.ontology_id)
|
||||
|
||||
# Use with gene symbols
|
||||
genes = bt.Gene.from_values(["CD8A", "CD8B"], organism="human")
|
||||
```
|
||||
|
||||
## Annotating Datasets
|
||||
|
||||
### Annotating AnnData
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
import lamindb as ln
|
||||
|
||||
# Load example data
|
||||
adata = ad.read_h5ad("data.h5ad")
|
||||
|
||||
# Validate and retrieve matching records
|
||||
cell_types = bt.CellType.from_values(adata.obs.cell_type)
|
||||
|
||||
# Create artifact with annotations
|
||||
artifact = ln.Artifact.from_anndata(
|
||||
adata,
|
||||
key="scrna/annotated_data.h5ad",
|
||||
description="scRNA-seq data with validated cell type annotations"
|
||||
).save()
|
||||
|
||||
# Link ontology records to artifact
|
||||
artifact.feature_sets.add_ontology(cell_types)
|
||||
```
|
||||
|
||||
### Annotating DataFrames
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Create DataFrame with biological entities
|
||||
df = pd.DataFrame({
|
||||
"cell_type": ["T cell", "B cell", "NK cell"],
|
||||
"tissue": ["blood", "spleen", "liver"],
|
||||
"disease": ["healthy", "lymphoma", "healthy"]
|
||||
})
|
||||
|
||||
# Validate and standardize
|
||||
df["cell_type"] = bt.CellType.standardize(df["cell_type"])
|
||||
df["tissue"] = bt.Tissue.standardize(df["tissue"])
|
||||
|
||||
# Create artifact
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="metadata/sample_info.parquet"
|
||||
).save()
|
||||
|
||||
# Link ontology records
|
||||
cell_type_records = bt.CellType.from_values(df["cell_type"])
|
||||
tissue_records = bt.Tissue.from_values(df["tissue"])
|
||||
|
||||
artifact.feature_sets.add_ontology(cell_type_records)
|
||||
artifact.feature_sets.add_ontology(tissue_records)
|
||||
```
|
||||
|
||||
## Managing Custom Terms and Hierarchies
|
||||
|
||||
### Adding Custom Terms
|
||||
|
||||
```python
|
||||
# Register new term not in public ontology
|
||||
my_celltype = bt.CellType(name="my_novel_T_cell_subtype").save()
|
||||
|
||||
# Establish parent-child relationship
|
||||
parent = bt.CellType.get(name="T cell")
|
||||
my_celltype.parents.add(parent)
|
||||
|
||||
# Verify relationship
|
||||
my_celltype.parents.to_dataframe()
|
||||
parent.children.to_dataframe() # Should include my_celltype
|
||||
```
|
||||
|
||||
### Adding Synonyms
|
||||
|
||||
```python
|
||||
# Add synonyms for standardization
|
||||
hsc = bt.CellType.get(name="hematopoietic stem cell")
|
||||
hsc.add_synonym("HSC")
|
||||
hsc.add_synonym("blood stem cell")
|
||||
hsc.add_synonym("hematopoietic progenitor")
|
||||
|
||||
# Set abbreviation
|
||||
hsc.set_abbr("HSC")
|
||||
|
||||
# Now standardization works with synonyms
|
||||
bt.CellType.standardize(["HSC", "blood stem cell"])
|
||||
# Returns: ['hematopoietic stem cell', 'hematopoietic stem cell']
|
||||
```
|
||||
|
||||
### Creating Custom Hierarchies
|
||||
|
||||
```python
|
||||
# Build custom cell type hierarchy
|
||||
immune_cell = bt.CellType.get(name="immune cell")
|
||||
|
||||
# Add custom subtypes
|
||||
my_subtype1 = bt.CellType(name="custom_immune_subtype_1").save()
|
||||
my_subtype2 = bt.CellType(name="custom_immune_subtype_2").save()
|
||||
|
||||
# Link to parent
|
||||
my_subtype1.parents.add(immune_cell)
|
||||
my_subtype2.parents.add(immune_cell)
|
||||
|
||||
# Create sub-subtypes
|
||||
my_subsubtype = bt.CellType(name="custom_sub_subtype").save()
|
||||
my_subsubtype.parents.add(my_subtype1)
|
||||
|
||||
# Visualize custom hierarchy
|
||||
immune_cell.view_parents(with_children=True)
|
||||
```
|
||||
|
||||
## Multi-Organism Support
|
||||
|
||||
For organism-aware registries like Gene:
|
||||
|
||||
```python
|
||||
# Set global organism
|
||||
bt.settings.organism = "human"
|
||||
|
||||
# Validate human genes
|
||||
bt.Gene.validate(["TCF7", "CD8A"], organism="human")
|
||||
|
||||
# Load genes for specific organism
|
||||
human_genes = bt.Gene.from_values(["CD8A", "TP53"], organism="human")
|
||||
mouse_genes = bt.Gene.from_values(["Cd8a", "Trp53"], organism="mouse")
|
||||
|
||||
# Search organism-specific genes
|
||||
bt.Gene.search("CD8", organism="human").to_dataframe()
|
||||
bt.Gene.search("Cd8", organism="mouse").to_dataframe()
|
||||
|
||||
# Switch organism context
|
||||
bt.settings.organism = "mouse"
|
||||
genes = bt.Gene.from_source(symbol="Ap5b1")
|
||||
```
|
||||
|
||||
## Public Ontology Lookup
|
||||
|
||||
Access terms from public ontologies without importing:
|
||||
|
||||
```python
|
||||
# Interactive lookup in public sources
|
||||
cell_types_public = bt.CellType.lookup(public=True)
|
||||
|
||||
# Access public terms
|
||||
hepatocyte = cell_types_public.hepatocyte
|
||||
|
||||
# Import specific term
|
||||
hepatocyte_local = bt.CellType.from_source(name="hepatocyte")
|
||||
|
||||
# Or import by ontology ID
|
||||
specific_cell = bt.CellType.from_source(ontology_id="CL:0000182")
|
||||
```
|
||||
|
||||
## Version Tracking
|
||||
|
||||
LaminDB automatically tracks ontology versions:
|
||||
|
||||
```python
|
||||
# View current source versions
|
||||
bt.Source.filter(currently_used=True).to_dataframe()
|
||||
|
||||
# Check which source a record derives from
|
||||
cell_type = bt.CellType.get(name="hepatocyte")
|
||||
cell_type.source # Returns Source metadata
|
||||
|
||||
# View source details
|
||||
source = cell_type.source
|
||||
print(source.name) # e.g., "cl"
|
||||
print(source.version) # e.g., "2023-05-18"
|
||||
print(source.url) # Ontology URL
|
||||
```
|
||||
|
||||
## Ontology Integration Workflows
|
||||
|
||||
### Workflow 1: Validate Existing Data
|
||||
|
||||
```python
|
||||
# Load data with biological annotations
|
||||
adata = ad.read_h5ad("uncurated_data.h5ad")
|
||||
|
||||
# Validate cell types
|
||||
validation = bt.CellType.validate(adata.obs.cell_type)
|
||||
|
||||
# Identify invalid terms
|
||||
invalid_idx = [i for i, v in enumerate(validation) if not v]
|
||||
invalid_terms = adata.obs.cell_type.iloc[invalid_idx].unique()
|
||||
print(f"Invalid cell types: {invalid_terms}")
|
||||
|
||||
# Fix invalid terms manually or with standardization
|
||||
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs.cell_type)
|
||||
|
||||
# Re-validate
|
||||
validation = bt.CellType.validate(adata.obs.cell_type)
|
||||
assert all(validation), "All terms should now be valid"
|
||||
```
|
||||
|
||||
### Workflow 2: Curate and Annotate
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
ln.track() # Start tracking
|
||||
|
||||
# Load data
|
||||
df = pd.read_csv("experimental_data.csv")
|
||||
|
||||
# Standardize using ontologies
|
||||
df["cell_type"] = bt.CellType.standardize(df["cell_type"])
|
||||
df["tissue"] = bt.Tissue.standardize(df["tissue"])
|
||||
|
||||
# Create curated artifact
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="curated/experiment_2025_10.parquet",
|
||||
description="Curated experimental data with ontology-validated annotations"
|
||||
).save()
|
||||
|
||||
# Link ontology records
|
||||
artifact.feature_sets.add_ontology(bt.CellType.from_values(df["cell_type"]))
|
||||
artifact.feature_sets.add_ontology(bt.Tissue.from_values(df["tissue"]))
|
||||
|
||||
ln.finish() # Complete tracking
|
||||
```
|
||||
|
||||
### Workflow 3: Cross-Organism Gene Mapping
|
||||
|
||||
```python
|
||||
# Get human genes
|
||||
human_genes = ["CD8A", "CD8B", "TP53"]
|
||||
human_records = bt.Gene.from_values(human_genes, organism="human")
|
||||
|
||||
# Find mouse orthologs (requires external mapping)
|
||||
# LaminDB doesn't provide built-in ortholog mapping
|
||||
# Use external tools like Ensembl BioMart or homologene
|
||||
|
||||
mouse_orthologs = ["Cd8a", "Cd8b", "Trp53"]
|
||||
mouse_records = bt.Gene.from_values(mouse_orthologs, organism="mouse")
|
||||
```
|
||||
|
||||
## Querying Ontology-Annotated Data
|
||||
|
||||
```python
|
||||
# Find all datasets with specific cell type
|
||||
t_cell = bt.CellType.get(name="T cell")
|
||||
ln.Artifact.filter(feature_sets__cell_types=t_cell).to_dataframe()
|
||||
|
||||
# Find datasets measuring specific genes
|
||||
cd8a = bt.Gene.get(symbol="CD8A", organism="human")
|
||||
ln.Artifact.filter(feature_sets__genes=cd8a).to_dataframe()
|
||||
|
||||
# Query across ontology hierarchy
|
||||
# Find all datasets with T cell or T cell subtypes
|
||||
t_cell_subtypes = t_cell.query_children()
|
||||
ln.Artifact.filter(
|
||||
feature_sets__cell_types__in=t_cell_subtypes
|
||||
).to_dataframe()
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Import ontologies first**: Call `import_source()` before validation
|
||||
2. **Use standardization**: Leverage synonym mapping to handle variations
|
||||
3. **Validate early**: Check terms before creating artifacts
|
||||
4. **Set organism context**: Specify organism for gene-related queries
|
||||
5. **Add custom synonyms**: Register common variations in your domain
|
||||
6. **Use public lookup**: Access `lookup(public=True)` for term discovery
|
||||
7. **Track versions**: Monitor ontology source versions for reproducibility
|
||||
8. **Build hierarchies**: Link custom terms to existing ontology structures
|
||||
9. **Query hierarchically**: Use `query_children()` for comprehensive searches
|
||||
10. **Document mappings**: Track custom term additions and relationships
|
||||
|
||||
## Common Ontology Operations
|
||||
|
||||
```python
|
||||
# Check if term exists
|
||||
exists = bt.CellType.filter(name="T cell").exists()
|
||||
|
||||
# Count terms in registry
|
||||
n_cell_types = bt.CellType.filter().count()
|
||||
|
||||
# Get all terms with specific parent
|
||||
immune_cells = bt.CellType.filter(parents__name="immune cell")
|
||||
|
||||
# Find orphan terms (no parents)
|
||||
orphans = bt.CellType.filter(parents__isnull=True)
|
||||
|
||||
# Get recently added terms
|
||||
from datetime import datetime, timedelta
|
||||
recent = bt.CellType.filter(
|
||||
created_at__gte=datetime.now() - timedelta(days=7)
|
||||
)
|
||||
```
|
||||
Reference in New Issue
Block a user