Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/lamindb/references/ontologies.md
+++ b/skills/lamindb/references/ontologies.md
@@ -0,0 +1,497 @@
+# LaminDB Ontology Management
+
+This document covers biological ontology management in LaminDB through the Bionty plugin, including accessing, searching, and annotating data with standardized biological terms.
+
+## Overview
+
+LaminDB integrates the `bionty` plugin to manage standardized biological ontologies, enabling consistent metadata curation and data annotation across research projects. Bionty provides access to 20+ curated biological ontologies covering genes, proteins, cell types, tissues, diseases, and more.
+
+## Available Ontologies
+
+LaminDB provides access to multiple curated ontology sources:
+
+| Registry | Ontology Source | Description |
+|----------|----------------|-------------|
+| **Gene** | Ensembl | Genes across organisms (human, mouse, etc.) |
+| **Protein** | UniProt | Protein sequences and annotations |
+| **CellType** | Cell Ontology (CL) | Standardized cell type classifications |
+| **CellLine** | Cell Line Ontology (CLO) | Cell line annotations |
+| **Tissue** | Uberon | Anatomical structures and tissues |
+| **Disease** | Mondo, DOID | Disease classifications |
+| **Phenotype** | Human Phenotype Ontology (HPO) | Phenotypic abnormalities |
+| **Pathway** | Gene Ontology (GO) | Biological pathways and processes |
+| **ExperimentalFactor** | Experimental Factor Ontology (EFO) | Experimental variables |
+| **DevelopmentalStage** | Various | Developmental stages across organisms |
+| **Ethnicity** | HANCESTRO | Human ancestry ontology |
+| **Drug** | DrugBank | Drug compounds |
+| **Organism** | NCBItaxon | Taxonomic classifications |
+
+## Installation and Import
+
+```python
+# Install bionty (included with lamindb)
+pip install lamindb
+
+# Import
+import lamindb as ln
+import bionty as bt
+```
+
+## Importing Public Ontologies
+
+Populate your registry with a public ontology source:
+
+```python
+# Import Cell Ontology
+bt.CellType.import_source()
+
+# Import organism-specific genes
+bt.Gene.import_source(organism="human")
+bt.Gene.import_source(organism="mouse")
+
+# Import tissues
+bt.Tissue.import_source()
+
+# Import diseases
+bt.Disease.import_source(source="mondo")  # Mondo Disease Ontology
+bt.Disease.import_source(source="doid")   # Disease Ontology
+```
+
+## Searching and Accessing Records
+
+### Keyword Search
+
+```python
+# Search cell types
+bt.CellType.search("T cell").to_dataframe()
+bt.CellType.search("gamma-delta").to_dataframe()
+
+# Search genes
+bt.Gene.search("CD8").to_dataframe()
+bt.Gene.search("TP53").to_dataframe()
+
+# Search diseases
+bt.Disease.search("cancer").to_dataframe()
+
+# Search tissues
+bt.Tissue.search("brain").to_dataframe()
+```
+
+### Auto-Complete Lookup
+
+For registries with fewer than 100k records:
+
+```python
+# Create lookup object
+cell_types = bt.CellType.lookup()
+
+# Access by name (auto-complete in IDEs)
+t_cell = cell_types.t_cell
+hsc = cell_types.hematopoietic_stem_cell
+
+# Similarly for other registries
+genes = bt.Gene.lookup()
+cd8a = genes.cd8a
+```
+
+### Exact Field Matching
+
+```python
+# By ontology ID
+cell_type = bt.CellType.get(ontology_id="CL:0000798")
+disease = bt.Disease.get(ontology_id="MONDO:0004992")
+
+# By name
+cell_type = bt.CellType.get(name="T cell")
+gene = bt.Gene.get(symbol="CD8A")
+
+# By Ensembl ID
+gene = bt.Gene.get(ensembl_gene_id="ENSG00000153563")
+```
+
+## Ontological Hierarchies
+
+### Exploring Relationships
+
+```python
+# Get a cell type
+gdt_cell = bt.CellType.get(ontology_id="CL:0000798")  # gamma-delta T cell
+
+# View direct parents
+gdt_cell.parents.to_dataframe()
+
+# View all ancestors recursively
+ancestors = []
+current = gdt_cell
+while current.parents.exists():
+    parent = current.parents.first()
+    ancestors.append(parent)
+    current = parent
+
+# View direct children
+gdt_cell.children.to_dataframe()
+
+# View all descendants recursively
+gdt_cell.query_children().to_dataframe()
+```
+
+### Visualizing Hierarchies
+
+```python
+# Visualize parent hierarchy
+gdt_cell.view_parents()
+
+# Include children in visualization
+gdt_cell.view_parents(with_children=True)
+
+# Get all related terms for visualization
+t_cell = bt.CellType.get(name="T cell")
+t_cell.view_parents(with_children=True)  # Shows T cell subtypes
+```
+
+## Standardizing and Validating Data
+
+### Validation
+
+Check if terms exist in the ontology:
+
+```python
+# Validate cell types
+bt.CellType.validate(["T cell", "B cell", "invalid_cell"])
+# Returns: [True, True, False]
+
+# Validate genes
+bt.Gene.validate(["CD8A", "TP53", "FAKEGENE"], organism="human")
+# Returns: [True, True, False]
+
+# Check which terms are invalid
+terms = ["T cell", "fat cell", "neuron", "invalid_term"]
+invalid = [t for t, valid in zip(terms, bt.CellType.validate(terms)) if not valid]
+print(f"Invalid terms: {invalid}")
+```
+
+### Standardization with Synonyms
+
+Convert non-standard terms to validated names:
+
+```python
+# Standardize cell type names
+bt.CellType.standardize(["fat cell", "blood forming stem cell"])
+# Returns: ['adipocyte', 'hematopoietic stem cell']
+
+# Standardize genes
+bt.Gene.standardize(["BRCA-1", "p53"], organism="human")
+# Returns: ['BRCA1', 'TP53']
+
+# Handle mixed valid/invalid terms
+terms = ["T cell", "T lymphocyte", "invalid"]
+standardized = bt.CellType.standardize(terms)
+# Returns standardized names where possible
+```
+
+### Loading Validated Records
+
+```python
+# Load records from values (including synonyms)
+records = bt.CellType.from_values(["fat cell", "blood forming stem cell"])
+
+# Returns list of CellType records
+for record in records:
+    print(record.name, record.ontology_id)
+
+# Use with gene symbols
+genes = bt.Gene.from_values(["CD8A", "CD8B"], organism="human")
+```
+
+## Annotating Datasets
+
+### Annotating AnnData
+
+```python
+import anndata as ad
+import lamindb as ln
+
+# Load example data
+adata = ad.read_h5ad("data.h5ad")
+
+# Validate and retrieve matching records
+cell_types = bt.CellType.from_values(adata.obs.cell_type)
+
+# Create artifact with annotations
+artifact = ln.Artifact.from_anndata(
+    adata,
+    key="scrna/annotated_data.h5ad",
+    description="scRNA-seq data with validated cell type annotations"
+).save()
+
+# Link ontology records to artifact
+artifact.feature_sets.add_ontology(cell_types)
+```
+
+### Annotating DataFrames
+
+```python
+import pandas as pd
+
+# Create DataFrame with biological entities
+df = pd.DataFrame({
+    "cell_type": ["T cell", "B cell", "NK cell"],
+    "tissue": ["blood", "spleen", "liver"],
+    "disease": ["healthy", "lymphoma", "healthy"]
+})
+
+# Validate and standardize
+df["cell_type"] = bt.CellType.standardize(df["cell_type"])
+df["tissue"] = bt.Tissue.standardize(df["tissue"])
+
+# Create artifact
+artifact = ln.Artifact.from_dataframe(
+    df,
+    key="metadata/sample_info.parquet"
+).save()
+
+# Link ontology records
+cell_type_records = bt.CellType.from_values(df["cell_type"])
+tissue_records = bt.Tissue.from_values(df["tissue"])
+
+artifact.feature_sets.add_ontology(cell_type_records)
+artifact.feature_sets.add_ontology(tissue_records)
+```
+
+## Managing Custom Terms and Hierarchies
+
+### Adding Custom Terms
+
+```python
+# Register new term not in public ontology
+my_celltype = bt.CellType(name="my_novel_T_cell_subtype").save()
+
+# Establish parent-child relationship
+parent = bt.CellType.get(name="T cell")
+my_celltype.parents.add(parent)
+
+# Verify relationship
+my_celltype.parents.to_dataframe()
+parent.children.to_dataframe()  # Should include my_celltype
+```
+
+### Adding Synonyms
+
+```python
+# Add synonyms for standardization
+hsc = bt.CellType.get(name="hematopoietic stem cell")
+hsc.add_synonym("HSC")
+hsc.add_synonym("blood stem cell")
+hsc.add_synonym("hematopoietic progenitor")
+
+# Set abbreviation
+hsc.set_abbr("HSC")
+
+# Now standardization works with synonyms
+bt.CellType.standardize(["HSC", "blood stem cell"])
+# Returns: ['hematopoietic stem cell', 'hematopoietic stem cell']
+```
+
+### Creating Custom Hierarchies
+
+```python
+# Build custom cell type hierarchy
+immune_cell = bt.CellType.get(name="immune cell")
+
+# Add custom subtypes
+my_subtype1 = bt.CellType(name="custom_immune_subtype_1").save()
+my_subtype2 = bt.CellType(name="custom_immune_subtype_2").save()
+
+# Link to parent
+my_subtype1.parents.add(immune_cell)
+my_subtype2.parents.add(immune_cell)
+
+# Create sub-subtypes
+my_subsubtype = bt.CellType(name="custom_sub_subtype").save()
+my_subsubtype.parents.add(my_subtype1)
+
+# Visualize custom hierarchy
+immune_cell.view_parents(with_children=True)
+```
+
+## Multi-Organism Support
+
+For organism-aware registries like Gene:
+
+```python
+# Set global organism
+bt.settings.organism = "human"
+
+# Validate human genes
+bt.Gene.validate(["TCF7", "CD8A"], organism="human")
+
+# Load genes for specific organism
+human_genes = bt.Gene.from_values(["CD8A", "TP53"], organism="human")
+mouse_genes = bt.Gene.from_values(["Cd8a", "Trp53"], organism="mouse")
+
+# Search organism-specific genes
+bt.Gene.search("CD8", organism="human").to_dataframe()
+bt.Gene.search("Cd8", organism="mouse").to_dataframe()
+
+# Switch organism context
+bt.settings.organism = "mouse"
+genes = bt.Gene.from_source(symbol="Ap5b1")
+```
+
+## Public Ontology Lookup
+
+Access terms from public ontologies without importing:
+
+```python
+# Interactive lookup in public sources
+cell_types_public = bt.CellType.lookup(public=True)
+
+# Access public terms
+hepatocyte = cell_types_public.hepatocyte
+
+# Import specific term
+hepatocyte_local = bt.CellType.from_source(name="hepatocyte")
+
+# Or import by ontology ID
+specific_cell = bt.CellType.from_source(ontology_id="CL:0000182")
+```
+
+## Version Tracking
+
+LaminDB automatically tracks ontology versions:
+
+```python
+# View current source versions
+bt.Source.filter(currently_used=True).to_dataframe()
+
+# Check which source a record derives from
+cell_type = bt.CellType.get(name="hepatocyte")
+cell_type.source  # Returns Source metadata
+
+# View source details
+source = cell_type.source
+print(source.name)        # e.g., "cl"
+print(source.version)     # e.g., "2023-05-18"
+print(source.url)         # Ontology URL
+```
+
+## Ontology Integration Workflows
+
+### Workflow 1: Validate Existing Data
+
+```python
+# Load data with biological annotations
+adata = ad.read_h5ad("uncurated_data.h5ad")
+
+# Validate cell types
+validation = bt.CellType.validate(adata.obs.cell_type)
+
+# Identify invalid terms
+invalid_idx = [i for i, v in enumerate(validation) if not v]
+invalid_terms = adata.obs.cell_type.iloc[invalid_idx].unique()
+print(f"Invalid cell types: {invalid_terms}")
+
+# Fix invalid terms manually or with standardization
+adata.obs["cell_type"] = bt.CellType.standardize(adata.obs.cell_type)
+
+# Re-validate
+validation = bt.CellType.validate(adata.obs.cell_type)
+assert all(validation), "All terms should now be valid"
+```
+
+### Workflow 2: Curate and Annotate
+
+```python
+import lamindb as ln
+
+ln.track()  # Start tracking
+
+# Load data
+df = pd.read_csv("experimental_data.csv")
+
+# Standardize using ontologies
+df["cell_type"] = bt.CellType.standardize(df["cell_type"])
+df["tissue"] = bt.Tissue.standardize(df["tissue"])
+
+# Create curated artifact
+artifact = ln.Artifact.from_dataframe(
+    df,
+    key="curated/experiment_2025_10.parquet",
+    description="Curated experimental data with ontology-validated annotations"
+).save()
+
+# Link ontology records
+artifact.feature_sets.add_ontology(bt.CellType.from_values(df["cell_type"]))
+artifact.feature_sets.add_ontology(bt.Tissue.from_values(df["tissue"]))
+
+ln.finish()  # Complete tracking
+```
+
+### Workflow 3: Cross-Organism Gene Mapping
+
+```python
+# Get human genes
+human_genes = ["CD8A", "CD8B", "TP53"]
+human_records = bt.Gene.from_values(human_genes, organism="human")
+
+# Find mouse orthologs (requires external mapping)
+# LaminDB doesn't provide built-in ortholog mapping
+# Use external tools like Ensembl BioMart or homologene
+
+mouse_orthologs = ["Cd8a", "Cd8b", "Trp53"]
+mouse_records = bt.Gene.from_values(mouse_orthologs, organism="mouse")
+```
+
+## Querying Ontology-Annotated Data
+
+```python
+# Find all datasets with specific cell type
+t_cell = bt.CellType.get(name="T cell")
+ln.Artifact.filter(feature_sets__cell_types=t_cell).to_dataframe()
+
+# Find datasets measuring specific genes
+cd8a = bt.Gene.get(symbol="CD8A", organism="human")
+ln.Artifact.filter(feature_sets__genes=cd8a).to_dataframe()
+
+# Query across ontology hierarchy
+# Find all datasets with T cell or T cell subtypes
+t_cell_subtypes = t_cell.query_children()
+ln.Artifact.filter(
+    feature_sets__cell_types__in=t_cell_subtypes
+).to_dataframe()
+```
+
+## Best Practices
+
+1. **Import ontologies first**: Call `import_source()` before validation
+2. **Use standardization**: Leverage synonym mapping to handle variations
+3. **Validate early**: Check terms before creating artifacts
+4. **Set organism context**: Specify organism for gene-related queries
+5. **Add custom synonyms**: Register common variations in your domain
+6. **Use public lookup**: Access `lookup(public=True)` for term discovery
+7. **Track versions**: Monitor ontology source versions for reproducibility
+8. **Build hierarchies**: Link custom terms to existing ontology structures
+9. **Query hierarchically**: Use `query_children()` for comprehensive searches
+10. **Document mappings**: Track custom term additions and relationships
+
+## Common Ontology Operations
+
+```python
+# Check if term exists
+exists = bt.CellType.filter(name="T cell").exists()
+
+# Count terms in registry
+n_cell_types = bt.CellType.filter().count()
+
+# Get all terms with specific parent
+immune_cells = bt.CellType.filter(parents__name="immune cell")
+
+# Find orphan terms (no parents)
+orphans = bt.CellType.filter(parents__isnull=True)
+
+# Get recently added terms
+from datetime import datetime, timedelta
+recent = bt.CellType.filter(
+    created_at__gte=datetime.now() - timedelta(days=7)
+)
+```