12 KiB
LaminDB Annotation & Validation
This document covers data curation, validation, schema management, and annotation best practices in LaminDB.
Overview
LaminDB's curation process ensures datasets are both validated and queryable through three essential steps:
- Validation: Confirming datasets match desired schemas
- Standardization: Fixing inconsistencies like typos and mapping synonyms
- Annotation: Linking datasets to metadata entities for queryability
Schema Design
Schemas define expected data structure, types, and validation rules. LaminDB supports three main schema approaches:
1. Flexible Schema
Validates only columns matching Feature registry names, allowing additional metadata:
import lamindb as ln
# Create flexible schema
schema = ln.Schema(
name="valid_features",
itype=ln.Feature # Validates against Feature registry
).save()
# Any column matching a Feature name will be validated
# Additional columns are permitted but not validated
2. Minimal Required Schema
Specifies essential columns while permitting extra metadata:
# Define required features
required_features = [
ln.Feature.get(name="cell_type"),
ln.Feature.get(name="tissue"),
ln.Feature.get(name="donor_id")
]
# Create schema with required features
schema = ln.Schema(
name="minimal_immune_schema",
features=required_features,
flexible=True # Allows additional columns
).save()
3. Strict Schema
Enforces complete control over data structure:
# Define all allowed features
all_features = [
ln.Feature.get(name="cell_type"),
ln.Feature.get(name="tissue"),
ln.Feature.get(name="donor_id"),
ln.Feature.get(name="disease")
]
# Create strict schema
schema = ln.Schema(
name="strict_immune_schema",
features=all_features,
flexible=False # No additional columns allowed
).save()
DataFrame Curation Workflow
The typical curation process involves six key steps:
Step 1-2: Load Data and Establish Registries
import pandas as pd
import lamindb as ln
# Load data
df = pd.read_csv("experiment.csv")
# Define and save features
ln.Feature(name="cell_type", dtype=str).save()
ln.Feature(name="tissue", dtype=str).save()
ln.Feature(name="gene_count", dtype=int).save()
ln.Feature(name="experiment_date", dtype="date").save()
# Populate valid values (if using controlled vocabulary)
import bionty as bt
bt.CellType.import_source()
bt.Tissue.import_source()
Step 3: Create Schema
# Link features to schema
features = [
ln.Feature.get(name="cell_type"),
ln.Feature.get(name="tissue"),
ln.Feature.get(name="gene_count"),
ln.Feature.get(name="experiment_date")
]
schema = ln.Schema(
name="experiment_schema",
features=features,
flexible=True
).save()
Step 4: Initialize Curator and Validate
# Initialize curator
curator = ln.curators.DataFrameCurator(df, schema)
# Validate dataset
validation = curator.validate()
# Check validation results
if validation:
print("✓ Validation passed")
else:
print("✗ Validation failed")
curator.non_validated # See problematic fields
Step 5: Fix Validation Issues
Standardize Values
# Fix typos and synonyms in categorical columns
curator.cat.standardize("cell_type")
curator.cat.standardize("tissue")
# View standardization mapping
curator.cat.inspect_standardize("cell_type")
Map to Ontologies
# Map values to ontology terms
curator.cat.add_ontology("cell_type", bt.CellType)
curator.cat.add_ontology("tissue", bt.Tissue)
# Look up public ontologies for unmapped terms
curator.cat.lookup(public=True).cell_type # Interactive lookup
Add New Terms
# Add new valid terms to registry
curator.cat.add_new_from("cell_type")
# Or manually create records
new_cell_type = bt.CellType(name="my_novel_cell_type").save()
Rename Columns
# Rename columns to match feature names
df = df.rename(columns={"celltype": "cell_type"})
# Re-initialize curator with fixed DataFrame
curator = ln.curators.DataFrameCurator(df, schema)
Step 6: Save Curated Artifact
# Save with schema linkage
artifact = curator.save_artifact(
key="experiments/curated_data.parquet",
description="Validated and annotated experimental data"
)
# Verify artifact has schema
artifact.schema # Returns the schema object
artifact.describe() # Shows validation status
AnnData Curation
For composite structures like AnnData, use "slots" to validate different components:
Defining AnnData Schemas
# Create schemas for different slots
obs_schema = ln.Schema(
name="cell_metadata",
features=[
ln.Feature.get(name="cell_type"),
ln.Feature.get(name="tissue"),
ln.Feature.get(name="donor_id")
]
).save()
var_schema = ln.Schema(
name="gene_ids",
features=[ln.Feature.get(name="ensembl_gene_id")]
).save()
# Create composite AnnData schema
anndata_schema = ln.Schema(
name="scrna_schema",
otype="AnnData",
slots={
"obs": obs_schema,
"var.T": var_schema # .T indicates transposition
}
).save()
Curating AnnData Objects
import anndata as ad
# Load AnnData
adata = ad.read_h5ad("data.h5ad")
# Initialize curator
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
# Validate all slots
validation = curator.validate()
# Fix issues by slot
curator.cat.standardize("obs", "cell_type")
curator.cat.add_ontology("obs", "cell_type", bt.CellType)
curator.cat.standardize("var.T", "ensembl_gene_id")
# Save curated artifact
artifact = curator.save_artifact(
key="scrna/validated_data.h5ad",
description="Curated single-cell RNA-seq data"
)
MuData Curation
MuData supports multi-modal data through modality-specific slots:
# Define schemas for each modality
rna_obs_schema = ln.Schema(name="rna_obs_schema", features=[...]).save()
protein_obs_schema = ln.Schema(name="protein_obs_schema", features=[...]).save()
# Create MuData schema
mudata_schema = ln.Schema(
name="multimodal_schema",
otype="MuData",
slots={
"rna:obs": rna_obs_schema,
"protein:obs": protein_obs_schema
}
).save()
# Curate
curator = ln.curators.MuDataCurator(mdata, mudata_schema)
curator.validate()
SpatialData Curation
For spatial transcriptomics data:
# Define spatial schema
spatial_schema = ln.Schema(
name="spatial_schema",
otype="SpatialData",
slots={
"tables:cell_metadata.obs": cell_schema,
"attrs:bio": bio_metadata_schema
}
).save()
# Curate
curator = ln.curators.SpatialDataCurator(sdata, spatial_schema)
curator.validate()
TileDB-SOMA Curation
For scalable array-backed data:
# Define SOMA schema
soma_schema = ln.Schema(
name="soma_schema",
otype="tiledbsoma",
slots={
"obs": obs_schema,
"ms:RNA.T": var_schema # measurement:modality.T
}
).save()
# Curate
curator = ln.curators.TileDBSOMACurator(soma_exp, soma_schema)
curator.validate()
Feature Validation
Data Type Validation
# Define typed features
ln.Feature(name="age", dtype=int).save()
ln.Feature(name="weight", dtype=float).save()
ln.Feature(name="is_treated", dtype=bool).save()
ln.Feature(name="collection_date", dtype="date").save()
# Coerce types during validation
ln.Feature(name="age_str", dtype=int, coerce_dtype=True).save() # Auto-convert strings to int
Value Validation
# Validate against allowed values
cell_type_feature = ln.Feature(name="cell_type", dtype=str).save()
# Link to registry for controlled vocabulary
cell_type_feature.link_to_registry(bt.CellType)
# Now validation checks against CellType registry
curator = ln.curators.DataFrameCurator(df, schema)
curator.validate() # Errors if cell_type values not in registry
Standardization Strategies
Using Public Ontologies
# Look up standardized terms from public sources
curator.cat.lookup(public=True).cell_type
# Returns auto-complete object with public ontology terms
# User can select correct term interactively
Synonym Mapping
# Add synonyms to records
t_cell = bt.CellType.get(name="T cell")
t_cell.add_synonym("T lymphocyte")
t_cell.add_synonym("T-cell")
# Now standardization maps synonyms automatically
curator.cat.standardize("cell_type")
# "T lymphocyte" → "T cell"
# "T-cell" → "T cell"
Custom Standardization
# Manual mapping
mapping = {
"TCell": "T cell",
"t cell": "T cell",
"T-cells": "T cell"
}
# Apply mapping
df["cell_type"] = df["cell_type"].map(lambda x: mapping.get(x, x))
Handling Validation Errors
Common Issues and Solutions
Issue: Column not in schema
# Solution 1: Rename column
df = df.rename(columns={"old_name": "feature_name"})
# Solution 2: Add feature to schema
new_feature = ln.Feature(name="new_column", dtype=str).save()
schema.features.add(new_feature)
Issue: Invalid values
# Solution 1: Standardize
curator.cat.standardize("column_name")
# Solution 2: Add new valid values
curator.cat.add_new_from("column_name")
# Solution 3: Map to ontology
curator.cat.add_ontology("column_name", bt.Registry)
Issue: Data type mismatch
# Solution 1: Convert data type
df["column"] = df["column"].astype(int)
# Solution 2: Enable coercion in feature
feature = ln.Feature.get(name="column")
feature.coerce_dtype = True
feature.save()
Schema Versioning
Schemas can be versioned like other records:
# Create initial schema
schema_v1 = ln.Schema(name="experiment_schema", features=[...]).save()
# Update schema with new features
schema_v2 = ln.Schema(
name="experiment_schema",
features=[...], # Updated list
version="2"
).save()
# Link artifacts to specific schema versions
artifact.schema = schema_v2
artifact.save()
Querying Validated Data
Once data is validated and annotated, it becomes queryable:
# Find all validated artifacts
ln.Artifact.filter(is_valid=True).to_dataframe()
# Find artifacts with specific schema
ln.Artifact.filter(schema=schema).to_dataframe()
# Query by annotated features
ln.Artifact.filter(cell_type="T cell", tissue="blood").to_dataframe()
# Include features in results
ln.Artifact.filter(is_valid=True).to_dataframe(include="features")
Best Practices
- Define features first: Create Feature registry before curation
- Use public ontologies: Leverage bt.lookup(public=True) for standardization
- Start flexible: Use flexible schemas initially, tighten as understanding grows
- Document slots: Clearly specify transposition (.T) in composite schemas
- Standardize early: Fix typos and synonyms before validation
- Validate incrementally: Check each slot separately for composite structures
- Version schemas: Track schema changes over time
- Add synonyms: Register common variations to simplify future curation
- Coerce types cautiously: Enable dtype coercion only when safe
- Test on samples: Validate small subsets before full dataset curation
Advanced: Custom Validators
Create custom validation logic:
def validate_gene_expression(df):
"""Custom validator for gene expression values."""
# Check non-negative
if (df < 0).any().any():
return False, "Negative expression values found"
# Check reasonable range
if (df > 1e6).any().any():
return False, "Unreasonably high expression values"
return True, "Valid"
# Apply during curation
is_valid, message = validate_gene_expression(df)
if not is_valid:
print(f"Validation failed: {message}")
Tracking Curation Provenance
# Curated artifacts track curation lineage
ln.track() # Start tracking
# Perform curation
curator = ln.curators.DataFrameCurator(df, schema)
curator.validate()
curator.cat.standardize("cell_type")
artifact = curator.save_artifact(key="curated.parquet")
ln.finish() # Complete tracking
# View curation lineage
artifact.run.describe() # Shows curation transform
artifact.view_lineage() # Visualizes curation process