Initial commit
This commit is contained in:
513
skills/lamindb/references/annotation-validation.md
Normal file
513
skills/lamindb/references/annotation-validation.md
Normal file
@@ -0,0 +1,513 @@
|
||||
# LaminDB Annotation & Validation
|
||||
|
||||
This document covers data curation, validation, schema management, and annotation best practices in LaminDB.
|
||||
|
||||
## Overview
|
||||
|
||||
LaminDB's curation process ensures datasets are both validated and queryable through three essential steps:
|
||||
|
||||
1. **Validation**: Confirming datasets match desired schemas
|
||||
2. **Standardization**: Fixing inconsistencies like typos and mapping synonyms
|
||||
3. **Annotation**: Linking datasets to metadata entities for queryability
|
||||
|
||||
## Schema Design
|
||||
|
||||
Schemas define expected data structure, types, and validation rules. LaminDB supports three main schema approaches:
|
||||
|
||||
### 1. Flexible Schema
|
||||
|
||||
Validates only columns matching Feature registry names, allowing additional metadata:
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Create flexible schema
|
||||
schema = ln.Schema(
|
||||
name="valid_features",
|
||||
itype=ln.Feature # Validates against Feature registry
|
||||
).save()
|
||||
|
||||
# Any column matching a Feature name will be validated
|
||||
# Additional columns are permitted but not validated
|
||||
```
|
||||
|
||||
### 2. Minimal Required Schema
|
||||
|
||||
Specifies essential columns while permitting extra metadata:
|
||||
|
||||
```python
|
||||
# Define required features
|
||||
required_features = [
|
||||
ln.Feature.get(name="cell_type"),
|
||||
ln.Feature.get(name="tissue"),
|
||||
ln.Feature.get(name="donor_id")
|
||||
]
|
||||
|
||||
# Create schema with required features
|
||||
schema = ln.Schema(
|
||||
name="minimal_immune_schema",
|
||||
features=required_features,
|
||||
flexible=True # Allows additional columns
|
||||
).save()
|
||||
```
|
||||
|
||||
### 3. Strict Schema
|
||||
|
||||
Enforces complete control over data structure:
|
||||
|
||||
```python
|
||||
# Define all allowed features
|
||||
all_features = [
|
||||
ln.Feature.get(name="cell_type"),
|
||||
ln.Feature.get(name="tissue"),
|
||||
ln.Feature.get(name="donor_id"),
|
||||
ln.Feature.get(name="disease")
|
||||
]
|
||||
|
||||
# Create strict schema
|
||||
schema = ln.Schema(
|
||||
name="strict_immune_schema",
|
||||
features=all_features,
|
||||
flexible=False # No additional columns allowed
|
||||
).save()
|
||||
```
|
||||
|
||||
## DataFrame Curation Workflow
|
||||
|
||||
The typical curation process involves six key steps:
|
||||
|
||||
### Step 1-2: Load Data and Establish Registries
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import lamindb as ln
|
||||
|
||||
# Load data
|
||||
df = pd.read_csv("experiment.csv")
|
||||
|
||||
# Define and save features
|
||||
ln.Feature(name="cell_type", dtype=str).save()
|
||||
ln.Feature(name="tissue", dtype=str).save()
|
||||
ln.Feature(name="gene_count", dtype=int).save()
|
||||
ln.Feature(name="experiment_date", dtype="date").save()
|
||||
|
||||
# Populate valid values (if using controlled vocabulary)
|
||||
import bionty as bt
|
||||
bt.CellType.import_source()
|
||||
bt.Tissue.import_source()
|
||||
```
|
||||
|
||||
### Step 3: Create Schema
|
||||
|
||||
```python
|
||||
# Link features to schema
|
||||
features = [
|
||||
ln.Feature.get(name="cell_type"),
|
||||
ln.Feature.get(name="tissue"),
|
||||
ln.Feature.get(name="gene_count"),
|
||||
ln.Feature.get(name="experiment_date")
|
||||
]
|
||||
|
||||
schema = ln.Schema(
|
||||
name="experiment_schema",
|
||||
features=features,
|
||||
flexible=True
|
||||
).save()
|
||||
```
|
||||
|
||||
### Step 4: Initialize Curator and Validate
|
||||
|
||||
```python
|
||||
# Initialize curator
|
||||
curator = ln.curators.DataFrameCurator(df, schema)
|
||||
|
||||
# Validate dataset
|
||||
validation = curator.validate()
|
||||
|
||||
# Check validation results
|
||||
if validation:
|
||||
print("✓ Validation passed")
|
||||
else:
|
||||
print("✗ Validation failed")
|
||||
curator.non_validated # See problematic fields
|
||||
```
|
||||
|
||||
### Step 5: Fix Validation Issues
|
||||
|
||||
#### Standardize Values
|
||||
|
||||
```python
|
||||
# Fix typos and synonyms in categorical columns
|
||||
curator.cat.standardize("cell_type")
|
||||
curator.cat.standardize("tissue")
|
||||
|
||||
# View standardization mapping
|
||||
curator.cat.inspect_standardize("cell_type")
|
||||
```
|
||||
|
||||
#### Map to Ontologies
|
||||
|
||||
```python
|
||||
# Map values to ontology terms
|
||||
curator.cat.add_ontology("cell_type", bt.CellType)
|
||||
curator.cat.add_ontology("tissue", bt.Tissue)
|
||||
|
||||
# Look up public ontologies for unmapped terms
|
||||
curator.cat.lookup(public=True).cell_type # Interactive lookup
|
||||
```
|
||||
|
||||
#### Add New Terms
|
||||
|
||||
```python
|
||||
# Add new valid terms to registry
|
||||
curator.cat.add_new_from("cell_type")
|
||||
|
||||
# Or manually create records
|
||||
new_cell_type = bt.CellType(name="my_novel_cell_type").save()
|
||||
```
|
||||
|
||||
#### Rename Columns
|
||||
|
||||
```python
|
||||
# Rename columns to match feature names
|
||||
df = df.rename(columns={"celltype": "cell_type"})
|
||||
|
||||
# Re-initialize curator with fixed DataFrame
|
||||
curator = ln.curators.DataFrameCurator(df, schema)
|
||||
```
|
||||
|
||||
### Step 6: Save Curated Artifact
|
||||
|
||||
```python
|
||||
# Save with schema linkage
|
||||
artifact = curator.save_artifact(
|
||||
key="experiments/curated_data.parquet",
|
||||
description="Validated and annotated experimental data"
|
||||
)
|
||||
|
||||
# Verify artifact has schema
|
||||
artifact.schema # Returns the schema object
|
||||
artifact.describe() # Shows validation status
|
||||
```
|
||||
|
||||
## AnnData Curation
|
||||
|
||||
For composite structures like AnnData, use "slots" to validate different components:
|
||||
|
||||
### Defining AnnData Schemas
|
||||
|
||||
```python
|
||||
# Create schemas for different slots
|
||||
obs_schema = ln.Schema(
|
||||
name="cell_metadata",
|
||||
features=[
|
||||
ln.Feature.get(name="cell_type"),
|
||||
ln.Feature.get(name="tissue"),
|
||||
ln.Feature.get(name="donor_id")
|
||||
]
|
||||
).save()
|
||||
|
||||
var_schema = ln.Schema(
|
||||
name="gene_ids",
|
||||
features=[ln.Feature.get(name="ensembl_gene_id")]
|
||||
).save()
|
||||
|
||||
# Create composite AnnData schema
|
||||
anndata_schema = ln.Schema(
|
||||
name="scrna_schema",
|
||||
otype="AnnData",
|
||||
slots={
|
||||
"obs": obs_schema,
|
||||
"var.T": var_schema # .T indicates transposition
|
||||
}
|
||||
).save()
|
||||
```
|
||||
|
||||
### Curating AnnData Objects
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Load AnnData
|
||||
adata = ad.read_h5ad("data.h5ad")
|
||||
|
||||
# Initialize curator
|
||||
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
|
||||
|
||||
# Validate all slots
|
||||
validation = curator.validate()
|
||||
|
||||
# Fix issues by slot
|
||||
curator.cat.standardize("obs", "cell_type")
|
||||
curator.cat.add_ontology("obs", "cell_type", bt.CellType)
|
||||
curator.cat.standardize("var.T", "ensembl_gene_id")
|
||||
|
||||
# Save curated artifact
|
||||
artifact = curator.save_artifact(
|
||||
key="scrna/validated_data.h5ad",
|
||||
description="Curated single-cell RNA-seq data"
|
||||
)
|
||||
```
|
||||
|
||||
## MuData Curation
|
||||
|
||||
MuData supports multi-modal data through modality-specific slots:
|
||||
|
||||
```python
|
||||
# Define schemas for each modality
|
||||
rna_obs_schema = ln.Schema(name="rna_obs_schema", features=[...]).save()
|
||||
protein_obs_schema = ln.Schema(name="protein_obs_schema", features=[...]).save()
|
||||
|
||||
# Create MuData schema
|
||||
mudata_schema = ln.Schema(
|
||||
name="multimodal_schema",
|
||||
otype="MuData",
|
||||
slots={
|
||||
"rna:obs": rna_obs_schema,
|
||||
"protein:obs": protein_obs_schema
|
||||
}
|
||||
).save()
|
||||
|
||||
# Curate
|
||||
curator = ln.curators.MuDataCurator(mdata, mudata_schema)
|
||||
curator.validate()
|
||||
```
|
||||
|
||||
## SpatialData Curation
|
||||
|
||||
For spatial transcriptomics data:
|
||||
|
||||
```python
|
||||
# Define spatial schema
|
||||
spatial_schema = ln.Schema(
|
||||
name="spatial_schema",
|
||||
otype="SpatialData",
|
||||
slots={
|
||||
"tables:cell_metadata.obs": cell_schema,
|
||||
"attrs:bio": bio_metadata_schema
|
||||
}
|
||||
).save()
|
||||
|
||||
# Curate
|
||||
curator = ln.curators.SpatialDataCurator(sdata, spatial_schema)
|
||||
curator.validate()
|
||||
```
|
||||
|
||||
## TileDB-SOMA Curation
|
||||
|
||||
For scalable array-backed data:
|
||||
|
||||
```python
|
||||
# Define SOMA schema
|
||||
soma_schema = ln.Schema(
|
||||
name="soma_schema",
|
||||
otype="tiledbsoma",
|
||||
slots={
|
||||
"obs": obs_schema,
|
||||
"ms:RNA.T": var_schema # measurement:modality.T
|
||||
}
|
||||
).save()
|
||||
|
||||
# Curate
|
||||
curator = ln.curators.TileDBSOMACurator(soma_exp, soma_schema)
|
||||
curator.validate()
|
||||
```
|
||||
|
||||
## Feature Validation
|
||||
|
||||
### Data Type Validation
|
||||
|
||||
```python
|
||||
# Define typed features
|
||||
ln.Feature(name="age", dtype=int).save()
|
||||
ln.Feature(name="weight", dtype=float).save()
|
||||
ln.Feature(name="is_treated", dtype=bool).save()
|
||||
ln.Feature(name="collection_date", dtype="date").save()
|
||||
|
||||
# Coerce types during validation
|
||||
ln.Feature(name="age_str", dtype=int, coerce_dtype=True).save() # Auto-convert strings to int
|
||||
```
|
||||
|
||||
### Value Validation
|
||||
|
||||
```python
|
||||
# Validate against allowed values
|
||||
cell_type_feature = ln.Feature(name="cell_type", dtype=str).save()
|
||||
|
||||
# Link to registry for controlled vocabulary
|
||||
cell_type_feature.link_to_registry(bt.CellType)
|
||||
|
||||
# Now validation checks against CellType registry
|
||||
curator = ln.curators.DataFrameCurator(df, schema)
|
||||
curator.validate() # Errors if cell_type values not in registry
|
||||
```
|
||||
|
||||
## Standardization Strategies
|
||||
|
||||
### Using Public Ontologies
|
||||
|
||||
```python
|
||||
# Look up standardized terms from public sources
|
||||
curator.cat.lookup(public=True).cell_type
|
||||
|
||||
# Returns auto-complete object with public ontology terms
|
||||
# User can select correct term interactively
|
||||
```
|
||||
|
||||
### Synonym Mapping
|
||||
|
||||
```python
|
||||
# Add synonyms to records
|
||||
t_cell = bt.CellType.get(name="T cell")
|
||||
t_cell.add_synonym("T lymphocyte")
|
||||
t_cell.add_synonym("T-cell")
|
||||
|
||||
# Now standardization maps synonyms automatically
|
||||
curator.cat.standardize("cell_type")
|
||||
# "T lymphocyte" → "T cell"
|
||||
# "T-cell" → "T cell"
|
||||
```
|
||||
|
||||
### Custom Standardization
|
||||
|
||||
```python
|
||||
# Manual mapping
|
||||
mapping = {
|
||||
"TCell": "T cell",
|
||||
"t cell": "T cell",
|
||||
"T-cells": "T cell"
|
||||
}
|
||||
|
||||
# Apply mapping
|
||||
df["cell_type"] = df["cell_type"].map(lambda x: mapping.get(x, x))
|
||||
```
|
||||
|
||||
## Handling Validation Errors
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
**Issue: Column not in schema**
|
||||
```python
|
||||
# Solution 1: Rename column
|
||||
df = df.rename(columns={"old_name": "feature_name"})
|
||||
|
||||
# Solution 2: Add feature to schema
|
||||
new_feature = ln.Feature(name="new_column", dtype=str).save()
|
||||
schema.features.add(new_feature)
|
||||
```
|
||||
|
||||
**Issue: Invalid values**
|
||||
```python
|
||||
# Solution 1: Standardize
|
||||
curator.cat.standardize("column_name")
|
||||
|
||||
# Solution 2: Add new valid values
|
||||
curator.cat.add_new_from("column_name")
|
||||
|
||||
# Solution 3: Map to ontology
|
||||
curator.cat.add_ontology("column_name", bt.Registry)
|
||||
```
|
||||
|
||||
**Issue: Data type mismatch**
|
||||
```python
|
||||
# Solution 1: Convert data type
|
||||
df["column"] = df["column"].astype(int)
|
||||
|
||||
# Solution 2: Enable coercion in feature
|
||||
feature = ln.Feature.get(name="column")
|
||||
feature.coerce_dtype = True
|
||||
feature.save()
|
||||
```
|
||||
|
||||
## Schema Versioning
|
||||
|
||||
Schemas can be versioned like other records:
|
||||
|
||||
```python
|
||||
# Create initial schema
|
||||
schema_v1 = ln.Schema(name="experiment_schema", features=[...]).save()
|
||||
|
||||
# Update schema with new features
|
||||
schema_v2 = ln.Schema(
|
||||
name="experiment_schema",
|
||||
features=[...], # Updated list
|
||||
version="2"
|
||||
).save()
|
||||
|
||||
# Link artifacts to specific schema versions
|
||||
artifact.schema = schema_v2
|
||||
artifact.save()
|
||||
```
|
||||
|
||||
## Querying Validated Data
|
||||
|
||||
Once data is validated and annotated, it becomes queryable:
|
||||
|
||||
```python
|
||||
# Find all validated artifacts
|
||||
ln.Artifact.filter(is_valid=True).to_dataframe()
|
||||
|
||||
# Find artifacts with specific schema
|
||||
ln.Artifact.filter(schema=schema).to_dataframe()
|
||||
|
||||
# Query by annotated features
|
||||
ln.Artifact.filter(cell_type="T cell", tissue="blood").to_dataframe()
|
||||
|
||||
# Include features in results
|
||||
ln.Artifact.filter(is_valid=True).to_dataframe(include="features")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Define features first**: Create Feature registry before curation
|
||||
2. **Use public ontologies**: Leverage bt.lookup(public=True) for standardization
|
||||
3. **Start flexible**: Use flexible schemas initially, tighten as understanding grows
|
||||
4. **Document slots**: Clearly specify transposition (.T) in composite schemas
|
||||
5. **Standardize early**: Fix typos and synonyms before validation
|
||||
6. **Validate incrementally**: Check each slot separately for composite structures
|
||||
7. **Version schemas**: Track schema changes over time
|
||||
8. **Add synonyms**: Register common variations to simplify future curation
|
||||
9. **Coerce types cautiously**: Enable dtype coercion only when safe
|
||||
10. **Test on samples**: Validate small subsets before full dataset curation
|
||||
|
||||
## Advanced: Custom Validators
|
||||
|
||||
Create custom validation logic:
|
||||
|
||||
```python
|
||||
def validate_gene_expression(df):
|
||||
"""Custom validator for gene expression values."""
|
||||
# Check non-negative
|
||||
if (df < 0).any().any():
|
||||
return False, "Negative expression values found"
|
||||
|
||||
# Check reasonable range
|
||||
if (df > 1e6).any().any():
|
||||
return False, "Unreasonably high expression values"
|
||||
|
||||
return True, "Valid"
|
||||
|
||||
# Apply during curation
|
||||
is_valid, message = validate_gene_expression(df)
|
||||
if not is_valid:
|
||||
print(f"Validation failed: {message}")
|
||||
```
|
||||
|
||||
## Tracking Curation Provenance
|
||||
|
||||
```python
|
||||
# Curated artifacts track curation lineage
|
||||
ln.track() # Start tracking
|
||||
|
||||
# Perform curation
|
||||
curator = ln.curators.DataFrameCurator(df, schema)
|
||||
curator.validate()
|
||||
curator.cat.standardize("cell_type")
|
||||
artifact = curator.save_artifact(key="curated.parquet")
|
||||
|
||||
ln.finish() # Complete tracking
|
||||
|
||||
# View curation lineage
|
||||
artifact.run.describe() # Shows curation transform
|
||||
artifact.view_lineage() # Visualizes curation process
|
||||
```
|
||||
Reference in New Issue
Block a user