Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/lamindb/references/annotation-validation.md
+++ b/skills/lamindb/references/annotation-validation.md
@@ -0,0 +1,513 @@
+# LaminDB Annotation & Validation
+
+This document covers data curation, validation, schema management, and annotation best practices in LaminDB.
+
+## Overview
+
+LaminDB's curation process ensures datasets are both validated and queryable through three essential steps:
+
+1. **Validation**: Confirming datasets match desired schemas
+2. **Standardization**: Fixing inconsistencies like typos and mapping synonyms
+3. **Annotation**: Linking datasets to metadata entities for queryability
+
+## Schema Design
+
+Schemas define expected data structure, types, and validation rules. LaminDB supports three main schema approaches:
+
+### 1. Flexible Schema
+
+Validates only columns matching Feature registry names, allowing additional metadata:
+
+```python
+import lamindb as ln
+
+# Create flexible schema
+schema = ln.Schema(
+    name="valid_features",
+    itype=ln.Feature  # Validates against Feature registry
+).save()
+
+# Any column matching a Feature name will be validated
+# Additional columns are permitted but not validated
+```
+
+### 2. Minimal Required Schema
+
+Specifies essential columns while permitting extra metadata:
+
+```python
+# Define required features
+required_features = [
+    ln.Feature.get(name="cell_type"),
+    ln.Feature.get(name="tissue"),
+    ln.Feature.get(name="donor_id")
+]
+
+# Create schema with required features
+schema = ln.Schema(
+    name="minimal_immune_schema",
+    features=required_features,
+    flexible=True  # Allows additional columns
+).save()
+```
+
+### 3. Strict Schema
+
+Enforces complete control over data structure:
+
+```python
+# Define all allowed features
+all_features = [
+    ln.Feature.get(name="cell_type"),
+    ln.Feature.get(name="tissue"),
+    ln.Feature.get(name="donor_id"),
+    ln.Feature.get(name="disease")
+]
+
+# Create strict schema
+schema = ln.Schema(
+    name="strict_immune_schema",
+    features=all_features,
+    flexible=False  # No additional columns allowed
+).save()
+```
+
+## DataFrame Curation Workflow
+
+The typical curation process involves six key steps:
+
+### Step 1-2: Load Data and Establish Registries
+
+```python
+import pandas as pd
+import lamindb as ln
+
+# Load data
+df = pd.read_csv("experiment.csv")
+
+# Define and save features
+ln.Feature(name="cell_type", dtype=str).save()
+ln.Feature(name="tissue", dtype=str).save()
+ln.Feature(name="gene_count", dtype=int).save()
+ln.Feature(name="experiment_date", dtype="date").save()
+
+# Populate valid values (if using controlled vocabulary)
+import bionty as bt
+bt.CellType.import_source()
+bt.Tissue.import_source()
+```
+
+### Step 3: Create Schema
+
+```python
+# Link features to schema
+features = [
+    ln.Feature.get(name="cell_type"),
+    ln.Feature.get(name="tissue"),
+    ln.Feature.get(name="gene_count"),
+    ln.Feature.get(name="experiment_date")
+]
+
+schema = ln.Schema(
+    name="experiment_schema",
+    features=features,
+    flexible=True
+).save()
+```
+
+### Step 4: Initialize Curator and Validate
+
+```python
+# Initialize curator
+curator = ln.curators.DataFrameCurator(df, schema)
+
+# Validate dataset
+validation = curator.validate()
+
+# Check validation results
+if validation:
+    print("✓ Validation passed")
+else:
+    print("✗ Validation failed")
+    curator.non_validated  # See problematic fields
+```
+
+### Step 5: Fix Validation Issues
+
+#### Standardize Values
+
+```python
+# Fix typos and synonyms in categorical columns
+curator.cat.standardize("cell_type")
+curator.cat.standardize("tissue")
+
+# View standardization mapping
+curator.cat.inspect_standardize("cell_type")
+```
+
+#### Map to Ontologies
+
+```python
+# Map values to ontology terms
+curator.cat.add_ontology("cell_type", bt.CellType)
+curator.cat.add_ontology("tissue", bt.Tissue)
+
+# Look up public ontologies for unmapped terms
+curator.cat.lookup(public=True).cell_type  # Interactive lookup
+```
+
+#### Add New Terms
+
+```python
+# Add new valid terms to registry
+curator.cat.add_new_from("cell_type")
+
+# Or manually create records
+new_cell_type = bt.CellType(name="my_novel_cell_type").save()
+```
+
+#### Rename Columns
+
+```python
+# Rename columns to match feature names
+df = df.rename(columns={"celltype": "cell_type"})
+
+# Re-initialize curator with fixed DataFrame
+curator = ln.curators.DataFrameCurator(df, schema)
+```
+
+### Step 6: Save Curated Artifact
+
+```python
+# Save with schema linkage
+artifact = curator.save_artifact(
+    key="experiments/curated_data.parquet",
+    description="Validated and annotated experimental data"
+)
+
+# Verify artifact has schema
+artifact.schema  # Returns the schema object
+artifact.describe()  # Shows validation status
+```
+
+## AnnData Curation
+
+For composite structures like AnnData, use "slots" to validate different components:
+
+### Defining AnnData Schemas
+
+```python
+# Create schemas for different slots
+obs_schema = ln.Schema(
+    name="cell_metadata",
+    features=[
+        ln.Feature.get(name="cell_type"),
+        ln.Feature.get(name="tissue"),
+        ln.Feature.get(name="donor_id")
+    ]
+).save()
+
+var_schema = ln.Schema(
+    name="gene_ids",
+    features=[ln.Feature.get(name="ensembl_gene_id")]
+).save()
+
+# Create composite AnnData schema
+anndata_schema = ln.Schema(
+    name="scrna_schema",
+    otype="AnnData",
+    slots={
+        "obs": obs_schema,
+        "var.T": var_schema  # .T indicates transposition
+    }
+).save()
+```
+
+### Curating AnnData Objects
+
+```python
+import anndata as ad
+
+# Load AnnData
+adata = ad.read_h5ad("data.h5ad")
+
+# Initialize curator
+curator = ln.curators.AnnDataCurator(adata, anndata_schema)
+
+# Validate all slots
+validation = curator.validate()
+
+# Fix issues by slot
+curator.cat.standardize("obs", "cell_type")
+curator.cat.add_ontology("obs", "cell_type", bt.CellType)
+curator.cat.standardize("var.T", "ensembl_gene_id")
+
+# Save curated artifact
+artifact = curator.save_artifact(
+    key="scrna/validated_data.h5ad",
+    description="Curated single-cell RNA-seq data"
+)
+```
+
+## MuData Curation
+
+MuData supports multi-modal data through modality-specific slots:
+
+```python
+# Define schemas for each modality
+rna_obs_schema = ln.Schema(name="rna_obs_schema", features=[...]).save()
+protein_obs_schema = ln.Schema(name="protein_obs_schema", features=[...]).save()
+
+# Create MuData schema
+mudata_schema = ln.Schema(
+    name="multimodal_schema",
+    otype="MuData",
+    slots={
+        "rna:obs": rna_obs_schema,
+        "protein:obs": protein_obs_schema
+    }
+).save()
+
+# Curate
+curator = ln.curators.MuDataCurator(mdata, mudata_schema)
+curator.validate()
+```
+
+## SpatialData Curation
+
+For spatial transcriptomics data:
+
+```python
+# Define spatial schema
+spatial_schema = ln.Schema(
+    name="spatial_schema",
+    otype="SpatialData",
+    slots={
+        "tables:cell_metadata.obs": cell_schema,
+        "attrs:bio": bio_metadata_schema
+    }
+).save()
+
+# Curate
+curator = ln.curators.SpatialDataCurator(sdata, spatial_schema)
+curator.validate()
+```
+
+## TileDB-SOMA Curation
+
+For scalable array-backed data:
+
+```python
+# Define SOMA schema
+soma_schema = ln.Schema(
+    name="soma_schema",
+    otype="tiledbsoma",
+    slots={
+        "obs": obs_schema,
+        "ms:RNA.T": var_schema  # measurement:modality.T
+    }
+).save()
+
+# Curate
+curator = ln.curators.TileDBSOMACurator(soma_exp, soma_schema)
+curator.validate()
+```
+
+## Feature Validation
+
+### Data Type Validation
+
+```python
+# Define typed features
+ln.Feature(name="age", dtype=int).save()
+ln.Feature(name="weight", dtype=float).save()
+ln.Feature(name="is_treated", dtype=bool).save()
+ln.Feature(name="collection_date", dtype="date").save()
+
+# Coerce types during validation
+ln.Feature(name="age_str", dtype=int, coerce_dtype=True).save()  # Auto-convert strings to int
+```
+
+### Value Validation
+
+```python
+# Validate against allowed values
+cell_type_feature = ln.Feature(name="cell_type", dtype=str).save()
+
+# Link to registry for controlled vocabulary
+cell_type_feature.link_to_registry(bt.CellType)
+
+# Now validation checks against CellType registry
+curator = ln.curators.DataFrameCurator(df, schema)
+curator.validate()  # Errors if cell_type values not in registry
+```
+
+## Standardization Strategies
+
+### Using Public Ontologies
+
+```python
+# Look up standardized terms from public sources
+curator.cat.lookup(public=True).cell_type
+
+# Returns auto-complete object with public ontology terms
+# User can select correct term interactively
+```
+
+### Synonym Mapping
+
+```python
+# Add synonyms to records
+t_cell = bt.CellType.get(name="T cell")
+t_cell.add_synonym("T lymphocyte")
+t_cell.add_synonym("T-cell")
+
+# Now standardization maps synonyms automatically
+curator.cat.standardize("cell_type")
+# "T lymphocyte" → "T cell"
+# "T-cell" → "T cell"
+```
+
+### Custom Standardization
+
+```python
+# Manual mapping
+mapping = {
+    "TCell": "T cell",
+    "t cell": "T cell",
+    "T-cells": "T cell"
+}
+
+# Apply mapping
+df["cell_type"] = df["cell_type"].map(lambda x: mapping.get(x, x))
+```
+
+## Handling Validation Errors
+
+### Common Issues and Solutions
+
+**Issue: Column not in schema**
+```python
+# Solution 1: Rename column
+df = df.rename(columns={"old_name": "feature_name"})
+
+# Solution 2: Add feature to schema
+new_feature = ln.Feature(name="new_column", dtype=str).save()
+schema.features.add(new_feature)
+```
+
+**Issue: Invalid values**
+```python
+# Solution 1: Standardize
+curator.cat.standardize("column_name")
+
+# Solution 2: Add new valid values
+curator.cat.add_new_from("column_name")
+
+# Solution 3: Map to ontology
+curator.cat.add_ontology("column_name", bt.Registry)
+```
+
+**Issue: Data type mismatch**
+```python
+# Solution 1: Convert data type
+df["column"] = df["column"].astype(int)
+
+# Solution 2: Enable coercion in feature
+feature = ln.Feature.get(name="column")
+feature.coerce_dtype = True
+feature.save()
+```
+
+## Schema Versioning
+
+Schemas can be versioned like other records:
+
+```python
+# Create initial schema
+schema_v1 = ln.Schema(name="experiment_schema", features=[...]).save()
+
+# Update schema with new features
+schema_v2 = ln.Schema(
+    name="experiment_schema",
+    features=[...],  # Updated list
+    version="2"
+).save()
+
+# Link artifacts to specific schema versions
+artifact.schema = schema_v2
+artifact.save()
+```
+
+## Querying Validated Data
+
+Once data is validated and annotated, it becomes queryable:
+
+```python
+# Find all validated artifacts
+ln.Artifact.filter(is_valid=True).to_dataframe()
+
+# Find artifacts with specific schema
+ln.Artifact.filter(schema=schema).to_dataframe()
+
+# Query by annotated features
+ln.Artifact.filter(cell_type="T cell", tissue="blood").to_dataframe()
+
+# Include features in results
+ln.Artifact.filter(is_valid=True).to_dataframe(include="features")
+```
+
+## Best Practices
+
+1. **Define features first**: Create Feature registry before curation
+2. **Use public ontologies**: Leverage bt.lookup(public=True) for standardization
+3. **Start flexible**: Use flexible schemas initially, tighten as understanding grows
+4. **Document slots**: Clearly specify transposition (.T) in composite schemas
+5. **Standardize early**: Fix typos and synonyms before validation
+6. **Validate incrementally**: Check each slot separately for composite structures
+7. **Version schemas**: Track schema changes over time
+8. **Add synonyms**: Register common variations to simplify future curation
+9. **Coerce types cautiously**: Enable dtype coercion only when safe
+10. **Test on samples**: Validate small subsets before full dataset curation
+
+## Advanced: Custom Validators
+
+Create custom validation logic:
+
+```python
+def validate_gene_expression(df):
+    """Custom validator for gene expression values."""
+    # Check non-negative
+    if (df < 0).any().any():
+        return False, "Negative expression values found"
+
+    # Check reasonable range
+    if (df > 1e6).any().any():
+        return False, "Unreasonably high expression values"
+
+    return True, "Valid"
+
+# Apply during curation
+is_valid, message = validate_gene_expression(df)
+if not is_valid:
+    print(f"Validation failed: {message}")
+```
+
+## Tracking Curation Provenance
+
+```python
+# Curated artifacts track curation lineage
+ln.track()  # Start tracking
+
+# Perform curation
+curator = ln.curators.DataFrameCurator(df, schema)
+curator.validate()
+curator.cat.standardize("cell_type")
+artifact = curator.save_artifact(key="curated.parquet")
+
+ln.finish()  # Complete tracking
+
+# View curation lineage
+artifact.run.describe()  # Shows curation transform
+artifact.view_lineage()  # Visualizes curation process
+```