zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Fork 0

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

12 KiB

Raw Blame History

LaminDB Annotation & Validation

This document covers data curation, validation, schema management, and annotation best practices in LaminDB.

Overview

LaminDB's curation process ensures datasets are both validated and queryable through three essential steps:

Validation: Confirming datasets match desired schemas
Standardization: Fixing inconsistencies like typos and mapping synonyms
Annotation: Linking datasets to metadata entities for queryability

Schema Design

Schemas define expected data structure, types, and validation rules. LaminDB supports three main schema approaches:

1. Flexible Schema

Validates only columns matching Feature registry names, allowing additional metadata:

import lamindb as ln

# Create flexible schema
schema = ln.Schema(
    name="valid_features",
    itype=ln.Feature  # Validates against Feature registry
).save()

# Any column matching a Feature name will be validated
# Additional columns are permitted but not validated

2. Minimal Required Schema

Specifies essential columns while permitting extra metadata:

# Define required features
required_features = [
    ln.Feature.get(name="cell_type"),
    ln.Feature.get(name="tissue"),
    ln.Feature.get(name="donor_id")
]

# Create schema with required features
schema = ln.Schema(
    name="minimal_immune_schema",
    features=required_features,
    flexible=True  # Allows additional columns
).save()

3. Strict Schema

Enforces complete control over data structure:

# Define all allowed features
all_features = [
    ln.Feature.get(name="cell_type"),
    ln.Feature.get(name="tissue"),
    ln.Feature.get(name="donor_id"),
    ln.Feature.get(name="disease")
]

# Create strict schema
schema = ln.Schema(
    name="strict_immune_schema",
    features=all_features,
    flexible=False  # No additional columns allowed
).save()

DataFrame Curation Workflow

The typical curation process involves six key steps:

Step 1-2: Load Data and Establish Registries

import pandas as pd
import lamindb as ln

# Load data
df = pd.read_csv("experiment.csv")

# Define and save features
ln.Feature(name="cell_type", dtype=str).save()
ln.Feature(name="tissue", dtype=str).save()
ln.Feature(name="gene_count", dtype=int).save()
ln.Feature(name="experiment_date", dtype="date").save()

# Populate valid values (if using controlled vocabulary)
import bionty as bt
bt.CellType.import_source()
bt.Tissue.import_source()

Step 3: Create Schema

# Link features to schema
features = [
    ln.Feature.get(name="cell_type"),
    ln.Feature.get(name="tissue"),
    ln.Feature.get(name="gene_count"),
    ln.Feature.get(name="experiment_date")
]

schema = ln.Schema(
    name="experiment_schema",
    features=features,
    flexible=True
).save()

Step 4: Initialize Curator and Validate

# Initialize curator
curator = ln.curators.DataFrameCurator(df, schema)

# Validate dataset
validation = curator.validate()

# Check validation results
if validation:
    print("✓ Validation passed")
else:
    print("✗ Validation failed")
    curator.non_validated  # See problematic fields

Step 5: Fix Validation Issues

Standardize Values

# Fix typos and synonyms in categorical columns
curator.cat.standardize("cell_type")
curator.cat.standardize("tissue")

# View standardization mapping
curator.cat.inspect_standardize("cell_type")

Map to Ontologies

# Map values to ontology terms
curator.cat.add_ontology("cell_type", bt.CellType)
curator.cat.add_ontology("tissue", bt.Tissue)

# Look up public ontologies for unmapped terms
curator.cat.lookup(public=True).cell_type  # Interactive lookup

Add New Terms

# Add new valid terms to registry
curator.cat.add_new_from("cell_type")

# Or manually create records
new_cell_type = bt.CellType(name="my_novel_cell_type").save()

Rename Columns

# Rename columns to match feature names
df = df.rename(columns={"celltype": "cell_type"})

# Re-initialize curator with fixed DataFrame
curator = ln.curators.DataFrameCurator(df, schema)

Step 6: Save Curated Artifact

# Save with schema linkage
artifact = curator.save_artifact(
    key="experiments/curated_data.parquet",
    description="Validated and annotated experimental data"
)

# Verify artifact has schema
artifact.schema  # Returns the schema object
artifact.describe()  # Shows validation status

AnnData Curation

For composite structures like AnnData, use "slots" to validate different components:

Defining AnnData Schemas

# Create schemas for different slots
obs_schema = ln.Schema(
    name="cell_metadata",
    features=[
        ln.Feature.get(name="cell_type"),
        ln.Feature.get(name="tissue"),
        ln.Feature.get(name="donor_id")
    ]
).save()

var_schema = ln.Schema(
    name="gene_ids",
    features=[ln.Feature.get(name="ensembl_gene_id")]
).save()

# Create composite AnnData schema
anndata_schema = ln.Schema(
    name="scrna_schema",
    otype="AnnData",
    slots={
        "obs": obs_schema,
        "var.T": var_schema  # .T indicates transposition
    }
).save()

Curating AnnData Objects

import anndata as ad

# Load AnnData
adata = ad.read_h5ad("data.h5ad")

# Initialize curator
curator = ln.curators.AnnDataCurator(adata, anndata_schema)

# Validate all slots
validation = curator.validate()

# Fix issues by slot
curator.cat.standardize("obs", "cell_type")
curator.cat.add_ontology("obs", "cell_type", bt.CellType)
curator.cat.standardize("var.T", "ensembl_gene_id")

# Save curated artifact
artifact = curator.save_artifact(
    key="scrna/validated_data.h5ad",
    description="Curated single-cell RNA-seq data"
)

MuData Curation

MuData supports multi-modal data through modality-specific slots:

# Define schemas for each modality
rna_obs_schema = ln.Schema(name="rna_obs_schema", features=[...]).save()
protein_obs_schema = ln.Schema(name="protein_obs_schema", features=[...]).save()

# Create MuData schema
mudata_schema = ln.Schema(
    name="multimodal_schema",
    otype="MuData",
    slots={
        "rna:obs": rna_obs_schema,
        "protein:obs": protein_obs_schema
    }
).save()

# Curate
curator = ln.curators.MuDataCurator(mdata, mudata_schema)
curator.validate()

SpatialData Curation

For spatial transcriptomics data:

# Define spatial schema
spatial_schema = ln.Schema(
    name="spatial_schema",
    otype="SpatialData",
    slots={
        "tables:cell_metadata.obs": cell_schema,
        "attrs:bio": bio_metadata_schema
    }
).save()

# Curate
curator = ln.curators.SpatialDataCurator(sdata, spatial_schema)
curator.validate()

TileDB-SOMA Curation

For scalable array-backed data:

# Define SOMA schema
soma_schema = ln.Schema(
    name="soma_schema",
    otype="tiledbsoma",
    slots={
        "obs": obs_schema,
        "ms:RNA.T": var_schema  # measurement:modality.T
    }
).save()

# Curate
curator = ln.curators.TileDBSOMACurator(soma_exp, soma_schema)
curator.validate()

Feature Validation

Data Type Validation

# Define typed features
ln.Feature(name="age", dtype=int).save()
ln.Feature(name="weight", dtype=float).save()
ln.Feature(name="is_treated", dtype=bool).save()
ln.Feature(name="collection_date", dtype="date").save()

# Coerce types during validation
ln.Feature(name="age_str", dtype=int, coerce_dtype=True).save()  # Auto-convert strings to int

Value Validation

# Validate against allowed values
cell_type_feature = ln.Feature(name="cell_type", dtype=str).save()

# Link to registry for controlled vocabulary
cell_type_feature.link_to_registry(bt.CellType)

# Now validation checks against CellType registry
curator = ln.curators.DataFrameCurator(df, schema)
curator.validate()  # Errors if cell_type values not in registry

Standardization Strategies

Using Public Ontologies

# Look up standardized terms from public sources
curator.cat.lookup(public=True).cell_type

# Returns auto-complete object with public ontology terms
# User can select correct term interactively

Synonym Mapping

# Add synonyms to records
t_cell = bt.CellType.get(name="T cell")
t_cell.add_synonym("T lymphocyte")
t_cell.add_synonym("T-cell")

# Now standardization maps synonyms automatically
curator.cat.standardize("cell_type")
# "T lymphocyte" → "T cell"
# "T-cell" → "T cell"

Custom Standardization

# Manual mapping
mapping = {
    "TCell": "T cell",
    "t cell": "T cell",
    "T-cells": "T cell"
}

# Apply mapping
df["cell_type"] = df["cell_type"].map(lambda x: mapping.get(x, x))

Handling Validation Errors

Common Issues and Solutions

Issue: Column not in schema

# Solution 1: Rename column
df = df.rename(columns={"old_name": "feature_name"})

# Solution 2: Add feature to schema
new_feature = ln.Feature(name="new_column", dtype=str).save()
schema.features.add(new_feature)

Issue: Invalid values

# Solution 1: Standardize
curator.cat.standardize("column_name")

# Solution 2: Add new valid values
curator.cat.add_new_from("column_name")

# Solution 3: Map to ontology
curator.cat.add_ontology("column_name", bt.Registry)

Issue: Data type mismatch

# Solution 1: Convert data type
df["column"] = df["column"].astype(int)

# Solution 2: Enable coercion in feature
feature = ln.Feature.get(name="column")
feature.coerce_dtype = True
feature.save()

Schema Versioning

Schemas can be versioned like other records:

# Create initial schema
schema_v1 = ln.Schema(name="experiment_schema", features=[...]).save()

# Update schema with new features
schema_v2 = ln.Schema(
    name="experiment_schema",
    features=[...],  # Updated list
    version="2"
).save()

# Link artifacts to specific schema versions
artifact.schema = schema_v2
artifact.save()

Querying Validated Data

Once data is validated and annotated, it becomes queryable:

# Find all validated artifacts
ln.Artifact.filter(is_valid=True).to_dataframe()

# Find artifacts with specific schema
ln.Artifact.filter(schema=schema).to_dataframe()

# Query by annotated features
ln.Artifact.filter(cell_type="T cell", tissue="blood").to_dataframe()

# Include features in results
ln.Artifact.filter(is_valid=True).to_dataframe(include="features")

Best Practices

Define features first: Create Feature registry before curation
Use public ontologies: Leverage bt.lookup(public=True) for standardization
Start flexible: Use flexible schemas initially, tighten as understanding grows
Document slots: Clearly specify transposition (.T) in composite schemas
Standardize early: Fix typos and synonyms before validation
Validate incrementally: Check each slot separately for composite structures
Version schemas: Track schema changes over time
Add synonyms: Register common variations to simplify future curation
Coerce types cautiously: Enable dtype coercion only when safe
Test on samples: Validate small subsets before full dataset curation

Advanced: Custom Validators

Create custom validation logic:

def validate_gene_expression(df):
    """Custom validator for gene expression values."""
    # Check non-negative
    if (df < 0).any().any():
        return False, "Negative expression values found"

    # Check reasonable range
    if (df > 1e6).any().any():
        return False, "Unreasonably high expression values"

    return True, "Valid"

# Apply during curation
is_valid, message = validate_gene_expression(df)
if not is_valid:
    print(f"Validation failed: {message}")

Tracking Curation Provenance

# Curated artifacts track curation lineage
ln.track()  # Start tracking

# Perform curation
curator = ln.curators.DataFrameCurator(df, schema)
curator.validate()
curator.cat.standardize("cell_type")
artifact = curator.save_artifact(key="curated.parquet")

ln.finish()  # Complete tracking

# View curation lineage
artifact.run.describe()  # Shows curation transform
artifact.view_lineage()  # Visualizes curation process

12 KiB Raw Blame History