Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

384
skills/lamindb/SKILL.md Normal file
View File

@@ -0,0 +1,384 @@
---
name: lamindb
description: This skill should be used when working with LaminDB, an open-source data framework for biology that makes data queryable, traceable, reproducible, and FAIR. Use when managing biological datasets (scRNA-seq, spatial, flow cytometry, etc.), tracking computational workflows, curating and validating data with biological ontologies, building data lakehouses, or ensuring data lineage and reproducibility in biological research. Covers data management, annotation, ontologies (genes, cell types, diseases, tissues), schema validation, integrations with workflow managers (Nextflow, Snakemake) and MLOps platforms (W&B, MLflow), and deployment strategies.
---
# LaminDB
## Overview
LaminDB is an open-source data framework for biology designed to make data queryable, traceable, reproducible, and FAIR (Findable, Accessible, Interoperable, Reusable). It provides a unified platform that combines lakehouse architecture, lineage tracking, feature stores, biological ontologies, LIMS (Laboratory Information Management System), and ELN (Electronic Lab Notebook) capabilities through a single Python API.
**Core Value Proposition:**
- **Queryability**: Search and filter datasets by metadata, features, and ontology terms
- **Traceability**: Automatic lineage tracking from raw data through analysis to results
- **Reproducibility**: Version control for data, code, and environment
- **FAIR Compliance**: Standardized annotations using biological ontologies
## When to Use This Skill
Use this skill when:
- **Managing biological datasets**: scRNA-seq, bulk RNA-seq, spatial transcriptomics, flow cytometry, multi-modal data, EHR data
- **Tracking computational workflows**: Notebooks, scripts, pipeline execution (Nextflow, Snakemake, Redun)
- **Curating and validating data**: Schema validation, standardization, ontology-based annotation
- **Working with biological ontologies**: Genes, proteins, cell types, tissues, diseases, pathways (via Bionty)
- **Building data lakehouses**: Unified query interface across multiple datasets
- **Ensuring reproducibility**: Automatic versioning, lineage tracking, environment capture
- **Integrating ML pipelines**: Connecting with Weights & Biases, MLflow, HuggingFace, scVI-tools
- **Deploying data infrastructure**: Setting up local or cloud-based data management systems
- **Collaborating on datasets**: Sharing curated, annotated data with standardized metadata
## Core Capabilities
LaminDB provides six interconnected capability areas, each documented in detail in the references folder.
### 1. Core Concepts and Data Lineage
**Core entities:**
- **Artifacts**: Versioned datasets (DataFrame, AnnData, Parquet, Zarr, etc.)
- **Records**: Experimental entities (samples, perturbations, instruments)
- **Runs & Transforms**: Computational lineage tracking (what code produced what data)
- **Features**: Typed metadata fields for annotation and querying
**Key workflows:**
- Create and version artifacts from files or Python objects
- Track notebook/script execution with `ln.track()` and `ln.finish()`
- Annotate artifacts with typed features
- Visualize data lineage graphs with `artifact.view_lineage()`
- Query by provenance (find all outputs from specific code/inputs)
**Reference:** `references/core-concepts.md` - Read this for detailed information on artifacts, records, runs, transforms, features, versioning, and lineage tracking.
### 2. Data Management and Querying
**Query capabilities:**
- Registry exploration and lookup with auto-complete
- Single record retrieval with `get()`, `one()`, `one_or_none()`
- Filtering with comparison operators (`__gt`, `__lte`, `__contains`, `__startswith`)
- Feature-based queries (query by annotated metadata)
- Cross-registry traversal with double-underscore syntax
- Full-text search across registries
- Advanced logical queries with Q objects (AND, OR, NOT)
- Streaming large datasets without loading into memory
**Key workflows:**
- Browse artifacts with filters and ordering
- Query by features, creation date, creator, size, etc.
- Stream large files in chunks or with array slicing
- Organize data with hierarchical keys
- Group artifacts into collections
**Reference:** `references/data-management.md` - Read this for comprehensive query patterns, filtering examples, streaming strategies, and data organization best practices.
### 3. Annotation and Validation
**Curation process:**
1. **Validation**: Confirm datasets match desired schemas
2. **Standardization**: Fix typos, map synonyms to canonical terms
3. **Annotation**: Link datasets to metadata entities for queryability
**Schema types:**
- **Flexible schemas**: Validate only known columns, allow additional metadata
- **Minimal required schemas**: Specify essential columns, permit extras
- **Strict schemas**: Complete control over structure and values
**Supported data types:**
- DataFrames (Parquet, CSV)
- AnnData (single-cell genomics)
- MuData (multi-modal)
- SpatialData (spatial transcriptomics)
- TileDB-SOMA (scalable arrays)
**Key workflows:**
- Define features and schemas for data validation
- Use `DataFrameCurator` or `AnnDataCurator` for validation
- Standardize values with `.cat.standardize()`
- Map to ontologies with `.cat.add_ontology()`
- Save curated artifacts with schema linkage
- Query validated datasets by features
**Reference:** `references/annotation-validation.md` - Read this for detailed curation workflows, schema design patterns, handling validation errors, and best practices.
### 4. Biological Ontologies
**Available ontologies (via Bionty):**
- Genes (Ensembl), Proteins (UniProt)
- Cell types (CL), Cell lines (CLO)
- Tissues (Uberon), Diseases (Mondo, DOID)
- Phenotypes (HPO), Pathways (GO)
- Experimental factors (EFO), Developmental stages
- Organisms (NCBItaxon), Drugs (DrugBank)
**Key workflows:**
- Import public ontologies with `bt.CellType.import_source()`
- Search ontologies with keyword or exact matching
- Standardize terms using synonym mapping
- Explore hierarchical relationships (parents, children, ancestors)
- Validate data against ontology terms
- Annotate datasets with ontology records
- Create custom terms and hierarchies
- Handle multi-organism contexts (human, mouse, etc.)
**Reference:** `references/ontologies.md` - Read this for comprehensive ontology operations, standardization strategies, hierarchy navigation, and annotation workflows.
### 5. Integrations
**Workflow managers:**
- Nextflow: Track pipeline processes and outputs
- Snakemake: Integrate into Snakemake rules
- Redun: Combine with Redun task tracking
**MLOps platforms:**
- Weights & Biases: Link experiments with data artifacts
- MLflow: Track models and experiments
- HuggingFace: Track model fine-tuning
- scVI-tools: Single-cell analysis workflows
**Storage systems:**
- Local filesystem, AWS S3, Google Cloud Storage
- S3-compatible (MinIO, Cloudflare R2)
- HTTP/HTTPS endpoints (read-only)
- HuggingFace datasets
**Array stores:**
- TileDB-SOMA (with cellxgene support)
- DuckDB for SQL queries on Parquet files
**Visualization:**
- Vitessce for interactive spatial/single-cell visualization
**Version control:**
- Git integration for source code tracking
**Reference:** `references/integrations.md` - Read this for integration patterns, code examples, and troubleshooting for third-party systems.
### 6. Setup and Deployment
**Installation:**
- Basic: `uv pip install lamindb`
- With extras: `uv pip install 'lamindb[gcp,zarr,fcs]'`
- Modules: bionty, wetlab, clinical
**Instance types:**
- Local SQLite (development)
- Cloud storage + SQLite (small teams)
- Cloud storage + PostgreSQL (production)
**Storage options:**
- Local filesystem
- AWS S3 with configurable regions and permissions
- Google Cloud Storage
- S3-compatible endpoints (MinIO, Cloudflare R2)
**Configuration:**
- Cache management for cloud files
- Multi-user system configurations
- Git repository sync
- Environment variables
**Deployment patterns:**
- Local dev → Cloud production migration
- Multi-region deployments
- Shared storage with personal instances
**Reference:** `references/setup-deployment.md` - Read this for detailed installation, configuration, storage setup, database management, security best practices, and troubleshooting.
## Common Use Case Workflows
### Use Case 1: Single-Cell RNA-seq Analysis with Ontology Validation
```python
import lamindb as ln
import bionty as bt
import anndata as ad
# Start tracking
ln.track(params={"analysis": "scRNA-seq QC and annotation"})
# Import cell type ontology
bt.CellType.import_source()
# Load data
adata = ad.read_h5ad("raw_counts.h5ad")
# Validate and standardize cell types
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
# Curate with schema
curator = ln.curators.AnnDataCurator(adata, schema)
curator.validate()
artifact = curator.save_artifact(key="scrna/validated.h5ad")
# Link ontology annotations
cell_types = bt.CellType.from_values(adata.obs.cell_type)
artifact.feature_sets.add_ontology(cell_types)
ln.finish()
```
### Use Case 2: Building a Queryable Data Lakehouse
```python
import lamindb as ln
# Register multiple experiments
for i, file in enumerate(data_files):
artifact = ln.Artifact.from_anndata(
ad.read_h5ad(file),
key=f"scrna/batch_{i}.h5ad",
description=f"scRNA-seq batch {i}"
).save()
# Annotate with features
artifact.features.add_values({
"batch": i,
"tissue": tissues[i],
"condition": conditions[i]
})
# Query across all experiments
immune_datasets = ln.Artifact.filter(
key__startswith="scrna/",
tissue="PBMC",
condition="treated"
).to_dataframe()
# Load specific datasets
for artifact in immune_datasets:
adata = artifact.load()
# Analyze
```
### Use Case 3: ML Pipeline with W&B Integration
```python
import lamindb as ln
import wandb
# Initialize both systems
wandb.init(project="drug-response", name="exp-42")
ln.track(params={"model": "random_forest", "n_estimators": 100})
# Load training data from LaminDB
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
train_data = train_artifact.load()
# Train model
model = train_model(train_data)
# Log to W&B
wandb.log({"accuracy": 0.95})
# Save model in LaminDB with W&B linkage
import joblib
joblib.dump(model, "model.pkl")
model_artifact = ln.Artifact("model.pkl", key="models/exp-42.pkl").save()
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
ln.finish()
wandb.finish()
```
### Use Case 4: Nextflow Pipeline Integration
```python
# In Nextflow process script
import lamindb as ln
ln.track()
# Load input artifact
input_artifact = ln.Artifact.get(key="raw/batch_${batch_id}.fastq.gz")
input_path = input_artifact.cache()
# Process (alignment, quantification, etc.)
# ... Nextflow process logic ...
# Save output
output_artifact = ln.Artifact(
"counts.csv",
key="processed/batch_${batch_id}_counts.csv"
).save()
ln.finish()
```
## Getting Started Checklist
To start using LaminDB effectively:
1. **Installation & Setup** (`references/setup-deployment.md`)
- Install LaminDB and required extras
- Authenticate with `lamin login`
- Initialize instance with `lamin init --storage ...`
2. **Learn Core Concepts** (`references/core-concepts.md`)
- Understand Artifacts, Records, Runs, Transforms
- Practice creating and retrieving artifacts
- Implement `ln.track()` and `ln.finish()` in workflows
3. **Master Querying** (`references/data-management.md`)
- Practice filtering and searching registries
- Learn feature-based queries
- Experiment with streaming large files
4. **Set Up Validation** (`references/annotation-validation.md`)
- Define features relevant to research domain
- Create schemas for data types
- Practice curation workflows
5. **Integrate Ontologies** (`references/ontologies.md`)
- Import relevant biological ontologies (genes, cell types, etc.)
- Validate existing annotations
- Standardize metadata with ontology terms
6. **Connect Tools** (`references/integrations.md`)
- Integrate with existing workflow managers
- Link ML platforms for experiment tracking
- Configure cloud storage and compute
## Key Principles
Follow these principles when working with LaminDB:
1. **Track everything**: Use `ln.track()` at the start of every analysis for automatic lineage capture
2. **Validate early**: Define schemas and validate data before extensive analysis
3. **Use ontologies**: Leverage public biological ontologies for standardized annotations
4. **Organize with keys**: Structure artifact keys hierarchically (e.g., `project/experiment/batch/file.h5ad`)
5. **Query metadata first**: Filter and search before loading large files
6. **Version, don't duplicate**: Use built-in versioning instead of creating new keys for modifications
7. **Annotate with features**: Define typed features for queryable metadata
8. **Document thoroughly**: Add descriptions to artifacts, schemas, and transforms
9. **Leverage lineage**: Use `view_lineage()` to understand data provenance
10. **Start local, scale cloud**: Develop locally with SQLite, deploy to cloud with PostgreSQL
## Reference Files
This skill includes comprehensive reference documentation organized by capability:
- **`references/core-concepts.md`** - Artifacts, records, runs, transforms, features, versioning, lineage
- **`references/data-management.md`** - Querying, filtering, searching, streaming, organizing data
- **`references/annotation-validation.md`** - Schema design, curation workflows, validation strategies
- **`references/ontologies.md`** - Biological ontology management, standardization, hierarchies
- **`references/integrations.md`** - Workflow managers, MLOps platforms, storage systems, tools
- **`references/setup-deployment.md`** - Installation, configuration, deployment, troubleshooting
Read the relevant reference file(s) based on the specific LaminDB capability needed for the task at hand.
## Additional Resources
- **Official Documentation**: https://docs.lamin.ai
- **API Reference**: https://docs.lamin.ai/api
- **GitHub Repository**: https://github.com/laminlabs/lamindb
- **Tutorial**: https://docs.lamin.ai/tutorial
- **FAQ**: https://docs.lamin.ai/faq

View File

@@ -0,0 +1,513 @@
# LaminDB Annotation & Validation
This document covers data curation, validation, schema management, and annotation best practices in LaminDB.
## Overview
LaminDB's curation process ensures datasets are both validated and queryable through three essential steps:
1. **Validation**: Confirming datasets match desired schemas
2. **Standardization**: Fixing inconsistencies like typos and mapping synonyms
3. **Annotation**: Linking datasets to metadata entities for queryability
## Schema Design
Schemas define expected data structure, types, and validation rules. LaminDB supports three main schema approaches:
### 1. Flexible Schema
Validates only columns matching Feature registry names, allowing additional metadata:
```python
import lamindb as ln
# Create flexible schema
schema = ln.Schema(
name="valid_features",
itype=ln.Feature # Validates against Feature registry
).save()
# Any column matching a Feature name will be validated
# Additional columns are permitted but not validated
```
### 2. Minimal Required Schema
Specifies essential columns while permitting extra metadata:
```python
# Define required features
required_features = [
ln.Feature.get(name="cell_type"),
ln.Feature.get(name="tissue"),
ln.Feature.get(name="donor_id")
]
# Create schema with required features
schema = ln.Schema(
name="minimal_immune_schema",
features=required_features,
flexible=True # Allows additional columns
).save()
```
### 3. Strict Schema
Enforces complete control over data structure:
```python
# Define all allowed features
all_features = [
ln.Feature.get(name="cell_type"),
ln.Feature.get(name="tissue"),
ln.Feature.get(name="donor_id"),
ln.Feature.get(name="disease")
]
# Create strict schema
schema = ln.Schema(
name="strict_immune_schema",
features=all_features,
flexible=False # No additional columns allowed
).save()
```
## DataFrame Curation Workflow
The typical curation process involves six key steps:
### Step 1-2: Load Data and Establish Registries
```python
import pandas as pd
import lamindb as ln
# Load data
df = pd.read_csv("experiment.csv")
# Define and save features
ln.Feature(name="cell_type", dtype=str).save()
ln.Feature(name="tissue", dtype=str).save()
ln.Feature(name="gene_count", dtype=int).save()
ln.Feature(name="experiment_date", dtype="date").save()
# Populate valid values (if using controlled vocabulary)
import bionty as bt
bt.CellType.import_source()
bt.Tissue.import_source()
```
### Step 3: Create Schema
```python
# Link features to schema
features = [
ln.Feature.get(name="cell_type"),
ln.Feature.get(name="tissue"),
ln.Feature.get(name="gene_count"),
ln.Feature.get(name="experiment_date")
]
schema = ln.Schema(
name="experiment_schema",
features=features,
flexible=True
).save()
```
### Step 4: Initialize Curator and Validate
```python
# Initialize curator
curator = ln.curators.DataFrameCurator(df, schema)
# Validate dataset
validation = curator.validate()
# Check validation results
if validation:
print("✓ Validation passed")
else:
print("✗ Validation failed")
curator.non_validated # See problematic fields
```
### Step 5: Fix Validation Issues
#### Standardize Values
```python
# Fix typos and synonyms in categorical columns
curator.cat.standardize("cell_type")
curator.cat.standardize("tissue")
# View standardization mapping
curator.cat.inspect_standardize("cell_type")
```
#### Map to Ontologies
```python
# Map values to ontology terms
curator.cat.add_ontology("cell_type", bt.CellType)
curator.cat.add_ontology("tissue", bt.Tissue)
# Look up public ontologies for unmapped terms
curator.cat.lookup(public=True).cell_type # Interactive lookup
```
#### Add New Terms
```python
# Add new valid terms to registry
curator.cat.add_new_from("cell_type")
# Or manually create records
new_cell_type = bt.CellType(name="my_novel_cell_type").save()
```
#### Rename Columns
```python
# Rename columns to match feature names
df = df.rename(columns={"celltype": "cell_type"})
# Re-initialize curator with fixed DataFrame
curator = ln.curators.DataFrameCurator(df, schema)
```
### Step 6: Save Curated Artifact
```python
# Save with schema linkage
artifact = curator.save_artifact(
key="experiments/curated_data.parquet",
description="Validated and annotated experimental data"
)
# Verify artifact has schema
artifact.schema # Returns the schema object
artifact.describe() # Shows validation status
```
## AnnData Curation
For composite structures like AnnData, use "slots" to validate different components:
### Defining AnnData Schemas
```python
# Create schemas for different slots
obs_schema = ln.Schema(
name="cell_metadata",
features=[
ln.Feature.get(name="cell_type"),
ln.Feature.get(name="tissue"),
ln.Feature.get(name="donor_id")
]
).save()
var_schema = ln.Schema(
name="gene_ids",
features=[ln.Feature.get(name="ensembl_gene_id")]
).save()
# Create composite AnnData schema
anndata_schema = ln.Schema(
name="scrna_schema",
otype="AnnData",
slots={
"obs": obs_schema,
"var.T": var_schema # .T indicates transposition
}
).save()
```
### Curating AnnData Objects
```python
import anndata as ad
# Load AnnData
adata = ad.read_h5ad("data.h5ad")
# Initialize curator
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
# Validate all slots
validation = curator.validate()
# Fix issues by slot
curator.cat.standardize("obs", "cell_type")
curator.cat.add_ontology("obs", "cell_type", bt.CellType)
curator.cat.standardize("var.T", "ensembl_gene_id")
# Save curated artifact
artifact = curator.save_artifact(
key="scrna/validated_data.h5ad",
description="Curated single-cell RNA-seq data"
)
```
## MuData Curation
MuData supports multi-modal data through modality-specific slots:
```python
# Define schemas for each modality
rna_obs_schema = ln.Schema(name="rna_obs_schema", features=[...]).save()
protein_obs_schema = ln.Schema(name="protein_obs_schema", features=[...]).save()
# Create MuData schema
mudata_schema = ln.Schema(
name="multimodal_schema",
otype="MuData",
slots={
"rna:obs": rna_obs_schema,
"protein:obs": protein_obs_schema
}
).save()
# Curate
curator = ln.curators.MuDataCurator(mdata, mudata_schema)
curator.validate()
```
## SpatialData Curation
For spatial transcriptomics data:
```python
# Define spatial schema
spatial_schema = ln.Schema(
name="spatial_schema",
otype="SpatialData",
slots={
"tables:cell_metadata.obs": cell_schema,
"attrs:bio": bio_metadata_schema
}
).save()
# Curate
curator = ln.curators.SpatialDataCurator(sdata, spatial_schema)
curator.validate()
```
## TileDB-SOMA Curation
For scalable array-backed data:
```python
# Define SOMA schema
soma_schema = ln.Schema(
name="soma_schema",
otype="tiledbsoma",
slots={
"obs": obs_schema,
"ms:RNA.T": var_schema # measurement:modality.T
}
).save()
# Curate
curator = ln.curators.TileDBSOMACurator(soma_exp, soma_schema)
curator.validate()
```
## Feature Validation
### Data Type Validation
```python
# Define typed features
ln.Feature(name="age", dtype=int).save()
ln.Feature(name="weight", dtype=float).save()
ln.Feature(name="is_treated", dtype=bool).save()
ln.Feature(name="collection_date", dtype="date").save()
# Coerce types during validation
ln.Feature(name="age_str", dtype=int, coerce_dtype=True).save() # Auto-convert strings to int
```
### Value Validation
```python
# Validate against allowed values
cell_type_feature = ln.Feature(name="cell_type", dtype=str).save()
# Link to registry for controlled vocabulary
cell_type_feature.link_to_registry(bt.CellType)
# Now validation checks against CellType registry
curator = ln.curators.DataFrameCurator(df, schema)
curator.validate() # Errors if cell_type values not in registry
```
## Standardization Strategies
### Using Public Ontologies
```python
# Look up standardized terms from public sources
curator.cat.lookup(public=True).cell_type
# Returns auto-complete object with public ontology terms
# User can select correct term interactively
```
### Synonym Mapping
```python
# Add synonyms to records
t_cell = bt.CellType.get(name="T cell")
t_cell.add_synonym("T lymphocyte")
t_cell.add_synonym("T-cell")
# Now standardization maps synonyms automatically
curator.cat.standardize("cell_type")
# "T lymphocyte" → "T cell"
# "T-cell" → "T cell"
```
### Custom Standardization
```python
# Manual mapping
mapping = {
"TCell": "T cell",
"t cell": "T cell",
"T-cells": "T cell"
}
# Apply mapping
df["cell_type"] = df["cell_type"].map(lambda x: mapping.get(x, x))
```
## Handling Validation Errors
### Common Issues and Solutions
**Issue: Column not in schema**
```python
# Solution 1: Rename column
df = df.rename(columns={"old_name": "feature_name"})
# Solution 2: Add feature to schema
new_feature = ln.Feature(name="new_column", dtype=str).save()
schema.features.add(new_feature)
```
**Issue: Invalid values**
```python
# Solution 1: Standardize
curator.cat.standardize("column_name")
# Solution 2: Add new valid values
curator.cat.add_new_from("column_name")
# Solution 3: Map to ontology
curator.cat.add_ontology("column_name", bt.Registry)
```
**Issue: Data type mismatch**
```python
# Solution 1: Convert data type
df["column"] = df["column"].astype(int)
# Solution 2: Enable coercion in feature
feature = ln.Feature.get(name="column")
feature.coerce_dtype = True
feature.save()
```
## Schema Versioning
Schemas can be versioned like other records:
```python
# Create initial schema
schema_v1 = ln.Schema(name="experiment_schema", features=[...]).save()
# Update schema with new features
schema_v2 = ln.Schema(
name="experiment_schema",
features=[...], # Updated list
version="2"
).save()
# Link artifacts to specific schema versions
artifact.schema = schema_v2
artifact.save()
```
## Querying Validated Data
Once data is validated and annotated, it becomes queryable:
```python
# Find all validated artifacts
ln.Artifact.filter(is_valid=True).to_dataframe()
# Find artifacts with specific schema
ln.Artifact.filter(schema=schema).to_dataframe()
# Query by annotated features
ln.Artifact.filter(cell_type="T cell", tissue="blood").to_dataframe()
# Include features in results
ln.Artifact.filter(is_valid=True).to_dataframe(include="features")
```
## Best Practices
1. **Define features first**: Create Feature registry before curation
2. **Use public ontologies**: Leverage bt.lookup(public=True) for standardization
3. **Start flexible**: Use flexible schemas initially, tighten as understanding grows
4. **Document slots**: Clearly specify transposition (.T) in composite schemas
5. **Standardize early**: Fix typos and synonyms before validation
6. **Validate incrementally**: Check each slot separately for composite structures
7. **Version schemas**: Track schema changes over time
8. **Add synonyms**: Register common variations to simplify future curation
9. **Coerce types cautiously**: Enable dtype coercion only when safe
10. **Test on samples**: Validate small subsets before full dataset curation
## Advanced: Custom Validators
Create custom validation logic:
```python
def validate_gene_expression(df):
"""Custom validator for gene expression values."""
# Check non-negative
if (df < 0).any().any():
return False, "Negative expression values found"
# Check reasonable range
if (df > 1e6).any().any():
return False, "Unreasonably high expression values"
return True, "Valid"
# Apply during curation
is_valid, message = validate_gene_expression(df)
if not is_valid:
print(f"Validation failed: {message}")
```
## Tracking Curation Provenance
```python
# Curated artifacts track curation lineage
ln.track() # Start tracking
# Perform curation
curator = ln.curators.DataFrameCurator(df, schema)
curator.validate()
curator.cat.standardize("cell_type")
artifact = curator.save_artifact(key="curated.parquet")
ln.finish() # Complete tracking
# View curation lineage
artifact.run.describe() # Shows curation transform
artifact.view_lineage() # Visualizes curation process
```

View File

@@ -0,0 +1,380 @@
# LaminDB Core Concepts
This document covers the fundamental concepts and building blocks of LaminDB: Artifacts, Records, Runs, Transforms, Features, and data lineage tracking.
## Artifacts
Artifacts represent datasets in various formats (DataFrames, AnnData, SpatialData, Parquet, Zarr, etc.). They serve as the primary data objects in LaminDB.
### Creating and Saving Artifacts
**From file:**
```python
import lamindb as ln
# Save a file as artifact
ln.Artifact("sample.fasta", key="sample.fasta").save()
# With description
artifact = ln.Artifact(
"data/analysis.h5ad",
key="experiments/scrna_batch1.h5ad",
description="Single-cell RNA-seq batch 1"
).save()
```
**From DataFrame:**
```python
import pandas as pd
df = pd.read_csv("data.csv")
artifact = ln.Artifact.from_dataframe(
df,
key="datasets/processed_data.parquet",
description="Processed experimental data"
).save()
```
**From AnnData:**
```python
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
artifact = ln.Artifact.from_anndata(
adata,
key="scrna/experiment1.h5ad",
description="scRNA-seq data with QC"
).save()
```
### Retrieving Artifacts
```python
# By key
artifact = ln.Artifact.get(key="sample.fasta")
# By UID
artifact = ln.Artifact.get("aRt1Fact0uid000")
# By filter
artifact = ln.Artifact.filter(suffix=".h5ad").first()
```
### Accessing Artifact Content
```python
# Get cached local path
local_path = artifact.cache()
# Load into memory
data = artifact.load() # Returns DataFrame, AnnData, etc.
# Streaming access (for large files)
with artifact.open() as f:
# Read incrementally
chunk = f.read(1000)
```
### Artifact Metadata
```python
# View all metadata
artifact.describe()
# Access specific metadata
artifact.size # File size in bytes
artifact.suffix # File extension
artifact.created_at # Timestamp
artifact.created_by # User who created it
artifact.run # Associated run
artifact.transform # Associated transform
artifact.version # Version string
```
## Records
Records represent experimental entities: samples, perturbations, instruments, cell lines, and any other metadata entities. They support hierarchical relationships through type definitions.
### Creating Records
```python
# Define a type
sample_type = ln.Record(name="Sample", is_type=True).save()
# Create instances of that type
ln.Record(name="P53mutant1", type=sample_type).save()
ln.Record(name="P53mutant2", type=sample_type).save()
ln.Record(name="WT-control", type=sample_type).save()
```
### Searching Records
```python
# Text search
ln.Record.search("p53").to_dataframe()
# Filter by fields
ln.Record.filter(type=sample_type).to_dataframe()
# Get specific record
record = ln.Record.get(name="P53mutant1")
```
### Hierarchical Relationships
```python
# Establish parent-child relationships
parent_record = ln.Record.get(name="P53mutant1")
child_record = ln.Record(name="P53mutant1-replicate1", type=sample_type).save()
child_record.parents.add(parent_record)
# Query relationships
parent_record.children.to_dataframe()
child_record.parents.to_dataframe()
```
## Runs & Transforms
These capture computational lineage. A **Transform** represents a reusable analysis step (notebook, script, or function), while a **Run** documents a specific execution instance.
### Basic Tracking Workflow
```python
import lamindb as ln
# Start tracking (beginning of notebook/script)
ln.track()
# Your analysis code
data = ln.Artifact.get(key="input.csv").load()
# ... perform analysis ...
result.to_csv("output.csv")
artifact = ln.Artifact("output.csv", key="output.csv").save()
# Finish tracking (end of notebook/script)
ln.finish()
```
### Tracking with Parameters
```python
ln.track(params={
"learning_rate": 0.01,
"batch_size": 32,
"epochs": 100,
"downsample": True
})
# Query runs by parameters
ln.Run.filter(params__learning_rate=0.01).to_dataframe()
ln.Run.filter(params__downsample=True).to_dataframe()
```
### Tracking with Projects
```python
# Associate with project
ln.track(project="Cancer Drug Screen 2025")
# Query by project
project = ln.Project.get(name="Cancer Drug Screen 2025")
ln.Artifact.filter(projects=project).to_dataframe()
ln.Run.filter(project=project).to_dataframe()
```
### Function-Level Tracking
Use the `@ln.tracked()` decorator for fine-grained lineage:
```python
@ln.tracked()
def preprocess_data(input_key: str, output_key: str, normalize: bool = True) -> None:
"""Preprocess raw data and save result."""
# Load input (automatically tracked)
artifact = ln.Artifact.get(key=input_key)
data = artifact.load()
# Process
if normalize:
data = (data - data.mean()) / data.std()
# Save output (automatically tracked)
ln.Artifact.from_dataframe(data, key=output_key).save()
# Each call creates a separate Transform and Run
preprocess_data("raw/batch1.csv", "processed/batch1.csv", normalize=True)
preprocess_data("raw/batch2.csv", "processed/batch2.csv", normalize=False)
```
### Accessing Lineage Information
```python
# From artifact to run
artifact = ln.Artifact.get(key="output.csv")
run = artifact.run
transform = run.transform
# View details
run.describe() # Run metadata
transform.describe() # Transform metadata
# Access inputs
run.inputs.to_dataframe()
# Visualize lineage graph
artifact.view_lineage()
```
## Features
Features define typed metadata fields for validation and querying. They enable structured annotation and searching.
### Defining Features
```python
from datetime import date
# Numeric feature
ln.Feature(name="gc_content", dtype=float).save()
ln.Feature(name="read_count", dtype=int).save()
# Date feature
ln.Feature(name="experiment_date", dtype=date).save()
# Categorical feature
ln.Feature(name="cell_type", dtype=str).save()
ln.Feature(name="treatment", dtype=str).save()
```
### Annotating Artifacts with Features
```python
# Single values
artifact.features.add_values({
"gc_content": 0.55,
"experiment_date": "2025-10-31"
})
# Using feature registry records
gc_content_feature = ln.Feature.get(name="gc_content")
artifact.features.add(gc_content_feature)
```
### Querying by Features
```python
# Filter by feature value
ln.Artifact.filter(gc_content=0.55).to_dataframe()
ln.Artifact.filter(experiment_date="2025-10-31").to_dataframe()
# Comparison operators
ln.Artifact.filter(read_count__gt=1000000).to_dataframe()
ln.Artifact.filter(gc_content__gte=0.5, gc_content__lte=0.6).to_dataframe()
# Check for presence of annotation
ln.Artifact.filter(cell_type__isnull=False).to_dataframe()
# Include features in output
ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features")
```
### Nested Dictionary Features
For complex metadata stored as dictionaries:
```python
# Access nested values
ln.Artifact.filter(study_metadata__detail1="123").to_dataframe()
ln.Artifact.filter(study_metadata__assay__type="RNA-seq").to_dataframe()
```
## Data Lineage Tracking
LaminDB automatically captures execution context and relationships between data, code, and runs.
### What Gets Tracked
- **Source code**: Script/notebook content and git commit
- **Environment**: Python packages and versions
- **Input artifacts**: Data loaded during execution
- **Output artifacts**: Data created during execution
- **Execution metadata**: Timestamps, user, parameters
- **Computational dependencies**: Transform relationships
### Viewing Lineage
```python
# Visualize full lineage graph
artifact.view_lineage()
# View captured metadata
artifact.describe()
# Access related entities
artifact.run # The run that created it
artifact.run.transform # The transform (code) used
artifact.run.inputs # Input artifacts
artifact.run.report # Execution report
```
### Querying Lineage
```python
# Find all outputs from a transform
transform = ln.Transform.get(name="preprocessing.py")
ln.Artifact.filter(transform=transform).to_dataframe()
# Find all artifacts from a specific user
user = ln.User.get(handle="researcher123")
ln.Artifact.filter(created_by=user).to_dataframe()
# Find artifacts using specific inputs
input_artifact = ln.Artifact.get(key="raw/data.csv")
runs = ln.Run.filter(inputs=input_artifact)
ln.Artifact.filter(run__in=runs).to_dataframe()
```
## Versioning
LaminDB manages artifact versioning automatically when source data or code changes.
### Automatic Versioning
```python
# First version
artifact_v1 = ln.Artifact("data.csv", key="experiment/data.csv").save()
# Modify and save again - creates new version
# (modify data.csv)
artifact_v2 = ln.Artifact("data.csv", key="experiment/data.csv").save()
```
### Working with Versions
```python
# Get latest version (default)
artifact = ln.Artifact.get(key="experiment/data.csv")
# View all versions
artifact.versions.to_dataframe()
# Get specific version
artifact_v1 = artifact.versions.filter(version="1").first()
# Compare versions
v1_data = artifact_v1.load()
v2_data = artifact.load()
```
## Best Practices
1. **Use meaningful keys**: Structure keys hierarchically (e.g., `project/experiment/sample.h5ad`)
2. **Add descriptions**: Help future users understand artifact contents
3. **Track consistently**: Call `ln.track()` at the start of every analysis
4. **Define features upfront**: Create feature registry before annotation
5. **Use typed features**: Specify dtypes for better validation
6. **Leverage versioning**: Don't create new keys for minor changes
7. **Document transforms**: Add docstrings to tracked functions
8. **Set projects**: Group related work for easier organization and access control
9. **Query efficiently**: Use filters before loading large datasets
10. **Visualize lineage**: Use `view_lineage()` to understand data provenance

View File

@@ -0,0 +1,433 @@
# LaminDB Data Management
This document covers querying, searching, filtering, and streaming data in LaminDB, as well as best practices for organizing and accessing datasets.
## Registry Overview
View available registries and their contents:
```python
import lamindb as ln
# View all registries across modules
ln.view()
# View latest 100 artifacts
ln.Artifact.to_dataframe()
# View other registries
ln.Transform.to_dataframe()
ln.Run.to_dataframe()
ln.User.to_dataframe()
```
## Lookup for Quick Access
For registries with fewer than 100k records, `Lookup` objects enable convenient auto-complete:
```python
# Create lookup
records = ln.Record.lookup()
# Access by name (auto-complete enabled in IDEs)
experiment_1 = records.experiment_1
sample_a = records.sample_a
# Works with biological ontologies too
import bionty as bt
cell_types = bt.CellType.lookup()
t_cell = cell_types.t_cell
```
## Retrieving Single Records
### Using get()
Retrieve exactly one record (errors if zero or multiple matches):
```python
# By UID
artifact = ln.Artifact.get("aRt1Fact0uid000")
# By field
artifact = ln.Artifact.get(key="data/experiment.h5ad")
user = ln.User.get(handle="researcher123")
# By ontology ID (for bionty)
cell_type = bt.CellType.get(ontology_id="CL:0000084")
```
### Using one() and one_or_none()
```python
# Get exactly one from QuerySet (errors if 0 or >1)
artifact = ln.Artifact.filter(key="data.csv").one()
# Get one or None (errors if >1)
artifact = ln.Artifact.filter(key="maybe_data.csv").one_or_none()
# Get first match
artifact = ln.Artifact.filter(suffix=".h5ad").first()
```
## Filtering Data
The `filter()` method returns a QuerySet for flexible retrieval:
```python
# Basic filtering
artifacts = ln.Artifact.filter(suffix=".h5ad")
artifacts.to_dataframe()
# Multiple conditions (AND logic)
artifacts = ln.Artifact.filter(
suffix=".h5ad",
created_by=user
)
# Comparison operators
ln.Artifact.filter(size__gt=1e6).to_dataframe() # Greater than
ln.Artifact.filter(size__gte=1e6).to_dataframe() # Greater than or equal
ln.Artifact.filter(size__lt=1e9).to_dataframe() # Less than
ln.Artifact.filter(size__lte=1e9).to_dataframe() # Less than or equal
# Range queries
ln.Artifact.filter(size__gte=1e6, size__lte=1e9).to_dataframe()
```
## Text and String Queries
```python
# Exact match
ln.Artifact.filter(description="Experiment 1").to_dataframe()
# Contains (case-sensitive)
ln.Artifact.filter(description__contains="RNA").to_dataframe()
# Case-insensitive contains
ln.Artifact.filter(description__icontains="rna").to_dataframe()
# Starts with
ln.Artifact.filter(key__startswith="experiments/").to_dataframe()
# Ends with
ln.Artifact.filter(key__endswith=".csv").to_dataframe()
# IN list
ln.Artifact.filter(suffix__in=[".h5ad", ".csv", ".parquet"]).to_dataframe()
```
## Feature-Based Queries
Query artifacts by their annotated features:
```python
# Filter by feature value
ln.Artifact.filter(cell_type="T cell").to_dataframe()
ln.Artifact.filter(treatment="DMSO").to_dataframe()
# Include features in output
ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features")
# Nested dictionary access
ln.Artifact.filter(study_metadata__assay="RNA-seq").to_dataframe()
ln.Artifact.filter(study_metadata__detail1="123").to_dataframe()
# Check annotation status
ln.Artifact.filter(cell_type__isnull=False).to_dataframe() # Has annotation
ln.Artifact.filter(treatment__isnull=True).to_dataframe() # Missing annotation
```
## Traversing Related Registries
Django's double-underscore syntax enables queries across related tables:
```python
# Find artifacts by creator handle
ln.Artifact.filter(created_by__handle="researcher123").to_dataframe()
ln.Artifact.filter(created_by__handle__startswith="test").to_dataframe()
# Find artifacts by transform name
ln.Artifact.filter(transform__name="preprocess.py").to_dataframe()
# Find artifacts measuring specific genes
ln.Artifact.filter(feature_sets__genes__symbol="CD8A").to_dataframe()
ln.Artifact.filter(feature_sets__genes__ensembl_gene_id="ENSG00000153563").to_dataframe()
# Find runs with specific parameters
ln.Run.filter(params__learning_rate=0.01).to_dataframe()
ln.Run.filter(params__downsample=True).to_dataframe()
# Find artifacts from specific project
project = ln.Project.get(name="Cancer Study")
ln.Artifact.filter(projects=project).to_dataframe()
```
## Ordering Results
```python
# Order by field (ascending)
ln.Artifact.filter(suffix=".h5ad").order_by("created_at").to_dataframe()
# Order descending
ln.Artifact.filter(suffix=".h5ad").order_by("-created_at").to_dataframe()
# Multiple order fields
ln.Artifact.order_by("-created_at", "size").to_dataframe()
```
## Advanced Logical Queries
### OR Logic
```python
from lamindb import Q
# OR condition
artifacts = ln.Artifact.filter(
Q(suffix=".jpg") | Q(suffix=".png")
).to_dataframe()
# Complex OR with multiple conditions
artifacts = ln.Artifact.filter(
Q(suffix=".h5ad", size__gt=1e6) | Q(suffix=".csv", size__lt=1e3)
).to_dataframe()
```
### NOT Logic
```python
# Exclude condition
artifacts = ln.Artifact.filter(
~Q(suffix=".tmp")
).to_dataframe()
# Complex exclusion
artifacts = ln.Artifact.filter(
~Q(created_by__handle="testuser")
).to_dataframe()
```
### Combining AND, OR, NOT
```python
# Complex query
artifacts = ln.Artifact.filter(
(Q(suffix=".h5ad") | Q(suffix=".csv")) &
Q(size__gt=1e6) &
~Q(created_by__handle__startswith="test")
).to_dataframe()
```
## Search Functionality
Full-text search across registry fields:
```python
# Basic search
ln.Artifact.search("iris").to_dataframe()
ln.User.search("smith").to_dataframe()
# Search in specific registry
bt.CellType.search("T cell").to_dataframe()
bt.Gene.search("CD8").to_dataframe()
```
## Working with QuerySets
QuerySets are lazy - they don't hit the database until evaluated:
```python
# Create query (no database hit)
qs = ln.Artifact.filter(suffix=".h5ad")
# Evaluate in different ways
df = qs.to_dataframe() # As pandas DataFrame
list_records = list(qs) # As Python list
count = qs.count() # Count only
exists = qs.exists() # Boolean check
# Iteration
for artifact in qs:
print(artifact.key, artifact.size)
# Slicing
first_10 = qs[:10]
next_10 = qs[10:20]
```
## Chaining Filters
```python
# Build query incrementally
qs = ln.Artifact.filter(suffix=".h5ad")
qs = qs.filter(size__gt=1e6)
qs = qs.filter(created_at__year=2025)
qs = qs.order_by("-created_at")
# Execute
results = qs.to_dataframe()
```
## Streaming Large Datasets
For datasets too large to fit in memory, use streaming access:
### Streaming Files
```python
# Open file stream
artifact = ln.Artifact.get(key="large_file.csv")
with artifact.open() as f:
# Read in chunks
chunk = f.read(10000) # Read 10KB
# Process chunk
```
### Array Slicing
For array-based formats (Zarr, HDF5, AnnData):
```python
# Get backing file without loading
artifact = ln.Artifact.get(key="large_data.h5ad")
adata = artifact.backed() # Returns backed AnnData
# Slice specific portions
subset = adata[:1000, :] # First 1000 cells
genes_of_interest = adata[:, ["CD4", "CD8A", "CD8B"]]
# Stream batches
for i in range(0, adata.n_obs, 1000):
batch = adata[i:i+1000, :]
# Process batch
```
### Iterator Access
```python
# Process large collections incrementally
artifacts = ln.Artifact.filter(suffix=".fastq.gz")
for artifact in artifacts.iterator(chunk_size=10):
# Process 10 at a time
path = artifact.cache()
# Analyze file
```
## Aggregation and Statistics
```python
# Count records
ln.Artifact.filter(suffix=".h5ad").count()
# Distinct values
ln.Artifact.values_list("suffix", flat=True).distinct()
# Aggregation (requires Django ORM knowledge)
from django.db.models import Sum, Avg, Max, Min
# Total size of all artifacts
ln.Artifact.aggregate(Sum("size"))
# Average artifact size by suffix
ln.Artifact.values("suffix").annotate(avg_size=Avg("size"))
```
## Caching and Performance
```python
# Check cache location
ln.settings.cache_dir
# Configure cache
lamin cache set /path/to/cache
# Clear cache for specific artifact
artifact.delete_cache()
# Get cached path (downloads if needed)
path = artifact.cache()
# Check if cached
if artifact.is_cached():
path = artifact.cache()
```
## Organizing Data with Keys
Best practices for structuring keys:
```python
# Hierarchical organization
ln.Artifact("data.h5ad", key="project/experiment/batch1/data.h5ad").save()
ln.Artifact("data.h5ad", key="scrna/2025/oct/sample_001.h5ad").save()
# Browse by prefix
ln.Artifact.filter(key__startswith="scrna/2025/oct/").to_dataframe()
# Version in key (alternative to built-in versioning)
ln.Artifact("data.h5ad", key="data/processed/v1/final.h5ad").save()
ln.Artifact("data.h5ad", key="data/processed/v2/final.h5ad").save()
```
## Collections
Group related artifacts into collections:
```python
# Create collection
collection = ln.Collection(
[artifact1, artifact2, artifact3],
name="scRNA-seq batch 1-3",
description="Complete dataset across three batches"
).save()
# Access collection members
for artifact in collection.artifacts:
print(artifact.key)
# Query collections
ln.Collection.filter(name__contains="batch").to_dataframe()
```
## Best Practices
1. **Use filters before loading**: Query metadata before accessing file contents
2. **Leverage QuerySets**: Build queries incrementally for complex conditions
3. **Stream large files**: Don't load entire datasets into memory unnecessarily
4. **Structure keys hierarchically**: Makes browsing and filtering easier
5. **Use search for discovery**: When you don't know exact field values
6. **Cache strategically**: Configure cache location based on storage capacity
7. **Index features**: Define features upfront for efficient feature-based queries
8. **Use collections**: Group related artifacts for dataset-level operations
9. **Order results**: Sort by creation date or other fields for consistent retrieval
10. **Check existence**: Use `exists()` or `one_or_none()` to avoid errors
## Common Query Patterns
```python
# Recent artifacts
ln.Artifact.order_by("-created_at")[:10].to_dataframe()
# My artifacts
me = ln.setup.settings.user
ln.Artifact.filter(created_by=me).to_dataframe()
# Large files
ln.Artifact.filter(size__gt=1e9).order_by("-size").to_dataframe()
# This month's data
from datetime import datetime
ln.Artifact.filter(
created_at__year=2025,
created_at__month=10
).to_dataframe()
# Validated datasets with specific features
ln.Artifact.filter(
is_valid=True,
cell_type__isnull=False
).to_dataframe(include="features")
```

View File

@@ -0,0 +1,642 @@
# LaminDB Integrations
This document covers LaminDB integrations with workflow managers, MLOps platforms, visualization tools, and other third-party systems.
## Overview
LaminDB supports extensive integrations across data storage, computational workflows, machine learning platforms, and visualization tools, enabling seamless incorporation into existing data science and bioinformatics pipelines.
## Data Storage Integrations
### Local Filesystem
```python
import lamindb as ln
# Initialize with local storage
lamin init --storage ./mydata
# Save artifacts to local storage
artifact = ln.Artifact("data.csv", key="local/data.csv").save()
# Load from local storage
data = artifact.load()
```
### AWS S3
```python
# Initialize with S3 storage
lamin init --storage s3://my-bucket/path \
--db postgresql://user:pwd@host:port/db
# Artifacts automatically sync to S3
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
# Transparent S3 access
data = artifact.load() # Downloads from S3 if not cached
```
### S3-Compatible Services
Support for MinIO, Cloudflare R2, and other S3-compatible endpoints:
```python
# Initialize with custom S3 endpoint
lamin init --storage 's3://bucket?endpoint_url=http://minio.example.com:9000'
# Configure credentials
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
```
### Google Cloud Storage
```python
# Install GCP extras
pip install 'lamindb[gcp]'
# Initialize with GCS
lamin init --storage gs://my-bucket/path \
--db postgresql://user:pwd@host:port/db
# Artifacts sync to GCS
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
```
### HTTP/HTTPS (Read-Only)
```python
# Access remote files without copying
artifact = ln.Artifact(
"https://example.com/data.csv",
key="remote/data.csv"
).save()
# Stream remote content
with artifact.open() as f:
data = f.read()
```
### HuggingFace Datasets
```python
# Access HuggingFace datasets
from datasets import load_dataset
dataset = load_dataset("squad", split="train")
# Register as LaminDB artifact
artifact = ln.Artifact.from_dataframe(
dataset.to_pandas(),
key="hf/squad_train.parquet",
description="SQuAD training data from HuggingFace"
).save()
```
## Workflow Manager Integrations
### Nextflow
Track Nextflow pipeline execution and outputs:
```python
# In your Nextflow process script
import lamindb as ln
# Initialize tracking
ln.track()
# Your Nextflow process logic
input_artifact = ln.Artifact.get(key="${input_key}")
data = input_artifact.load()
# Process data
result = process_data(data)
# Save output
output_artifact = ln.Artifact.from_dataframe(
result,
key="${output_key}"
).save()
ln.finish()
```
**Nextflow config example:**
```nextflow
process ANALYZE {
input:
val input_key
output:
path "result.csv"
script:
"""
#!/usr/bin/env python
import lamindb as ln
ln.track()
artifact = ln.Artifact.get(key="${input_key}")
# Process and save
ln.finish()
"""
}
```
### Snakemake
Integrate LaminDB into Snakemake workflows:
```python
# In Snakemake rule
rule process_data:
input:
"data/input.csv"
output:
"data/output.csv"
run:
import lamindb as ln
ln.track()
# Load input artifact
artifact = ln.Artifact.get(key="inputs/data.csv")
data = artifact.load()
# Process
result = analyze(data)
# Save output
result.to_csv(output[0])
ln.Artifact(output[0], key="outputs/result.csv").save()
ln.finish()
```
### Redun
Track Redun task execution:
```python
from redun import task
import lamindb as ln
@task()
@ln.tracked()
def process_dataset(input_key: str, output_key: str):
"""Redun task with LaminDB tracking."""
# Load input
artifact = ln.Artifact.get(key=input_key)
data = artifact.load()
# Process
result = transform(data)
# Save output
ln.Artifact.from_dataframe(result, key=output_key).save()
return output_key
# Redun automatically tracks lineage alongside LaminDB
```
## MLOps Platform Integrations
### Weights & Biases (W&B)
Combine W&B experiment tracking with LaminDB data management:
```python
import wandb
import lamindb as ln
# Initialize both
wandb.init(project="my-project", name="experiment-1")
ln.track(params={"learning_rate": 0.01, "batch_size": 32})
# Load training data
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
train_data = train_artifact.load()
# Train model
model = train_model(train_data)
# Log to W&B
wandb.log({"accuracy": 0.95, "loss": 0.05})
# Save model in LaminDB
import joblib
joblib.dump(model, "model.pkl")
model_artifact = ln.Artifact(
"model.pkl",
key="models/experiment-1.pkl",
description=f"Model from W&B run {wandb.run.id}"
).save()
# Link W&B run ID
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
ln.finish()
wandb.finish()
```
### MLflow
Integrate MLflow model tracking with LaminDB:
```python
import mlflow
import lamindb as ln
# Start runs
mlflow.start_run()
ln.track()
# Log parameters to both
params = {"max_depth": 5, "n_estimators": 100}
mlflow.log_params(params)
ln.context.params = params
# Load data from LaminDB
data_artifact = ln.Artifact.get(key="datasets/features.parquet")
X = data_artifact.load()
# Train and log model
model = train_model(X)
mlflow.sklearn.log_model(model, "model")
# Save to LaminDB
import joblib
joblib.dump(model, "model.pkl")
model_artifact = ln.Artifact(
"model.pkl",
key=f"models/{mlflow.active_run().info.run_id}.pkl"
).save()
mlflow.end_run()
ln.finish()
```
### HuggingFace Transformers
Track model fine-tuning with LaminDB:
```python
from transformers import Trainer, TrainingArguments
import lamindb as ln
ln.track(params={"model": "bert-base", "epochs": 3})
# Load training data
train_artifact = ln.Artifact.get(key="datasets/train_tokenized.parquet")
train_dataset = train_artifact.load()
# Configure trainer
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
# Train
trainer.train()
# Save model to LaminDB
trainer.save_model("./model")
model_artifact = ln.Artifact(
"./model",
key="models/bert_finetuned",
description="BERT fine-tuned on custom dataset"
).save()
ln.finish()
```
### scVI-tools
Single-cell analysis with scVI and LaminDB:
```python
import scvi
import lamindb as ln
ln.track()
# Load data
adata_artifact = ln.Artifact.get(key="scrna/raw_counts.h5ad")
adata = adata_artifact.load()
# Setup scVI
scvi.model.SCVI.setup_anndata(adata, layer="counts")
# Train model
model = scvi.model.SCVI(adata)
model.train()
# Save latent representation
adata.obsm["X_scvi"] = model.get_latent_representation()
# Save results
result_artifact = ln.Artifact.from_anndata(
adata,
key="scrna/scvi_latent.h5ad",
description="scVI latent representation"
).save()
ln.finish()
```
## Array Store Integrations
### TileDB-SOMA
Scalable array storage with cellxgene support:
```python
import tiledbsoma as soma
import lamindb as ln
# Create SOMA experiment
uri = "tiledb://my-namespace/experiment"
with soma.Experiment.create(uri) as exp:
# Add measurements
exp.add_new_collection("RNA")
# Register in LaminDB
artifact = ln.Artifact(
uri,
key="cellxgene/experiment.soma",
description="TileDB-SOMA experiment"
).save()
# Query with SOMA
with soma.Experiment.open(uri) as exp:
obs = exp.obs.read().to_pandas()
```
### DuckDB
Query artifacts with DuckDB:
```python
import duckdb
import lamindb as ln
# Get artifact
artifact = ln.Artifact.get(key="datasets/large_data.parquet")
# Query with DuckDB (without loading full file)
path = artifact.cache()
result = duckdb.query(f"""
SELECT cell_type, COUNT(*) as count
FROM read_parquet('{path}')
GROUP BY cell_type
ORDER BY count DESC
""").to_df()
# Save query result
result_artifact = ln.Artifact.from_dataframe(
result,
key="analysis/cell_type_counts.parquet"
).save()
```
## Visualization Integrations
### Vitessce
Create interactive visualizations:
```python
from vitessce import VitessceConfig
import lamindb as ln
# Load spatial data
artifact = ln.Artifact.get(key="spatial/visium_slide.h5ad")
adata = artifact.load()
# Create Vitessce configuration
vc = VitessceConfig.from_object(adata)
# Save configuration
import json
config_file = "vitessce_config.json"
with open(config_file, "w") as f:
json.dump(vc.to_dict(), f)
# Register configuration
config_artifact = ln.Artifact(
config_file,
key="visualizations/spatial_config.json",
description="Vitessce visualization config"
).save()
```
## Schema Module Integrations
### Bionty (Biological Ontologies)
```python
import bionty as bt
# Import biological ontologies
bt.CellType.import_source()
bt.Gene.import_source(organism="human")
# Use in data curation
cell_types = bt.CellType.from_values(adata.obs.cell_type)
```
### WetLab
Track wet lab experiments:
```python
# Install wetlab module
pip install 'lamindb[wetlab]'
# Use wetlab registries
import lamindb_wetlab as wetlab
# Track experiments, samples, protocols
experiment = wetlab.Experiment(name="RNA-seq batch 1").save()
```
### Clinical Data (OMOP)
```python
# Install clinical module
pip install 'lamindb[clinical]'
# Use OMOP common data model
import lamindb_clinical as clinical
# Track clinical data
patient = clinical.Patient(patient_id="P001").save()
```
## Git Integration
### Sync with Git Repositories
```python
# Configure git sync
export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
# Or programmatically
ln.settings.sync_git_repo = "https://github.com/user/repo.git"
# Set development directory
lamin settings set dev-dir .
# Scripts tracked with git commits
ln.track() # Automatically captures git commit hash
# ... your code ...
ln.finish()
# View git information
transform = ln.Transform.get(name="analysis.py")
transform.source_code # Shows code at git commit
transform.hash # Git commit hash
```
## Enterprise Integrations
### Benchling
Sync with Benchling registries (requires team/enterprise plan):
```python
# Configure Benchling connection (contact LaminDB team)
# Syncs schemas and data from Benchling registries
# Access synced Benchling data
# Details available through enterprise support
```
## Custom Integration Patterns
### REST API Integration
```python
import requests
import lamindb as ln
ln.track()
# Fetch from API
response = requests.get("https://api.example.com/data")
data = response.json()
# Convert to DataFrame
import pandas as pd
df = pd.DataFrame(data)
# Save to LaminDB
artifact = ln.Artifact.from_dataframe(
df,
key="api/fetched_data.parquet",
description="Data fetched from external API"
).save()
artifact.features.add_values({"api_url": response.url})
ln.finish()
```
### Database Integration
```python
import sqlalchemy as sa
import lamindb as ln
ln.track()
# Connect to external database
engine = sa.create_engine("postgresql://user:pwd@host:port/db")
# Query data
query = "SELECT * FROM experiments WHERE date > '2025-01-01'"
df = pd.read_sql(query, engine)
# Save to LaminDB
artifact = ln.Artifact.from_dataframe(
df,
key="external_db/experiments_2025.parquet",
description="Experiments from external database"
).save()
ln.finish()
```
## Croissant Metadata
Export datasets with Croissant metadata format:
```python
# Create artifact with rich metadata
artifact = ln.Artifact.from_dataframe(
df,
key="datasets/published_data.parquet",
description="Published dataset with Croissant metadata"
).save()
# Export Croissant metadata (requires additional configuration)
# Enables dataset discovery and interoperability
```
## Best Practices for Integrations
1. **Track consistently**: Use `ln.track()` in all integrated workflows
2. **Link IDs**: Store external system IDs (W&B run ID, MLflow experiment ID) as features
3. **Centralize data**: Use LaminDB as single source of truth for data artifacts
4. **Sync parameters**: Log parameters to both LaminDB and ML platforms
5. **Version together**: Keep code (git), data (LaminDB), and experiments (ML platform) in sync
6. **Cache strategically**: Configure appropriate cache locations for cloud storage
7. **Use feature sets**: Link ontology terms from Bionty to artifacts
8. **Document integrations**: Add descriptions explaining integration context
9. **Test incrementally**: Verify integration with small datasets first
10. **Monitor lineage**: Use `view_lineage()` to ensure integration tracking works
## Troubleshooting Common Issues
**Issue: S3 credentials not found**
```bash
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
```
**Issue: GCS authentication failure**
```bash
gcloud auth application-default login
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
```
**Issue: Git sync not working**
```bash
# Ensure git repo is set
lamin settings get sync-git-repo
# Ensure you're in git repo
git status
# Commit changes before tracking
git add .
git commit -m "Update analysis"
ln.track()
```
**Issue: MLflow artifacts not syncing**
```python
# Save explicitly to both systems
mlflow.log_artifact("model.pkl")
ln.Artifact("model.pkl", key="models/model.pkl").save()
```

View File

@@ -0,0 +1,497 @@
# LaminDB Ontology Management
This document covers biological ontology management in LaminDB through the Bionty plugin, including accessing, searching, and annotating data with standardized biological terms.
## Overview
LaminDB integrates the `bionty` plugin to manage standardized biological ontologies, enabling consistent metadata curation and data annotation across research projects. Bionty provides access to 20+ curated biological ontologies covering genes, proteins, cell types, tissues, diseases, and more.
## Available Ontologies
LaminDB provides access to multiple curated ontology sources:
| Registry | Ontology Source | Description |
|----------|----------------|-------------|
| **Gene** | Ensembl | Genes across organisms (human, mouse, etc.) |
| **Protein** | UniProt | Protein sequences and annotations |
| **CellType** | Cell Ontology (CL) | Standardized cell type classifications |
| **CellLine** | Cell Line Ontology (CLO) | Cell line annotations |
| **Tissue** | Uberon | Anatomical structures and tissues |
| **Disease** | Mondo, DOID | Disease classifications |
| **Phenotype** | Human Phenotype Ontology (HPO) | Phenotypic abnormalities |
| **Pathway** | Gene Ontology (GO) | Biological pathways and processes |
| **ExperimentalFactor** | Experimental Factor Ontology (EFO) | Experimental variables |
| **DevelopmentalStage** | Various | Developmental stages across organisms |
| **Ethnicity** | HANCESTRO | Human ancestry ontology |
| **Drug** | DrugBank | Drug compounds |
| **Organism** | NCBItaxon | Taxonomic classifications |
## Installation and Import
```python
# Install bionty (included with lamindb)
pip install lamindb
# Import
import lamindb as ln
import bionty as bt
```
## Importing Public Ontologies
Populate your registry with a public ontology source:
```python
# Import Cell Ontology
bt.CellType.import_source()
# Import organism-specific genes
bt.Gene.import_source(organism="human")
bt.Gene.import_source(organism="mouse")
# Import tissues
bt.Tissue.import_source()
# Import diseases
bt.Disease.import_source(source="mondo") # Mondo Disease Ontology
bt.Disease.import_source(source="doid") # Disease Ontology
```
## Searching and Accessing Records
### Keyword Search
```python
# Search cell types
bt.CellType.search("T cell").to_dataframe()
bt.CellType.search("gamma-delta").to_dataframe()
# Search genes
bt.Gene.search("CD8").to_dataframe()
bt.Gene.search("TP53").to_dataframe()
# Search diseases
bt.Disease.search("cancer").to_dataframe()
# Search tissues
bt.Tissue.search("brain").to_dataframe()
```
### Auto-Complete Lookup
For registries with fewer than 100k records:
```python
# Create lookup object
cell_types = bt.CellType.lookup()
# Access by name (auto-complete in IDEs)
t_cell = cell_types.t_cell
hsc = cell_types.hematopoietic_stem_cell
# Similarly for other registries
genes = bt.Gene.lookup()
cd8a = genes.cd8a
```
### Exact Field Matching
```python
# By ontology ID
cell_type = bt.CellType.get(ontology_id="CL:0000798")
disease = bt.Disease.get(ontology_id="MONDO:0004992")
# By name
cell_type = bt.CellType.get(name="T cell")
gene = bt.Gene.get(symbol="CD8A")
# By Ensembl ID
gene = bt.Gene.get(ensembl_gene_id="ENSG00000153563")
```
## Ontological Hierarchies
### Exploring Relationships
```python
# Get a cell type
gdt_cell = bt.CellType.get(ontology_id="CL:0000798") # gamma-delta T cell
# View direct parents
gdt_cell.parents.to_dataframe()
# View all ancestors recursively
ancestors = []
current = gdt_cell
while current.parents.exists():
parent = current.parents.first()
ancestors.append(parent)
current = parent
# View direct children
gdt_cell.children.to_dataframe()
# View all descendants recursively
gdt_cell.query_children().to_dataframe()
```
### Visualizing Hierarchies
```python
# Visualize parent hierarchy
gdt_cell.view_parents()
# Include children in visualization
gdt_cell.view_parents(with_children=True)
# Get all related terms for visualization
t_cell = bt.CellType.get(name="T cell")
t_cell.view_parents(with_children=True) # Shows T cell subtypes
```
## Standardizing and Validating Data
### Validation
Check if terms exist in the ontology:
```python
# Validate cell types
bt.CellType.validate(["T cell", "B cell", "invalid_cell"])
# Returns: [True, True, False]
# Validate genes
bt.Gene.validate(["CD8A", "TP53", "FAKEGENE"], organism="human")
# Returns: [True, True, False]
# Check which terms are invalid
terms = ["T cell", "fat cell", "neuron", "invalid_term"]
invalid = [t for t, valid in zip(terms, bt.CellType.validate(terms)) if not valid]
print(f"Invalid terms: {invalid}")
```
### Standardization with Synonyms
Convert non-standard terms to validated names:
```python
# Standardize cell type names
bt.CellType.standardize(["fat cell", "blood forming stem cell"])
# Returns: ['adipocyte', 'hematopoietic stem cell']
# Standardize genes
bt.Gene.standardize(["BRCA-1", "p53"], organism="human")
# Returns: ['BRCA1', 'TP53']
# Handle mixed valid/invalid terms
terms = ["T cell", "T lymphocyte", "invalid"]
standardized = bt.CellType.standardize(terms)
# Returns standardized names where possible
```
### Loading Validated Records
```python
# Load records from values (including synonyms)
records = bt.CellType.from_values(["fat cell", "blood forming stem cell"])
# Returns list of CellType records
for record in records:
print(record.name, record.ontology_id)
# Use with gene symbols
genes = bt.Gene.from_values(["CD8A", "CD8B"], organism="human")
```
## Annotating Datasets
### Annotating AnnData
```python
import anndata as ad
import lamindb as ln
# Load example data
adata = ad.read_h5ad("data.h5ad")
# Validate and retrieve matching records
cell_types = bt.CellType.from_values(adata.obs.cell_type)
# Create artifact with annotations
artifact = ln.Artifact.from_anndata(
adata,
key="scrna/annotated_data.h5ad",
description="scRNA-seq data with validated cell type annotations"
).save()
# Link ontology records to artifact
artifact.feature_sets.add_ontology(cell_types)
```
### Annotating DataFrames
```python
import pandas as pd
# Create DataFrame with biological entities
df = pd.DataFrame({
"cell_type": ["T cell", "B cell", "NK cell"],
"tissue": ["blood", "spleen", "liver"],
"disease": ["healthy", "lymphoma", "healthy"]
})
# Validate and standardize
df["cell_type"] = bt.CellType.standardize(df["cell_type"])
df["tissue"] = bt.Tissue.standardize(df["tissue"])
# Create artifact
artifact = ln.Artifact.from_dataframe(
df,
key="metadata/sample_info.parquet"
).save()
# Link ontology records
cell_type_records = bt.CellType.from_values(df["cell_type"])
tissue_records = bt.Tissue.from_values(df["tissue"])
artifact.feature_sets.add_ontology(cell_type_records)
artifact.feature_sets.add_ontology(tissue_records)
```
## Managing Custom Terms and Hierarchies
### Adding Custom Terms
```python
# Register new term not in public ontology
my_celltype = bt.CellType(name="my_novel_T_cell_subtype").save()
# Establish parent-child relationship
parent = bt.CellType.get(name="T cell")
my_celltype.parents.add(parent)
# Verify relationship
my_celltype.parents.to_dataframe()
parent.children.to_dataframe() # Should include my_celltype
```
### Adding Synonyms
```python
# Add synonyms for standardization
hsc = bt.CellType.get(name="hematopoietic stem cell")
hsc.add_synonym("HSC")
hsc.add_synonym("blood stem cell")
hsc.add_synonym("hematopoietic progenitor")
# Set abbreviation
hsc.set_abbr("HSC")
# Now standardization works with synonyms
bt.CellType.standardize(["HSC", "blood stem cell"])
# Returns: ['hematopoietic stem cell', 'hematopoietic stem cell']
```
### Creating Custom Hierarchies
```python
# Build custom cell type hierarchy
immune_cell = bt.CellType.get(name="immune cell")
# Add custom subtypes
my_subtype1 = bt.CellType(name="custom_immune_subtype_1").save()
my_subtype2 = bt.CellType(name="custom_immune_subtype_2").save()
# Link to parent
my_subtype1.parents.add(immune_cell)
my_subtype2.parents.add(immune_cell)
# Create sub-subtypes
my_subsubtype = bt.CellType(name="custom_sub_subtype").save()
my_subsubtype.parents.add(my_subtype1)
# Visualize custom hierarchy
immune_cell.view_parents(with_children=True)
```
## Multi-Organism Support
For organism-aware registries like Gene:
```python
# Set global organism
bt.settings.organism = "human"
# Validate human genes
bt.Gene.validate(["TCF7", "CD8A"], organism="human")
# Load genes for specific organism
human_genes = bt.Gene.from_values(["CD8A", "TP53"], organism="human")
mouse_genes = bt.Gene.from_values(["Cd8a", "Trp53"], organism="mouse")
# Search organism-specific genes
bt.Gene.search("CD8", organism="human").to_dataframe()
bt.Gene.search("Cd8", organism="mouse").to_dataframe()
# Switch organism context
bt.settings.organism = "mouse"
genes = bt.Gene.from_source(symbol="Ap5b1")
```
## Public Ontology Lookup
Access terms from public ontologies without importing:
```python
# Interactive lookup in public sources
cell_types_public = bt.CellType.lookup(public=True)
# Access public terms
hepatocyte = cell_types_public.hepatocyte
# Import specific term
hepatocyte_local = bt.CellType.from_source(name="hepatocyte")
# Or import by ontology ID
specific_cell = bt.CellType.from_source(ontology_id="CL:0000182")
```
## Version Tracking
LaminDB automatically tracks ontology versions:
```python
# View current source versions
bt.Source.filter(currently_used=True).to_dataframe()
# Check which source a record derives from
cell_type = bt.CellType.get(name="hepatocyte")
cell_type.source # Returns Source metadata
# View source details
source = cell_type.source
print(source.name) # e.g., "cl"
print(source.version) # e.g., "2023-05-18"
print(source.url) # Ontology URL
```
## Ontology Integration Workflows
### Workflow 1: Validate Existing Data
```python
# Load data with biological annotations
adata = ad.read_h5ad("uncurated_data.h5ad")
# Validate cell types
validation = bt.CellType.validate(adata.obs.cell_type)
# Identify invalid terms
invalid_idx = [i for i, v in enumerate(validation) if not v]
invalid_terms = adata.obs.cell_type.iloc[invalid_idx].unique()
print(f"Invalid cell types: {invalid_terms}")
# Fix invalid terms manually or with standardization
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs.cell_type)
# Re-validate
validation = bt.CellType.validate(adata.obs.cell_type)
assert all(validation), "All terms should now be valid"
```
### Workflow 2: Curate and Annotate
```python
import lamindb as ln
ln.track() # Start tracking
# Load data
df = pd.read_csv("experimental_data.csv")
# Standardize using ontologies
df["cell_type"] = bt.CellType.standardize(df["cell_type"])
df["tissue"] = bt.Tissue.standardize(df["tissue"])
# Create curated artifact
artifact = ln.Artifact.from_dataframe(
df,
key="curated/experiment_2025_10.parquet",
description="Curated experimental data with ontology-validated annotations"
).save()
# Link ontology records
artifact.feature_sets.add_ontology(bt.CellType.from_values(df["cell_type"]))
artifact.feature_sets.add_ontology(bt.Tissue.from_values(df["tissue"]))
ln.finish() # Complete tracking
```
### Workflow 3: Cross-Organism Gene Mapping
```python
# Get human genes
human_genes = ["CD8A", "CD8B", "TP53"]
human_records = bt.Gene.from_values(human_genes, organism="human")
# Find mouse orthologs (requires external mapping)
# LaminDB doesn't provide built-in ortholog mapping
# Use external tools like Ensembl BioMart or homologene
mouse_orthologs = ["Cd8a", "Cd8b", "Trp53"]
mouse_records = bt.Gene.from_values(mouse_orthologs, organism="mouse")
```
## Querying Ontology-Annotated Data
```python
# Find all datasets with specific cell type
t_cell = bt.CellType.get(name="T cell")
ln.Artifact.filter(feature_sets__cell_types=t_cell).to_dataframe()
# Find datasets measuring specific genes
cd8a = bt.Gene.get(symbol="CD8A", organism="human")
ln.Artifact.filter(feature_sets__genes=cd8a).to_dataframe()
# Query across ontology hierarchy
# Find all datasets with T cell or T cell subtypes
t_cell_subtypes = t_cell.query_children()
ln.Artifact.filter(
feature_sets__cell_types__in=t_cell_subtypes
).to_dataframe()
```
## Best Practices
1. **Import ontologies first**: Call `import_source()` before validation
2. **Use standardization**: Leverage synonym mapping to handle variations
3. **Validate early**: Check terms before creating artifacts
4. **Set organism context**: Specify organism for gene-related queries
5. **Add custom synonyms**: Register common variations in your domain
6. **Use public lookup**: Access `lookup(public=True)` for term discovery
7. **Track versions**: Monitor ontology source versions for reproducibility
8. **Build hierarchies**: Link custom terms to existing ontology structures
9. **Query hierarchically**: Use `query_children()` for comprehensive searches
10. **Document mappings**: Track custom term additions and relationships
## Common Ontology Operations
```python
# Check if term exists
exists = bt.CellType.filter(name="T cell").exists()
# Count terms in registry
n_cell_types = bt.CellType.filter().count()
# Get all terms with specific parent
immune_cells = bt.CellType.filter(parents__name="immune cell")
# Find orphan terms (no parents)
orphans = bt.CellType.filter(parents__isnull=True)
# Get recently added terms
from datetime import datetime, timedelta
recent = bt.CellType.filter(
created_at__gte=datetime.now() - timedelta(days=7)
)
```

View File

@@ -0,0 +1,733 @@
# LaminDB Setup & Deployment
This document covers installation, configuration, instance management, storage options, and deployment strategies for LaminDB.
## Installation
### Basic Installation
```bash
# Install LaminDB
pip install lamindb
# Or with pip3
pip3 install lamindb
```
### Installation with Extras
Install optional dependencies for specific functionality:
```bash
# Google Cloud Platform support
pip install 'lamindb[gcp]'
# Flow cytometry formats
pip install 'lamindb[fcs]'
# Array storage and streaming (Zarr support)
pip install 'lamindb[zarr]'
# AWS S3 support (usually included by default)
pip install 'lamindb[aws]'
# Multiple extras
pip install 'lamindb[gcp,zarr,fcs]'
```
### Module Plugins
```bash
# Biological ontologies (Bionty)
pip install bionty
# Wet lab functionality
pip install lamindb-wetlab
# Clinical data (OMOP CDM)
pip install lamindb-clinical
```
### Verify Installation
```python
import lamindb as ln
print(ln.__version__)
# Check available modules
import bionty as bt
print(bt.__version__)
```
## Authentication
### Creating an Account
1. Visit https://lamin.ai
2. Sign up for a free account
3. Navigate to account settings to generate an API key
### Logging In
```bash
# Login with API key
lamin login
# You'll be prompted to enter your API key
# API key is stored locally at ~/.lamin/
```
### Authentication Details
**Data Privacy:** LaminDB authentication only collects basic metadata (email, user information). Your actual data remains private and is not sent to LaminDB servers.
**Local vs Cloud:** Authentication is required even for local-only usage to enable collaboration features and instance management.
## Instance Initialization
### Local SQLite Instance
For local development and small datasets:
```bash
# Initialize in current directory
lamin init --storage ./mydata
# Initialize in specific directory
lamin init --storage /path/to/data
# Initialize with specific modules
lamin init --storage ./mydata --modules bionty
# Initialize with multiple modules
lamin init --storage ./mydata --modules bionty,wetlab
```
### Cloud Storage with SQLite
Use cloud storage but local SQLite database:
```bash
# AWS S3
lamin init --storage s3://my-bucket/path
# Google Cloud Storage
lamin init --storage gs://my-bucket/path
# S3-compatible (MinIO, Cloudflare R2)
lamin init --storage 's3://bucket?endpoint_url=http://endpoint:9000'
```
### Cloud Storage with PostgreSQL
For production deployments:
```bash
# S3 + PostgreSQL
lamin init --storage s3://my-bucket/path \
--db postgresql://user:password@hostname:5432/dbname \
--modules bionty
# GCS + PostgreSQL
lamin init --storage gs://my-bucket/path \
--db postgresql://user:password@hostname:5432/dbname \
--modules bionty
```
### Instance Naming
```bash
# Specify instance name
lamin init --storage ./mydata --name my-project
# Default name uses directory name
lamin init --storage ./mydata # Instance name: "mydata"
```
## Connecting to Instances
### Connect to Your Own Instance
```bash
# By name
lamin connect my-project
# By full path
lamin connect account_handle/my-project
```
### Connect to Shared Instance
```bash
# Connect to someone else's instance
lamin connect other-user/their-project
# Requires appropriate permissions
```
### Switching Between Instances
```bash
# List available instances
lamin info
# Switch instance
lamin connect another-instance
# Close current instance
lamin close
```
## Storage Configuration
### Local Storage
**Advantages:**
- Fast access
- No internet required
- Simple setup
**Setup:**
```bash
lamin init --storage ./data
```
### AWS S3 Storage
**Advantages:**
- Scalable
- Collaborative
- Durable
**Setup:**
```bash
# Set credentials
export AWS_ACCESS_KEY_ID=your_key_id
export AWS_SECRET_ACCESS_KEY=your_secret_key
export AWS_DEFAULT_REGION=us-east-1
# Initialize
lamin init --storage s3://my-bucket/project-data \
--db postgresql://user:pwd@host:5432/db
```
**S3 Permissions Required:**
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-bucket/*",
"arn:aws:s3:::my-bucket"
]
}
]
}
```
### Google Cloud Storage
**Setup:**
```bash
# Authenticate
gcloud auth application-default login
# Or use service account
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
# Initialize
lamin init --storage gs://my-bucket/project-data \
--db postgresql://user:pwd@host:5432/db
```
### S3-Compatible Storage
For MinIO, Cloudflare R2, or other S3-compatible services:
```bash
# MinIO example
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
lamin init --storage 's3://my-bucket?endpoint_url=http://minio.example.com:9000'
# Cloudflare R2 example
export AWS_ACCESS_KEY_ID=your_r2_access_key
export AWS_SECRET_ACCESS_KEY=your_r2_secret_key
lamin init --storage 's3://bucket?endpoint_url=https://account-id.r2.cloudflarestorage.com'
```
## Database Configuration
### SQLite (Default)
**Advantages:**
- No separate database server
- Simple setup
- Good for development
**Limitations:**
- Not suitable for concurrent writes
- Limited scalability
**Setup:**
```bash
# SQLite is default
lamin init --storage ./data
# Database stored at ./data/.lamindb/
```
### PostgreSQL
**Advantages:**
- Production-ready
- Concurrent access
- Better performance at scale
**Setup:**
```bash
# Full connection string
lamin init --storage s3://bucket/path \
--db postgresql://username:password@hostname:5432/database
# With SSL
lamin init --storage s3://bucket/path \
--db "postgresql://user:pwd@host:5432/db?sslmode=require"
```
**PostgreSQL Versions:** Compatible with PostgreSQL 12+
### Database Schema Management
```bash
# Check current schema version
lamin migrate check
# Upgrade schema
lamin migrate deploy
# View migration history
lamin migrate history
```
## Cache Configuration
### Cache Directory
LaminDB maintains a local cache for cloud files:
```python
import lamindb as ln
# View cache location
print(ln.settings.cache_dir)
```
### Configure Cache Location
```bash
# Set cache directory
lamin cache set /path/to/cache
# View current cache settings
lamin cache get
```
### System-Wide Cache (Multi-User)
For shared systems with multiple users:
```bash
# Create system settings file
sudo mkdir -p /system/settings
sudo nano /system/settings/system.env
```
Add to `system.env`:
```bash
lamindb_cache_path=/shared/cache/lamindb
```
Ensure permissions:
```bash
sudo chmod 755 /shared/cache/lamindb
sudo chown -R shared-user:shared-group /shared/cache/lamindb
```
### Cache Management
```python
import lamindb as ln
# Clear cache for specific artifact
artifact = ln.Artifact.get(key="data.h5ad")
artifact.delete_cache()
# Check if artifact is cached
if artifact.is_cached():
print("Already cached")
# Manually clear entire cache
import shutil
shutil.rmtree(ln.settings.cache_dir)
```
## Settings Management
### View Current Settings
```python
import lamindb as ln
# User settings
print(ln.setup.settings.user)
# User(handle='username', email='user@email.com', name='Full Name')
# Instance settings
print(ln.setup.settings.instance)
# Instance(name='my-project', storage='s3://bucket/path')
```
### Configure Settings
```bash
# Set development directory for relative keys
lamin settings set dev-dir /path/to/project
# Configure git sync
lamin settings set sync-git-repo https://github.com/user/repo.git
# View all settings
lamin settings
```
### Environment Variables
```bash
# Cache directory
export LAMIN_CACHE_DIR=/path/to/cache
# Settings directory
export LAMIN_SETTINGS_DIR=/path/to/settings
# Git sync
export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
```
## Instance Management
### Viewing Instance Information
```bash
# Current instance info
lamin info
# List all instances
lamin ls
# View instance details
lamin instance details
```
### Instance Collaboration
```bash
# Set instance visibility (requires LaminHub)
lamin instance set-visibility public
lamin instance set-visibility private
# Invite collaborators (requires LaminHub)
lamin instance invite user@email.com
```
### Instance Migration
```bash
# Backup instance
lamin backup create
# Restore from backup
lamin backup restore backup_id
# Export instance metadata
lamin export instance-metadata.json
```
### Deleting Instances
```bash
# Delete instance (preserves data, removes metadata)
lamin delete --force instance-name
# This only removes the LaminDB metadata
# Actual data in storage location remains
```
## Production Deployment Patterns
### Pattern 1: Local Development → Cloud Production
**Development:**
```bash
# Local development
lamin init --storage ./dev-data --modules bionty
```
**Production:**
```bash
# Cloud production
lamin init --storage s3://prod-bucket/data \
--db postgresql://user:pwd@db-host:5432/prod-db \
--modules bionty \
--name production
```
**Migration:** Export artifacts from dev, import to prod
```python
# Export from dev
artifacts = ln.Artifact.filter().all()
for artifact in artifacts:
artifact.export("/tmp/export/")
# Switch to prod
lamin connect production
# Import to prod
for file in Path("/tmp/export/").glob("*"):
ln.Artifact(str(file), key=file.name).save()
```
### Pattern 2: Multi-Region Deployment
Deploy instances in multiple regions for data sovereignty:
```bash
# US instance
lamin init --storage s3://us-bucket/data \
--db postgresql://user:pwd@us-db:5432/db \
--name us-production
# EU instance
lamin init --storage s3://eu-bucket/data \
--db postgresql://user:pwd@eu-db:5432/db \
--name eu-production
```
### Pattern 3: Shared Storage, Personal Instances
Multiple users, shared data:
```bash
# Shared storage with user-specific DB
lamin init --storage s3://shared-bucket/data \
--db postgresql://user1:pwd@host:5432/user1_db \
--name user1-workspace
lamin init --storage s3://shared-bucket/data \
--db postgresql://user2:pwd@host:5432/user2_db \
--name user2-workspace
```
## Performance Optimization
### Database Performance
```python
# Use connection pooling for PostgreSQL
# Configure in database server settings
# Optimize queries with indexes
# LaminDB creates indexes automatically for common queries
```
### Storage Performance
```bash
# Use appropriate storage classes
# S3: STANDARD for frequent access, INTELLIGENT_TIERING for mixed access
# Configure multipart upload thresholds
export AWS_CLI_FILE_IO_BANDWIDTH=100MB
```
### Cache Optimization
```python
# Pre-cache frequently used artifacts
artifacts = ln.Artifact.filter(key__startswith="reference/")
for artifact in artifacts:
artifact.cache() # Download to cache
# Use backed mode for large arrays
adata = artifact.backed() # Don't load into memory
```
## Security Best Practices
1. **Credentials Management:**
- Use environment variables, not hardcoded credentials
- Use IAM roles on AWS/GCP instead of access keys
- Rotate credentials regularly
2. **Access Control:**
- Use PostgreSQL for multi-user access control
- Configure storage bucket policies
- Enable audit logging
3. **Network Security:**
- Use SSL/TLS for database connections
- Use VPCs for cloud deployments
- Restrict IP addresses when possible
4. **Data Protection:**
- Enable encryption at rest (S3, GCS)
- Use encryption in transit (HTTPS, SSL)
- Implement backup strategies
## Monitoring and Maintenance
### Health Checks
```python
import lamindb as ln
# Check database connection
try:
ln.Artifact.filter().count()
print("✓ Database connected")
except Exception as e:
print(f"✗ Database error: {e}")
# Check storage access
try:
test_artifact = ln.Artifact("test.txt", key="healthcheck.txt").save()
test_artifact.delete(permanent=True)
print("✓ Storage accessible")
except Exception as e:
print(f"✗ Storage error: {e}")
```
### Logging
```python
# Enable debug logging
import logging
logging.basicConfig(level=logging.DEBUG)
# LaminDB operations will produce detailed logs
```
### Backup Strategy
```bash
# Regular database backups (PostgreSQL)
pg_dump -h hostname -U username -d database > backup_$(date +%Y%m%d).sql
# Storage backups (S3 versioning)
aws s3api put-bucket-versioning \
--bucket my-bucket \
--versioning-configuration Status=Enabled
# Metadata export
lamin export metadata_backup.json
```
## Troubleshooting
### Common Issues
**Issue: Cannot connect to instance**
```bash
# Check instance exists
lamin ls
# Verify authentication
lamin login
# Re-connect
lamin connect instance-name
```
**Issue: Storage permissions denied**
```bash
# Check AWS credentials
aws s3 ls s3://your-bucket/
# Check GCS credentials
gsutil ls gs://your-bucket/
# Verify IAM permissions
```
**Issue: Database connection error**
```bash
# Test PostgreSQL connection
psql postgresql://user:pwd@host:5432/db
# Check database version compatibility
lamin migrate check
```
**Issue: Cache full**
```python
# Clear cache
import lamindb as ln
import shutil
shutil.rmtree(ln.settings.cache_dir)
# Set larger cache location
lamin cache set /larger/disk/cache
```
## Upgrade and Migration
### Upgrading LaminDB
```bash
# Upgrade to latest version
pip install --upgrade lamindb
# Upgrade database schema
lamin migrate deploy
```
### Schema Compatibility
Check the compatibility matrix to ensure your database schema version is compatible with your installed LaminDB version.
### Breaking Changes
Major version upgrades may require migration:
```bash
# Check for breaking changes
lamin migrate check
# Review migration plan
lamin migrate plan
# Execute migration
lamin migrate deploy
```
## Best Practices
1. **Start local, scale cloud**: Develop locally, deploy to cloud for production
2. **Use PostgreSQL for production**: SQLite is only for development
3. **Configure appropriate cache**: Size cache based on working set
4. **Enable versioning**: Use S3/GCS versioning for data protection
5. **Monitor costs**: Track storage and compute costs in cloud deployments
6. **Document configuration**: Keep infrastructure-as-code for reproducibility
7. **Test backups**: Regularly verify backup and restore procedures
8. **Set up monitoring**: Implement health checks and alerting
9. **Use modules strategically**: Only install needed plugins to reduce complexity
10. **Plan for scale**: Consider concurrent users and data growth