Initial commit
This commit is contained in:
384
skills/lamindb/SKILL.md
Normal file
384
skills/lamindb/SKILL.md
Normal file
@@ -0,0 +1,384 @@
|
||||
---
|
||||
name: lamindb
|
||||
description: This skill should be used when working with LaminDB, an open-source data framework for biology that makes data queryable, traceable, reproducible, and FAIR. Use when managing biological datasets (scRNA-seq, spatial, flow cytometry, etc.), tracking computational workflows, curating and validating data with biological ontologies, building data lakehouses, or ensuring data lineage and reproducibility in biological research. Covers data management, annotation, ontologies (genes, cell types, diseases, tissues), schema validation, integrations with workflow managers (Nextflow, Snakemake) and MLOps platforms (W&B, MLflow), and deployment strategies.
|
||||
---
|
||||
|
||||
# LaminDB
|
||||
|
||||
## Overview
|
||||
|
||||
LaminDB is an open-source data framework for biology designed to make data queryable, traceable, reproducible, and FAIR (Findable, Accessible, Interoperable, Reusable). It provides a unified platform that combines lakehouse architecture, lineage tracking, feature stores, biological ontologies, LIMS (Laboratory Information Management System), and ELN (Electronic Lab Notebook) capabilities through a single Python API.
|
||||
|
||||
**Core Value Proposition:**
|
||||
- **Queryability**: Search and filter datasets by metadata, features, and ontology terms
|
||||
- **Traceability**: Automatic lineage tracking from raw data through analysis to results
|
||||
- **Reproducibility**: Version control for data, code, and environment
|
||||
- **FAIR Compliance**: Standardized annotations using biological ontologies
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
Use this skill when:
|
||||
|
||||
- **Managing biological datasets**: scRNA-seq, bulk RNA-seq, spatial transcriptomics, flow cytometry, multi-modal data, EHR data
|
||||
- **Tracking computational workflows**: Notebooks, scripts, pipeline execution (Nextflow, Snakemake, Redun)
|
||||
- **Curating and validating data**: Schema validation, standardization, ontology-based annotation
|
||||
- **Working with biological ontologies**: Genes, proteins, cell types, tissues, diseases, pathways (via Bionty)
|
||||
- **Building data lakehouses**: Unified query interface across multiple datasets
|
||||
- **Ensuring reproducibility**: Automatic versioning, lineage tracking, environment capture
|
||||
- **Integrating ML pipelines**: Connecting with Weights & Biases, MLflow, HuggingFace, scVI-tools
|
||||
- **Deploying data infrastructure**: Setting up local or cloud-based data management systems
|
||||
- **Collaborating on datasets**: Sharing curated, annotated data with standardized metadata
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
LaminDB provides six interconnected capability areas, each documented in detail in the references folder.
|
||||
|
||||
### 1. Core Concepts and Data Lineage
|
||||
|
||||
**Core entities:**
|
||||
- **Artifacts**: Versioned datasets (DataFrame, AnnData, Parquet, Zarr, etc.)
|
||||
- **Records**: Experimental entities (samples, perturbations, instruments)
|
||||
- **Runs & Transforms**: Computational lineage tracking (what code produced what data)
|
||||
- **Features**: Typed metadata fields for annotation and querying
|
||||
|
||||
**Key workflows:**
|
||||
- Create and version artifacts from files or Python objects
|
||||
- Track notebook/script execution with `ln.track()` and `ln.finish()`
|
||||
- Annotate artifacts with typed features
|
||||
- Visualize data lineage graphs with `artifact.view_lineage()`
|
||||
- Query by provenance (find all outputs from specific code/inputs)
|
||||
|
||||
**Reference:** `references/core-concepts.md` - Read this for detailed information on artifacts, records, runs, transforms, features, versioning, and lineage tracking.
|
||||
|
||||
### 2. Data Management and Querying
|
||||
|
||||
**Query capabilities:**
|
||||
- Registry exploration and lookup with auto-complete
|
||||
- Single record retrieval with `get()`, `one()`, `one_or_none()`
|
||||
- Filtering with comparison operators (`__gt`, `__lte`, `__contains`, `__startswith`)
|
||||
- Feature-based queries (query by annotated metadata)
|
||||
- Cross-registry traversal with double-underscore syntax
|
||||
- Full-text search across registries
|
||||
- Advanced logical queries with Q objects (AND, OR, NOT)
|
||||
- Streaming large datasets without loading into memory
|
||||
|
||||
**Key workflows:**
|
||||
- Browse artifacts with filters and ordering
|
||||
- Query by features, creation date, creator, size, etc.
|
||||
- Stream large files in chunks or with array slicing
|
||||
- Organize data with hierarchical keys
|
||||
- Group artifacts into collections
|
||||
|
||||
**Reference:** `references/data-management.md` - Read this for comprehensive query patterns, filtering examples, streaming strategies, and data organization best practices.
|
||||
|
||||
### 3. Annotation and Validation
|
||||
|
||||
**Curation process:**
|
||||
1. **Validation**: Confirm datasets match desired schemas
|
||||
2. **Standardization**: Fix typos, map synonyms to canonical terms
|
||||
3. **Annotation**: Link datasets to metadata entities for queryability
|
||||
|
||||
**Schema types:**
|
||||
- **Flexible schemas**: Validate only known columns, allow additional metadata
|
||||
- **Minimal required schemas**: Specify essential columns, permit extras
|
||||
- **Strict schemas**: Complete control over structure and values
|
||||
|
||||
**Supported data types:**
|
||||
- DataFrames (Parquet, CSV)
|
||||
- AnnData (single-cell genomics)
|
||||
- MuData (multi-modal)
|
||||
- SpatialData (spatial transcriptomics)
|
||||
- TileDB-SOMA (scalable arrays)
|
||||
|
||||
**Key workflows:**
|
||||
- Define features and schemas for data validation
|
||||
- Use `DataFrameCurator` or `AnnDataCurator` for validation
|
||||
- Standardize values with `.cat.standardize()`
|
||||
- Map to ontologies with `.cat.add_ontology()`
|
||||
- Save curated artifacts with schema linkage
|
||||
- Query validated datasets by features
|
||||
|
||||
**Reference:** `references/annotation-validation.md` - Read this for detailed curation workflows, schema design patterns, handling validation errors, and best practices.
|
||||
|
||||
### 4. Biological Ontologies
|
||||
|
||||
**Available ontologies (via Bionty):**
|
||||
- Genes (Ensembl), Proteins (UniProt)
|
||||
- Cell types (CL), Cell lines (CLO)
|
||||
- Tissues (Uberon), Diseases (Mondo, DOID)
|
||||
- Phenotypes (HPO), Pathways (GO)
|
||||
- Experimental factors (EFO), Developmental stages
|
||||
- Organisms (NCBItaxon), Drugs (DrugBank)
|
||||
|
||||
**Key workflows:**
|
||||
- Import public ontologies with `bt.CellType.import_source()`
|
||||
- Search ontologies with keyword or exact matching
|
||||
- Standardize terms using synonym mapping
|
||||
- Explore hierarchical relationships (parents, children, ancestors)
|
||||
- Validate data against ontology terms
|
||||
- Annotate datasets with ontology records
|
||||
- Create custom terms and hierarchies
|
||||
- Handle multi-organism contexts (human, mouse, etc.)
|
||||
|
||||
**Reference:** `references/ontologies.md` - Read this for comprehensive ontology operations, standardization strategies, hierarchy navigation, and annotation workflows.
|
||||
|
||||
### 5. Integrations
|
||||
|
||||
**Workflow managers:**
|
||||
- Nextflow: Track pipeline processes and outputs
|
||||
- Snakemake: Integrate into Snakemake rules
|
||||
- Redun: Combine with Redun task tracking
|
||||
|
||||
**MLOps platforms:**
|
||||
- Weights & Biases: Link experiments with data artifacts
|
||||
- MLflow: Track models and experiments
|
||||
- HuggingFace: Track model fine-tuning
|
||||
- scVI-tools: Single-cell analysis workflows
|
||||
|
||||
**Storage systems:**
|
||||
- Local filesystem, AWS S3, Google Cloud Storage
|
||||
- S3-compatible (MinIO, Cloudflare R2)
|
||||
- HTTP/HTTPS endpoints (read-only)
|
||||
- HuggingFace datasets
|
||||
|
||||
**Array stores:**
|
||||
- TileDB-SOMA (with cellxgene support)
|
||||
- DuckDB for SQL queries on Parquet files
|
||||
|
||||
**Visualization:**
|
||||
- Vitessce for interactive spatial/single-cell visualization
|
||||
|
||||
**Version control:**
|
||||
- Git integration for source code tracking
|
||||
|
||||
**Reference:** `references/integrations.md` - Read this for integration patterns, code examples, and troubleshooting for third-party systems.
|
||||
|
||||
### 6. Setup and Deployment
|
||||
|
||||
**Installation:**
|
||||
- Basic: `uv pip install lamindb`
|
||||
- With extras: `uv pip install 'lamindb[gcp,zarr,fcs]'`
|
||||
- Modules: bionty, wetlab, clinical
|
||||
|
||||
**Instance types:**
|
||||
- Local SQLite (development)
|
||||
- Cloud storage + SQLite (small teams)
|
||||
- Cloud storage + PostgreSQL (production)
|
||||
|
||||
**Storage options:**
|
||||
- Local filesystem
|
||||
- AWS S3 with configurable regions and permissions
|
||||
- Google Cloud Storage
|
||||
- S3-compatible endpoints (MinIO, Cloudflare R2)
|
||||
|
||||
**Configuration:**
|
||||
- Cache management for cloud files
|
||||
- Multi-user system configurations
|
||||
- Git repository sync
|
||||
- Environment variables
|
||||
|
||||
**Deployment patterns:**
|
||||
- Local dev → Cloud production migration
|
||||
- Multi-region deployments
|
||||
- Shared storage with personal instances
|
||||
|
||||
**Reference:** `references/setup-deployment.md` - Read this for detailed installation, configuration, storage setup, database management, security best practices, and troubleshooting.
|
||||
|
||||
## Common Use Case Workflows
|
||||
|
||||
### Use Case 1: Single-Cell RNA-seq Analysis with Ontology Validation
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
import bionty as bt
|
||||
import anndata as ad
|
||||
|
||||
# Start tracking
|
||||
ln.track(params={"analysis": "scRNA-seq QC and annotation"})
|
||||
|
||||
# Import cell type ontology
|
||||
bt.CellType.import_source()
|
||||
|
||||
# Load data
|
||||
adata = ad.read_h5ad("raw_counts.h5ad")
|
||||
|
||||
# Validate and standardize cell types
|
||||
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs["cell_type"])
|
||||
|
||||
# Curate with schema
|
||||
curator = ln.curators.AnnDataCurator(adata, schema)
|
||||
curator.validate()
|
||||
artifact = curator.save_artifact(key="scrna/validated.h5ad")
|
||||
|
||||
# Link ontology annotations
|
||||
cell_types = bt.CellType.from_values(adata.obs.cell_type)
|
||||
artifact.feature_sets.add_ontology(cell_types)
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### Use Case 2: Building a Queryable Data Lakehouse
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Register multiple experiments
|
||||
for i, file in enumerate(data_files):
|
||||
artifact = ln.Artifact.from_anndata(
|
||||
ad.read_h5ad(file),
|
||||
key=f"scrna/batch_{i}.h5ad",
|
||||
description=f"scRNA-seq batch {i}"
|
||||
).save()
|
||||
|
||||
# Annotate with features
|
||||
artifact.features.add_values({
|
||||
"batch": i,
|
||||
"tissue": tissues[i],
|
||||
"condition": conditions[i]
|
||||
})
|
||||
|
||||
# Query across all experiments
|
||||
immune_datasets = ln.Artifact.filter(
|
||||
key__startswith="scrna/",
|
||||
tissue="PBMC",
|
||||
condition="treated"
|
||||
).to_dataframe()
|
||||
|
||||
# Load specific datasets
|
||||
for artifact in immune_datasets:
|
||||
adata = artifact.load()
|
||||
# Analyze
|
||||
```
|
||||
|
||||
### Use Case 3: ML Pipeline with W&B Integration
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
import wandb
|
||||
|
||||
# Initialize both systems
|
||||
wandb.init(project="drug-response", name="exp-42")
|
||||
ln.track(params={"model": "random_forest", "n_estimators": 100})
|
||||
|
||||
# Load training data from LaminDB
|
||||
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
|
||||
train_data = train_artifact.load()
|
||||
|
||||
# Train model
|
||||
model = train_model(train_data)
|
||||
|
||||
# Log to W&B
|
||||
wandb.log({"accuracy": 0.95})
|
||||
|
||||
# Save model in LaminDB with W&B linkage
|
||||
import joblib
|
||||
joblib.dump(model, "model.pkl")
|
||||
model_artifact = ln.Artifact("model.pkl", key="models/exp-42.pkl").save()
|
||||
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
|
||||
|
||||
ln.finish()
|
||||
wandb.finish()
|
||||
```
|
||||
|
||||
### Use Case 4: Nextflow Pipeline Integration
|
||||
|
||||
```python
|
||||
# In Nextflow process script
|
||||
import lamindb as ln
|
||||
|
||||
ln.track()
|
||||
|
||||
# Load input artifact
|
||||
input_artifact = ln.Artifact.get(key="raw/batch_${batch_id}.fastq.gz")
|
||||
input_path = input_artifact.cache()
|
||||
|
||||
# Process (alignment, quantification, etc.)
|
||||
# ... Nextflow process logic ...
|
||||
|
||||
# Save output
|
||||
output_artifact = ln.Artifact(
|
||||
"counts.csv",
|
||||
key="processed/batch_${batch_id}_counts.csv"
|
||||
).save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
## Getting Started Checklist
|
||||
|
||||
To start using LaminDB effectively:
|
||||
|
||||
1. **Installation & Setup** (`references/setup-deployment.md`)
|
||||
- Install LaminDB and required extras
|
||||
- Authenticate with `lamin login`
|
||||
- Initialize instance with `lamin init --storage ...`
|
||||
|
||||
2. **Learn Core Concepts** (`references/core-concepts.md`)
|
||||
- Understand Artifacts, Records, Runs, Transforms
|
||||
- Practice creating and retrieving artifacts
|
||||
- Implement `ln.track()` and `ln.finish()` in workflows
|
||||
|
||||
3. **Master Querying** (`references/data-management.md`)
|
||||
- Practice filtering and searching registries
|
||||
- Learn feature-based queries
|
||||
- Experiment with streaming large files
|
||||
|
||||
4. **Set Up Validation** (`references/annotation-validation.md`)
|
||||
- Define features relevant to research domain
|
||||
- Create schemas for data types
|
||||
- Practice curation workflows
|
||||
|
||||
5. **Integrate Ontologies** (`references/ontologies.md`)
|
||||
- Import relevant biological ontologies (genes, cell types, etc.)
|
||||
- Validate existing annotations
|
||||
- Standardize metadata with ontology terms
|
||||
|
||||
6. **Connect Tools** (`references/integrations.md`)
|
||||
- Integrate with existing workflow managers
|
||||
- Link ML platforms for experiment tracking
|
||||
- Configure cloud storage and compute
|
||||
|
||||
## Key Principles
|
||||
|
||||
Follow these principles when working with LaminDB:
|
||||
|
||||
1. **Track everything**: Use `ln.track()` at the start of every analysis for automatic lineage capture
|
||||
|
||||
2. **Validate early**: Define schemas and validate data before extensive analysis
|
||||
|
||||
3. **Use ontologies**: Leverage public biological ontologies for standardized annotations
|
||||
|
||||
4. **Organize with keys**: Structure artifact keys hierarchically (e.g., `project/experiment/batch/file.h5ad`)
|
||||
|
||||
5. **Query metadata first**: Filter and search before loading large files
|
||||
|
||||
6. **Version, don't duplicate**: Use built-in versioning instead of creating new keys for modifications
|
||||
|
||||
7. **Annotate with features**: Define typed features for queryable metadata
|
||||
|
||||
8. **Document thoroughly**: Add descriptions to artifacts, schemas, and transforms
|
||||
|
||||
9. **Leverage lineage**: Use `view_lineage()` to understand data provenance
|
||||
|
||||
10. **Start local, scale cloud**: Develop locally with SQLite, deploy to cloud with PostgreSQL
|
||||
|
||||
## Reference Files
|
||||
|
||||
This skill includes comprehensive reference documentation organized by capability:
|
||||
|
||||
- **`references/core-concepts.md`** - Artifacts, records, runs, transforms, features, versioning, lineage
|
||||
- **`references/data-management.md`** - Querying, filtering, searching, streaming, organizing data
|
||||
- **`references/annotation-validation.md`** - Schema design, curation workflows, validation strategies
|
||||
- **`references/ontologies.md`** - Biological ontology management, standardization, hierarchies
|
||||
- **`references/integrations.md`** - Workflow managers, MLOps platforms, storage systems, tools
|
||||
- **`references/setup-deployment.md`** - Installation, configuration, deployment, troubleshooting
|
||||
|
||||
Read the relevant reference file(s) based on the specific LaminDB capability needed for the task at hand.
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Official Documentation**: https://docs.lamin.ai
|
||||
- **API Reference**: https://docs.lamin.ai/api
|
||||
- **GitHub Repository**: https://github.com/laminlabs/lamindb
|
||||
- **Tutorial**: https://docs.lamin.ai/tutorial
|
||||
- **FAQ**: https://docs.lamin.ai/faq
|
||||
513
skills/lamindb/references/annotation-validation.md
Normal file
513
skills/lamindb/references/annotation-validation.md
Normal file
@@ -0,0 +1,513 @@
|
||||
# LaminDB Annotation & Validation
|
||||
|
||||
This document covers data curation, validation, schema management, and annotation best practices in LaminDB.
|
||||
|
||||
## Overview
|
||||
|
||||
LaminDB's curation process ensures datasets are both validated and queryable through three essential steps:
|
||||
|
||||
1. **Validation**: Confirming datasets match desired schemas
|
||||
2. **Standardization**: Fixing inconsistencies like typos and mapping synonyms
|
||||
3. **Annotation**: Linking datasets to metadata entities for queryability
|
||||
|
||||
## Schema Design
|
||||
|
||||
Schemas define expected data structure, types, and validation rules. LaminDB supports three main schema approaches:
|
||||
|
||||
### 1. Flexible Schema
|
||||
|
||||
Validates only columns matching Feature registry names, allowing additional metadata:
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Create flexible schema
|
||||
schema = ln.Schema(
|
||||
name="valid_features",
|
||||
itype=ln.Feature # Validates against Feature registry
|
||||
).save()
|
||||
|
||||
# Any column matching a Feature name will be validated
|
||||
# Additional columns are permitted but not validated
|
||||
```
|
||||
|
||||
### 2. Minimal Required Schema
|
||||
|
||||
Specifies essential columns while permitting extra metadata:
|
||||
|
||||
```python
|
||||
# Define required features
|
||||
required_features = [
|
||||
ln.Feature.get(name="cell_type"),
|
||||
ln.Feature.get(name="tissue"),
|
||||
ln.Feature.get(name="donor_id")
|
||||
]
|
||||
|
||||
# Create schema with required features
|
||||
schema = ln.Schema(
|
||||
name="minimal_immune_schema",
|
||||
features=required_features,
|
||||
flexible=True # Allows additional columns
|
||||
).save()
|
||||
```
|
||||
|
||||
### 3. Strict Schema
|
||||
|
||||
Enforces complete control over data structure:
|
||||
|
||||
```python
|
||||
# Define all allowed features
|
||||
all_features = [
|
||||
ln.Feature.get(name="cell_type"),
|
||||
ln.Feature.get(name="tissue"),
|
||||
ln.Feature.get(name="donor_id"),
|
||||
ln.Feature.get(name="disease")
|
||||
]
|
||||
|
||||
# Create strict schema
|
||||
schema = ln.Schema(
|
||||
name="strict_immune_schema",
|
||||
features=all_features,
|
||||
flexible=False # No additional columns allowed
|
||||
).save()
|
||||
```
|
||||
|
||||
## DataFrame Curation Workflow
|
||||
|
||||
The typical curation process involves six key steps:
|
||||
|
||||
### Step 1-2: Load Data and Establish Registries
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import lamindb as ln
|
||||
|
||||
# Load data
|
||||
df = pd.read_csv("experiment.csv")
|
||||
|
||||
# Define and save features
|
||||
ln.Feature(name="cell_type", dtype=str).save()
|
||||
ln.Feature(name="tissue", dtype=str).save()
|
||||
ln.Feature(name="gene_count", dtype=int).save()
|
||||
ln.Feature(name="experiment_date", dtype="date").save()
|
||||
|
||||
# Populate valid values (if using controlled vocabulary)
|
||||
import bionty as bt
|
||||
bt.CellType.import_source()
|
||||
bt.Tissue.import_source()
|
||||
```
|
||||
|
||||
### Step 3: Create Schema
|
||||
|
||||
```python
|
||||
# Link features to schema
|
||||
features = [
|
||||
ln.Feature.get(name="cell_type"),
|
||||
ln.Feature.get(name="tissue"),
|
||||
ln.Feature.get(name="gene_count"),
|
||||
ln.Feature.get(name="experiment_date")
|
||||
]
|
||||
|
||||
schema = ln.Schema(
|
||||
name="experiment_schema",
|
||||
features=features,
|
||||
flexible=True
|
||||
).save()
|
||||
```
|
||||
|
||||
### Step 4: Initialize Curator and Validate
|
||||
|
||||
```python
|
||||
# Initialize curator
|
||||
curator = ln.curators.DataFrameCurator(df, schema)
|
||||
|
||||
# Validate dataset
|
||||
validation = curator.validate()
|
||||
|
||||
# Check validation results
|
||||
if validation:
|
||||
print("✓ Validation passed")
|
||||
else:
|
||||
print("✗ Validation failed")
|
||||
curator.non_validated # See problematic fields
|
||||
```
|
||||
|
||||
### Step 5: Fix Validation Issues
|
||||
|
||||
#### Standardize Values
|
||||
|
||||
```python
|
||||
# Fix typos and synonyms in categorical columns
|
||||
curator.cat.standardize("cell_type")
|
||||
curator.cat.standardize("tissue")
|
||||
|
||||
# View standardization mapping
|
||||
curator.cat.inspect_standardize("cell_type")
|
||||
```
|
||||
|
||||
#### Map to Ontologies
|
||||
|
||||
```python
|
||||
# Map values to ontology terms
|
||||
curator.cat.add_ontology("cell_type", bt.CellType)
|
||||
curator.cat.add_ontology("tissue", bt.Tissue)
|
||||
|
||||
# Look up public ontologies for unmapped terms
|
||||
curator.cat.lookup(public=True).cell_type # Interactive lookup
|
||||
```
|
||||
|
||||
#### Add New Terms
|
||||
|
||||
```python
|
||||
# Add new valid terms to registry
|
||||
curator.cat.add_new_from("cell_type")
|
||||
|
||||
# Or manually create records
|
||||
new_cell_type = bt.CellType(name="my_novel_cell_type").save()
|
||||
```
|
||||
|
||||
#### Rename Columns
|
||||
|
||||
```python
|
||||
# Rename columns to match feature names
|
||||
df = df.rename(columns={"celltype": "cell_type"})
|
||||
|
||||
# Re-initialize curator with fixed DataFrame
|
||||
curator = ln.curators.DataFrameCurator(df, schema)
|
||||
```
|
||||
|
||||
### Step 6: Save Curated Artifact
|
||||
|
||||
```python
|
||||
# Save with schema linkage
|
||||
artifact = curator.save_artifact(
|
||||
key="experiments/curated_data.parquet",
|
||||
description="Validated and annotated experimental data"
|
||||
)
|
||||
|
||||
# Verify artifact has schema
|
||||
artifact.schema # Returns the schema object
|
||||
artifact.describe() # Shows validation status
|
||||
```
|
||||
|
||||
## AnnData Curation
|
||||
|
||||
For composite structures like AnnData, use "slots" to validate different components:
|
||||
|
||||
### Defining AnnData Schemas
|
||||
|
||||
```python
|
||||
# Create schemas for different slots
|
||||
obs_schema = ln.Schema(
|
||||
name="cell_metadata",
|
||||
features=[
|
||||
ln.Feature.get(name="cell_type"),
|
||||
ln.Feature.get(name="tissue"),
|
||||
ln.Feature.get(name="donor_id")
|
||||
]
|
||||
).save()
|
||||
|
||||
var_schema = ln.Schema(
|
||||
name="gene_ids",
|
||||
features=[ln.Feature.get(name="ensembl_gene_id")]
|
||||
).save()
|
||||
|
||||
# Create composite AnnData schema
|
||||
anndata_schema = ln.Schema(
|
||||
name="scrna_schema",
|
||||
otype="AnnData",
|
||||
slots={
|
||||
"obs": obs_schema,
|
||||
"var.T": var_schema # .T indicates transposition
|
||||
}
|
||||
).save()
|
||||
```
|
||||
|
||||
### Curating AnnData Objects
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
# Load AnnData
|
||||
adata = ad.read_h5ad("data.h5ad")
|
||||
|
||||
# Initialize curator
|
||||
curator = ln.curators.AnnDataCurator(adata, anndata_schema)
|
||||
|
||||
# Validate all slots
|
||||
validation = curator.validate()
|
||||
|
||||
# Fix issues by slot
|
||||
curator.cat.standardize("obs", "cell_type")
|
||||
curator.cat.add_ontology("obs", "cell_type", bt.CellType)
|
||||
curator.cat.standardize("var.T", "ensembl_gene_id")
|
||||
|
||||
# Save curated artifact
|
||||
artifact = curator.save_artifact(
|
||||
key="scrna/validated_data.h5ad",
|
||||
description="Curated single-cell RNA-seq data"
|
||||
)
|
||||
```
|
||||
|
||||
## MuData Curation
|
||||
|
||||
MuData supports multi-modal data through modality-specific slots:
|
||||
|
||||
```python
|
||||
# Define schemas for each modality
|
||||
rna_obs_schema = ln.Schema(name="rna_obs_schema", features=[...]).save()
|
||||
protein_obs_schema = ln.Schema(name="protein_obs_schema", features=[...]).save()
|
||||
|
||||
# Create MuData schema
|
||||
mudata_schema = ln.Schema(
|
||||
name="multimodal_schema",
|
||||
otype="MuData",
|
||||
slots={
|
||||
"rna:obs": rna_obs_schema,
|
||||
"protein:obs": protein_obs_schema
|
||||
}
|
||||
).save()
|
||||
|
||||
# Curate
|
||||
curator = ln.curators.MuDataCurator(mdata, mudata_schema)
|
||||
curator.validate()
|
||||
```
|
||||
|
||||
## SpatialData Curation
|
||||
|
||||
For spatial transcriptomics data:
|
||||
|
||||
```python
|
||||
# Define spatial schema
|
||||
spatial_schema = ln.Schema(
|
||||
name="spatial_schema",
|
||||
otype="SpatialData",
|
||||
slots={
|
||||
"tables:cell_metadata.obs": cell_schema,
|
||||
"attrs:bio": bio_metadata_schema
|
||||
}
|
||||
).save()
|
||||
|
||||
# Curate
|
||||
curator = ln.curators.SpatialDataCurator(sdata, spatial_schema)
|
||||
curator.validate()
|
||||
```
|
||||
|
||||
## TileDB-SOMA Curation
|
||||
|
||||
For scalable array-backed data:
|
||||
|
||||
```python
|
||||
# Define SOMA schema
|
||||
soma_schema = ln.Schema(
|
||||
name="soma_schema",
|
||||
otype="tiledbsoma",
|
||||
slots={
|
||||
"obs": obs_schema,
|
||||
"ms:RNA.T": var_schema # measurement:modality.T
|
||||
}
|
||||
).save()
|
||||
|
||||
# Curate
|
||||
curator = ln.curators.TileDBSOMACurator(soma_exp, soma_schema)
|
||||
curator.validate()
|
||||
```
|
||||
|
||||
## Feature Validation
|
||||
|
||||
### Data Type Validation
|
||||
|
||||
```python
|
||||
# Define typed features
|
||||
ln.Feature(name="age", dtype=int).save()
|
||||
ln.Feature(name="weight", dtype=float).save()
|
||||
ln.Feature(name="is_treated", dtype=bool).save()
|
||||
ln.Feature(name="collection_date", dtype="date").save()
|
||||
|
||||
# Coerce types during validation
|
||||
ln.Feature(name="age_str", dtype=int, coerce_dtype=True).save() # Auto-convert strings to int
|
||||
```
|
||||
|
||||
### Value Validation
|
||||
|
||||
```python
|
||||
# Validate against allowed values
|
||||
cell_type_feature = ln.Feature(name="cell_type", dtype=str).save()
|
||||
|
||||
# Link to registry for controlled vocabulary
|
||||
cell_type_feature.link_to_registry(bt.CellType)
|
||||
|
||||
# Now validation checks against CellType registry
|
||||
curator = ln.curators.DataFrameCurator(df, schema)
|
||||
curator.validate() # Errors if cell_type values not in registry
|
||||
```
|
||||
|
||||
## Standardization Strategies
|
||||
|
||||
### Using Public Ontologies
|
||||
|
||||
```python
|
||||
# Look up standardized terms from public sources
|
||||
curator.cat.lookup(public=True).cell_type
|
||||
|
||||
# Returns auto-complete object with public ontology terms
|
||||
# User can select correct term interactively
|
||||
```
|
||||
|
||||
### Synonym Mapping
|
||||
|
||||
```python
|
||||
# Add synonyms to records
|
||||
t_cell = bt.CellType.get(name="T cell")
|
||||
t_cell.add_synonym("T lymphocyte")
|
||||
t_cell.add_synonym("T-cell")
|
||||
|
||||
# Now standardization maps synonyms automatically
|
||||
curator.cat.standardize("cell_type")
|
||||
# "T lymphocyte" → "T cell"
|
||||
# "T-cell" → "T cell"
|
||||
```
|
||||
|
||||
### Custom Standardization
|
||||
|
||||
```python
|
||||
# Manual mapping
|
||||
mapping = {
|
||||
"TCell": "T cell",
|
||||
"t cell": "T cell",
|
||||
"T-cells": "T cell"
|
||||
}
|
||||
|
||||
# Apply mapping
|
||||
df["cell_type"] = df["cell_type"].map(lambda x: mapping.get(x, x))
|
||||
```
|
||||
|
||||
## Handling Validation Errors
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
**Issue: Column not in schema**
|
||||
```python
|
||||
# Solution 1: Rename column
|
||||
df = df.rename(columns={"old_name": "feature_name"})
|
||||
|
||||
# Solution 2: Add feature to schema
|
||||
new_feature = ln.Feature(name="new_column", dtype=str).save()
|
||||
schema.features.add(new_feature)
|
||||
```
|
||||
|
||||
**Issue: Invalid values**
|
||||
```python
|
||||
# Solution 1: Standardize
|
||||
curator.cat.standardize("column_name")
|
||||
|
||||
# Solution 2: Add new valid values
|
||||
curator.cat.add_new_from("column_name")
|
||||
|
||||
# Solution 3: Map to ontology
|
||||
curator.cat.add_ontology("column_name", bt.Registry)
|
||||
```
|
||||
|
||||
**Issue: Data type mismatch**
|
||||
```python
|
||||
# Solution 1: Convert data type
|
||||
df["column"] = df["column"].astype(int)
|
||||
|
||||
# Solution 2: Enable coercion in feature
|
||||
feature = ln.Feature.get(name="column")
|
||||
feature.coerce_dtype = True
|
||||
feature.save()
|
||||
```
|
||||
|
||||
## Schema Versioning
|
||||
|
||||
Schemas can be versioned like other records:
|
||||
|
||||
```python
|
||||
# Create initial schema
|
||||
schema_v1 = ln.Schema(name="experiment_schema", features=[...]).save()
|
||||
|
||||
# Update schema with new features
|
||||
schema_v2 = ln.Schema(
|
||||
name="experiment_schema",
|
||||
features=[...], # Updated list
|
||||
version="2"
|
||||
).save()
|
||||
|
||||
# Link artifacts to specific schema versions
|
||||
artifact.schema = schema_v2
|
||||
artifact.save()
|
||||
```
|
||||
|
||||
## Querying Validated Data
|
||||
|
||||
Once data is validated and annotated, it becomes queryable:
|
||||
|
||||
```python
|
||||
# Find all validated artifacts
|
||||
ln.Artifact.filter(is_valid=True).to_dataframe()
|
||||
|
||||
# Find artifacts with specific schema
|
||||
ln.Artifact.filter(schema=schema).to_dataframe()
|
||||
|
||||
# Query by annotated features
|
||||
ln.Artifact.filter(cell_type="T cell", tissue="blood").to_dataframe()
|
||||
|
||||
# Include features in results
|
||||
ln.Artifact.filter(is_valid=True).to_dataframe(include="features")
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Define features first**: Create Feature registry before curation
|
||||
2. **Use public ontologies**: Leverage bt.lookup(public=True) for standardization
|
||||
3. **Start flexible**: Use flexible schemas initially, tighten as understanding grows
|
||||
4. **Document slots**: Clearly specify transposition (.T) in composite schemas
|
||||
5. **Standardize early**: Fix typos and synonyms before validation
|
||||
6. **Validate incrementally**: Check each slot separately for composite structures
|
||||
7. **Version schemas**: Track schema changes over time
|
||||
8. **Add synonyms**: Register common variations to simplify future curation
|
||||
9. **Coerce types cautiously**: Enable dtype coercion only when safe
|
||||
10. **Test on samples**: Validate small subsets before full dataset curation
|
||||
|
||||
## Advanced: Custom Validators
|
||||
|
||||
Create custom validation logic:
|
||||
|
||||
```python
|
||||
def validate_gene_expression(df):
|
||||
"""Custom validator for gene expression values."""
|
||||
# Check non-negative
|
||||
if (df < 0).any().any():
|
||||
return False, "Negative expression values found"
|
||||
|
||||
# Check reasonable range
|
||||
if (df > 1e6).any().any():
|
||||
return False, "Unreasonably high expression values"
|
||||
|
||||
return True, "Valid"
|
||||
|
||||
# Apply during curation
|
||||
is_valid, message = validate_gene_expression(df)
|
||||
if not is_valid:
|
||||
print(f"Validation failed: {message}")
|
||||
```
|
||||
|
||||
## Tracking Curation Provenance
|
||||
|
||||
```python
|
||||
# Curated artifacts track curation lineage
|
||||
ln.track() # Start tracking
|
||||
|
||||
# Perform curation
|
||||
curator = ln.curators.DataFrameCurator(df, schema)
|
||||
curator.validate()
|
||||
curator.cat.standardize("cell_type")
|
||||
artifact = curator.save_artifact(key="curated.parquet")
|
||||
|
||||
ln.finish() # Complete tracking
|
||||
|
||||
# View curation lineage
|
||||
artifact.run.describe() # Shows curation transform
|
||||
artifact.view_lineage() # Visualizes curation process
|
||||
```
|
||||
380
skills/lamindb/references/core-concepts.md
Normal file
380
skills/lamindb/references/core-concepts.md
Normal file
@@ -0,0 +1,380 @@
|
||||
# LaminDB Core Concepts
|
||||
|
||||
This document covers the fundamental concepts and building blocks of LaminDB: Artifacts, Records, Runs, Transforms, Features, and data lineage tracking.
|
||||
|
||||
## Artifacts
|
||||
|
||||
Artifacts represent datasets in various formats (DataFrames, AnnData, SpatialData, Parquet, Zarr, etc.). They serve as the primary data objects in LaminDB.
|
||||
|
||||
### Creating and Saving Artifacts
|
||||
|
||||
**From file:**
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Save a file as artifact
|
||||
ln.Artifact("sample.fasta", key="sample.fasta").save()
|
||||
|
||||
# With description
|
||||
artifact = ln.Artifact(
|
||||
"data/analysis.h5ad",
|
||||
key="experiments/scrna_batch1.h5ad",
|
||||
description="Single-cell RNA-seq batch 1"
|
||||
).save()
|
||||
```
|
||||
|
||||
**From DataFrame:**
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
df = pd.read_csv("data.csv")
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="datasets/processed_data.parquet",
|
||||
description="Processed experimental data"
|
||||
).save()
|
||||
```
|
||||
|
||||
**From AnnData:**
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
adata = ad.read_h5ad("data.h5ad")
|
||||
artifact = ln.Artifact.from_anndata(
|
||||
adata,
|
||||
key="scrna/experiment1.h5ad",
|
||||
description="scRNA-seq data with QC"
|
||||
).save()
|
||||
```
|
||||
|
||||
### Retrieving Artifacts
|
||||
|
||||
```python
|
||||
# By key
|
||||
artifact = ln.Artifact.get(key="sample.fasta")
|
||||
|
||||
# By UID
|
||||
artifact = ln.Artifact.get("aRt1Fact0uid000")
|
||||
|
||||
# By filter
|
||||
artifact = ln.Artifact.filter(suffix=".h5ad").first()
|
||||
```
|
||||
|
||||
### Accessing Artifact Content
|
||||
|
||||
```python
|
||||
# Get cached local path
|
||||
local_path = artifact.cache()
|
||||
|
||||
# Load into memory
|
||||
data = artifact.load() # Returns DataFrame, AnnData, etc.
|
||||
|
||||
# Streaming access (for large files)
|
||||
with artifact.open() as f:
|
||||
# Read incrementally
|
||||
chunk = f.read(1000)
|
||||
```
|
||||
|
||||
### Artifact Metadata
|
||||
|
||||
```python
|
||||
# View all metadata
|
||||
artifact.describe()
|
||||
|
||||
# Access specific metadata
|
||||
artifact.size # File size in bytes
|
||||
artifact.suffix # File extension
|
||||
artifact.created_at # Timestamp
|
||||
artifact.created_by # User who created it
|
||||
artifact.run # Associated run
|
||||
artifact.transform # Associated transform
|
||||
artifact.version # Version string
|
||||
```
|
||||
|
||||
## Records
|
||||
|
||||
Records represent experimental entities: samples, perturbations, instruments, cell lines, and any other metadata entities. They support hierarchical relationships through type definitions.
|
||||
|
||||
### Creating Records
|
||||
|
||||
```python
|
||||
# Define a type
|
||||
sample_type = ln.Record(name="Sample", is_type=True).save()
|
||||
|
||||
# Create instances of that type
|
||||
ln.Record(name="P53mutant1", type=sample_type).save()
|
||||
ln.Record(name="P53mutant2", type=sample_type).save()
|
||||
ln.Record(name="WT-control", type=sample_type).save()
|
||||
```
|
||||
|
||||
### Searching Records
|
||||
|
||||
```python
|
||||
# Text search
|
||||
ln.Record.search("p53").to_dataframe()
|
||||
|
||||
# Filter by fields
|
||||
ln.Record.filter(type=sample_type).to_dataframe()
|
||||
|
||||
# Get specific record
|
||||
record = ln.Record.get(name="P53mutant1")
|
||||
```
|
||||
|
||||
### Hierarchical Relationships
|
||||
|
||||
```python
|
||||
# Establish parent-child relationships
|
||||
parent_record = ln.Record.get(name="P53mutant1")
|
||||
child_record = ln.Record(name="P53mutant1-replicate1", type=sample_type).save()
|
||||
child_record.parents.add(parent_record)
|
||||
|
||||
# Query relationships
|
||||
parent_record.children.to_dataframe()
|
||||
child_record.parents.to_dataframe()
|
||||
```
|
||||
|
||||
## Runs & Transforms
|
||||
|
||||
These capture computational lineage. A **Transform** represents a reusable analysis step (notebook, script, or function), while a **Run** documents a specific execution instance.
|
||||
|
||||
### Basic Tracking Workflow
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Start tracking (beginning of notebook/script)
|
||||
ln.track()
|
||||
|
||||
# Your analysis code
|
||||
data = ln.Artifact.get(key="input.csv").load()
|
||||
# ... perform analysis ...
|
||||
result.to_csv("output.csv")
|
||||
artifact = ln.Artifact("output.csv", key="output.csv").save()
|
||||
|
||||
# Finish tracking (end of notebook/script)
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### Tracking with Parameters
|
||||
|
||||
```python
|
||||
ln.track(params={
|
||||
"learning_rate": 0.01,
|
||||
"batch_size": 32,
|
||||
"epochs": 100,
|
||||
"downsample": True
|
||||
})
|
||||
|
||||
# Query runs by parameters
|
||||
ln.Run.filter(params__learning_rate=0.01).to_dataframe()
|
||||
ln.Run.filter(params__downsample=True).to_dataframe()
|
||||
```
|
||||
|
||||
### Tracking with Projects
|
||||
|
||||
```python
|
||||
# Associate with project
|
||||
ln.track(project="Cancer Drug Screen 2025")
|
||||
|
||||
# Query by project
|
||||
project = ln.Project.get(name="Cancer Drug Screen 2025")
|
||||
ln.Artifact.filter(projects=project).to_dataframe()
|
||||
ln.Run.filter(project=project).to_dataframe()
|
||||
```
|
||||
|
||||
### Function-Level Tracking
|
||||
|
||||
Use the `@ln.tracked()` decorator for fine-grained lineage:
|
||||
|
||||
```python
|
||||
@ln.tracked()
|
||||
def preprocess_data(input_key: str, output_key: str, normalize: bool = True) -> None:
|
||||
"""Preprocess raw data and save result."""
|
||||
# Load input (automatically tracked)
|
||||
artifact = ln.Artifact.get(key=input_key)
|
||||
data = artifact.load()
|
||||
|
||||
# Process
|
||||
if normalize:
|
||||
data = (data - data.mean()) / data.std()
|
||||
|
||||
# Save output (automatically tracked)
|
||||
ln.Artifact.from_dataframe(data, key=output_key).save()
|
||||
|
||||
# Each call creates a separate Transform and Run
|
||||
preprocess_data("raw/batch1.csv", "processed/batch1.csv", normalize=True)
|
||||
preprocess_data("raw/batch2.csv", "processed/batch2.csv", normalize=False)
|
||||
```
|
||||
|
||||
### Accessing Lineage Information
|
||||
|
||||
```python
|
||||
# From artifact to run
|
||||
artifact = ln.Artifact.get(key="output.csv")
|
||||
run = artifact.run
|
||||
transform = run.transform
|
||||
|
||||
# View details
|
||||
run.describe() # Run metadata
|
||||
transform.describe() # Transform metadata
|
||||
|
||||
# Access inputs
|
||||
run.inputs.to_dataframe()
|
||||
|
||||
# Visualize lineage graph
|
||||
artifact.view_lineage()
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
Features define typed metadata fields for validation and querying. They enable structured annotation and searching.
|
||||
|
||||
### Defining Features
|
||||
|
||||
```python
|
||||
from datetime import date
|
||||
|
||||
# Numeric feature
|
||||
ln.Feature(name="gc_content", dtype=float).save()
|
||||
ln.Feature(name="read_count", dtype=int).save()
|
||||
|
||||
# Date feature
|
||||
ln.Feature(name="experiment_date", dtype=date).save()
|
||||
|
||||
# Categorical feature
|
||||
ln.Feature(name="cell_type", dtype=str).save()
|
||||
ln.Feature(name="treatment", dtype=str).save()
|
||||
```
|
||||
|
||||
### Annotating Artifacts with Features
|
||||
|
||||
```python
|
||||
# Single values
|
||||
artifact.features.add_values({
|
||||
"gc_content": 0.55,
|
||||
"experiment_date": "2025-10-31"
|
||||
})
|
||||
|
||||
# Using feature registry records
|
||||
gc_content_feature = ln.Feature.get(name="gc_content")
|
||||
artifact.features.add(gc_content_feature)
|
||||
```
|
||||
|
||||
### Querying by Features
|
||||
|
||||
```python
|
||||
# Filter by feature value
|
||||
ln.Artifact.filter(gc_content=0.55).to_dataframe()
|
||||
ln.Artifact.filter(experiment_date="2025-10-31").to_dataframe()
|
||||
|
||||
# Comparison operators
|
||||
ln.Artifact.filter(read_count__gt=1000000).to_dataframe()
|
||||
ln.Artifact.filter(gc_content__gte=0.5, gc_content__lte=0.6).to_dataframe()
|
||||
|
||||
# Check for presence of annotation
|
||||
ln.Artifact.filter(cell_type__isnull=False).to_dataframe()
|
||||
|
||||
# Include features in output
|
||||
ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features")
|
||||
```
|
||||
|
||||
### Nested Dictionary Features
|
||||
|
||||
For complex metadata stored as dictionaries:
|
||||
|
||||
```python
|
||||
# Access nested values
|
||||
ln.Artifact.filter(study_metadata__detail1="123").to_dataframe()
|
||||
ln.Artifact.filter(study_metadata__assay__type="RNA-seq").to_dataframe()
|
||||
```
|
||||
|
||||
## Data Lineage Tracking
|
||||
|
||||
LaminDB automatically captures execution context and relationships between data, code, and runs.
|
||||
|
||||
### What Gets Tracked
|
||||
|
||||
- **Source code**: Script/notebook content and git commit
|
||||
- **Environment**: Python packages and versions
|
||||
- **Input artifacts**: Data loaded during execution
|
||||
- **Output artifacts**: Data created during execution
|
||||
- **Execution metadata**: Timestamps, user, parameters
|
||||
- **Computational dependencies**: Transform relationships
|
||||
|
||||
### Viewing Lineage
|
||||
|
||||
```python
|
||||
# Visualize full lineage graph
|
||||
artifact.view_lineage()
|
||||
|
||||
# View captured metadata
|
||||
artifact.describe()
|
||||
|
||||
# Access related entities
|
||||
artifact.run # The run that created it
|
||||
artifact.run.transform # The transform (code) used
|
||||
artifact.run.inputs # Input artifacts
|
||||
artifact.run.report # Execution report
|
||||
```
|
||||
|
||||
### Querying Lineage
|
||||
|
||||
```python
|
||||
# Find all outputs from a transform
|
||||
transform = ln.Transform.get(name="preprocessing.py")
|
||||
ln.Artifact.filter(transform=transform).to_dataframe()
|
||||
|
||||
# Find all artifacts from a specific user
|
||||
user = ln.User.get(handle="researcher123")
|
||||
ln.Artifact.filter(created_by=user).to_dataframe()
|
||||
|
||||
# Find artifacts using specific inputs
|
||||
input_artifact = ln.Artifact.get(key="raw/data.csv")
|
||||
runs = ln.Run.filter(inputs=input_artifact)
|
||||
ln.Artifact.filter(run__in=runs).to_dataframe()
|
||||
```
|
||||
|
||||
## Versioning
|
||||
|
||||
LaminDB manages artifact versioning automatically when source data or code changes.
|
||||
|
||||
### Automatic Versioning
|
||||
|
||||
```python
|
||||
# First version
|
||||
artifact_v1 = ln.Artifact("data.csv", key="experiment/data.csv").save()
|
||||
|
||||
# Modify and save again - creates new version
|
||||
# (modify data.csv)
|
||||
artifact_v2 = ln.Artifact("data.csv", key="experiment/data.csv").save()
|
||||
```
|
||||
|
||||
### Working with Versions
|
||||
|
||||
```python
|
||||
# Get latest version (default)
|
||||
artifact = ln.Artifact.get(key="experiment/data.csv")
|
||||
|
||||
# View all versions
|
||||
artifact.versions.to_dataframe()
|
||||
|
||||
# Get specific version
|
||||
artifact_v1 = artifact.versions.filter(version="1").first()
|
||||
|
||||
# Compare versions
|
||||
v1_data = artifact_v1.load()
|
||||
v2_data = artifact.load()
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use meaningful keys**: Structure keys hierarchically (e.g., `project/experiment/sample.h5ad`)
|
||||
2. **Add descriptions**: Help future users understand artifact contents
|
||||
3. **Track consistently**: Call `ln.track()` at the start of every analysis
|
||||
4. **Define features upfront**: Create feature registry before annotation
|
||||
5. **Use typed features**: Specify dtypes for better validation
|
||||
6. **Leverage versioning**: Don't create new keys for minor changes
|
||||
7. **Document transforms**: Add docstrings to tracked functions
|
||||
8. **Set projects**: Group related work for easier organization and access control
|
||||
9. **Query efficiently**: Use filters before loading large datasets
|
||||
10. **Visualize lineage**: Use `view_lineage()` to understand data provenance
|
||||
433
skills/lamindb/references/data-management.md
Normal file
433
skills/lamindb/references/data-management.md
Normal file
@@ -0,0 +1,433 @@
|
||||
# LaminDB Data Management
|
||||
|
||||
This document covers querying, searching, filtering, and streaming data in LaminDB, as well as best practices for organizing and accessing datasets.
|
||||
|
||||
## Registry Overview
|
||||
|
||||
View available registries and their contents:
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# View all registries across modules
|
||||
ln.view()
|
||||
|
||||
# View latest 100 artifacts
|
||||
ln.Artifact.to_dataframe()
|
||||
|
||||
# View other registries
|
||||
ln.Transform.to_dataframe()
|
||||
ln.Run.to_dataframe()
|
||||
ln.User.to_dataframe()
|
||||
```
|
||||
|
||||
## Lookup for Quick Access
|
||||
|
||||
For registries with fewer than 100k records, `Lookup` objects enable convenient auto-complete:
|
||||
|
||||
```python
|
||||
# Create lookup
|
||||
records = ln.Record.lookup()
|
||||
|
||||
# Access by name (auto-complete enabled in IDEs)
|
||||
experiment_1 = records.experiment_1
|
||||
sample_a = records.sample_a
|
||||
|
||||
# Works with biological ontologies too
|
||||
import bionty as bt
|
||||
cell_types = bt.CellType.lookup()
|
||||
t_cell = cell_types.t_cell
|
||||
```
|
||||
|
||||
## Retrieving Single Records
|
||||
|
||||
### Using get()
|
||||
|
||||
Retrieve exactly one record (errors if zero or multiple matches):
|
||||
|
||||
```python
|
||||
# By UID
|
||||
artifact = ln.Artifact.get("aRt1Fact0uid000")
|
||||
|
||||
# By field
|
||||
artifact = ln.Artifact.get(key="data/experiment.h5ad")
|
||||
user = ln.User.get(handle="researcher123")
|
||||
|
||||
# By ontology ID (for bionty)
|
||||
cell_type = bt.CellType.get(ontology_id="CL:0000084")
|
||||
```
|
||||
|
||||
### Using one() and one_or_none()
|
||||
|
||||
```python
|
||||
# Get exactly one from QuerySet (errors if 0 or >1)
|
||||
artifact = ln.Artifact.filter(key="data.csv").one()
|
||||
|
||||
# Get one or None (errors if >1)
|
||||
artifact = ln.Artifact.filter(key="maybe_data.csv").one_or_none()
|
||||
|
||||
# Get first match
|
||||
artifact = ln.Artifact.filter(suffix=".h5ad").first()
|
||||
```
|
||||
|
||||
## Filtering Data
|
||||
|
||||
The `filter()` method returns a QuerySet for flexible retrieval:
|
||||
|
||||
```python
|
||||
# Basic filtering
|
||||
artifacts = ln.Artifact.filter(suffix=".h5ad")
|
||||
artifacts.to_dataframe()
|
||||
|
||||
# Multiple conditions (AND logic)
|
||||
artifacts = ln.Artifact.filter(
|
||||
suffix=".h5ad",
|
||||
created_by=user
|
||||
)
|
||||
|
||||
# Comparison operators
|
||||
ln.Artifact.filter(size__gt=1e6).to_dataframe() # Greater than
|
||||
ln.Artifact.filter(size__gte=1e6).to_dataframe() # Greater than or equal
|
||||
ln.Artifact.filter(size__lt=1e9).to_dataframe() # Less than
|
||||
ln.Artifact.filter(size__lte=1e9).to_dataframe() # Less than or equal
|
||||
|
||||
# Range queries
|
||||
ln.Artifact.filter(size__gte=1e6, size__lte=1e9).to_dataframe()
|
||||
```
|
||||
|
||||
## Text and String Queries
|
||||
|
||||
```python
|
||||
# Exact match
|
||||
ln.Artifact.filter(description="Experiment 1").to_dataframe()
|
||||
|
||||
# Contains (case-sensitive)
|
||||
ln.Artifact.filter(description__contains="RNA").to_dataframe()
|
||||
|
||||
# Case-insensitive contains
|
||||
ln.Artifact.filter(description__icontains="rna").to_dataframe()
|
||||
|
||||
# Starts with
|
||||
ln.Artifact.filter(key__startswith="experiments/").to_dataframe()
|
||||
|
||||
# Ends with
|
||||
ln.Artifact.filter(key__endswith=".csv").to_dataframe()
|
||||
|
||||
# IN list
|
||||
ln.Artifact.filter(suffix__in=[".h5ad", ".csv", ".parquet"]).to_dataframe()
|
||||
```
|
||||
|
||||
## Feature-Based Queries
|
||||
|
||||
Query artifacts by their annotated features:
|
||||
|
||||
```python
|
||||
# Filter by feature value
|
||||
ln.Artifact.filter(cell_type="T cell").to_dataframe()
|
||||
ln.Artifact.filter(treatment="DMSO").to_dataframe()
|
||||
|
||||
# Include features in output
|
||||
ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features")
|
||||
|
||||
# Nested dictionary access
|
||||
ln.Artifact.filter(study_metadata__assay="RNA-seq").to_dataframe()
|
||||
ln.Artifact.filter(study_metadata__detail1="123").to_dataframe()
|
||||
|
||||
# Check annotation status
|
||||
ln.Artifact.filter(cell_type__isnull=False).to_dataframe() # Has annotation
|
||||
ln.Artifact.filter(treatment__isnull=True).to_dataframe() # Missing annotation
|
||||
```
|
||||
|
||||
## Traversing Related Registries
|
||||
|
||||
Django's double-underscore syntax enables queries across related tables:
|
||||
|
||||
```python
|
||||
# Find artifacts by creator handle
|
||||
ln.Artifact.filter(created_by__handle="researcher123").to_dataframe()
|
||||
ln.Artifact.filter(created_by__handle__startswith="test").to_dataframe()
|
||||
|
||||
# Find artifacts by transform name
|
||||
ln.Artifact.filter(transform__name="preprocess.py").to_dataframe()
|
||||
|
||||
# Find artifacts measuring specific genes
|
||||
ln.Artifact.filter(feature_sets__genes__symbol="CD8A").to_dataframe()
|
||||
ln.Artifact.filter(feature_sets__genes__ensembl_gene_id="ENSG00000153563").to_dataframe()
|
||||
|
||||
# Find runs with specific parameters
|
||||
ln.Run.filter(params__learning_rate=0.01).to_dataframe()
|
||||
ln.Run.filter(params__downsample=True).to_dataframe()
|
||||
|
||||
# Find artifacts from specific project
|
||||
project = ln.Project.get(name="Cancer Study")
|
||||
ln.Artifact.filter(projects=project).to_dataframe()
|
||||
```
|
||||
|
||||
## Ordering Results
|
||||
|
||||
```python
|
||||
# Order by field (ascending)
|
||||
ln.Artifact.filter(suffix=".h5ad").order_by("created_at").to_dataframe()
|
||||
|
||||
# Order descending
|
||||
ln.Artifact.filter(suffix=".h5ad").order_by("-created_at").to_dataframe()
|
||||
|
||||
# Multiple order fields
|
||||
ln.Artifact.order_by("-created_at", "size").to_dataframe()
|
||||
```
|
||||
|
||||
## Advanced Logical Queries
|
||||
|
||||
### OR Logic
|
||||
|
||||
```python
|
||||
from lamindb import Q
|
||||
|
||||
# OR condition
|
||||
artifacts = ln.Artifact.filter(
|
||||
Q(suffix=".jpg") | Q(suffix=".png")
|
||||
).to_dataframe()
|
||||
|
||||
# Complex OR with multiple conditions
|
||||
artifacts = ln.Artifact.filter(
|
||||
Q(suffix=".h5ad", size__gt=1e6) | Q(suffix=".csv", size__lt=1e3)
|
||||
).to_dataframe()
|
||||
```
|
||||
|
||||
### NOT Logic
|
||||
|
||||
```python
|
||||
# Exclude condition
|
||||
artifacts = ln.Artifact.filter(
|
||||
~Q(suffix=".tmp")
|
||||
).to_dataframe()
|
||||
|
||||
# Complex exclusion
|
||||
artifacts = ln.Artifact.filter(
|
||||
~Q(created_by__handle="testuser")
|
||||
).to_dataframe()
|
||||
```
|
||||
|
||||
### Combining AND, OR, NOT
|
||||
|
||||
```python
|
||||
# Complex query
|
||||
artifacts = ln.Artifact.filter(
|
||||
(Q(suffix=".h5ad") | Q(suffix=".csv")) &
|
||||
Q(size__gt=1e6) &
|
||||
~Q(created_by__handle__startswith="test")
|
||||
).to_dataframe()
|
||||
```
|
||||
|
||||
## Search Functionality
|
||||
|
||||
Full-text search across registry fields:
|
||||
|
||||
```python
|
||||
# Basic search
|
||||
ln.Artifact.search("iris").to_dataframe()
|
||||
ln.User.search("smith").to_dataframe()
|
||||
|
||||
# Search in specific registry
|
||||
bt.CellType.search("T cell").to_dataframe()
|
||||
bt.Gene.search("CD8").to_dataframe()
|
||||
```
|
||||
|
||||
## Working with QuerySets
|
||||
|
||||
QuerySets are lazy - they don't hit the database until evaluated:
|
||||
|
||||
```python
|
||||
# Create query (no database hit)
|
||||
qs = ln.Artifact.filter(suffix=".h5ad")
|
||||
|
||||
# Evaluate in different ways
|
||||
df = qs.to_dataframe() # As pandas DataFrame
|
||||
list_records = list(qs) # As Python list
|
||||
count = qs.count() # Count only
|
||||
exists = qs.exists() # Boolean check
|
||||
|
||||
# Iteration
|
||||
for artifact in qs:
|
||||
print(artifact.key, artifact.size)
|
||||
|
||||
# Slicing
|
||||
first_10 = qs[:10]
|
||||
next_10 = qs[10:20]
|
||||
```
|
||||
|
||||
## Chaining Filters
|
||||
|
||||
```python
|
||||
# Build query incrementally
|
||||
qs = ln.Artifact.filter(suffix=".h5ad")
|
||||
qs = qs.filter(size__gt=1e6)
|
||||
qs = qs.filter(created_at__year=2025)
|
||||
qs = qs.order_by("-created_at")
|
||||
|
||||
# Execute
|
||||
results = qs.to_dataframe()
|
||||
```
|
||||
|
||||
## Streaming Large Datasets
|
||||
|
||||
For datasets too large to fit in memory, use streaming access:
|
||||
|
||||
### Streaming Files
|
||||
|
||||
```python
|
||||
# Open file stream
|
||||
artifact = ln.Artifact.get(key="large_file.csv")
|
||||
|
||||
with artifact.open() as f:
|
||||
# Read in chunks
|
||||
chunk = f.read(10000) # Read 10KB
|
||||
# Process chunk
|
||||
```
|
||||
|
||||
### Array Slicing
|
||||
|
||||
For array-based formats (Zarr, HDF5, AnnData):
|
||||
|
||||
```python
|
||||
# Get backing file without loading
|
||||
artifact = ln.Artifact.get(key="large_data.h5ad")
|
||||
adata = artifact.backed() # Returns backed AnnData
|
||||
|
||||
# Slice specific portions
|
||||
subset = adata[:1000, :] # First 1000 cells
|
||||
genes_of_interest = adata[:, ["CD4", "CD8A", "CD8B"]]
|
||||
|
||||
# Stream batches
|
||||
for i in range(0, adata.n_obs, 1000):
|
||||
batch = adata[i:i+1000, :]
|
||||
# Process batch
|
||||
```
|
||||
|
||||
### Iterator Access
|
||||
|
||||
```python
|
||||
# Process large collections incrementally
|
||||
artifacts = ln.Artifact.filter(suffix=".fastq.gz")
|
||||
|
||||
for artifact in artifacts.iterator(chunk_size=10):
|
||||
# Process 10 at a time
|
||||
path = artifact.cache()
|
||||
# Analyze file
|
||||
```
|
||||
|
||||
## Aggregation and Statistics
|
||||
|
||||
```python
|
||||
# Count records
|
||||
ln.Artifact.filter(suffix=".h5ad").count()
|
||||
|
||||
# Distinct values
|
||||
ln.Artifact.values_list("suffix", flat=True).distinct()
|
||||
|
||||
# Aggregation (requires Django ORM knowledge)
|
||||
from django.db.models import Sum, Avg, Max, Min
|
||||
|
||||
# Total size of all artifacts
|
||||
ln.Artifact.aggregate(Sum("size"))
|
||||
|
||||
# Average artifact size by suffix
|
||||
ln.Artifact.values("suffix").annotate(avg_size=Avg("size"))
|
||||
```
|
||||
|
||||
## Caching and Performance
|
||||
|
||||
```python
|
||||
# Check cache location
|
||||
ln.settings.cache_dir
|
||||
|
||||
# Configure cache
|
||||
lamin cache set /path/to/cache
|
||||
|
||||
# Clear cache for specific artifact
|
||||
artifact.delete_cache()
|
||||
|
||||
# Get cached path (downloads if needed)
|
||||
path = artifact.cache()
|
||||
|
||||
# Check if cached
|
||||
if artifact.is_cached():
|
||||
path = artifact.cache()
|
||||
```
|
||||
|
||||
## Organizing Data with Keys
|
||||
|
||||
Best practices for structuring keys:
|
||||
|
||||
```python
|
||||
# Hierarchical organization
|
||||
ln.Artifact("data.h5ad", key="project/experiment/batch1/data.h5ad").save()
|
||||
ln.Artifact("data.h5ad", key="scrna/2025/oct/sample_001.h5ad").save()
|
||||
|
||||
# Browse by prefix
|
||||
ln.Artifact.filter(key__startswith="scrna/2025/oct/").to_dataframe()
|
||||
|
||||
# Version in key (alternative to built-in versioning)
|
||||
ln.Artifact("data.h5ad", key="data/processed/v1/final.h5ad").save()
|
||||
ln.Artifact("data.h5ad", key="data/processed/v2/final.h5ad").save()
|
||||
```
|
||||
|
||||
## Collections
|
||||
|
||||
Group related artifacts into collections:
|
||||
|
||||
```python
|
||||
# Create collection
|
||||
collection = ln.Collection(
|
||||
[artifact1, artifact2, artifact3],
|
||||
name="scRNA-seq batch 1-3",
|
||||
description="Complete dataset across three batches"
|
||||
).save()
|
||||
|
||||
# Access collection members
|
||||
for artifact in collection.artifacts:
|
||||
print(artifact.key)
|
||||
|
||||
# Query collections
|
||||
ln.Collection.filter(name__contains="batch").to_dataframe()
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use filters before loading**: Query metadata before accessing file contents
|
||||
2. **Leverage QuerySets**: Build queries incrementally for complex conditions
|
||||
3. **Stream large files**: Don't load entire datasets into memory unnecessarily
|
||||
4. **Structure keys hierarchically**: Makes browsing and filtering easier
|
||||
5. **Use search for discovery**: When you don't know exact field values
|
||||
6. **Cache strategically**: Configure cache location based on storage capacity
|
||||
7. **Index features**: Define features upfront for efficient feature-based queries
|
||||
8. **Use collections**: Group related artifacts for dataset-level operations
|
||||
9. **Order results**: Sort by creation date or other fields for consistent retrieval
|
||||
10. **Check existence**: Use `exists()` or `one_or_none()` to avoid errors
|
||||
|
||||
## Common Query Patterns
|
||||
|
||||
```python
|
||||
# Recent artifacts
|
||||
ln.Artifact.order_by("-created_at")[:10].to_dataframe()
|
||||
|
||||
# My artifacts
|
||||
me = ln.setup.settings.user
|
||||
ln.Artifact.filter(created_by=me).to_dataframe()
|
||||
|
||||
# Large files
|
||||
ln.Artifact.filter(size__gt=1e9).order_by("-size").to_dataframe()
|
||||
|
||||
# This month's data
|
||||
from datetime import datetime
|
||||
ln.Artifact.filter(
|
||||
created_at__year=2025,
|
||||
created_at__month=10
|
||||
).to_dataframe()
|
||||
|
||||
# Validated datasets with specific features
|
||||
ln.Artifact.filter(
|
||||
is_valid=True,
|
||||
cell_type__isnull=False
|
||||
).to_dataframe(include="features")
|
||||
```
|
||||
642
skills/lamindb/references/integrations.md
Normal file
642
skills/lamindb/references/integrations.md
Normal file
@@ -0,0 +1,642 @@
|
||||
# LaminDB Integrations
|
||||
|
||||
This document covers LaminDB integrations with workflow managers, MLOps platforms, visualization tools, and other third-party systems.
|
||||
|
||||
## Overview
|
||||
|
||||
LaminDB supports extensive integrations across data storage, computational workflows, machine learning platforms, and visualization tools, enabling seamless incorporation into existing data science and bioinformatics pipelines.
|
||||
|
||||
## Data Storage Integrations
|
||||
|
||||
### Local Filesystem
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Initialize with local storage
|
||||
lamin init --storage ./mydata
|
||||
|
||||
# Save artifacts to local storage
|
||||
artifact = ln.Artifact("data.csv", key="local/data.csv").save()
|
||||
|
||||
# Load from local storage
|
||||
data = artifact.load()
|
||||
```
|
||||
|
||||
### AWS S3
|
||||
|
||||
```python
|
||||
# Initialize with S3 storage
|
||||
lamin init --storage s3://my-bucket/path \
|
||||
--db postgresql://user:pwd@host:port/db
|
||||
|
||||
# Artifacts automatically sync to S3
|
||||
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
|
||||
|
||||
# Transparent S3 access
|
||||
data = artifact.load() # Downloads from S3 if not cached
|
||||
```
|
||||
|
||||
### S3-Compatible Services
|
||||
|
||||
Support for MinIO, Cloudflare R2, and other S3-compatible endpoints:
|
||||
|
||||
```python
|
||||
# Initialize with custom S3 endpoint
|
||||
lamin init --storage 's3://bucket?endpoint_url=http://minio.example.com:9000'
|
||||
|
||||
# Configure credentials
|
||||
export AWS_ACCESS_KEY_ID=minioadmin
|
||||
export AWS_SECRET_ACCESS_KEY=minioadmin
|
||||
```
|
||||
|
||||
### Google Cloud Storage
|
||||
|
||||
```python
|
||||
# Install GCP extras
|
||||
pip install 'lamindb[gcp]'
|
||||
|
||||
# Initialize with GCS
|
||||
lamin init --storage gs://my-bucket/path \
|
||||
--db postgresql://user:pwd@host:port/db
|
||||
|
||||
# Artifacts sync to GCS
|
||||
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
|
||||
```
|
||||
|
||||
### HTTP/HTTPS (Read-Only)
|
||||
|
||||
```python
|
||||
# Access remote files without copying
|
||||
artifact = ln.Artifact(
|
||||
"https://example.com/data.csv",
|
||||
key="remote/data.csv"
|
||||
).save()
|
||||
|
||||
# Stream remote content
|
||||
with artifact.open() as f:
|
||||
data = f.read()
|
||||
```
|
||||
|
||||
### HuggingFace Datasets
|
||||
|
||||
```python
|
||||
# Access HuggingFace datasets
|
||||
from datasets import load_dataset
|
||||
|
||||
dataset = load_dataset("squad", split="train")
|
||||
|
||||
# Register as LaminDB artifact
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
dataset.to_pandas(),
|
||||
key="hf/squad_train.parquet",
|
||||
description="SQuAD training data from HuggingFace"
|
||||
).save()
|
||||
```
|
||||
|
||||
## Workflow Manager Integrations
|
||||
|
||||
### Nextflow
|
||||
|
||||
Track Nextflow pipeline execution and outputs:
|
||||
|
||||
```python
|
||||
# In your Nextflow process script
|
||||
import lamindb as ln
|
||||
|
||||
# Initialize tracking
|
||||
ln.track()
|
||||
|
||||
# Your Nextflow process logic
|
||||
input_artifact = ln.Artifact.get(key="${input_key}")
|
||||
data = input_artifact.load()
|
||||
|
||||
# Process data
|
||||
result = process_data(data)
|
||||
|
||||
# Save output
|
||||
output_artifact = ln.Artifact.from_dataframe(
|
||||
result,
|
||||
key="${output_key}"
|
||||
).save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
**Nextflow config example:**
|
||||
```nextflow
|
||||
process ANALYZE {
|
||||
input:
|
||||
val input_key
|
||||
|
||||
output:
|
||||
path "result.csv"
|
||||
|
||||
script:
|
||||
"""
|
||||
#!/usr/bin/env python
|
||||
import lamindb as ln
|
||||
ln.track()
|
||||
artifact = ln.Artifact.get(key="${input_key}")
|
||||
# Process and save
|
||||
ln.finish()
|
||||
"""
|
||||
}
|
||||
```
|
||||
|
||||
### Snakemake
|
||||
|
||||
Integrate LaminDB into Snakemake workflows:
|
||||
|
||||
```python
|
||||
# In Snakemake rule
|
||||
rule process_data:
|
||||
input:
|
||||
"data/input.csv"
|
||||
output:
|
||||
"data/output.csv"
|
||||
run:
|
||||
import lamindb as ln
|
||||
|
||||
ln.track()
|
||||
|
||||
# Load input artifact
|
||||
artifact = ln.Artifact.get(key="inputs/data.csv")
|
||||
data = artifact.load()
|
||||
|
||||
# Process
|
||||
result = analyze(data)
|
||||
|
||||
# Save output
|
||||
result.to_csv(output[0])
|
||||
ln.Artifact(output[0], key="outputs/result.csv").save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### Redun
|
||||
|
||||
Track Redun task execution:
|
||||
|
||||
```python
|
||||
from redun import task
|
||||
import lamindb as ln
|
||||
|
||||
@task()
|
||||
@ln.tracked()
|
||||
def process_dataset(input_key: str, output_key: str):
|
||||
"""Redun task with LaminDB tracking."""
|
||||
# Load input
|
||||
artifact = ln.Artifact.get(key=input_key)
|
||||
data = artifact.load()
|
||||
|
||||
# Process
|
||||
result = transform(data)
|
||||
|
||||
# Save output
|
||||
ln.Artifact.from_dataframe(result, key=output_key).save()
|
||||
|
||||
return output_key
|
||||
|
||||
# Redun automatically tracks lineage alongside LaminDB
|
||||
```
|
||||
|
||||
## MLOps Platform Integrations
|
||||
|
||||
### Weights & Biases (W&B)
|
||||
|
||||
Combine W&B experiment tracking with LaminDB data management:
|
||||
|
||||
```python
|
||||
import wandb
|
||||
import lamindb as ln
|
||||
|
||||
# Initialize both
|
||||
wandb.init(project="my-project", name="experiment-1")
|
||||
ln.track(params={"learning_rate": 0.01, "batch_size": 32})
|
||||
|
||||
# Load training data
|
||||
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
|
||||
train_data = train_artifact.load()
|
||||
|
||||
# Train model
|
||||
model = train_model(train_data)
|
||||
|
||||
# Log to W&B
|
||||
wandb.log({"accuracy": 0.95, "loss": 0.05})
|
||||
|
||||
# Save model in LaminDB
|
||||
import joblib
|
||||
joblib.dump(model, "model.pkl")
|
||||
model_artifact = ln.Artifact(
|
||||
"model.pkl",
|
||||
key="models/experiment-1.pkl",
|
||||
description=f"Model from W&B run {wandb.run.id}"
|
||||
).save()
|
||||
|
||||
# Link W&B run ID
|
||||
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
|
||||
|
||||
ln.finish()
|
||||
wandb.finish()
|
||||
```
|
||||
|
||||
### MLflow
|
||||
|
||||
Integrate MLflow model tracking with LaminDB:
|
||||
|
||||
```python
|
||||
import mlflow
|
||||
import lamindb as ln
|
||||
|
||||
# Start runs
|
||||
mlflow.start_run()
|
||||
ln.track()
|
||||
|
||||
# Log parameters to both
|
||||
params = {"max_depth": 5, "n_estimators": 100}
|
||||
mlflow.log_params(params)
|
||||
ln.context.params = params
|
||||
|
||||
# Load data from LaminDB
|
||||
data_artifact = ln.Artifact.get(key="datasets/features.parquet")
|
||||
X = data_artifact.load()
|
||||
|
||||
# Train and log model
|
||||
model = train_model(X)
|
||||
mlflow.sklearn.log_model(model, "model")
|
||||
|
||||
# Save to LaminDB
|
||||
import joblib
|
||||
joblib.dump(model, "model.pkl")
|
||||
model_artifact = ln.Artifact(
|
||||
"model.pkl",
|
||||
key=f"models/{mlflow.active_run().info.run_id}.pkl"
|
||||
).save()
|
||||
|
||||
mlflow.end_run()
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### HuggingFace Transformers
|
||||
|
||||
Track model fine-tuning with LaminDB:
|
||||
|
||||
```python
|
||||
from transformers import Trainer, TrainingArguments
|
||||
import lamindb as ln
|
||||
|
||||
ln.track(params={"model": "bert-base", "epochs": 3})
|
||||
|
||||
# Load training data
|
||||
train_artifact = ln.Artifact.get(key="datasets/train_tokenized.parquet")
|
||||
train_dataset = train_artifact.load()
|
||||
|
||||
# Configure trainer
|
||||
training_args = TrainingArguments(
|
||||
output_dir="./results",
|
||||
num_train_epochs=3,
|
||||
)
|
||||
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
train_dataset=train_dataset,
|
||||
)
|
||||
|
||||
# Train
|
||||
trainer.train()
|
||||
|
||||
# Save model to LaminDB
|
||||
trainer.save_model("./model")
|
||||
model_artifact = ln.Artifact(
|
||||
"./model",
|
||||
key="models/bert_finetuned",
|
||||
description="BERT fine-tuned on custom dataset"
|
||||
).save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### scVI-tools
|
||||
|
||||
Single-cell analysis with scVI and LaminDB:
|
||||
|
||||
```python
|
||||
import scvi
|
||||
import lamindb as ln
|
||||
|
||||
ln.track()
|
||||
|
||||
# Load data
|
||||
adata_artifact = ln.Artifact.get(key="scrna/raw_counts.h5ad")
|
||||
adata = adata_artifact.load()
|
||||
|
||||
# Setup scVI
|
||||
scvi.model.SCVI.setup_anndata(adata, layer="counts")
|
||||
|
||||
# Train model
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
|
||||
# Save latent representation
|
||||
adata.obsm["X_scvi"] = model.get_latent_representation()
|
||||
|
||||
# Save results
|
||||
result_artifact = ln.Artifact.from_anndata(
|
||||
adata,
|
||||
key="scrna/scvi_latent.h5ad",
|
||||
description="scVI latent representation"
|
||||
).save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
## Array Store Integrations
|
||||
|
||||
### TileDB-SOMA
|
||||
|
||||
Scalable array storage with cellxgene support:
|
||||
|
||||
```python
|
||||
import tiledbsoma as soma
|
||||
import lamindb as ln
|
||||
|
||||
# Create SOMA experiment
|
||||
uri = "tiledb://my-namespace/experiment"
|
||||
|
||||
with soma.Experiment.create(uri) as exp:
|
||||
# Add measurements
|
||||
exp.add_new_collection("RNA")
|
||||
|
||||
# Register in LaminDB
|
||||
artifact = ln.Artifact(
|
||||
uri,
|
||||
key="cellxgene/experiment.soma",
|
||||
description="TileDB-SOMA experiment"
|
||||
).save()
|
||||
|
||||
# Query with SOMA
|
||||
with soma.Experiment.open(uri) as exp:
|
||||
obs = exp.obs.read().to_pandas()
|
||||
```
|
||||
|
||||
### DuckDB
|
||||
|
||||
Query artifacts with DuckDB:
|
||||
|
||||
```python
|
||||
import duckdb
|
||||
import lamindb as ln
|
||||
|
||||
# Get artifact
|
||||
artifact = ln.Artifact.get(key="datasets/large_data.parquet")
|
||||
|
||||
# Query with DuckDB (without loading full file)
|
||||
path = artifact.cache()
|
||||
result = duckdb.query(f"""
|
||||
SELECT cell_type, COUNT(*) as count
|
||||
FROM read_parquet('{path}')
|
||||
GROUP BY cell_type
|
||||
ORDER BY count DESC
|
||||
""").to_df()
|
||||
|
||||
# Save query result
|
||||
result_artifact = ln.Artifact.from_dataframe(
|
||||
result,
|
||||
key="analysis/cell_type_counts.parquet"
|
||||
).save()
|
||||
```
|
||||
|
||||
## Visualization Integrations
|
||||
|
||||
### Vitessce
|
||||
|
||||
Create interactive visualizations:
|
||||
|
||||
```python
|
||||
from vitessce import VitessceConfig
|
||||
import lamindb as ln
|
||||
|
||||
# Load spatial data
|
||||
artifact = ln.Artifact.get(key="spatial/visium_slide.h5ad")
|
||||
adata = artifact.load()
|
||||
|
||||
# Create Vitessce configuration
|
||||
vc = VitessceConfig.from_object(adata)
|
||||
|
||||
# Save configuration
|
||||
import json
|
||||
config_file = "vitessce_config.json"
|
||||
with open(config_file, "w") as f:
|
||||
json.dump(vc.to_dict(), f)
|
||||
|
||||
# Register configuration
|
||||
config_artifact = ln.Artifact(
|
||||
config_file,
|
||||
key="visualizations/spatial_config.json",
|
||||
description="Vitessce visualization config"
|
||||
).save()
|
||||
```
|
||||
|
||||
## Schema Module Integrations
|
||||
|
||||
### Bionty (Biological Ontologies)
|
||||
|
||||
```python
|
||||
import bionty as bt
|
||||
|
||||
# Import biological ontologies
|
||||
bt.CellType.import_source()
|
||||
bt.Gene.import_source(organism="human")
|
||||
|
||||
# Use in data curation
|
||||
cell_types = bt.CellType.from_values(adata.obs.cell_type)
|
||||
```
|
||||
|
||||
### WetLab
|
||||
|
||||
Track wet lab experiments:
|
||||
|
||||
```python
|
||||
# Install wetlab module
|
||||
pip install 'lamindb[wetlab]'
|
||||
|
||||
# Use wetlab registries
|
||||
import lamindb_wetlab as wetlab
|
||||
|
||||
# Track experiments, samples, protocols
|
||||
experiment = wetlab.Experiment(name="RNA-seq batch 1").save()
|
||||
```
|
||||
|
||||
### Clinical Data (OMOP)
|
||||
|
||||
```python
|
||||
# Install clinical module
|
||||
pip install 'lamindb[clinical]'
|
||||
|
||||
# Use OMOP common data model
|
||||
import lamindb_clinical as clinical
|
||||
|
||||
# Track clinical data
|
||||
patient = clinical.Patient(patient_id="P001").save()
|
||||
```
|
||||
|
||||
## Git Integration
|
||||
|
||||
### Sync with Git Repositories
|
||||
|
||||
```python
|
||||
# Configure git sync
|
||||
export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
|
||||
|
||||
# Or programmatically
|
||||
ln.settings.sync_git_repo = "https://github.com/user/repo.git"
|
||||
|
||||
# Set development directory
|
||||
lamin settings set dev-dir .
|
||||
|
||||
# Scripts tracked with git commits
|
||||
ln.track() # Automatically captures git commit hash
|
||||
# ... your code ...
|
||||
ln.finish()
|
||||
|
||||
# View git information
|
||||
transform = ln.Transform.get(name="analysis.py")
|
||||
transform.source_code # Shows code at git commit
|
||||
transform.hash # Git commit hash
|
||||
```
|
||||
|
||||
## Enterprise Integrations
|
||||
|
||||
### Benchling
|
||||
|
||||
Sync with Benchling registries (requires team/enterprise plan):
|
||||
|
||||
```python
|
||||
# Configure Benchling connection (contact LaminDB team)
|
||||
# Syncs schemas and data from Benchling registries
|
||||
|
||||
# Access synced Benchling data
|
||||
# Details available through enterprise support
|
||||
```
|
||||
|
||||
## Custom Integration Patterns
|
||||
|
||||
### REST API Integration
|
||||
|
||||
```python
|
||||
import requests
|
||||
import lamindb as ln
|
||||
|
||||
ln.track()
|
||||
|
||||
# Fetch from API
|
||||
response = requests.get("https://api.example.com/data")
|
||||
data = response.json()
|
||||
|
||||
# Convert to DataFrame
|
||||
import pandas as pd
|
||||
df = pd.DataFrame(data)
|
||||
|
||||
# Save to LaminDB
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="api/fetched_data.parquet",
|
||||
description="Data fetched from external API"
|
||||
).save()
|
||||
|
||||
artifact.features.add_values({"api_url": response.url})
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### Database Integration
|
||||
|
||||
```python
|
||||
import sqlalchemy as sa
|
||||
import lamindb as ln
|
||||
|
||||
ln.track()
|
||||
|
||||
# Connect to external database
|
||||
engine = sa.create_engine("postgresql://user:pwd@host:port/db")
|
||||
|
||||
# Query data
|
||||
query = "SELECT * FROM experiments WHERE date > '2025-01-01'"
|
||||
df = pd.read_sql(query, engine)
|
||||
|
||||
# Save to LaminDB
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="external_db/experiments_2025.parquet",
|
||||
description="Experiments from external database"
|
||||
).save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
## Croissant Metadata
|
||||
|
||||
Export datasets with Croissant metadata format:
|
||||
|
||||
```python
|
||||
# Create artifact with rich metadata
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="datasets/published_data.parquet",
|
||||
description="Published dataset with Croissant metadata"
|
||||
).save()
|
||||
|
||||
# Export Croissant metadata (requires additional configuration)
|
||||
# Enables dataset discovery and interoperability
|
||||
```
|
||||
|
||||
## Best Practices for Integrations
|
||||
|
||||
1. **Track consistently**: Use `ln.track()` in all integrated workflows
|
||||
2. **Link IDs**: Store external system IDs (W&B run ID, MLflow experiment ID) as features
|
||||
3. **Centralize data**: Use LaminDB as single source of truth for data artifacts
|
||||
4. **Sync parameters**: Log parameters to both LaminDB and ML platforms
|
||||
5. **Version together**: Keep code (git), data (LaminDB), and experiments (ML platform) in sync
|
||||
6. **Cache strategically**: Configure appropriate cache locations for cloud storage
|
||||
7. **Use feature sets**: Link ontology terms from Bionty to artifacts
|
||||
8. **Document integrations**: Add descriptions explaining integration context
|
||||
9. **Test incrementally**: Verify integration with small datasets first
|
||||
10. **Monitor lineage**: Use `view_lineage()` to ensure integration tracking works
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
**Issue: S3 credentials not found**
|
||||
```bash
|
||||
export AWS_ACCESS_KEY_ID=your_key
|
||||
export AWS_SECRET_ACCESS_KEY=your_secret
|
||||
export AWS_DEFAULT_REGION=us-east-1
|
||||
```
|
||||
|
||||
**Issue: GCS authentication failure**
|
||||
```bash
|
||||
gcloud auth application-default login
|
||||
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
|
||||
```
|
||||
|
||||
**Issue: Git sync not working**
|
||||
```bash
|
||||
# Ensure git repo is set
|
||||
lamin settings get sync-git-repo
|
||||
|
||||
# Ensure you're in git repo
|
||||
git status
|
||||
|
||||
# Commit changes before tracking
|
||||
git add .
|
||||
git commit -m "Update analysis"
|
||||
ln.track()
|
||||
```
|
||||
|
||||
**Issue: MLflow artifacts not syncing**
|
||||
```python
|
||||
# Save explicitly to both systems
|
||||
mlflow.log_artifact("model.pkl")
|
||||
ln.Artifact("model.pkl", key="models/model.pkl").save()
|
||||
```
|
||||
497
skills/lamindb/references/ontologies.md
Normal file
497
skills/lamindb/references/ontologies.md
Normal file
@@ -0,0 +1,497 @@
|
||||
# LaminDB Ontology Management
|
||||
|
||||
This document covers biological ontology management in LaminDB through the Bionty plugin, including accessing, searching, and annotating data with standardized biological terms.
|
||||
|
||||
## Overview
|
||||
|
||||
LaminDB integrates the `bionty` plugin to manage standardized biological ontologies, enabling consistent metadata curation and data annotation across research projects. Bionty provides access to 20+ curated biological ontologies covering genes, proteins, cell types, tissues, diseases, and more.
|
||||
|
||||
## Available Ontologies
|
||||
|
||||
LaminDB provides access to multiple curated ontology sources:
|
||||
|
||||
| Registry | Ontology Source | Description |
|
||||
|----------|----------------|-------------|
|
||||
| **Gene** | Ensembl | Genes across organisms (human, mouse, etc.) |
|
||||
| **Protein** | UniProt | Protein sequences and annotations |
|
||||
| **CellType** | Cell Ontology (CL) | Standardized cell type classifications |
|
||||
| **CellLine** | Cell Line Ontology (CLO) | Cell line annotations |
|
||||
| **Tissue** | Uberon | Anatomical structures and tissues |
|
||||
| **Disease** | Mondo, DOID | Disease classifications |
|
||||
| **Phenotype** | Human Phenotype Ontology (HPO) | Phenotypic abnormalities |
|
||||
| **Pathway** | Gene Ontology (GO) | Biological pathways and processes |
|
||||
| **ExperimentalFactor** | Experimental Factor Ontology (EFO) | Experimental variables |
|
||||
| **DevelopmentalStage** | Various | Developmental stages across organisms |
|
||||
| **Ethnicity** | HANCESTRO | Human ancestry ontology |
|
||||
| **Drug** | DrugBank | Drug compounds |
|
||||
| **Organism** | NCBItaxon | Taxonomic classifications |
|
||||
|
||||
## Installation and Import
|
||||
|
||||
```python
|
||||
# Install bionty (included with lamindb)
|
||||
pip install lamindb
|
||||
|
||||
# Import
|
||||
import lamindb as ln
|
||||
import bionty as bt
|
||||
```
|
||||
|
||||
## Importing Public Ontologies
|
||||
|
||||
Populate your registry with a public ontology source:
|
||||
|
||||
```python
|
||||
# Import Cell Ontology
|
||||
bt.CellType.import_source()
|
||||
|
||||
# Import organism-specific genes
|
||||
bt.Gene.import_source(organism="human")
|
||||
bt.Gene.import_source(organism="mouse")
|
||||
|
||||
# Import tissues
|
||||
bt.Tissue.import_source()
|
||||
|
||||
# Import diseases
|
||||
bt.Disease.import_source(source="mondo") # Mondo Disease Ontology
|
||||
bt.Disease.import_source(source="doid") # Disease Ontology
|
||||
```
|
||||
|
||||
## Searching and Accessing Records
|
||||
|
||||
### Keyword Search
|
||||
|
||||
```python
|
||||
# Search cell types
|
||||
bt.CellType.search("T cell").to_dataframe()
|
||||
bt.CellType.search("gamma-delta").to_dataframe()
|
||||
|
||||
# Search genes
|
||||
bt.Gene.search("CD8").to_dataframe()
|
||||
bt.Gene.search("TP53").to_dataframe()
|
||||
|
||||
# Search diseases
|
||||
bt.Disease.search("cancer").to_dataframe()
|
||||
|
||||
# Search tissues
|
||||
bt.Tissue.search("brain").to_dataframe()
|
||||
```
|
||||
|
||||
### Auto-Complete Lookup
|
||||
|
||||
For registries with fewer than 100k records:
|
||||
|
||||
```python
|
||||
# Create lookup object
|
||||
cell_types = bt.CellType.lookup()
|
||||
|
||||
# Access by name (auto-complete in IDEs)
|
||||
t_cell = cell_types.t_cell
|
||||
hsc = cell_types.hematopoietic_stem_cell
|
||||
|
||||
# Similarly for other registries
|
||||
genes = bt.Gene.lookup()
|
||||
cd8a = genes.cd8a
|
||||
```
|
||||
|
||||
### Exact Field Matching
|
||||
|
||||
```python
|
||||
# By ontology ID
|
||||
cell_type = bt.CellType.get(ontology_id="CL:0000798")
|
||||
disease = bt.Disease.get(ontology_id="MONDO:0004992")
|
||||
|
||||
# By name
|
||||
cell_type = bt.CellType.get(name="T cell")
|
||||
gene = bt.Gene.get(symbol="CD8A")
|
||||
|
||||
# By Ensembl ID
|
||||
gene = bt.Gene.get(ensembl_gene_id="ENSG00000153563")
|
||||
```
|
||||
|
||||
## Ontological Hierarchies
|
||||
|
||||
### Exploring Relationships
|
||||
|
||||
```python
|
||||
# Get a cell type
|
||||
gdt_cell = bt.CellType.get(ontology_id="CL:0000798") # gamma-delta T cell
|
||||
|
||||
# View direct parents
|
||||
gdt_cell.parents.to_dataframe()
|
||||
|
||||
# View all ancestors recursively
|
||||
ancestors = []
|
||||
current = gdt_cell
|
||||
while current.parents.exists():
|
||||
parent = current.parents.first()
|
||||
ancestors.append(parent)
|
||||
current = parent
|
||||
|
||||
# View direct children
|
||||
gdt_cell.children.to_dataframe()
|
||||
|
||||
# View all descendants recursively
|
||||
gdt_cell.query_children().to_dataframe()
|
||||
```
|
||||
|
||||
### Visualizing Hierarchies
|
||||
|
||||
```python
|
||||
# Visualize parent hierarchy
|
||||
gdt_cell.view_parents()
|
||||
|
||||
# Include children in visualization
|
||||
gdt_cell.view_parents(with_children=True)
|
||||
|
||||
# Get all related terms for visualization
|
||||
t_cell = bt.CellType.get(name="T cell")
|
||||
t_cell.view_parents(with_children=True) # Shows T cell subtypes
|
||||
```
|
||||
|
||||
## Standardizing and Validating Data
|
||||
|
||||
### Validation
|
||||
|
||||
Check if terms exist in the ontology:
|
||||
|
||||
```python
|
||||
# Validate cell types
|
||||
bt.CellType.validate(["T cell", "B cell", "invalid_cell"])
|
||||
# Returns: [True, True, False]
|
||||
|
||||
# Validate genes
|
||||
bt.Gene.validate(["CD8A", "TP53", "FAKEGENE"], organism="human")
|
||||
# Returns: [True, True, False]
|
||||
|
||||
# Check which terms are invalid
|
||||
terms = ["T cell", "fat cell", "neuron", "invalid_term"]
|
||||
invalid = [t for t, valid in zip(terms, bt.CellType.validate(terms)) if not valid]
|
||||
print(f"Invalid terms: {invalid}")
|
||||
```
|
||||
|
||||
### Standardization with Synonyms
|
||||
|
||||
Convert non-standard terms to validated names:
|
||||
|
||||
```python
|
||||
# Standardize cell type names
|
||||
bt.CellType.standardize(["fat cell", "blood forming stem cell"])
|
||||
# Returns: ['adipocyte', 'hematopoietic stem cell']
|
||||
|
||||
# Standardize genes
|
||||
bt.Gene.standardize(["BRCA-1", "p53"], organism="human")
|
||||
# Returns: ['BRCA1', 'TP53']
|
||||
|
||||
# Handle mixed valid/invalid terms
|
||||
terms = ["T cell", "T lymphocyte", "invalid"]
|
||||
standardized = bt.CellType.standardize(terms)
|
||||
# Returns standardized names where possible
|
||||
```
|
||||
|
||||
### Loading Validated Records
|
||||
|
||||
```python
|
||||
# Load records from values (including synonyms)
|
||||
records = bt.CellType.from_values(["fat cell", "blood forming stem cell"])
|
||||
|
||||
# Returns list of CellType records
|
||||
for record in records:
|
||||
print(record.name, record.ontology_id)
|
||||
|
||||
# Use with gene symbols
|
||||
genes = bt.Gene.from_values(["CD8A", "CD8B"], organism="human")
|
||||
```
|
||||
|
||||
## Annotating Datasets
|
||||
|
||||
### Annotating AnnData
|
||||
|
||||
```python
|
||||
import anndata as ad
|
||||
import lamindb as ln
|
||||
|
||||
# Load example data
|
||||
adata = ad.read_h5ad("data.h5ad")
|
||||
|
||||
# Validate and retrieve matching records
|
||||
cell_types = bt.CellType.from_values(adata.obs.cell_type)
|
||||
|
||||
# Create artifact with annotations
|
||||
artifact = ln.Artifact.from_anndata(
|
||||
adata,
|
||||
key="scrna/annotated_data.h5ad",
|
||||
description="scRNA-seq data with validated cell type annotations"
|
||||
).save()
|
||||
|
||||
# Link ontology records to artifact
|
||||
artifact.feature_sets.add_ontology(cell_types)
|
||||
```
|
||||
|
||||
### Annotating DataFrames
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
# Create DataFrame with biological entities
|
||||
df = pd.DataFrame({
|
||||
"cell_type": ["T cell", "B cell", "NK cell"],
|
||||
"tissue": ["blood", "spleen", "liver"],
|
||||
"disease": ["healthy", "lymphoma", "healthy"]
|
||||
})
|
||||
|
||||
# Validate and standardize
|
||||
df["cell_type"] = bt.CellType.standardize(df["cell_type"])
|
||||
df["tissue"] = bt.Tissue.standardize(df["tissue"])
|
||||
|
||||
# Create artifact
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="metadata/sample_info.parquet"
|
||||
).save()
|
||||
|
||||
# Link ontology records
|
||||
cell_type_records = bt.CellType.from_values(df["cell_type"])
|
||||
tissue_records = bt.Tissue.from_values(df["tissue"])
|
||||
|
||||
artifact.feature_sets.add_ontology(cell_type_records)
|
||||
artifact.feature_sets.add_ontology(tissue_records)
|
||||
```
|
||||
|
||||
## Managing Custom Terms and Hierarchies
|
||||
|
||||
### Adding Custom Terms
|
||||
|
||||
```python
|
||||
# Register new term not in public ontology
|
||||
my_celltype = bt.CellType(name="my_novel_T_cell_subtype").save()
|
||||
|
||||
# Establish parent-child relationship
|
||||
parent = bt.CellType.get(name="T cell")
|
||||
my_celltype.parents.add(parent)
|
||||
|
||||
# Verify relationship
|
||||
my_celltype.parents.to_dataframe()
|
||||
parent.children.to_dataframe() # Should include my_celltype
|
||||
```
|
||||
|
||||
### Adding Synonyms
|
||||
|
||||
```python
|
||||
# Add synonyms for standardization
|
||||
hsc = bt.CellType.get(name="hematopoietic stem cell")
|
||||
hsc.add_synonym("HSC")
|
||||
hsc.add_synonym("blood stem cell")
|
||||
hsc.add_synonym("hematopoietic progenitor")
|
||||
|
||||
# Set abbreviation
|
||||
hsc.set_abbr("HSC")
|
||||
|
||||
# Now standardization works with synonyms
|
||||
bt.CellType.standardize(["HSC", "blood stem cell"])
|
||||
# Returns: ['hematopoietic stem cell', 'hematopoietic stem cell']
|
||||
```
|
||||
|
||||
### Creating Custom Hierarchies
|
||||
|
||||
```python
|
||||
# Build custom cell type hierarchy
|
||||
immune_cell = bt.CellType.get(name="immune cell")
|
||||
|
||||
# Add custom subtypes
|
||||
my_subtype1 = bt.CellType(name="custom_immune_subtype_1").save()
|
||||
my_subtype2 = bt.CellType(name="custom_immune_subtype_2").save()
|
||||
|
||||
# Link to parent
|
||||
my_subtype1.parents.add(immune_cell)
|
||||
my_subtype2.parents.add(immune_cell)
|
||||
|
||||
# Create sub-subtypes
|
||||
my_subsubtype = bt.CellType(name="custom_sub_subtype").save()
|
||||
my_subsubtype.parents.add(my_subtype1)
|
||||
|
||||
# Visualize custom hierarchy
|
||||
immune_cell.view_parents(with_children=True)
|
||||
```
|
||||
|
||||
## Multi-Organism Support
|
||||
|
||||
For organism-aware registries like Gene:
|
||||
|
||||
```python
|
||||
# Set global organism
|
||||
bt.settings.organism = "human"
|
||||
|
||||
# Validate human genes
|
||||
bt.Gene.validate(["TCF7", "CD8A"], organism="human")
|
||||
|
||||
# Load genes for specific organism
|
||||
human_genes = bt.Gene.from_values(["CD8A", "TP53"], organism="human")
|
||||
mouse_genes = bt.Gene.from_values(["Cd8a", "Trp53"], organism="mouse")
|
||||
|
||||
# Search organism-specific genes
|
||||
bt.Gene.search("CD8", organism="human").to_dataframe()
|
||||
bt.Gene.search("Cd8", organism="mouse").to_dataframe()
|
||||
|
||||
# Switch organism context
|
||||
bt.settings.organism = "mouse"
|
||||
genes = bt.Gene.from_source(symbol="Ap5b1")
|
||||
```
|
||||
|
||||
## Public Ontology Lookup
|
||||
|
||||
Access terms from public ontologies without importing:
|
||||
|
||||
```python
|
||||
# Interactive lookup in public sources
|
||||
cell_types_public = bt.CellType.lookup(public=True)
|
||||
|
||||
# Access public terms
|
||||
hepatocyte = cell_types_public.hepatocyte
|
||||
|
||||
# Import specific term
|
||||
hepatocyte_local = bt.CellType.from_source(name="hepatocyte")
|
||||
|
||||
# Or import by ontology ID
|
||||
specific_cell = bt.CellType.from_source(ontology_id="CL:0000182")
|
||||
```
|
||||
|
||||
## Version Tracking
|
||||
|
||||
LaminDB automatically tracks ontology versions:
|
||||
|
||||
```python
|
||||
# View current source versions
|
||||
bt.Source.filter(currently_used=True).to_dataframe()
|
||||
|
||||
# Check which source a record derives from
|
||||
cell_type = bt.CellType.get(name="hepatocyte")
|
||||
cell_type.source # Returns Source metadata
|
||||
|
||||
# View source details
|
||||
source = cell_type.source
|
||||
print(source.name) # e.g., "cl"
|
||||
print(source.version) # e.g., "2023-05-18"
|
||||
print(source.url) # Ontology URL
|
||||
```
|
||||
|
||||
## Ontology Integration Workflows
|
||||
|
||||
### Workflow 1: Validate Existing Data
|
||||
|
||||
```python
|
||||
# Load data with biological annotations
|
||||
adata = ad.read_h5ad("uncurated_data.h5ad")
|
||||
|
||||
# Validate cell types
|
||||
validation = bt.CellType.validate(adata.obs.cell_type)
|
||||
|
||||
# Identify invalid terms
|
||||
invalid_idx = [i for i, v in enumerate(validation) if not v]
|
||||
invalid_terms = adata.obs.cell_type.iloc[invalid_idx].unique()
|
||||
print(f"Invalid cell types: {invalid_terms}")
|
||||
|
||||
# Fix invalid terms manually or with standardization
|
||||
adata.obs["cell_type"] = bt.CellType.standardize(adata.obs.cell_type)
|
||||
|
||||
# Re-validate
|
||||
validation = bt.CellType.validate(adata.obs.cell_type)
|
||||
assert all(validation), "All terms should now be valid"
|
||||
```
|
||||
|
||||
### Workflow 2: Curate and Annotate
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
ln.track() # Start tracking
|
||||
|
||||
# Load data
|
||||
df = pd.read_csv("experimental_data.csv")
|
||||
|
||||
# Standardize using ontologies
|
||||
df["cell_type"] = bt.CellType.standardize(df["cell_type"])
|
||||
df["tissue"] = bt.Tissue.standardize(df["tissue"])
|
||||
|
||||
# Create curated artifact
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="curated/experiment_2025_10.parquet",
|
||||
description="Curated experimental data with ontology-validated annotations"
|
||||
).save()
|
||||
|
||||
# Link ontology records
|
||||
artifact.feature_sets.add_ontology(bt.CellType.from_values(df["cell_type"]))
|
||||
artifact.feature_sets.add_ontology(bt.Tissue.from_values(df["tissue"]))
|
||||
|
||||
ln.finish() # Complete tracking
|
||||
```
|
||||
|
||||
### Workflow 3: Cross-Organism Gene Mapping
|
||||
|
||||
```python
|
||||
# Get human genes
|
||||
human_genes = ["CD8A", "CD8B", "TP53"]
|
||||
human_records = bt.Gene.from_values(human_genes, organism="human")
|
||||
|
||||
# Find mouse orthologs (requires external mapping)
|
||||
# LaminDB doesn't provide built-in ortholog mapping
|
||||
# Use external tools like Ensembl BioMart or homologene
|
||||
|
||||
mouse_orthologs = ["Cd8a", "Cd8b", "Trp53"]
|
||||
mouse_records = bt.Gene.from_values(mouse_orthologs, organism="mouse")
|
||||
```
|
||||
|
||||
## Querying Ontology-Annotated Data
|
||||
|
||||
```python
|
||||
# Find all datasets with specific cell type
|
||||
t_cell = bt.CellType.get(name="T cell")
|
||||
ln.Artifact.filter(feature_sets__cell_types=t_cell).to_dataframe()
|
||||
|
||||
# Find datasets measuring specific genes
|
||||
cd8a = bt.Gene.get(symbol="CD8A", organism="human")
|
||||
ln.Artifact.filter(feature_sets__genes=cd8a).to_dataframe()
|
||||
|
||||
# Query across ontology hierarchy
|
||||
# Find all datasets with T cell or T cell subtypes
|
||||
t_cell_subtypes = t_cell.query_children()
|
||||
ln.Artifact.filter(
|
||||
feature_sets__cell_types__in=t_cell_subtypes
|
||||
).to_dataframe()
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Import ontologies first**: Call `import_source()` before validation
|
||||
2. **Use standardization**: Leverage synonym mapping to handle variations
|
||||
3. **Validate early**: Check terms before creating artifacts
|
||||
4. **Set organism context**: Specify organism for gene-related queries
|
||||
5. **Add custom synonyms**: Register common variations in your domain
|
||||
6. **Use public lookup**: Access `lookup(public=True)` for term discovery
|
||||
7. **Track versions**: Monitor ontology source versions for reproducibility
|
||||
8. **Build hierarchies**: Link custom terms to existing ontology structures
|
||||
9. **Query hierarchically**: Use `query_children()` for comprehensive searches
|
||||
10. **Document mappings**: Track custom term additions and relationships
|
||||
|
||||
## Common Ontology Operations
|
||||
|
||||
```python
|
||||
# Check if term exists
|
||||
exists = bt.CellType.filter(name="T cell").exists()
|
||||
|
||||
# Count terms in registry
|
||||
n_cell_types = bt.CellType.filter().count()
|
||||
|
||||
# Get all terms with specific parent
|
||||
immune_cells = bt.CellType.filter(parents__name="immune cell")
|
||||
|
||||
# Find orphan terms (no parents)
|
||||
orphans = bt.CellType.filter(parents__isnull=True)
|
||||
|
||||
# Get recently added terms
|
||||
from datetime import datetime, timedelta
|
||||
recent = bt.CellType.filter(
|
||||
created_at__gte=datetime.now() - timedelta(days=7)
|
||||
)
|
||||
```
|
||||
733
skills/lamindb/references/setup-deployment.md
Normal file
733
skills/lamindb/references/setup-deployment.md
Normal file
@@ -0,0 +1,733 @@
|
||||
# LaminDB Setup & Deployment
|
||||
|
||||
This document covers installation, configuration, instance management, storage options, and deployment strategies for LaminDB.
|
||||
|
||||
## Installation
|
||||
|
||||
### Basic Installation
|
||||
|
||||
```bash
|
||||
# Install LaminDB
|
||||
pip install lamindb
|
||||
|
||||
# Or with pip3
|
||||
pip3 install lamindb
|
||||
```
|
||||
|
||||
### Installation with Extras
|
||||
|
||||
Install optional dependencies for specific functionality:
|
||||
|
||||
```bash
|
||||
# Google Cloud Platform support
|
||||
pip install 'lamindb[gcp]'
|
||||
|
||||
# Flow cytometry formats
|
||||
pip install 'lamindb[fcs]'
|
||||
|
||||
# Array storage and streaming (Zarr support)
|
||||
pip install 'lamindb[zarr]'
|
||||
|
||||
# AWS S3 support (usually included by default)
|
||||
pip install 'lamindb[aws]'
|
||||
|
||||
# Multiple extras
|
||||
pip install 'lamindb[gcp,zarr,fcs]'
|
||||
```
|
||||
|
||||
### Module Plugins
|
||||
|
||||
```bash
|
||||
# Biological ontologies (Bionty)
|
||||
pip install bionty
|
||||
|
||||
# Wet lab functionality
|
||||
pip install lamindb-wetlab
|
||||
|
||||
# Clinical data (OMOP CDM)
|
||||
pip install lamindb-clinical
|
||||
```
|
||||
|
||||
### Verify Installation
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
print(ln.__version__)
|
||||
|
||||
# Check available modules
|
||||
import bionty as bt
|
||||
print(bt.__version__)
|
||||
```
|
||||
|
||||
## Authentication
|
||||
|
||||
### Creating an Account
|
||||
|
||||
1. Visit https://lamin.ai
|
||||
2. Sign up for a free account
|
||||
3. Navigate to account settings to generate an API key
|
||||
|
||||
### Logging In
|
||||
|
||||
```bash
|
||||
# Login with API key
|
||||
lamin login
|
||||
|
||||
# You'll be prompted to enter your API key
|
||||
# API key is stored locally at ~/.lamin/
|
||||
```
|
||||
|
||||
### Authentication Details
|
||||
|
||||
**Data Privacy:** LaminDB authentication only collects basic metadata (email, user information). Your actual data remains private and is not sent to LaminDB servers.
|
||||
|
||||
**Local vs Cloud:** Authentication is required even for local-only usage to enable collaboration features and instance management.
|
||||
|
||||
## Instance Initialization
|
||||
|
||||
### Local SQLite Instance
|
||||
|
||||
For local development and small datasets:
|
||||
|
||||
```bash
|
||||
# Initialize in current directory
|
||||
lamin init --storage ./mydata
|
||||
|
||||
# Initialize in specific directory
|
||||
lamin init --storage /path/to/data
|
||||
|
||||
# Initialize with specific modules
|
||||
lamin init --storage ./mydata --modules bionty
|
||||
|
||||
# Initialize with multiple modules
|
||||
lamin init --storage ./mydata --modules bionty,wetlab
|
||||
```
|
||||
|
||||
### Cloud Storage with SQLite
|
||||
|
||||
Use cloud storage but local SQLite database:
|
||||
|
||||
```bash
|
||||
# AWS S3
|
||||
lamin init --storage s3://my-bucket/path
|
||||
|
||||
# Google Cloud Storage
|
||||
lamin init --storage gs://my-bucket/path
|
||||
|
||||
# S3-compatible (MinIO, Cloudflare R2)
|
||||
lamin init --storage 's3://bucket?endpoint_url=http://endpoint:9000'
|
||||
```
|
||||
|
||||
### Cloud Storage with PostgreSQL
|
||||
|
||||
For production deployments:
|
||||
|
||||
```bash
|
||||
# S3 + PostgreSQL
|
||||
lamin init --storage s3://my-bucket/path \
|
||||
--db postgresql://user:password@hostname:5432/dbname \
|
||||
--modules bionty
|
||||
|
||||
# GCS + PostgreSQL
|
||||
lamin init --storage gs://my-bucket/path \
|
||||
--db postgresql://user:password@hostname:5432/dbname \
|
||||
--modules bionty
|
||||
```
|
||||
|
||||
### Instance Naming
|
||||
|
||||
```bash
|
||||
# Specify instance name
|
||||
lamin init --storage ./mydata --name my-project
|
||||
|
||||
# Default name uses directory name
|
||||
lamin init --storage ./mydata # Instance name: "mydata"
|
||||
```
|
||||
|
||||
## Connecting to Instances
|
||||
|
||||
### Connect to Your Own Instance
|
||||
|
||||
```bash
|
||||
# By name
|
||||
lamin connect my-project
|
||||
|
||||
# By full path
|
||||
lamin connect account_handle/my-project
|
||||
```
|
||||
|
||||
### Connect to Shared Instance
|
||||
|
||||
```bash
|
||||
# Connect to someone else's instance
|
||||
lamin connect other-user/their-project
|
||||
|
||||
# Requires appropriate permissions
|
||||
```
|
||||
|
||||
### Switching Between Instances
|
||||
|
||||
```bash
|
||||
# List available instances
|
||||
lamin info
|
||||
|
||||
# Switch instance
|
||||
lamin connect another-instance
|
||||
|
||||
# Close current instance
|
||||
lamin close
|
||||
```
|
||||
|
||||
## Storage Configuration
|
||||
|
||||
### Local Storage
|
||||
|
||||
**Advantages:**
|
||||
- Fast access
|
||||
- No internet required
|
||||
- Simple setup
|
||||
|
||||
**Setup:**
|
||||
```bash
|
||||
lamin init --storage ./data
|
||||
```
|
||||
|
||||
### AWS S3 Storage
|
||||
|
||||
**Advantages:**
|
||||
- Scalable
|
||||
- Collaborative
|
||||
- Durable
|
||||
|
||||
**Setup:**
|
||||
```bash
|
||||
# Set credentials
|
||||
export AWS_ACCESS_KEY_ID=your_key_id
|
||||
export AWS_SECRET_ACCESS_KEY=your_secret_key
|
||||
export AWS_DEFAULT_REGION=us-east-1
|
||||
|
||||
# Initialize
|
||||
lamin init --storage s3://my-bucket/project-data \
|
||||
--db postgresql://user:pwd@host:5432/db
|
||||
```
|
||||
|
||||
**S3 Permissions Required:**
|
||||
```json
|
||||
{
|
||||
"Version": "2012-10-17",
|
||||
"Statement": [
|
||||
{
|
||||
"Effect": "Allow",
|
||||
"Action": [
|
||||
"s3:GetObject",
|
||||
"s3:PutObject",
|
||||
"s3:DeleteObject",
|
||||
"s3:ListBucket"
|
||||
],
|
||||
"Resource": [
|
||||
"arn:aws:s3:::my-bucket/*",
|
||||
"arn:aws:s3:::my-bucket"
|
||||
]
|
||||
}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Google Cloud Storage
|
||||
|
||||
**Setup:**
|
||||
```bash
|
||||
# Authenticate
|
||||
gcloud auth application-default login
|
||||
|
||||
# Or use service account
|
||||
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
|
||||
|
||||
# Initialize
|
||||
lamin init --storage gs://my-bucket/project-data \
|
||||
--db postgresql://user:pwd@host:5432/db
|
||||
```
|
||||
|
||||
### S3-Compatible Storage
|
||||
|
||||
For MinIO, Cloudflare R2, or other S3-compatible services:
|
||||
|
||||
```bash
|
||||
# MinIO example
|
||||
export AWS_ACCESS_KEY_ID=minioadmin
|
||||
export AWS_SECRET_ACCESS_KEY=minioadmin
|
||||
|
||||
lamin init --storage 's3://my-bucket?endpoint_url=http://minio.example.com:9000'
|
||||
|
||||
# Cloudflare R2 example
|
||||
export AWS_ACCESS_KEY_ID=your_r2_access_key
|
||||
export AWS_SECRET_ACCESS_KEY=your_r2_secret_key
|
||||
|
||||
lamin init --storage 's3://bucket?endpoint_url=https://account-id.r2.cloudflarestorage.com'
|
||||
```
|
||||
|
||||
## Database Configuration
|
||||
|
||||
### SQLite (Default)
|
||||
|
||||
**Advantages:**
|
||||
- No separate database server
|
||||
- Simple setup
|
||||
- Good for development
|
||||
|
||||
**Limitations:**
|
||||
- Not suitable for concurrent writes
|
||||
- Limited scalability
|
||||
|
||||
**Setup:**
|
||||
```bash
|
||||
# SQLite is default
|
||||
lamin init --storage ./data
|
||||
# Database stored at ./data/.lamindb/
|
||||
```
|
||||
|
||||
### PostgreSQL
|
||||
|
||||
**Advantages:**
|
||||
- Production-ready
|
||||
- Concurrent access
|
||||
- Better performance at scale
|
||||
|
||||
**Setup:**
|
||||
```bash
|
||||
# Full connection string
|
||||
lamin init --storage s3://bucket/path \
|
||||
--db postgresql://username:password@hostname:5432/database
|
||||
|
||||
# With SSL
|
||||
lamin init --storage s3://bucket/path \
|
||||
--db "postgresql://user:pwd@host:5432/db?sslmode=require"
|
||||
```
|
||||
|
||||
**PostgreSQL Versions:** Compatible with PostgreSQL 12+
|
||||
|
||||
### Database Schema Management
|
||||
|
||||
```bash
|
||||
# Check current schema version
|
||||
lamin migrate check
|
||||
|
||||
# Upgrade schema
|
||||
lamin migrate deploy
|
||||
|
||||
# View migration history
|
||||
lamin migrate history
|
||||
```
|
||||
|
||||
## Cache Configuration
|
||||
|
||||
### Cache Directory
|
||||
|
||||
LaminDB maintains a local cache for cloud files:
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# View cache location
|
||||
print(ln.settings.cache_dir)
|
||||
```
|
||||
|
||||
### Configure Cache Location
|
||||
|
||||
```bash
|
||||
# Set cache directory
|
||||
lamin cache set /path/to/cache
|
||||
|
||||
# View current cache settings
|
||||
lamin cache get
|
||||
```
|
||||
|
||||
### System-Wide Cache (Multi-User)
|
||||
|
||||
For shared systems with multiple users:
|
||||
|
||||
```bash
|
||||
# Create system settings file
|
||||
sudo mkdir -p /system/settings
|
||||
sudo nano /system/settings/system.env
|
||||
```
|
||||
|
||||
Add to `system.env`:
|
||||
```bash
|
||||
lamindb_cache_path=/shared/cache/lamindb
|
||||
```
|
||||
|
||||
Ensure permissions:
|
||||
```bash
|
||||
sudo chmod 755 /shared/cache/lamindb
|
||||
sudo chown -R shared-user:shared-group /shared/cache/lamindb
|
||||
```
|
||||
|
||||
### Cache Management
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Clear cache for specific artifact
|
||||
artifact = ln.Artifact.get(key="data.h5ad")
|
||||
artifact.delete_cache()
|
||||
|
||||
# Check if artifact is cached
|
||||
if artifact.is_cached():
|
||||
print("Already cached")
|
||||
|
||||
# Manually clear entire cache
|
||||
import shutil
|
||||
shutil.rmtree(ln.settings.cache_dir)
|
||||
```
|
||||
|
||||
## Settings Management
|
||||
|
||||
### View Current Settings
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# User settings
|
||||
print(ln.setup.settings.user)
|
||||
# User(handle='username', email='user@email.com', name='Full Name')
|
||||
|
||||
# Instance settings
|
||||
print(ln.setup.settings.instance)
|
||||
# Instance(name='my-project', storage='s3://bucket/path')
|
||||
```
|
||||
|
||||
### Configure Settings
|
||||
|
||||
```bash
|
||||
# Set development directory for relative keys
|
||||
lamin settings set dev-dir /path/to/project
|
||||
|
||||
# Configure git sync
|
||||
lamin settings set sync-git-repo https://github.com/user/repo.git
|
||||
|
||||
# View all settings
|
||||
lamin settings
|
||||
```
|
||||
|
||||
### Environment Variables
|
||||
|
||||
```bash
|
||||
# Cache directory
|
||||
export LAMIN_CACHE_DIR=/path/to/cache
|
||||
|
||||
# Settings directory
|
||||
export LAMIN_SETTINGS_DIR=/path/to/settings
|
||||
|
||||
# Git sync
|
||||
export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
|
||||
```
|
||||
|
||||
## Instance Management
|
||||
|
||||
### Viewing Instance Information
|
||||
|
||||
```bash
|
||||
# Current instance info
|
||||
lamin info
|
||||
|
||||
# List all instances
|
||||
lamin ls
|
||||
|
||||
# View instance details
|
||||
lamin instance details
|
||||
```
|
||||
|
||||
### Instance Collaboration
|
||||
|
||||
```bash
|
||||
# Set instance visibility (requires LaminHub)
|
||||
lamin instance set-visibility public
|
||||
lamin instance set-visibility private
|
||||
|
||||
# Invite collaborators (requires LaminHub)
|
||||
lamin instance invite user@email.com
|
||||
```
|
||||
|
||||
### Instance Migration
|
||||
|
||||
```bash
|
||||
# Backup instance
|
||||
lamin backup create
|
||||
|
||||
# Restore from backup
|
||||
lamin backup restore backup_id
|
||||
|
||||
# Export instance metadata
|
||||
lamin export instance-metadata.json
|
||||
```
|
||||
|
||||
### Deleting Instances
|
||||
|
||||
```bash
|
||||
# Delete instance (preserves data, removes metadata)
|
||||
lamin delete --force instance-name
|
||||
|
||||
# This only removes the LaminDB metadata
|
||||
# Actual data in storage location remains
|
||||
```
|
||||
|
||||
## Production Deployment Patterns
|
||||
|
||||
### Pattern 1: Local Development → Cloud Production
|
||||
|
||||
**Development:**
|
||||
```bash
|
||||
# Local development
|
||||
lamin init --storage ./dev-data --modules bionty
|
||||
```
|
||||
|
||||
**Production:**
|
||||
```bash
|
||||
# Cloud production
|
||||
lamin init --storage s3://prod-bucket/data \
|
||||
--db postgresql://user:pwd@db-host:5432/prod-db \
|
||||
--modules bionty \
|
||||
--name production
|
||||
```
|
||||
|
||||
**Migration:** Export artifacts from dev, import to prod
|
||||
```python
|
||||
# Export from dev
|
||||
artifacts = ln.Artifact.filter().all()
|
||||
for artifact in artifacts:
|
||||
artifact.export("/tmp/export/")
|
||||
|
||||
# Switch to prod
|
||||
lamin connect production
|
||||
|
||||
# Import to prod
|
||||
for file in Path("/tmp/export/").glob("*"):
|
||||
ln.Artifact(str(file), key=file.name).save()
|
||||
```
|
||||
|
||||
### Pattern 2: Multi-Region Deployment
|
||||
|
||||
Deploy instances in multiple regions for data sovereignty:
|
||||
|
||||
```bash
|
||||
# US instance
|
||||
lamin init --storage s3://us-bucket/data \
|
||||
--db postgresql://user:pwd@us-db:5432/db \
|
||||
--name us-production
|
||||
|
||||
# EU instance
|
||||
lamin init --storage s3://eu-bucket/data \
|
||||
--db postgresql://user:pwd@eu-db:5432/db \
|
||||
--name eu-production
|
||||
```
|
||||
|
||||
### Pattern 3: Shared Storage, Personal Instances
|
||||
|
||||
Multiple users, shared data:
|
||||
|
||||
```bash
|
||||
# Shared storage with user-specific DB
|
||||
lamin init --storage s3://shared-bucket/data \
|
||||
--db postgresql://user1:pwd@host:5432/user1_db \
|
||||
--name user1-workspace
|
||||
|
||||
lamin init --storage s3://shared-bucket/data \
|
||||
--db postgresql://user2:pwd@host:5432/user2_db \
|
||||
--name user2-workspace
|
||||
```
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Database Performance
|
||||
|
||||
```python
|
||||
# Use connection pooling for PostgreSQL
|
||||
# Configure in database server settings
|
||||
|
||||
# Optimize queries with indexes
|
||||
# LaminDB creates indexes automatically for common queries
|
||||
```
|
||||
|
||||
### Storage Performance
|
||||
|
||||
```bash
|
||||
# Use appropriate storage classes
|
||||
# S3: STANDARD for frequent access, INTELLIGENT_TIERING for mixed access
|
||||
|
||||
# Configure multipart upload thresholds
|
||||
export AWS_CLI_FILE_IO_BANDWIDTH=100MB
|
||||
```
|
||||
|
||||
### Cache Optimization
|
||||
|
||||
```python
|
||||
# Pre-cache frequently used artifacts
|
||||
artifacts = ln.Artifact.filter(key__startswith="reference/")
|
||||
for artifact in artifacts:
|
||||
artifact.cache() # Download to cache
|
||||
|
||||
# Use backed mode for large arrays
|
||||
adata = artifact.backed() # Don't load into memory
|
||||
```
|
||||
|
||||
## Security Best Practices
|
||||
|
||||
1. **Credentials Management:**
|
||||
- Use environment variables, not hardcoded credentials
|
||||
- Use IAM roles on AWS/GCP instead of access keys
|
||||
- Rotate credentials regularly
|
||||
|
||||
2. **Access Control:**
|
||||
- Use PostgreSQL for multi-user access control
|
||||
- Configure storage bucket policies
|
||||
- Enable audit logging
|
||||
|
||||
3. **Network Security:**
|
||||
- Use SSL/TLS for database connections
|
||||
- Use VPCs for cloud deployments
|
||||
- Restrict IP addresses when possible
|
||||
|
||||
4. **Data Protection:**
|
||||
- Enable encryption at rest (S3, GCS)
|
||||
- Use encryption in transit (HTTPS, SSL)
|
||||
- Implement backup strategies
|
||||
|
||||
## Monitoring and Maintenance
|
||||
|
||||
### Health Checks
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Check database connection
|
||||
try:
|
||||
ln.Artifact.filter().count()
|
||||
print("✓ Database connected")
|
||||
except Exception as e:
|
||||
print(f"✗ Database error: {e}")
|
||||
|
||||
# Check storage access
|
||||
try:
|
||||
test_artifact = ln.Artifact("test.txt", key="healthcheck.txt").save()
|
||||
test_artifact.delete(permanent=True)
|
||||
print("✓ Storage accessible")
|
||||
except Exception as e:
|
||||
print(f"✗ Storage error: {e}")
|
||||
```
|
||||
|
||||
### Logging
|
||||
|
||||
```python
|
||||
# Enable debug logging
|
||||
import logging
|
||||
logging.basicConfig(level=logging.DEBUG)
|
||||
|
||||
# LaminDB operations will produce detailed logs
|
||||
```
|
||||
|
||||
### Backup Strategy
|
||||
|
||||
```bash
|
||||
# Regular database backups (PostgreSQL)
|
||||
pg_dump -h hostname -U username -d database > backup_$(date +%Y%m%d).sql
|
||||
|
||||
# Storage backups (S3 versioning)
|
||||
aws s3api put-bucket-versioning \
|
||||
--bucket my-bucket \
|
||||
--versioning-configuration Status=Enabled
|
||||
|
||||
# Metadata export
|
||||
lamin export metadata_backup.json
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue: Cannot connect to instance**
|
||||
```bash
|
||||
# Check instance exists
|
||||
lamin ls
|
||||
|
||||
# Verify authentication
|
||||
lamin login
|
||||
|
||||
# Re-connect
|
||||
lamin connect instance-name
|
||||
```
|
||||
|
||||
**Issue: Storage permissions denied**
|
||||
```bash
|
||||
# Check AWS credentials
|
||||
aws s3 ls s3://your-bucket/
|
||||
|
||||
# Check GCS credentials
|
||||
gsutil ls gs://your-bucket/
|
||||
|
||||
# Verify IAM permissions
|
||||
```
|
||||
|
||||
**Issue: Database connection error**
|
||||
```bash
|
||||
# Test PostgreSQL connection
|
||||
psql postgresql://user:pwd@host:5432/db
|
||||
|
||||
# Check database version compatibility
|
||||
lamin migrate check
|
||||
```
|
||||
|
||||
**Issue: Cache full**
|
||||
```python
|
||||
# Clear cache
|
||||
import lamindb as ln
|
||||
import shutil
|
||||
shutil.rmtree(ln.settings.cache_dir)
|
||||
|
||||
# Set larger cache location
|
||||
lamin cache set /larger/disk/cache
|
||||
```
|
||||
|
||||
## Upgrade and Migration
|
||||
|
||||
### Upgrading LaminDB
|
||||
|
||||
```bash
|
||||
# Upgrade to latest version
|
||||
pip install --upgrade lamindb
|
||||
|
||||
# Upgrade database schema
|
||||
lamin migrate deploy
|
||||
```
|
||||
|
||||
### Schema Compatibility
|
||||
|
||||
Check the compatibility matrix to ensure your database schema version is compatible with your installed LaminDB version.
|
||||
|
||||
### Breaking Changes
|
||||
|
||||
Major version upgrades may require migration:
|
||||
|
||||
```bash
|
||||
# Check for breaking changes
|
||||
lamin migrate check
|
||||
|
||||
# Review migration plan
|
||||
lamin migrate plan
|
||||
|
||||
# Execute migration
|
||||
lamin migrate deploy
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Start local, scale cloud**: Develop locally, deploy to cloud for production
|
||||
2. **Use PostgreSQL for production**: SQLite is only for development
|
||||
3. **Configure appropriate cache**: Size cache based on working set
|
||||
4. **Enable versioning**: Use S3/GCS versioning for data protection
|
||||
5. **Monitor costs**: Track storage and compute costs in cloud deployments
|
||||
6. **Document configuration**: Keep infrastructure-as-code for reproducibility
|
||||
7. **Test backups**: Regularly verify backup and restore procedures
|
||||
8. **Set up monitoring**: Implement health checks and alerting
|
||||
9. **Use modules strategically**: Only install needed plugins to reduce complexity
|
||||
10. **Plan for scale**: Consider concurrent users and data growth
|
||||
Reference in New Issue
Block a user