13 KiB
LaminDB Integrations
This document covers LaminDB integrations with workflow managers, MLOps platforms, visualization tools, and other third-party systems.
Overview
LaminDB supports extensive integrations across data storage, computational workflows, machine learning platforms, and visualization tools, enabling seamless incorporation into existing data science and bioinformatics pipelines.
Data Storage Integrations
Local Filesystem
import lamindb as ln
# Initialize with local storage
lamin init --storage ./mydata
# Save artifacts to local storage
artifact = ln.Artifact("data.csv", key="local/data.csv").save()
# Load from local storage
data = artifact.load()
AWS S3
# Initialize with S3 storage
lamin init --storage s3://my-bucket/path \
--db postgresql://user:pwd@host:port/db
# Artifacts automatically sync to S3
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
# Transparent S3 access
data = artifact.load() # Downloads from S3 if not cached
S3-Compatible Services
Support for MinIO, Cloudflare R2, and other S3-compatible endpoints:
# Initialize with custom S3 endpoint
lamin init --storage 's3://bucket?endpoint_url=http://minio.example.com:9000'
# Configure credentials
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
Google Cloud Storage
# Install GCP extras
pip install 'lamindb[gcp]'
# Initialize with GCS
lamin init --storage gs://my-bucket/path \
--db postgresql://user:pwd@host:port/db
# Artifacts sync to GCS
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
HTTP/HTTPS (Read-Only)
# Access remote files without copying
artifact = ln.Artifact(
"https://example.com/data.csv",
key="remote/data.csv"
).save()
# Stream remote content
with artifact.open() as f:
data = f.read()
HuggingFace Datasets
# Access HuggingFace datasets
from datasets import load_dataset
dataset = load_dataset("squad", split="train")
# Register as LaminDB artifact
artifact = ln.Artifact.from_dataframe(
dataset.to_pandas(),
key="hf/squad_train.parquet",
description="SQuAD training data from HuggingFace"
).save()
Workflow Manager Integrations
Nextflow
Track Nextflow pipeline execution and outputs:
# In your Nextflow process script
import lamindb as ln
# Initialize tracking
ln.track()
# Your Nextflow process logic
input_artifact = ln.Artifact.get(key="${input_key}")
data = input_artifact.load()
# Process data
result = process_data(data)
# Save output
output_artifact = ln.Artifact.from_dataframe(
result,
key="${output_key}"
).save()
ln.finish()
Nextflow config example:
process ANALYZE {
input:
val input_key
output:
path "result.csv"
script:
"""
#!/usr/bin/env python
import lamindb as ln
ln.track()
artifact = ln.Artifact.get(key="${input_key}")
# Process and save
ln.finish()
"""
}
Snakemake
Integrate LaminDB into Snakemake workflows:
# In Snakemake rule
rule process_data:
input:
"data/input.csv"
output:
"data/output.csv"
run:
import lamindb as ln
ln.track()
# Load input artifact
artifact = ln.Artifact.get(key="inputs/data.csv")
data = artifact.load()
# Process
result = analyze(data)
# Save output
result.to_csv(output[0])
ln.Artifact(output[0], key="outputs/result.csv").save()
ln.finish()
Redun
Track Redun task execution:
from redun import task
import lamindb as ln
@task()
@ln.tracked()
def process_dataset(input_key: str, output_key: str):
"""Redun task with LaminDB tracking."""
# Load input
artifact = ln.Artifact.get(key=input_key)
data = artifact.load()
# Process
result = transform(data)
# Save output
ln.Artifact.from_dataframe(result, key=output_key).save()
return output_key
# Redun automatically tracks lineage alongside LaminDB
MLOps Platform Integrations
Weights & Biases (W&B)
Combine W&B experiment tracking with LaminDB data management:
import wandb
import lamindb as ln
# Initialize both
wandb.init(project="my-project", name="experiment-1")
ln.track(params={"learning_rate": 0.01, "batch_size": 32})
# Load training data
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
train_data = train_artifact.load()
# Train model
model = train_model(train_data)
# Log to W&B
wandb.log({"accuracy": 0.95, "loss": 0.05})
# Save model in LaminDB
import joblib
joblib.dump(model, "model.pkl")
model_artifact = ln.Artifact(
"model.pkl",
key="models/experiment-1.pkl",
description=f"Model from W&B run {wandb.run.id}"
).save()
# Link W&B run ID
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
ln.finish()
wandb.finish()
MLflow
Integrate MLflow model tracking with LaminDB:
import mlflow
import lamindb as ln
# Start runs
mlflow.start_run()
ln.track()
# Log parameters to both
params = {"max_depth": 5, "n_estimators": 100}
mlflow.log_params(params)
ln.context.params = params
# Load data from LaminDB
data_artifact = ln.Artifact.get(key="datasets/features.parquet")
X = data_artifact.load()
# Train and log model
model = train_model(X)
mlflow.sklearn.log_model(model, "model")
# Save to LaminDB
import joblib
joblib.dump(model, "model.pkl")
model_artifact = ln.Artifact(
"model.pkl",
key=f"models/{mlflow.active_run().info.run_id}.pkl"
).save()
mlflow.end_run()
ln.finish()
HuggingFace Transformers
Track model fine-tuning with LaminDB:
from transformers import Trainer, TrainingArguments
import lamindb as ln
ln.track(params={"model": "bert-base", "epochs": 3})
# Load training data
train_artifact = ln.Artifact.get(key="datasets/train_tokenized.parquet")
train_dataset = train_artifact.load()
# Configure trainer
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
# Train
trainer.train()
# Save model to LaminDB
trainer.save_model("./model")
model_artifact = ln.Artifact(
"./model",
key="models/bert_finetuned",
description="BERT fine-tuned on custom dataset"
).save()
ln.finish()
scVI-tools
Single-cell analysis with scVI and LaminDB:
import scvi
import lamindb as ln
ln.track()
# Load data
adata_artifact = ln.Artifact.get(key="scrna/raw_counts.h5ad")
adata = adata_artifact.load()
# Setup scVI
scvi.model.SCVI.setup_anndata(adata, layer="counts")
# Train model
model = scvi.model.SCVI(adata)
model.train()
# Save latent representation
adata.obsm["X_scvi"] = model.get_latent_representation()
# Save results
result_artifact = ln.Artifact.from_anndata(
adata,
key="scrna/scvi_latent.h5ad",
description="scVI latent representation"
).save()
ln.finish()
Array Store Integrations
TileDB-SOMA
Scalable array storage with cellxgene support:
import tiledbsoma as soma
import lamindb as ln
# Create SOMA experiment
uri = "tiledb://my-namespace/experiment"
with soma.Experiment.create(uri) as exp:
# Add measurements
exp.add_new_collection("RNA")
# Register in LaminDB
artifact = ln.Artifact(
uri,
key="cellxgene/experiment.soma",
description="TileDB-SOMA experiment"
).save()
# Query with SOMA
with soma.Experiment.open(uri) as exp:
obs = exp.obs.read().to_pandas()
DuckDB
Query artifacts with DuckDB:
import duckdb
import lamindb as ln
# Get artifact
artifact = ln.Artifact.get(key="datasets/large_data.parquet")
# Query with DuckDB (without loading full file)
path = artifact.cache()
result = duckdb.query(f"""
SELECT cell_type, COUNT(*) as count
FROM read_parquet('{path}')
GROUP BY cell_type
ORDER BY count DESC
""").to_df()
# Save query result
result_artifact = ln.Artifact.from_dataframe(
result,
key="analysis/cell_type_counts.parquet"
).save()
Visualization Integrations
Vitessce
Create interactive visualizations:
from vitessce import VitessceConfig
import lamindb as ln
# Load spatial data
artifact = ln.Artifact.get(key="spatial/visium_slide.h5ad")
adata = artifact.load()
# Create Vitessce configuration
vc = VitessceConfig.from_object(adata)
# Save configuration
import json
config_file = "vitessce_config.json"
with open(config_file, "w") as f:
json.dump(vc.to_dict(), f)
# Register configuration
config_artifact = ln.Artifact(
config_file,
key="visualizations/spatial_config.json",
description="Vitessce visualization config"
).save()
Schema Module Integrations
Bionty (Biological Ontologies)
import bionty as bt
# Import biological ontologies
bt.CellType.import_source()
bt.Gene.import_source(organism="human")
# Use in data curation
cell_types = bt.CellType.from_values(adata.obs.cell_type)
WetLab
Track wet lab experiments:
# Install wetlab module
pip install 'lamindb[wetlab]'
# Use wetlab registries
import lamindb_wetlab as wetlab
# Track experiments, samples, protocols
experiment = wetlab.Experiment(name="RNA-seq batch 1").save()
Clinical Data (OMOP)
# Install clinical module
pip install 'lamindb[clinical]'
# Use OMOP common data model
import lamindb_clinical as clinical
# Track clinical data
patient = clinical.Patient(patient_id="P001").save()
Git Integration
Sync with Git Repositories
# Configure git sync
export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
# Or programmatically
ln.settings.sync_git_repo = "https://github.com/user/repo.git"
# Set development directory
lamin settings set dev-dir .
# Scripts tracked with git commits
ln.track() # Automatically captures git commit hash
# ... your code ...
ln.finish()
# View git information
transform = ln.Transform.get(name="analysis.py")
transform.source_code # Shows code at git commit
transform.hash # Git commit hash
Enterprise Integrations
Benchling
Sync with Benchling registries (requires team/enterprise plan):
# Configure Benchling connection (contact LaminDB team)
# Syncs schemas and data from Benchling registries
# Access synced Benchling data
# Details available through enterprise support
Custom Integration Patterns
REST API Integration
import requests
import lamindb as ln
ln.track()
# Fetch from API
response = requests.get("https://api.example.com/data")
data = response.json()
# Convert to DataFrame
import pandas as pd
df = pd.DataFrame(data)
# Save to LaminDB
artifact = ln.Artifact.from_dataframe(
df,
key="api/fetched_data.parquet",
description="Data fetched from external API"
).save()
artifact.features.add_values({"api_url": response.url})
ln.finish()
Database Integration
import sqlalchemy as sa
import lamindb as ln
ln.track()
# Connect to external database
engine = sa.create_engine("postgresql://user:pwd@host:port/db")
# Query data
query = "SELECT * FROM experiments WHERE date > '2025-01-01'"
df = pd.read_sql(query, engine)
# Save to LaminDB
artifact = ln.Artifact.from_dataframe(
df,
key="external_db/experiments_2025.parquet",
description="Experiments from external database"
).save()
ln.finish()
Croissant Metadata
Export datasets with Croissant metadata format:
# Create artifact with rich metadata
artifact = ln.Artifact.from_dataframe(
df,
key="datasets/published_data.parquet",
description="Published dataset with Croissant metadata"
).save()
# Export Croissant metadata (requires additional configuration)
# Enables dataset discovery and interoperability
Best Practices for Integrations
- Track consistently: Use
ln.track()in all integrated workflows - Link IDs: Store external system IDs (W&B run ID, MLflow experiment ID) as features
- Centralize data: Use LaminDB as single source of truth for data artifacts
- Sync parameters: Log parameters to both LaminDB and ML platforms
- Version together: Keep code (git), data (LaminDB), and experiments (ML platform) in sync
- Cache strategically: Configure appropriate cache locations for cloud storage
- Use feature sets: Link ontology terms from Bionty to artifacts
- Document integrations: Add descriptions explaining integration context
- Test incrementally: Verify integration with small datasets first
- Monitor lineage: Use
view_lineage()to ensure integration tracking works
Troubleshooting Common Issues
Issue: S3 credentials not found
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
Issue: GCS authentication failure
gcloud auth application-default login
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
Issue: Git sync not working
# Ensure git repo is set
lamin settings get sync-git-repo
# Ensure you're in git repo
git status
# Commit changes before tracking
git add .
git commit -m "Update analysis"
ln.track()
Issue: MLflow artifacts not syncing
# Save explicitly to both systems
mlflow.log_artifact("model.pkl")
ln.Artifact("model.pkl", key="models/model.pkl").save()