643 lines
13 KiB
Markdown
643 lines
13 KiB
Markdown
# LaminDB Integrations
|
|
|
|
This document covers LaminDB integrations with workflow managers, MLOps platforms, visualization tools, and other third-party systems.
|
|
|
|
## Overview
|
|
|
|
LaminDB supports extensive integrations across data storage, computational workflows, machine learning platforms, and visualization tools, enabling seamless incorporation into existing data science and bioinformatics pipelines.
|
|
|
|
## Data Storage Integrations
|
|
|
|
### Local Filesystem
|
|
|
|
```python
|
|
import lamindb as ln
|
|
|
|
# Initialize with local storage
|
|
lamin init --storage ./mydata
|
|
|
|
# Save artifacts to local storage
|
|
artifact = ln.Artifact("data.csv", key="local/data.csv").save()
|
|
|
|
# Load from local storage
|
|
data = artifact.load()
|
|
```
|
|
|
|
### AWS S3
|
|
|
|
```python
|
|
# Initialize with S3 storage
|
|
lamin init --storage s3://my-bucket/path \
|
|
--db postgresql://user:pwd@host:port/db
|
|
|
|
# Artifacts automatically sync to S3
|
|
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
|
|
|
|
# Transparent S3 access
|
|
data = artifact.load() # Downloads from S3 if not cached
|
|
```
|
|
|
|
### S3-Compatible Services
|
|
|
|
Support for MinIO, Cloudflare R2, and other S3-compatible endpoints:
|
|
|
|
```python
|
|
# Initialize with custom S3 endpoint
|
|
lamin init --storage 's3://bucket?endpoint_url=http://minio.example.com:9000'
|
|
|
|
# Configure credentials
|
|
export AWS_ACCESS_KEY_ID=minioadmin
|
|
export AWS_SECRET_ACCESS_KEY=minioadmin
|
|
```
|
|
|
|
### Google Cloud Storage
|
|
|
|
```python
|
|
# Install GCP extras
|
|
pip install 'lamindb[gcp]'
|
|
|
|
# Initialize with GCS
|
|
lamin init --storage gs://my-bucket/path \
|
|
--db postgresql://user:pwd@host:port/db
|
|
|
|
# Artifacts sync to GCS
|
|
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
|
|
```
|
|
|
|
### HTTP/HTTPS (Read-Only)
|
|
|
|
```python
|
|
# Access remote files without copying
|
|
artifact = ln.Artifact(
|
|
"https://example.com/data.csv",
|
|
key="remote/data.csv"
|
|
).save()
|
|
|
|
# Stream remote content
|
|
with artifact.open() as f:
|
|
data = f.read()
|
|
```
|
|
|
|
### HuggingFace Datasets
|
|
|
|
```python
|
|
# Access HuggingFace datasets
|
|
from datasets import load_dataset
|
|
|
|
dataset = load_dataset("squad", split="train")
|
|
|
|
# Register as LaminDB artifact
|
|
artifact = ln.Artifact.from_dataframe(
|
|
dataset.to_pandas(),
|
|
key="hf/squad_train.parquet",
|
|
description="SQuAD training data from HuggingFace"
|
|
).save()
|
|
```
|
|
|
|
## Workflow Manager Integrations
|
|
|
|
### Nextflow
|
|
|
|
Track Nextflow pipeline execution and outputs:
|
|
|
|
```python
|
|
# In your Nextflow process script
|
|
import lamindb as ln
|
|
|
|
# Initialize tracking
|
|
ln.track()
|
|
|
|
# Your Nextflow process logic
|
|
input_artifact = ln.Artifact.get(key="${input_key}")
|
|
data = input_artifact.load()
|
|
|
|
# Process data
|
|
result = process_data(data)
|
|
|
|
# Save output
|
|
output_artifact = ln.Artifact.from_dataframe(
|
|
result,
|
|
key="${output_key}"
|
|
).save()
|
|
|
|
ln.finish()
|
|
```
|
|
|
|
**Nextflow config example:**
|
|
```nextflow
|
|
process ANALYZE {
|
|
input:
|
|
val input_key
|
|
|
|
output:
|
|
path "result.csv"
|
|
|
|
script:
|
|
"""
|
|
#!/usr/bin/env python
|
|
import lamindb as ln
|
|
ln.track()
|
|
artifact = ln.Artifact.get(key="${input_key}")
|
|
# Process and save
|
|
ln.finish()
|
|
"""
|
|
}
|
|
```
|
|
|
|
### Snakemake
|
|
|
|
Integrate LaminDB into Snakemake workflows:
|
|
|
|
```python
|
|
# In Snakemake rule
|
|
rule process_data:
|
|
input:
|
|
"data/input.csv"
|
|
output:
|
|
"data/output.csv"
|
|
run:
|
|
import lamindb as ln
|
|
|
|
ln.track()
|
|
|
|
# Load input artifact
|
|
artifact = ln.Artifact.get(key="inputs/data.csv")
|
|
data = artifact.load()
|
|
|
|
# Process
|
|
result = analyze(data)
|
|
|
|
# Save output
|
|
result.to_csv(output[0])
|
|
ln.Artifact(output[0], key="outputs/result.csv").save()
|
|
|
|
ln.finish()
|
|
```
|
|
|
|
### Redun
|
|
|
|
Track Redun task execution:
|
|
|
|
```python
|
|
from redun import task
|
|
import lamindb as ln
|
|
|
|
@task()
|
|
@ln.tracked()
|
|
def process_dataset(input_key: str, output_key: str):
|
|
"""Redun task with LaminDB tracking."""
|
|
# Load input
|
|
artifact = ln.Artifact.get(key=input_key)
|
|
data = artifact.load()
|
|
|
|
# Process
|
|
result = transform(data)
|
|
|
|
# Save output
|
|
ln.Artifact.from_dataframe(result, key=output_key).save()
|
|
|
|
return output_key
|
|
|
|
# Redun automatically tracks lineage alongside LaminDB
|
|
```
|
|
|
|
## MLOps Platform Integrations
|
|
|
|
### Weights & Biases (W&B)
|
|
|
|
Combine W&B experiment tracking with LaminDB data management:
|
|
|
|
```python
|
|
import wandb
|
|
import lamindb as ln
|
|
|
|
# Initialize both
|
|
wandb.init(project="my-project", name="experiment-1")
|
|
ln.track(params={"learning_rate": 0.01, "batch_size": 32})
|
|
|
|
# Load training data
|
|
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
|
|
train_data = train_artifact.load()
|
|
|
|
# Train model
|
|
model = train_model(train_data)
|
|
|
|
# Log to W&B
|
|
wandb.log({"accuracy": 0.95, "loss": 0.05})
|
|
|
|
# Save model in LaminDB
|
|
import joblib
|
|
joblib.dump(model, "model.pkl")
|
|
model_artifact = ln.Artifact(
|
|
"model.pkl",
|
|
key="models/experiment-1.pkl",
|
|
description=f"Model from W&B run {wandb.run.id}"
|
|
).save()
|
|
|
|
# Link W&B run ID
|
|
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
|
|
|
|
ln.finish()
|
|
wandb.finish()
|
|
```
|
|
|
|
### MLflow
|
|
|
|
Integrate MLflow model tracking with LaminDB:
|
|
|
|
```python
|
|
import mlflow
|
|
import lamindb as ln
|
|
|
|
# Start runs
|
|
mlflow.start_run()
|
|
ln.track()
|
|
|
|
# Log parameters to both
|
|
params = {"max_depth": 5, "n_estimators": 100}
|
|
mlflow.log_params(params)
|
|
ln.context.params = params
|
|
|
|
# Load data from LaminDB
|
|
data_artifact = ln.Artifact.get(key="datasets/features.parquet")
|
|
X = data_artifact.load()
|
|
|
|
# Train and log model
|
|
model = train_model(X)
|
|
mlflow.sklearn.log_model(model, "model")
|
|
|
|
# Save to LaminDB
|
|
import joblib
|
|
joblib.dump(model, "model.pkl")
|
|
model_artifact = ln.Artifact(
|
|
"model.pkl",
|
|
key=f"models/{mlflow.active_run().info.run_id}.pkl"
|
|
).save()
|
|
|
|
mlflow.end_run()
|
|
ln.finish()
|
|
```
|
|
|
|
### HuggingFace Transformers
|
|
|
|
Track model fine-tuning with LaminDB:
|
|
|
|
```python
|
|
from transformers import Trainer, TrainingArguments
|
|
import lamindb as ln
|
|
|
|
ln.track(params={"model": "bert-base", "epochs": 3})
|
|
|
|
# Load training data
|
|
train_artifact = ln.Artifact.get(key="datasets/train_tokenized.parquet")
|
|
train_dataset = train_artifact.load()
|
|
|
|
# Configure trainer
|
|
training_args = TrainingArguments(
|
|
output_dir="./results",
|
|
num_train_epochs=3,
|
|
)
|
|
|
|
trainer = Trainer(
|
|
model=model,
|
|
args=training_args,
|
|
train_dataset=train_dataset,
|
|
)
|
|
|
|
# Train
|
|
trainer.train()
|
|
|
|
# Save model to LaminDB
|
|
trainer.save_model("./model")
|
|
model_artifact = ln.Artifact(
|
|
"./model",
|
|
key="models/bert_finetuned",
|
|
description="BERT fine-tuned on custom dataset"
|
|
).save()
|
|
|
|
ln.finish()
|
|
```
|
|
|
|
### scVI-tools
|
|
|
|
Single-cell analysis with scVI and LaminDB:
|
|
|
|
```python
|
|
import scvi
|
|
import lamindb as ln
|
|
|
|
ln.track()
|
|
|
|
# Load data
|
|
adata_artifact = ln.Artifact.get(key="scrna/raw_counts.h5ad")
|
|
adata = adata_artifact.load()
|
|
|
|
# Setup scVI
|
|
scvi.model.SCVI.setup_anndata(adata, layer="counts")
|
|
|
|
# Train model
|
|
model = scvi.model.SCVI(adata)
|
|
model.train()
|
|
|
|
# Save latent representation
|
|
adata.obsm["X_scvi"] = model.get_latent_representation()
|
|
|
|
# Save results
|
|
result_artifact = ln.Artifact.from_anndata(
|
|
adata,
|
|
key="scrna/scvi_latent.h5ad",
|
|
description="scVI latent representation"
|
|
).save()
|
|
|
|
ln.finish()
|
|
```
|
|
|
|
## Array Store Integrations
|
|
|
|
### TileDB-SOMA
|
|
|
|
Scalable array storage with cellxgene support:
|
|
|
|
```python
|
|
import tiledbsoma as soma
|
|
import lamindb as ln
|
|
|
|
# Create SOMA experiment
|
|
uri = "tiledb://my-namespace/experiment"
|
|
|
|
with soma.Experiment.create(uri) as exp:
|
|
# Add measurements
|
|
exp.add_new_collection("RNA")
|
|
|
|
# Register in LaminDB
|
|
artifact = ln.Artifact(
|
|
uri,
|
|
key="cellxgene/experiment.soma",
|
|
description="TileDB-SOMA experiment"
|
|
).save()
|
|
|
|
# Query with SOMA
|
|
with soma.Experiment.open(uri) as exp:
|
|
obs = exp.obs.read().to_pandas()
|
|
```
|
|
|
|
### DuckDB
|
|
|
|
Query artifacts with DuckDB:
|
|
|
|
```python
|
|
import duckdb
|
|
import lamindb as ln
|
|
|
|
# Get artifact
|
|
artifact = ln.Artifact.get(key="datasets/large_data.parquet")
|
|
|
|
# Query with DuckDB (without loading full file)
|
|
path = artifact.cache()
|
|
result = duckdb.query(f"""
|
|
SELECT cell_type, COUNT(*) as count
|
|
FROM read_parquet('{path}')
|
|
GROUP BY cell_type
|
|
ORDER BY count DESC
|
|
""").to_df()
|
|
|
|
# Save query result
|
|
result_artifact = ln.Artifact.from_dataframe(
|
|
result,
|
|
key="analysis/cell_type_counts.parquet"
|
|
).save()
|
|
```
|
|
|
|
## Visualization Integrations
|
|
|
|
### Vitessce
|
|
|
|
Create interactive visualizations:
|
|
|
|
```python
|
|
from vitessce import VitessceConfig
|
|
import lamindb as ln
|
|
|
|
# Load spatial data
|
|
artifact = ln.Artifact.get(key="spatial/visium_slide.h5ad")
|
|
adata = artifact.load()
|
|
|
|
# Create Vitessce configuration
|
|
vc = VitessceConfig.from_object(adata)
|
|
|
|
# Save configuration
|
|
import json
|
|
config_file = "vitessce_config.json"
|
|
with open(config_file, "w") as f:
|
|
json.dump(vc.to_dict(), f)
|
|
|
|
# Register configuration
|
|
config_artifact = ln.Artifact(
|
|
config_file,
|
|
key="visualizations/spatial_config.json",
|
|
description="Vitessce visualization config"
|
|
).save()
|
|
```
|
|
|
|
## Schema Module Integrations
|
|
|
|
### Bionty (Biological Ontologies)
|
|
|
|
```python
|
|
import bionty as bt
|
|
|
|
# Import biological ontologies
|
|
bt.CellType.import_source()
|
|
bt.Gene.import_source(organism="human")
|
|
|
|
# Use in data curation
|
|
cell_types = bt.CellType.from_values(adata.obs.cell_type)
|
|
```
|
|
|
|
### WetLab
|
|
|
|
Track wet lab experiments:
|
|
|
|
```python
|
|
# Install wetlab module
|
|
pip install 'lamindb[wetlab]'
|
|
|
|
# Use wetlab registries
|
|
import lamindb_wetlab as wetlab
|
|
|
|
# Track experiments, samples, protocols
|
|
experiment = wetlab.Experiment(name="RNA-seq batch 1").save()
|
|
```
|
|
|
|
### Clinical Data (OMOP)
|
|
|
|
```python
|
|
# Install clinical module
|
|
pip install 'lamindb[clinical]'
|
|
|
|
# Use OMOP common data model
|
|
import lamindb_clinical as clinical
|
|
|
|
# Track clinical data
|
|
patient = clinical.Patient(patient_id="P001").save()
|
|
```
|
|
|
|
## Git Integration
|
|
|
|
### Sync with Git Repositories
|
|
|
|
```python
|
|
# Configure git sync
|
|
export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
|
|
|
|
# Or programmatically
|
|
ln.settings.sync_git_repo = "https://github.com/user/repo.git"
|
|
|
|
# Set development directory
|
|
lamin settings set dev-dir .
|
|
|
|
# Scripts tracked with git commits
|
|
ln.track() # Automatically captures git commit hash
|
|
# ... your code ...
|
|
ln.finish()
|
|
|
|
# View git information
|
|
transform = ln.Transform.get(name="analysis.py")
|
|
transform.source_code # Shows code at git commit
|
|
transform.hash # Git commit hash
|
|
```
|
|
|
|
## Enterprise Integrations
|
|
|
|
### Benchling
|
|
|
|
Sync with Benchling registries (requires team/enterprise plan):
|
|
|
|
```python
|
|
# Configure Benchling connection (contact LaminDB team)
|
|
# Syncs schemas and data from Benchling registries
|
|
|
|
# Access synced Benchling data
|
|
# Details available through enterprise support
|
|
```
|
|
|
|
## Custom Integration Patterns
|
|
|
|
### REST API Integration
|
|
|
|
```python
|
|
import requests
|
|
import lamindb as ln
|
|
|
|
ln.track()
|
|
|
|
# Fetch from API
|
|
response = requests.get("https://api.example.com/data")
|
|
data = response.json()
|
|
|
|
# Convert to DataFrame
|
|
import pandas as pd
|
|
df = pd.DataFrame(data)
|
|
|
|
# Save to LaminDB
|
|
artifact = ln.Artifact.from_dataframe(
|
|
df,
|
|
key="api/fetched_data.parquet",
|
|
description="Data fetched from external API"
|
|
).save()
|
|
|
|
artifact.features.add_values({"api_url": response.url})
|
|
|
|
ln.finish()
|
|
```
|
|
|
|
### Database Integration
|
|
|
|
```python
|
|
import sqlalchemy as sa
|
|
import lamindb as ln
|
|
|
|
ln.track()
|
|
|
|
# Connect to external database
|
|
engine = sa.create_engine("postgresql://user:pwd@host:port/db")
|
|
|
|
# Query data
|
|
query = "SELECT * FROM experiments WHERE date > '2025-01-01'"
|
|
df = pd.read_sql(query, engine)
|
|
|
|
# Save to LaminDB
|
|
artifact = ln.Artifact.from_dataframe(
|
|
df,
|
|
key="external_db/experiments_2025.parquet",
|
|
description="Experiments from external database"
|
|
).save()
|
|
|
|
ln.finish()
|
|
```
|
|
|
|
## Croissant Metadata
|
|
|
|
Export datasets with Croissant metadata format:
|
|
|
|
```python
|
|
# Create artifact with rich metadata
|
|
artifact = ln.Artifact.from_dataframe(
|
|
df,
|
|
key="datasets/published_data.parquet",
|
|
description="Published dataset with Croissant metadata"
|
|
).save()
|
|
|
|
# Export Croissant metadata (requires additional configuration)
|
|
# Enables dataset discovery and interoperability
|
|
```
|
|
|
|
## Best Practices for Integrations
|
|
|
|
1. **Track consistently**: Use `ln.track()` in all integrated workflows
|
|
2. **Link IDs**: Store external system IDs (W&B run ID, MLflow experiment ID) as features
|
|
3. **Centralize data**: Use LaminDB as single source of truth for data artifacts
|
|
4. **Sync parameters**: Log parameters to both LaminDB and ML platforms
|
|
5. **Version together**: Keep code (git), data (LaminDB), and experiments (ML platform) in sync
|
|
6. **Cache strategically**: Configure appropriate cache locations for cloud storage
|
|
7. **Use feature sets**: Link ontology terms from Bionty to artifacts
|
|
8. **Document integrations**: Add descriptions explaining integration context
|
|
9. **Test incrementally**: Verify integration with small datasets first
|
|
10. **Monitor lineage**: Use `view_lineage()` to ensure integration tracking works
|
|
|
|
## Troubleshooting Common Issues
|
|
|
|
**Issue: S3 credentials not found**
|
|
```bash
|
|
export AWS_ACCESS_KEY_ID=your_key
|
|
export AWS_SECRET_ACCESS_KEY=your_secret
|
|
export AWS_DEFAULT_REGION=us-east-1
|
|
```
|
|
|
|
**Issue: GCS authentication failure**
|
|
```bash
|
|
gcloud auth application-default login
|
|
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
|
|
```
|
|
|
|
**Issue: Git sync not working**
|
|
```bash
|
|
# Ensure git repo is set
|
|
lamin settings get sync-git-repo
|
|
|
|
# Ensure you're in git repo
|
|
git status
|
|
|
|
# Commit changes before tracking
|
|
git add .
|
|
git commit -m "Update analysis"
|
|
ln.track()
|
|
```
|
|
|
|
**Issue: MLflow artifacts not syncing**
|
|
```python
|
|
# Save explicitly to both systems
|
|
mlflow.log_artifact("model.pkl")
|
|
ln.Artifact("model.pkl", key="models/model.pkl").save()
|
|
```
|