Files
2025-11-30 08:30:10 +08:00

643 lines
13 KiB
Markdown

# LaminDB Integrations
This document covers LaminDB integrations with workflow managers, MLOps platforms, visualization tools, and other third-party systems.
## Overview
LaminDB supports extensive integrations across data storage, computational workflows, machine learning platforms, and visualization tools, enabling seamless incorporation into existing data science and bioinformatics pipelines.
## Data Storage Integrations
### Local Filesystem
```python
import lamindb as ln
# Initialize with local storage
lamin init --storage ./mydata
# Save artifacts to local storage
artifact = ln.Artifact("data.csv", key="local/data.csv").save()
# Load from local storage
data = artifact.load()
```
### AWS S3
```python
# Initialize with S3 storage
lamin init --storage s3://my-bucket/path \
--db postgresql://user:pwd@host:port/db
# Artifacts automatically sync to S3
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
# Transparent S3 access
data = artifact.load() # Downloads from S3 if not cached
```
### S3-Compatible Services
Support for MinIO, Cloudflare R2, and other S3-compatible endpoints:
```python
# Initialize with custom S3 endpoint
lamin init --storage 's3://bucket?endpoint_url=http://minio.example.com:9000'
# Configure credentials
export AWS_ACCESS_KEY_ID=minioadmin
export AWS_SECRET_ACCESS_KEY=minioadmin
```
### Google Cloud Storage
```python
# Install GCP extras
pip install 'lamindb[gcp]'
# Initialize with GCS
lamin init --storage gs://my-bucket/path \
--db postgresql://user:pwd@host:port/db
# Artifacts sync to GCS
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
```
### HTTP/HTTPS (Read-Only)
```python
# Access remote files without copying
artifact = ln.Artifact(
"https://example.com/data.csv",
key="remote/data.csv"
).save()
# Stream remote content
with artifact.open() as f:
data = f.read()
```
### HuggingFace Datasets
```python
# Access HuggingFace datasets
from datasets import load_dataset
dataset = load_dataset("squad", split="train")
# Register as LaminDB artifact
artifact = ln.Artifact.from_dataframe(
dataset.to_pandas(),
key="hf/squad_train.parquet",
description="SQuAD training data from HuggingFace"
).save()
```
## Workflow Manager Integrations
### Nextflow
Track Nextflow pipeline execution and outputs:
```python
# In your Nextflow process script
import lamindb as ln
# Initialize tracking
ln.track()
# Your Nextflow process logic
input_artifact = ln.Artifact.get(key="${input_key}")
data = input_artifact.load()
# Process data
result = process_data(data)
# Save output
output_artifact = ln.Artifact.from_dataframe(
result,
key="${output_key}"
).save()
ln.finish()
```
**Nextflow config example:**
```nextflow
process ANALYZE {
input:
val input_key
output:
path "result.csv"
script:
"""
#!/usr/bin/env python
import lamindb as ln
ln.track()
artifact = ln.Artifact.get(key="${input_key}")
# Process and save
ln.finish()
"""
}
```
### Snakemake
Integrate LaminDB into Snakemake workflows:
```python
# In Snakemake rule
rule process_data:
input:
"data/input.csv"
output:
"data/output.csv"
run:
import lamindb as ln
ln.track()
# Load input artifact
artifact = ln.Artifact.get(key="inputs/data.csv")
data = artifact.load()
# Process
result = analyze(data)
# Save output
result.to_csv(output[0])
ln.Artifact(output[0], key="outputs/result.csv").save()
ln.finish()
```
### Redun
Track Redun task execution:
```python
from redun import task
import lamindb as ln
@task()
@ln.tracked()
def process_dataset(input_key: str, output_key: str):
"""Redun task with LaminDB tracking."""
# Load input
artifact = ln.Artifact.get(key=input_key)
data = artifact.load()
# Process
result = transform(data)
# Save output
ln.Artifact.from_dataframe(result, key=output_key).save()
return output_key
# Redun automatically tracks lineage alongside LaminDB
```
## MLOps Platform Integrations
### Weights & Biases (W&B)
Combine W&B experiment tracking with LaminDB data management:
```python
import wandb
import lamindb as ln
# Initialize both
wandb.init(project="my-project", name="experiment-1")
ln.track(params={"learning_rate": 0.01, "batch_size": 32})
# Load training data
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
train_data = train_artifact.load()
# Train model
model = train_model(train_data)
# Log to W&B
wandb.log({"accuracy": 0.95, "loss": 0.05})
# Save model in LaminDB
import joblib
joblib.dump(model, "model.pkl")
model_artifact = ln.Artifact(
"model.pkl",
key="models/experiment-1.pkl",
description=f"Model from W&B run {wandb.run.id}"
).save()
# Link W&B run ID
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
ln.finish()
wandb.finish()
```
### MLflow
Integrate MLflow model tracking with LaminDB:
```python
import mlflow
import lamindb as ln
# Start runs
mlflow.start_run()
ln.track()
# Log parameters to both
params = {"max_depth": 5, "n_estimators": 100}
mlflow.log_params(params)
ln.context.params = params
# Load data from LaminDB
data_artifact = ln.Artifact.get(key="datasets/features.parquet")
X = data_artifact.load()
# Train and log model
model = train_model(X)
mlflow.sklearn.log_model(model, "model")
# Save to LaminDB
import joblib
joblib.dump(model, "model.pkl")
model_artifact = ln.Artifact(
"model.pkl",
key=f"models/{mlflow.active_run().info.run_id}.pkl"
).save()
mlflow.end_run()
ln.finish()
```
### HuggingFace Transformers
Track model fine-tuning with LaminDB:
```python
from transformers import Trainer, TrainingArguments
import lamindb as ln
ln.track(params={"model": "bert-base", "epochs": 3})
# Load training data
train_artifact = ln.Artifact.get(key="datasets/train_tokenized.parquet")
train_dataset = train_artifact.load()
# Configure trainer
training_args = TrainingArguments(
output_dir="./results",
num_train_epochs=3,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
# Train
trainer.train()
# Save model to LaminDB
trainer.save_model("./model")
model_artifact = ln.Artifact(
"./model",
key="models/bert_finetuned",
description="BERT fine-tuned on custom dataset"
).save()
ln.finish()
```
### scVI-tools
Single-cell analysis with scVI and LaminDB:
```python
import scvi
import lamindb as ln
ln.track()
# Load data
adata_artifact = ln.Artifact.get(key="scrna/raw_counts.h5ad")
adata = adata_artifact.load()
# Setup scVI
scvi.model.SCVI.setup_anndata(adata, layer="counts")
# Train model
model = scvi.model.SCVI(adata)
model.train()
# Save latent representation
adata.obsm["X_scvi"] = model.get_latent_representation()
# Save results
result_artifact = ln.Artifact.from_anndata(
adata,
key="scrna/scvi_latent.h5ad",
description="scVI latent representation"
).save()
ln.finish()
```
## Array Store Integrations
### TileDB-SOMA
Scalable array storage with cellxgene support:
```python
import tiledbsoma as soma
import lamindb as ln
# Create SOMA experiment
uri = "tiledb://my-namespace/experiment"
with soma.Experiment.create(uri) as exp:
# Add measurements
exp.add_new_collection("RNA")
# Register in LaminDB
artifact = ln.Artifact(
uri,
key="cellxgene/experiment.soma",
description="TileDB-SOMA experiment"
).save()
# Query with SOMA
with soma.Experiment.open(uri) as exp:
obs = exp.obs.read().to_pandas()
```
### DuckDB
Query artifacts with DuckDB:
```python
import duckdb
import lamindb as ln
# Get artifact
artifact = ln.Artifact.get(key="datasets/large_data.parquet")
# Query with DuckDB (without loading full file)
path = artifact.cache()
result = duckdb.query(f"""
SELECT cell_type, COUNT(*) as count
FROM read_parquet('{path}')
GROUP BY cell_type
ORDER BY count DESC
""").to_df()
# Save query result
result_artifact = ln.Artifact.from_dataframe(
result,
key="analysis/cell_type_counts.parquet"
).save()
```
## Visualization Integrations
### Vitessce
Create interactive visualizations:
```python
from vitessce import VitessceConfig
import lamindb as ln
# Load spatial data
artifact = ln.Artifact.get(key="spatial/visium_slide.h5ad")
adata = artifact.load()
# Create Vitessce configuration
vc = VitessceConfig.from_object(adata)
# Save configuration
import json
config_file = "vitessce_config.json"
with open(config_file, "w") as f:
json.dump(vc.to_dict(), f)
# Register configuration
config_artifact = ln.Artifact(
config_file,
key="visualizations/spatial_config.json",
description="Vitessce visualization config"
).save()
```
## Schema Module Integrations
### Bionty (Biological Ontologies)
```python
import bionty as bt
# Import biological ontologies
bt.CellType.import_source()
bt.Gene.import_source(organism="human")
# Use in data curation
cell_types = bt.CellType.from_values(adata.obs.cell_type)
```
### WetLab
Track wet lab experiments:
```python
# Install wetlab module
pip install 'lamindb[wetlab]'
# Use wetlab registries
import lamindb_wetlab as wetlab
# Track experiments, samples, protocols
experiment = wetlab.Experiment(name="RNA-seq batch 1").save()
```
### Clinical Data (OMOP)
```python
# Install clinical module
pip install 'lamindb[clinical]'
# Use OMOP common data model
import lamindb_clinical as clinical
# Track clinical data
patient = clinical.Patient(patient_id="P001").save()
```
## Git Integration
### Sync with Git Repositories
```python
# Configure git sync
export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
# Or programmatically
ln.settings.sync_git_repo = "https://github.com/user/repo.git"
# Set development directory
lamin settings set dev-dir .
# Scripts tracked with git commits
ln.track() # Automatically captures git commit hash
# ... your code ...
ln.finish()
# View git information
transform = ln.Transform.get(name="analysis.py")
transform.source_code # Shows code at git commit
transform.hash # Git commit hash
```
## Enterprise Integrations
### Benchling
Sync with Benchling registries (requires team/enterprise plan):
```python
# Configure Benchling connection (contact LaminDB team)
# Syncs schemas and data from Benchling registries
# Access synced Benchling data
# Details available through enterprise support
```
## Custom Integration Patterns
### REST API Integration
```python
import requests
import lamindb as ln
ln.track()
# Fetch from API
response = requests.get("https://api.example.com/data")
data = response.json()
# Convert to DataFrame
import pandas as pd
df = pd.DataFrame(data)
# Save to LaminDB
artifact = ln.Artifact.from_dataframe(
df,
key="api/fetched_data.parquet",
description="Data fetched from external API"
).save()
artifact.features.add_values({"api_url": response.url})
ln.finish()
```
### Database Integration
```python
import sqlalchemy as sa
import lamindb as ln
ln.track()
# Connect to external database
engine = sa.create_engine("postgresql://user:pwd@host:port/db")
# Query data
query = "SELECT * FROM experiments WHERE date > '2025-01-01'"
df = pd.read_sql(query, engine)
# Save to LaminDB
artifact = ln.Artifact.from_dataframe(
df,
key="external_db/experiments_2025.parquet",
description="Experiments from external database"
).save()
ln.finish()
```
## Croissant Metadata
Export datasets with Croissant metadata format:
```python
# Create artifact with rich metadata
artifact = ln.Artifact.from_dataframe(
df,
key="datasets/published_data.parquet",
description="Published dataset with Croissant metadata"
).save()
# Export Croissant metadata (requires additional configuration)
# Enables dataset discovery and interoperability
```
## Best Practices for Integrations
1. **Track consistently**: Use `ln.track()` in all integrated workflows
2. **Link IDs**: Store external system IDs (W&B run ID, MLflow experiment ID) as features
3. **Centralize data**: Use LaminDB as single source of truth for data artifacts
4. **Sync parameters**: Log parameters to both LaminDB and ML platforms
5. **Version together**: Keep code (git), data (LaminDB), and experiments (ML platform) in sync
6. **Cache strategically**: Configure appropriate cache locations for cloud storage
7. **Use feature sets**: Link ontology terms from Bionty to artifacts
8. **Document integrations**: Add descriptions explaining integration context
9. **Test incrementally**: Verify integration with small datasets first
10. **Monitor lineage**: Use `view_lineage()` to ensure integration tracking works
## Troubleshooting Common Issues
**Issue: S3 credentials not found**
```bash
export AWS_ACCESS_KEY_ID=your_key
export AWS_SECRET_ACCESS_KEY=your_secret
export AWS_DEFAULT_REGION=us-east-1
```
**Issue: GCS authentication failure**
```bash
gcloud auth application-default login
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
```
**Issue: Git sync not working**
```bash
# Ensure git repo is set
lamin settings get sync-git-repo
# Ensure you're in git repo
git status
# Commit changes before tracking
git add .
git commit -m "Update analysis"
ln.track()
```
**Issue: MLflow artifacts not syncing**
```python
# Save explicitly to both systems
mlflow.log_artifact("model.pkl")
ln.Artifact("model.pkl", key="models/model.pkl").save()
```