Initial commit
This commit is contained in:
642
skills/lamindb/references/integrations.md
Normal file
642
skills/lamindb/references/integrations.md
Normal file
@@ -0,0 +1,642 @@
|
||||
# LaminDB Integrations
|
||||
|
||||
This document covers LaminDB integrations with workflow managers, MLOps platforms, visualization tools, and other third-party systems.
|
||||
|
||||
## Overview
|
||||
|
||||
LaminDB supports extensive integrations across data storage, computational workflows, machine learning platforms, and visualization tools, enabling seamless incorporation into existing data science and bioinformatics pipelines.
|
||||
|
||||
## Data Storage Integrations
|
||||
|
||||
### Local Filesystem
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Initialize with local storage
|
||||
lamin init --storage ./mydata
|
||||
|
||||
# Save artifacts to local storage
|
||||
artifact = ln.Artifact("data.csv", key="local/data.csv").save()
|
||||
|
||||
# Load from local storage
|
||||
data = artifact.load()
|
||||
```
|
||||
|
||||
### AWS S3
|
||||
|
||||
```python
|
||||
# Initialize with S3 storage
|
||||
lamin init --storage s3://my-bucket/path \
|
||||
--db postgresql://user:pwd@host:port/db
|
||||
|
||||
# Artifacts automatically sync to S3
|
||||
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
|
||||
|
||||
# Transparent S3 access
|
||||
data = artifact.load() # Downloads from S3 if not cached
|
||||
```
|
||||
|
||||
### S3-Compatible Services
|
||||
|
||||
Support for MinIO, Cloudflare R2, and other S3-compatible endpoints:
|
||||
|
||||
```python
|
||||
# Initialize with custom S3 endpoint
|
||||
lamin init --storage 's3://bucket?endpoint_url=http://minio.example.com:9000'
|
||||
|
||||
# Configure credentials
|
||||
export AWS_ACCESS_KEY_ID=minioadmin
|
||||
export AWS_SECRET_ACCESS_KEY=minioadmin
|
||||
```
|
||||
|
||||
### Google Cloud Storage
|
||||
|
||||
```python
|
||||
# Install GCP extras
|
||||
pip install 'lamindb[gcp]'
|
||||
|
||||
# Initialize with GCS
|
||||
lamin init --storage gs://my-bucket/path \
|
||||
--db postgresql://user:pwd@host:port/db
|
||||
|
||||
# Artifacts sync to GCS
|
||||
artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
|
||||
```
|
||||
|
||||
### HTTP/HTTPS (Read-Only)
|
||||
|
||||
```python
|
||||
# Access remote files without copying
|
||||
artifact = ln.Artifact(
|
||||
"https://example.com/data.csv",
|
||||
key="remote/data.csv"
|
||||
).save()
|
||||
|
||||
# Stream remote content
|
||||
with artifact.open() as f:
|
||||
data = f.read()
|
||||
```
|
||||
|
||||
### HuggingFace Datasets
|
||||
|
||||
```python
|
||||
# Access HuggingFace datasets
|
||||
from datasets import load_dataset
|
||||
|
||||
dataset = load_dataset("squad", split="train")
|
||||
|
||||
# Register as LaminDB artifact
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
dataset.to_pandas(),
|
||||
key="hf/squad_train.parquet",
|
||||
description="SQuAD training data from HuggingFace"
|
||||
).save()
|
||||
```
|
||||
|
||||
## Workflow Manager Integrations
|
||||
|
||||
### Nextflow
|
||||
|
||||
Track Nextflow pipeline execution and outputs:
|
||||
|
||||
```python
|
||||
# In your Nextflow process script
|
||||
import lamindb as ln
|
||||
|
||||
# Initialize tracking
|
||||
ln.track()
|
||||
|
||||
# Your Nextflow process logic
|
||||
input_artifact = ln.Artifact.get(key="${input_key}")
|
||||
data = input_artifact.load()
|
||||
|
||||
# Process data
|
||||
result = process_data(data)
|
||||
|
||||
# Save output
|
||||
output_artifact = ln.Artifact.from_dataframe(
|
||||
result,
|
||||
key="${output_key}"
|
||||
).save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
**Nextflow config example:**
|
||||
```nextflow
|
||||
process ANALYZE {
|
||||
input:
|
||||
val input_key
|
||||
|
||||
output:
|
||||
path "result.csv"
|
||||
|
||||
script:
|
||||
"""
|
||||
#!/usr/bin/env python
|
||||
import lamindb as ln
|
||||
ln.track()
|
||||
artifact = ln.Artifact.get(key="${input_key}")
|
||||
# Process and save
|
||||
ln.finish()
|
||||
"""
|
||||
}
|
||||
```
|
||||
|
||||
### Snakemake
|
||||
|
||||
Integrate LaminDB into Snakemake workflows:
|
||||
|
||||
```python
|
||||
# In Snakemake rule
|
||||
rule process_data:
|
||||
input:
|
||||
"data/input.csv"
|
||||
output:
|
||||
"data/output.csv"
|
||||
run:
|
||||
import lamindb as ln
|
||||
|
||||
ln.track()
|
||||
|
||||
# Load input artifact
|
||||
artifact = ln.Artifact.get(key="inputs/data.csv")
|
||||
data = artifact.load()
|
||||
|
||||
# Process
|
||||
result = analyze(data)
|
||||
|
||||
# Save output
|
||||
result.to_csv(output[0])
|
||||
ln.Artifact(output[0], key="outputs/result.csv").save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### Redun
|
||||
|
||||
Track Redun task execution:
|
||||
|
||||
```python
|
||||
from redun import task
|
||||
import lamindb as ln
|
||||
|
||||
@task()
|
||||
@ln.tracked()
|
||||
def process_dataset(input_key: str, output_key: str):
|
||||
"""Redun task with LaminDB tracking."""
|
||||
# Load input
|
||||
artifact = ln.Artifact.get(key=input_key)
|
||||
data = artifact.load()
|
||||
|
||||
# Process
|
||||
result = transform(data)
|
||||
|
||||
# Save output
|
||||
ln.Artifact.from_dataframe(result, key=output_key).save()
|
||||
|
||||
return output_key
|
||||
|
||||
# Redun automatically tracks lineage alongside LaminDB
|
||||
```
|
||||
|
||||
## MLOps Platform Integrations
|
||||
|
||||
### Weights & Biases (W&B)
|
||||
|
||||
Combine W&B experiment tracking with LaminDB data management:
|
||||
|
||||
```python
|
||||
import wandb
|
||||
import lamindb as ln
|
||||
|
||||
# Initialize both
|
||||
wandb.init(project="my-project", name="experiment-1")
|
||||
ln.track(params={"learning_rate": 0.01, "batch_size": 32})
|
||||
|
||||
# Load training data
|
||||
train_artifact = ln.Artifact.get(key="datasets/train.parquet")
|
||||
train_data = train_artifact.load()
|
||||
|
||||
# Train model
|
||||
model = train_model(train_data)
|
||||
|
||||
# Log to W&B
|
||||
wandb.log({"accuracy": 0.95, "loss": 0.05})
|
||||
|
||||
# Save model in LaminDB
|
||||
import joblib
|
||||
joblib.dump(model, "model.pkl")
|
||||
model_artifact = ln.Artifact(
|
||||
"model.pkl",
|
||||
key="models/experiment-1.pkl",
|
||||
description=f"Model from W&B run {wandb.run.id}"
|
||||
).save()
|
||||
|
||||
# Link W&B run ID
|
||||
model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
|
||||
|
||||
ln.finish()
|
||||
wandb.finish()
|
||||
```
|
||||
|
||||
### MLflow
|
||||
|
||||
Integrate MLflow model tracking with LaminDB:
|
||||
|
||||
```python
|
||||
import mlflow
|
||||
import lamindb as ln
|
||||
|
||||
# Start runs
|
||||
mlflow.start_run()
|
||||
ln.track()
|
||||
|
||||
# Log parameters to both
|
||||
params = {"max_depth": 5, "n_estimators": 100}
|
||||
mlflow.log_params(params)
|
||||
ln.context.params = params
|
||||
|
||||
# Load data from LaminDB
|
||||
data_artifact = ln.Artifact.get(key="datasets/features.parquet")
|
||||
X = data_artifact.load()
|
||||
|
||||
# Train and log model
|
||||
model = train_model(X)
|
||||
mlflow.sklearn.log_model(model, "model")
|
||||
|
||||
# Save to LaminDB
|
||||
import joblib
|
||||
joblib.dump(model, "model.pkl")
|
||||
model_artifact = ln.Artifact(
|
||||
"model.pkl",
|
||||
key=f"models/{mlflow.active_run().info.run_id}.pkl"
|
||||
).save()
|
||||
|
||||
mlflow.end_run()
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### HuggingFace Transformers
|
||||
|
||||
Track model fine-tuning with LaminDB:
|
||||
|
||||
```python
|
||||
from transformers import Trainer, TrainingArguments
|
||||
import lamindb as ln
|
||||
|
||||
ln.track(params={"model": "bert-base", "epochs": 3})
|
||||
|
||||
# Load training data
|
||||
train_artifact = ln.Artifact.get(key="datasets/train_tokenized.parquet")
|
||||
train_dataset = train_artifact.load()
|
||||
|
||||
# Configure trainer
|
||||
training_args = TrainingArguments(
|
||||
output_dir="./results",
|
||||
num_train_epochs=3,
|
||||
)
|
||||
|
||||
trainer = Trainer(
|
||||
model=model,
|
||||
args=training_args,
|
||||
train_dataset=train_dataset,
|
||||
)
|
||||
|
||||
# Train
|
||||
trainer.train()
|
||||
|
||||
# Save model to LaminDB
|
||||
trainer.save_model("./model")
|
||||
model_artifact = ln.Artifact(
|
||||
"./model",
|
||||
key="models/bert_finetuned",
|
||||
description="BERT fine-tuned on custom dataset"
|
||||
).save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### scVI-tools
|
||||
|
||||
Single-cell analysis with scVI and LaminDB:
|
||||
|
||||
```python
|
||||
import scvi
|
||||
import lamindb as ln
|
||||
|
||||
ln.track()
|
||||
|
||||
# Load data
|
||||
adata_artifact = ln.Artifact.get(key="scrna/raw_counts.h5ad")
|
||||
adata = adata_artifact.load()
|
||||
|
||||
# Setup scVI
|
||||
scvi.model.SCVI.setup_anndata(adata, layer="counts")
|
||||
|
||||
# Train model
|
||||
model = scvi.model.SCVI(adata)
|
||||
model.train()
|
||||
|
||||
# Save latent representation
|
||||
adata.obsm["X_scvi"] = model.get_latent_representation()
|
||||
|
||||
# Save results
|
||||
result_artifact = ln.Artifact.from_anndata(
|
||||
adata,
|
||||
key="scrna/scvi_latent.h5ad",
|
||||
description="scVI latent representation"
|
||||
).save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
## Array Store Integrations
|
||||
|
||||
### TileDB-SOMA
|
||||
|
||||
Scalable array storage with cellxgene support:
|
||||
|
||||
```python
|
||||
import tiledbsoma as soma
|
||||
import lamindb as ln
|
||||
|
||||
# Create SOMA experiment
|
||||
uri = "tiledb://my-namespace/experiment"
|
||||
|
||||
with soma.Experiment.create(uri) as exp:
|
||||
# Add measurements
|
||||
exp.add_new_collection("RNA")
|
||||
|
||||
# Register in LaminDB
|
||||
artifact = ln.Artifact(
|
||||
uri,
|
||||
key="cellxgene/experiment.soma",
|
||||
description="TileDB-SOMA experiment"
|
||||
).save()
|
||||
|
||||
# Query with SOMA
|
||||
with soma.Experiment.open(uri) as exp:
|
||||
obs = exp.obs.read().to_pandas()
|
||||
```
|
||||
|
||||
### DuckDB
|
||||
|
||||
Query artifacts with DuckDB:
|
||||
|
||||
```python
|
||||
import duckdb
|
||||
import lamindb as ln
|
||||
|
||||
# Get artifact
|
||||
artifact = ln.Artifact.get(key="datasets/large_data.parquet")
|
||||
|
||||
# Query with DuckDB (without loading full file)
|
||||
path = artifact.cache()
|
||||
result = duckdb.query(f"""
|
||||
SELECT cell_type, COUNT(*) as count
|
||||
FROM read_parquet('{path}')
|
||||
GROUP BY cell_type
|
||||
ORDER BY count DESC
|
||||
""").to_df()
|
||||
|
||||
# Save query result
|
||||
result_artifact = ln.Artifact.from_dataframe(
|
||||
result,
|
||||
key="analysis/cell_type_counts.parquet"
|
||||
).save()
|
||||
```
|
||||
|
||||
## Visualization Integrations
|
||||
|
||||
### Vitessce
|
||||
|
||||
Create interactive visualizations:
|
||||
|
||||
```python
|
||||
from vitessce import VitessceConfig
|
||||
import lamindb as ln
|
||||
|
||||
# Load spatial data
|
||||
artifact = ln.Artifact.get(key="spatial/visium_slide.h5ad")
|
||||
adata = artifact.load()
|
||||
|
||||
# Create Vitessce configuration
|
||||
vc = VitessceConfig.from_object(adata)
|
||||
|
||||
# Save configuration
|
||||
import json
|
||||
config_file = "vitessce_config.json"
|
||||
with open(config_file, "w") as f:
|
||||
json.dump(vc.to_dict(), f)
|
||||
|
||||
# Register configuration
|
||||
config_artifact = ln.Artifact(
|
||||
config_file,
|
||||
key="visualizations/spatial_config.json",
|
||||
description="Vitessce visualization config"
|
||||
).save()
|
||||
```
|
||||
|
||||
## Schema Module Integrations
|
||||
|
||||
### Bionty (Biological Ontologies)
|
||||
|
||||
```python
|
||||
import bionty as bt
|
||||
|
||||
# Import biological ontologies
|
||||
bt.CellType.import_source()
|
||||
bt.Gene.import_source(organism="human")
|
||||
|
||||
# Use in data curation
|
||||
cell_types = bt.CellType.from_values(adata.obs.cell_type)
|
||||
```
|
||||
|
||||
### WetLab
|
||||
|
||||
Track wet lab experiments:
|
||||
|
||||
```python
|
||||
# Install wetlab module
|
||||
pip install 'lamindb[wetlab]'
|
||||
|
||||
# Use wetlab registries
|
||||
import lamindb_wetlab as wetlab
|
||||
|
||||
# Track experiments, samples, protocols
|
||||
experiment = wetlab.Experiment(name="RNA-seq batch 1").save()
|
||||
```
|
||||
|
||||
### Clinical Data (OMOP)
|
||||
|
||||
```python
|
||||
# Install clinical module
|
||||
pip install 'lamindb[clinical]'
|
||||
|
||||
# Use OMOP common data model
|
||||
import lamindb_clinical as clinical
|
||||
|
||||
# Track clinical data
|
||||
patient = clinical.Patient(patient_id="P001").save()
|
||||
```
|
||||
|
||||
## Git Integration
|
||||
|
||||
### Sync with Git Repositories
|
||||
|
||||
```python
|
||||
# Configure git sync
|
||||
export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
|
||||
|
||||
# Or programmatically
|
||||
ln.settings.sync_git_repo = "https://github.com/user/repo.git"
|
||||
|
||||
# Set development directory
|
||||
lamin settings set dev-dir .
|
||||
|
||||
# Scripts tracked with git commits
|
||||
ln.track() # Automatically captures git commit hash
|
||||
# ... your code ...
|
||||
ln.finish()
|
||||
|
||||
# View git information
|
||||
transform = ln.Transform.get(name="analysis.py")
|
||||
transform.source_code # Shows code at git commit
|
||||
transform.hash # Git commit hash
|
||||
```
|
||||
|
||||
## Enterprise Integrations
|
||||
|
||||
### Benchling
|
||||
|
||||
Sync with Benchling registries (requires team/enterprise plan):
|
||||
|
||||
```python
|
||||
# Configure Benchling connection (contact LaminDB team)
|
||||
# Syncs schemas and data from Benchling registries
|
||||
|
||||
# Access synced Benchling data
|
||||
# Details available through enterprise support
|
||||
```
|
||||
|
||||
## Custom Integration Patterns
|
||||
|
||||
### REST API Integration
|
||||
|
||||
```python
|
||||
import requests
|
||||
import lamindb as ln
|
||||
|
||||
ln.track()
|
||||
|
||||
# Fetch from API
|
||||
response = requests.get("https://api.example.com/data")
|
||||
data = response.json()
|
||||
|
||||
# Convert to DataFrame
|
||||
import pandas as pd
|
||||
df = pd.DataFrame(data)
|
||||
|
||||
# Save to LaminDB
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="api/fetched_data.parquet",
|
||||
description="Data fetched from external API"
|
||||
).save()
|
||||
|
||||
artifact.features.add_values({"api_url": response.url})
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### Database Integration
|
||||
|
||||
```python
|
||||
import sqlalchemy as sa
|
||||
import lamindb as ln
|
||||
|
||||
ln.track()
|
||||
|
||||
# Connect to external database
|
||||
engine = sa.create_engine("postgresql://user:pwd@host:port/db")
|
||||
|
||||
# Query data
|
||||
query = "SELECT * FROM experiments WHERE date > '2025-01-01'"
|
||||
df = pd.read_sql(query, engine)
|
||||
|
||||
# Save to LaminDB
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="external_db/experiments_2025.parquet",
|
||||
description="Experiments from external database"
|
||||
).save()
|
||||
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
## Croissant Metadata
|
||||
|
||||
Export datasets with Croissant metadata format:
|
||||
|
||||
```python
|
||||
# Create artifact with rich metadata
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="datasets/published_data.parquet",
|
||||
description="Published dataset with Croissant metadata"
|
||||
).save()
|
||||
|
||||
# Export Croissant metadata (requires additional configuration)
|
||||
# Enables dataset discovery and interoperability
|
||||
```
|
||||
|
||||
## Best Practices for Integrations
|
||||
|
||||
1. **Track consistently**: Use `ln.track()` in all integrated workflows
|
||||
2. **Link IDs**: Store external system IDs (W&B run ID, MLflow experiment ID) as features
|
||||
3. **Centralize data**: Use LaminDB as single source of truth for data artifacts
|
||||
4. **Sync parameters**: Log parameters to both LaminDB and ML platforms
|
||||
5. **Version together**: Keep code (git), data (LaminDB), and experiments (ML platform) in sync
|
||||
6. **Cache strategically**: Configure appropriate cache locations for cloud storage
|
||||
7. **Use feature sets**: Link ontology terms from Bionty to artifacts
|
||||
8. **Document integrations**: Add descriptions explaining integration context
|
||||
9. **Test incrementally**: Verify integration with small datasets first
|
||||
10. **Monitor lineage**: Use `view_lineage()` to ensure integration tracking works
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
**Issue: S3 credentials not found**
|
||||
```bash
|
||||
export AWS_ACCESS_KEY_ID=your_key
|
||||
export AWS_SECRET_ACCESS_KEY=your_secret
|
||||
export AWS_DEFAULT_REGION=us-east-1
|
||||
```
|
||||
|
||||
**Issue: GCS authentication failure**
|
||||
```bash
|
||||
gcloud auth application-default login
|
||||
export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
|
||||
```
|
||||
|
||||
**Issue: Git sync not working**
|
||||
```bash
|
||||
# Ensure git repo is set
|
||||
lamin settings get sync-git-repo
|
||||
|
||||
# Ensure you're in git repo
|
||||
git status
|
||||
|
||||
# Commit changes before tracking
|
||||
git add .
|
||||
git commit -m "Update analysis"
|
||||
ln.track()
|
||||
```
|
||||
|
||||
**Issue: MLflow artifacts not syncing**
|
||||
```python
|
||||
# Save explicitly to both systems
|
||||
mlflow.log_artifact("model.pkl")
|
||||
ln.Artifact("model.pkl", key="models/model.pkl").save()
|
||||
```
|
||||
Reference in New Issue
Block a user