Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/lamindb/references/integrations.md
+++ b/skills/lamindb/references/integrations.md
@@ -0,0 +1,642 @@
+# LaminDB Integrations
+
+This document covers LaminDB integrations with workflow managers, MLOps platforms, visualization tools, and other third-party systems.
+
+## Overview
+
+LaminDB supports extensive integrations across data storage, computational workflows, machine learning platforms, and visualization tools, enabling seamless incorporation into existing data science and bioinformatics pipelines.
+
+## Data Storage Integrations
+
+### Local Filesystem
+
+```python
+import lamindb as ln
+
+# Initialize with local storage
+lamin init --storage ./mydata
+
+# Save artifacts to local storage
+artifact = ln.Artifact("data.csv", key="local/data.csv").save()
+
+# Load from local storage
+data = artifact.load()
+```
+
+### AWS S3
+
+```python
+# Initialize with S3 storage
+lamin init --storage s3://my-bucket/path \
+  --db postgresql://user:pwd@host:port/db
+
+# Artifacts automatically sync to S3
+artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
+
+# Transparent S3 access
+data = artifact.load()  # Downloads from S3 if not cached
+```
+
+### S3-Compatible Services
+
+Support for MinIO, Cloudflare R2, and other S3-compatible endpoints:
+
+```python
+# Initialize with custom S3 endpoint
+lamin init --storage 's3://bucket?endpoint_url=http://minio.example.com:9000'
+
+# Configure credentials
+export AWS_ACCESS_KEY_ID=minioadmin
+export AWS_SECRET_ACCESS_KEY=minioadmin
+```
+
+### Google Cloud Storage
+
+```python
+# Install GCP extras
+pip install 'lamindb[gcp]'
+
+# Initialize with GCS
+lamin init --storage gs://my-bucket/path \
+  --db postgresql://user:pwd@host:port/db
+
+# Artifacts sync to GCS
+artifact = ln.Artifact("data.csv", key="experiments/data.csv").save()
+```
+
+### HTTP/HTTPS (Read-Only)
+
+```python
+# Access remote files without copying
+artifact = ln.Artifact(
+    "https://example.com/data.csv",
+    key="remote/data.csv"
+).save()
+
+# Stream remote content
+with artifact.open() as f:
+    data = f.read()
+```
+
+### HuggingFace Datasets
+
+```python
+# Access HuggingFace datasets
+from datasets import load_dataset
+
+dataset = load_dataset("squad", split="train")
+
+# Register as LaminDB artifact
+artifact = ln.Artifact.from_dataframe(
+    dataset.to_pandas(),
+    key="hf/squad_train.parquet",
+    description="SQuAD training data from HuggingFace"
+).save()
+```
+
+## Workflow Manager Integrations
+
+### Nextflow
+
+Track Nextflow pipeline execution and outputs:
+
+```python
+# In your Nextflow process script
+import lamindb as ln
+
+# Initialize tracking
+ln.track()
+
+# Your Nextflow process logic
+input_artifact = ln.Artifact.get(key="${input_key}")
+data = input_artifact.load()
+
+# Process data
+result = process_data(data)
+
+# Save output
+output_artifact = ln.Artifact.from_dataframe(
+    result,
+    key="${output_key}"
+).save()
+
+ln.finish()
+```
+
+**Nextflow config example:**
+```nextflow
+process ANALYZE {
+    input:
+    val input_key
+
+    output:
+    path "result.csv"
+
+    script:
+    """
+    #!/usr/bin/env python
+    import lamindb as ln
+    ln.track()
+    artifact = ln.Artifact.get(key="${input_key}")
+    # Process and save
+    ln.finish()
+    """
+}
+```
+
+### Snakemake
+
+Integrate LaminDB into Snakemake workflows:
+
+```python
+# In Snakemake rule
+rule process_data:
+    input:
+        "data/input.csv"
+    output:
+        "data/output.csv"
+    run:
+        import lamindb as ln
+
+        ln.track()
+
+        # Load input artifact
+        artifact = ln.Artifact.get(key="inputs/data.csv")
+        data = artifact.load()
+
+        # Process
+        result = analyze(data)
+
+        # Save output
+        result.to_csv(output[0])
+        ln.Artifact(output[0], key="outputs/result.csv").save()
+
+        ln.finish()
+```
+
+### Redun
+
+Track Redun task execution:
+
+```python
+from redun import task
+import lamindb as ln
+
+@task()
+@ln.tracked()
+def process_dataset(input_key: str, output_key: str):
+    """Redun task with LaminDB tracking."""
+    # Load input
+    artifact = ln.Artifact.get(key=input_key)
+    data = artifact.load()
+
+    # Process
+    result = transform(data)
+
+    # Save output
+    ln.Artifact.from_dataframe(result, key=output_key).save()
+
+    return output_key
+
+# Redun automatically tracks lineage alongside LaminDB
+```
+
+## MLOps Platform Integrations
+
+### Weights & Biases (W&B)
+
+Combine W&B experiment tracking with LaminDB data management:
+
+```python
+import wandb
+import lamindb as ln
+
+# Initialize both
+wandb.init(project="my-project", name="experiment-1")
+ln.track(params={"learning_rate": 0.01, "batch_size": 32})
+
+# Load training data
+train_artifact = ln.Artifact.get(key="datasets/train.parquet")
+train_data = train_artifact.load()
+
+# Train model
+model = train_model(train_data)
+
+# Log to W&B
+wandb.log({"accuracy": 0.95, "loss": 0.05})
+
+# Save model in LaminDB
+import joblib
+joblib.dump(model, "model.pkl")
+model_artifact = ln.Artifact(
+    "model.pkl",
+    key="models/experiment-1.pkl",
+    description=f"Model from W&B run {wandb.run.id}"
+).save()
+
+# Link W&B run ID
+model_artifact.features.add_values({"wandb_run_id": wandb.run.id})
+
+ln.finish()
+wandb.finish()
+```
+
+### MLflow
+
+Integrate MLflow model tracking with LaminDB:
+
+```python
+import mlflow
+import lamindb as ln
+
+# Start runs
+mlflow.start_run()
+ln.track()
+
+# Log parameters to both
+params = {"max_depth": 5, "n_estimators": 100}
+mlflow.log_params(params)
+ln.context.params = params
+
+# Load data from LaminDB
+data_artifact = ln.Artifact.get(key="datasets/features.parquet")
+X = data_artifact.load()
+
+# Train and log model
+model = train_model(X)
+mlflow.sklearn.log_model(model, "model")
+
+# Save to LaminDB
+import joblib
+joblib.dump(model, "model.pkl")
+model_artifact = ln.Artifact(
+    "model.pkl",
+    key=f"models/{mlflow.active_run().info.run_id}.pkl"
+).save()
+
+mlflow.end_run()
+ln.finish()
+```
+
+### HuggingFace Transformers
+
+Track model fine-tuning with LaminDB:
+
+```python
+from transformers import Trainer, TrainingArguments
+import lamindb as ln
+
+ln.track(params={"model": "bert-base", "epochs": 3})
+
+# Load training data
+train_artifact = ln.Artifact.get(key="datasets/train_tokenized.parquet")
+train_dataset = train_artifact.load()
+
+# Configure trainer
+training_args = TrainingArguments(
+    output_dir="./results",
+    num_train_epochs=3,
+)
+
+trainer = Trainer(
+    model=model,
+    args=training_args,
+    train_dataset=train_dataset,
+)
+
+# Train
+trainer.train()
+
+# Save model to LaminDB
+trainer.save_model("./model")
+model_artifact = ln.Artifact(
+    "./model",
+    key="models/bert_finetuned",
+    description="BERT fine-tuned on custom dataset"
+).save()
+
+ln.finish()
+```
+
+### scVI-tools
+
+Single-cell analysis with scVI and LaminDB:
+
+```python
+import scvi
+import lamindb as ln
+
+ln.track()
+
+# Load data
+adata_artifact = ln.Artifact.get(key="scrna/raw_counts.h5ad")
+adata = adata_artifact.load()
+
+# Setup scVI
+scvi.model.SCVI.setup_anndata(adata, layer="counts")
+
+# Train model
+model = scvi.model.SCVI(adata)
+model.train()
+
+# Save latent representation
+adata.obsm["X_scvi"] = model.get_latent_representation()
+
+# Save results
+result_artifact = ln.Artifact.from_anndata(
+    adata,
+    key="scrna/scvi_latent.h5ad",
+    description="scVI latent representation"
+).save()
+
+ln.finish()
+```
+
+## Array Store Integrations
+
+### TileDB-SOMA
+
+Scalable array storage with cellxgene support:
+
+```python
+import tiledbsoma as soma
+import lamindb as ln
+
+# Create SOMA experiment
+uri = "tiledb://my-namespace/experiment"
+
+with soma.Experiment.create(uri) as exp:
+    # Add measurements
+    exp.add_new_collection("RNA")
+
+    # Register in LaminDB
+    artifact = ln.Artifact(
+        uri,
+        key="cellxgene/experiment.soma",
+        description="TileDB-SOMA experiment"
+    ).save()
+
+# Query with SOMA
+with soma.Experiment.open(uri) as exp:
+    obs = exp.obs.read().to_pandas()
+```
+
+### DuckDB
+
+Query artifacts with DuckDB:
+
+```python
+import duckdb
+import lamindb as ln
+
+# Get artifact
+artifact = ln.Artifact.get(key="datasets/large_data.parquet")
+
+# Query with DuckDB (without loading full file)
+path = artifact.cache()
+result = duckdb.query(f"""
+    SELECT cell_type, COUNT(*) as count
+    FROM read_parquet('{path}')
+    GROUP BY cell_type
+    ORDER BY count DESC
+""").to_df()
+
+# Save query result
+result_artifact = ln.Artifact.from_dataframe(
+    result,
+    key="analysis/cell_type_counts.parquet"
+).save()
+```
+
+## Visualization Integrations
+
+### Vitessce
+
+Create interactive visualizations:
+
+```python
+from vitessce import VitessceConfig
+import lamindb as ln
+
+# Load spatial data
+artifact = ln.Artifact.get(key="spatial/visium_slide.h5ad")
+adata = artifact.load()
+
+# Create Vitessce configuration
+vc = VitessceConfig.from_object(adata)
+
+# Save configuration
+import json
+config_file = "vitessce_config.json"
+with open(config_file, "w") as f:
+    json.dump(vc.to_dict(), f)
+
+# Register configuration
+config_artifact = ln.Artifact(
+    config_file,
+    key="visualizations/spatial_config.json",
+    description="Vitessce visualization config"
+).save()
+```
+
+## Schema Module Integrations
+
+### Bionty (Biological Ontologies)
+
+```python
+import bionty as bt
+
+# Import biological ontologies
+bt.CellType.import_source()
+bt.Gene.import_source(organism="human")
+
+# Use in data curation
+cell_types = bt.CellType.from_values(adata.obs.cell_type)
+```
+
+### WetLab
+
+Track wet lab experiments:
+
+```python
+# Install wetlab module
+pip install 'lamindb[wetlab]'
+
+# Use wetlab registries
+import lamindb_wetlab as wetlab
+
+# Track experiments, samples, protocols
+experiment = wetlab.Experiment(name="RNA-seq batch 1").save()
+```
+
+### Clinical Data (OMOP)
+
+```python
+# Install clinical module
+pip install 'lamindb[clinical]'
+
+# Use OMOP common data model
+import lamindb_clinical as clinical
+
+# Track clinical data
+patient = clinical.Patient(patient_id="P001").save()
+```
+
+## Git Integration
+
+### Sync with Git Repositories
+
+```python
+# Configure git sync
+export LAMINDB_SYNC_GIT_REPO=https://github.com/user/repo.git
+
+# Or programmatically
+ln.settings.sync_git_repo = "https://github.com/user/repo.git"
+
+# Set development directory
+lamin settings set dev-dir .
+
+# Scripts tracked with git commits
+ln.track()  # Automatically captures git commit hash
+# ... your code ...
+ln.finish()
+
+# View git information
+transform = ln.Transform.get(name="analysis.py")
+transform.source_code  # Shows code at git commit
+transform.hash        # Git commit hash
+```
+
+## Enterprise Integrations
+
+### Benchling
+
+Sync with Benchling registries (requires team/enterprise plan):
+
+```python
+# Configure Benchling connection (contact LaminDB team)
+# Syncs schemas and data from Benchling registries
+
+# Access synced Benchling data
+# Details available through enterprise support
+```
+
+## Custom Integration Patterns
+
+### REST API Integration
+
+```python
+import requests
+import lamindb as ln
+
+ln.track()
+
+# Fetch from API
+response = requests.get("https://api.example.com/data")
+data = response.json()
+
+# Convert to DataFrame
+import pandas as pd
+df = pd.DataFrame(data)
+
+# Save to LaminDB
+artifact = ln.Artifact.from_dataframe(
+    df,
+    key="api/fetched_data.parquet",
+    description="Data fetched from external API"
+).save()
+
+artifact.features.add_values({"api_url": response.url})
+
+ln.finish()
+```
+
+### Database Integration
+
+```python
+import sqlalchemy as sa
+import lamindb as ln
+
+ln.track()
+
+# Connect to external database
+engine = sa.create_engine("postgresql://user:pwd@host:port/db")
+
+# Query data
+query = "SELECT * FROM experiments WHERE date > '2025-01-01'"
+df = pd.read_sql(query, engine)
+
+# Save to LaminDB
+artifact = ln.Artifact.from_dataframe(
+    df,
+    key="external_db/experiments_2025.parquet",
+    description="Experiments from external database"
+).save()
+
+ln.finish()
+```
+
+## Croissant Metadata
+
+Export datasets with Croissant metadata format:
+
+```python
+# Create artifact with rich metadata
+artifact = ln.Artifact.from_dataframe(
+    df,
+    key="datasets/published_data.parquet",
+    description="Published dataset with Croissant metadata"
+).save()
+
+# Export Croissant metadata (requires additional configuration)
+# Enables dataset discovery and interoperability
+```
+
+## Best Practices for Integrations
+
+1. **Track consistently**: Use `ln.track()` in all integrated workflows
+2. **Link IDs**: Store external system IDs (W&B run ID, MLflow experiment ID) as features
+3. **Centralize data**: Use LaminDB as single source of truth for data artifacts
+4. **Sync parameters**: Log parameters to both LaminDB and ML platforms
+5. **Version together**: Keep code (git), data (LaminDB), and experiments (ML platform) in sync
+6. **Cache strategically**: Configure appropriate cache locations for cloud storage
+7. **Use feature sets**: Link ontology terms from Bionty to artifacts
+8. **Document integrations**: Add descriptions explaining integration context
+9. **Test incrementally**: Verify integration with small datasets first
+10. **Monitor lineage**: Use `view_lineage()` to ensure integration tracking works
+
+## Troubleshooting Common Issues
+
+**Issue: S3 credentials not found**
+```bash
+export AWS_ACCESS_KEY_ID=your_key
+export AWS_SECRET_ACCESS_KEY=your_secret
+export AWS_DEFAULT_REGION=us-east-1
+```
+
+**Issue: GCS authentication failure**
+```bash
+gcloud auth application-default login
+export GOOGLE_APPLICATION_CREDENTIALS=/path/to/credentials.json
+```
+
+**Issue: Git sync not working**
+```bash
+# Ensure git repo is set
+lamin settings get sync-git-repo
+
+# Ensure you're in git repo
+git status
+
+# Commit changes before tracking
+git add .
+git commit -m "Update analysis"
+ln.track()
+```
+
+**Issue: MLflow artifacts not syncing**
+```python
+# Save explicitly to both systems
+mlflow.log_artifact("model.pkl")
+ln.Artifact("model.pkl", key="models/model.pkl").save()
+```