381 lines
9.6 KiB
Markdown
381 lines
9.6 KiB
Markdown
# LaminDB Core Concepts
|
|
|
|
This document covers the fundamental concepts and building blocks of LaminDB: Artifacts, Records, Runs, Transforms, Features, and data lineage tracking.
|
|
|
|
## Artifacts
|
|
|
|
Artifacts represent datasets in various formats (DataFrames, AnnData, SpatialData, Parquet, Zarr, etc.). They serve as the primary data objects in LaminDB.
|
|
|
|
### Creating and Saving Artifacts
|
|
|
|
**From file:**
|
|
```python
|
|
import lamindb as ln
|
|
|
|
# Save a file as artifact
|
|
ln.Artifact("sample.fasta", key="sample.fasta").save()
|
|
|
|
# With description
|
|
artifact = ln.Artifact(
|
|
"data/analysis.h5ad",
|
|
key="experiments/scrna_batch1.h5ad",
|
|
description="Single-cell RNA-seq batch 1"
|
|
).save()
|
|
```
|
|
|
|
**From DataFrame:**
|
|
```python
|
|
import pandas as pd
|
|
|
|
df = pd.read_csv("data.csv")
|
|
artifact = ln.Artifact.from_dataframe(
|
|
df,
|
|
key="datasets/processed_data.parquet",
|
|
description="Processed experimental data"
|
|
).save()
|
|
```
|
|
|
|
**From AnnData:**
|
|
```python
|
|
import anndata as ad
|
|
|
|
adata = ad.read_h5ad("data.h5ad")
|
|
artifact = ln.Artifact.from_anndata(
|
|
adata,
|
|
key="scrna/experiment1.h5ad",
|
|
description="scRNA-seq data with QC"
|
|
).save()
|
|
```
|
|
|
|
### Retrieving Artifacts
|
|
|
|
```python
|
|
# By key
|
|
artifact = ln.Artifact.get(key="sample.fasta")
|
|
|
|
# By UID
|
|
artifact = ln.Artifact.get("aRt1Fact0uid000")
|
|
|
|
# By filter
|
|
artifact = ln.Artifact.filter(suffix=".h5ad").first()
|
|
```
|
|
|
|
### Accessing Artifact Content
|
|
|
|
```python
|
|
# Get cached local path
|
|
local_path = artifact.cache()
|
|
|
|
# Load into memory
|
|
data = artifact.load() # Returns DataFrame, AnnData, etc.
|
|
|
|
# Streaming access (for large files)
|
|
with artifact.open() as f:
|
|
# Read incrementally
|
|
chunk = f.read(1000)
|
|
```
|
|
|
|
### Artifact Metadata
|
|
|
|
```python
|
|
# View all metadata
|
|
artifact.describe()
|
|
|
|
# Access specific metadata
|
|
artifact.size # File size in bytes
|
|
artifact.suffix # File extension
|
|
artifact.created_at # Timestamp
|
|
artifact.created_by # User who created it
|
|
artifact.run # Associated run
|
|
artifact.transform # Associated transform
|
|
artifact.version # Version string
|
|
```
|
|
|
|
## Records
|
|
|
|
Records represent experimental entities: samples, perturbations, instruments, cell lines, and any other metadata entities. They support hierarchical relationships through type definitions.
|
|
|
|
### Creating Records
|
|
|
|
```python
|
|
# Define a type
|
|
sample_type = ln.Record(name="Sample", is_type=True).save()
|
|
|
|
# Create instances of that type
|
|
ln.Record(name="P53mutant1", type=sample_type).save()
|
|
ln.Record(name="P53mutant2", type=sample_type).save()
|
|
ln.Record(name="WT-control", type=sample_type).save()
|
|
```
|
|
|
|
### Searching Records
|
|
|
|
```python
|
|
# Text search
|
|
ln.Record.search("p53").to_dataframe()
|
|
|
|
# Filter by fields
|
|
ln.Record.filter(type=sample_type).to_dataframe()
|
|
|
|
# Get specific record
|
|
record = ln.Record.get(name="P53mutant1")
|
|
```
|
|
|
|
### Hierarchical Relationships
|
|
|
|
```python
|
|
# Establish parent-child relationships
|
|
parent_record = ln.Record.get(name="P53mutant1")
|
|
child_record = ln.Record(name="P53mutant1-replicate1", type=sample_type).save()
|
|
child_record.parents.add(parent_record)
|
|
|
|
# Query relationships
|
|
parent_record.children.to_dataframe()
|
|
child_record.parents.to_dataframe()
|
|
```
|
|
|
|
## Runs & Transforms
|
|
|
|
These capture computational lineage. A **Transform** represents a reusable analysis step (notebook, script, or function), while a **Run** documents a specific execution instance.
|
|
|
|
### Basic Tracking Workflow
|
|
|
|
```python
|
|
import lamindb as ln
|
|
|
|
# Start tracking (beginning of notebook/script)
|
|
ln.track()
|
|
|
|
# Your analysis code
|
|
data = ln.Artifact.get(key="input.csv").load()
|
|
# ... perform analysis ...
|
|
result.to_csv("output.csv")
|
|
artifact = ln.Artifact("output.csv", key="output.csv").save()
|
|
|
|
# Finish tracking (end of notebook/script)
|
|
ln.finish()
|
|
```
|
|
|
|
### Tracking with Parameters
|
|
|
|
```python
|
|
ln.track(params={
|
|
"learning_rate": 0.01,
|
|
"batch_size": 32,
|
|
"epochs": 100,
|
|
"downsample": True
|
|
})
|
|
|
|
# Query runs by parameters
|
|
ln.Run.filter(params__learning_rate=0.01).to_dataframe()
|
|
ln.Run.filter(params__downsample=True).to_dataframe()
|
|
```
|
|
|
|
### Tracking with Projects
|
|
|
|
```python
|
|
# Associate with project
|
|
ln.track(project="Cancer Drug Screen 2025")
|
|
|
|
# Query by project
|
|
project = ln.Project.get(name="Cancer Drug Screen 2025")
|
|
ln.Artifact.filter(projects=project).to_dataframe()
|
|
ln.Run.filter(project=project).to_dataframe()
|
|
```
|
|
|
|
### Function-Level Tracking
|
|
|
|
Use the `@ln.tracked()` decorator for fine-grained lineage:
|
|
|
|
```python
|
|
@ln.tracked()
|
|
def preprocess_data(input_key: str, output_key: str, normalize: bool = True) -> None:
|
|
"""Preprocess raw data and save result."""
|
|
# Load input (automatically tracked)
|
|
artifact = ln.Artifact.get(key=input_key)
|
|
data = artifact.load()
|
|
|
|
# Process
|
|
if normalize:
|
|
data = (data - data.mean()) / data.std()
|
|
|
|
# Save output (automatically tracked)
|
|
ln.Artifact.from_dataframe(data, key=output_key).save()
|
|
|
|
# Each call creates a separate Transform and Run
|
|
preprocess_data("raw/batch1.csv", "processed/batch1.csv", normalize=True)
|
|
preprocess_data("raw/batch2.csv", "processed/batch2.csv", normalize=False)
|
|
```
|
|
|
|
### Accessing Lineage Information
|
|
|
|
```python
|
|
# From artifact to run
|
|
artifact = ln.Artifact.get(key="output.csv")
|
|
run = artifact.run
|
|
transform = run.transform
|
|
|
|
# View details
|
|
run.describe() # Run metadata
|
|
transform.describe() # Transform metadata
|
|
|
|
# Access inputs
|
|
run.inputs.to_dataframe()
|
|
|
|
# Visualize lineage graph
|
|
artifact.view_lineage()
|
|
```
|
|
|
|
## Features
|
|
|
|
Features define typed metadata fields for validation and querying. They enable structured annotation and searching.
|
|
|
|
### Defining Features
|
|
|
|
```python
|
|
from datetime import date
|
|
|
|
# Numeric feature
|
|
ln.Feature(name="gc_content", dtype=float).save()
|
|
ln.Feature(name="read_count", dtype=int).save()
|
|
|
|
# Date feature
|
|
ln.Feature(name="experiment_date", dtype=date).save()
|
|
|
|
# Categorical feature
|
|
ln.Feature(name="cell_type", dtype=str).save()
|
|
ln.Feature(name="treatment", dtype=str).save()
|
|
```
|
|
|
|
### Annotating Artifacts with Features
|
|
|
|
```python
|
|
# Single values
|
|
artifact.features.add_values({
|
|
"gc_content": 0.55,
|
|
"experiment_date": "2025-10-31"
|
|
})
|
|
|
|
# Using feature registry records
|
|
gc_content_feature = ln.Feature.get(name="gc_content")
|
|
artifact.features.add(gc_content_feature)
|
|
```
|
|
|
|
### Querying by Features
|
|
|
|
```python
|
|
# Filter by feature value
|
|
ln.Artifact.filter(gc_content=0.55).to_dataframe()
|
|
ln.Artifact.filter(experiment_date="2025-10-31").to_dataframe()
|
|
|
|
# Comparison operators
|
|
ln.Artifact.filter(read_count__gt=1000000).to_dataframe()
|
|
ln.Artifact.filter(gc_content__gte=0.5, gc_content__lte=0.6).to_dataframe()
|
|
|
|
# Check for presence of annotation
|
|
ln.Artifact.filter(cell_type__isnull=False).to_dataframe()
|
|
|
|
# Include features in output
|
|
ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features")
|
|
```
|
|
|
|
### Nested Dictionary Features
|
|
|
|
For complex metadata stored as dictionaries:
|
|
|
|
```python
|
|
# Access nested values
|
|
ln.Artifact.filter(study_metadata__detail1="123").to_dataframe()
|
|
ln.Artifact.filter(study_metadata__assay__type="RNA-seq").to_dataframe()
|
|
```
|
|
|
|
## Data Lineage Tracking
|
|
|
|
LaminDB automatically captures execution context and relationships between data, code, and runs.
|
|
|
|
### What Gets Tracked
|
|
|
|
- **Source code**: Script/notebook content and git commit
|
|
- **Environment**: Python packages and versions
|
|
- **Input artifacts**: Data loaded during execution
|
|
- **Output artifacts**: Data created during execution
|
|
- **Execution metadata**: Timestamps, user, parameters
|
|
- **Computational dependencies**: Transform relationships
|
|
|
|
### Viewing Lineage
|
|
|
|
```python
|
|
# Visualize full lineage graph
|
|
artifact.view_lineage()
|
|
|
|
# View captured metadata
|
|
artifact.describe()
|
|
|
|
# Access related entities
|
|
artifact.run # The run that created it
|
|
artifact.run.transform # The transform (code) used
|
|
artifact.run.inputs # Input artifacts
|
|
artifact.run.report # Execution report
|
|
```
|
|
|
|
### Querying Lineage
|
|
|
|
```python
|
|
# Find all outputs from a transform
|
|
transform = ln.Transform.get(name="preprocessing.py")
|
|
ln.Artifact.filter(transform=transform).to_dataframe()
|
|
|
|
# Find all artifacts from a specific user
|
|
user = ln.User.get(handle="researcher123")
|
|
ln.Artifact.filter(created_by=user).to_dataframe()
|
|
|
|
# Find artifacts using specific inputs
|
|
input_artifact = ln.Artifact.get(key="raw/data.csv")
|
|
runs = ln.Run.filter(inputs=input_artifact)
|
|
ln.Artifact.filter(run__in=runs).to_dataframe()
|
|
```
|
|
|
|
## Versioning
|
|
|
|
LaminDB manages artifact versioning automatically when source data or code changes.
|
|
|
|
### Automatic Versioning
|
|
|
|
```python
|
|
# First version
|
|
artifact_v1 = ln.Artifact("data.csv", key="experiment/data.csv").save()
|
|
|
|
# Modify and save again - creates new version
|
|
# (modify data.csv)
|
|
artifact_v2 = ln.Artifact("data.csv", key="experiment/data.csv").save()
|
|
```
|
|
|
|
### Working with Versions
|
|
|
|
```python
|
|
# Get latest version (default)
|
|
artifact = ln.Artifact.get(key="experiment/data.csv")
|
|
|
|
# View all versions
|
|
artifact.versions.to_dataframe()
|
|
|
|
# Get specific version
|
|
artifact_v1 = artifact.versions.filter(version="1").first()
|
|
|
|
# Compare versions
|
|
v1_data = artifact_v1.load()
|
|
v2_data = artifact.load()
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Use meaningful keys**: Structure keys hierarchically (e.g., `project/experiment/sample.h5ad`)
|
|
2. **Add descriptions**: Help future users understand artifact contents
|
|
3. **Track consistently**: Call `ln.track()` at the start of every analysis
|
|
4. **Define features upfront**: Create feature registry before annotation
|
|
5. **Use typed features**: Specify dtypes for better validation
|
|
6. **Leverage versioning**: Don't create new keys for minor changes
|
|
7. **Document transforms**: Add docstrings to tracked functions
|
|
8. **Set projects**: Group related work for easier organization and access control
|
|
9. **Query efficiently**: Use filters before loading large datasets
|
|
10. **Visualize lineage**: Use `view_lineage()` to understand data provenance
|