Files
2025-11-30 08:30:10 +08:00

381 lines
9.6 KiB
Markdown

# LaminDB Core Concepts
This document covers the fundamental concepts and building blocks of LaminDB: Artifacts, Records, Runs, Transforms, Features, and data lineage tracking.
## Artifacts
Artifacts represent datasets in various formats (DataFrames, AnnData, SpatialData, Parquet, Zarr, etc.). They serve as the primary data objects in LaminDB.
### Creating and Saving Artifacts
**From file:**
```python
import lamindb as ln
# Save a file as artifact
ln.Artifact("sample.fasta", key="sample.fasta").save()
# With description
artifact = ln.Artifact(
"data/analysis.h5ad",
key="experiments/scrna_batch1.h5ad",
description="Single-cell RNA-seq batch 1"
).save()
```
**From DataFrame:**
```python
import pandas as pd
df = pd.read_csv("data.csv")
artifact = ln.Artifact.from_dataframe(
df,
key="datasets/processed_data.parquet",
description="Processed experimental data"
).save()
```
**From AnnData:**
```python
import anndata as ad
adata = ad.read_h5ad("data.h5ad")
artifact = ln.Artifact.from_anndata(
adata,
key="scrna/experiment1.h5ad",
description="scRNA-seq data with QC"
).save()
```
### Retrieving Artifacts
```python
# By key
artifact = ln.Artifact.get(key="sample.fasta")
# By UID
artifact = ln.Artifact.get("aRt1Fact0uid000")
# By filter
artifact = ln.Artifact.filter(suffix=".h5ad").first()
```
### Accessing Artifact Content
```python
# Get cached local path
local_path = artifact.cache()
# Load into memory
data = artifact.load() # Returns DataFrame, AnnData, etc.
# Streaming access (for large files)
with artifact.open() as f:
# Read incrementally
chunk = f.read(1000)
```
### Artifact Metadata
```python
# View all metadata
artifact.describe()
# Access specific metadata
artifact.size # File size in bytes
artifact.suffix # File extension
artifact.created_at # Timestamp
artifact.created_by # User who created it
artifact.run # Associated run
artifact.transform # Associated transform
artifact.version # Version string
```
## Records
Records represent experimental entities: samples, perturbations, instruments, cell lines, and any other metadata entities. They support hierarchical relationships through type definitions.
### Creating Records
```python
# Define a type
sample_type = ln.Record(name="Sample", is_type=True).save()
# Create instances of that type
ln.Record(name="P53mutant1", type=sample_type).save()
ln.Record(name="P53mutant2", type=sample_type).save()
ln.Record(name="WT-control", type=sample_type).save()
```
### Searching Records
```python
# Text search
ln.Record.search("p53").to_dataframe()
# Filter by fields
ln.Record.filter(type=sample_type).to_dataframe()
# Get specific record
record = ln.Record.get(name="P53mutant1")
```
### Hierarchical Relationships
```python
# Establish parent-child relationships
parent_record = ln.Record.get(name="P53mutant1")
child_record = ln.Record(name="P53mutant1-replicate1", type=sample_type).save()
child_record.parents.add(parent_record)
# Query relationships
parent_record.children.to_dataframe()
child_record.parents.to_dataframe()
```
## Runs & Transforms
These capture computational lineage. A **Transform** represents a reusable analysis step (notebook, script, or function), while a **Run** documents a specific execution instance.
### Basic Tracking Workflow
```python
import lamindb as ln
# Start tracking (beginning of notebook/script)
ln.track()
# Your analysis code
data = ln.Artifact.get(key="input.csv").load()
# ... perform analysis ...
result.to_csv("output.csv")
artifact = ln.Artifact("output.csv", key="output.csv").save()
# Finish tracking (end of notebook/script)
ln.finish()
```
### Tracking with Parameters
```python
ln.track(params={
"learning_rate": 0.01,
"batch_size": 32,
"epochs": 100,
"downsample": True
})
# Query runs by parameters
ln.Run.filter(params__learning_rate=0.01).to_dataframe()
ln.Run.filter(params__downsample=True).to_dataframe()
```
### Tracking with Projects
```python
# Associate with project
ln.track(project="Cancer Drug Screen 2025")
# Query by project
project = ln.Project.get(name="Cancer Drug Screen 2025")
ln.Artifact.filter(projects=project).to_dataframe()
ln.Run.filter(project=project).to_dataframe()
```
### Function-Level Tracking
Use the `@ln.tracked()` decorator for fine-grained lineage:
```python
@ln.tracked()
def preprocess_data(input_key: str, output_key: str, normalize: bool = True) -> None:
"""Preprocess raw data and save result."""
# Load input (automatically tracked)
artifact = ln.Artifact.get(key=input_key)
data = artifact.load()
# Process
if normalize:
data = (data - data.mean()) / data.std()
# Save output (automatically tracked)
ln.Artifact.from_dataframe(data, key=output_key).save()
# Each call creates a separate Transform and Run
preprocess_data("raw/batch1.csv", "processed/batch1.csv", normalize=True)
preprocess_data("raw/batch2.csv", "processed/batch2.csv", normalize=False)
```
### Accessing Lineage Information
```python
# From artifact to run
artifact = ln.Artifact.get(key="output.csv")
run = artifact.run
transform = run.transform
# View details
run.describe() # Run metadata
transform.describe() # Transform metadata
# Access inputs
run.inputs.to_dataframe()
# Visualize lineage graph
artifact.view_lineage()
```
## Features
Features define typed metadata fields for validation and querying. They enable structured annotation and searching.
### Defining Features
```python
from datetime import date
# Numeric feature
ln.Feature(name="gc_content", dtype=float).save()
ln.Feature(name="read_count", dtype=int).save()
# Date feature
ln.Feature(name="experiment_date", dtype=date).save()
# Categorical feature
ln.Feature(name="cell_type", dtype=str).save()
ln.Feature(name="treatment", dtype=str).save()
```
### Annotating Artifacts with Features
```python
# Single values
artifact.features.add_values({
"gc_content": 0.55,
"experiment_date": "2025-10-31"
})
# Using feature registry records
gc_content_feature = ln.Feature.get(name="gc_content")
artifact.features.add(gc_content_feature)
```
### Querying by Features
```python
# Filter by feature value
ln.Artifact.filter(gc_content=0.55).to_dataframe()
ln.Artifact.filter(experiment_date="2025-10-31").to_dataframe()
# Comparison operators
ln.Artifact.filter(read_count__gt=1000000).to_dataframe()
ln.Artifact.filter(gc_content__gte=0.5, gc_content__lte=0.6).to_dataframe()
# Check for presence of annotation
ln.Artifact.filter(cell_type__isnull=False).to_dataframe()
# Include features in output
ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features")
```
### Nested Dictionary Features
For complex metadata stored as dictionaries:
```python
# Access nested values
ln.Artifact.filter(study_metadata__detail1="123").to_dataframe()
ln.Artifact.filter(study_metadata__assay__type="RNA-seq").to_dataframe()
```
## Data Lineage Tracking
LaminDB automatically captures execution context and relationships between data, code, and runs.
### What Gets Tracked
- **Source code**: Script/notebook content and git commit
- **Environment**: Python packages and versions
- **Input artifacts**: Data loaded during execution
- **Output artifacts**: Data created during execution
- **Execution metadata**: Timestamps, user, parameters
- **Computational dependencies**: Transform relationships
### Viewing Lineage
```python
# Visualize full lineage graph
artifact.view_lineage()
# View captured metadata
artifact.describe()
# Access related entities
artifact.run # The run that created it
artifact.run.transform # The transform (code) used
artifact.run.inputs # Input artifacts
artifact.run.report # Execution report
```
### Querying Lineage
```python
# Find all outputs from a transform
transform = ln.Transform.get(name="preprocessing.py")
ln.Artifact.filter(transform=transform).to_dataframe()
# Find all artifacts from a specific user
user = ln.User.get(handle="researcher123")
ln.Artifact.filter(created_by=user).to_dataframe()
# Find artifacts using specific inputs
input_artifact = ln.Artifact.get(key="raw/data.csv")
runs = ln.Run.filter(inputs=input_artifact)
ln.Artifact.filter(run__in=runs).to_dataframe()
```
## Versioning
LaminDB manages artifact versioning automatically when source data or code changes.
### Automatic Versioning
```python
# First version
artifact_v1 = ln.Artifact("data.csv", key="experiment/data.csv").save()
# Modify and save again - creates new version
# (modify data.csv)
artifact_v2 = ln.Artifact("data.csv", key="experiment/data.csv").save()
```
### Working with Versions
```python
# Get latest version (default)
artifact = ln.Artifact.get(key="experiment/data.csv")
# View all versions
artifact.versions.to_dataframe()
# Get specific version
artifact_v1 = artifact.versions.filter(version="1").first()
# Compare versions
v1_data = artifact_v1.load()
v2_data = artifact.load()
```
## Best Practices
1. **Use meaningful keys**: Structure keys hierarchically (e.g., `project/experiment/sample.h5ad`)
2. **Add descriptions**: Help future users understand artifact contents
3. **Track consistently**: Call `ln.track()` at the start of every analysis
4. **Define features upfront**: Create feature registry before annotation
5. **Use typed features**: Specify dtypes for better validation
6. **Leverage versioning**: Don't create new keys for minor changes
7. **Document transforms**: Add docstrings to tracked functions
8. **Set projects**: Group related work for easier organization and access control
9. **Query efficiently**: Use filters before loading large datasets
10. **Visualize lineage**: Use `view_lineage()` to understand data provenance