Initial commit
This commit is contained in:
380
skills/lamindb/references/core-concepts.md
Normal file
380
skills/lamindb/references/core-concepts.md
Normal file
@@ -0,0 +1,380 @@
|
||||
# LaminDB Core Concepts
|
||||
|
||||
This document covers the fundamental concepts and building blocks of LaminDB: Artifacts, Records, Runs, Transforms, Features, and data lineage tracking.
|
||||
|
||||
## Artifacts
|
||||
|
||||
Artifacts represent datasets in various formats (DataFrames, AnnData, SpatialData, Parquet, Zarr, etc.). They serve as the primary data objects in LaminDB.
|
||||
|
||||
### Creating and Saving Artifacts
|
||||
|
||||
**From file:**
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Save a file as artifact
|
||||
ln.Artifact("sample.fasta", key="sample.fasta").save()
|
||||
|
||||
# With description
|
||||
artifact = ln.Artifact(
|
||||
"data/analysis.h5ad",
|
||||
key="experiments/scrna_batch1.h5ad",
|
||||
description="Single-cell RNA-seq batch 1"
|
||||
).save()
|
||||
```
|
||||
|
||||
**From DataFrame:**
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
df = pd.read_csv("data.csv")
|
||||
artifact = ln.Artifact.from_dataframe(
|
||||
df,
|
||||
key="datasets/processed_data.parquet",
|
||||
description="Processed experimental data"
|
||||
).save()
|
||||
```
|
||||
|
||||
**From AnnData:**
|
||||
```python
|
||||
import anndata as ad
|
||||
|
||||
adata = ad.read_h5ad("data.h5ad")
|
||||
artifact = ln.Artifact.from_anndata(
|
||||
adata,
|
||||
key="scrna/experiment1.h5ad",
|
||||
description="scRNA-seq data with QC"
|
||||
).save()
|
||||
```
|
||||
|
||||
### Retrieving Artifacts
|
||||
|
||||
```python
|
||||
# By key
|
||||
artifact = ln.Artifact.get(key="sample.fasta")
|
||||
|
||||
# By UID
|
||||
artifact = ln.Artifact.get("aRt1Fact0uid000")
|
||||
|
||||
# By filter
|
||||
artifact = ln.Artifact.filter(suffix=".h5ad").first()
|
||||
```
|
||||
|
||||
### Accessing Artifact Content
|
||||
|
||||
```python
|
||||
# Get cached local path
|
||||
local_path = artifact.cache()
|
||||
|
||||
# Load into memory
|
||||
data = artifact.load() # Returns DataFrame, AnnData, etc.
|
||||
|
||||
# Streaming access (for large files)
|
||||
with artifact.open() as f:
|
||||
# Read incrementally
|
||||
chunk = f.read(1000)
|
||||
```
|
||||
|
||||
### Artifact Metadata
|
||||
|
||||
```python
|
||||
# View all metadata
|
||||
artifact.describe()
|
||||
|
||||
# Access specific metadata
|
||||
artifact.size # File size in bytes
|
||||
artifact.suffix # File extension
|
||||
artifact.created_at # Timestamp
|
||||
artifact.created_by # User who created it
|
||||
artifact.run # Associated run
|
||||
artifact.transform # Associated transform
|
||||
artifact.version # Version string
|
||||
```
|
||||
|
||||
## Records
|
||||
|
||||
Records represent experimental entities: samples, perturbations, instruments, cell lines, and any other metadata entities. They support hierarchical relationships through type definitions.
|
||||
|
||||
### Creating Records
|
||||
|
||||
```python
|
||||
# Define a type
|
||||
sample_type = ln.Record(name="Sample", is_type=True).save()
|
||||
|
||||
# Create instances of that type
|
||||
ln.Record(name="P53mutant1", type=sample_type).save()
|
||||
ln.Record(name="P53mutant2", type=sample_type).save()
|
||||
ln.Record(name="WT-control", type=sample_type).save()
|
||||
```
|
||||
|
||||
### Searching Records
|
||||
|
||||
```python
|
||||
# Text search
|
||||
ln.Record.search("p53").to_dataframe()
|
||||
|
||||
# Filter by fields
|
||||
ln.Record.filter(type=sample_type).to_dataframe()
|
||||
|
||||
# Get specific record
|
||||
record = ln.Record.get(name="P53mutant1")
|
||||
```
|
||||
|
||||
### Hierarchical Relationships
|
||||
|
||||
```python
|
||||
# Establish parent-child relationships
|
||||
parent_record = ln.Record.get(name="P53mutant1")
|
||||
child_record = ln.Record(name="P53mutant1-replicate1", type=sample_type).save()
|
||||
child_record.parents.add(parent_record)
|
||||
|
||||
# Query relationships
|
||||
parent_record.children.to_dataframe()
|
||||
child_record.parents.to_dataframe()
|
||||
```
|
||||
|
||||
## Runs & Transforms
|
||||
|
||||
These capture computational lineage. A **Transform** represents a reusable analysis step (notebook, script, or function), while a **Run** documents a specific execution instance.
|
||||
|
||||
### Basic Tracking Workflow
|
||||
|
||||
```python
|
||||
import lamindb as ln
|
||||
|
||||
# Start tracking (beginning of notebook/script)
|
||||
ln.track()
|
||||
|
||||
# Your analysis code
|
||||
data = ln.Artifact.get(key="input.csv").load()
|
||||
# ... perform analysis ...
|
||||
result.to_csv("output.csv")
|
||||
artifact = ln.Artifact("output.csv", key="output.csv").save()
|
||||
|
||||
# Finish tracking (end of notebook/script)
|
||||
ln.finish()
|
||||
```
|
||||
|
||||
### Tracking with Parameters
|
||||
|
||||
```python
|
||||
ln.track(params={
|
||||
"learning_rate": 0.01,
|
||||
"batch_size": 32,
|
||||
"epochs": 100,
|
||||
"downsample": True
|
||||
})
|
||||
|
||||
# Query runs by parameters
|
||||
ln.Run.filter(params__learning_rate=0.01).to_dataframe()
|
||||
ln.Run.filter(params__downsample=True).to_dataframe()
|
||||
```
|
||||
|
||||
### Tracking with Projects
|
||||
|
||||
```python
|
||||
# Associate with project
|
||||
ln.track(project="Cancer Drug Screen 2025")
|
||||
|
||||
# Query by project
|
||||
project = ln.Project.get(name="Cancer Drug Screen 2025")
|
||||
ln.Artifact.filter(projects=project).to_dataframe()
|
||||
ln.Run.filter(project=project).to_dataframe()
|
||||
```
|
||||
|
||||
### Function-Level Tracking
|
||||
|
||||
Use the `@ln.tracked()` decorator for fine-grained lineage:
|
||||
|
||||
```python
|
||||
@ln.tracked()
|
||||
def preprocess_data(input_key: str, output_key: str, normalize: bool = True) -> None:
|
||||
"""Preprocess raw data and save result."""
|
||||
# Load input (automatically tracked)
|
||||
artifact = ln.Artifact.get(key=input_key)
|
||||
data = artifact.load()
|
||||
|
||||
# Process
|
||||
if normalize:
|
||||
data = (data - data.mean()) / data.std()
|
||||
|
||||
# Save output (automatically tracked)
|
||||
ln.Artifact.from_dataframe(data, key=output_key).save()
|
||||
|
||||
# Each call creates a separate Transform and Run
|
||||
preprocess_data("raw/batch1.csv", "processed/batch1.csv", normalize=True)
|
||||
preprocess_data("raw/batch2.csv", "processed/batch2.csv", normalize=False)
|
||||
```
|
||||
|
||||
### Accessing Lineage Information
|
||||
|
||||
```python
|
||||
# From artifact to run
|
||||
artifact = ln.Artifact.get(key="output.csv")
|
||||
run = artifact.run
|
||||
transform = run.transform
|
||||
|
||||
# View details
|
||||
run.describe() # Run metadata
|
||||
transform.describe() # Transform metadata
|
||||
|
||||
# Access inputs
|
||||
run.inputs.to_dataframe()
|
||||
|
||||
# Visualize lineage graph
|
||||
artifact.view_lineage()
|
||||
```
|
||||
|
||||
## Features
|
||||
|
||||
Features define typed metadata fields for validation and querying. They enable structured annotation and searching.
|
||||
|
||||
### Defining Features
|
||||
|
||||
```python
|
||||
from datetime import date
|
||||
|
||||
# Numeric feature
|
||||
ln.Feature(name="gc_content", dtype=float).save()
|
||||
ln.Feature(name="read_count", dtype=int).save()
|
||||
|
||||
# Date feature
|
||||
ln.Feature(name="experiment_date", dtype=date).save()
|
||||
|
||||
# Categorical feature
|
||||
ln.Feature(name="cell_type", dtype=str).save()
|
||||
ln.Feature(name="treatment", dtype=str).save()
|
||||
```
|
||||
|
||||
### Annotating Artifacts with Features
|
||||
|
||||
```python
|
||||
# Single values
|
||||
artifact.features.add_values({
|
||||
"gc_content": 0.55,
|
||||
"experiment_date": "2025-10-31"
|
||||
})
|
||||
|
||||
# Using feature registry records
|
||||
gc_content_feature = ln.Feature.get(name="gc_content")
|
||||
artifact.features.add(gc_content_feature)
|
||||
```
|
||||
|
||||
### Querying by Features
|
||||
|
||||
```python
|
||||
# Filter by feature value
|
||||
ln.Artifact.filter(gc_content=0.55).to_dataframe()
|
||||
ln.Artifact.filter(experiment_date="2025-10-31").to_dataframe()
|
||||
|
||||
# Comparison operators
|
||||
ln.Artifact.filter(read_count__gt=1000000).to_dataframe()
|
||||
ln.Artifact.filter(gc_content__gte=0.5, gc_content__lte=0.6).to_dataframe()
|
||||
|
||||
# Check for presence of annotation
|
||||
ln.Artifact.filter(cell_type__isnull=False).to_dataframe()
|
||||
|
||||
# Include features in output
|
||||
ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features")
|
||||
```
|
||||
|
||||
### Nested Dictionary Features
|
||||
|
||||
For complex metadata stored as dictionaries:
|
||||
|
||||
```python
|
||||
# Access nested values
|
||||
ln.Artifact.filter(study_metadata__detail1="123").to_dataframe()
|
||||
ln.Artifact.filter(study_metadata__assay__type="RNA-seq").to_dataframe()
|
||||
```
|
||||
|
||||
## Data Lineage Tracking
|
||||
|
||||
LaminDB automatically captures execution context and relationships between data, code, and runs.
|
||||
|
||||
### What Gets Tracked
|
||||
|
||||
- **Source code**: Script/notebook content and git commit
|
||||
- **Environment**: Python packages and versions
|
||||
- **Input artifacts**: Data loaded during execution
|
||||
- **Output artifacts**: Data created during execution
|
||||
- **Execution metadata**: Timestamps, user, parameters
|
||||
- **Computational dependencies**: Transform relationships
|
||||
|
||||
### Viewing Lineage
|
||||
|
||||
```python
|
||||
# Visualize full lineage graph
|
||||
artifact.view_lineage()
|
||||
|
||||
# View captured metadata
|
||||
artifact.describe()
|
||||
|
||||
# Access related entities
|
||||
artifact.run # The run that created it
|
||||
artifact.run.transform # The transform (code) used
|
||||
artifact.run.inputs # Input artifacts
|
||||
artifact.run.report # Execution report
|
||||
```
|
||||
|
||||
### Querying Lineage
|
||||
|
||||
```python
|
||||
# Find all outputs from a transform
|
||||
transform = ln.Transform.get(name="preprocessing.py")
|
||||
ln.Artifact.filter(transform=transform).to_dataframe()
|
||||
|
||||
# Find all artifacts from a specific user
|
||||
user = ln.User.get(handle="researcher123")
|
||||
ln.Artifact.filter(created_by=user).to_dataframe()
|
||||
|
||||
# Find artifacts using specific inputs
|
||||
input_artifact = ln.Artifact.get(key="raw/data.csv")
|
||||
runs = ln.Run.filter(inputs=input_artifact)
|
||||
ln.Artifact.filter(run__in=runs).to_dataframe()
|
||||
```
|
||||
|
||||
## Versioning
|
||||
|
||||
LaminDB manages artifact versioning automatically when source data or code changes.
|
||||
|
||||
### Automatic Versioning
|
||||
|
||||
```python
|
||||
# First version
|
||||
artifact_v1 = ln.Artifact("data.csv", key="experiment/data.csv").save()
|
||||
|
||||
# Modify and save again - creates new version
|
||||
# (modify data.csv)
|
||||
artifact_v2 = ln.Artifact("data.csv", key="experiment/data.csv").save()
|
||||
```
|
||||
|
||||
### Working with Versions
|
||||
|
||||
```python
|
||||
# Get latest version (default)
|
||||
artifact = ln.Artifact.get(key="experiment/data.csv")
|
||||
|
||||
# View all versions
|
||||
artifact.versions.to_dataframe()
|
||||
|
||||
# Get specific version
|
||||
artifact_v1 = artifact.versions.filter(version="1").first()
|
||||
|
||||
# Compare versions
|
||||
v1_data = artifact_v1.load()
|
||||
v2_data = artifact.load()
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Use meaningful keys**: Structure keys hierarchically (e.g., `project/experiment/sample.h5ad`)
|
||||
2. **Add descriptions**: Help future users understand artifact contents
|
||||
3. **Track consistently**: Call `ln.track()` at the start of every analysis
|
||||
4. **Define features upfront**: Create feature registry before annotation
|
||||
5. **Use typed features**: Specify dtypes for better validation
|
||||
6. **Leverage versioning**: Don't create new keys for minor changes
|
||||
7. **Document transforms**: Add docstrings to tracked functions
|
||||
8. **Set projects**: Group related work for easier organization and access control
|
||||
9. **Query efficiently**: Use filters before loading large datasets
|
||||
10. **Visualize lineage**: Use `view_lineage()` to understand data provenance
|
||||
Reference in New Issue
Block a user