# LaminDB Core Concepts This document covers the fundamental concepts and building blocks of LaminDB: Artifacts, Records, Runs, Transforms, Features, and data lineage tracking. ## Artifacts Artifacts represent datasets in various formats (DataFrames, AnnData, SpatialData, Parquet, Zarr, etc.). They serve as the primary data objects in LaminDB. ### Creating and Saving Artifacts **From file:** ```python import lamindb as ln # Save a file as artifact ln.Artifact("sample.fasta", key="sample.fasta").save() # With description artifact = ln.Artifact( "data/analysis.h5ad", key="experiments/scrna_batch1.h5ad", description="Single-cell RNA-seq batch 1" ).save() ``` **From DataFrame:** ```python import pandas as pd df = pd.read_csv("data.csv") artifact = ln.Artifact.from_dataframe( df, key="datasets/processed_data.parquet", description="Processed experimental data" ).save() ``` **From AnnData:** ```python import anndata as ad adata = ad.read_h5ad("data.h5ad") artifact = ln.Artifact.from_anndata( adata, key="scrna/experiment1.h5ad", description="scRNA-seq data with QC" ).save() ``` ### Retrieving Artifacts ```python # By key artifact = ln.Artifact.get(key="sample.fasta") # By UID artifact = ln.Artifact.get("aRt1Fact0uid000") # By filter artifact = ln.Artifact.filter(suffix=".h5ad").first() ``` ### Accessing Artifact Content ```python # Get cached local path local_path = artifact.cache() # Load into memory data = artifact.load() # Returns DataFrame, AnnData, etc. # Streaming access (for large files) with artifact.open() as f: # Read incrementally chunk = f.read(1000) ``` ### Artifact Metadata ```python # View all metadata artifact.describe() # Access specific metadata artifact.size # File size in bytes artifact.suffix # File extension artifact.created_at # Timestamp artifact.created_by # User who created it artifact.run # Associated run artifact.transform # Associated transform artifact.version # Version string ``` ## Records Records represent experimental entities: samples, perturbations, instruments, cell lines, and any other metadata entities. They support hierarchical relationships through type definitions. ### Creating Records ```python # Define a type sample_type = ln.Record(name="Sample", is_type=True).save() # Create instances of that type ln.Record(name="P53mutant1", type=sample_type).save() ln.Record(name="P53mutant2", type=sample_type).save() ln.Record(name="WT-control", type=sample_type).save() ``` ### Searching Records ```python # Text search ln.Record.search("p53").to_dataframe() # Filter by fields ln.Record.filter(type=sample_type).to_dataframe() # Get specific record record = ln.Record.get(name="P53mutant1") ``` ### Hierarchical Relationships ```python # Establish parent-child relationships parent_record = ln.Record.get(name="P53mutant1") child_record = ln.Record(name="P53mutant1-replicate1", type=sample_type).save() child_record.parents.add(parent_record) # Query relationships parent_record.children.to_dataframe() child_record.parents.to_dataframe() ``` ## Runs & Transforms These capture computational lineage. A **Transform** represents a reusable analysis step (notebook, script, or function), while a **Run** documents a specific execution instance. ### Basic Tracking Workflow ```python import lamindb as ln # Start tracking (beginning of notebook/script) ln.track() # Your analysis code data = ln.Artifact.get(key="input.csv").load() # ... perform analysis ... result.to_csv("output.csv") artifact = ln.Artifact("output.csv", key="output.csv").save() # Finish tracking (end of notebook/script) ln.finish() ``` ### Tracking with Parameters ```python ln.track(params={ "learning_rate": 0.01, "batch_size": 32, "epochs": 100, "downsample": True }) # Query runs by parameters ln.Run.filter(params__learning_rate=0.01).to_dataframe() ln.Run.filter(params__downsample=True).to_dataframe() ``` ### Tracking with Projects ```python # Associate with project ln.track(project="Cancer Drug Screen 2025") # Query by project project = ln.Project.get(name="Cancer Drug Screen 2025") ln.Artifact.filter(projects=project).to_dataframe() ln.Run.filter(project=project).to_dataframe() ``` ### Function-Level Tracking Use the `@ln.tracked()` decorator for fine-grained lineage: ```python @ln.tracked() def preprocess_data(input_key: str, output_key: str, normalize: bool = True) -> None: """Preprocess raw data and save result.""" # Load input (automatically tracked) artifact = ln.Artifact.get(key=input_key) data = artifact.load() # Process if normalize: data = (data - data.mean()) / data.std() # Save output (automatically tracked) ln.Artifact.from_dataframe(data, key=output_key).save() # Each call creates a separate Transform and Run preprocess_data("raw/batch1.csv", "processed/batch1.csv", normalize=True) preprocess_data("raw/batch2.csv", "processed/batch2.csv", normalize=False) ``` ### Accessing Lineage Information ```python # From artifact to run artifact = ln.Artifact.get(key="output.csv") run = artifact.run transform = run.transform # View details run.describe() # Run metadata transform.describe() # Transform metadata # Access inputs run.inputs.to_dataframe() # Visualize lineage graph artifact.view_lineage() ``` ## Features Features define typed metadata fields for validation and querying. They enable structured annotation and searching. ### Defining Features ```python from datetime import date # Numeric feature ln.Feature(name="gc_content", dtype=float).save() ln.Feature(name="read_count", dtype=int).save() # Date feature ln.Feature(name="experiment_date", dtype=date).save() # Categorical feature ln.Feature(name="cell_type", dtype=str).save() ln.Feature(name="treatment", dtype=str).save() ``` ### Annotating Artifacts with Features ```python # Single values artifact.features.add_values({ "gc_content": 0.55, "experiment_date": "2025-10-31" }) # Using feature registry records gc_content_feature = ln.Feature.get(name="gc_content") artifact.features.add(gc_content_feature) ``` ### Querying by Features ```python # Filter by feature value ln.Artifact.filter(gc_content=0.55).to_dataframe() ln.Artifact.filter(experiment_date="2025-10-31").to_dataframe() # Comparison operators ln.Artifact.filter(read_count__gt=1000000).to_dataframe() ln.Artifact.filter(gc_content__gte=0.5, gc_content__lte=0.6).to_dataframe() # Check for presence of annotation ln.Artifact.filter(cell_type__isnull=False).to_dataframe() # Include features in output ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features") ``` ### Nested Dictionary Features For complex metadata stored as dictionaries: ```python # Access nested values ln.Artifact.filter(study_metadata__detail1="123").to_dataframe() ln.Artifact.filter(study_metadata__assay__type="RNA-seq").to_dataframe() ``` ## Data Lineage Tracking LaminDB automatically captures execution context and relationships between data, code, and runs. ### What Gets Tracked - **Source code**: Script/notebook content and git commit - **Environment**: Python packages and versions - **Input artifacts**: Data loaded during execution - **Output artifacts**: Data created during execution - **Execution metadata**: Timestamps, user, parameters - **Computational dependencies**: Transform relationships ### Viewing Lineage ```python # Visualize full lineage graph artifact.view_lineage() # View captured metadata artifact.describe() # Access related entities artifact.run # The run that created it artifact.run.transform # The transform (code) used artifact.run.inputs # Input artifacts artifact.run.report # Execution report ``` ### Querying Lineage ```python # Find all outputs from a transform transform = ln.Transform.get(name="preprocessing.py") ln.Artifact.filter(transform=transform).to_dataframe() # Find all artifacts from a specific user user = ln.User.get(handle="researcher123") ln.Artifact.filter(created_by=user).to_dataframe() # Find artifacts using specific inputs input_artifact = ln.Artifact.get(key="raw/data.csv") runs = ln.Run.filter(inputs=input_artifact) ln.Artifact.filter(run__in=runs).to_dataframe() ``` ## Versioning LaminDB manages artifact versioning automatically when source data or code changes. ### Automatic Versioning ```python # First version artifact_v1 = ln.Artifact("data.csv", key="experiment/data.csv").save() # Modify and save again - creates new version # (modify data.csv) artifact_v2 = ln.Artifact("data.csv", key="experiment/data.csv").save() ``` ### Working with Versions ```python # Get latest version (default) artifact = ln.Artifact.get(key="experiment/data.csv") # View all versions artifact.versions.to_dataframe() # Get specific version artifact_v1 = artifact.versions.filter(version="1").first() # Compare versions v1_data = artifact_v1.load() v2_data = artifact.load() ``` ## Best Practices 1. **Use meaningful keys**: Structure keys hierarchically (e.g., `project/experiment/sample.h5ad`) 2. **Add descriptions**: Help future users understand artifact contents 3. **Track consistently**: Call `ln.track()` at the start of every analysis 4. **Define features upfront**: Create feature registry before annotation 5. **Use typed features**: Specify dtypes for better validation 6. **Leverage versioning**: Don't create new keys for minor changes 7. **Document transforms**: Add docstrings to tracked functions 8. **Set projects**: Group related work for easier organization and access control 9. **Query efficiently**: Use filters before loading large datasets 10. **Visualize lineage**: Use `view_lineage()` to understand data provenance