Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/lamindb/references/core-concepts.md
+++ b/skills/lamindb/references/core-concepts.md
@@ -0,0 +1,380 @@
+# LaminDB Core Concepts
+
+This document covers the fundamental concepts and building blocks of LaminDB: Artifacts, Records, Runs, Transforms, Features, and data lineage tracking.
+
+## Artifacts
+
+Artifacts represent datasets in various formats (DataFrames, AnnData, SpatialData, Parquet, Zarr, etc.). They serve as the primary data objects in LaminDB.
+
+### Creating and Saving Artifacts
+
+**From file:**
+```python
+import lamindb as ln
+
+# Save a file as artifact
+ln.Artifact("sample.fasta", key="sample.fasta").save()
+
+# With description
+artifact = ln.Artifact(
+    "data/analysis.h5ad",
+    key="experiments/scrna_batch1.h5ad",
+    description="Single-cell RNA-seq batch 1"
+).save()
+```
+
+**From DataFrame:**
+```python
+import pandas as pd
+
+df = pd.read_csv("data.csv")
+artifact = ln.Artifact.from_dataframe(
+    df,
+    key="datasets/processed_data.parquet",
+    description="Processed experimental data"
+).save()
+```
+
+**From AnnData:**
+```python
+import anndata as ad
+
+adata = ad.read_h5ad("data.h5ad")
+artifact = ln.Artifact.from_anndata(
+    adata,
+    key="scrna/experiment1.h5ad",
+    description="scRNA-seq data with QC"
+).save()
+```
+
+### Retrieving Artifacts
+
+```python
+# By key
+artifact = ln.Artifact.get(key="sample.fasta")
+
+# By UID
+artifact = ln.Artifact.get("aRt1Fact0uid000")
+
+# By filter
+artifact = ln.Artifact.filter(suffix=".h5ad").first()
+```
+
+### Accessing Artifact Content
+
+```python
+# Get cached local path
+local_path = artifact.cache()
+
+# Load into memory
+data = artifact.load()  # Returns DataFrame, AnnData, etc.
+
+# Streaming access (for large files)
+with artifact.open() as f:
+    # Read incrementally
+    chunk = f.read(1000)
+```
+
+### Artifact Metadata
+
+```python
+# View all metadata
+artifact.describe()
+
+# Access specific metadata
+artifact.size          # File size in bytes
+artifact.suffix        # File extension
+artifact.created_at    # Timestamp
+artifact.created_by    # User who created it
+artifact.run          # Associated run
+artifact.transform    # Associated transform
+artifact.version      # Version string
+```
+
+## Records
+
+Records represent experimental entities: samples, perturbations, instruments, cell lines, and any other metadata entities. They support hierarchical relationships through type definitions.
+
+### Creating Records
+
+```python
+# Define a type
+sample_type = ln.Record(name="Sample", is_type=True).save()
+
+# Create instances of that type
+ln.Record(name="P53mutant1", type=sample_type).save()
+ln.Record(name="P53mutant2", type=sample_type).save()
+ln.Record(name="WT-control", type=sample_type).save()
+```
+
+### Searching Records
+
+```python
+# Text search
+ln.Record.search("p53").to_dataframe()
+
+# Filter by fields
+ln.Record.filter(type=sample_type).to_dataframe()
+
+# Get specific record
+record = ln.Record.get(name="P53mutant1")
+```
+
+### Hierarchical Relationships
+
+```python
+# Establish parent-child relationships
+parent_record = ln.Record.get(name="P53mutant1")
+child_record = ln.Record(name="P53mutant1-replicate1", type=sample_type).save()
+child_record.parents.add(parent_record)
+
+# Query relationships
+parent_record.children.to_dataframe()
+child_record.parents.to_dataframe()
+```
+
+## Runs & Transforms
+
+These capture computational lineage. A **Transform** represents a reusable analysis step (notebook, script, or function), while a **Run** documents a specific execution instance.
+
+### Basic Tracking Workflow
+
+```python
+import lamindb as ln
+
+# Start tracking (beginning of notebook/script)
+ln.track()
+
+# Your analysis code
+data = ln.Artifact.get(key="input.csv").load()
+# ... perform analysis ...
+result.to_csv("output.csv")
+artifact = ln.Artifact("output.csv", key="output.csv").save()
+
+# Finish tracking (end of notebook/script)
+ln.finish()
+```
+
+### Tracking with Parameters
+
+```python
+ln.track(params={
+    "learning_rate": 0.01,
+    "batch_size": 32,
+    "epochs": 100,
+    "downsample": True
+})
+
+# Query runs by parameters
+ln.Run.filter(params__learning_rate=0.01).to_dataframe()
+ln.Run.filter(params__downsample=True).to_dataframe()
+```
+
+### Tracking with Projects
+
+```python
+# Associate with project
+ln.track(project="Cancer Drug Screen 2025")
+
+# Query by project
+project = ln.Project.get(name="Cancer Drug Screen 2025")
+ln.Artifact.filter(projects=project).to_dataframe()
+ln.Run.filter(project=project).to_dataframe()
+```
+
+### Function-Level Tracking
+
+Use the `@ln.tracked()` decorator for fine-grained lineage:
+
+```python
+@ln.tracked()
+def preprocess_data(input_key: str, output_key: str, normalize: bool = True) -> None:
+    """Preprocess raw data and save result."""
+    # Load input (automatically tracked)
+    artifact = ln.Artifact.get(key=input_key)
+    data = artifact.load()
+
+    # Process
+    if normalize:
+        data = (data - data.mean()) / data.std()
+
+    # Save output (automatically tracked)
+    ln.Artifact.from_dataframe(data, key=output_key).save()
+
+# Each call creates a separate Transform and Run
+preprocess_data("raw/batch1.csv", "processed/batch1.csv", normalize=True)
+preprocess_data("raw/batch2.csv", "processed/batch2.csv", normalize=False)
+```
+
+### Accessing Lineage Information
+
+```python
+# From artifact to run
+artifact = ln.Artifact.get(key="output.csv")
+run = artifact.run
+transform = run.transform
+
+# View details
+run.describe()          # Run metadata
+transform.describe()    # Transform metadata
+
+# Access inputs
+run.inputs.to_dataframe()
+
+# Visualize lineage graph
+artifact.view_lineage()
+```
+
+## Features
+
+Features define typed metadata fields for validation and querying. They enable structured annotation and searching.
+
+### Defining Features
+
+```python
+from datetime import date
+
+# Numeric feature
+ln.Feature(name="gc_content", dtype=float).save()
+ln.Feature(name="read_count", dtype=int).save()
+
+# Date feature
+ln.Feature(name="experiment_date", dtype=date).save()
+
+# Categorical feature
+ln.Feature(name="cell_type", dtype=str).save()
+ln.Feature(name="treatment", dtype=str).save()
+```
+
+### Annotating Artifacts with Features
+
+```python
+# Single values
+artifact.features.add_values({
+    "gc_content": 0.55,
+    "experiment_date": "2025-10-31"
+})
+
+# Using feature registry records
+gc_content_feature = ln.Feature.get(name="gc_content")
+artifact.features.add(gc_content_feature)
+```
+
+### Querying by Features
+
+```python
+# Filter by feature value
+ln.Artifact.filter(gc_content=0.55).to_dataframe()
+ln.Artifact.filter(experiment_date="2025-10-31").to_dataframe()
+
+# Comparison operators
+ln.Artifact.filter(read_count__gt=1000000).to_dataframe()
+ln.Artifact.filter(gc_content__gte=0.5, gc_content__lte=0.6).to_dataframe()
+
+# Check for presence of annotation
+ln.Artifact.filter(cell_type__isnull=False).to_dataframe()
+
+# Include features in output
+ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features")
+```
+
+### Nested Dictionary Features
+
+For complex metadata stored as dictionaries:
+
+```python
+# Access nested values
+ln.Artifact.filter(study_metadata__detail1="123").to_dataframe()
+ln.Artifact.filter(study_metadata__assay__type="RNA-seq").to_dataframe()
+```
+
+## Data Lineage Tracking
+
+LaminDB automatically captures execution context and relationships between data, code, and runs.
+
+### What Gets Tracked
+
+- **Source code**: Script/notebook content and git commit
+- **Environment**: Python packages and versions
+- **Input artifacts**: Data loaded during execution
+- **Output artifacts**: Data created during execution
+- **Execution metadata**: Timestamps, user, parameters
+- **Computational dependencies**: Transform relationships
+
+### Viewing Lineage
+
+```python
+# Visualize full lineage graph
+artifact.view_lineage()
+
+# View captured metadata
+artifact.describe()
+
+# Access related entities
+artifact.run              # The run that created it
+artifact.run.transform    # The transform (code) used
+artifact.run.inputs       # Input artifacts
+artifact.run.report       # Execution report
+```
+
+### Querying Lineage
+
+```python
+# Find all outputs from a transform
+transform = ln.Transform.get(name="preprocessing.py")
+ln.Artifact.filter(transform=transform).to_dataframe()
+
+# Find all artifacts from a specific user
+user = ln.User.get(handle="researcher123")
+ln.Artifact.filter(created_by=user).to_dataframe()
+
+# Find artifacts using specific inputs
+input_artifact = ln.Artifact.get(key="raw/data.csv")
+runs = ln.Run.filter(inputs=input_artifact)
+ln.Artifact.filter(run__in=runs).to_dataframe()
+```
+
+## Versioning
+
+LaminDB manages artifact versioning automatically when source data or code changes.
+
+### Automatic Versioning
+
+```python
+# First version
+artifact_v1 = ln.Artifact("data.csv", key="experiment/data.csv").save()
+
+# Modify and save again - creates new version
+# (modify data.csv)
+artifact_v2 = ln.Artifact("data.csv", key="experiment/data.csv").save()
+```
+
+### Working with Versions
+
+```python
+# Get latest version (default)
+artifact = ln.Artifact.get(key="experiment/data.csv")
+
+# View all versions
+artifact.versions.to_dataframe()
+
+# Get specific version
+artifact_v1 = artifact.versions.filter(version="1").first()
+
+# Compare versions
+v1_data = artifact_v1.load()
+v2_data = artifact.load()
+```
+
+## Best Practices
+
+1. **Use meaningful keys**: Structure keys hierarchically (e.g., `project/experiment/sample.h5ad`)
+2. **Add descriptions**: Help future users understand artifact contents
+3. **Track consistently**: Call `ln.track()` at the start of every analysis
+4. **Define features upfront**: Create feature registry before annotation
+5. **Use typed features**: Specify dtypes for better validation
+6. **Leverage versioning**: Don't create new keys for minor changes
+7. **Document transforms**: Add docstrings to tracked functions
+8. **Set projects**: Group related work for easier organization and access control
+9. **Query efficiently**: Use filters before loading large datasets
+10. **Visualize lineage**: Use `view_lineage()` to understand data provenance