# LaminDB Data Management This document covers querying, searching, filtering, and streaming data in LaminDB, as well as best practices for organizing and accessing datasets. ## Registry Overview View available registries and their contents: ```python import lamindb as ln # View all registries across modules ln.view() # View latest 100 artifacts ln.Artifact.to_dataframe() # View other registries ln.Transform.to_dataframe() ln.Run.to_dataframe() ln.User.to_dataframe() ``` ## Lookup for Quick Access For registries with fewer than 100k records, `Lookup` objects enable convenient auto-complete: ```python # Create lookup records = ln.Record.lookup() # Access by name (auto-complete enabled in IDEs) experiment_1 = records.experiment_1 sample_a = records.sample_a # Works with biological ontologies too import bionty as bt cell_types = bt.CellType.lookup() t_cell = cell_types.t_cell ``` ## Retrieving Single Records ### Using get() Retrieve exactly one record (errors if zero or multiple matches): ```python # By UID artifact = ln.Artifact.get("aRt1Fact0uid000") # By field artifact = ln.Artifact.get(key="data/experiment.h5ad") user = ln.User.get(handle="researcher123") # By ontology ID (for bionty) cell_type = bt.CellType.get(ontology_id="CL:0000084") ``` ### Using one() and one_or_none() ```python # Get exactly one from QuerySet (errors if 0 or >1) artifact = ln.Artifact.filter(key="data.csv").one() # Get one or None (errors if >1) artifact = ln.Artifact.filter(key="maybe_data.csv").one_or_none() # Get first match artifact = ln.Artifact.filter(suffix=".h5ad").first() ``` ## Filtering Data The `filter()` method returns a QuerySet for flexible retrieval: ```python # Basic filtering artifacts = ln.Artifact.filter(suffix=".h5ad") artifacts.to_dataframe() # Multiple conditions (AND logic) artifacts = ln.Artifact.filter( suffix=".h5ad", created_by=user ) # Comparison operators ln.Artifact.filter(size__gt=1e6).to_dataframe() # Greater than ln.Artifact.filter(size__gte=1e6).to_dataframe() # Greater than or equal ln.Artifact.filter(size__lt=1e9).to_dataframe() # Less than ln.Artifact.filter(size__lte=1e9).to_dataframe() # Less than or equal # Range queries ln.Artifact.filter(size__gte=1e6, size__lte=1e9).to_dataframe() ``` ## Text and String Queries ```python # Exact match ln.Artifact.filter(description="Experiment 1").to_dataframe() # Contains (case-sensitive) ln.Artifact.filter(description__contains="RNA").to_dataframe() # Case-insensitive contains ln.Artifact.filter(description__icontains="rna").to_dataframe() # Starts with ln.Artifact.filter(key__startswith="experiments/").to_dataframe() # Ends with ln.Artifact.filter(key__endswith=".csv").to_dataframe() # IN list ln.Artifact.filter(suffix__in=[".h5ad", ".csv", ".parquet"]).to_dataframe() ``` ## Feature-Based Queries Query artifacts by their annotated features: ```python # Filter by feature value ln.Artifact.filter(cell_type="T cell").to_dataframe() ln.Artifact.filter(treatment="DMSO").to_dataframe() # Include features in output ln.Artifact.filter(treatment="DMSO").to_dataframe(include="features") # Nested dictionary access ln.Artifact.filter(study_metadata__assay="RNA-seq").to_dataframe() ln.Artifact.filter(study_metadata__detail1="123").to_dataframe() # Check annotation status ln.Artifact.filter(cell_type__isnull=False).to_dataframe() # Has annotation ln.Artifact.filter(treatment__isnull=True).to_dataframe() # Missing annotation ``` ## Traversing Related Registries Django's double-underscore syntax enables queries across related tables: ```python # Find artifacts by creator handle ln.Artifact.filter(created_by__handle="researcher123").to_dataframe() ln.Artifact.filter(created_by__handle__startswith="test").to_dataframe() # Find artifacts by transform name ln.Artifact.filter(transform__name="preprocess.py").to_dataframe() # Find artifacts measuring specific genes ln.Artifact.filter(feature_sets__genes__symbol="CD8A").to_dataframe() ln.Artifact.filter(feature_sets__genes__ensembl_gene_id="ENSG00000153563").to_dataframe() # Find runs with specific parameters ln.Run.filter(params__learning_rate=0.01).to_dataframe() ln.Run.filter(params__downsample=True).to_dataframe() # Find artifacts from specific project project = ln.Project.get(name="Cancer Study") ln.Artifact.filter(projects=project).to_dataframe() ``` ## Ordering Results ```python # Order by field (ascending) ln.Artifact.filter(suffix=".h5ad").order_by("created_at").to_dataframe() # Order descending ln.Artifact.filter(suffix=".h5ad").order_by("-created_at").to_dataframe() # Multiple order fields ln.Artifact.order_by("-created_at", "size").to_dataframe() ``` ## Advanced Logical Queries ### OR Logic ```python from lamindb import Q # OR condition artifacts = ln.Artifact.filter( Q(suffix=".jpg") | Q(suffix=".png") ).to_dataframe() # Complex OR with multiple conditions artifacts = ln.Artifact.filter( Q(suffix=".h5ad", size__gt=1e6) | Q(suffix=".csv", size__lt=1e3) ).to_dataframe() ``` ### NOT Logic ```python # Exclude condition artifacts = ln.Artifact.filter( ~Q(suffix=".tmp") ).to_dataframe() # Complex exclusion artifacts = ln.Artifact.filter( ~Q(created_by__handle="testuser") ).to_dataframe() ``` ### Combining AND, OR, NOT ```python # Complex query artifacts = ln.Artifact.filter( (Q(suffix=".h5ad") | Q(suffix=".csv")) & Q(size__gt=1e6) & ~Q(created_by__handle__startswith="test") ).to_dataframe() ``` ## Search Functionality Full-text search across registry fields: ```python # Basic search ln.Artifact.search("iris").to_dataframe() ln.User.search("smith").to_dataframe() # Search in specific registry bt.CellType.search("T cell").to_dataframe() bt.Gene.search("CD8").to_dataframe() ``` ## Working with QuerySets QuerySets are lazy - they don't hit the database until evaluated: ```python # Create query (no database hit) qs = ln.Artifact.filter(suffix=".h5ad") # Evaluate in different ways df = qs.to_dataframe() # As pandas DataFrame list_records = list(qs) # As Python list count = qs.count() # Count only exists = qs.exists() # Boolean check # Iteration for artifact in qs: print(artifact.key, artifact.size) # Slicing first_10 = qs[:10] next_10 = qs[10:20] ``` ## Chaining Filters ```python # Build query incrementally qs = ln.Artifact.filter(suffix=".h5ad") qs = qs.filter(size__gt=1e6) qs = qs.filter(created_at__year=2025) qs = qs.order_by("-created_at") # Execute results = qs.to_dataframe() ``` ## Streaming Large Datasets For datasets too large to fit in memory, use streaming access: ### Streaming Files ```python # Open file stream artifact = ln.Artifact.get(key="large_file.csv") with artifact.open() as f: # Read in chunks chunk = f.read(10000) # Read 10KB # Process chunk ``` ### Array Slicing For array-based formats (Zarr, HDF5, AnnData): ```python # Get backing file without loading artifact = ln.Artifact.get(key="large_data.h5ad") adata = artifact.backed() # Returns backed AnnData # Slice specific portions subset = adata[:1000, :] # First 1000 cells genes_of_interest = adata[:, ["CD4", "CD8A", "CD8B"]] # Stream batches for i in range(0, adata.n_obs, 1000): batch = adata[i:i+1000, :] # Process batch ``` ### Iterator Access ```python # Process large collections incrementally artifacts = ln.Artifact.filter(suffix=".fastq.gz") for artifact in artifacts.iterator(chunk_size=10): # Process 10 at a time path = artifact.cache() # Analyze file ``` ## Aggregation and Statistics ```python # Count records ln.Artifact.filter(suffix=".h5ad").count() # Distinct values ln.Artifact.values_list("suffix", flat=True).distinct() # Aggregation (requires Django ORM knowledge) from django.db.models import Sum, Avg, Max, Min # Total size of all artifacts ln.Artifact.aggregate(Sum("size")) # Average artifact size by suffix ln.Artifact.values("suffix").annotate(avg_size=Avg("size")) ``` ## Caching and Performance ```python # Check cache location ln.settings.cache_dir # Configure cache lamin cache set /path/to/cache # Clear cache for specific artifact artifact.delete_cache() # Get cached path (downloads if needed) path = artifact.cache() # Check if cached if artifact.is_cached(): path = artifact.cache() ``` ## Organizing Data with Keys Best practices for structuring keys: ```python # Hierarchical organization ln.Artifact("data.h5ad", key="project/experiment/batch1/data.h5ad").save() ln.Artifact("data.h5ad", key="scrna/2025/oct/sample_001.h5ad").save() # Browse by prefix ln.Artifact.filter(key__startswith="scrna/2025/oct/").to_dataframe() # Version in key (alternative to built-in versioning) ln.Artifact("data.h5ad", key="data/processed/v1/final.h5ad").save() ln.Artifact("data.h5ad", key="data/processed/v2/final.h5ad").save() ``` ## Collections Group related artifacts into collections: ```python # Create collection collection = ln.Collection( [artifact1, artifact2, artifact3], name="scRNA-seq batch 1-3", description="Complete dataset across three batches" ).save() # Access collection members for artifact in collection.artifacts: print(artifact.key) # Query collections ln.Collection.filter(name__contains="batch").to_dataframe() ``` ## Best Practices 1. **Use filters before loading**: Query metadata before accessing file contents 2. **Leverage QuerySets**: Build queries incrementally for complex conditions 3. **Stream large files**: Don't load entire datasets into memory unnecessarily 4. **Structure keys hierarchically**: Makes browsing and filtering easier 5. **Use search for discovery**: When you don't know exact field values 6. **Cache strategically**: Configure cache location based on storage capacity 7. **Index features**: Define features upfront for efficient feature-based queries 8. **Use collections**: Group related artifacts for dataset-level operations 9. **Order results**: Sort by creation date or other fields for consistent retrieval 10. **Check existence**: Use `exists()` or `one_or_none()` to avoid errors ## Common Query Patterns ```python # Recent artifacts ln.Artifact.order_by("-created_at")[:10].to_dataframe() # My artifacts me = ln.setup.settings.user ln.Artifact.filter(created_by=me).to_dataframe() # Large files ln.Artifact.filter(size__gt=1e9).order_by("-size").to_dataframe() # This month's data from datetime import datetime ln.Artifact.filter( created_at__year=2025, created_at__month=10 ).to_dataframe() # Validated datasets with specific features ln.Artifact.filter( is_valid=True, cell_type__isnull=False ).to_dataframe(include="features") ```