# CZ CELLxGENE Census Data Schema Reference ## Overview The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax. ## High-Level Structure The Census is organized as a `SOMACollection` with two main components: ### 1. census_info Summary information including: - **summary**: Build date, cell counts, dataset statistics - **datasets**: All datasets from CELLxGENE Discover with metadata - **summary_cell_counts**: Cell counts stratified by metadata categories ### 2. census_data Organism-specific `SOMAExperiment` objects: - **"homo_sapiens"**: Human single-cell data - **"mus_musculus"**: Mouse single-cell data ## Data Structure Per Organism Each organism experiment contains: ### obs (Cell Metadata) Cell-level annotations stored as a `SOMADataFrame`. Access via: ```python census["census_data"]["homo_sapiens"].obs ``` ### ms["RNA"] (Measurement) RNA measurement data including: - **X**: Data matrices with layers: - `raw`: Raw count data - `normalized`: (if available) Normalized counts - **var**: Gene metadata - **feature_dataset_presence_matrix**: Sparse boolean array showing which genes were measured in each dataset ## Cell Metadata Fields (obs) ### Required/Core Fields **Identity & Dataset:** - `soma_joinid`: Unique integer identifier for joins - `dataset_id`: Source dataset identifier - `is_primary_data`: Boolean flag (True = unique cell, False = duplicate across datasets) **Cell Type:** - `cell_type`: Human-readable cell type name - `cell_type_ontology_term_id`: Standardized ontology term (e.g., "CL:0000236") **Tissue:** - `tissue`: Specific tissue name - `tissue_general`: Broader tissue category (useful for grouping) - `tissue_ontology_term_id`: Standardized ontology term **Assay:** - `assay`: Sequencing technology used - `assay_ontology_term_id`: Standardized ontology term **Disease:** - `disease`: Disease status or condition - `disease_ontology_term_id`: Standardized ontology term **Donor:** - `donor_id`: Unique donor identifier - `sex`: Biological sex (male, female, unknown) - `self_reported_ethnicity`: Ethnicity information - `development_stage`: Life stage (adult, child, embryonic, etc.) - `development_stage_ontology_term_id`: Standardized ontology term **Organism:** - `organism`: Scientific name (Homo sapiens, Mus musculus) - `organism_ontology_term_id`: Standardized ontology term **Technical:** - `suspension_type`: Sample preparation type (cell, nucleus, na) ## Gene Metadata Fields (var) Access via: ```python census["census_data"]["homo_sapiens"].ms["RNA"].var ``` **Available Fields:** - `soma_joinid`: Unique integer identifier for joins - `feature_id`: Ensembl gene ID (e.g., "ENSG00000161798") - `feature_name`: Gene symbol (e.g., "FOXP2") - `feature_length`: Gene length in base pairs ## Value Filter Syntax Queries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA. ### Comparison Operators - `==`: Equal to - `!=`: Not equal to - `<`, `>`, `<=`, `>=`: Numeric comparisons - `in`: Membership test (e.g., `feature_id in ['ENSG00000161798', 'ENSG00000188229']`) ### Logical Operators - `and`, `&`: Logical AND - `or`, `|`: Logical OR ### Examples **Single condition:** ```python value_filter="cell_type == 'B cell'" ``` **Multiple conditions with AND:** ```python value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True" ``` **Using IN for multiple values:** ```python value_filter="tissue in ['lung', 'liver', 'kidney']" ``` **Complex condition:** ```python value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'" ``` **Filtering genes:** ```python var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']" ``` ## Data Inclusion Criteria The Census includes all data from CZ CELLxGENE Discover meeting: 1. **Species**: Human (*Homo sapiens*) or mouse (*Mus musculus*) 2. **Technology**: Approved sequencing technologies for RNA 3. **Count Type**: Raw counts only (no processed/normalized-only data) 4. **Metadata**: Standardized following CELLxGENE schema 5. **Both spatial and non-spatial data**: Includes traditional and spatial transcriptomics ## Important Data Characteristics ### Duplicate Cells Cells may appear across multiple datasets. Use `is_primary_data == True` to filter for unique cells in most analyses. ### Count Types The Census includes: - **Molecule counts**: From UMI-based methods - **Full-gene sequencing read counts**: From non-UMI methods These may need different normalization approaches. ### Versioning Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis: ```python census = cellxgene_census.open_soma(census_version="2023-07-25") ``` ## Dataset Presence Matrix Access which genes were measured in each dataset: ```python presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"] ``` This sparse boolean matrix helps understand: - Gene coverage across datasets - Which datasets to include for specific gene analyses - Technical batch effects related to gene coverage ## SOMA Object Types Core TileDB-SOMA objects used: - **DataFrame**: Tabular data (obs, var) - **SparseNDArray**: Sparse matrices (X layers, presence matrix) - **DenseNDArray**: Dense arrays (less common) - **Collection**: Container for related objects - **Experiment**: Top-level container for measurements - **SOMAScene**: Spatial transcriptomics scenes - **obs_spatial_presence**: Spatial data availability