5.5 KiB
CZ CELLxGENE Census Data Schema Reference
Overview
The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.
High-Level Structure
The Census is organized as a SOMACollection with two main components:
1. census_info
Summary information including:
- summary: Build date, cell counts, dataset statistics
- datasets: All datasets from CELLxGENE Discover with metadata
- summary_cell_counts: Cell counts stratified by metadata categories
2. census_data
Organism-specific SOMAExperiment objects:
- "homo_sapiens": Human single-cell data
- "mus_musculus": Mouse single-cell data
Data Structure Per Organism
Each organism experiment contains:
obs (Cell Metadata)
Cell-level annotations stored as a SOMADataFrame. Access via:
census["census_data"]["homo_sapiens"].obs
ms["RNA"] (Measurement)
RNA measurement data including:
- X: Data matrices with layers:
raw: Raw count datanormalized: (if available) Normalized counts
- var: Gene metadata
- feature_dataset_presence_matrix: Sparse boolean array showing which genes were measured in each dataset
Cell Metadata Fields (obs)
Required/Core Fields
Identity & Dataset:
soma_joinid: Unique integer identifier for joinsdataset_id: Source dataset identifieris_primary_data: Boolean flag (True = unique cell, False = duplicate across datasets)
Cell Type:
cell_type: Human-readable cell type namecell_type_ontology_term_id: Standardized ontology term (e.g., "CL:0000236")
Tissue:
tissue: Specific tissue nametissue_general: Broader tissue category (useful for grouping)tissue_ontology_term_id: Standardized ontology term
Assay:
assay: Sequencing technology usedassay_ontology_term_id: Standardized ontology term
Disease:
disease: Disease status or conditiondisease_ontology_term_id: Standardized ontology term
Donor:
donor_id: Unique donor identifiersex: Biological sex (male, female, unknown)self_reported_ethnicity: Ethnicity informationdevelopment_stage: Life stage (adult, child, embryonic, etc.)development_stage_ontology_term_id: Standardized ontology term
Organism:
organism: Scientific name (Homo sapiens, Mus musculus)organism_ontology_term_id: Standardized ontology term
Technical:
suspension_type: Sample preparation type (cell, nucleus, na)
Gene Metadata Fields (var)
Access via:
census["census_data"]["homo_sapiens"].ms["RNA"].var
Available Fields:
soma_joinid: Unique integer identifier for joinsfeature_id: Ensembl gene ID (e.g., "ENSG00000161798")feature_name: Gene symbol (e.g., "FOXP2")feature_length: Gene length in base pairs
Value Filter Syntax
Queries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.
Comparison Operators
==: Equal to!=: Not equal to<,>,<=,>=: Numeric comparisonsin: Membership test (e.g.,feature_id in ['ENSG00000161798', 'ENSG00000188229'])
Logical Operators
and,&: Logical ANDor,|: Logical OR
Examples
Single condition:
value_filter="cell_type == 'B cell'"
Multiple conditions with AND:
value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"
Using IN for multiple values:
value_filter="tissue in ['lung', 'liver', 'kidney']"
Complex condition:
value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"
Filtering genes:
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"
Data Inclusion Criteria
The Census includes all data from CZ CELLxGENE Discover meeting:
- Species: Human (Homo sapiens) or mouse (Mus musculus)
- Technology: Approved sequencing technologies for RNA
- Count Type: Raw counts only (no processed/normalized-only data)
- Metadata: Standardized following CELLxGENE schema
- Both spatial and non-spatial data: Includes traditional and spatial transcriptomics
Important Data Characteristics
Duplicate Cells
Cells may appear across multiple datasets. Use is_primary_data == True to filter for unique cells in most analyses.
Count Types
The Census includes:
- Molecule counts: From UMI-based methods
- Full-gene sequencing read counts: From non-UMI methods These may need different normalization approaches.
Versioning
Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis:
census = cellxgene_census.open_soma(census_version="2023-07-25")
Dataset Presence Matrix
Access which genes were measured in each dataset:
presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]
This sparse boolean matrix helps understand:
- Gene coverage across datasets
- Which datasets to include for specific gene analyses
- Technical batch effects related to gene coverage
SOMA Object Types
Core TileDB-SOMA objects used:
- DataFrame: Tabular data (obs, var)
- SparseNDArray: Sparse matrices (X layers, presence matrix)
- DenseNDArray: Dense arrays (less common)
- Collection: Container for related objects
- Experiment: Top-level container for measurements
- SOMAScene: Spatial transcriptomics scenes
- obs_spatial_presence: Spatial data availability