Files
gh-k-dense-ai-claude-scient…/skills/cellxgene-census/references/census_schema.md
2025-11-30 08:30:10 +08:00

183 lines
5.5 KiB
Markdown

# CZ CELLxGENE Census Data Schema Reference
## Overview
The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.
## High-Level Structure
The Census is organized as a `SOMACollection` with two main components:
### 1. census_info
Summary information including:
- **summary**: Build date, cell counts, dataset statistics
- **datasets**: All datasets from CELLxGENE Discover with metadata
- **summary_cell_counts**: Cell counts stratified by metadata categories
### 2. census_data
Organism-specific `SOMAExperiment` objects:
- **"homo_sapiens"**: Human single-cell data
- **"mus_musculus"**: Mouse single-cell data
## Data Structure Per Organism
Each organism experiment contains:
### obs (Cell Metadata)
Cell-level annotations stored as a `SOMADataFrame`. Access via:
```python
census["census_data"]["homo_sapiens"].obs
```
### ms["RNA"] (Measurement)
RNA measurement data including:
- **X**: Data matrices with layers:
- `raw`: Raw count data
- `normalized`: (if available) Normalized counts
- **var**: Gene metadata
- **feature_dataset_presence_matrix**: Sparse boolean array showing which genes were measured in each dataset
## Cell Metadata Fields (obs)
### Required/Core Fields
**Identity & Dataset:**
- `soma_joinid`: Unique integer identifier for joins
- `dataset_id`: Source dataset identifier
- `is_primary_data`: Boolean flag (True = unique cell, False = duplicate across datasets)
**Cell Type:**
- `cell_type`: Human-readable cell type name
- `cell_type_ontology_term_id`: Standardized ontology term (e.g., "CL:0000236")
**Tissue:**
- `tissue`: Specific tissue name
- `tissue_general`: Broader tissue category (useful for grouping)
- `tissue_ontology_term_id`: Standardized ontology term
**Assay:**
- `assay`: Sequencing technology used
- `assay_ontology_term_id`: Standardized ontology term
**Disease:**
- `disease`: Disease status or condition
- `disease_ontology_term_id`: Standardized ontology term
**Donor:**
- `donor_id`: Unique donor identifier
- `sex`: Biological sex (male, female, unknown)
- `self_reported_ethnicity`: Ethnicity information
- `development_stage`: Life stage (adult, child, embryonic, etc.)
- `development_stage_ontology_term_id`: Standardized ontology term
**Organism:**
- `organism`: Scientific name (Homo sapiens, Mus musculus)
- `organism_ontology_term_id`: Standardized ontology term
**Technical:**
- `suspension_type`: Sample preparation type (cell, nucleus, na)
## Gene Metadata Fields (var)
Access via:
```python
census["census_data"]["homo_sapiens"].ms["RNA"].var
```
**Available Fields:**
- `soma_joinid`: Unique integer identifier for joins
- `feature_id`: Ensembl gene ID (e.g., "ENSG00000161798")
- `feature_name`: Gene symbol (e.g., "FOXP2")
- `feature_length`: Gene length in base pairs
## Value Filter Syntax
Queries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.
### Comparison Operators
- `==`: Equal to
- `!=`: Not equal to
- `<`, `>`, `<=`, `>=`: Numeric comparisons
- `in`: Membership test (e.g., `feature_id in ['ENSG00000161798', 'ENSG00000188229']`)
### Logical Operators
- `and`, `&`: Logical AND
- `or`, `|`: Logical OR
### Examples
**Single condition:**
```python
value_filter="cell_type == 'B cell'"
```
**Multiple conditions with AND:**
```python
value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"
```
**Using IN for multiple values:**
```python
value_filter="tissue in ['lung', 'liver', 'kidney']"
```
**Complex condition:**
```python
value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"
```
**Filtering genes:**
```python
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"
```
## Data Inclusion Criteria
The Census includes all data from CZ CELLxGENE Discover meeting:
1. **Species**: Human (*Homo sapiens*) or mouse (*Mus musculus*)
2. **Technology**: Approved sequencing technologies for RNA
3. **Count Type**: Raw counts only (no processed/normalized-only data)
4. **Metadata**: Standardized following CELLxGENE schema
5. **Both spatial and non-spatial data**: Includes traditional and spatial transcriptomics
## Important Data Characteristics
### Duplicate Cells
Cells may appear across multiple datasets. Use `is_primary_data == True` to filter for unique cells in most analyses.
### Count Types
The Census includes:
- **Molecule counts**: From UMI-based methods
- **Full-gene sequencing read counts**: From non-UMI methods
These may need different normalization approaches.
### Versioning
Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis:
```python
census = cellxgene_census.open_soma(census_version="2023-07-25")
```
## Dataset Presence Matrix
Access which genes were measured in each dataset:
```python
presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]
```
This sparse boolean matrix helps understand:
- Gene coverage across datasets
- Which datasets to include for specific gene analyses
- Technical batch effects related to gene coverage
## SOMA Object Types
Core TileDB-SOMA objects used:
- **DataFrame**: Tabular data (obs, var)
- **SparseNDArray**: Sparse matrices (X layers, presence matrix)
- **DenseNDArray**: Dense arrays (less common)
- **Collection**: Container for related objects
- **Experiment**: Top-level container for measurements
- **SOMAScene**: Spatial transcriptomics scenes
- **obs_spatial_presence**: Spatial data availability