Initial commit
This commit is contained in:
182
skills/cellxgene-census/references/census_schema.md
Normal file
182
skills/cellxgene-census/references/census_schema.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# CZ CELLxGENE Census Data Schema Reference
|
||||
|
||||
## Overview
|
||||
|
||||
The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.
|
||||
|
||||
## High-Level Structure
|
||||
|
||||
The Census is organized as a `SOMACollection` with two main components:
|
||||
|
||||
### 1. census_info
|
||||
Summary information including:
|
||||
- **summary**: Build date, cell counts, dataset statistics
|
||||
- **datasets**: All datasets from CELLxGENE Discover with metadata
|
||||
- **summary_cell_counts**: Cell counts stratified by metadata categories
|
||||
|
||||
### 2. census_data
|
||||
Organism-specific `SOMAExperiment` objects:
|
||||
- **"homo_sapiens"**: Human single-cell data
|
||||
- **"mus_musculus"**: Mouse single-cell data
|
||||
|
||||
## Data Structure Per Organism
|
||||
|
||||
Each organism experiment contains:
|
||||
|
||||
### obs (Cell Metadata)
|
||||
Cell-level annotations stored as a `SOMADataFrame`. Access via:
|
||||
```python
|
||||
census["census_data"]["homo_sapiens"].obs
|
||||
```
|
||||
|
||||
### ms["RNA"] (Measurement)
|
||||
RNA measurement data including:
|
||||
- **X**: Data matrices with layers:
|
||||
- `raw`: Raw count data
|
||||
- `normalized`: (if available) Normalized counts
|
||||
- **var**: Gene metadata
|
||||
- **feature_dataset_presence_matrix**: Sparse boolean array showing which genes were measured in each dataset
|
||||
|
||||
## Cell Metadata Fields (obs)
|
||||
|
||||
### Required/Core Fields
|
||||
|
||||
**Identity & Dataset:**
|
||||
- `soma_joinid`: Unique integer identifier for joins
|
||||
- `dataset_id`: Source dataset identifier
|
||||
- `is_primary_data`: Boolean flag (True = unique cell, False = duplicate across datasets)
|
||||
|
||||
**Cell Type:**
|
||||
- `cell_type`: Human-readable cell type name
|
||||
- `cell_type_ontology_term_id`: Standardized ontology term (e.g., "CL:0000236")
|
||||
|
||||
**Tissue:**
|
||||
- `tissue`: Specific tissue name
|
||||
- `tissue_general`: Broader tissue category (useful for grouping)
|
||||
- `tissue_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Assay:**
|
||||
- `assay`: Sequencing technology used
|
||||
- `assay_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Disease:**
|
||||
- `disease`: Disease status or condition
|
||||
- `disease_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Donor:**
|
||||
- `donor_id`: Unique donor identifier
|
||||
- `sex`: Biological sex (male, female, unknown)
|
||||
- `self_reported_ethnicity`: Ethnicity information
|
||||
- `development_stage`: Life stage (adult, child, embryonic, etc.)
|
||||
- `development_stage_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Organism:**
|
||||
- `organism`: Scientific name (Homo sapiens, Mus musculus)
|
||||
- `organism_ontology_term_id`: Standardized ontology term
|
||||
|
||||
**Technical:**
|
||||
- `suspension_type`: Sample preparation type (cell, nucleus, na)
|
||||
|
||||
## Gene Metadata Fields (var)
|
||||
|
||||
Access via:
|
||||
```python
|
||||
census["census_data"]["homo_sapiens"].ms["RNA"].var
|
||||
```
|
||||
|
||||
**Available Fields:**
|
||||
- `soma_joinid`: Unique integer identifier for joins
|
||||
- `feature_id`: Ensembl gene ID (e.g., "ENSG00000161798")
|
||||
- `feature_name`: Gene symbol (e.g., "FOXP2")
|
||||
- `feature_length`: Gene length in base pairs
|
||||
|
||||
## Value Filter Syntax
|
||||
|
||||
Queries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.
|
||||
|
||||
### Comparison Operators
|
||||
- `==`: Equal to
|
||||
- `!=`: Not equal to
|
||||
- `<`, `>`, `<=`, `>=`: Numeric comparisons
|
||||
- `in`: Membership test (e.g., `feature_id in ['ENSG00000161798', 'ENSG00000188229']`)
|
||||
|
||||
### Logical Operators
|
||||
- `and`, `&`: Logical AND
|
||||
- `or`, `|`: Logical OR
|
||||
|
||||
### Examples
|
||||
|
||||
**Single condition:**
|
||||
```python
|
||||
value_filter="cell_type == 'B cell'"
|
||||
```
|
||||
|
||||
**Multiple conditions with AND:**
|
||||
```python
|
||||
value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"
|
||||
```
|
||||
|
||||
**Using IN for multiple values:**
|
||||
```python
|
||||
value_filter="tissue in ['lung', 'liver', 'kidney']"
|
||||
```
|
||||
|
||||
**Complex condition:**
|
||||
```python
|
||||
value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"
|
||||
```
|
||||
|
||||
**Filtering genes:**
|
||||
```python
|
||||
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"
|
||||
```
|
||||
|
||||
## Data Inclusion Criteria
|
||||
|
||||
The Census includes all data from CZ CELLxGENE Discover meeting:
|
||||
|
||||
1. **Species**: Human (*Homo sapiens*) or mouse (*Mus musculus*)
|
||||
2. **Technology**: Approved sequencing technologies for RNA
|
||||
3. **Count Type**: Raw counts only (no processed/normalized-only data)
|
||||
4. **Metadata**: Standardized following CELLxGENE schema
|
||||
5. **Both spatial and non-spatial data**: Includes traditional and spatial transcriptomics
|
||||
|
||||
## Important Data Characteristics
|
||||
|
||||
### Duplicate Cells
|
||||
Cells may appear across multiple datasets. Use `is_primary_data == True` to filter for unique cells in most analyses.
|
||||
|
||||
### Count Types
|
||||
The Census includes:
|
||||
- **Molecule counts**: From UMI-based methods
|
||||
- **Full-gene sequencing read counts**: From non-UMI methods
|
||||
These may need different normalization approaches.
|
||||
|
||||
### Versioning
|
||||
Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis:
|
||||
```python
|
||||
census = cellxgene_census.open_soma(census_version="2023-07-25")
|
||||
```
|
||||
|
||||
## Dataset Presence Matrix
|
||||
|
||||
Access which genes were measured in each dataset:
|
||||
```python
|
||||
presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]
|
||||
```
|
||||
|
||||
This sparse boolean matrix helps understand:
|
||||
- Gene coverage across datasets
|
||||
- Which datasets to include for specific gene analyses
|
||||
- Technical batch effects related to gene coverage
|
||||
|
||||
## SOMA Object Types
|
||||
|
||||
Core TileDB-SOMA objects used:
|
||||
- **DataFrame**: Tabular data (obs, var)
|
||||
- **SparseNDArray**: Sparse matrices (X layers, presence matrix)
|
||||
- **DenseNDArray**: Dense arrays (less common)
|
||||
- **Collection**: Container for related objects
|
||||
- **Experiment**: Top-level container for measurements
|
||||
- **SOMAScene**: Spatial transcriptomics scenes
|
||||
- **obs_spatial_presence**: Spatial data availability
|
||||
Reference in New Issue
Block a user