Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/cellxgene-census/references/census_schema.md
+++ b/skills/cellxgene-census/references/census_schema.md
@@ -0,0 +1,182 @@
+# CZ CELLxGENE Census Data Schema Reference
+
+## Overview
+
+The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.
+
+## High-Level Structure
+
+The Census is organized as a `SOMACollection` with two main components:
+
+### 1. census_info
+Summary information including:
+- **summary**: Build date, cell counts, dataset statistics
+- **datasets**: All datasets from CELLxGENE Discover with metadata
+- **summary_cell_counts**: Cell counts stratified by metadata categories
+
+### 2. census_data
+Organism-specific `SOMAExperiment` objects:
+- **"homo_sapiens"**: Human single-cell data
+- **"mus_musculus"**: Mouse single-cell data
+
+## Data Structure Per Organism
+
+Each organism experiment contains:
+
+### obs (Cell Metadata)
+Cell-level annotations stored as a `SOMADataFrame`. Access via:
+```python
+census["census_data"]["homo_sapiens"].obs
+```
+
+### ms["RNA"] (Measurement)
+RNA measurement data including:
+- **X**: Data matrices with layers:
+  - `raw`: Raw count data
+  - `normalized`: (if available) Normalized counts
+- **var**: Gene metadata
+- **feature_dataset_presence_matrix**: Sparse boolean array showing which genes were measured in each dataset
+
+## Cell Metadata Fields (obs)
+
+### Required/Core Fields
+
+**Identity & Dataset:**
+- `soma_joinid`: Unique integer identifier for joins
+- `dataset_id`: Source dataset identifier
+- `is_primary_data`: Boolean flag (True = unique cell, False = duplicate across datasets)
+
+**Cell Type:**
+- `cell_type`: Human-readable cell type name
+- `cell_type_ontology_term_id`: Standardized ontology term (e.g., "CL:0000236")
+
+**Tissue:**
+- `tissue`: Specific tissue name
+- `tissue_general`: Broader tissue category (useful for grouping)
+- `tissue_ontology_term_id`: Standardized ontology term
+
+**Assay:**
+- `assay`: Sequencing technology used
+- `assay_ontology_term_id`: Standardized ontology term
+
+**Disease:**
+- `disease`: Disease status or condition
+- `disease_ontology_term_id`: Standardized ontology term
+
+**Donor:**
+- `donor_id`: Unique donor identifier
+- `sex`: Biological sex (male, female, unknown)
+- `self_reported_ethnicity`: Ethnicity information
+- `development_stage`: Life stage (adult, child, embryonic, etc.)
+- `development_stage_ontology_term_id`: Standardized ontology term
+
+**Organism:**
+- `organism`: Scientific name (Homo sapiens, Mus musculus)
+- `organism_ontology_term_id`: Standardized ontology term
+
+**Technical:**
+- `suspension_type`: Sample preparation type (cell, nucleus, na)
+
+## Gene Metadata Fields (var)
+
+Access via:
+```python
+census["census_data"]["homo_sapiens"].ms["RNA"].var
+```
+
+**Available Fields:**
+- `soma_joinid`: Unique integer identifier for joins
+- `feature_id`: Ensembl gene ID (e.g., "ENSG00000161798")
+- `feature_name`: Gene symbol (e.g., "FOXP2")
+- `feature_length`: Gene length in base pairs
+
+## Value Filter Syntax
+
+Queries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.
+
+### Comparison Operators
+- `==`: Equal to
+- `!=`: Not equal to
+- `<`, `>`, `<=`, `>=`: Numeric comparisons
+- `in`: Membership test (e.g., `feature_id in ['ENSG00000161798', 'ENSG00000188229']`)
+
+### Logical Operators
+- `and`, `&`: Logical AND
+- `or`, `|`: Logical OR
+
+### Examples
+
+**Single condition:**
+```python
+value_filter="cell_type == 'B cell'"
+```
+
+**Multiple conditions with AND:**
+```python
+value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"
+```
+
+**Using IN for multiple values:**
+```python
+value_filter="tissue in ['lung', 'liver', 'kidney']"
+```
+
+**Complex condition:**
+```python
+value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"
+```
+
+**Filtering genes:**
+```python
+var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"
+```
+
+## Data Inclusion Criteria
+
+The Census includes all data from CZ CELLxGENE Discover meeting:
+
+1. **Species**: Human (*Homo sapiens*) or mouse (*Mus musculus*)
+2. **Technology**: Approved sequencing technologies for RNA
+3. **Count Type**: Raw counts only (no processed/normalized-only data)
+4. **Metadata**: Standardized following CELLxGENE schema
+5. **Both spatial and non-spatial data**: Includes traditional and spatial transcriptomics
+
+## Important Data Characteristics
+
+### Duplicate Cells
+Cells may appear across multiple datasets. Use `is_primary_data == True` to filter for unique cells in most analyses.
+
+### Count Types
+The Census includes:
+- **Molecule counts**: From UMI-based methods
+- **Full-gene sequencing read counts**: From non-UMI methods
+These may need different normalization approaches.
+
+### Versioning
+Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis:
+```python
+census = cellxgene_census.open_soma(census_version="2023-07-25")
+```
+
+## Dataset Presence Matrix
+
+Access which genes were measured in each dataset:
+```python
+presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]
+```
+
+This sparse boolean matrix helps understand:
+- Gene coverage across datasets
+- Which datasets to include for specific gene analyses
+- Technical batch effects related to gene coverage
+
+## SOMA Object Types
+
+Core TileDB-SOMA objects used:
+- **DataFrame**: Tabular data (obs, var)
+- **SparseNDArray**: Sparse matrices (X layers, presence matrix)
+- **DenseNDArray**: Dense arrays (less common)
+- **Collection**: Container for related objects
+- **Experiment**: Top-level container for measurements
+- **SOMAScene**: Spatial transcriptomics scenes
+- **obs_spatial_presence**: Spatial data availability