Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,505 @@
---
name: cellxgene-census
description: "Query CZ CELLxGENE Census (61M+ cells). Filter by cell type/tissue/disease, retrieve expression data, integrate with scanpy/PyTorch, for population-scale single-cell analysis."
---
# CZ CELLxGENE Census
## Overview
The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.
The Census includes:
- **61+ million cells** from human and mouse
- **Standardized metadata** (cell types, tissues, diseases, donors)
- **Raw gene expression** matrices
- **Pre-calculated embeddings** and statistics
- **Integration with PyTorch, scanpy, and other analysis tools**
## When to Use This Skill
This skill should be used when:
- Querying single-cell expression data by cell type, tissue, or disease
- Exploring available single-cell datasets and metadata
- Training machine learning models on single-cell data
- Performing large-scale cross-dataset analyses
- Integrating Census data with scanpy or other analysis frameworks
- Computing statistics across millions of cells
- Accessing pre-calculated embeddings or model predictions
## Installation and Setup
Install the Census API:
```bash
uv pip install cellxgene-census
```
For machine learning workflows, install additional dependencies:
```bash
uv pip install cellxgene-census[experimental]
```
## Core Workflow Patterns
### 1. Opening the Census
Always use the context manager to ensure proper resource cleanup:
```python
import cellxgene_census
# Open latest stable version
with cellxgene_census.open_soma() as census:
# Work with census data
# Open specific version for reproducibility
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
# Work with census data
```
**Key points:**
- Use context manager (`with` statement) for automatic cleanup
- Specify `census_version` for reproducible analyses
- Default opens latest "stable" release
### 2. Exploring Census Information
Before querying expression data, explore available datasets and metadata.
**Access summary information:**
```python
# Get summary statistics
summary = census["census_info"]["summary"].read().concat().to_pandas()
print(f"Total cells: {summary['total_cell_count'][0]}")
# Get all datasets
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
# Filter datasets by criteria
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
```
**Query cell metadata to understand available data:**
```python
# Get unique cell types in a tissue
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type"]
)
unique_cell_types = cell_metadata["cell_type"].unique()
print(f"Found {len(unique_cell_types)} cell types in brain")
# Count cells by tissue
tissue_counts = cell_metadata.groupby("tissue_general").size()
```
**Important:** Always filter for `is_primary_data == True` to avoid counting duplicate cells unless specifically analyzing duplicates.
### 3. Querying Expression Data (Small to Medium Scale)
For queries returning < 100k cells that fit in memory, use `get_anndata()`:
```python
# Basic query with cell type and tissue filters
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens", # or "Mus musculus"
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
obs_column_names=["assay", "disease", "sex", "donor_id"],
)
# Query specific genes with multiple filters
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
obs_column_names=["cell_type", "tissue_general", "donor_id"],
)
```
**Filter syntax:**
- Use `obs_value_filter` for cell filtering
- Use `var_value_filter` for gene filtering
- Combine conditions with `and`, `or`
- Use `in` for multiple values: `tissue in ['lung', 'liver']`
- Select only needed columns with `obs_column_names`
**Getting metadata separately:**
```python
# Query cell metadata
cell_metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19' and is_primary_data == True",
column_names=["cell_type", "tissue_general", "donor_id"]
)
# Query gene metadata
gene_metadata = cellxgene_census.get_var(
census, "homo_sapiens",
value_filter="feature_name in ['CD4', 'CD8A']",
column_names=["feature_id", "feature_name", "feature_length"]
)
```
### 4. Large-Scale Queries (Out-of-Core Processing)
For queries exceeding available RAM, use `axis_query()` with iterative processing:
```python
import tiledbsoma as soma
# Create axis query
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
)
)
# Iterate through expression matrix in chunks
iterator = query.X("raw").tables()
for batch in iterator:
# batch is a pyarrow.Table with columns:
# - soma_data: expression value
# - soma_dim_0: cell (obs) coordinate
# - soma_dim_1: gene (var) coordinate
process_batch(batch)
```
**Computing incremental statistics:**
```python
# Example: Calculate mean expression
n_observations = 0
sum_values = 0.0
iterator = query.X("raw").tables()
for batch in iterator:
values = batch["soma_data"].to_numpy()
n_observations += len(values)
sum_values += values.sum()
mean_expression = sum_values / n_observations
```
### 5. Machine Learning with PyTorch
For training models, use the experimental PyTorch integration:
```python
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census:
# Create dataloader
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
X = batch["X"] # Gene expression tensor
labels = batch["obs"]["cell_type"] # Cell type labels
# Forward pass
outputs = model(X)
loss = criterion(outputs, labels)
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
```
**Train/test splitting:**
```python
from cellxgene_census.experimental.ml import ExperimentDataset
# Create dataset from experiment
dataset = ExperimentDataset(
experiment_axis_query,
layer_name="raw",
obs_column_names=["cell_type"],
batch_size=128,
)
# Split into train and test
train_dataset, test_dataset = dataset.random_split(
split=[0.8, 0.2],
seed=42
)
```
### 6. Integration with Scanpy
Seamlessly integrate Census data with scanpy workflows:
```python
import scanpy as sc
# Load data from Census
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
)
# Standard scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
# Dimensionality reduction
sc.pp.pca(adata, n_comps=50)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
# Visualization
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
```
### 7. Multi-Dataset Integration
Query and integrate multiple datasets:
```python
# Strategy 1: Query multiple tissues separately
tissues = ["lung", "liver", "kidney"]
adatas = []
for tissue in tissues:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
)
adata.obs["tissue"] = tissue
adatas.append(adata)
# Concatenate
combined = adatas[0].concatenate(adatas[1:])
# Strategy 2: Query multiple datasets directly
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
)
```
## Key Concepts and Best Practices
### Always Filter for Primary Data
Unless analyzing duplicates, always include `is_primary_data == True` in queries to avoid counting cells multiple times:
```python
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
```
### Specify Census Version for Reproducibility
Always specify the Census version in production analyses:
```python
census = cellxgene_census.open_soma(census_version="2023-07-25")
```
### Estimate Query Size Before Loading
For large queries, first check the number of cells to avoid memory issues:
```python
# Get cell count
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"Query will return {n_cells:,} cells")
# If too large (>100k), use out-of-core processing
```
### Use tissue_general for Broader Groupings
The `tissue_general` field provides coarser categories than `tissue`, useful for cross-tissue analyses:
```python
# Broader grouping
obs_value_filter="tissue_general == 'immune system'"
# Specific tissue
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
```
### Select Only Needed Columns
Minimize data transfer by specifying only required metadata columns:
```python
obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns
```
### Check Dataset Presence for Gene-Specific Queries
When analyzing specific genes, verify which datasets measured them:
```python
presence = cellxgene_census.get_presence_matrix(
census,
"homo_sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A']"
)
```
### Two-Step Workflow: Explore Then Query
First explore metadata to understand available data, then query expression:
```python
# Step 1: Explore what's available
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19' and is_primary_data == True",
column_names=["cell_type", "tissue_general"]
)
print(metadata.value_counts())
# Step 2: Query based on findings
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
)
```
## Available Metadata Fields
### Cell Metadata (obs)
Key fields for filtering:
- `cell_type`, `cell_type_ontology_term_id`
- `tissue`, `tissue_general`, `tissue_ontology_term_id`
- `disease`, `disease_ontology_term_id`
- `assay`, `assay_ontology_term_id`
- `donor_id`, `sex`, `self_reported_ethnicity`
- `development_stage`, `development_stage_ontology_term_id`
- `dataset_id`
- `is_primary_data` (Boolean: True = unique cell)
### Gene Metadata (var)
- `feature_id` (Ensembl gene ID, e.g., "ENSG00000161798")
- `feature_name` (Gene symbol, e.g., "FOXP2")
- `feature_length` (Gene length in base pairs)
## Reference Documentation
This skill includes detailed reference documentation:
### references/census_schema.md
Comprehensive documentation of:
- Census data structure and organization
- All available metadata fields
- Value filter syntax and operators
- SOMA object types
- Data inclusion criteria
**When to read:** When you need detailed schema information, full list of metadata fields, or complex filter syntax.
### references/common_patterns.md
Examples and patterns for:
- Exploratory queries (metadata only)
- Small-to-medium queries (AnnData)
- Large queries (out-of-core processing)
- PyTorch integration
- Scanpy integration workflows
- Multi-dataset integration
- Best practices and common pitfalls
**When to read:** When implementing specific query patterns, looking for code examples, or troubleshooting common issues.
## Common Use Cases
### Use Case 1: Explore Cell Types in a Tissue
```python
with cellxgene_census.open_soma() as census:
cells = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'lung' and is_primary_data == True",
column_names=["cell_type"]
)
print(cells["cell_type"].value_counts())
```
### Use Case 2: Query Marker Gene Expression
```python
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
)
```
### Use Case 3: Train Cell Type Classifier
```python
from cellxgene_census.experimental.ml import experiment_dataloader
with cellxgene_census.open_soma() as census:
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# Train model
for epoch in range(epochs):
for batch in dataloader:
# Training logic
pass
```
### Use Case 4: Cross-Tissue Analysis
```python
with cellxgene_census.open_soma() as census:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
)
# Analyze macrophage differences across tissues
sc.tl.rank_genes_groups(adata, groupby="tissue_general")
```
## Troubleshooting
### Query Returns Too Many Cells
- Add more specific filters to reduce scope
- Use `tissue` instead of `tissue_general` for finer granularity
- Filter by specific `dataset_id` if known
- Switch to out-of-core processing for large queries
### Memory Errors
- Reduce query scope with more restrictive filters
- Select fewer genes with `var_value_filter`
- Use out-of-core processing with `axis_query()`
- Process data in batches
### Duplicate Cells in Results
- Always include `is_primary_data == True` in filters
- Check if intentionally querying across multiple datasets
### Gene Not Found
- Verify gene name spelling (case-sensitive)
- Try Ensembl ID with `feature_id` instead of `feature_name`
- Check dataset presence matrix to see if gene was measured
- Some genes may have been filtered during Census construction
### Version Inconsistencies
- Always specify `census_version` explicitly
- Use same version across all analyses
- Check release notes for version-specific changes

View File

@@ -0,0 +1,182 @@
# CZ CELLxGENE Census Data Schema Reference
## Overview
The CZ CELLxGENE Census is a versioned collection of single-cell data built on the TileDB-SOMA framework. This reference documents the data structure, available metadata fields, and query syntax.
## High-Level Structure
The Census is organized as a `SOMACollection` with two main components:
### 1. census_info
Summary information including:
- **summary**: Build date, cell counts, dataset statistics
- **datasets**: All datasets from CELLxGENE Discover with metadata
- **summary_cell_counts**: Cell counts stratified by metadata categories
### 2. census_data
Organism-specific `SOMAExperiment` objects:
- **"homo_sapiens"**: Human single-cell data
- **"mus_musculus"**: Mouse single-cell data
## Data Structure Per Organism
Each organism experiment contains:
### obs (Cell Metadata)
Cell-level annotations stored as a `SOMADataFrame`. Access via:
```python
census["census_data"]["homo_sapiens"].obs
```
### ms["RNA"] (Measurement)
RNA measurement data including:
- **X**: Data matrices with layers:
- `raw`: Raw count data
- `normalized`: (if available) Normalized counts
- **var**: Gene metadata
- **feature_dataset_presence_matrix**: Sparse boolean array showing which genes were measured in each dataset
## Cell Metadata Fields (obs)
### Required/Core Fields
**Identity & Dataset:**
- `soma_joinid`: Unique integer identifier for joins
- `dataset_id`: Source dataset identifier
- `is_primary_data`: Boolean flag (True = unique cell, False = duplicate across datasets)
**Cell Type:**
- `cell_type`: Human-readable cell type name
- `cell_type_ontology_term_id`: Standardized ontology term (e.g., "CL:0000236")
**Tissue:**
- `tissue`: Specific tissue name
- `tissue_general`: Broader tissue category (useful for grouping)
- `tissue_ontology_term_id`: Standardized ontology term
**Assay:**
- `assay`: Sequencing technology used
- `assay_ontology_term_id`: Standardized ontology term
**Disease:**
- `disease`: Disease status or condition
- `disease_ontology_term_id`: Standardized ontology term
**Donor:**
- `donor_id`: Unique donor identifier
- `sex`: Biological sex (male, female, unknown)
- `self_reported_ethnicity`: Ethnicity information
- `development_stage`: Life stage (adult, child, embryonic, etc.)
- `development_stage_ontology_term_id`: Standardized ontology term
**Organism:**
- `organism`: Scientific name (Homo sapiens, Mus musculus)
- `organism_ontology_term_id`: Standardized ontology term
**Technical:**
- `suspension_type`: Sample preparation type (cell, nucleus, na)
## Gene Metadata Fields (var)
Access via:
```python
census["census_data"]["homo_sapiens"].ms["RNA"].var
```
**Available Fields:**
- `soma_joinid`: Unique integer identifier for joins
- `feature_id`: Ensembl gene ID (e.g., "ENSG00000161798")
- `feature_name`: Gene symbol (e.g., "FOXP2")
- `feature_length`: Gene length in base pairs
## Value Filter Syntax
Queries use Python-like expressions for filtering. The syntax is processed by TileDB-SOMA.
### Comparison Operators
- `==`: Equal to
- `!=`: Not equal to
- `<`, `>`, `<=`, `>=`: Numeric comparisons
- `in`: Membership test (e.g., `feature_id in ['ENSG00000161798', 'ENSG00000188229']`)
### Logical Operators
- `and`, `&`: Logical AND
- `or`, `|`: Logical OR
### Examples
**Single condition:**
```python
value_filter="cell_type == 'B cell'"
```
**Multiple conditions with AND:**
```python
value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True"
```
**Using IN for multiple values:**
```python
value_filter="tissue in ['lung', 'liver', 'kidney']"
```
**Complex condition:**
```python
value_filter="(cell_type == 'neuron' or cell_type == 'astrocyte') and disease != 'normal'"
```
**Filtering genes:**
```python
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']"
```
## Data Inclusion Criteria
The Census includes all data from CZ CELLxGENE Discover meeting:
1. **Species**: Human (*Homo sapiens*) or mouse (*Mus musculus*)
2. **Technology**: Approved sequencing technologies for RNA
3. **Count Type**: Raw counts only (no processed/normalized-only data)
4. **Metadata**: Standardized following CELLxGENE schema
5. **Both spatial and non-spatial data**: Includes traditional and spatial transcriptomics
## Important Data Characteristics
### Duplicate Cells
Cells may appear across multiple datasets. Use `is_primary_data == True` to filter for unique cells in most analyses.
### Count Types
The Census includes:
- **Molecule counts**: From UMI-based methods
- **Full-gene sequencing read counts**: From non-UMI methods
These may need different normalization approaches.
### Versioning
Census releases are versioned (e.g., "2023-07-25", "stable"). Always specify version for reproducible analysis:
```python
census = cellxgene_census.open_soma(census_version="2023-07-25")
```
## Dataset Presence Matrix
Access which genes were measured in each dataset:
```python
presence_matrix = census["census_data"]["homo_sapiens"].ms["RNA"]["feature_dataset_presence_matrix"]
```
This sparse boolean matrix helps understand:
- Gene coverage across datasets
- Which datasets to include for specific gene analyses
- Technical batch effects related to gene coverage
## SOMA Object Types
Core TileDB-SOMA objects used:
- **DataFrame**: Tabular data (obs, var)
- **SparseNDArray**: Sparse matrices (X layers, presence matrix)
- **DenseNDArray**: Dense arrays (less common)
- **Collection**: Container for related objects
- **Experiment**: Top-level container for measurements
- **SOMAScene**: Spatial transcriptomics scenes
- **obs_spatial_presence**: Spatial data availability

View File

@@ -0,0 +1,351 @@
# Common Query Patterns and Best Practices
## Query Pattern Categories
### 1. Exploratory Queries (Metadata Only)
Use when exploring available data without loading expression matrices.
**Pattern: Get unique cell types in a tissue**
```python
import cellxgene_census
with cellxgene_census.open_soma() as census:
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["cell_type"]
)
unique_cell_types = cell_metadata["cell_type"].unique()
print(f"Found {len(unique_cell_types)} unique cell types")
```
**Pattern: Count cells by condition**
```python
cell_metadata = cellxgene_census.get_obs(
census,
"homo_sapiens",
value_filter="disease != 'normal' and is_primary_data == True",
column_names=["disease", "tissue_general"]
)
counts = cell_metadata.groupby(["disease", "tissue_general"]).size()
```
**Pattern: Explore dataset information**
```python
# Access datasets table
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
# Filter for specific criteria
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
```
### 2. Small-to-Medium Queries (AnnData)
Use `get_anndata()` when results fit in memory (typically < 100k cells).
**Pattern: Tissue-specific cell type query**
```python
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
obs_column_names=["assay", "disease", "sex", "donor_id"],
)
```
**Pattern: Gene-specific query with multiple genes**
```python
marker_genes = ["CD4", "CD8A", "CD19", "FOXP3"]
# First get gene IDs
gene_metadata = cellxgene_census.get_var(
census, "homo_sapiens",
value_filter=f"feature_name in {marker_genes}",
column_names=["feature_id", "feature_name"]
)
gene_ids = gene_metadata["feature_id"].tolist()
# Query with gene filter
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
var_value_filter=f"feature_id in {gene_ids}",
obs_value_filter="cell_type == 'T cell' and is_primary_data == True",
)
```
**Pattern: Multi-tissue query**
```python
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
obs_column_names=["cell_type", "tissue_general", "dataset_id"],
)
```
**Pattern: Disease-specific query**
```python
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="disease == 'COVID-19' and tissue_general == 'lung' and is_primary_data == True",
)
```
### 3. Large Queries (Out-of-Core Processing)
Use `axis_query()` for queries that exceed available RAM.
**Pattern: Iterative processing**
```python
import pyarrow as pa
# Create query
query = census["census_data"]["homo_sapiens"].axis_query(
measurement_name="RNA",
obs_query=soma.AxisQuery(
value_filter="tissue_general == 'brain' and is_primary_data == True"
),
var_query=soma.AxisQuery(
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
)
)
# Iterate through X matrix in chunks
iterator = query.X("raw").tables()
for batch in iterator:
# Process batch (a pyarrow.Table)
# batch has columns: soma_data, soma_dim_0, soma_dim_1
process_batch(batch)
```
**Pattern: Incremental statistics (mean/variance)**
```python
# Using Welford's online algorithm
n = 0
mean = 0
M2 = 0
iterator = query.X("raw").tables()
for batch in iterator:
values = batch["soma_data"].to_numpy()
for x in values:
n += 1
delta = x - mean
mean += delta / n
delta2 = x - mean
M2 += delta * delta2
variance = M2 / (n - 1) if n > 1 else 0
```
### 4. PyTorch Integration (Machine Learning)
Use `experiment_dataloader()` for training models.
**Pattern: Create training dataloader**
```python
from cellxgene_census.experimental.ml import experiment_dataloader
import torch
with cellxgene_census.open_soma() as census:
# Create dataloader
dataloader = experiment_dataloader(
census["census_data"]["homo_sapiens"],
measurement_name="RNA",
X_name="raw",
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
obs_column_names=["cell_type"],
batch_size=128,
shuffle=True,
)
# Training loop
for epoch in range(num_epochs):
for batch in dataloader:
X = batch["X"] # Gene expression
labels = batch["obs"]["cell_type"] # Cell type labels
# Train model...
```
**Pattern: Train/test split**
```python
from cellxgene_census.experimental.ml import ExperimentDataset
# Create dataset from query
dataset = ExperimentDataset(
experiment_axis_query,
layer_name="raw",
obs_column_names=["cell_type"],
batch_size=128,
)
# Split data
train_dataset, test_dataset = dataset.random_split(
split=[0.8, 0.2],
seed=42
)
# Create loaders
train_loader = experiment_dataloader(train_dataset)
test_loader = experiment_dataloader(test_dataset)
```
### 5. Integration Workflows
**Pattern: Scanpy integration**
```python
import scanpy as sc
# Load data
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="cell_type == 'neuron' and is_primary_data == True",
)
# Standard scanpy workflow
sc.pp.normalize_total(adata, target_sum=1e4)
sc.pp.log1p(adata)
sc.pp.highly_variable_genes(adata)
sc.pp.pca(adata)
sc.pp.neighbors(adata)
sc.tl.umap(adata)
sc.pl.umap(adata, color=["cell_type", "tissue_general"])
```
**Pattern: Multi-dataset integration**
```python
# Query multiple datasets separately
datasets_to_integrate = ["dataset_id_1", "dataset_id_2", "dataset_id_3"]
adatas = []
for dataset_id in datasets_to_integrate:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=f"dataset_id == '{dataset_id}' and is_primary_data == True",
)
adatas.append(adata)
# Integrate using scanorama, harmony, or other tools
import scanpy.external as sce
sce.pp.scanorama_integrate(adatas)
```
## Best Practices
### 1. Always Filter for Primary Data
Unless specifically analyzing duplicates, always include `is_primary_data == True`:
```python
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
```
### 2. Specify Census Version
For reproducible analysis, always specify the Census version:
```python
census = cellxgene_census.open_soma(census_version="2023-07-25")
```
### 3. Use Context Manager
Always use the context manager to ensure proper cleanup:
```python
with cellxgene_census.open_soma() as census:
# Your code here
```
### 4. Select Only Needed Columns
Minimize data transfer by selecting only required metadata columns:
```python
obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns
```
### 5. Check Dataset Presence for Gene Queries
When analyzing specific genes, check which datasets measured them:
```python
presence = cellxgene_census.get_presence_matrix(
census,
"homo_sapiens",
var_value_filter="feature_name in ['CD4', 'CD8A']"
)
```
### 6. Use tissue_general for Broader Queries
`tissue_general` provides coarser groupings than `tissue`, useful for cross-tissue analyses:
```python
# Better for broad queries
obs_value_filter="tissue_general == 'immune system'"
# Use specific tissue when needed
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
```
### 7. Combine Metadata Exploration with Expression Queries
First explore metadata to understand available data, then query expression:
```python
# Step 1: Explore
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="disease == 'COVID-19'",
column_names=["cell_type", "tissue_general"]
)
print(metadata.value_counts())
# Step 2: Query based on findings
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
)
```
### 8. Memory Management for Large Queries
For large queries, check estimated size before loading:
```python
# Get cell count first
metadata = cellxgene_census.get_obs(
census, "homo_sapiens",
value_filter="tissue_general == 'brain' and is_primary_data == True",
column_names=["soma_joinid"]
)
n_cells = len(metadata)
print(f"Query will return {n_cells} cells")
# If too large, use out-of-core processing or further filtering
```
### 9. Leverage Ontology Terms for Consistency
When possible, use ontology term IDs instead of free text:
```python
# More reliable than cell_type == 'B cell' across datasets
obs_value_filter="cell_type_ontology_term_id == 'CL:0000236'"
```
### 10. Batch Processing Pattern
For systematic analyses across multiple conditions:
```python
tissues = ["lung", "liver", "kidney", "heart"]
results = {}
for tissue in tissues:
adata = cellxgene_census.get_anndata(
census=census,
organism="Homo sapiens",
obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
)
# Perform analysis
results[tissue] = analyze(adata)
```
## Common Pitfalls to Avoid
1. **Not filtering for is_primary_data**: Leads to counting duplicate cells
2. **Loading too much data**: Use metadata queries to estimate size first
3. **Not using context manager**: Can cause resource leaks
4. **Inconsistent versioning**: Results not reproducible without specifying version
5. **Overly broad queries**: Start with focused queries, expand as needed
6. **Ignoring dataset presence**: Some genes not measured in all datasets
7. **Wrong count normalization**: Be aware of UMI vs read count differences