Initial commit
This commit is contained in:
505
skills/cellxgene-census/SKILL.md
Normal file
505
skills/cellxgene-census/SKILL.md
Normal file
@@ -0,0 +1,505 @@
|
||||
---
|
||||
name: cellxgene-census
|
||||
description: "Query CZ CELLxGENE Census (61M+ cells). Filter by cell type/tissue/disease, retrieve expression data, integrate with scanpy/PyTorch, for population-scale single-cell analysis."
|
||||
---
|
||||
|
||||
# CZ CELLxGENE Census
|
||||
|
||||
## Overview
|
||||
|
||||
The CZ CELLxGENE Census provides programmatic access to a comprehensive, versioned collection of standardized single-cell genomics data from CZ CELLxGENE Discover. This skill enables efficient querying and analysis of millions of cells across thousands of datasets.
|
||||
|
||||
The Census includes:
|
||||
- **61+ million cells** from human and mouse
|
||||
- **Standardized metadata** (cell types, tissues, diseases, donors)
|
||||
- **Raw gene expression** matrices
|
||||
- **Pre-calculated embeddings** and statistics
|
||||
- **Integration with PyTorch, scanpy, and other analysis tools**
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
- Querying single-cell expression data by cell type, tissue, or disease
|
||||
- Exploring available single-cell datasets and metadata
|
||||
- Training machine learning models on single-cell data
|
||||
- Performing large-scale cross-dataset analyses
|
||||
- Integrating Census data with scanpy or other analysis frameworks
|
||||
- Computing statistics across millions of cells
|
||||
- Accessing pre-calculated embeddings or model predictions
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
Install the Census API:
|
||||
```bash
|
||||
uv pip install cellxgene-census
|
||||
```
|
||||
|
||||
For machine learning workflows, install additional dependencies:
|
||||
```bash
|
||||
uv pip install cellxgene-census[experimental]
|
||||
```
|
||||
|
||||
## Core Workflow Patterns
|
||||
|
||||
### 1. Opening the Census
|
||||
|
||||
Always use the context manager to ensure proper resource cleanup:
|
||||
|
||||
```python
|
||||
import cellxgene_census
|
||||
|
||||
# Open latest stable version
|
||||
with cellxgene_census.open_soma() as census:
|
||||
# Work with census data
|
||||
|
||||
# Open specific version for reproducibility
|
||||
with cellxgene_census.open_soma(census_version="2023-07-25") as census:
|
||||
# Work with census data
|
||||
```
|
||||
|
||||
**Key points:**
|
||||
- Use context manager (`with` statement) for automatic cleanup
|
||||
- Specify `census_version` for reproducible analyses
|
||||
- Default opens latest "stable" release
|
||||
|
||||
### 2. Exploring Census Information
|
||||
|
||||
Before querying expression data, explore available datasets and metadata.
|
||||
|
||||
**Access summary information:**
|
||||
```python
|
||||
# Get summary statistics
|
||||
summary = census["census_info"]["summary"].read().concat().to_pandas()
|
||||
print(f"Total cells: {summary['total_cell_count'][0]}")
|
||||
|
||||
# Get all datasets
|
||||
datasets = census["census_info"]["datasets"].read().concat().to_pandas()
|
||||
|
||||
# Filter datasets by criteria
|
||||
covid_datasets = datasets[datasets["disease"].str.contains("COVID", na=False)]
|
||||
```
|
||||
|
||||
**Query cell metadata to understand available data:**
|
||||
```python
|
||||
# Get unique cell types in a tissue
|
||||
cell_metadata = cellxgene_census.get_obs(
|
||||
census,
|
||||
"homo_sapiens",
|
||||
value_filter="tissue_general == 'brain' and is_primary_data == True",
|
||||
column_names=["cell_type"]
|
||||
)
|
||||
unique_cell_types = cell_metadata["cell_type"].unique()
|
||||
print(f"Found {len(unique_cell_types)} cell types in brain")
|
||||
|
||||
# Count cells by tissue
|
||||
tissue_counts = cell_metadata.groupby("tissue_general").size()
|
||||
```
|
||||
|
||||
**Important:** Always filter for `is_primary_data == True` to avoid counting duplicate cells unless specifically analyzing duplicates.
|
||||
|
||||
### 3. Querying Expression Data (Small to Medium Scale)
|
||||
|
||||
For queries returning < 100k cells that fit in memory, use `get_anndata()`:
|
||||
|
||||
```python
|
||||
# Basic query with cell type and tissue filters
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens", # or "Mus musculus"
|
||||
obs_value_filter="cell_type == 'B cell' and tissue_general == 'lung' and is_primary_data == True",
|
||||
obs_column_names=["assay", "disease", "sex", "donor_id"],
|
||||
)
|
||||
|
||||
# Query specific genes with multiple filters
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19', 'FOXP3']",
|
||||
obs_value_filter="cell_type == 'T cell' and disease == 'COVID-19' and is_primary_data == True",
|
||||
obs_column_names=["cell_type", "tissue_general", "donor_id"],
|
||||
)
|
||||
```
|
||||
|
||||
**Filter syntax:**
|
||||
- Use `obs_value_filter` for cell filtering
|
||||
- Use `var_value_filter` for gene filtering
|
||||
- Combine conditions with `and`, `or`
|
||||
- Use `in` for multiple values: `tissue in ['lung', 'liver']`
|
||||
- Select only needed columns with `obs_column_names`
|
||||
|
||||
**Getting metadata separately:**
|
||||
```python
|
||||
# Query cell metadata
|
||||
cell_metadata = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="disease == 'COVID-19' and is_primary_data == True",
|
||||
column_names=["cell_type", "tissue_general", "donor_id"]
|
||||
)
|
||||
|
||||
# Query gene metadata
|
||||
gene_metadata = cellxgene_census.get_var(
|
||||
census, "homo_sapiens",
|
||||
value_filter="feature_name in ['CD4', 'CD8A']",
|
||||
column_names=["feature_id", "feature_name", "feature_length"]
|
||||
)
|
||||
```
|
||||
|
||||
### 4. Large-Scale Queries (Out-of-Core Processing)
|
||||
|
||||
For queries exceeding available RAM, use `axis_query()` with iterative processing:
|
||||
|
||||
```python
|
||||
import tiledbsoma as soma
|
||||
|
||||
# Create axis query
|
||||
query = census["census_data"]["homo_sapiens"].axis_query(
|
||||
measurement_name="RNA",
|
||||
obs_query=soma.AxisQuery(
|
||||
value_filter="tissue_general == 'brain' and is_primary_data == True"
|
||||
),
|
||||
var_query=soma.AxisQuery(
|
||||
value_filter="feature_name in ['FOXP2', 'TBR1', 'SATB2']"
|
||||
)
|
||||
)
|
||||
|
||||
# Iterate through expression matrix in chunks
|
||||
iterator = query.X("raw").tables()
|
||||
for batch in iterator:
|
||||
# batch is a pyarrow.Table with columns:
|
||||
# - soma_data: expression value
|
||||
# - soma_dim_0: cell (obs) coordinate
|
||||
# - soma_dim_1: gene (var) coordinate
|
||||
process_batch(batch)
|
||||
```
|
||||
|
||||
**Computing incremental statistics:**
|
||||
```python
|
||||
# Example: Calculate mean expression
|
||||
n_observations = 0
|
||||
sum_values = 0.0
|
||||
|
||||
iterator = query.X("raw").tables()
|
||||
for batch in iterator:
|
||||
values = batch["soma_data"].to_numpy()
|
||||
n_observations += len(values)
|
||||
sum_values += values.sum()
|
||||
|
||||
mean_expression = sum_values / n_observations
|
||||
```
|
||||
|
||||
### 5. Machine Learning with PyTorch
|
||||
|
||||
For training models, use the experimental PyTorch integration:
|
||||
|
||||
```python
|
||||
from cellxgene_census.experimental.ml import experiment_dataloader
|
||||
|
||||
with cellxgene_census.open_soma() as census:
|
||||
# Create dataloader
|
||||
dataloader = experiment_dataloader(
|
||||
census["census_data"]["homo_sapiens"],
|
||||
measurement_name="RNA",
|
||||
X_name="raw",
|
||||
obs_value_filter="tissue_general == 'liver' and is_primary_data == True",
|
||||
obs_column_names=["cell_type"],
|
||||
batch_size=128,
|
||||
shuffle=True,
|
||||
)
|
||||
|
||||
# Training loop
|
||||
for epoch in range(num_epochs):
|
||||
for batch in dataloader:
|
||||
X = batch["X"] # Gene expression tensor
|
||||
labels = batch["obs"]["cell_type"] # Cell type labels
|
||||
|
||||
# Forward pass
|
||||
outputs = model(X)
|
||||
loss = criterion(outputs, labels)
|
||||
|
||||
# Backward pass
|
||||
optimizer.zero_grad()
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
**Train/test splitting:**
|
||||
```python
|
||||
from cellxgene_census.experimental.ml import ExperimentDataset
|
||||
|
||||
# Create dataset from experiment
|
||||
dataset = ExperimentDataset(
|
||||
experiment_axis_query,
|
||||
layer_name="raw",
|
||||
obs_column_names=["cell_type"],
|
||||
batch_size=128,
|
||||
)
|
||||
|
||||
# Split into train and test
|
||||
train_dataset, test_dataset = dataset.random_split(
|
||||
split=[0.8, 0.2],
|
||||
seed=42
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Integration with Scanpy
|
||||
|
||||
Seamlessly integrate Census data with scanpy workflows:
|
||||
|
||||
```python
|
||||
import scanpy as sc
|
||||
|
||||
# Load data from Census
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="cell_type == 'neuron' and tissue_general == 'cortex' and is_primary_data == True",
|
||||
)
|
||||
|
||||
# Standard scanpy workflow
|
||||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||||
sc.pp.log1p(adata)
|
||||
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
|
||||
|
||||
# Dimensionality reduction
|
||||
sc.pp.pca(adata, n_comps=50)
|
||||
sc.pp.neighbors(adata)
|
||||
sc.tl.umap(adata)
|
||||
|
||||
# Visualization
|
||||
sc.pl.umap(adata, color=["cell_type", "tissue", "disease"])
|
||||
```
|
||||
|
||||
### 7. Multi-Dataset Integration
|
||||
|
||||
Query and integrate multiple datasets:
|
||||
|
||||
```python
|
||||
# Strategy 1: Query multiple tissues separately
|
||||
tissues = ["lung", "liver", "kidney"]
|
||||
adatas = []
|
||||
|
||||
for tissue in tissues:
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter=f"tissue_general == '{tissue}' and is_primary_data == True",
|
||||
)
|
||||
adata.obs["tissue"] = tissue
|
||||
adatas.append(adata)
|
||||
|
||||
# Concatenate
|
||||
combined = adatas[0].concatenate(adatas[1:])
|
||||
|
||||
# Strategy 2: Query multiple datasets directly
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="tissue_general in ['lung', 'liver', 'kidney'] and is_primary_data == True",
|
||||
)
|
||||
```
|
||||
|
||||
## Key Concepts and Best Practices
|
||||
|
||||
### Always Filter for Primary Data
|
||||
Unless analyzing duplicates, always include `is_primary_data == True` in queries to avoid counting cells multiple times:
|
||||
```python
|
||||
obs_value_filter="cell_type == 'B cell' and is_primary_data == True"
|
||||
```
|
||||
|
||||
### Specify Census Version for Reproducibility
|
||||
Always specify the Census version in production analyses:
|
||||
```python
|
||||
census = cellxgene_census.open_soma(census_version="2023-07-25")
|
||||
```
|
||||
|
||||
### Estimate Query Size Before Loading
|
||||
For large queries, first check the number of cells to avoid memory issues:
|
||||
```python
|
||||
# Get cell count
|
||||
metadata = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="tissue_general == 'brain' and is_primary_data == True",
|
||||
column_names=["soma_joinid"]
|
||||
)
|
||||
n_cells = len(metadata)
|
||||
print(f"Query will return {n_cells:,} cells")
|
||||
|
||||
# If too large (>100k), use out-of-core processing
|
||||
```
|
||||
|
||||
### Use tissue_general for Broader Groupings
|
||||
The `tissue_general` field provides coarser categories than `tissue`, useful for cross-tissue analyses:
|
||||
```python
|
||||
# Broader grouping
|
||||
obs_value_filter="tissue_general == 'immune system'"
|
||||
|
||||
# Specific tissue
|
||||
obs_value_filter="tissue == 'peripheral blood mononuclear cell'"
|
||||
```
|
||||
|
||||
### Select Only Needed Columns
|
||||
Minimize data transfer by specifying only required metadata columns:
|
||||
```python
|
||||
obs_column_names=["cell_type", "tissue_general", "disease"] # Not all columns
|
||||
```
|
||||
|
||||
### Check Dataset Presence for Gene-Specific Queries
|
||||
When analyzing specific genes, verify which datasets measured them:
|
||||
```python
|
||||
presence = cellxgene_census.get_presence_matrix(
|
||||
census,
|
||||
"homo_sapiens",
|
||||
var_value_filter="feature_name in ['CD4', 'CD8A']"
|
||||
)
|
||||
```
|
||||
|
||||
### Two-Step Workflow: Explore Then Query
|
||||
First explore metadata to understand available data, then query expression:
|
||||
```python
|
||||
# Step 1: Explore what's available
|
||||
metadata = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="disease == 'COVID-19' and is_primary_data == True",
|
||||
column_names=["cell_type", "tissue_general"]
|
||||
)
|
||||
print(metadata.value_counts())
|
||||
|
||||
# Step 2: Query based on findings
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="disease == 'COVID-19' and cell_type == 'T cell' and is_primary_data == True",
|
||||
)
|
||||
```
|
||||
|
||||
## Available Metadata Fields
|
||||
|
||||
### Cell Metadata (obs)
|
||||
Key fields for filtering:
|
||||
- `cell_type`, `cell_type_ontology_term_id`
|
||||
- `tissue`, `tissue_general`, `tissue_ontology_term_id`
|
||||
- `disease`, `disease_ontology_term_id`
|
||||
- `assay`, `assay_ontology_term_id`
|
||||
- `donor_id`, `sex`, `self_reported_ethnicity`
|
||||
- `development_stage`, `development_stage_ontology_term_id`
|
||||
- `dataset_id`
|
||||
- `is_primary_data` (Boolean: True = unique cell)
|
||||
|
||||
### Gene Metadata (var)
|
||||
- `feature_id` (Ensembl gene ID, e.g., "ENSG00000161798")
|
||||
- `feature_name` (Gene symbol, e.g., "FOXP2")
|
||||
- `feature_length` (Gene length in base pairs)
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
This skill includes detailed reference documentation:
|
||||
|
||||
### references/census_schema.md
|
||||
Comprehensive documentation of:
|
||||
- Census data structure and organization
|
||||
- All available metadata fields
|
||||
- Value filter syntax and operators
|
||||
- SOMA object types
|
||||
- Data inclusion criteria
|
||||
|
||||
**When to read:** When you need detailed schema information, full list of metadata fields, or complex filter syntax.
|
||||
|
||||
### references/common_patterns.md
|
||||
Examples and patterns for:
|
||||
- Exploratory queries (metadata only)
|
||||
- Small-to-medium queries (AnnData)
|
||||
- Large queries (out-of-core processing)
|
||||
- PyTorch integration
|
||||
- Scanpy integration workflows
|
||||
- Multi-dataset integration
|
||||
- Best practices and common pitfalls
|
||||
|
||||
**When to read:** When implementing specific query patterns, looking for code examples, or troubleshooting common issues.
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Use Case 1: Explore Cell Types in a Tissue
|
||||
```python
|
||||
with cellxgene_census.open_soma() as census:
|
||||
cells = cellxgene_census.get_obs(
|
||||
census, "homo_sapiens",
|
||||
value_filter="tissue_general == 'lung' and is_primary_data == True",
|
||||
column_names=["cell_type"]
|
||||
)
|
||||
print(cells["cell_type"].value_counts())
|
||||
```
|
||||
|
||||
### Use Case 2: Query Marker Gene Expression
|
||||
```python
|
||||
with cellxgene_census.open_soma() as census:
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
var_value_filter="feature_name in ['CD4', 'CD8A', 'CD19']",
|
||||
obs_value_filter="cell_type in ['T cell', 'B cell'] and is_primary_data == True",
|
||||
)
|
||||
```
|
||||
|
||||
### Use Case 3: Train Cell Type Classifier
|
||||
```python
|
||||
from cellxgene_census.experimental.ml import experiment_dataloader
|
||||
|
||||
with cellxgene_census.open_soma() as census:
|
||||
dataloader = experiment_dataloader(
|
||||
census["census_data"]["homo_sapiens"],
|
||||
measurement_name="RNA",
|
||||
X_name="raw",
|
||||
obs_value_filter="is_primary_data == True",
|
||||
obs_column_names=["cell_type"],
|
||||
batch_size=128,
|
||||
shuffle=True,
|
||||
)
|
||||
|
||||
# Train model
|
||||
for epoch in range(epochs):
|
||||
for batch in dataloader:
|
||||
# Training logic
|
||||
pass
|
||||
```
|
||||
|
||||
### Use Case 4: Cross-Tissue Analysis
|
||||
```python
|
||||
with cellxgene_census.open_soma() as census:
|
||||
adata = cellxgene_census.get_anndata(
|
||||
census=census,
|
||||
organism="Homo sapiens",
|
||||
obs_value_filter="cell_type == 'macrophage' and tissue_general in ['lung', 'liver', 'brain'] and is_primary_data == True",
|
||||
)
|
||||
|
||||
# Analyze macrophage differences across tissues
|
||||
sc.tl.rank_genes_groups(adata, groupby="tissue_general")
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Query Returns Too Many Cells
|
||||
- Add more specific filters to reduce scope
|
||||
- Use `tissue` instead of `tissue_general` for finer granularity
|
||||
- Filter by specific `dataset_id` if known
|
||||
- Switch to out-of-core processing for large queries
|
||||
|
||||
### Memory Errors
|
||||
- Reduce query scope with more restrictive filters
|
||||
- Select fewer genes with `var_value_filter`
|
||||
- Use out-of-core processing with `axis_query()`
|
||||
- Process data in batches
|
||||
|
||||
### Duplicate Cells in Results
|
||||
- Always include `is_primary_data == True` in filters
|
||||
- Check if intentionally querying across multiple datasets
|
||||
|
||||
### Gene Not Found
|
||||
- Verify gene name spelling (case-sensitive)
|
||||
- Try Ensembl ID with `feature_id` instead of `feature_name`
|
||||
- Check dataset presence matrix to see if gene was measured
|
||||
- Some genes may have been filtered during Census construction
|
||||
|
||||
### Version Inconsistencies
|
||||
- Always specify `census_version` explicitly
|
||||
- Use same version across all analyses
|
||||
- Check release notes for version-specific changes
|
||||
Reference in New Issue
Block a user