207 lines
5.4 KiB
Markdown
207 lines
5.4 KiB
Markdown
# Standard Scanpy Workflow for Single-Cell Analysis
|
|
|
|
This document outlines the standard workflow for analyzing single-cell RNA-seq data using scanpy.
|
|
|
|
## Complete Analysis Pipeline
|
|
|
|
### 1. Data Loading and Initial Setup
|
|
|
|
```python
|
|
import scanpy as sc
|
|
import pandas as pd
|
|
import numpy as np
|
|
|
|
# Configure scanpy settings
|
|
sc.settings.verbosity = 3 # verbosity: errors (0), warnings (1), info (2), hints (3)
|
|
sc.settings.set_figure_params(dpi=80, facecolor='white')
|
|
|
|
# Load data (various formats)
|
|
adata = sc.read_10x_mtx('path/to/data/') # For 10X data
|
|
# adata = sc.read_h5ad('path/to/data.h5ad') # For h5ad format
|
|
# adata = sc.read_csv('path/to/data.csv') # For CSV format
|
|
```
|
|
|
|
### 2. Quality Control (QC)
|
|
|
|
```python
|
|
# Calculate QC metrics
|
|
sc.pp.calculate_qc_metrics(adata, qc_vars=['mt'], percent_top=None, log1p=False, inplace=True)
|
|
|
|
# Common filtering thresholds (adjust based on dataset)
|
|
sc.pp.filter_cells(adata, min_genes=200)
|
|
sc.pp.filter_genes(adata, min_cells=3)
|
|
|
|
# Remove cells with high mitochondrial content
|
|
adata = adata[adata.obs.pct_counts_mt < 5, :]
|
|
|
|
# Visualize QC metrics
|
|
sc.pl.violin(adata, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
|
|
jitter=0.4, multi_panel=True)
|
|
sc.pl.scatter(adata, x='total_counts', y='pct_counts_mt')
|
|
sc.pl.scatter(adata, x='total_counts', y='n_genes_by_counts')
|
|
```
|
|
|
|
### 3. Normalization
|
|
|
|
```python
|
|
# Normalize to 10,000 counts per cell
|
|
sc.pp.normalize_total(adata, target_sum=1e4)
|
|
|
|
# Log-transform the data
|
|
sc.pp.log1p(adata)
|
|
|
|
# Store normalized data in raw for later use
|
|
adata.raw = adata
|
|
```
|
|
|
|
### 4. Feature Selection
|
|
|
|
```python
|
|
# Identify highly variable genes
|
|
sc.pp.highly_variable_genes(adata, min_mean=0.0125, max_mean=3, min_disp=0.5)
|
|
|
|
# Visualize highly variable genes
|
|
sc.pl.highly_variable_genes(adata)
|
|
|
|
# Subset to highly variable genes
|
|
adata = adata[:, adata.var.highly_variable]
|
|
```
|
|
|
|
### 5. Scaling and Regression
|
|
|
|
```python
|
|
# Regress out effects of total counts per cell and percent mitochondrial genes
|
|
sc.pp.regress_out(adata, ['total_counts', 'pct_counts_mt'])
|
|
|
|
# Scale data to unit variance and zero mean
|
|
sc.pp.scale(adata, max_value=10)
|
|
```
|
|
|
|
### 6. Dimensionality Reduction
|
|
|
|
```python
|
|
# Principal Component Analysis (PCA)
|
|
sc.tl.pca(adata, svd_solver='arpack')
|
|
|
|
# Visualize PCA results
|
|
sc.pl.pca(adata, color='CST3')
|
|
sc.pl.pca_variance_ratio(adata, log=True)
|
|
|
|
# Computing neighborhood graph
|
|
sc.pp.neighbors(adata, n_neighbors=10, n_pcs=40)
|
|
|
|
# UMAP for visualization
|
|
sc.tl.umap(adata)
|
|
|
|
# t-SNE (alternative to UMAP)
|
|
# sc.tl.tsne(adata)
|
|
```
|
|
|
|
### 7. Clustering
|
|
|
|
```python
|
|
# Leiden clustering (recommended)
|
|
sc.tl.leiden(adata, resolution=0.5)
|
|
|
|
# Alternative: Louvain clustering
|
|
# sc.tl.louvain(adata, resolution=0.5)
|
|
|
|
# Visualize clustering results
|
|
sc.pl.umap(adata, color=['leiden'], legend_loc='on data')
|
|
```
|
|
|
|
### 8. Marker Gene Identification
|
|
|
|
```python
|
|
# Find marker genes for each cluster
|
|
sc.tl.rank_genes_groups(adata, 'leiden', method='wilcoxon')
|
|
|
|
# Visualize top marker genes
|
|
sc.pl.rank_genes_groups(adata, n_genes=25, sharey=False)
|
|
|
|
# Get marker gene dataframe
|
|
marker_genes = sc.get.rank_genes_groups_df(adata, group='0')
|
|
|
|
# Visualize specific markers
|
|
sc.pl.umap(adata, color=['leiden', 'CST3', 'NKG7'])
|
|
```
|
|
|
|
### 9. Cell Type Annotation
|
|
|
|
```python
|
|
# Manual annotation based on marker genes
|
|
cluster_annotations = {
|
|
'0': 'CD4 T cells',
|
|
'1': 'CD14+ Monocytes',
|
|
'2': 'B cells',
|
|
'3': 'CD8 T cells',
|
|
# ... add more annotations
|
|
}
|
|
adata.obs['cell_type'] = adata.obs['leiden'].map(cluster_annotations)
|
|
|
|
# Visualize annotated cell types
|
|
sc.pl.umap(adata, color='cell_type', legend_loc='on data')
|
|
```
|
|
|
|
### 10. Saving Results
|
|
|
|
```python
|
|
# Save the processed AnnData object
|
|
adata.write('results/processed_data.h5ad')
|
|
|
|
# Export results to CSV
|
|
adata.obs.to_csv('results/cell_metadata.csv')
|
|
adata.var.to_csv('results/gene_metadata.csv')
|
|
```
|
|
|
|
## Additional Analysis Options
|
|
|
|
### Trajectory Inference
|
|
|
|
```python
|
|
# PAGA (Partition-based graph abstraction)
|
|
sc.tl.paga(adata, groups='leiden')
|
|
sc.pl.paga(adata, color=['leiden'])
|
|
|
|
# Diffusion pseudotime (DPT)
|
|
adata.uns['iroot'] = np.flatnonzero(adata.obs['leiden'] == '0')[0]
|
|
sc.tl.dpt(adata)
|
|
sc.pl.umap(adata, color=['dpt_pseudotime'])
|
|
```
|
|
|
|
### Differential Expression Between Conditions
|
|
|
|
```python
|
|
# Compare conditions within a cell type
|
|
sc.tl.rank_genes_groups(adata, groupby='condition', groups=['treated'],
|
|
reference='control', method='wilcoxon')
|
|
sc.pl.rank_genes_groups(adata, groups=['treated'])
|
|
```
|
|
|
|
### Gene Set Scoring
|
|
|
|
```python
|
|
# Score cells for gene set expression
|
|
gene_set = ['CD3D', 'CD3E', 'CD3G']
|
|
sc.tl.score_genes(adata, gene_set, score_name='T_cell_score')
|
|
sc.pl.umap(adata, color='T_cell_score')
|
|
```
|
|
|
|
## Common Parameters to Adjust
|
|
|
|
- **QC thresholds**: `min_genes`, `min_cells`, `pct_counts_mt` - depends on dataset quality
|
|
- **Normalization target**: Usually 1e4, but can be adjusted
|
|
- **HVG parameters**: Affects feature selection stringency
|
|
- **PCA components**: Check variance ratio plot to determine optimal number
|
|
- **Clustering resolution**: Higher values give more clusters (typically 0.4-1.2)
|
|
- **n_neighbors**: Affects granularity of UMAP and clustering (typically 10-30)
|
|
|
|
## Best Practices
|
|
|
|
1. Always visualize QC metrics before filtering
|
|
2. Save raw counts before normalization (`adata.raw = adata`)
|
|
3. Use Leiden instead of Louvain for clustering (more efficient)
|
|
4. Try multiple clustering resolutions to find optimal granularity
|
|
5. Validate cell type annotations with known marker genes
|
|
6. Save intermediate results at key steps
|