395 lines
9.9 KiB
Markdown
395 lines
9.9 KiB
Markdown
---
|
||
name: anndata
|
||
description: This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
|
||
---
|
||
|
||
# AnnData
|
||
|
||
## Overview
|
||
|
||
AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.
|
||
|
||
## When to Use This Skill
|
||
|
||
Use this skill when:
|
||
- Creating, reading, or writing AnnData objects
|
||
- Working with h5ad, zarr, or other genomics data formats
|
||
- Performing single-cell RNA-seq analysis
|
||
- Managing large datasets with sparse matrices or backed mode
|
||
- Concatenating multiple datasets or experimental batches
|
||
- Subsetting, filtering, or transforming annotated data
|
||
- Integrating with scanpy, scvi-tools, or other scverse ecosystem tools
|
||
|
||
## Installation
|
||
|
||
```bash
|
||
uv pip install anndata
|
||
|
||
# With optional dependencies
|
||
uv pip install anndata[dev,test,doc]
|
||
```
|
||
|
||
## Quick Start
|
||
|
||
### Creating an AnnData object
|
||
```python
|
||
import anndata as ad
|
||
import numpy as np
|
||
import pandas as pd
|
||
|
||
# Minimal creation
|
||
X = np.random.rand(100, 2000) # 100 cells × 2000 genes
|
||
adata = ad.AnnData(X)
|
||
|
||
# With metadata
|
||
obs = pd.DataFrame({
|
||
'cell_type': ['T cell', 'B cell'] * 50,
|
||
'sample': ['A', 'B'] * 50
|
||
}, index=[f'cell_{i}' for i in range(100)])
|
||
|
||
var = pd.DataFrame({
|
||
'gene_name': [f'Gene_{i}' for i in range(2000)]
|
||
}, index=[f'ENSG{i:05d}' for i in range(2000)])
|
||
|
||
adata = ad.AnnData(X=X, obs=obs, var=var)
|
||
```
|
||
|
||
### Reading data
|
||
```python
|
||
# Read h5ad file
|
||
adata = ad.read_h5ad('data.h5ad')
|
||
|
||
# Read with backed mode (for large files)
|
||
adata = ad.read_h5ad('large_data.h5ad', backed='r')
|
||
|
||
# Read other formats
|
||
adata = ad.read_csv('data.csv')
|
||
adata = ad.read_loom('data.loom')
|
||
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
|
||
```
|
||
|
||
### Writing data
|
||
```python
|
||
# Write h5ad file
|
||
adata.write_h5ad('output.h5ad')
|
||
|
||
# Write with compression
|
||
adata.write_h5ad('output.h5ad', compression='gzip')
|
||
|
||
# Write other formats
|
||
adata.write_zarr('output.zarr')
|
||
adata.write_csvs('output_dir/')
|
||
```
|
||
|
||
### Basic operations
|
||
```python
|
||
# Subset by conditions
|
||
t_cells = adata[adata.obs['cell_type'] == 'T cell']
|
||
|
||
# Subset by indices
|
||
subset = adata[0:50, 0:100]
|
||
|
||
# Add metadata
|
||
adata.obs['quality_score'] = np.random.rand(adata.n_obs)
|
||
adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
|
||
|
||
# Access dimensions
|
||
print(f"{adata.n_obs} observations × {adata.n_vars} variables")
|
||
```
|
||
|
||
## Core Capabilities
|
||
|
||
### 1. Data Structure
|
||
|
||
Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
|
||
|
||
**See**: `references/data_structure.md` for comprehensive information on:
|
||
- Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
|
||
- Creating AnnData objects from various sources
|
||
- Accessing and manipulating data components
|
||
- Memory-efficient practices
|
||
|
||
### 2. Input/Output Operations
|
||
|
||
Read and write data in various formats with support for compression, backed mode, and cloud storage.
|
||
|
||
**See**: `references/io_operations.md` for details on:
|
||
- Native formats (h5ad, zarr)
|
||
- Alternative formats (CSV, MTX, Loom, 10X, Excel)
|
||
- Backed mode for large datasets
|
||
- Remote data access
|
||
- Format conversion
|
||
- Performance optimization
|
||
|
||
Common commands:
|
||
```python
|
||
# Read/write h5ad
|
||
adata = ad.read_h5ad('data.h5ad', backed='r')
|
||
adata.write_h5ad('output.h5ad', compression='gzip')
|
||
|
||
# Read 10X data
|
||
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
|
||
|
||
# Read MTX format
|
||
adata = ad.read_mtx('matrix.mtx').T
|
||
```
|
||
|
||
### 3. Concatenation
|
||
|
||
Combine multiple AnnData objects along observations or variables with flexible join strategies.
|
||
|
||
**See**: `references/concatenation.md` for comprehensive coverage of:
|
||
- Basic concatenation (axis=0 for observations, axis=1 for variables)
|
||
- Join types (inner, outer)
|
||
- Merge strategies (same, unique, first, only)
|
||
- Tracking data sources with labels
|
||
- Lazy concatenation (AnnCollection)
|
||
- On-disk concatenation for large datasets
|
||
|
||
Common commands:
|
||
```python
|
||
# Concatenate observations (combine samples)
|
||
adata = ad.concat(
|
||
[adata1, adata2, adata3],
|
||
axis=0,
|
||
join='inner',
|
||
label='batch',
|
||
keys=['batch1', 'batch2', 'batch3']
|
||
)
|
||
|
||
# Concatenate variables (combine modalities)
|
||
adata = ad.concat([adata_rna, adata_protein], axis=1)
|
||
|
||
# Lazy concatenation
|
||
from anndata.experimental import AnnCollection
|
||
collection = AnnCollection(
|
||
['data1.h5ad', 'data2.h5ad'],
|
||
join_obs='outer',
|
||
label='dataset'
|
||
)
|
||
```
|
||
|
||
### 4. Data Manipulation
|
||
|
||
Transform, subset, filter, and reorganize data efficiently.
|
||
|
||
**See**: `references/manipulation.md` for detailed guidance on:
|
||
- Subsetting (by indices, names, boolean masks, metadata conditions)
|
||
- Transposition
|
||
- Copying (full copies vs views)
|
||
- Renaming (observations, variables, categories)
|
||
- Type conversions (strings to categoricals, sparse/dense)
|
||
- Adding/removing data components
|
||
- Reordering
|
||
- Quality control filtering
|
||
|
||
Common commands:
|
||
```python
|
||
# Subset by metadata
|
||
filtered = adata[adata.obs['quality_score'] > 0.8]
|
||
hv_genes = adata[:, adata.var['highly_variable']]
|
||
|
||
# Transpose
|
||
adata_T = adata.T
|
||
|
||
# Copy vs view
|
||
view = adata[0:100, :] # View (lightweight reference)
|
||
copy = adata[0:100, :].copy() # Independent copy
|
||
|
||
# Convert strings to categoricals
|
||
adata.strings_to_categoricals()
|
||
```
|
||
|
||
### 5. Best Practices
|
||
|
||
Follow recommended patterns for memory efficiency, performance, and reproducibility.
|
||
|
||
**See**: `references/best_practices.md` for guidelines on:
|
||
- Memory management (sparse matrices, categoricals, backed mode)
|
||
- Views vs copies
|
||
- Data storage optimization
|
||
- Performance optimization
|
||
- Working with raw data
|
||
- Metadata management
|
||
- Reproducibility
|
||
- Error handling
|
||
- Integration with other tools
|
||
- Common pitfalls and solutions
|
||
|
||
Key recommendations:
|
||
```python
|
||
# Use sparse matrices for sparse data
|
||
from scipy.sparse import csr_matrix
|
||
adata.X = csr_matrix(adata.X)
|
||
|
||
# Convert strings to categoricals
|
||
adata.strings_to_categoricals()
|
||
|
||
# Use backed mode for large files
|
||
adata = ad.read_h5ad('large.h5ad', backed='r')
|
||
|
||
# Store raw before filtering
|
||
adata.raw = adata.copy()
|
||
adata = adata[:, adata.var['highly_variable']]
|
||
```
|
||
|
||
## Integration with Scverse Ecosystem
|
||
|
||
AnnData serves as the foundational data structure for the scverse ecosystem:
|
||
|
||
### Scanpy (Single-cell analysis)
|
||
```python
|
||
import scanpy as sc
|
||
|
||
# Preprocessing
|
||
sc.pp.filter_cells(adata, min_genes=200)
|
||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||
sc.pp.log1p(adata)
|
||
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
|
||
|
||
# Dimensionality reduction
|
||
sc.pp.pca(adata, n_comps=50)
|
||
sc.pp.neighbors(adata, n_neighbors=15)
|
||
sc.tl.umap(adata)
|
||
sc.tl.leiden(adata)
|
||
|
||
# Visualization
|
||
sc.pl.umap(adata, color=['cell_type', 'leiden'])
|
||
```
|
||
|
||
### Muon (Multimodal data)
|
||
```python
|
||
import muon as mu
|
||
|
||
# Combine RNA and protein data
|
||
mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
|
||
```
|
||
|
||
### PyTorch integration
|
||
```python
|
||
from anndata.experimental import AnnLoader
|
||
|
||
# Create DataLoader for deep learning
|
||
dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
|
||
|
||
for batch in dataloader:
|
||
X = batch.X
|
||
# Train model
|
||
```
|
||
|
||
## Common Workflows
|
||
|
||
### Single-cell RNA-seq analysis
|
||
```python
|
||
import anndata as ad
|
||
import scanpy as sc
|
||
|
||
# 1. Load data
|
||
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
|
||
|
||
# 2. Quality control
|
||
adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
|
||
adata.obs['n_counts'] = adata.X.sum(axis=1)
|
||
adata = adata[adata.obs['n_genes'] > 200]
|
||
adata = adata[adata.obs['n_counts'] < 50000]
|
||
|
||
# 3. Store raw
|
||
adata.raw = adata.copy()
|
||
|
||
# 4. Normalize and filter
|
||
sc.pp.normalize_total(adata, target_sum=1e4)
|
||
sc.pp.log1p(adata)
|
||
sc.pp.highly_variable_genes(adata, n_top_genes=2000)
|
||
adata = adata[:, adata.var['highly_variable']]
|
||
|
||
# 5. Save processed data
|
||
adata.write_h5ad('processed.h5ad')
|
||
```
|
||
|
||
### Batch integration
|
||
```python
|
||
# Load multiple batches
|
||
adata1 = ad.read_h5ad('batch1.h5ad')
|
||
adata2 = ad.read_h5ad('batch2.h5ad')
|
||
adata3 = ad.read_h5ad('batch3.h5ad')
|
||
|
||
# Concatenate with batch labels
|
||
adata = ad.concat(
|
||
[adata1, adata2, adata3],
|
||
label='batch',
|
||
keys=['batch1', 'batch2', 'batch3'],
|
||
join='inner'
|
||
)
|
||
|
||
# Apply batch correction
|
||
import scanpy as sc
|
||
sc.pp.combat(adata, key='batch')
|
||
|
||
# Continue analysis
|
||
sc.pp.pca(adata)
|
||
sc.pp.neighbors(adata)
|
||
sc.tl.umap(adata)
|
||
```
|
||
|
||
### Working with large datasets
|
||
```python
|
||
# Open in backed mode
|
||
adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')
|
||
|
||
# Filter based on metadata (no data loading)
|
||
high_quality = adata[adata.obs['quality_score'] > 0.8]
|
||
|
||
# Load filtered subset
|
||
adata_subset = high_quality.to_memory()
|
||
|
||
# Process subset
|
||
process(adata_subset)
|
||
|
||
# Or process in chunks
|
||
chunk_size = 1000
|
||
for i in range(0, adata.n_obs, chunk_size):
|
||
chunk = adata[i:i+chunk_size, :].to_memory()
|
||
process(chunk)
|
||
```
|
||
|
||
## Troubleshooting
|
||
|
||
### Out of memory errors
|
||
Use backed mode or convert to sparse matrices:
|
||
```python
|
||
# Backed mode
|
||
adata = ad.read_h5ad('file.h5ad', backed='r')
|
||
|
||
# Sparse matrices
|
||
from scipy.sparse import csr_matrix
|
||
adata.X = csr_matrix(adata.X)
|
||
```
|
||
|
||
### Slow file reading
|
||
Use compression and appropriate formats:
|
||
```python
|
||
# Optimize for storage
|
||
adata.strings_to_categoricals()
|
||
adata.write_h5ad('file.h5ad', compression='gzip')
|
||
|
||
# Use Zarr for cloud storage
|
||
adata.write_zarr('file.zarr', chunks=(1000, 1000))
|
||
```
|
||
|
||
### Index alignment issues
|
||
Always align external data on index:
|
||
```python
|
||
# Wrong
|
||
adata.obs['new_col'] = external_data['values']
|
||
|
||
# Correct
|
||
adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
|
||
```
|
||
|
||
## Additional Resources
|
||
|
||
- **Official documentation**: https://anndata.readthedocs.io/
|
||
- **Scanpy tutorials**: https://scanpy.readthedocs.io/
|
||
- **Scverse ecosystem**: https://scverse.org/
|
||
- **GitHub repository**: https://github.com/scverse/anndata
|