397 lines
10 KiB
Markdown
397 lines
10 KiB
Markdown
# Concatenating AnnData Objects
|
|
|
|
Combine multiple AnnData objects along either observations or variables axis.
|
|
|
|
## Basic Concatenation
|
|
|
|
### Concatenate along observations (stack cells/samples)
|
|
```python
|
|
import anndata as ad
|
|
import numpy as np
|
|
|
|
# Create multiple AnnData objects
|
|
adata1 = ad.AnnData(X=np.random.rand(100, 50))
|
|
adata2 = ad.AnnData(X=np.random.rand(150, 50))
|
|
adata3 = ad.AnnData(X=np.random.rand(200, 50))
|
|
|
|
# Concatenate along observations (axis=0, default)
|
|
adata_combined = ad.concat([adata1, adata2, adata3], axis=0)
|
|
|
|
print(adata_combined.shape) # (450, 50)
|
|
```
|
|
|
|
### Concatenate along variables (stack genes/features)
|
|
```python
|
|
# Create objects with same observations, different variables
|
|
adata1 = ad.AnnData(X=np.random.rand(100, 50))
|
|
adata2 = ad.AnnData(X=np.random.rand(100, 30))
|
|
adata3 = ad.AnnData(X=np.random.rand(100, 70))
|
|
|
|
# Concatenate along variables (axis=1)
|
|
adata_combined = ad.concat([adata1, adata2, adata3], axis=1)
|
|
|
|
print(adata_combined.shape) # (100, 150)
|
|
```
|
|
|
|
## Join Types
|
|
|
|
### Inner join (intersection)
|
|
Keep only variables/observations present in all objects.
|
|
|
|
```python
|
|
import pandas as pd
|
|
|
|
# Create objects with different variables
|
|
adata1 = ad.AnnData(
|
|
X=np.random.rand(100, 50),
|
|
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(50)])
|
|
)
|
|
adata2 = ad.AnnData(
|
|
X=np.random.rand(150, 60),
|
|
var=pd.DataFrame(index=[f'Gene_{i}' for i in range(10, 70)])
|
|
)
|
|
|
|
# Inner join: only genes 10-49 are kept (overlap)
|
|
adata_inner = ad.concat([adata1, adata2], join='inner')
|
|
print(adata_inner.n_vars) # 40 genes (overlap)
|
|
```
|
|
|
|
### Outer join (union)
|
|
Keep all variables/observations, filling missing values.
|
|
|
|
```python
|
|
# Outer join: all genes are kept
|
|
adata_outer = ad.concat([adata1, adata2], join='outer')
|
|
print(adata_outer.n_vars) # 70 genes (union)
|
|
|
|
# Missing values are filled with appropriate defaults:
|
|
# - 0 for sparse matrices
|
|
# - NaN for dense matrices
|
|
```
|
|
|
|
### Fill values for outer joins
|
|
```python
|
|
# Specify fill value for missing data
|
|
adata_filled = ad.concat([adata1, adata2], join='outer', fill_value=0)
|
|
```
|
|
|
|
## Tracking Data Sources
|
|
|
|
### Add batch labels
|
|
```python
|
|
# Label which object each observation came from
|
|
adata_combined = ad.concat(
|
|
[adata1, adata2, adata3],
|
|
label='batch', # Column name for labels
|
|
keys=['batch1', 'batch2', 'batch3'] # Labels for each object
|
|
)
|
|
|
|
print(adata_combined.obs['batch'].value_counts())
|
|
# batch1 100
|
|
# batch2 150
|
|
# batch3 200
|
|
```
|
|
|
|
### Automatic batch labels
|
|
```python
|
|
# If keys not provided, uses integer indices
|
|
adata_combined = ad.concat(
|
|
[adata1, adata2, adata3],
|
|
label='dataset'
|
|
)
|
|
# dataset column contains: 0, 1, 2
|
|
```
|
|
|
|
## Merge Strategies
|
|
|
|
Control how metadata from different objects is combined using the `merge` parameter.
|
|
|
|
### merge=None (default for observations)
|
|
Exclude metadata on non-concatenation axis.
|
|
|
|
```python
|
|
# When concatenating observations, var metadata must match
|
|
adata1.var['gene_type'] = 'protein_coding'
|
|
adata2.var['gene_type'] = 'protein_coding'
|
|
|
|
# var is kept only if identical across all objects
|
|
adata_combined = ad.concat([adata1, adata2], merge=None)
|
|
```
|
|
|
|
### merge='same'
|
|
Keep metadata that is identical across all objects.
|
|
|
|
```python
|
|
adata1.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
|
|
adata2.var['chromosome'] = ['chr1'] * 25 + ['chr2'] * 25
|
|
adata1.var['type'] = 'protein_coding'
|
|
adata2.var['type'] = 'lncRNA' # Different
|
|
|
|
# 'chromosome' is kept (same), 'type' is excluded (different)
|
|
adata_combined = ad.concat([adata1, adata2], merge='same')
|
|
```
|
|
|
|
### merge='unique'
|
|
Keep metadata columns where each key has exactly one value.
|
|
|
|
```python
|
|
adata1.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
|
|
adata2.var['gene_id'] = [f'ENSG{i:05d}' for i in range(50)]
|
|
|
|
# gene_id is kept (unique values for each key)
|
|
adata_combined = ad.concat([adata1, adata2], merge='unique')
|
|
```
|
|
|
|
### merge='first'
|
|
Take values from the first object containing each key.
|
|
|
|
```python
|
|
adata1.var['description'] = ['Desc1'] * 50
|
|
adata2.var['description'] = ['Desc2'] * 50
|
|
|
|
# Uses descriptions from adata1
|
|
adata_combined = ad.concat([adata1, adata2], merge='first')
|
|
```
|
|
|
|
### merge='only'
|
|
Keep metadata that appears in only one object.
|
|
|
|
```python
|
|
adata1.var['adata1_specific'] = [1] * 50
|
|
adata2.var['adata2_specific'] = [2] * 50
|
|
|
|
# Both metadata columns are kept
|
|
adata_combined = ad.concat([adata1, adata2], merge='only')
|
|
```
|
|
|
|
## Handling Index Conflicts
|
|
|
|
### Make indices unique
|
|
```python
|
|
import pandas as pd
|
|
|
|
# Create objects with overlapping observation names
|
|
adata1 = ad.AnnData(
|
|
X=np.random.rand(3, 10),
|
|
obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
|
|
)
|
|
adata2 = ad.AnnData(
|
|
X=np.random.rand(3, 10),
|
|
obs=pd.DataFrame(index=['cell_1', 'cell_2', 'cell_3'])
|
|
)
|
|
|
|
# Make indices unique by appending batch keys
|
|
adata_combined = ad.concat(
|
|
[adata1, adata2],
|
|
label='batch',
|
|
keys=['batch1', 'batch2'],
|
|
index_unique='_' # Separator for making indices unique
|
|
)
|
|
|
|
print(adata_combined.obs_names)
|
|
# ['cell_1_batch1', 'cell_2_batch1', 'cell_3_batch1',
|
|
# 'cell_1_batch2', 'cell_2_batch2', 'cell_3_batch2']
|
|
```
|
|
|
|
## Concatenating Layers
|
|
|
|
```python
|
|
# Objects with layers
|
|
adata1 = ad.AnnData(X=np.random.rand(100, 50))
|
|
adata1.layers['normalized'] = np.random.rand(100, 50)
|
|
adata1.layers['scaled'] = np.random.rand(100, 50)
|
|
|
|
adata2 = ad.AnnData(X=np.random.rand(150, 50))
|
|
adata2.layers['normalized'] = np.random.rand(150, 50)
|
|
adata2.layers['scaled'] = np.random.rand(150, 50)
|
|
|
|
# Layers are concatenated automatically if present in all objects
|
|
adata_combined = ad.concat([adata1, adata2])
|
|
|
|
print(adata_combined.layers.keys())
|
|
# dict_keys(['normalized', 'scaled'])
|
|
```
|
|
|
|
## Concatenating Multi-dimensional Annotations
|
|
|
|
### obsm/varm
|
|
```python
|
|
# Objects with embeddings
|
|
adata1.obsm['X_pca'] = np.random.rand(100, 50)
|
|
adata2.obsm['X_pca'] = np.random.rand(150, 50)
|
|
|
|
# obsm is concatenated along observation axis
|
|
adata_combined = ad.concat([adata1, adata2])
|
|
print(adata_combined.obsm['X_pca'].shape) # (250, 50)
|
|
```
|
|
|
|
### obsp/varp (pairwise annotations)
|
|
```python
|
|
from scipy.sparse import csr_matrix
|
|
|
|
# Pairwise matrices
|
|
adata1.obsp['connectivities'] = csr_matrix((100, 100))
|
|
adata2.obsp['connectivities'] = csr_matrix((150, 150))
|
|
|
|
# By default, obsp is NOT concatenated (set pairwise=True to include)
|
|
adata_combined = ad.concat([adata1, adata2])
|
|
# adata_combined.obsp is empty
|
|
|
|
# Include pairwise data (creates block diagonal matrix)
|
|
adata_combined = ad.concat([adata1, adata2], pairwise=True)
|
|
print(adata_combined.obsp['connectivities'].shape) # (250, 250)
|
|
```
|
|
|
|
## Concatenating uns (unstructured)
|
|
|
|
Unstructured metadata is merged recursively:
|
|
|
|
```python
|
|
adata1.uns['experiment'] = {'date': '2025-01-01', 'batch': 'A'}
|
|
adata2.uns['experiment'] = {'date': '2025-01-01', 'batch': 'B'}
|
|
|
|
# Using merge='unique' for uns
|
|
adata_combined = ad.concat([adata1, adata2], uns_merge='unique')
|
|
# 'date' is kept (same value), 'batch' might be excluded (different values)
|
|
```
|
|
|
|
## Lazy Concatenation (AnnCollection)
|
|
|
|
For very large datasets, use lazy concatenation that doesn't load all data:
|
|
|
|
```python
|
|
from anndata.experimental import AnnCollection
|
|
|
|
# Create collection from file paths (doesn't load data)
|
|
files = ['data1.h5ad', 'data2.h5ad', 'data3.h5ad']
|
|
collection = AnnCollection(
|
|
files,
|
|
join_obs='outer',
|
|
join_vars='inner',
|
|
label='dataset',
|
|
keys=['dataset1', 'dataset2', 'dataset3']
|
|
)
|
|
|
|
# Access data lazily
|
|
print(collection.n_obs) # Total observations
|
|
print(collection.obs.head()) # Metadata loaded, not X
|
|
|
|
# Convert to regular AnnData when needed (loads all data)
|
|
adata = collection.to_adata()
|
|
```
|
|
|
|
### Working with AnnCollection
|
|
```python
|
|
# Subset without loading data
|
|
subset = collection[collection.obs['cell_type'] == 'T cell']
|
|
|
|
# Iterate through datasets
|
|
for adata in collection:
|
|
print(adata.shape)
|
|
|
|
# Access specific dataset
|
|
first_dataset = collection[0]
|
|
```
|
|
|
|
## Concatenation on Disk
|
|
|
|
For datasets too large for memory, concatenate directly on disk:
|
|
|
|
```python
|
|
from anndata.experimental import concat_on_disk
|
|
|
|
# Concatenate without loading into memory
|
|
concat_on_disk(
|
|
['data1.h5ad', 'data2.h5ad', 'data3.h5ad'],
|
|
'combined.h5ad',
|
|
join='outer'
|
|
)
|
|
|
|
# Load result in backed mode
|
|
adata = ad.read_h5ad('combined.h5ad', backed='r')
|
|
```
|
|
|
|
## Common Concatenation Patterns
|
|
|
|
### Combine technical replicates
|
|
```python
|
|
# Multiple runs of the same samples
|
|
replicates = [adata_run1, adata_run2, adata_run3]
|
|
adata_combined = ad.concat(
|
|
replicates,
|
|
label='technical_replicate',
|
|
keys=['rep1', 'rep2', 'rep3'],
|
|
join='inner' # Keep only genes measured in all runs
|
|
)
|
|
```
|
|
|
|
### Combine batches from experiment
|
|
```python
|
|
# Different experimental batches
|
|
batches = [adata_batch1, adata_batch2, adata_batch3]
|
|
adata_combined = ad.concat(
|
|
batches,
|
|
label='batch',
|
|
keys=['batch1', 'batch2', 'batch3'],
|
|
join='outer' # Keep all genes
|
|
)
|
|
|
|
# Later: apply batch correction
|
|
```
|
|
|
|
### Merge multi-modal data
|
|
```python
|
|
# Different measurement modalities (e.g., RNA + protein)
|
|
adata_rna = ad.AnnData(X=np.random.rand(100, 2000))
|
|
adata_protein = ad.AnnData(X=np.random.rand(100, 50))
|
|
|
|
# Concatenate along variables to combine modalities
|
|
adata_multimodal = ad.concat([adata_rna, adata_protein], axis=1)
|
|
|
|
# Add labels to distinguish modalities
|
|
adata_multimodal.var['modality'] = ['RNA'] * 2000 + ['protein'] * 50
|
|
```
|
|
|
|
## Best Practices
|
|
|
|
1. **Check compatibility before concatenating**
|
|
```python
|
|
# Verify shapes are compatible
|
|
print([adata.n_vars for adata in [adata1, adata2, adata3]])
|
|
|
|
# Check variable names match
|
|
print([set(adata.var_names) for adata in [adata1, adata2, adata3]])
|
|
```
|
|
|
|
2. **Use appropriate join type**
|
|
- `inner`: When you need the same features across all samples (most stringent)
|
|
- `outer`: When you want to preserve all features (most inclusive)
|
|
|
|
3. **Track data sources**
|
|
Always use `label` and `keys` to track which observations came from which dataset.
|
|
|
|
4. **Consider memory usage**
|
|
- For large datasets, use `AnnCollection` or `concat_on_disk`
|
|
- Consider backed mode for the result
|
|
|
|
5. **Handle batch effects**
|
|
Concatenation combines data but doesn't correct for batch effects. Apply batch correction after concatenation:
|
|
```python
|
|
# After concatenation, apply batch correction
|
|
import scanpy as sc
|
|
sc.pp.combat(adata_combined, key='batch')
|
|
```
|
|
|
|
6. **Validate results**
|
|
```python
|
|
# Check dimensions
|
|
print(adata_combined.shape)
|
|
|
|
# Check batch distribution
|
|
print(adata_combined.obs['batch'].value_counts())
|
|
|
|
# Verify metadata integrity
|
|
print(adata_combined.var.head())
|
|
print(adata_combined.obs.head())
|
|
```
|