315 lines
8.5 KiB
Markdown
315 lines
8.5 KiB
Markdown
# AnnData Object Structure
|
|
|
|
The AnnData object stores a data matrix with associated annotations, providing a flexible framework for managing experimental data and metadata.
|
|
|
|
## Core Components
|
|
|
|
### X (Data Matrix)
|
|
The primary data matrix with shape (n_obs, n_vars) storing experimental measurements.
|
|
|
|
```python
|
|
import anndata as ad
|
|
import numpy as np
|
|
|
|
# Create with dense array
|
|
adata = ad.AnnData(X=np.random.rand(100, 2000))
|
|
|
|
# Create with sparse matrix (recommended for large, sparse data)
|
|
from scipy.sparse import csr_matrix
|
|
sparse_data = csr_matrix(np.random.rand(100, 2000))
|
|
adata = ad.AnnData(X=sparse_data)
|
|
```
|
|
|
|
Access data:
|
|
```python
|
|
# Full matrix (caution with large datasets)
|
|
full_data = adata.X
|
|
|
|
# Single observation
|
|
obs_data = adata.X[0, :]
|
|
|
|
# Single variable across all observations
|
|
var_data = adata.X[:, 0]
|
|
```
|
|
|
|
### obs (Observation Annotations)
|
|
DataFrame storing metadata about observations (rows). Each row corresponds to one observation in X.
|
|
|
|
```python
|
|
import pandas as pd
|
|
|
|
# Create AnnData with observation metadata
|
|
obs_df = pd.DataFrame({
|
|
'cell_type': ['T cell', 'B cell', 'Monocyte'],
|
|
'treatment': ['control', 'treated', 'control'],
|
|
'timepoint': [0, 24, 24]
|
|
}, index=['cell_1', 'cell_2', 'cell_3'])
|
|
|
|
adata = ad.AnnData(X=np.random.rand(3, 100), obs=obs_df)
|
|
|
|
# Access observation metadata
|
|
print(adata.obs['cell_type'])
|
|
print(adata.obs.loc['cell_1'])
|
|
```
|
|
|
|
### var (Variable Annotations)
|
|
DataFrame storing metadata about variables (columns). Each row corresponds to one variable in X.
|
|
|
|
```python
|
|
# Create AnnData with variable metadata
|
|
var_df = pd.DataFrame({
|
|
'gene_name': ['ACTB', 'GAPDH', 'TP53'],
|
|
'chromosome': ['7', '12', '17'],
|
|
'highly_variable': [True, False, True]
|
|
}, index=['ENSG00001', 'ENSG00002', 'ENSG00003'])
|
|
|
|
adata = ad.AnnData(X=np.random.rand(100, 3), var=var_df)
|
|
|
|
# Access variable metadata
|
|
print(adata.var['gene_name'])
|
|
print(adata.var.loc['ENSG00001'])
|
|
```
|
|
|
|
### layers (Alternative Data Representations)
|
|
Dictionary storing alternative matrices with the same dimensions as X.
|
|
|
|
```python
|
|
# Store raw counts, normalized data, and scaled data
|
|
adata = ad.AnnData(X=np.random.rand(100, 2000))
|
|
adata.layers['raw_counts'] = np.random.randint(0, 100, (100, 2000))
|
|
adata.layers['normalized'] = adata.X / np.sum(adata.X, axis=1, keepdims=True)
|
|
adata.layers['scaled'] = (adata.X - adata.X.mean()) / adata.X.std()
|
|
|
|
# Access layers
|
|
raw_data = adata.layers['raw_counts']
|
|
normalized_data = adata.layers['normalized']
|
|
```
|
|
|
|
Common layer uses:
|
|
- `raw_counts`: Original count data before normalization
|
|
- `normalized`: Log-normalized or TPM values
|
|
- `scaled`: Z-scored values for analysis
|
|
- `imputed`: Data after imputation
|
|
|
|
### obsm (Multi-dimensional Observation Annotations)
|
|
Dictionary storing multi-dimensional arrays aligned to observations.
|
|
|
|
```python
|
|
# Store PCA coordinates and UMAP embeddings
|
|
adata.obsm['X_pca'] = np.random.rand(100, 50) # 50 principal components
|
|
adata.obsm['X_umap'] = np.random.rand(100, 2) # 2D UMAP coordinates
|
|
adata.obsm['X_tsne'] = np.random.rand(100, 2) # 2D t-SNE coordinates
|
|
|
|
# Access embeddings
|
|
pca_coords = adata.obsm['X_pca']
|
|
umap_coords = adata.obsm['X_umap']
|
|
```
|
|
|
|
Common obsm uses:
|
|
- `X_pca`: Principal component coordinates
|
|
- `X_umap`: UMAP embedding coordinates
|
|
- `X_tsne`: t-SNE embedding coordinates
|
|
- `X_diffmap`: Diffusion map coordinates
|
|
- `protein_expression`: Protein abundance measurements (CITE-seq)
|
|
|
|
### varm (Multi-dimensional Variable Annotations)
|
|
Dictionary storing multi-dimensional arrays aligned to variables.
|
|
|
|
```python
|
|
# Store PCA loadings
|
|
adata.varm['PCs'] = np.random.rand(2000, 50) # Loadings for 50 components
|
|
adata.varm['gene_modules'] = np.random.rand(2000, 10) # Gene module scores
|
|
|
|
# Access loadings
|
|
pc_loadings = adata.varm['PCs']
|
|
```
|
|
|
|
Common varm uses:
|
|
- `PCs`: Principal component loadings
|
|
- `gene_modules`: Gene co-expression module assignments
|
|
|
|
### obsp (Pairwise Observation Relationships)
|
|
Dictionary storing sparse matrices representing relationships between observations.
|
|
|
|
```python
|
|
from scipy.sparse import csr_matrix
|
|
|
|
# Store k-nearest neighbor graph
|
|
n_obs = 100
|
|
knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
|
|
adata.obsp['connectivities'] = knn_graph
|
|
adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))
|
|
|
|
# Access graphs
|
|
knn_connections = adata.obsp['connectivities']
|
|
distances = adata.obsp['distances']
|
|
```
|
|
|
|
Common obsp uses:
|
|
- `connectivities`: Cell-cell neighborhood graph
|
|
- `distances`: Pairwise distances between cells
|
|
|
|
### varp (Pairwise Variable Relationships)
|
|
Dictionary storing sparse matrices representing relationships between variables.
|
|
|
|
```python
|
|
# Store gene-gene correlation matrix
|
|
n_vars = 2000
|
|
gene_corr = csr_matrix(np.random.rand(n_vars, n_vars) > 0.99)
|
|
adata.varp['correlations'] = gene_corr
|
|
|
|
# Access correlations
|
|
gene_correlations = adata.varp['correlations']
|
|
```
|
|
|
|
### uns (Unstructured Annotations)
|
|
Dictionary storing arbitrary unstructured metadata.
|
|
|
|
```python
|
|
# Store analysis parameters and results
|
|
adata.uns['experiment_date'] = '2025-11-03'
|
|
adata.uns['pca'] = {
|
|
'variance_ratio': [0.15, 0.10, 0.08],
|
|
'params': {'n_comps': 50}
|
|
}
|
|
adata.uns['neighbors'] = {
|
|
'params': {'n_neighbors': 15, 'method': 'umap'},
|
|
'connectivities_key': 'connectivities'
|
|
}
|
|
|
|
# Access unstructured data
|
|
exp_date = adata.uns['experiment_date']
|
|
pca_params = adata.uns['pca']['params']
|
|
```
|
|
|
|
Common uns uses:
|
|
- Analysis parameters and settings
|
|
- Color palettes for plotting
|
|
- Cluster information
|
|
- Tool-specific metadata
|
|
|
|
### raw (Original Data Snapshot)
|
|
Optional attribute preserving the original data matrix and variable annotations before filtering.
|
|
|
|
```python
|
|
# Create AnnData and store raw state
|
|
adata = ad.AnnData(X=np.random.rand(100, 5000))
|
|
adata.var['gene_name'] = [f'Gene_{i}' for i in range(5000)]
|
|
|
|
# Store raw state before filtering
|
|
adata.raw = adata.copy()
|
|
|
|
# Filter to highly variable genes
|
|
highly_variable_mask = np.random.rand(5000) > 0.5
|
|
adata = adata[:, highly_variable_mask]
|
|
|
|
# Access original data
|
|
original_matrix = adata.raw.X
|
|
original_var = adata.raw.var
|
|
```
|
|
|
|
## Object Properties
|
|
|
|
```python
|
|
# Dimensions
|
|
n_observations = adata.n_obs
|
|
n_variables = adata.n_vars
|
|
shape = adata.shape # (n_obs, n_vars)
|
|
|
|
# Index information
|
|
obs_names = adata.obs_names # Observation identifiers
|
|
var_names = adata.var_names # Variable identifiers
|
|
|
|
# Storage mode
|
|
is_view = adata.is_view # True if this is a view of another object
|
|
is_backed = adata.isbacked # True if backed by on-disk storage
|
|
filename = adata.filename # Path to backing file (if backed)
|
|
```
|
|
|
|
## Creating AnnData Objects
|
|
|
|
### From arrays and DataFrames
|
|
```python
|
|
import anndata as ad
|
|
import numpy as np
|
|
import pandas as pd
|
|
|
|
# Minimal creation
|
|
X = np.random.rand(100, 2000)
|
|
adata = ad.AnnData(X)
|
|
|
|
# With metadata
|
|
obs = pd.DataFrame({'cell_type': ['A', 'B'] * 50}, index=[f'cell_{i}' for i in range(100)])
|
|
var = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]}, index=[f'ENSG{i:05d}' for i in range(2000)])
|
|
adata = ad.AnnData(X=X, obs=obs, var=var)
|
|
|
|
# With all components
|
|
adata = ad.AnnData(
|
|
X=X,
|
|
obs=obs,
|
|
var=var,
|
|
layers={'raw': np.random.randint(0, 100, (100, 2000))},
|
|
obsm={'X_pca': np.random.rand(100, 50)},
|
|
uns={'experiment': 'test'}
|
|
)
|
|
```
|
|
|
|
### From DataFrame
|
|
```python
|
|
# Create from pandas DataFrame (genes as columns, cells as rows)
|
|
df = pd.DataFrame(
|
|
np.random.rand(100, 50),
|
|
columns=[f'Gene_{i}' for i in range(50)],
|
|
index=[f'Cell_{i}' for i in range(100)]
|
|
)
|
|
adata = ad.AnnData(df)
|
|
```
|
|
|
|
## Data Access Patterns
|
|
|
|
### Vector extraction
|
|
```python
|
|
# Get observation annotation as array
|
|
cell_types = adata.obs_vector('cell_type')
|
|
|
|
# Get variable values across observations
|
|
gene_expression = adata.obs_vector('ACTB') # If ACTB is in var_names
|
|
|
|
# Get variable annotation as array
|
|
gene_names = adata.var_vector('gene_name')
|
|
```
|
|
|
|
### Subsetting
|
|
```python
|
|
# By index
|
|
subset = adata[0:10, 0:100] # First 10 obs, first 100 vars
|
|
|
|
# By name
|
|
subset = adata[['cell_1', 'cell_2'], ['ACTB', 'GAPDH']]
|
|
|
|
# By boolean mask
|
|
high_count_cells = adata.obs['total_counts'] > 1000
|
|
subset = adata[high_count_cells, :]
|
|
|
|
# By observation metadata
|
|
t_cells = adata[adata.obs['cell_type'] == 'T cell']
|
|
```
|
|
|
|
## Memory Considerations
|
|
|
|
The AnnData structure is designed for memory efficiency:
|
|
- Sparse matrices reduce memory for sparse data
|
|
- Views avoid copying data when possible
|
|
- Backed mode enables working with data larger than RAM
|
|
- Categorical annotations reduce memory for discrete values
|
|
|
|
```python
|
|
# Convert strings to categoricals (more memory efficient)
|
|
adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
|
|
adata.strings_to_categoricals()
|
|
|
|
# Check if object is a view (doesn't own data)
|
|
if adata.is_view:
|
|
adata = adata.copy() # Create independent copy
|
|
```
|