Files
gh-k-dense-ai-claude-scient…/skills/anndata/references/data_structure.md
2025-11-30 08:30:10 +08:00

8.5 KiB

AnnData Object Structure

The AnnData object stores a data matrix with associated annotations, providing a flexible framework for managing experimental data and metadata.

Core Components

X (Data Matrix)

The primary data matrix with shape (n_obs, n_vars) storing experimental measurements.

import anndata as ad
import numpy as np

# Create with dense array
adata = ad.AnnData(X=np.random.rand(100, 2000))

# Create with sparse matrix (recommended for large, sparse data)
from scipy.sparse import csr_matrix
sparse_data = csr_matrix(np.random.rand(100, 2000))
adata = ad.AnnData(X=sparse_data)

Access data:

# Full matrix (caution with large datasets)
full_data = adata.X

# Single observation
obs_data = adata.X[0, :]

# Single variable across all observations
var_data = adata.X[:, 0]

obs (Observation Annotations)

DataFrame storing metadata about observations (rows). Each row corresponds to one observation in X.

import pandas as pd

# Create AnnData with observation metadata
obs_df = pd.DataFrame({
    'cell_type': ['T cell', 'B cell', 'Monocyte'],
    'treatment': ['control', 'treated', 'control'],
    'timepoint': [0, 24, 24]
}, index=['cell_1', 'cell_2', 'cell_3'])

adata = ad.AnnData(X=np.random.rand(3, 100), obs=obs_df)

# Access observation metadata
print(adata.obs['cell_type'])
print(adata.obs.loc['cell_1'])

var (Variable Annotations)

DataFrame storing metadata about variables (columns). Each row corresponds to one variable in X.

# Create AnnData with variable metadata
var_df = pd.DataFrame({
    'gene_name': ['ACTB', 'GAPDH', 'TP53'],
    'chromosome': ['7', '12', '17'],
    'highly_variable': [True, False, True]
}, index=['ENSG00001', 'ENSG00002', 'ENSG00003'])

adata = ad.AnnData(X=np.random.rand(100, 3), var=var_df)

# Access variable metadata
print(adata.var['gene_name'])
print(adata.var.loc['ENSG00001'])

layers (Alternative Data Representations)

Dictionary storing alternative matrices with the same dimensions as X.

# Store raw counts, normalized data, and scaled data
adata = ad.AnnData(X=np.random.rand(100, 2000))
adata.layers['raw_counts'] = np.random.randint(0, 100, (100, 2000))
adata.layers['normalized'] = adata.X / np.sum(adata.X, axis=1, keepdims=True)
adata.layers['scaled'] = (adata.X - adata.X.mean()) / adata.X.std()

# Access layers
raw_data = adata.layers['raw_counts']
normalized_data = adata.layers['normalized']

Common layer uses:

  • raw_counts: Original count data before normalization
  • normalized: Log-normalized or TPM values
  • scaled: Z-scored values for analysis
  • imputed: Data after imputation

obsm (Multi-dimensional Observation Annotations)

Dictionary storing multi-dimensional arrays aligned to observations.

# Store PCA coordinates and UMAP embeddings
adata.obsm['X_pca'] = np.random.rand(100, 50)  # 50 principal components
adata.obsm['X_umap'] = np.random.rand(100, 2)  # 2D UMAP coordinates
adata.obsm['X_tsne'] = np.random.rand(100, 2)  # 2D t-SNE coordinates

# Access embeddings
pca_coords = adata.obsm['X_pca']
umap_coords = adata.obsm['X_umap']

Common obsm uses:

  • X_pca: Principal component coordinates
  • X_umap: UMAP embedding coordinates
  • X_tsne: t-SNE embedding coordinates
  • X_diffmap: Diffusion map coordinates
  • protein_expression: Protein abundance measurements (CITE-seq)

varm (Multi-dimensional Variable Annotations)

Dictionary storing multi-dimensional arrays aligned to variables.

# Store PCA loadings
adata.varm['PCs'] = np.random.rand(2000, 50)  # Loadings for 50 components
adata.varm['gene_modules'] = np.random.rand(2000, 10)  # Gene module scores

# Access loadings
pc_loadings = adata.varm['PCs']

Common varm uses:

  • PCs: Principal component loadings
  • gene_modules: Gene co-expression module assignments

obsp (Pairwise Observation Relationships)

Dictionary storing sparse matrices representing relationships between observations.

from scipy.sparse import csr_matrix

# Store k-nearest neighbor graph
n_obs = 100
knn_graph = csr_matrix(np.random.rand(n_obs, n_obs) > 0.95)
adata.obsp['connectivities'] = knn_graph
adata.obsp['distances'] = csr_matrix(np.random.rand(n_obs, n_obs))

# Access graphs
knn_connections = adata.obsp['connectivities']
distances = adata.obsp['distances']

Common obsp uses:

  • connectivities: Cell-cell neighborhood graph
  • distances: Pairwise distances between cells

varp (Pairwise Variable Relationships)

Dictionary storing sparse matrices representing relationships between variables.

# Store gene-gene correlation matrix
n_vars = 2000
gene_corr = csr_matrix(np.random.rand(n_vars, n_vars) > 0.99)
adata.varp['correlations'] = gene_corr

# Access correlations
gene_correlations = adata.varp['correlations']

uns (Unstructured Annotations)

Dictionary storing arbitrary unstructured metadata.

# Store analysis parameters and results
adata.uns['experiment_date'] = '2025-11-03'
adata.uns['pca'] = {
    'variance_ratio': [0.15, 0.10, 0.08],
    'params': {'n_comps': 50}
}
adata.uns['neighbors'] = {
    'params': {'n_neighbors': 15, 'method': 'umap'},
    'connectivities_key': 'connectivities'
}

# Access unstructured data
exp_date = adata.uns['experiment_date']
pca_params = adata.uns['pca']['params']

Common uns uses:

  • Analysis parameters and settings
  • Color palettes for plotting
  • Cluster information
  • Tool-specific metadata

raw (Original Data Snapshot)

Optional attribute preserving the original data matrix and variable annotations before filtering.

# Create AnnData and store raw state
adata = ad.AnnData(X=np.random.rand(100, 5000))
adata.var['gene_name'] = [f'Gene_{i}' for i in range(5000)]

# Store raw state before filtering
adata.raw = adata.copy()

# Filter to highly variable genes
highly_variable_mask = np.random.rand(5000) > 0.5
adata = adata[:, highly_variable_mask]

# Access original data
original_matrix = adata.raw.X
original_var = adata.raw.var

Object Properties

# Dimensions
n_observations = adata.n_obs
n_variables = adata.n_vars
shape = adata.shape  # (n_obs, n_vars)

# Index information
obs_names = adata.obs_names  # Observation identifiers
var_names = adata.var_names  # Variable identifiers

# Storage mode
is_view = adata.is_view  # True if this is a view of another object
is_backed = adata.isbacked  # True if backed by on-disk storage
filename = adata.filename  # Path to backing file (if backed)

Creating AnnData Objects

From arrays and DataFrames

import anndata as ad
import numpy as np
import pandas as pd

# Minimal creation
X = np.random.rand(100, 2000)
adata = ad.AnnData(X)

# With metadata
obs = pd.DataFrame({'cell_type': ['A', 'B'] * 50}, index=[f'cell_{i}' for i in range(100)])
var = pd.DataFrame({'gene_name': [f'Gene_{i}' for i in range(2000)]}, index=[f'ENSG{i:05d}' for i in range(2000)])
adata = ad.AnnData(X=X, obs=obs, var=var)

# With all components
adata = ad.AnnData(
    X=X,
    obs=obs,
    var=var,
    layers={'raw': np.random.randint(0, 100, (100, 2000))},
    obsm={'X_pca': np.random.rand(100, 50)},
    uns={'experiment': 'test'}
)

From DataFrame

# Create from pandas DataFrame (genes as columns, cells as rows)
df = pd.DataFrame(
    np.random.rand(100, 50),
    columns=[f'Gene_{i}' for i in range(50)],
    index=[f'Cell_{i}' for i in range(100)]
)
adata = ad.AnnData(df)

Data Access Patterns

Vector extraction

# Get observation annotation as array
cell_types = adata.obs_vector('cell_type')

# Get variable values across observations
gene_expression = adata.obs_vector('ACTB')  # If ACTB is in var_names

# Get variable annotation as array
gene_names = adata.var_vector('gene_name')

Subsetting

# By index
subset = adata[0:10, 0:100]  # First 10 obs, first 100 vars

# By name
subset = adata[['cell_1', 'cell_2'], ['ACTB', 'GAPDH']]

# By boolean mask
high_count_cells = adata.obs['total_counts'] > 1000
subset = adata[high_count_cells, :]

# By observation metadata
t_cells = adata[adata.obs['cell_type'] == 'T cell']

Memory Considerations

The AnnData structure is designed for memory efficiency:

  • Sparse matrices reduce memory for sparse data
  • Views avoid copying data when possible
  • Backed mode enables working with data larger than RAM
  • Categorical annotations reduce memory for discrete values
# Convert strings to categoricals (more memory efficient)
adata.obs['cell_type'] = adata.obs['cell_type'].astype('category')
adata.strings_to_categoricals()

# Check if object is a view (doesn't own data)
if adata.is_view:
    adata = adata.copy()  # Create independent copy