Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/anndata/SKILL.md
+++ b/skills/anndata/SKILL.md
@@ -0,0 +1,394 @@
+---
+name: anndata
+description: This skill should be used when working with annotated data matrices in Python, particularly for single-cell genomics analysis, managing experimental measurements with metadata, or handling large-scale biological datasets. Use when tasks involve AnnData objects, h5ad files, single-cell RNA-seq data, or integration with scanpy/scverse tools.
+---
+
+# AnnData
+
+## Overview
+
+AnnData is a Python package for handling annotated data matrices, storing experimental measurements (X) alongside observation metadata (obs), variable metadata (var), and multi-dimensional annotations (obsm, varm, obsp, varp, uns). Originally designed for single-cell genomics through Scanpy, it now serves as a general-purpose framework for any annotated data requiring efficient storage, manipulation, and analysis.
+
+## When to Use This Skill
+
+Use this skill when:
+- Creating, reading, or writing AnnData objects
+- Working with h5ad, zarr, or other genomics data formats
+- Performing single-cell RNA-seq analysis
+- Managing large datasets with sparse matrices or backed mode
+- Concatenating multiple datasets or experimental batches
+- Subsetting, filtering, or transforming annotated data
+- Integrating with scanpy, scvi-tools, or other scverse ecosystem tools
+
+## Installation
+
+```bash
+uv pip install anndata
+
+# With optional dependencies
+uv pip install anndata[dev,test,doc]
+```
+
+## Quick Start
+
+### Creating an AnnData object
+```python
+import anndata as ad
+import numpy as np
+import pandas as pd
+
+# Minimal creation
+X = np.random.rand(100, 2000)  # 100 cells × 2000 genes
+adata = ad.AnnData(X)
+
+# With metadata
+obs = pd.DataFrame({
+    'cell_type': ['T cell', 'B cell'] * 50,
+    'sample': ['A', 'B'] * 50
+}, index=[f'cell_{i}' for i in range(100)])
+
+var = pd.DataFrame({
+    'gene_name': [f'Gene_{i}' for i in range(2000)]
+}, index=[f'ENSG{i:05d}' for i in range(2000)])
+
+adata = ad.AnnData(X=X, obs=obs, var=var)
+```
+
+### Reading data
+```python
+# Read h5ad file
+adata = ad.read_h5ad('data.h5ad')
+
+# Read with backed mode (for large files)
+adata = ad.read_h5ad('large_data.h5ad', backed='r')
+
+# Read other formats
+adata = ad.read_csv('data.csv')
+adata = ad.read_loom('data.loom')
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+```
+
+### Writing data
+```python
+# Write h5ad file
+adata.write_h5ad('output.h5ad')
+
+# Write with compression
+adata.write_h5ad('output.h5ad', compression='gzip')
+
+# Write other formats
+adata.write_zarr('output.zarr')
+adata.write_csvs('output_dir/')
+```
+
+### Basic operations
+```python
+# Subset by conditions
+t_cells = adata[adata.obs['cell_type'] == 'T cell']
+
+# Subset by indices
+subset = adata[0:50, 0:100]
+
+# Add metadata
+adata.obs['quality_score'] = np.random.rand(adata.n_obs)
+adata.var['highly_variable'] = np.random.rand(adata.n_vars) > 0.8
+
+# Access dimensions
+print(f"{adata.n_obs} observations × {adata.n_vars} variables")
+```
+
+## Core Capabilities
+
+### 1. Data Structure
+
+Understand the AnnData object structure including X, obs, var, layers, obsm, varm, obsp, varp, uns, and raw components.
+
+**See**: `references/data_structure.md` for comprehensive information on:
+- Core components (X, obs, var, layers, obsm, varm, obsp, varp, uns, raw)
+- Creating AnnData objects from various sources
+- Accessing and manipulating data components
+- Memory-efficient practices
+
+### 2. Input/Output Operations
+
+Read and write data in various formats with support for compression, backed mode, and cloud storage.
+
+**See**: `references/io_operations.md` for details on:
+- Native formats (h5ad, zarr)
+- Alternative formats (CSV, MTX, Loom, 10X, Excel)
+- Backed mode for large datasets
+- Remote data access
+- Format conversion
+- Performance optimization
+
+Common commands:
+```python
+# Read/write h5ad
+adata = ad.read_h5ad('data.h5ad', backed='r')
+adata.write_h5ad('output.h5ad', compression='gzip')
+
+# Read 10X data
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+
+# Read MTX format
+adata = ad.read_mtx('matrix.mtx').T
+```
+
+### 3. Concatenation
+
+Combine multiple AnnData objects along observations or variables with flexible join strategies.
+
+**See**: `references/concatenation.md` for comprehensive coverage of:
+- Basic concatenation (axis=0 for observations, axis=1 for variables)
+- Join types (inner, outer)
+- Merge strategies (same, unique, first, only)
+- Tracking data sources with labels
+- Lazy concatenation (AnnCollection)
+- On-disk concatenation for large datasets
+
+Common commands:
+```python
+# Concatenate observations (combine samples)
+adata = ad.concat(
+    [adata1, adata2, adata3],
+    axis=0,
+    join='inner',
+    label='batch',
+    keys=['batch1', 'batch2', 'batch3']
+)
+
+# Concatenate variables (combine modalities)
+adata = ad.concat([adata_rna, adata_protein], axis=1)
+
+# Lazy concatenation
+from anndata.experimental import AnnCollection
+collection = AnnCollection(
+    ['data1.h5ad', 'data2.h5ad'],
+    join_obs='outer',
+    label='dataset'
+)
+```
+
+### 4. Data Manipulation
+
+Transform, subset, filter, and reorganize data efficiently.
+
+**See**: `references/manipulation.md` for detailed guidance on:
+- Subsetting (by indices, names, boolean masks, metadata conditions)
+- Transposition
+- Copying (full copies vs views)
+- Renaming (observations, variables, categories)
+- Type conversions (strings to categoricals, sparse/dense)
+- Adding/removing data components
+- Reordering
+- Quality control filtering
+
+Common commands:
+```python
+# Subset by metadata
+filtered = adata[adata.obs['quality_score'] > 0.8]
+hv_genes = adata[:, adata.var['highly_variable']]
+
+# Transpose
+adata_T = adata.T
+
+# Copy vs view
+view = adata[0:100, :]  # View (lightweight reference)
+copy = adata[0:100, :].copy()  # Independent copy
+
+# Convert strings to categoricals
+adata.strings_to_categoricals()
+```
+
+### 5. Best Practices
+
+Follow recommended patterns for memory efficiency, performance, and reproducibility.
+
+**See**: `references/best_practices.md` for guidelines on:
+- Memory management (sparse matrices, categoricals, backed mode)
+- Views vs copies
+- Data storage optimization
+- Performance optimization
+- Working with raw data
+- Metadata management
+- Reproducibility
+- Error handling
+- Integration with other tools
+- Common pitfalls and solutions
+
+Key recommendations:
+```python
+# Use sparse matrices for sparse data
+from scipy.sparse import csr_matrix
+adata.X = csr_matrix(adata.X)
+
+# Convert strings to categoricals
+adata.strings_to_categoricals()
+
+# Use backed mode for large files
+adata = ad.read_h5ad('large.h5ad', backed='r')
+
+# Store raw before filtering
+adata.raw = adata.copy()
+adata = adata[:, adata.var['highly_variable']]
+```
+
+## Integration with Scverse Ecosystem
+
+AnnData serves as the foundational data structure for the scverse ecosystem:
+
+### Scanpy (Single-cell analysis)
+```python
+import scanpy as sc
+
+# Preprocessing
+sc.pp.filter_cells(adata, min_genes=200)
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata, n_top_genes=2000)
+
+# Dimensionality reduction
+sc.pp.pca(adata, n_comps=50)
+sc.pp.neighbors(adata, n_neighbors=15)
+sc.tl.umap(adata)
+sc.tl.leiden(adata)
+
+# Visualization
+sc.pl.umap(adata, color=['cell_type', 'leiden'])
+```
+
+### Muon (Multimodal data)
+```python
+import muon as mu
+
+# Combine RNA and protein data
+mdata = mu.MuData({'rna': adata_rna, 'protein': adata_protein})
+```
+
+### PyTorch integration
+```python
+from anndata.experimental import AnnLoader
+
+# Create DataLoader for deep learning
+dataloader = AnnLoader(adata, batch_size=128, shuffle=True)
+
+for batch in dataloader:
+    X = batch.X
+    # Train model
+```
+
+## Common Workflows
+
+### Single-cell RNA-seq analysis
+```python
+import anndata as ad
+import scanpy as sc
+
+# 1. Load data
+adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
+
+# 2. Quality control
+adata.obs['n_genes'] = (adata.X > 0).sum(axis=1)
+adata.obs['n_counts'] = adata.X.sum(axis=1)
+adata = adata[adata.obs['n_genes'] > 200]
+adata = adata[adata.obs['n_counts'] < 50000]
+
+# 3. Store raw
+adata.raw = adata.copy()
+
+# 4. Normalize and filter
+sc.pp.normalize_total(adata, target_sum=1e4)
+sc.pp.log1p(adata)
+sc.pp.highly_variable_genes(adata, n_top_genes=2000)
+adata = adata[:, adata.var['highly_variable']]
+
+# 5. Save processed data
+adata.write_h5ad('processed.h5ad')
+```
+
+### Batch integration
+```python
+# Load multiple batches
+adata1 = ad.read_h5ad('batch1.h5ad')
+adata2 = ad.read_h5ad('batch2.h5ad')
+adata3 = ad.read_h5ad('batch3.h5ad')
+
+# Concatenate with batch labels
+adata = ad.concat(
+    [adata1, adata2, adata3],
+    label='batch',
+    keys=['batch1', 'batch2', 'batch3'],
+    join='inner'
+)
+
+# Apply batch correction
+import scanpy as sc
+sc.pp.combat(adata, key='batch')
+
+# Continue analysis
+sc.pp.pca(adata)
+sc.pp.neighbors(adata)
+sc.tl.umap(adata)
+```
+
+### Working with large datasets
+```python
+# Open in backed mode
+adata = ad.read_h5ad('100GB_dataset.h5ad', backed='r')
+
+# Filter based on metadata (no data loading)
+high_quality = adata[adata.obs['quality_score'] > 0.8]
+
+# Load filtered subset
+adata_subset = high_quality.to_memory()
+
+# Process subset
+process(adata_subset)
+
+# Or process in chunks
+chunk_size = 1000
+for i in range(0, adata.n_obs, chunk_size):
+    chunk = adata[i:i+chunk_size, :].to_memory()
+    process(chunk)
+```
+
+## Troubleshooting
+
+### Out of memory errors
+Use backed mode or convert to sparse matrices:
+```python
+# Backed mode
+adata = ad.read_h5ad('file.h5ad', backed='r')
+
+# Sparse matrices
+from scipy.sparse import csr_matrix
+adata.X = csr_matrix(adata.X)
+```
+
+### Slow file reading
+Use compression and appropriate formats:
+```python
+# Optimize for storage
+adata.strings_to_categoricals()
+adata.write_h5ad('file.h5ad', compression='gzip')
+
+# Use Zarr for cloud storage
+adata.write_zarr('file.zarr', chunks=(1000, 1000))
+```
+
+### Index alignment issues
+Always align external data on index:
+```python
+# Wrong
+adata.obs['new_col'] = external_data['values']
+
+# Correct
+adata.obs['new_col'] = external_data.set_index('cell_id').loc[adata.obs_names, 'values']
+```
+
+## Additional Resources
+
+- **Official documentation**: https://anndata.readthedocs.io/
+- **Scanpy tutorials**: https://scanpy.readthedocs.io/
+- **Scverse ecosystem**: https://scverse.org/
+- **GitHub repository**: https://github.com/scverse/anndata