405 lines
9.2 KiB
Markdown
405 lines
9.2 KiB
Markdown
# Input/Output Operations
|
|
|
|
AnnData provides comprehensive I/O functionality for reading and writing data in various formats.
|
|
|
|
## Native Formats
|
|
|
|
### H5AD (HDF5-based)
|
|
The recommended native format for AnnData objects, providing efficient storage and fast access.
|
|
|
|
#### Writing H5AD files
|
|
```python
|
|
import anndata as ad
|
|
|
|
# Write to file
|
|
adata.write_h5ad('data.h5ad')
|
|
|
|
# Write with compression
|
|
adata.write_h5ad('data.h5ad', compression='gzip')
|
|
|
|
# Write with specific compression level (0-9, higher = more compression)
|
|
adata.write_h5ad('data.h5ad', compression='gzip', compression_opts=9)
|
|
```
|
|
|
|
#### Reading H5AD files
|
|
```python
|
|
# Read entire file into memory
|
|
adata = ad.read_h5ad('data.h5ad')
|
|
|
|
# Read in backed mode (lazy loading for large files)
|
|
adata = ad.read_h5ad('data.h5ad', backed='r') # Read-only
|
|
adata = ad.read_h5ad('data.h5ad', backed='r+') # Read-write
|
|
|
|
# Backed mode enables working with datasets larger than RAM
|
|
# Only accessed data is loaded into memory
|
|
```
|
|
|
|
#### Backed mode operations
|
|
```python
|
|
# Open in backed mode
|
|
adata = ad.read_h5ad('large_dataset.h5ad', backed='r')
|
|
|
|
# Access metadata without loading X into memory
|
|
print(adata.obs.head())
|
|
print(adata.var.head())
|
|
|
|
# Subset operations create views
|
|
subset = adata[:100, :500] # View, no data loaded
|
|
|
|
# Load specific data into memory
|
|
X_subset = subset.X[:] # Now loads this subset
|
|
|
|
# Convert entire backed object to memory
|
|
adata_memory = adata.to_memory()
|
|
```
|
|
|
|
### Zarr
|
|
Hierarchical array storage format, optimized for cloud storage and parallel I/O.
|
|
|
|
#### Writing Zarr
|
|
```python
|
|
# Write to Zarr store
|
|
adata.write_zarr('data.zarr')
|
|
|
|
# Write with specific chunks (important for performance)
|
|
adata.write_zarr('data.zarr', chunks=(100, 100))
|
|
```
|
|
|
|
#### Reading Zarr
|
|
```python
|
|
# Read Zarr store
|
|
adata = ad.read_zarr('data.zarr')
|
|
```
|
|
|
|
#### Remote Zarr access
|
|
```python
|
|
import fsspec
|
|
|
|
# Access Zarr from S3
|
|
store = fsspec.get_mapper('s3://bucket-name/data.zarr')
|
|
adata = ad.read_zarr(store)
|
|
|
|
# Access Zarr from URL
|
|
store = fsspec.get_mapper('https://example.com/data.zarr')
|
|
adata = ad.read_zarr(store)
|
|
```
|
|
|
|
## Alternative Input Formats
|
|
|
|
### CSV/TSV
|
|
```python
|
|
# Read CSV (genes as columns, cells as rows)
|
|
adata = ad.read_csv('data.csv')
|
|
|
|
# Read with custom delimiter
|
|
adata = ad.read_csv('data.tsv', delimiter='\t')
|
|
|
|
# Specify that first column is row names
|
|
adata = ad.read_csv('data.csv', first_column_names=True)
|
|
```
|
|
|
|
### Excel
|
|
```python
|
|
# Read Excel file
|
|
adata = ad.read_excel('data.xlsx')
|
|
|
|
# Read specific sheet
|
|
adata = ad.read_excel('data.xlsx', sheet='Sheet1')
|
|
```
|
|
|
|
### Matrix Market (MTX)
|
|
Common format for sparse matrices in genomics.
|
|
|
|
```python
|
|
# Read MTX with associated files
|
|
# Requires: matrix.mtx, genes.tsv, barcodes.tsv
|
|
adata = ad.read_mtx('matrix.mtx')
|
|
|
|
# Read with custom gene and barcode files
|
|
adata = ad.read_mtx(
|
|
'matrix.mtx',
|
|
var_names='genes.tsv',
|
|
obs_names='barcodes.tsv'
|
|
)
|
|
|
|
# Transpose if needed (MTX often has genes as rows)
|
|
adata = adata.T
|
|
```
|
|
|
|
### 10X Genomics formats
|
|
```python
|
|
# Read 10X h5 format
|
|
adata = ad.read_10x_h5('filtered_feature_bc_matrix.h5')
|
|
|
|
# Read 10X MTX directory
|
|
adata = ad.read_10x_mtx('filtered_feature_bc_matrix/')
|
|
|
|
# Specify genome if multiple present
|
|
adata = ad.read_10x_h5('data.h5', genome='GRCh38')
|
|
```
|
|
|
|
### Loom
|
|
```python
|
|
# Read Loom file
|
|
adata = ad.read_loom('data.loom')
|
|
|
|
# Read with specific observation and variable annotations
|
|
adata = ad.read_loom(
|
|
'data.loom',
|
|
obs_names='CellID',
|
|
var_names='Gene'
|
|
)
|
|
```
|
|
|
|
### Text files
|
|
```python
|
|
# Read generic text file
|
|
adata = ad.read_text('data.txt', delimiter='\t')
|
|
|
|
# Read with custom parameters
|
|
adata = ad.read_text(
|
|
'data.txt',
|
|
delimiter=',',
|
|
first_column_names=True,
|
|
dtype='float32'
|
|
)
|
|
```
|
|
|
|
### UMI tools
|
|
```python
|
|
# Read UMI tools format
|
|
adata = ad.read_umi_tools('counts.tsv')
|
|
```
|
|
|
|
### HDF5 (generic)
|
|
```python
|
|
# Read from HDF5 file (not h5ad format)
|
|
adata = ad.read_hdf('data.h5', key='dataset')
|
|
```
|
|
|
|
## Alternative Output Formats
|
|
|
|
### CSV
|
|
```python
|
|
# Write to CSV files (creates multiple files)
|
|
adata.write_csvs('output_dir/')
|
|
|
|
# This creates:
|
|
# - output_dir/X.csv (expression matrix)
|
|
# - output_dir/obs.csv (observation annotations)
|
|
# - output_dir/var.csv (variable annotations)
|
|
# - output_dir/uns.csv (unstructured annotations, if possible)
|
|
|
|
# Skip certain components
|
|
adata.write_csvs('output_dir/', skip_data=True) # Skip X matrix
|
|
```
|
|
|
|
### Loom
|
|
```python
|
|
# Write to Loom format
|
|
adata.write_loom('output.loom')
|
|
```
|
|
|
|
## Reading Specific Elements
|
|
|
|
For fine-grained control, read specific elements from storage:
|
|
|
|
```python
|
|
from anndata import read_elem
|
|
|
|
# Read just observation annotations
|
|
obs = read_elem('data.h5ad/obs')
|
|
|
|
# Read specific layer
|
|
layer = read_elem('data.h5ad/layers/normalized')
|
|
|
|
# Read unstructured data element
|
|
params = read_elem('data.h5ad/uns/pca_params')
|
|
```
|
|
|
|
## Writing Specific Elements
|
|
|
|
```python
|
|
from anndata import write_elem
|
|
import h5py
|
|
|
|
# Write element to existing file
|
|
with h5py.File('data.h5ad', 'a') as f:
|
|
write_elem(f, 'new_layer', adata.X.copy())
|
|
```
|
|
|
|
## Lazy Operations
|
|
|
|
For very large datasets, use lazy reading to avoid loading entire datasets:
|
|
|
|
```python
|
|
from anndata.experimental import read_elem_lazy
|
|
|
|
# Lazy read (returns dask array or similar)
|
|
X_lazy = read_elem_lazy('large_data.h5ad/X')
|
|
|
|
# Compute only when needed
|
|
subset = X_lazy[:100, :100].compute()
|
|
```
|
|
|
|
## Common I/O Patterns
|
|
|
|
### Convert between formats
|
|
```python
|
|
# MTX to H5AD
|
|
adata = ad.read_mtx('matrix.mtx').T
|
|
adata.write_h5ad('data.h5ad')
|
|
|
|
# CSV to H5AD
|
|
adata = ad.read_csv('data.csv')
|
|
adata.write_h5ad('data.h5ad')
|
|
|
|
# H5AD to Zarr
|
|
adata = ad.read_h5ad('data.h5ad')
|
|
adata.write_zarr('data.zarr')
|
|
```
|
|
|
|
### Load metadata without data
|
|
```python
|
|
# Backed mode allows inspecting metadata without loading X
|
|
adata = ad.read_h5ad('large_file.h5ad', backed='r')
|
|
print(f"Dataset contains {adata.n_obs} observations and {adata.n_vars} variables")
|
|
print(adata.obs.columns)
|
|
print(adata.var.columns)
|
|
# X is not loaded into memory
|
|
```
|
|
|
|
### Append to existing file
|
|
```python
|
|
# Open in read-write mode
|
|
adata = ad.read_h5ad('data.h5ad', backed='r+')
|
|
|
|
# Modify metadata
|
|
adata.obs['new_column'] = values
|
|
|
|
# Changes are written to disk
|
|
```
|
|
|
|
### Download from URL
|
|
```python
|
|
import anndata as ad
|
|
|
|
# Read directly from URL (for h5ad files)
|
|
url = 'https://example.com/data.h5ad'
|
|
adata = ad.read_h5ad(url, backed='r') # Streaming access
|
|
|
|
# For other formats, download first
|
|
import urllib.request
|
|
urllib.request.urlretrieve(url, 'local_file.h5ad')
|
|
adata = ad.read_h5ad('local_file.h5ad')
|
|
```
|
|
|
|
## Performance Tips
|
|
|
|
### Reading
|
|
- Use `backed='r'` for large files you only need to query
|
|
- Use `backed='r+'` if you need to modify metadata without loading all data
|
|
- H5AD format is generally fastest for random access
|
|
- Zarr is better for cloud storage and parallel access
|
|
- Consider compression for storage, but note it may slow down reading
|
|
|
|
### Writing
|
|
- Use compression for long-term storage: `compression='gzip'` or `compression='lzf'`
|
|
- LZF compression is faster but compresses less than GZIP
|
|
- For Zarr, tune chunk sizes based on access patterns:
|
|
- Larger chunks for sequential reads
|
|
- Smaller chunks for random access
|
|
- Convert string columns to categorical before writing (smaller files)
|
|
|
|
### Memory management
|
|
```python
|
|
# Convert strings to categoricals (reduces file size and memory)
|
|
adata.strings_to_categoricals()
|
|
adata.write_h5ad('data.h5ad')
|
|
|
|
# Use sparse matrices for sparse data
|
|
from scipy.sparse import csr_matrix
|
|
if isinstance(adata.X, np.ndarray):
|
|
density = np.count_nonzero(adata.X) / adata.X.size
|
|
if density < 0.5: # If more than 50% zeros
|
|
adata.X = csr_matrix(adata.X)
|
|
```
|
|
|
|
## Handling Large Datasets
|
|
|
|
### Strategy 1: Backed mode
|
|
```python
|
|
# Work with dataset larger than RAM
|
|
adata = ad.read_h5ad('100GB_file.h5ad', backed='r')
|
|
|
|
# Filter based on metadata (fast, no data loading)
|
|
filtered = adata[adata.obs['quality_score'] > 0.8]
|
|
|
|
# Load filtered subset into memory
|
|
adata_memory = filtered.to_memory()
|
|
```
|
|
|
|
### Strategy 2: Chunked processing
|
|
```python
|
|
# Process data in chunks
|
|
adata = ad.read_h5ad('large_file.h5ad', backed='r')
|
|
|
|
chunk_size = 1000
|
|
results = []
|
|
|
|
for i in range(0, adata.n_obs, chunk_size):
|
|
chunk = adata[i:i+chunk_size, :].to_memory()
|
|
# Process chunk
|
|
result = process(chunk)
|
|
results.append(result)
|
|
```
|
|
|
|
### Strategy 3: Use AnnCollection
|
|
```python
|
|
from anndata.experimental import AnnCollection
|
|
|
|
# Create collection without loading data
|
|
adatas = [f'dataset_{i}.h5ad' for i in range(10)]
|
|
collection = AnnCollection(
|
|
adatas,
|
|
join_obs='inner',
|
|
join_vars='inner'
|
|
)
|
|
|
|
# Process collection lazily
|
|
# Data is loaded only when accessed
|
|
```
|
|
|
|
## Common Issues and Solutions
|
|
|
|
### Issue: Out of memory when reading
|
|
**Solution**: Use backed mode or read in chunks
|
|
```python
|
|
adata = ad.read_h5ad('file.h5ad', backed='r')
|
|
```
|
|
|
|
### Issue: Slow reading from cloud storage
|
|
**Solution**: Use Zarr format with appropriate chunking
|
|
```python
|
|
adata.write_zarr('data.zarr', chunks=(1000, 1000))
|
|
```
|
|
|
|
### Issue: Large file sizes
|
|
**Solution**: Use compression and convert to sparse/categorical
|
|
```python
|
|
adata.strings_to_categoricals()
|
|
from scipy.sparse import csr_matrix
|
|
adata.X = csr_matrix(adata.X)
|
|
adata.write_h5ad('compressed.h5ad', compression='gzip')
|
|
```
|
|
|
|
### Issue: Cannot modify backed object
|
|
**Solution**: Either load to memory or open in 'r+' mode
|
|
```python
|
|
# Option 1: Load to memory
|
|
adata = adata.to_memory()
|
|
|
|
# Option 2: Open in read-write mode
|
|
adata = ad.read_h5ad('file.h5ad', backed='r+')
|
|
```
|