Initial commit
This commit is contained in:
773
skills/zarr-python/SKILL.md
Normal file
773
skills/zarr-python/SKILL.md
Normal file
@@ -0,0 +1,773 @@
|
||||
---
|
||||
name: zarr-python
|
||||
description: "Chunked N-D arrays for cloud storage. Compressed arrays, parallel I/O, S3/GCS integration, NumPy/Dask/Xarray compatible, for large-scale scientific computing pipelines."
|
||||
---
|
||||
|
||||
# Zarr Python
|
||||
|
||||
## Overview
|
||||
|
||||
Zarr is a Python library for storing large N-dimensional arrays with chunking and compression. Apply this skill for efficient parallel I/O, cloud-native workflows, and seamless integration with NumPy, Dask, and Xarray.
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
uv pip install zarr
|
||||
```
|
||||
|
||||
Requires Python 3.11+. For cloud storage support, install additional packages:
|
||||
```python
|
||||
uv pip install s3fs # For S3
|
||||
uv pip install gcsfs # For Google Cloud Storage
|
||||
```
|
||||
|
||||
### Basic Array Creation
|
||||
|
||||
```python
|
||||
import zarr
|
||||
import numpy as np
|
||||
|
||||
# Create a 2D array with chunking and compression
|
||||
z = zarr.create_array(
|
||||
store="data/my_array.zarr",
|
||||
shape=(10000, 10000),
|
||||
chunks=(1000, 1000),
|
||||
dtype="f4"
|
||||
)
|
||||
|
||||
# Write data using NumPy-style indexing
|
||||
z[:, :] = np.random.random((10000, 10000))
|
||||
|
||||
# Read data
|
||||
data = z[0:100, 0:100] # Returns NumPy array
|
||||
```
|
||||
|
||||
## Core Operations
|
||||
|
||||
### Creating Arrays
|
||||
|
||||
Zarr provides multiple convenience functions for array creation:
|
||||
|
||||
```python
|
||||
# Create empty array
|
||||
z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000), dtype='f4',
|
||||
store='data.zarr')
|
||||
|
||||
# Create filled arrays
|
||||
z = zarr.ones((5000, 5000), chunks=(500, 500))
|
||||
z = zarr.full((1000, 1000), fill_value=42, chunks=(100, 100))
|
||||
|
||||
# Create from existing data
|
||||
data = np.arange(10000).reshape(100, 100)
|
||||
z = zarr.array(data, chunks=(10, 10), store='data.zarr')
|
||||
|
||||
# Create like another array
|
||||
z2 = zarr.zeros_like(z) # Matches shape, chunks, dtype of z
|
||||
```
|
||||
|
||||
### Opening Existing Arrays
|
||||
|
||||
```python
|
||||
# Open array (read/write mode by default)
|
||||
z = zarr.open_array('data.zarr', mode='r+')
|
||||
|
||||
# Read-only mode
|
||||
z = zarr.open_array('data.zarr', mode='r')
|
||||
|
||||
# The open() function auto-detects arrays vs groups
|
||||
z = zarr.open('data.zarr') # Returns Array or Group
|
||||
```
|
||||
|
||||
### Reading and Writing Data
|
||||
|
||||
Zarr arrays support NumPy-like indexing:
|
||||
|
||||
```python
|
||||
# Write entire array
|
||||
z[:] = 42
|
||||
|
||||
# Write slices
|
||||
z[0, :] = np.arange(100)
|
||||
z[10:20, 50:60] = np.random.random((10, 10))
|
||||
|
||||
# Read data (returns NumPy array)
|
||||
data = z[0:100, 0:100]
|
||||
row = z[5, :]
|
||||
|
||||
# Advanced indexing
|
||||
z.vindex[[0, 5, 10], [2, 8, 15]] # Coordinate indexing
|
||||
z.oindex[0:10, [5, 10, 15]] # Orthogonal indexing
|
||||
z.blocks[0, 0] # Block/chunk indexing
|
||||
```
|
||||
|
||||
### Resizing and Appending
|
||||
|
||||
```python
|
||||
# Resize array
|
||||
z.resize(15000, 15000) # Expands or shrinks dimensions
|
||||
|
||||
# Append data along an axis
|
||||
z.append(np.random.random((1000, 10000)), axis=0) # Adds rows
|
||||
```
|
||||
|
||||
## Chunking Strategies
|
||||
|
||||
Chunking is critical for performance. Choose chunk sizes and shapes based on access patterns.
|
||||
|
||||
### Chunk Size Guidelines
|
||||
|
||||
- **Minimum chunk size**: 1 MB recommended for optimal performance
|
||||
- **Balance**: Larger chunks = fewer metadata operations; smaller chunks = better parallel access
|
||||
- **Memory consideration**: Entire chunks must fit in memory during compression
|
||||
|
||||
```python
|
||||
# Configure chunk size (aim for ~1MB per chunk)
|
||||
# For float32 data: 1MB = 262,144 elements = 512×512 array
|
||||
z = zarr.zeros(
|
||||
shape=(10000, 10000),
|
||||
chunks=(512, 512), # ~1MB chunks
|
||||
dtype='f4'
|
||||
)
|
||||
```
|
||||
|
||||
### Aligning Chunks with Access Patterns
|
||||
|
||||
**Critical**: Chunk shape dramatically affects performance based on how data is accessed.
|
||||
|
||||
```python
|
||||
# If accessing rows frequently (first dimension)
|
||||
z = zarr.zeros((10000, 10000), chunks=(10, 10000)) # Chunk spans columns
|
||||
|
||||
# If accessing columns frequently (second dimension)
|
||||
z = zarr.zeros((10000, 10000), chunks=(10000, 10)) # Chunk spans rows
|
||||
|
||||
# For mixed access patterns (balanced approach)
|
||||
z = zarr.zeros((10000, 10000), chunks=(1000, 1000)) # Square chunks
|
||||
```
|
||||
|
||||
**Performance example**: For a (200, 200, 200) array, reading along the first dimension:
|
||||
- Using chunks (1, 200, 200): ~107ms
|
||||
- Using chunks (200, 200, 1): ~1.65ms (65× faster!)
|
||||
|
||||
### Sharding for Large-Scale Storage
|
||||
|
||||
When arrays have millions of small chunks, use sharding to group chunks into larger storage objects:
|
||||
|
||||
```python
|
||||
from zarr.codecs import ShardingCodec, BytesCodec
|
||||
from zarr.codecs.blosc import BloscCodec
|
||||
|
||||
# Create array with sharding
|
||||
z = zarr.create_array(
|
||||
store='data.zarr',
|
||||
shape=(100000, 100000),
|
||||
chunks=(100, 100), # Small chunks for access
|
||||
shards=(1000, 1000), # Groups 100 chunks per shard
|
||||
dtype='f4'
|
||||
)
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Reduces file system overhead from millions of small files
|
||||
- Improves cloud storage performance (fewer object requests)
|
||||
- Prevents filesystem block size waste
|
||||
|
||||
**Important**: Entire shards must fit in memory before writing.
|
||||
|
||||
## Compression
|
||||
|
||||
Zarr applies compression per chunk to reduce storage while maintaining fast access.
|
||||
|
||||
### Configuring Compression
|
||||
|
||||
```python
|
||||
from zarr.codecs.blosc import BloscCodec
|
||||
from zarr.codecs import GzipCodec, ZstdCodec
|
||||
|
||||
# Default: Blosc with Zstandard
|
||||
z = zarr.zeros((1000, 1000), chunks=(100, 100)) # Uses default compression
|
||||
|
||||
# Configure Blosc codec
|
||||
z = zarr.create_array(
|
||||
store='data.zarr',
|
||||
shape=(1000, 1000),
|
||||
chunks=(100, 100),
|
||||
dtype='f4',
|
||||
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
|
||||
)
|
||||
|
||||
# Available Blosc compressors: 'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'
|
||||
|
||||
# Use Gzip compression
|
||||
z = zarr.create_array(
|
||||
store='data.zarr',
|
||||
shape=(1000, 1000),
|
||||
chunks=(100, 100),
|
||||
dtype='f4',
|
||||
codecs=[GzipCodec(level=6)]
|
||||
)
|
||||
|
||||
# Disable compression
|
||||
z = zarr.create_array(
|
||||
store='data.zarr',
|
||||
shape=(1000, 1000),
|
||||
chunks=(100, 100),
|
||||
dtype='f4',
|
||||
codecs=[BytesCodec()] # No compression
|
||||
)
|
||||
```
|
||||
|
||||
### Compression Performance Tips
|
||||
|
||||
- **Blosc** (default): Fast compression/decompression, good for interactive workloads
|
||||
- **Zstandard**: Better compression ratios, slightly slower than LZ4
|
||||
- **Gzip**: Maximum compression, slower performance
|
||||
- **LZ4**: Fastest compression, lower ratios
|
||||
- **Shuffle**: Enable shuffle filter for better compression on numeric data
|
||||
|
||||
```python
|
||||
# Optimal for numeric scientific data
|
||||
codecs=[BloscCodec(cname='zstd', clevel=5, shuffle='shuffle')]
|
||||
|
||||
# Optimal for speed
|
||||
codecs=[BloscCodec(cname='lz4', clevel=1)]
|
||||
|
||||
# Optimal for compression ratio
|
||||
codecs=[GzipCodec(level=9)]
|
||||
```
|
||||
|
||||
## Storage Backends
|
||||
|
||||
Zarr supports multiple storage backends through a flexible storage interface.
|
||||
|
||||
### Local Filesystem (Default)
|
||||
|
||||
```python
|
||||
from zarr.storage import LocalStore
|
||||
|
||||
# Explicit store creation
|
||||
store = LocalStore('data/my_array.zarr')
|
||||
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
|
||||
|
||||
# Or use string path (creates LocalStore automatically)
|
||||
z = zarr.open_array('data/my_array.zarr', mode='w', shape=(1000, 1000),
|
||||
chunks=(100, 100))
|
||||
```
|
||||
|
||||
### In-Memory Storage
|
||||
|
||||
```python
|
||||
from zarr.storage import MemoryStore
|
||||
|
||||
# Create in-memory store
|
||||
store = MemoryStore()
|
||||
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
|
||||
|
||||
# Data exists only in memory, not persisted
|
||||
```
|
||||
|
||||
### ZIP File Storage
|
||||
|
||||
```python
|
||||
from zarr.storage import ZipStore
|
||||
|
||||
# Write to ZIP file
|
||||
store = ZipStore('data.zip', mode='w')
|
||||
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
|
||||
z[:] = np.random.random((1000, 1000))
|
||||
store.close() # IMPORTANT: Must close ZipStore
|
||||
|
||||
# Read from ZIP file
|
||||
store = ZipStore('data.zip', mode='r')
|
||||
z = zarr.open_array(store=store)
|
||||
data = z[:]
|
||||
store.close()
|
||||
```
|
||||
|
||||
### Cloud Storage (S3, GCS)
|
||||
|
||||
```python
|
||||
import s3fs
|
||||
import zarr
|
||||
|
||||
# S3 storage
|
||||
s3 = s3fs.S3FileSystem(anon=False) # Use credentials
|
||||
store = s3fs.S3Map(root='my-bucket/path/to/array.zarr', s3=s3)
|
||||
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
|
||||
z[:] = data
|
||||
|
||||
# Google Cloud Storage
|
||||
import gcsfs
|
||||
gcs = gcsfs.GCSFileSystem(project='my-project')
|
||||
store = gcsfs.GCSMap(root='my-bucket/path/to/array.zarr', gcs=gcs)
|
||||
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
|
||||
```
|
||||
|
||||
**Cloud Storage Best Practices**:
|
||||
- Use consolidated metadata to reduce latency: `zarr.consolidate_metadata(store)`
|
||||
- Align chunk sizes with cloud object sizing (typically 5-100 MB optimal)
|
||||
- Enable parallel writes using Dask for large-scale data
|
||||
- Consider sharding to reduce number of objects
|
||||
|
||||
## Groups and Hierarchies
|
||||
|
||||
Groups organize multiple arrays hierarchically, similar to directories or HDF5 groups.
|
||||
|
||||
### Creating and Using Groups
|
||||
|
||||
```python
|
||||
# Create root group
|
||||
root = zarr.group(store='data/hierarchy.zarr')
|
||||
|
||||
# Create sub-groups
|
||||
temperature = root.create_group('temperature')
|
||||
precipitation = root.create_group('precipitation')
|
||||
|
||||
# Create arrays within groups
|
||||
temp_array = temperature.create_array(
|
||||
name='t2m',
|
||||
shape=(365, 720, 1440),
|
||||
chunks=(1, 720, 1440),
|
||||
dtype='f4'
|
||||
)
|
||||
|
||||
precip_array = precipitation.create_array(
|
||||
name='prcp',
|
||||
shape=(365, 720, 1440),
|
||||
chunks=(1, 720, 1440),
|
||||
dtype='f4'
|
||||
)
|
||||
|
||||
# Access using paths
|
||||
array = root['temperature/t2m']
|
||||
|
||||
# Visualize hierarchy
|
||||
print(root.tree())
|
||||
# Output:
|
||||
# /
|
||||
# ├── temperature
|
||||
# │ └── t2m (365, 720, 1440) f4
|
||||
# └── precipitation
|
||||
# └── prcp (365, 720, 1440) f4
|
||||
```
|
||||
|
||||
### H5py-Compatible API
|
||||
|
||||
Zarr provides an h5py-compatible interface for familiar HDF5 users:
|
||||
|
||||
```python
|
||||
# Create group with h5py-style methods
|
||||
root = zarr.group('data.zarr')
|
||||
dataset = root.create_dataset('my_data', shape=(1000, 1000), chunks=(100, 100),
|
||||
dtype='f4')
|
||||
|
||||
# Access like h5py
|
||||
grp = root.require_group('subgroup')
|
||||
arr = grp.require_dataset('array', shape=(500, 500), chunks=(50, 50), dtype='i4')
|
||||
```
|
||||
|
||||
## Attributes and Metadata
|
||||
|
||||
Attach custom metadata to arrays and groups using attributes:
|
||||
|
||||
```python
|
||||
# Add attributes to array
|
||||
z = zarr.zeros((1000, 1000), chunks=(100, 100))
|
||||
z.attrs['description'] = 'Temperature data in Kelvin'
|
||||
z.attrs['units'] = 'K'
|
||||
z.attrs['created'] = '2024-01-15'
|
||||
z.attrs['processing_version'] = 2.1
|
||||
|
||||
# Attributes are stored as JSON
|
||||
print(z.attrs['units']) # Output: K
|
||||
|
||||
# Add attributes to groups
|
||||
root = zarr.group('data.zarr')
|
||||
root.attrs['project'] = 'Climate Analysis'
|
||||
root.attrs['institution'] = 'Research Institute'
|
||||
|
||||
# Attributes persist with the array/group
|
||||
z2 = zarr.open('data.zarr')
|
||||
print(z2.attrs['description'])
|
||||
```
|
||||
|
||||
**Important**: Attributes must be JSON-serializable (strings, numbers, lists, dicts, booleans, null).
|
||||
|
||||
## Integration with NumPy, Dask, and Xarray
|
||||
|
||||
### NumPy Integration
|
||||
|
||||
Zarr arrays implement the NumPy array interface:
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import zarr
|
||||
|
||||
z = zarr.zeros((1000, 1000), chunks=(100, 100))
|
||||
|
||||
# Use NumPy functions directly
|
||||
result = np.sum(z, axis=0) # NumPy operates on Zarr array
|
||||
mean = np.mean(z[:100, :100])
|
||||
|
||||
# Convert to NumPy array
|
||||
numpy_array = z[:] # Loads entire array into memory
|
||||
```
|
||||
|
||||
### Dask Integration
|
||||
|
||||
Dask provides lazy, parallel computation on Zarr arrays:
|
||||
|
||||
```python
|
||||
import dask.array as da
|
||||
import zarr
|
||||
|
||||
# Create large Zarr array
|
||||
z = zarr.open('data.zarr', mode='w', shape=(100000, 100000),
|
||||
chunks=(1000, 1000), dtype='f4')
|
||||
|
||||
# Load as Dask array (lazy, no data loaded)
|
||||
dask_array = da.from_zarr('data.zarr')
|
||||
|
||||
# Perform computations (parallel, out-of-core)
|
||||
result = dask_array.mean(axis=0).compute() # Parallel computation
|
||||
|
||||
# Write Dask array to Zarr
|
||||
large_array = da.random.random((100000, 100000), chunks=(1000, 1000))
|
||||
da.to_zarr(large_array, 'output.zarr')
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Process datasets larger than memory
|
||||
- Automatic parallel computation across chunks
|
||||
- Efficient I/O with chunked storage
|
||||
|
||||
### Xarray Integration
|
||||
|
||||
Xarray provides labeled, multidimensional arrays with Zarr backend:
|
||||
|
||||
```python
|
||||
import xarray as xr
|
||||
import zarr
|
||||
|
||||
# Open Zarr store as Xarray Dataset (lazy loading)
|
||||
ds = xr.open_zarr('data.zarr')
|
||||
|
||||
# Dataset includes coordinates and metadata
|
||||
print(ds)
|
||||
|
||||
# Access variables
|
||||
temperature = ds['temperature']
|
||||
|
||||
# Perform labeled operations
|
||||
subset = ds.sel(time='2024-01', lat=slice(30, 60))
|
||||
|
||||
# Write Xarray Dataset to Zarr
|
||||
ds.to_zarr('output.zarr')
|
||||
|
||||
# Create from scratch with coordinates
|
||||
ds = xr.Dataset(
|
||||
{
|
||||
'temperature': (['time', 'lat', 'lon'], data),
|
||||
'precipitation': (['time', 'lat', 'lon'], data2)
|
||||
},
|
||||
coords={
|
||||
'time': pd.date_range('2024-01-01', periods=365),
|
||||
'lat': np.arange(-90, 91, 1),
|
||||
'lon': np.arange(-180, 180, 1)
|
||||
}
|
||||
)
|
||||
ds.to_zarr('climate_data.zarr')
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Named dimensions and coordinates
|
||||
- Label-based indexing and selection
|
||||
- Integration with pandas for time series
|
||||
- NetCDF-like interface familiar to climate/geospatial scientists
|
||||
|
||||
## Parallel Computing and Synchronization
|
||||
|
||||
### Thread-Safe Operations
|
||||
|
||||
```python
|
||||
from zarr import ThreadSynchronizer
|
||||
import zarr
|
||||
|
||||
# For multi-threaded writes
|
||||
synchronizer = ThreadSynchronizer()
|
||||
z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000),
|
||||
chunks=(1000, 1000), synchronizer=synchronizer)
|
||||
|
||||
# Safe for concurrent writes from multiple threads
|
||||
# (when writes don't span chunk boundaries)
|
||||
```
|
||||
|
||||
### Process-Safe Operations
|
||||
|
||||
```python
|
||||
from zarr import ProcessSynchronizer
|
||||
import zarr
|
||||
|
||||
# For multi-process writes
|
||||
synchronizer = ProcessSynchronizer('sync_data.sync')
|
||||
z = zarr.open_array('data.zarr', mode='r+', shape=(10000, 10000),
|
||||
chunks=(1000, 1000), synchronizer=synchronizer)
|
||||
|
||||
# Safe for concurrent writes from multiple processes
|
||||
```
|
||||
|
||||
**Note**:
|
||||
- Concurrent reads require no synchronization
|
||||
- Synchronization only needed for writes that may span chunk boundaries
|
||||
- Each process/thread writing to separate chunks needs no synchronization
|
||||
|
||||
## Consolidated Metadata
|
||||
|
||||
For hierarchical stores with many arrays, consolidate metadata into a single file to reduce I/O operations:
|
||||
|
||||
```python
|
||||
import zarr
|
||||
|
||||
# After creating arrays/groups
|
||||
root = zarr.group('data.zarr')
|
||||
# ... create multiple arrays/groups ...
|
||||
|
||||
# Consolidate metadata
|
||||
zarr.consolidate_metadata('data.zarr')
|
||||
|
||||
# Open with consolidated metadata (faster, especially on cloud storage)
|
||||
root = zarr.open_consolidated('data.zarr')
|
||||
```
|
||||
|
||||
**Benefits**:
|
||||
- Reduces metadata read operations from N (one per array) to 1
|
||||
- Critical for cloud storage (reduces latency)
|
||||
- Speeds up `tree()` operations and group traversal
|
||||
|
||||
**Cautions**:
|
||||
- Metadata can become stale if arrays update without re-consolidation
|
||||
- Not suitable for frequently-updated datasets
|
||||
- Multi-writer scenarios may have inconsistent reads
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### Checklist for Optimal Performance
|
||||
|
||||
1. **Chunk Size**: Aim for 1-10 MB per chunk
|
||||
```python
|
||||
# For float32: 1MB = 262,144 elements
|
||||
chunks = (512, 512) # 512×512×4 bytes = ~1MB
|
||||
```
|
||||
|
||||
2. **Chunk Shape**: Align with access patterns
|
||||
```python
|
||||
# Row-wise access → chunk spans columns: (small, large)
|
||||
# Column-wise access → chunk spans rows: (large, small)
|
||||
# Random access → balanced: (medium, medium)
|
||||
```
|
||||
|
||||
3. **Compression**: Choose based on workload
|
||||
```python
|
||||
# Interactive/fast: BloscCodec(cname='lz4')
|
||||
# Balanced: BloscCodec(cname='zstd', clevel=5)
|
||||
# Maximum compression: GzipCodec(level=9)
|
||||
```
|
||||
|
||||
4. **Storage Backend**: Match to environment
|
||||
```python
|
||||
# Local: LocalStore (default)
|
||||
# Cloud: S3Map/GCSMap with consolidated metadata
|
||||
# Temporary: MemoryStore
|
||||
```
|
||||
|
||||
5. **Sharding**: Use for large-scale datasets
|
||||
```python
|
||||
# When you have millions of small chunks
|
||||
shards=(10*chunk_size, 10*chunk_size)
|
||||
```
|
||||
|
||||
6. **Parallel I/O**: Use Dask for large operations
|
||||
```python
|
||||
import dask.array as da
|
||||
dask_array = da.from_zarr('data.zarr')
|
||||
result = dask_array.compute(scheduler='threads', num_workers=8)
|
||||
```
|
||||
|
||||
### Profiling and Debugging
|
||||
|
||||
```python
|
||||
# Print detailed array information
|
||||
print(z.info)
|
||||
|
||||
# Output includes:
|
||||
# - Type, shape, chunks, dtype
|
||||
# - Compression codec and level
|
||||
# - Storage size (compressed vs uncompressed)
|
||||
# - Storage location
|
||||
|
||||
# Check storage size
|
||||
print(f"Compressed size: {z.nbytes_stored / 1e6:.2f} MB")
|
||||
print(f"Uncompressed size: {z.nbytes / 1e6:.2f} MB")
|
||||
print(f"Compression ratio: {z.nbytes / z.nbytes_stored:.2f}x")
|
||||
```
|
||||
|
||||
## Common Patterns and Best Practices
|
||||
|
||||
### Pattern: Time Series Data
|
||||
|
||||
```python
|
||||
# Store time series with time as first dimension
|
||||
# This allows efficient appending of new time steps
|
||||
z = zarr.open('timeseries.zarr', mode='a',
|
||||
shape=(0, 720, 1440), # Start with 0 time steps
|
||||
chunks=(1, 720, 1440), # One time step per chunk
|
||||
dtype='f4')
|
||||
|
||||
# Append new time steps
|
||||
new_data = np.random.random((1, 720, 1440))
|
||||
z.append(new_data, axis=0)
|
||||
```
|
||||
|
||||
### Pattern: Large Matrix Operations
|
||||
|
||||
```python
|
||||
import dask.array as da
|
||||
|
||||
# Create large matrix in Zarr
|
||||
z = zarr.open('matrix.zarr', mode='w',
|
||||
shape=(100000, 100000),
|
||||
chunks=(1000, 1000),
|
||||
dtype='f8')
|
||||
|
||||
# Use Dask for parallel computation
|
||||
dask_z = da.from_zarr('matrix.zarr')
|
||||
result = (dask_z @ dask_z.T).compute() # Parallel matrix multiply
|
||||
```
|
||||
|
||||
### Pattern: Cloud-Native Workflow
|
||||
|
||||
```python
|
||||
import s3fs
|
||||
import zarr
|
||||
|
||||
# Write to S3
|
||||
s3 = s3fs.S3FileSystem()
|
||||
store = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3)
|
||||
|
||||
# Create array with appropriate chunking for cloud
|
||||
z = zarr.open_array(store=store, mode='w',
|
||||
shape=(10000, 10000),
|
||||
chunks=(500, 500), # ~1MB chunks
|
||||
dtype='f4')
|
||||
z[:] = data
|
||||
|
||||
# Consolidate metadata for faster reads
|
||||
zarr.consolidate_metadata(store)
|
||||
|
||||
# Read from S3 (anywhere, anytime)
|
||||
store_read = s3fs.S3Map(root='s3://my-bucket/data.zarr', s3=s3)
|
||||
z_read = zarr.open_consolidated(store_read)
|
||||
subset = z_read[0:100, 0:100]
|
||||
```
|
||||
|
||||
### Pattern: Format Conversion
|
||||
|
||||
```python
|
||||
# HDF5 to Zarr
|
||||
import h5py
|
||||
import zarr
|
||||
|
||||
with h5py.File('data.h5', 'r') as h5:
|
||||
dataset = h5['dataset_name']
|
||||
z = zarr.array(dataset[:],
|
||||
chunks=(1000, 1000),
|
||||
store='data.zarr')
|
||||
|
||||
# NumPy to Zarr
|
||||
import numpy as np
|
||||
data = np.load('data.npy')
|
||||
z = zarr.array(data, chunks='auto', store='data.zarr')
|
||||
|
||||
# Zarr to NetCDF (via Xarray)
|
||||
import xarray as xr
|
||||
ds = xr.open_zarr('data.zarr')
|
||||
ds.to_netcdf('data.nc')
|
||||
```
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
### Issue: Slow Performance
|
||||
|
||||
**Diagnosis**: Check chunk size and alignment
|
||||
```python
|
||||
print(z.chunks) # Are chunks appropriate size?
|
||||
print(z.info) # Check compression ratio
|
||||
```
|
||||
|
||||
**Solutions**:
|
||||
- Increase chunk size to 1-10 MB
|
||||
- Align chunks with access pattern
|
||||
- Try different compression codecs
|
||||
- Use Dask for parallel operations
|
||||
|
||||
### Issue: High Memory Usage
|
||||
|
||||
**Cause**: Loading entire array or large chunks into memory
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# Don't load entire array
|
||||
# Bad: data = z[:]
|
||||
# Good: Process in chunks
|
||||
for i in range(0, z.shape[0], 1000):
|
||||
chunk = z[i:i+1000, :]
|
||||
process(chunk)
|
||||
|
||||
# Or use Dask for automatic chunking
|
||||
import dask.array as da
|
||||
dask_z = da.from_zarr('data.zarr')
|
||||
result = dask_z.mean().compute() # Processes in chunks
|
||||
```
|
||||
|
||||
### Issue: Cloud Storage Latency
|
||||
|
||||
**Solutions**:
|
||||
```python
|
||||
# 1. Consolidate metadata
|
||||
zarr.consolidate_metadata(store)
|
||||
z = zarr.open_consolidated(store)
|
||||
|
||||
# 2. Use appropriate chunk sizes (5-100 MB for cloud)
|
||||
chunks = (2000, 2000) # Larger chunks for cloud
|
||||
|
||||
# 3. Enable sharding
|
||||
shards = (10000, 10000) # Groups many chunks
|
||||
```
|
||||
|
||||
### Issue: Concurrent Write Conflicts
|
||||
|
||||
**Solution**: Use synchronizers or ensure non-overlapping writes
|
||||
```python
|
||||
from zarr import ProcessSynchronizer
|
||||
|
||||
sync = ProcessSynchronizer('sync.sync')
|
||||
z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync)
|
||||
|
||||
# Or design workflow so each process writes to separate chunks
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
For detailed API documentation, advanced usage, and the latest updates:
|
||||
|
||||
- **Official Documentation**: https://zarr.readthedocs.io/
|
||||
- **Zarr Specifications**: https://zarr-specs.readthedocs.io/
|
||||
- **GitHub Repository**: https://github.com/zarr-developers/zarr-python
|
||||
- **Community Chat**: https://gitter.im/zarr-developers/community
|
||||
|
||||
**Related Libraries**:
|
||||
- **Xarray**: https://docs.xarray.dev/ (labeled arrays)
|
||||
- **Dask**: https://docs.dask.org/ (parallel computing)
|
||||
- **NumCodecs**: https://numcodecs.readthedocs.io/ (compression codecs)
|
||||
515
skills/zarr-python/references/api_reference.md
Normal file
515
skills/zarr-python/references/api_reference.md
Normal file
@@ -0,0 +1,515 @@
|
||||
# Zarr Python Quick Reference
|
||||
|
||||
This reference provides a concise overview of commonly used Zarr functions, parameters, and patterns for quick lookup during development.
|
||||
|
||||
## Array Creation Functions
|
||||
|
||||
### `zarr.zeros()` / `zarr.ones()` / `zarr.empty()`
|
||||
```python
|
||||
zarr.zeros(shape, chunks=None, dtype='f8', store=None, compressor='default',
|
||||
fill_value=0, order='C', filters=None)
|
||||
```
|
||||
Create arrays filled with zeros, ones, or empty (uninitialized) values.
|
||||
|
||||
**Key parameters:**
|
||||
- `shape`: Tuple defining array dimensions (e.g., `(1000, 1000)`)
|
||||
- `chunks`: Tuple defining chunk dimensions (e.g., `(100, 100)`), or `None` for no chunking
|
||||
- `dtype`: NumPy data type (e.g., `'f4'`, `'i8'`, `'bool'`)
|
||||
- `store`: Storage location (string path, Store object, or `None` for memory)
|
||||
- `compressor`: Compression codec or `None` for no compression
|
||||
|
||||
### `zarr.create_array()` / `zarr.create()`
|
||||
```python
|
||||
zarr.create_array(store, shape, chunks, dtype='f8', compressor='default',
|
||||
fill_value=0, order='C', filters=None, overwrite=False)
|
||||
```
|
||||
Create a new array with explicit control over all parameters.
|
||||
|
||||
### `zarr.array()`
|
||||
```python
|
||||
zarr.array(data, chunks=None, dtype=None, compressor='default', store=None)
|
||||
```
|
||||
Create array from existing data (NumPy array, list, etc.).
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
import numpy as np
|
||||
data = np.random.random((1000, 1000))
|
||||
z = zarr.array(data, chunks=(100, 100), store='data.zarr')
|
||||
```
|
||||
|
||||
### `zarr.open_array()` / `zarr.open()`
|
||||
```python
|
||||
zarr.open_array(store, mode='a', shape=None, chunks=None, dtype=None,
|
||||
compressor='default', fill_value=0)
|
||||
```
|
||||
Open existing array or create new one.
|
||||
|
||||
**Mode options:**
|
||||
- `'r'`: Read-only
|
||||
- `'r+'`: Read-write, file must exist
|
||||
- `'a'`: Read-write, create if doesn't exist (default)
|
||||
- `'w'`: Create new, overwrite if exists
|
||||
- `'w-'`: Create new, fail if exists
|
||||
|
||||
## Storage Classes
|
||||
|
||||
### LocalStore (Default)
|
||||
```python
|
||||
from zarr.storage import LocalStore
|
||||
|
||||
store = LocalStore('path/to/data.zarr')
|
||||
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
|
||||
```
|
||||
|
||||
### MemoryStore
|
||||
```python
|
||||
from zarr.storage import MemoryStore
|
||||
|
||||
store = MemoryStore() # Data only in memory
|
||||
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
|
||||
```
|
||||
|
||||
### ZipStore
|
||||
```python
|
||||
from zarr.storage import ZipStore
|
||||
|
||||
# Write
|
||||
store = ZipStore('data.zip', mode='w')
|
||||
z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100))
|
||||
z[:] = data
|
||||
store.close() # MUST close
|
||||
|
||||
# Read
|
||||
store = ZipStore('data.zip', mode='r')
|
||||
z = zarr.open_array(store=store)
|
||||
data = z[:]
|
||||
store.close()
|
||||
```
|
||||
|
||||
### Cloud Storage (S3/GCS)
|
||||
```python
|
||||
# S3
|
||||
import s3fs
|
||||
s3 = s3fs.S3FileSystem(anon=False)
|
||||
store = s3fs.S3Map(root='bucket/path/data.zarr', s3=s3)
|
||||
|
||||
# GCS
|
||||
import gcsfs
|
||||
gcs = gcsfs.GCSFileSystem(project='my-project')
|
||||
store = gcsfs.GCSMap(root='bucket/path/data.zarr', gcs=gcs)
|
||||
```
|
||||
|
||||
## Compression Codecs
|
||||
|
||||
### Blosc Codec (Default)
|
||||
```python
|
||||
from zarr.codecs.blosc import BloscCodec
|
||||
|
||||
codec = BloscCodec(
|
||||
cname='zstd', # Compressor: 'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd'
|
||||
clevel=5, # Compression level: 0-9
|
||||
shuffle='shuffle' # Shuffle filter: 'noshuffle', 'shuffle', 'bitshuffle'
|
||||
)
|
||||
|
||||
z = zarr.create_array(store='data.zarr', shape=(1000, 1000), chunks=(100, 100),
|
||||
dtype='f4', codecs=[codec])
|
||||
```
|
||||
|
||||
**Blosc compressor characteristics:**
|
||||
- `'lz4'`: Fastest compression, lower ratio
|
||||
- `'zstd'`: Balanced (default), good ratio and speed
|
||||
- `'zlib'`: Good compatibility, moderate performance
|
||||
- `'lz4hc'`: Better ratio than lz4, slower
|
||||
- `'snappy'`: Fast, moderate ratio
|
||||
- `'blosclz'`: Blosc's default
|
||||
|
||||
### Other Codecs
|
||||
```python
|
||||
from zarr.codecs import GzipCodec, ZstdCodec, BytesCodec
|
||||
|
||||
# Gzip compression (maximum ratio, slower)
|
||||
GzipCodec(level=6) # Level 0-9
|
||||
|
||||
# Zstandard compression
|
||||
ZstdCodec(level=3) # Level 1-22
|
||||
|
||||
# No compression
|
||||
BytesCodec()
|
||||
```
|
||||
|
||||
## Array Indexing and Selection
|
||||
|
||||
### Basic Indexing (NumPy-style)
|
||||
```python
|
||||
z = zarr.zeros((1000, 1000), chunks=(100, 100))
|
||||
|
||||
# Read
|
||||
row = z[0, :] # Single row
|
||||
col = z[:, 0] # Single column
|
||||
block = z[10:20, 50:60] # Slice
|
||||
element = z[5, 10] # Single element
|
||||
|
||||
# Write
|
||||
z[0, :] = 42
|
||||
z[10:20, 50:60] = np.random.random((10, 10))
|
||||
```
|
||||
|
||||
### Advanced Indexing
|
||||
```python
|
||||
# Coordinate indexing (point selection)
|
||||
z.vindex[[0, 5, 10], [2, 8, 15]] # Specific coordinates
|
||||
|
||||
# Orthogonal indexing (outer product)
|
||||
z.oindex[0:10, [5, 10, 15]] # Rows 0-9, columns 5, 10, 15
|
||||
|
||||
# Block/chunk indexing
|
||||
z.blocks[0, 0] # First chunk
|
||||
z.blocks[0:2, 0:2] # First four chunks
|
||||
```
|
||||
|
||||
## Groups and Hierarchies
|
||||
|
||||
### Creating Groups
|
||||
```python
|
||||
# Create root group
|
||||
root = zarr.group(store='data.zarr')
|
||||
|
||||
# Create nested groups
|
||||
grp1 = root.create_group('group1')
|
||||
grp2 = grp1.create_group('subgroup')
|
||||
|
||||
# Create arrays in groups
|
||||
arr = grp1.create_array(name='data', shape=(1000, 1000),
|
||||
chunks=(100, 100), dtype='f4')
|
||||
|
||||
# Access by path
|
||||
arr2 = root['group1/data']
|
||||
```
|
||||
|
||||
### Group Methods
|
||||
```python
|
||||
root = zarr.group('data.zarr')
|
||||
|
||||
# h5py-compatible methods
|
||||
dataset = root.create_dataset('data', shape=(1000, 1000), chunks=(100, 100))
|
||||
subgrp = root.require_group('subgroup') # Create if doesn't exist
|
||||
|
||||
# Visualize structure
|
||||
print(root.tree())
|
||||
|
||||
# List contents
|
||||
print(list(root.keys()))
|
||||
print(list(root.groups()))
|
||||
print(list(root.arrays()))
|
||||
```
|
||||
|
||||
## Array Attributes and Metadata
|
||||
|
||||
### Working with Attributes
|
||||
```python
|
||||
z = zarr.zeros((1000, 1000), chunks=(100, 100))
|
||||
|
||||
# Set attributes
|
||||
z.attrs['units'] = 'meters'
|
||||
z.attrs['description'] = 'Temperature data'
|
||||
z.attrs['created'] = '2024-01-15'
|
||||
z.attrs['version'] = 1.2
|
||||
z.attrs['tags'] = ['climate', 'temperature']
|
||||
|
||||
# Read attributes
|
||||
print(z.attrs['units'])
|
||||
print(dict(z.attrs)) # All attributes as dict
|
||||
|
||||
# Update/delete
|
||||
z.attrs['version'] = 2.0
|
||||
del z.attrs['tags']
|
||||
```
|
||||
|
||||
**Note:** Attributes must be JSON-serializable.
|
||||
|
||||
## Array Properties and Methods
|
||||
|
||||
### Properties
|
||||
```python
|
||||
z = zarr.zeros((1000, 1000), chunks=(100, 100), dtype='f4')
|
||||
|
||||
z.shape # (1000, 1000)
|
||||
z.chunks # (100, 100)
|
||||
z.dtype # dtype('float32')
|
||||
z.size # 1000000
|
||||
z.nbytes # 4000000 (uncompressed size in bytes)
|
||||
z.nbytes_stored # Actual compressed size on disk
|
||||
z.nchunks # 100 (number of chunks)
|
||||
z.cdata_shape # Shape in terms of chunks: (10, 10)
|
||||
```
|
||||
|
||||
### Methods
|
||||
```python
|
||||
# Information
|
||||
print(z.info) # Detailed information about array
|
||||
print(z.info_items()) # Info as list of tuples
|
||||
|
||||
# Resizing
|
||||
z.resize(1500, 1500) # Change dimensions
|
||||
|
||||
# Appending
|
||||
z.append(new_data, axis=0) # Add data along axis
|
||||
|
||||
# Copying
|
||||
z2 = z.copy(store='new_location.zarr')
|
||||
```
|
||||
|
||||
## Chunking Guidelines
|
||||
|
||||
### Chunk Size Calculation
|
||||
```python
|
||||
# For float32 (4 bytes per element):
|
||||
# 1 MB = 262,144 elements
|
||||
# 10 MB = 2,621,440 elements
|
||||
|
||||
# Examples for 1 MB chunks:
|
||||
(512, 512) # For 2D: 512 × 512 × 4 = 1,048,576 bytes
|
||||
(128, 128, 128) # For 3D: 128 × 128 × 128 × 4 = 8,388,608 bytes ≈ 8 MB
|
||||
(64, 256, 256) # For 3D: 64 × 256 × 256 × 4 = 16,777,216 bytes ≈ 16 MB
|
||||
```
|
||||
|
||||
### Chunking Strategies by Access Pattern
|
||||
|
||||
**Time series (sequential access along first dimension):**
|
||||
```python
|
||||
chunks=(1, 720, 1440) # One time step per chunk
|
||||
```
|
||||
|
||||
**Row-wise access:**
|
||||
```python
|
||||
chunks=(10, 10000) # Small rows, span columns
|
||||
```
|
||||
|
||||
**Column-wise access:**
|
||||
```python
|
||||
chunks=(10000, 10) # Span rows, small columns
|
||||
```
|
||||
|
||||
**Random access:**
|
||||
```python
|
||||
chunks=(500, 500) # Balanced square chunks
|
||||
```
|
||||
|
||||
**3D volumetric data:**
|
||||
```python
|
||||
chunks=(64, 64, 64) # Cubic chunks for isotropic access
|
||||
```
|
||||
|
||||
## Integration APIs
|
||||
|
||||
### NumPy Integration
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
z = zarr.zeros((1000, 1000), chunks=(100, 100))
|
||||
|
||||
# Use NumPy functions
|
||||
result = np.sum(z, axis=0)
|
||||
mean = np.mean(z)
|
||||
std = np.std(z)
|
||||
|
||||
# Convert to NumPy
|
||||
arr = z[:] # Loads entire array into memory
|
||||
```
|
||||
|
||||
### Dask Integration
|
||||
```python
|
||||
import dask.array as da
|
||||
|
||||
# Load Zarr as Dask array
|
||||
dask_array = da.from_zarr('data.zarr')
|
||||
|
||||
# Compute operations in parallel
|
||||
result = dask_array.mean(axis=0).compute()
|
||||
|
||||
# Write Dask array to Zarr
|
||||
large_array = da.random.random((100000, 100000), chunks=(1000, 1000))
|
||||
da.to_zarr(large_array, 'output.zarr')
|
||||
```
|
||||
|
||||
### Xarray Integration
|
||||
```python
|
||||
import xarray as xr
|
||||
|
||||
# Open Zarr as Xarray Dataset
|
||||
ds = xr.open_zarr('data.zarr')
|
||||
|
||||
# Write Xarray to Zarr
|
||||
ds.to_zarr('output.zarr')
|
||||
|
||||
# Create with coordinates
|
||||
ds = xr.Dataset(
|
||||
{'temperature': (['time', 'lat', 'lon'], data)},
|
||||
coords={
|
||||
'time': pd.date_range('2024-01-01', periods=365),
|
||||
'lat': np.arange(-90, 91, 1),
|
||||
'lon': np.arange(-180, 180, 1)
|
||||
}
|
||||
)
|
||||
ds.to_zarr('climate.zarr')
|
||||
```
|
||||
|
||||
## Parallel Computing
|
||||
|
||||
### Synchronizers
|
||||
```python
|
||||
from zarr import ThreadSynchronizer, ProcessSynchronizer
|
||||
|
||||
# Multi-threaded writes
|
||||
sync = ThreadSynchronizer()
|
||||
z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync)
|
||||
|
||||
# Multi-process writes
|
||||
sync = ProcessSynchronizer('sync.sync')
|
||||
z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync)
|
||||
```
|
||||
|
||||
**Note:** Synchronization only needed for:
|
||||
- Concurrent writes that may span chunk boundaries
|
||||
- Not needed for reads (always safe)
|
||||
- Not needed if each process writes to separate chunks
|
||||
|
||||
## Metadata Consolidation
|
||||
|
||||
```python
|
||||
# Consolidate metadata (after creating all arrays/groups)
|
||||
zarr.consolidate_metadata('data.zarr')
|
||||
|
||||
# Open with consolidated metadata (faster, especially on cloud)
|
||||
root = zarr.open_consolidated('data.zarr')
|
||||
```
|
||||
|
||||
**Benefits:**
|
||||
- Reduces I/O from N operations to 1
|
||||
- Critical for cloud storage (reduces latency)
|
||||
- Speeds up hierarchy traversal
|
||||
|
||||
**Cautions:**
|
||||
- Can become stale if data updates
|
||||
- Re-consolidate after modifications
|
||||
- Not for frequently-updated datasets
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Time Series with Growing Data
|
||||
```python
|
||||
# Start with empty first dimension
|
||||
z = zarr.open('timeseries.zarr', mode='a',
|
||||
shape=(0, 720, 1440),
|
||||
chunks=(1, 720, 1440),
|
||||
dtype='f4')
|
||||
|
||||
# Append new time steps
|
||||
for new_timestep in data_stream:
|
||||
z.append(new_timestep, axis=0)
|
||||
```
|
||||
|
||||
### Processing Large Arrays in Chunks
|
||||
```python
|
||||
z = zarr.open('large_data.zarr', mode='r')
|
||||
|
||||
# Process without loading entire array
|
||||
for i in range(0, z.shape[0], 1000):
|
||||
chunk = z[i:i+1000, :]
|
||||
result = process(chunk)
|
||||
save(result)
|
||||
```
|
||||
|
||||
### Format Conversion Pipeline
|
||||
```python
|
||||
# HDF5 → Zarr
|
||||
import h5py
|
||||
with h5py.File('data.h5', 'r') as h5:
|
||||
z = zarr.array(h5['dataset'][:], chunks=(1000, 1000), store='data.zarr')
|
||||
|
||||
# Zarr → NumPy file
|
||||
z = zarr.open('data.zarr', mode='r')
|
||||
np.save('data.npy', z[:])
|
||||
|
||||
# Zarr → NetCDF (via Xarray)
|
||||
ds = xr.open_zarr('data.zarr')
|
||||
ds.to_netcdf('data.nc')
|
||||
```
|
||||
|
||||
## Performance Optimization Quick Checklist
|
||||
|
||||
1. **Chunk size**: 1-10 MB per chunk
|
||||
2. **Chunk shape**: Align with access pattern
|
||||
3. **Compression**:
|
||||
- Fast: `BloscCodec(cname='lz4', clevel=1)`
|
||||
- Balanced: `BloscCodec(cname='zstd', clevel=5)`
|
||||
- Maximum: `GzipCodec(level=9)`
|
||||
4. **Cloud storage**:
|
||||
- Larger chunks (5-100 MB)
|
||||
- Consolidate metadata
|
||||
- Consider sharding
|
||||
5. **Parallel I/O**: Use Dask for large operations
|
||||
6. **Memory**: Process in chunks, don't load entire arrays
|
||||
|
||||
## Debugging and Profiling
|
||||
|
||||
```python
|
||||
z = zarr.open('data.zarr', mode='r')
|
||||
|
||||
# Detailed information
|
||||
print(z.info)
|
||||
|
||||
# Size statistics
|
||||
print(f"Uncompressed: {z.nbytes / 1e6:.2f} MB")
|
||||
print(f"Compressed: {z.nbytes_stored / 1e6:.2f} MB")
|
||||
print(f"Ratio: {z.nbytes / z.nbytes_stored:.1f}x")
|
||||
|
||||
# Chunk information
|
||||
print(f"Chunks: {z.chunks}")
|
||||
print(f"Number of chunks: {z.nchunks}")
|
||||
print(f"Chunk grid: {z.cdata_shape}")
|
||||
```
|
||||
|
||||
## Common Data Types
|
||||
|
||||
```python
|
||||
# Integers
|
||||
'i1', 'i2', 'i4', 'i8' # Signed: 8, 16, 32, 64-bit
|
||||
'u1', 'u2', 'u4', 'u8' # Unsigned: 8, 16, 32, 64-bit
|
||||
|
||||
# Floats
|
||||
'f2', 'f4', 'f8' # 16, 32, 64-bit (half, single, double precision)
|
||||
|
||||
# Others
|
||||
'bool' # Boolean
|
||||
'c8', 'c16' # Complex: 64, 128-bit
|
||||
'S10' # Fixed-length string (10 bytes)
|
||||
'U10' # Unicode string (10 characters)
|
||||
```
|
||||
|
||||
## Version Compatibility
|
||||
|
||||
Zarr-Python version 3.x supports both:
|
||||
- **Zarr v2 format**: Legacy format, widely compatible
|
||||
- **Zarr v3 format**: New format with sharding, improved metadata
|
||||
|
||||
Check format version:
|
||||
```python
|
||||
# Zarr automatically detects format version
|
||||
z = zarr.open('data.zarr', mode='r')
|
||||
# Format info available in metadata
|
||||
```
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
try:
|
||||
z = zarr.open_array('data.zarr', mode='r')
|
||||
except zarr.errors.PathNotFoundError:
|
||||
print("Array does not exist")
|
||||
except zarr.errors.ReadOnlyError:
|
||||
print("Cannot write to read-only array")
|
||||
except Exception as e:
|
||||
print(f"Unexpected error: {e}")
|
||||
```
|
||||
Reference in New Issue
Block a user