# Zarr Python Quick Reference This reference provides a concise overview of commonly used Zarr functions, parameters, and patterns for quick lookup during development. ## Array Creation Functions ### `zarr.zeros()` / `zarr.ones()` / `zarr.empty()` ```python zarr.zeros(shape, chunks=None, dtype='f8', store=None, compressor='default', fill_value=0, order='C', filters=None) ``` Create arrays filled with zeros, ones, or empty (uninitialized) values. **Key parameters:** - `shape`: Tuple defining array dimensions (e.g., `(1000, 1000)`) - `chunks`: Tuple defining chunk dimensions (e.g., `(100, 100)`), or `None` for no chunking - `dtype`: NumPy data type (e.g., `'f4'`, `'i8'`, `'bool'`) - `store`: Storage location (string path, Store object, or `None` for memory) - `compressor`: Compression codec or `None` for no compression ### `zarr.create_array()` / `zarr.create()` ```python zarr.create_array(store, shape, chunks, dtype='f8', compressor='default', fill_value=0, order='C', filters=None, overwrite=False) ``` Create a new array with explicit control over all parameters. ### `zarr.array()` ```python zarr.array(data, chunks=None, dtype=None, compressor='default', store=None) ``` Create array from existing data (NumPy array, list, etc.). **Example:** ```python import numpy as np data = np.random.random((1000, 1000)) z = zarr.array(data, chunks=(100, 100), store='data.zarr') ``` ### `zarr.open_array()` / `zarr.open()` ```python zarr.open_array(store, mode='a', shape=None, chunks=None, dtype=None, compressor='default', fill_value=0) ``` Open existing array or create new one. **Mode options:** - `'r'`: Read-only - `'r+'`: Read-write, file must exist - `'a'`: Read-write, create if doesn't exist (default) - `'w'`: Create new, overwrite if exists - `'w-'`: Create new, fail if exists ## Storage Classes ### LocalStore (Default) ```python from zarr.storage import LocalStore store = LocalStore('path/to/data.zarr') z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) ``` ### MemoryStore ```python from zarr.storage import MemoryStore store = MemoryStore() # Data only in memory z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) ``` ### ZipStore ```python from zarr.storage import ZipStore # Write store = ZipStore('data.zip', mode='w') z = zarr.open_array(store=store, mode='w', shape=(1000, 1000), chunks=(100, 100)) z[:] = data store.close() # MUST close # Read store = ZipStore('data.zip', mode='r') z = zarr.open_array(store=store) data = z[:] store.close() ``` ### Cloud Storage (S3/GCS) ```python # S3 import s3fs s3 = s3fs.S3FileSystem(anon=False) store = s3fs.S3Map(root='bucket/path/data.zarr', s3=s3) # GCS import gcsfs gcs = gcsfs.GCSFileSystem(project='my-project') store = gcsfs.GCSMap(root='bucket/path/data.zarr', gcs=gcs) ``` ## Compression Codecs ### Blosc Codec (Default) ```python from zarr.codecs.blosc import BloscCodec codec = BloscCodec( cname='zstd', # Compressor: 'blosclz', 'lz4', 'lz4hc', 'snappy', 'zlib', 'zstd' clevel=5, # Compression level: 0-9 shuffle='shuffle' # Shuffle filter: 'noshuffle', 'shuffle', 'bitshuffle' ) z = zarr.create_array(store='data.zarr', shape=(1000, 1000), chunks=(100, 100), dtype='f4', codecs=[codec]) ``` **Blosc compressor characteristics:** - `'lz4'`: Fastest compression, lower ratio - `'zstd'`: Balanced (default), good ratio and speed - `'zlib'`: Good compatibility, moderate performance - `'lz4hc'`: Better ratio than lz4, slower - `'snappy'`: Fast, moderate ratio - `'blosclz'`: Blosc's default ### Other Codecs ```python from zarr.codecs import GzipCodec, ZstdCodec, BytesCodec # Gzip compression (maximum ratio, slower) GzipCodec(level=6) # Level 0-9 # Zstandard compression ZstdCodec(level=3) # Level 1-22 # No compression BytesCodec() ``` ## Array Indexing and Selection ### Basic Indexing (NumPy-style) ```python z = zarr.zeros((1000, 1000), chunks=(100, 100)) # Read row = z[0, :] # Single row col = z[:, 0] # Single column block = z[10:20, 50:60] # Slice element = z[5, 10] # Single element # Write z[0, :] = 42 z[10:20, 50:60] = np.random.random((10, 10)) ``` ### Advanced Indexing ```python # Coordinate indexing (point selection) z.vindex[[0, 5, 10], [2, 8, 15]] # Specific coordinates # Orthogonal indexing (outer product) z.oindex[0:10, [5, 10, 15]] # Rows 0-9, columns 5, 10, 15 # Block/chunk indexing z.blocks[0, 0] # First chunk z.blocks[0:2, 0:2] # First four chunks ``` ## Groups and Hierarchies ### Creating Groups ```python # Create root group root = zarr.group(store='data.zarr') # Create nested groups grp1 = root.create_group('group1') grp2 = grp1.create_group('subgroup') # Create arrays in groups arr = grp1.create_array(name='data', shape=(1000, 1000), chunks=(100, 100), dtype='f4') # Access by path arr2 = root['group1/data'] ``` ### Group Methods ```python root = zarr.group('data.zarr') # h5py-compatible methods dataset = root.create_dataset('data', shape=(1000, 1000), chunks=(100, 100)) subgrp = root.require_group('subgroup') # Create if doesn't exist # Visualize structure print(root.tree()) # List contents print(list(root.keys())) print(list(root.groups())) print(list(root.arrays())) ``` ## Array Attributes and Metadata ### Working with Attributes ```python z = zarr.zeros((1000, 1000), chunks=(100, 100)) # Set attributes z.attrs['units'] = 'meters' z.attrs['description'] = 'Temperature data' z.attrs['created'] = '2024-01-15' z.attrs['version'] = 1.2 z.attrs['tags'] = ['climate', 'temperature'] # Read attributes print(z.attrs['units']) print(dict(z.attrs)) # All attributes as dict # Update/delete z.attrs['version'] = 2.0 del z.attrs['tags'] ``` **Note:** Attributes must be JSON-serializable. ## Array Properties and Methods ### Properties ```python z = zarr.zeros((1000, 1000), chunks=(100, 100), dtype='f4') z.shape # (1000, 1000) z.chunks # (100, 100) z.dtype # dtype('float32') z.size # 1000000 z.nbytes # 4000000 (uncompressed size in bytes) z.nbytes_stored # Actual compressed size on disk z.nchunks # 100 (number of chunks) z.cdata_shape # Shape in terms of chunks: (10, 10) ``` ### Methods ```python # Information print(z.info) # Detailed information about array print(z.info_items()) # Info as list of tuples # Resizing z.resize(1500, 1500) # Change dimensions # Appending z.append(new_data, axis=0) # Add data along axis # Copying z2 = z.copy(store='new_location.zarr') ``` ## Chunking Guidelines ### Chunk Size Calculation ```python # For float32 (4 bytes per element): # 1 MB = 262,144 elements # 10 MB = 2,621,440 elements # Examples for 1 MB chunks: (512, 512) # For 2D: 512 × 512 × 4 = 1,048,576 bytes (128, 128, 128) # For 3D: 128 × 128 × 128 × 4 = 8,388,608 bytes ≈ 8 MB (64, 256, 256) # For 3D: 64 × 256 × 256 × 4 = 16,777,216 bytes ≈ 16 MB ``` ### Chunking Strategies by Access Pattern **Time series (sequential access along first dimension):** ```python chunks=(1, 720, 1440) # One time step per chunk ``` **Row-wise access:** ```python chunks=(10, 10000) # Small rows, span columns ``` **Column-wise access:** ```python chunks=(10000, 10) # Span rows, small columns ``` **Random access:** ```python chunks=(500, 500) # Balanced square chunks ``` **3D volumetric data:** ```python chunks=(64, 64, 64) # Cubic chunks for isotropic access ``` ## Integration APIs ### NumPy Integration ```python import numpy as np z = zarr.zeros((1000, 1000), chunks=(100, 100)) # Use NumPy functions result = np.sum(z, axis=0) mean = np.mean(z) std = np.std(z) # Convert to NumPy arr = z[:] # Loads entire array into memory ``` ### Dask Integration ```python import dask.array as da # Load Zarr as Dask array dask_array = da.from_zarr('data.zarr') # Compute operations in parallel result = dask_array.mean(axis=0).compute() # Write Dask array to Zarr large_array = da.random.random((100000, 100000), chunks=(1000, 1000)) da.to_zarr(large_array, 'output.zarr') ``` ### Xarray Integration ```python import xarray as xr # Open Zarr as Xarray Dataset ds = xr.open_zarr('data.zarr') # Write Xarray to Zarr ds.to_zarr('output.zarr') # Create with coordinates ds = xr.Dataset( {'temperature': (['time', 'lat', 'lon'], data)}, coords={ 'time': pd.date_range('2024-01-01', periods=365), 'lat': np.arange(-90, 91, 1), 'lon': np.arange(-180, 180, 1) } ) ds.to_zarr('climate.zarr') ``` ## Parallel Computing ### Synchronizers ```python from zarr import ThreadSynchronizer, ProcessSynchronizer # Multi-threaded writes sync = ThreadSynchronizer() z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync) # Multi-process writes sync = ProcessSynchronizer('sync.sync') z = zarr.open_array('data.zarr', mode='r+', synchronizer=sync) ``` **Note:** Synchronization only needed for: - Concurrent writes that may span chunk boundaries - Not needed for reads (always safe) - Not needed if each process writes to separate chunks ## Metadata Consolidation ```python # Consolidate metadata (after creating all arrays/groups) zarr.consolidate_metadata('data.zarr') # Open with consolidated metadata (faster, especially on cloud) root = zarr.open_consolidated('data.zarr') ``` **Benefits:** - Reduces I/O from N operations to 1 - Critical for cloud storage (reduces latency) - Speeds up hierarchy traversal **Cautions:** - Can become stale if data updates - Re-consolidate after modifications - Not for frequently-updated datasets ## Common Patterns ### Time Series with Growing Data ```python # Start with empty first dimension z = zarr.open('timeseries.zarr', mode='a', shape=(0, 720, 1440), chunks=(1, 720, 1440), dtype='f4') # Append new time steps for new_timestep in data_stream: z.append(new_timestep, axis=0) ``` ### Processing Large Arrays in Chunks ```python z = zarr.open('large_data.zarr', mode='r') # Process without loading entire array for i in range(0, z.shape[0], 1000): chunk = z[i:i+1000, :] result = process(chunk) save(result) ``` ### Format Conversion Pipeline ```python # HDF5 → Zarr import h5py with h5py.File('data.h5', 'r') as h5: z = zarr.array(h5['dataset'][:], chunks=(1000, 1000), store='data.zarr') # Zarr → NumPy file z = zarr.open('data.zarr', mode='r') np.save('data.npy', z[:]) # Zarr → NetCDF (via Xarray) ds = xr.open_zarr('data.zarr') ds.to_netcdf('data.nc') ``` ## Performance Optimization Quick Checklist 1. **Chunk size**: 1-10 MB per chunk 2. **Chunk shape**: Align with access pattern 3. **Compression**: - Fast: `BloscCodec(cname='lz4', clevel=1)` - Balanced: `BloscCodec(cname='zstd', clevel=5)` - Maximum: `GzipCodec(level=9)` 4. **Cloud storage**: - Larger chunks (5-100 MB) - Consolidate metadata - Consider sharding 5. **Parallel I/O**: Use Dask for large operations 6. **Memory**: Process in chunks, don't load entire arrays ## Debugging and Profiling ```python z = zarr.open('data.zarr', mode='r') # Detailed information print(z.info) # Size statistics print(f"Uncompressed: {z.nbytes / 1e6:.2f} MB") print(f"Compressed: {z.nbytes_stored / 1e6:.2f} MB") print(f"Ratio: {z.nbytes / z.nbytes_stored:.1f}x") # Chunk information print(f"Chunks: {z.chunks}") print(f"Number of chunks: {z.nchunks}") print(f"Chunk grid: {z.cdata_shape}") ``` ## Common Data Types ```python # Integers 'i1', 'i2', 'i4', 'i8' # Signed: 8, 16, 32, 64-bit 'u1', 'u2', 'u4', 'u8' # Unsigned: 8, 16, 32, 64-bit # Floats 'f2', 'f4', 'f8' # 16, 32, 64-bit (half, single, double precision) # Others 'bool' # Boolean 'c8', 'c16' # Complex: 64, 128-bit 'S10' # Fixed-length string (10 bytes) 'U10' # Unicode string (10 characters) ``` ## Version Compatibility Zarr-Python version 3.x supports both: - **Zarr v2 format**: Legacy format, widely compatible - **Zarr v3 format**: New format with sharding, improved metadata Check format version: ```python # Zarr automatically detects format version z = zarr.open('data.zarr', mode='r') # Format info available in metadata ``` ## Error Handling ```python try: z = zarr.open_array('data.zarr', mode='r') except zarr.errors.PathNotFoundError: print("Array does not exist") except zarr.errors.ReadOnlyError: print("Cannot write to read-only array") except Exception as e: print(f"Unexpected error: {e}") ```