12 KiB
Image Loading & Formats
Overview
PathML provides comprehensive support for loading whole-slide images (WSI) from 160+ proprietary medical imaging formats. The framework abstracts vendor-specific complexities through unified slide classes and interfaces, enabling seamless access to image pyramids, metadata, and regions of interest across different file formats.
Supported Formats
PathML supports the following slide formats:
Brightfield Microscopy Formats
- Aperio SVS (
.svs) - Leica Biosystems - Hamamatsu NDPI (
.ndpi) - Hamamatsu Photonics - Leica SCN (
.scn) - Leica Biosystems - Zeiss ZVI (
.zvi) - Carl Zeiss - 3DHISTECH (
.mrxs) - 3DHISTECH Ltd. - Ventana BIF (
.bif) - Roche Ventana - Generic tiled TIFF (
.tif,.tiff)
Medical Imaging Standards
- DICOM (
.dcm) - Digital Imaging and Communications in Medicine - OME-TIFF (
.ome.tif,.ome.tiff) - Open Microscopy Environment
Multiparametric Imaging
- CODEX - Spatial proteomics imaging
- Vectra (
.qptiff) - Multiplex immunofluorescence - MERFISH - Multiplexed error-robust FISH
PathML leverages OpenSlide and other specialized libraries to handle format-specific nuances automatically.
Core Classes for Loading Images
SlideData
SlideData is the fundamental class for representing whole-slide images in PathML.
Loading from file:
from pathml.core import SlideData
# Load a whole-slide image
wsi = SlideData.from_slide("path/to/slide.svs")
# Load with specific backend
wsi = SlideData.from_slide("path/to/slide.svs", backend="openslide")
# Load from OME-TIFF
wsi = SlideData.from_slide("path/to/slide.ome.tiff", backend="bioformats")
Key attributes:
wsi.slide- Backend slide object (OpenSlide, BioFormats, etc.)wsi.tiles- Collection of image tileswsi.metadata- Slide metadata dictionarywsi.level_dimensions- Image pyramid level dimensionswsi.level_downsamples- Downsample factors for each pyramid level
Methods:
wsi.generate_tiles()- Generate tiles from the slidewsi.read_region()- Read a specific region at a given levelwsi.get_thumbnail()- Get a thumbnail image
SlideType
SlideType is an enumeration defining supported slide backends:
from pathml.core import SlideType
# Available backends
SlideType.OPENSLIDE # For most WSI formats (SVS, NDPI, etc.)
SlideType.BIOFORMATS # For OME-TIFF and other formats
SlideType.DICOM # For DICOM WSI
SlideType.VectraQPTIFF # For Vectra multiplex IF
Specialized Slide Classes
PathML provides specialized slide classes for specific imaging modalities:
CODEXSlide:
from pathml.core import CODEXSlide
# Load CODEX spatial proteomics data
codex_slide = CODEXSlide(
path="path/to/codex_dir",
stain="IF", # Immunofluorescence
backend="bioformats"
)
VectraSlide:
from pathml.core import types
# Load Vectra multiplex IF data
vectra_slide = SlideData.from_slide(
"path/to/vectra.qptiff",
backend=SlideType.VectraQPTIFF
)
MultiparametricSlide:
from pathml.core import MultiparametricSlide
# Generic multiparametric imaging
mp_slide = MultiparametricSlide(path="path/to/multiparametric_data")
Loading Strategies
Tile-Based Loading
For large WSI files, tile-based loading enables memory-efficient processing:
from pathml.core import SlideData
# Load slide
wsi = SlideData.from_slide("path/to/slide.svs")
# Generate tiles at specific magnification level
wsi.generate_tiles(
level=0, # Pyramid level (0 = highest resolution)
tile_size=256, # Tile dimensions in pixels
stride=256, # Spacing between tiles (256 = no overlap)
pad=False # Whether to pad edge tiles
)
# Iterate over tiles
for tile in wsi.tiles:
image = tile.image # numpy array
coords = tile.coords # (x, y) coordinates
# Process tile...
Overlapping tiles:
# Generate tiles with 50% overlap
wsi.generate_tiles(
level=0,
tile_size=256,
stride=128 # 50% overlap
)
Region-Based Loading
Extract specific regions of interest directly:
# Read region at specific location and level
region = wsi.read_region(
location=(10000, 15000), # (x, y) in level 0 coordinates
level=1, # Pyramid level
size=(512, 512) # Width, height in pixels
)
# Returns numpy array
Pyramid Level Selection
Whole-slide images are stored in multi-resolution pyramids. Select the appropriate level based on desired magnification:
# Inspect available levels
print(wsi.level_dimensions) # [(width0, height0), (width1, height1), ...]
print(wsi.level_downsamples) # [1.0, 4.0, 16.0, ...]
# Load at lower resolution for faster processing
wsi.generate_tiles(level=2, tile_size=256) # Use level 2 (16x downsampled)
Common pyramid levels:
- Level 0: Full resolution (e.g., 40x magnification)
- Level 1: 4x downsampled (e.g., 10x magnification)
- Level 2: 16x downsampled (e.g., 2.5x magnification)
- Level 3: 64x downsampled (thumbnail)
Thumbnail Loading
Generate low-resolution thumbnails for visualization and quality control:
# Get thumbnail
thumbnail = wsi.get_thumbnail(size=(1024, 1024))
# Display with matplotlib
import matplotlib.pyplot as plt
plt.imshow(thumbnail)
plt.axis('off')
plt.show()
Batch Loading with SlideDataset
Process multiple slides efficiently using SlideDataset:
from pathml.core import SlideDataset
import glob
# Create dataset from multiple slides
slide_paths = glob.glob("data/*.svs")
dataset = SlideDataset(
slide_paths,
tile_size=256,
stride=256,
level=0
)
# Iterate over all tiles from all slides
for tile in dataset:
image = tile.image
slide_id = tile.slide_id
# Process tile...
With preprocessing pipeline:
from pathml.preprocessing import Pipeline, StainNormalizationHE
# Create pipeline
pipeline = Pipeline([
StainNormalizationHE(target='normalize')
])
# Apply to entire dataset
dataset = SlideDataset(slide_paths)
dataset.run(pipeline, distributed=True, n_workers=8)
Metadata Access
Extract slide metadata including acquisition parameters, magnification, and vendor-specific information:
# Access metadata
metadata = wsi.metadata
# Common metadata fields
print(metadata.get('openslide.objective-power')) # Magnification
print(metadata.get('openslide.mpp-x')) # Microns per pixel X
print(metadata.get('openslide.mpp-y')) # Microns per pixel Y
print(metadata.get('openslide.vendor')) # Scanner vendor
# Slide dimensions
print(wsi.level_dimensions[0]) # (width, height) at level 0
Working with DICOM Slides
PathML supports DICOM WSI through specialized handling:
from pathml.core import SlideData, SlideType
# Load DICOM WSI
dicom_slide = SlideData.from_slide(
"path/to/slide.dcm",
backend=SlideType.DICOM
)
# DICOM-specific metadata
print(dicom_slide.metadata.get('PatientID'))
print(dicom_slide.metadata.get('StudyDate'))
Working with OME-TIFF
OME-TIFF provides an open standard for multi-dimensional imaging:
from pathml.core import SlideData
# Load OME-TIFF
ome_slide = SlideData.from_slide(
"path/to/slide.ome.tiff",
backend="bioformats"
)
# Access channel information for multi-channel images
n_channels = ome_slide.shape[2] # Number of channels
Performance Considerations
Memory Management
For large WSI files (often >1GB), use tile-based loading to avoid memory exhaustion:
# Efficient: Tile-based processing
wsi.generate_tiles(level=1, tile_size=256)
for tile in wsi.tiles:
process_tile(tile) # Process one tile at a time
# Inefficient: Loading entire slide into memory
full_image = wsi.read_region((0, 0), level=0, wsi.level_dimensions[0]) # May crash
Distributed Processing
Use Dask for parallel processing across multiple workers:
from pathml.core import SlideDataset
from dask.distributed import Client
# Start Dask client
client = Client(n_workers=8, threads_per_worker=2)
# Process dataset in parallel
dataset = SlideDataset(slide_paths)
dataset.run(pipeline, distributed=True, client=client)
Level Selection
Balance resolution and performance by selecting appropriate pyramid levels:
- Level 0: Use for final analysis requiring maximum detail
- Level 1-2: Use for most preprocessing and model training
- Level 3+: Use for thumbnails, quality control, and rapid exploration
Common Issues and Solutions
Issue: Slide fails to load
- Verify file format is supported
- Check file permissions and path
- Try different backend:
backend="bioformats"orbackend="openslide"
Issue: Out of memory errors
- Use tile-based loading instead of full-slide loading
- Process at lower pyramid level (e.g., level=1 or level=2)
- Reduce tile_size parameter
- Enable distributed processing with Dask
Issue: Color inconsistencies across slides
- Apply stain normalization preprocessing (see
preprocessing.md) - Check scanner metadata for calibration information
- Use
StainNormalizationHEtransform in preprocessing pipeline
Issue: Metadata missing or incorrect
- Different vendors store metadata in different locations
- Use
wsi.metadatato inspect available fields - Some formats may have limited metadata support
Best Practices
-
Always inspect pyramid structure before processing: Check
level_dimensionsandlevel_downsamplesto understand available resolutions -
Use appropriate pyramid levels: Process at level 1-2 for most tasks; reserve level 0 for final high-resolution analysis
-
Tile with overlap for segmentation tasks: Use stride < tile_size to avoid edge artifacts
-
Verify magnification consistency: Check
openslide.objective-powermetadata when combining slides from different sources -
Handle vendor-specific formats: Use specialized slide classes (CODEXSlide, VectraSlide) for multiparametric data
-
Implement quality control: Generate thumbnails and inspect for artifacts before processing
-
Use distributed processing for large datasets: Leverage Dask for parallel processing across multiple workers
Example Workflows
Loading and Inspecting a New Slide
from pathml.core import SlideData
import matplotlib.pyplot as plt
# Load slide
wsi = SlideData.from_slide("path/to/slide.svs")
# Inspect properties
print(f"Dimensions: {wsi.level_dimensions}")
print(f"Downsamples: {wsi.level_downsamples}")
print(f"Magnification: {wsi.metadata.get('openslide.objective-power')}")
# Generate thumbnail for QC
thumbnail = wsi.get_thumbnail(size=(1024, 1024))
plt.imshow(thumbnail)
plt.title(f"Slide: {wsi.name}")
plt.axis('off')
plt.show()
Processing Multiple Slides
from pathml.core import SlideDataset
from pathml.preprocessing import Pipeline, TissueDetectionHE
import glob
# Find all slides
slide_paths = glob.glob("data/slides/*.svs")
# Create pipeline
pipeline = Pipeline([TissueDetectionHE()])
# Process all slides
dataset = SlideDataset(
slide_paths,
tile_size=512,
stride=512,
level=1
)
# Run pipeline with distributed processing
dataset.run(pipeline, distributed=True, n_workers=8)
# Save processed data
dataset.to_hdf5("processed_dataset.h5")
Loading CODEX Multiparametric Data
from pathml.core import CODEXSlide
from pathml.preprocessing import Pipeline, CollapseRunsCODEX, SegmentMIF
# Load CODEX slide
codex = CODEXSlide("path/to/codex_dir", stain="IF")
# Create CODEX-specific pipeline
pipeline = Pipeline([
CollapseRunsCODEX(z_slice=2), # Select z-slice
SegmentMIF(
nuclear_channel='DAPI',
cytoplasm_channel='CD45',
model='mesmer'
)
])
# Process
pipeline.run(codex)
Additional Resources
- PathML Documentation: https://pathml.readthedocs.io/
- OpenSlide: https://openslide.org/ (underlying library for WSI formats)
- Bio-Formats: https://www.openmicroscopy.org/bio-formats/ (alternative backend)
- DICOM Standard: https://www.dicomstandard.org/