# Image Loading & Formats ## Overview PathML provides comprehensive support for loading whole-slide images (WSI) from 160+ proprietary medical imaging formats. The framework abstracts vendor-specific complexities through unified slide classes and interfaces, enabling seamless access to image pyramids, metadata, and regions of interest across different file formats. ## Supported Formats PathML supports the following slide formats: ### Brightfield Microscopy Formats - **Aperio SVS** (`.svs`) - Leica Biosystems - **Hamamatsu NDPI** (`.ndpi`) - Hamamatsu Photonics - **Leica SCN** (`.scn`) - Leica Biosystems - **Zeiss ZVI** (`.zvi`) - Carl Zeiss - **3DHISTECH** (`.mrxs`) - 3DHISTECH Ltd. - **Ventana BIF** (`.bif`) - Roche Ventana - **Generic tiled TIFF** (`.tif`, `.tiff`) ### Medical Imaging Standards - **DICOM** (`.dcm`) - Digital Imaging and Communications in Medicine - **OME-TIFF** (`.ome.tif`, `.ome.tiff`) - Open Microscopy Environment ### Multiparametric Imaging - **CODEX** - Spatial proteomics imaging - **Vectra** (`.qptiff`) - Multiplex immunofluorescence - **MERFISH** - Multiplexed error-robust FISH PathML leverages OpenSlide and other specialized libraries to handle format-specific nuances automatically. ## Core Classes for Loading Images ### SlideData `SlideData` is the fundamental class for representing whole-slide images in PathML. **Loading from file:** ```python from pathml.core import SlideData # Load a whole-slide image wsi = SlideData.from_slide("path/to/slide.svs") # Load with specific backend wsi = SlideData.from_slide("path/to/slide.svs", backend="openslide") # Load from OME-TIFF wsi = SlideData.from_slide("path/to/slide.ome.tiff", backend="bioformats") ``` **Key attributes:** - `wsi.slide` - Backend slide object (OpenSlide, BioFormats, etc.) - `wsi.tiles` - Collection of image tiles - `wsi.metadata` - Slide metadata dictionary - `wsi.level_dimensions` - Image pyramid level dimensions - `wsi.level_downsamples` - Downsample factors for each pyramid level **Methods:** - `wsi.generate_tiles()` - Generate tiles from the slide - `wsi.read_region()` - Read a specific region at a given level - `wsi.get_thumbnail()` - Get a thumbnail image ### SlideType `SlideType` is an enumeration defining supported slide backends: ```python from pathml.core import SlideType # Available backends SlideType.OPENSLIDE # For most WSI formats (SVS, NDPI, etc.) SlideType.BIOFORMATS # For OME-TIFF and other formats SlideType.DICOM # For DICOM WSI SlideType.VectraQPTIFF # For Vectra multiplex IF ``` ### Specialized Slide Classes PathML provides specialized slide classes for specific imaging modalities: **CODEXSlide:** ```python from pathml.core import CODEXSlide # Load CODEX spatial proteomics data codex_slide = CODEXSlide( path="path/to/codex_dir", stain="IF", # Immunofluorescence backend="bioformats" ) ``` **VectraSlide:** ```python from pathml.core import types # Load Vectra multiplex IF data vectra_slide = SlideData.from_slide( "path/to/vectra.qptiff", backend=SlideType.VectraQPTIFF ) ``` **MultiparametricSlide:** ```python from pathml.core import MultiparametricSlide # Generic multiparametric imaging mp_slide = MultiparametricSlide(path="path/to/multiparametric_data") ``` ## Loading Strategies ### Tile-Based Loading For large WSI files, tile-based loading enables memory-efficient processing: ```python from pathml.core import SlideData # Load slide wsi = SlideData.from_slide("path/to/slide.svs") # Generate tiles at specific magnification level wsi.generate_tiles( level=0, # Pyramid level (0 = highest resolution) tile_size=256, # Tile dimensions in pixels stride=256, # Spacing between tiles (256 = no overlap) pad=False # Whether to pad edge tiles ) # Iterate over tiles for tile in wsi.tiles: image = tile.image # numpy array coords = tile.coords # (x, y) coordinates # Process tile... ``` **Overlapping tiles:** ```python # Generate tiles with 50% overlap wsi.generate_tiles( level=0, tile_size=256, stride=128 # 50% overlap ) ``` ### Region-Based Loading Extract specific regions of interest directly: ```python # Read region at specific location and level region = wsi.read_region( location=(10000, 15000), # (x, y) in level 0 coordinates level=1, # Pyramid level size=(512, 512) # Width, height in pixels ) # Returns numpy array ``` ### Pyramid Level Selection Whole-slide images are stored in multi-resolution pyramids. Select the appropriate level based on desired magnification: ```python # Inspect available levels print(wsi.level_dimensions) # [(width0, height0), (width1, height1), ...] print(wsi.level_downsamples) # [1.0, 4.0, 16.0, ...] # Load at lower resolution for faster processing wsi.generate_tiles(level=2, tile_size=256) # Use level 2 (16x downsampled) ``` **Common pyramid levels:** - Level 0: Full resolution (e.g., 40x magnification) - Level 1: 4x downsampled (e.g., 10x magnification) - Level 2: 16x downsampled (e.g., 2.5x magnification) - Level 3: 64x downsampled (thumbnail) ### Thumbnail Loading Generate low-resolution thumbnails for visualization and quality control: ```python # Get thumbnail thumbnail = wsi.get_thumbnail(size=(1024, 1024)) # Display with matplotlib import matplotlib.pyplot as plt plt.imshow(thumbnail) plt.axis('off') plt.show() ``` ## Batch Loading with SlideDataset Process multiple slides efficiently using `SlideDataset`: ```python from pathml.core import SlideDataset import glob # Create dataset from multiple slides slide_paths = glob.glob("data/*.svs") dataset = SlideDataset( slide_paths, tile_size=256, stride=256, level=0 ) # Iterate over all tiles from all slides for tile in dataset: image = tile.image slide_id = tile.slide_id # Process tile... ``` **With preprocessing pipeline:** ```python from pathml.preprocessing import Pipeline, StainNormalizationHE # Create pipeline pipeline = Pipeline([ StainNormalizationHE(target='normalize') ]) # Apply to entire dataset dataset = SlideDataset(slide_paths) dataset.run(pipeline, distributed=True, n_workers=8) ``` ## Metadata Access Extract slide metadata including acquisition parameters, magnification, and vendor-specific information: ```python # Access metadata metadata = wsi.metadata # Common metadata fields print(metadata.get('openslide.objective-power')) # Magnification print(metadata.get('openslide.mpp-x')) # Microns per pixel X print(metadata.get('openslide.mpp-y')) # Microns per pixel Y print(metadata.get('openslide.vendor')) # Scanner vendor # Slide dimensions print(wsi.level_dimensions[0]) # (width, height) at level 0 ``` ## Working with DICOM Slides PathML supports DICOM WSI through specialized handling: ```python from pathml.core import SlideData, SlideType # Load DICOM WSI dicom_slide = SlideData.from_slide( "path/to/slide.dcm", backend=SlideType.DICOM ) # DICOM-specific metadata print(dicom_slide.metadata.get('PatientID')) print(dicom_slide.metadata.get('StudyDate')) ``` ## Working with OME-TIFF OME-TIFF provides an open standard for multi-dimensional imaging: ```python from pathml.core import SlideData # Load OME-TIFF ome_slide = SlideData.from_slide( "path/to/slide.ome.tiff", backend="bioformats" ) # Access channel information for multi-channel images n_channels = ome_slide.shape[2] # Number of channels ``` ## Performance Considerations ### Memory Management For large WSI files (often >1GB), use tile-based loading to avoid memory exhaustion: ```python # Efficient: Tile-based processing wsi.generate_tiles(level=1, tile_size=256) for tile in wsi.tiles: process_tile(tile) # Process one tile at a time # Inefficient: Loading entire slide into memory full_image = wsi.read_region((0, 0), level=0, wsi.level_dimensions[0]) # May crash ``` ### Distributed Processing Use Dask for parallel processing across multiple workers: ```python from pathml.core import SlideDataset from dask.distributed import Client # Start Dask client client = Client(n_workers=8, threads_per_worker=2) # Process dataset in parallel dataset = SlideDataset(slide_paths) dataset.run(pipeline, distributed=True, client=client) ``` ### Level Selection Balance resolution and performance by selecting appropriate pyramid levels: - **Level 0:** Use for final analysis requiring maximum detail - **Level 1-2:** Use for most preprocessing and model training - **Level 3+:** Use for thumbnails, quality control, and rapid exploration ## Common Issues and Solutions **Issue: Slide fails to load** - Verify file format is supported - Check file permissions and path - Try different backend: `backend="bioformats"` or `backend="openslide"` **Issue: Out of memory errors** - Use tile-based loading instead of full-slide loading - Process at lower pyramid level (e.g., level=1 or level=2) - Reduce tile_size parameter - Enable distributed processing with Dask **Issue: Color inconsistencies across slides** - Apply stain normalization preprocessing (see `preprocessing.md`) - Check scanner metadata for calibration information - Use `StainNormalizationHE` transform in preprocessing pipeline **Issue: Metadata missing or incorrect** - Different vendors store metadata in different locations - Use `wsi.metadata` to inspect available fields - Some formats may have limited metadata support ## Best Practices 1. **Always inspect pyramid structure** before processing: Check `level_dimensions` and `level_downsamples` to understand available resolutions 2. **Use appropriate pyramid levels**: Process at level 1-2 for most tasks; reserve level 0 for final high-resolution analysis 3. **Tile with overlap** for segmentation tasks: Use stride < tile_size to avoid edge artifacts 4. **Verify magnification consistency**: Check `openslide.objective-power` metadata when combining slides from different sources 5. **Handle vendor-specific formats**: Use specialized slide classes (CODEXSlide, VectraSlide) for multiparametric data 6. **Implement quality control**: Generate thumbnails and inspect for artifacts before processing 7. **Use distributed processing** for large datasets: Leverage Dask for parallel processing across multiple workers ## Example Workflows ### Loading and Inspecting a New Slide ```python from pathml.core import SlideData import matplotlib.pyplot as plt # Load slide wsi = SlideData.from_slide("path/to/slide.svs") # Inspect properties print(f"Dimensions: {wsi.level_dimensions}") print(f"Downsamples: {wsi.level_downsamples}") print(f"Magnification: {wsi.metadata.get('openslide.objective-power')}") # Generate thumbnail for QC thumbnail = wsi.get_thumbnail(size=(1024, 1024)) plt.imshow(thumbnail) plt.title(f"Slide: {wsi.name}") plt.axis('off') plt.show() ``` ### Processing Multiple Slides ```python from pathml.core import SlideDataset from pathml.preprocessing import Pipeline, TissueDetectionHE import glob # Find all slides slide_paths = glob.glob("data/slides/*.svs") # Create pipeline pipeline = Pipeline([TissueDetectionHE()]) # Process all slides dataset = SlideDataset( slide_paths, tile_size=512, stride=512, level=1 ) # Run pipeline with distributed processing dataset.run(pipeline, distributed=True, n_workers=8) # Save processed data dataset.to_hdf5("processed_dataset.h5") ``` ### Loading CODEX Multiparametric Data ```python from pathml.core import CODEXSlide from pathml.preprocessing import Pipeline, CollapseRunsCODEX, SegmentMIF # Load CODEX slide codex = CODEXSlide("path/to/codex_dir", stain="IF") # Create CODEX-specific pipeline pipeline = Pipeline([ CollapseRunsCODEX(z_slice=2), # Select z-slice SegmentMIF( nuclear_channel='DAPI', cytoplasm_channel='CD45', model='mesmer' ) ]) # Process pipeline.run(codex) ``` ## Additional Resources - **PathML Documentation:** https://pathml.readthedocs.io/ - **OpenSlide:** https://openslide.org/ (underlying library for WSI formats) - **Bio-Formats:** https://www.openmicroscopy.org/bio-formats/ (alternative backend) - **DICOM Standard:** https://www.dicomstandard.org/