Files
2025-11-30 08:30:10 +08:00

449 lines
12 KiB
Markdown

# Image Loading & Formats
## Overview
PathML provides comprehensive support for loading whole-slide images (WSI) from 160+ proprietary medical imaging formats. The framework abstracts vendor-specific complexities through unified slide classes and interfaces, enabling seamless access to image pyramids, metadata, and regions of interest across different file formats.
## Supported Formats
PathML supports the following slide formats:
### Brightfield Microscopy Formats
- **Aperio SVS** (`.svs`) - Leica Biosystems
- **Hamamatsu NDPI** (`.ndpi`) - Hamamatsu Photonics
- **Leica SCN** (`.scn`) - Leica Biosystems
- **Zeiss ZVI** (`.zvi`) - Carl Zeiss
- **3DHISTECH** (`.mrxs`) - 3DHISTECH Ltd.
- **Ventana BIF** (`.bif`) - Roche Ventana
- **Generic tiled TIFF** (`.tif`, `.tiff`)
### Medical Imaging Standards
- **DICOM** (`.dcm`) - Digital Imaging and Communications in Medicine
- **OME-TIFF** (`.ome.tif`, `.ome.tiff`) - Open Microscopy Environment
### Multiparametric Imaging
- **CODEX** - Spatial proteomics imaging
- **Vectra** (`.qptiff`) - Multiplex immunofluorescence
- **MERFISH** - Multiplexed error-robust FISH
PathML leverages OpenSlide and other specialized libraries to handle format-specific nuances automatically.
## Core Classes for Loading Images
### SlideData
`SlideData` is the fundamental class for representing whole-slide images in PathML.
**Loading from file:**
```python
from pathml.core import SlideData
# Load a whole-slide image
wsi = SlideData.from_slide("path/to/slide.svs")
# Load with specific backend
wsi = SlideData.from_slide("path/to/slide.svs", backend="openslide")
# Load from OME-TIFF
wsi = SlideData.from_slide("path/to/slide.ome.tiff", backend="bioformats")
```
**Key attributes:**
- `wsi.slide` - Backend slide object (OpenSlide, BioFormats, etc.)
- `wsi.tiles` - Collection of image tiles
- `wsi.metadata` - Slide metadata dictionary
- `wsi.level_dimensions` - Image pyramid level dimensions
- `wsi.level_downsamples` - Downsample factors for each pyramid level
**Methods:**
- `wsi.generate_tiles()` - Generate tiles from the slide
- `wsi.read_region()` - Read a specific region at a given level
- `wsi.get_thumbnail()` - Get a thumbnail image
### SlideType
`SlideType` is an enumeration defining supported slide backends:
```python
from pathml.core import SlideType
# Available backends
SlideType.OPENSLIDE # For most WSI formats (SVS, NDPI, etc.)
SlideType.BIOFORMATS # For OME-TIFF and other formats
SlideType.DICOM # For DICOM WSI
SlideType.VectraQPTIFF # For Vectra multiplex IF
```
### Specialized Slide Classes
PathML provides specialized slide classes for specific imaging modalities:
**CODEXSlide:**
```python
from pathml.core import CODEXSlide
# Load CODEX spatial proteomics data
codex_slide = CODEXSlide(
path="path/to/codex_dir",
stain="IF", # Immunofluorescence
backend="bioformats"
)
```
**VectraSlide:**
```python
from pathml.core import types
# Load Vectra multiplex IF data
vectra_slide = SlideData.from_slide(
"path/to/vectra.qptiff",
backend=SlideType.VectraQPTIFF
)
```
**MultiparametricSlide:**
```python
from pathml.core import MultiparametricSlide
# Generic multiparametric imaging
mp_slide = MultiparametricSlide(path="path/to/multiparametric_data")
```
## Loading Strategies
### Tile-Based Loading
For large WSI files, tile-based loading enables memory-efficient processing:
```python
from pathml.core import SlideData
# Load slide
wsi = SlideData.from_slide("path/to/slide.svs")
# Generate tiles at specific magnification level
wsi.generate_tiles(
level=0, # Pyramid level (0 = highest resolution)
tile_size=256, # Tile dimensions in pixels
stride=256, # Spacing between tiles (256 = no overlap)
pad=False # Whether to pad edge tiles
)
# Iterate over tiles
for tile in wsi.tiles:
image = tile.image # numpy array
coords = tile.coords # (x, y) coordinates
# Process tile...
```
**Overlapping tiles:**
```python
# Generate tiles with 50% overlap
wsi.generate_tiles(
level=0,
tile_size=256,
stride=128 # 50% overlap
)
```
### Region-Based Loading
Extract specific regions of interest directly:
```python
# Read region at specific location and level
region = wsi.read_region(
location=(10000, 15000), # (x, y) in level 0 coordinates
level=1, # Pyramid level
size=(512, 512) # Width, height in pixels
)
# Returns numpy array
```
### Pyramid Level Selection
Whole-slide images are stored in multi-resolution pyramids. Select the appropriate level based on desired magnification:
```python
# Inspect available levels
print(wsi.level_dimensions) # [(width0, height0), (width1, height1), ...]
print(wsi.level_downsamples) # [1.0, 4.0, 16.0, ...]
# Load at lower resolution for faster processing
wsi.generate_tiles(level=2, tile_size=256) # Use level 2 (16x downsampled)
```
**Common pyramid levels:**
- Level 0: Full resolution (e.g., 40x magnification)
- Level 1: 4x downsampled (e.g., 10x magnification)
- Level 2: 16x downsampled (e.g., 2.5x magnification)
- Level 3: 64x downsampled (thumbnail)
### Thumbnail Loading
Generate low-resolution thumbnails for visualization and quality control:
```python
# Get thumbnail
thumbnail = wsi.get_thumbnail(size=(1024, 1024))
# Display with matplotlib
import matplotlib.pyplot as plt
plt.imshow(thumbnail)
plt.axis('off')
plt.show()
```
## Batch Loading with SlideDataset
Process multiple slides efficiently using `SlideDataset`:
```python
from pathml.core import SlideDataset
import glob
# Create dataset from multiple slides
slide_paths = glob.glob("data/*.svs")
dataset = SlideDataset(
slide_paths,
tile_size=256,
stride=256,
level=0
)
# Iterate over all tiles from all slides
for tile in dataset:
image = tile.image
slide_id = tile.slide_id
# Process tile...
```
**With preprocessing pipeline:**
```python
from pathml.preprocessing import Pipeline, StainNormalizationHE
# Create pipeline
pipeline = Pipeline([
StainNormalizationHE(target='normalize')
])
# Apply to entire dataset
dataset = SlideDataset(slide_paths)
dataset.run(pipeline, distributed=True, n_workers=8)
```
## Metadata Access
Extract slide metadata including acquisition parameters, magnification, and vendor-specific information:
```python
# Access metadata
metadata = wsi.metadata
# Common metadata fields
print(metadata.get('openslide.objective-power')) # Magnification
print(metadata.get('openslide.mpp-x')) # Microns per pixel X
print(metadata.get('openslide.mpp-y')) # Microns per pixel Y
print(metadata.get('openslide.vendor')) # Scanner vendor
# Slide dimensions
print(wsi.level_dimensions[0]) # (width, height) at level 0
```
## Working with DICOM Slides
PathML supports DICOM WSI through specialized handling:
```python
from pathml.core import SlideData, SlideType
# Load DICOM WSI
dicom_slide = SlideData.from_slide(
"path/to/slide.dcm",
backend=SlideType.DICOM
)
# DICOM-specific metadata
print(dicom_slide.metadata.get('PatientID'))
print(dicom_slide.metadata.get('StudyDate'))
```
## Working with OME-TIFF
OME-TIFF provides an open standard for multi-dimensional imaging:
```python
from pathml.core import SlideData
# Load OME-TIFF
ome_slide = SlideData.from_slide(
"path/to/slide.ome.tiff",
backend="bioformats"
)
# Access channel information for multi-channel images
n_channels = ome_slide.shape[2] # Number of channels
```
## Performance Considerations
### Memory Management
For large WSI files (often >1GB), use tile-based loading to avoid memory exhaustion:
```python
# Efficient: Tile-based processing
wsi.generate_tiles(level=1, tile_size=256)
for tile in wsi.tiles:
process_tile(tile) # Process one tile at a time
# Inefficient: Loading entire slide into memory
full_image = wsi.read_region((0, 0), level=0, wsi.level_dimensions[0]) # May crash
```
### Distributed Processing
Use Dask for parallel processing across multiple workers:
```python
from pathml.core import SlideDataset
from dask.distributed import Client
# Start Dask client
client = Client(n_workers=8, threads_per_worker=2)
# Process dataset in parallel
dataset = SlideDataset(slide_paths)
dataset.run(pipeline, distributed=True, client=client)
```
### Level Selection
Balance resolution and performance by selecting appropriate pyramid levels:
- **Level 0:** Use for final analysis requiring maximum detail
- **Level 1-2:** Use for most preprocessing and model training
- **Level 3+:** Use for thumbnails, quality control, and rapid exploration
## Common Issues and Solutions
**Issue: Slide fails to load**
- Verify file format is supported
- Check file permissions and path
- Try different backend: `backend="bioformats"` or `backend="openslide"`
**Issue: Out of memory errors**
- Use tile-based loading instead of full-slide loading
- Process at lower pyramid level (e.g., level=1 or level=2)
- Reduce tile_size parameter
- Enable distributed processing with Dask
**Issue: Color inconsistencies across slides**
- Apply stain normalization preprocessing (see `preprocessing.md`)
- Check scanner metadata for calibration information
- Use `StainNormalizationHE` transform in preprocessing pipeline
**Issue: Metadata missing or incorrect**
- Different vendors store metadata in different locations
- Use `wsi.metadata` to inspect available fields
- Some formats may have limited metadata support
## Best Practices
1. **Always inspect pyramid structure** before processing: Check `level_dimensions` and `level_downsamples` to understand available resolutions
2. **Use appropriate pyramid levels**: Process at level 1-2 for most tasks; reserve level 0 for final high-resolution analysis
3. **Tile with overlap** for segmentation tasks: Use stride < tile_size to avoid edge artifacts
4. **Verify magnification consistency**: Check `openslide.objective-power` metadata when combining slides from different sources
5. **Handle vendor-specific formats**: Use specialized slide classes (CODEXSlide, VectraSlide) for multiparametric data
6. **Implement quality control**: Generate thumbnails and inspect for artifacts before processing
7. **Use distributed processing** for large datasets: Leverage Dask for parallel processing across multiple workers
## Example Workflows
### Loading and Inspecting a New Slide
```python
from pathml.core import SlideData
import matplotlib.pyplot as plt
# Load slide
wsi = SlideData.from_slide("path/to/slide.svs")
# Inspect properties
print(f"Dimensions: {wsi.level_dimensions}")
print(f"Downsamples: {wsi.level_downsamples}")
print(f"Magnification: {wsi.metadata.get('openslide.objective-power')}")
# Generate thumbnail for QC
thumbnail = wsi.get_thumbnail(size=(1024, 1024))
plt.imshow(thumbnail)
plt.title(f"Slide: {wsi.name}")
plt.axis('off')
plt.show()
```
### Processing Multiple Slides
```python
from pathml.core import SlideDataset
from pathml.preprocessing import Pipeline, TissueDetectionHE
import glob
# Find all slides
slide_paths = glob.glob("data/slides/*.svs")
# Create pipeline
pipeline = Pipeline([TissueDetectionHE()])
# Process all slides
dataset = SlideDataset(
slide_paths,
tile_size=512,
stride=512,
level=1
)
# Run pipeline with distributed processing
dataset.run(pipeline, distributed=True, n_workers=8)
# Save processed data
dataset.to_hdf5("processed_dataset.h5")
```
### Loading CODEX Multiparametric Data
```python
from pathml.core import CODEXSlide
from pathml.preprocessing import Pipeline, CollapseRunsCODEX, SegmentMIF
# Load CODEX slide
codex = CODEXSlide("path/to/codex_dir", stain="IF")
# Create CODEX-specific pipeline
pipeline = Pipeline([
CollapseRunsCODEX(z_slice=2), # Select z-slice
SegmentMIF(
nuclear_channel='DAPI',
cytoplasm_channel='CD45',
model='mesmer'
)
])
# Process
pipeline.run(codex)
```
## Additional Resources
- **PathML Documentation:** https://pathml.readthedocs.io/
- **OpenSlide:** https://openslide.org/ (underlying library for WSI formats)
- **Bio-Formats:** https://www.openmicroscopy.org/bio-formats/ (alternative backend)
- **DICOM Standard:** https://www.dicomstandard.org/