Initial commit
This commit is contained in:
448
skills/pathml/references/image_loading.md
Normal file
448
skills/pathml/references/image_loading.md
Normal file
@@ -0,0 +1,448 @@
|
||||
# Image Loading & Formats
|
||||
|
||||
## Overview
|
||||
|
||||
PathML provides comprehensive support for loading whole-slide images (WSI) from 160+ proprietary medical imaging formats. The framework abstracts vendor-specific complexities through unified slide classes and interfaces, enabling seamless access to image pyramids, metadata, and regions of interest across different file formats.
|
||||
|
||||
## Supported Formats
|
||||
|
||||
PathML supports the following slide formats:
|
||||
|
||||
### Brightfield Microscopy Formats
|
||||
- **Aperio SVS** (`.svs`) - Leica Biosystems
|
||||
- **Hamamatsu NDPI** (`.ndpi`) - Hamamatsu Photonics
|
||||
- **Leica SCN** (`.scn`) - Leica Biosystems
|
||||
- **Zeiss ZVI** (`.zvi`) - Carl Zeiss
|
||||
- **3DHISTECH** (`.mrxs`) - 3DHISTECH Ltd.
|
||||
- **Ventana BIF** (`.bif`) - Roche Ventana
|
||||
- **Generic tiled TIFF** (`.tif`, `.tiff`)
|
||||
|
||||
### Medical Imaging Standards
|
||||
- **DICOM** (`.dcm`) - Digital Imaging and Communications in Medicine
|
||||
- **OME-TIFF** (`.ome.tif`, `.ome.tiff`) - Open Microscopy Environment
|
||||
|
||||
### Multiparametric Imaging
|
||||
- **CODEX** - Spatial proteomics imaging
|
||||
- **Vectra** (`.qptiff`) - Multiplex immunofluorescence
|
||||
- **MERFISH** - Multiplexed error-robust FISH
|
||||
|
||||
PathML leverages OpenSlide and other specialized libraries to handle format-specific nuances automatically.
|
||||
|
||||
## Core Classes for Loading Images
|
||||
|
||||
### SlideData
|
||||
|
||||
`SlideData` is the fundamental class for representing whole-slide images in PathML.
|
||||
|
||||
**Loading from file:**
|
||||
```python
|
||||
from pathml.core import SlideData
|
||||
|
||||
# Load a whole-slide image
|
||||
wsi = SlideData.from_slide("path/to/slide.svs")
|
||||
|
||||
# Load with specific backend
|
||||
wsi = SlideData.from_slide("path/to/slide.svs", backend="openslide")
|
||||
|
||||
# Load from OME-TIFF
|
||||
wsi = SlideData.from_slide("path/to/slide.ome.tiff", backend="bioformats")
|
||||
```
|
||||
|
||||
**Key attributes:**
|
||||
- `wsi.slide` - Backend slide object (OpenSlide, BioFormats, etc.)
|
||||
- `wsi.tiles` - Collection of image tiles
|
||||
- `wsi.metadata` - Slide metadata dictionary
|
||||
- `wsi.level_dimensions` - Image pyramid level dimensions
|
||||
- `wsi.level_downsamples` - Downsample factors for each pyramid level
|
||||
|
||||
**Methods:**
|
||||
- `wsi.generate_tiles()` - Generate tiles from the slide
|
||||
- `wsi.read_region()` - Read a specific region at a given level
|
||||
- `wsi.get_thumbnail()` - Get a thumbnail image
|
||||
|
||||
### SlideType
|
||||
|
||||
`SlideType` is an enumeration defining supported slide backends:
|
||||
|
||||
```python
|
||||
from pathml.core import SlideType
|
||||
|
||||
# Available backends
|
||||
SlideType.OPENSLIDE # For most WSI formats (SVS, NDPI, etc.)
|
||||
SlideType.BIOFORMATS # For OME-TIFF and other formats
|
||||
SlideType.DICOM # For DICOM WSI
|
||||
SlideType.VectraQPTIFF # For Vectra multiplex IF
|
||||
```
|
||||
|
||||
### Specialized Slide Classes
|
||||
|
||||
PathML provides specialized slide classes for specific imaging modalities:
|
||||
|
||||
**CODEXSlide:**
|
||||
```python
|
||||
from pathml.core import CODEXSlide
|
||||
|
||||
# Load CODEX spatial proteomics data
|
||||
codex_slide = CODEXSlide(
|
||||
path="path/to/codex_dir",
|
||||
stain="IF", # Immunofluorescence
|
||||
backend="bioformats"
|
||||
)
|
||||
```
|
||||
|
||||
**VectraSlide:**
|
||||
```python
|
||||
from pathml.core import types
|
||||
|
||||
# Load Vectra multiplex IF data
|
||||
vectra_slide = SlideData.from_slide(
|
||||
"path/to/vectra.qptiff",
|
||||
backend=SlideType.VectraQPTIFF
|
||||
)
|
||||
```
|
||||
|
||||
**MultiparametricSlide:**
|
||||
```python
|
||||
from pathml.core import MultiparametricSlide
|
||||
|
||||
# Generic multiparametric imaging
|
||||
mp_slide = MultiparametricSlide(path="path/to/multiparametric_data")
|
||||
```
|
||||
|
||||
## Loading Strategies
|
||||
|
||||
### Tile-Based Loading
|
||||
|
||||
For large WSI files, tile-based loading enables memory-efficient processing:
|
||||
|
||||
```python
|
||||
from pathml.core import SlideData
|
||||
|
||||
# Load slide
|
||||
wsi = SlideData.from_slide("path/to/slide.svs")
|
||||
|
||||
# Generate tiles at specific magnification level
|
||||
wsi.generate_tiles(
|
||||
level=0, # Pyramid level (0 = highest resolution)
|
||||
tile_size=256, # Tile dimensions in pixels
|
||||
stride=256, # Spacing between tiles (256 = no overlap)
|
||||
pad=False # Whether to pad edge tiles
|
||||
)
|
||||
|
||||
# Iterate over tiles
|
||||
for tile in wsi.tiles:
|
||||
image = tile.image # numpy array
|
||||
coords = tile.coords # (x, y) coordinates
|
||||
# Process tile...
|
||||
```
|
||||
|
||||
**Overlapping tiles:**
|
||||
```python
|
||||
# Generate tiles with 50% overlap
|
||||
wsi.generate_tiles(
|
||||
level=0,
|
||||
tile_size=256,
|
||||
stride=128 # 50% overlap
|
||||
)
|
||||
```
|
||||
|
||||
### Region-Based Loading
|
||||
|
||||
Extract specific regions of interest directly:
|
||||
|
||||
```python
|
||||
# Read region at specific location and level
|
||||
region = wsi.read_region(
|
||||
location=(10000, 15000), # (x, y) in level 0 coordinates
|
||||
level=1, # Pyramid level
|
||||
size=(512, 512) # Width, height in pixels
|
||||
)
|
||||
|
||||
# Returns numpy array
|
||||
```
|
||||
|
||||
### Pyramid Level Selection
|
||||
|
||||
Whole-slide images are stored in multi-resolution pyramids. Select the appropriate level based on desired magnification:
|
||||
|
||||
```python
|
||||
# Inspect available levels
|
||||
print(wsi.level_dimensions) # [(width0, height0), (width1, height1), ...]
|
||||
print(wsi.level_downsamples) # [1.0, 4.0, 16.0, ...]
|
||||
|
||||
# Load at lower resolution for faster processing
|
||||
wsi.generate_tiles(level=2, tile_size=256) # Use level 2 (16x downsampled)
|
||||
```
|
||||
|
||||
**Common pyramid levels:**
|
||||
- Level 0: Full resolution (e.g., 40x magnification)
|
||||
- Level 1: 4x downsampled (e.g., 10x magnification)
|
||||
- Level 2: 16x downsampled (e.g., 2.5x magnification)
|
||||
- Level 3: 64x downsampled (thumbnail)
|
||||
|
||||
### Thumbnail Loading
|
||||
|
||||
Generate low-resolution thumbnails for visualization and quality control:
|
||||
|
||||
```python
|
||||
# Get thumbnail
|
||||
thumbnail = wsi.get_thumbnail(size=(1024, 1024))
|
||||
|
||||
# Display with matplotlib
|
||||
import matplotlib.pyplot as plt
|
||||
plt.imshow(thumbnail)
|
||||
plt.axis('off')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
## Batch Loading with SlideDataset
|
||||
|
||||
Process multiple slides efficiently using `SlideDataset`:
|
||||
|
||||
```python
|
||||
from pathml.core import SlideDataset
|
||||
import glob
|
||||
|
||||
# Create dataset from multiple slides
|
||||
slide_paths = glob.glob("data/*.svs")
|
||||
dataset = SlideDataset(
|
||||
slide_paths,
|
||||
tile_size=256,
|
||||
stride=256,
|
||||
level=0
|
||||
)
|
||||
|
||||
# Iterate over all tiles from all slides
|
||||
for tile in dataset:
|
||||
image = tile.image
|
||||
slide_id = tile.slide_id
|
||||
# Process tile...
|
||||
```
|
||||
|
||||
**With preprocessing pipeline:**
|
||||
```python
|
||||
from pathml.preprocessing import Pipeline, StainNormalizationHE
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
StainNormalizationHE(target='normalize')
|
||||
])
|
||||
|
||||
# Apply to entire dataset
|
||||
dataset = SlideDataset(slide_paths)
|
||||
dataset.run(pipeline, distributed=True, n_workers=8)
|
||||
```
|
||||
|
||||
## Metadata Access
|
||||
|
||||
Extract slide metadata including acquisition parameters, magnification, and vendor-specific information:
|
||||
|
||||
```python
|
||||
# Access metadata
|
||||
metadata = wsi.metadata
|
||||
|
||||
# Common metadata fields
|
||||
print(metadata.get('openslide.objective-power')) # Magnification
|
||||
print(metadata.get('openslide.mpp-x')) # Microns per pixel X
|
||||
print(metadata.get('openslide.mpp-y')) # Microns per pixel Y
|
||||
print(metadata.get('openslide.vendor')) # Scanner vendor
|
||||
|
||||
# Slide dimensions
|
||||
print(wsi.level_dimensions[0]) # (width, height) at level 0
|
||||
```
|
||||
|
||||
## Working with DICOM Slides
|
||||
|
||||
PathML supports DICOM WSI through specialized handling:
|
||||
|
||||
```python
|
||||
from pathml.core import SlideData, SlideType
|
||||
|
||||
# Load DICOM WSI
|
||||
dicom_slide = SlideData.from_slide(
|
||||
"path/to/slide.dcm",
|
||||
backend=SlideType.DICOM
|
||||
)
|
||||
|
||||
# DICOM-specific metadata
|
||||
print(dicom_slide.metadata.get('PatientID'))
|
||||
print(dicom_slide.metadata.get('StudyDate'))
|
||||
```
|
||||
|
||||
## Working with OME-TIFF
|
||||
|
||||
OME-TIFF provides an open standard for multi-dimensional imaging:
|
||||
|
||||
```python
|
||||
from pathml.core import SlideData
|
||||
|
||||
# Load OME-TIFF
|
||||
ome_slide = SlideData.from_slide(
|
||||
"path/to/slide.ome.tiff",
|
||||
backend="bioformats"
|
||||
)
|
||||
|
||||
# Access channel information for multi-channel images
|
||||
n_channels = ome_slide.shape[2] # Number of channels
|
||||
```
|
||||
|
||||
## Performance Considerations
|
||||
|
||||
### Memory Management
|
||||
|
||||
For large WSI files (often >1GB), use tile-based loading to avoid memory exhaustion:
|
||||
|
||||
```python
|
||||
# Efficient: Tile-based processing
|
||||
wsi.generate_tiles(level=1, tile_size=256)
|
||||
for tile in wsi.tiles:
|
||||
process_tile(tile) # Process one tile at a time
|
||||
|
||||
# Inefficient: Loading entire slide into memory
|
||||
full_image = wsi.read_region((0, 0), level=0, wsi.level_dimensions[0]) # May crash
|
||||
```
|
||||
|
||||
### Distributed Processing
|
||||
|
||||
Use Dask for parallel processing across multiple workers:
|
||||
|
||||
```python
|
||||
from pathml.core import SlideDataset
|
||||
from dask.distributed import Client
|
||||
|
||||
# Start Dask client
|
||||
client = Client(n_workers=8, threads_per_worker=2)
|
||||
|
||||
# Process dataset in parallel
|
||||
dataset = SlideDataset(slide_paths)
|
||||
dataset.run(pipeline, distributed=True, client=client)
|
||||
```
|
||||
|
||||
### Level Selection
|
||||
|
||||
Balance resolution and performance by selecting appropriate pyramid levels:
|
||||
|
||||
- **Level 0:** Use for final analysis requiring maximum detail
|
||||
- **Level 1-2:** Use for most preprocessing and model training
|
||||
- **Level 3+:** Use for thumbnails, quality control, and rapid exploration
|
||||
|
||||
## Common Issues and Solutions
|
||||
|
||||
**Issue: Slide fails to load**
|
||||
- Verify file format is supported
|
||||
- Check file permissions and path
|
||||
- Try different backend: `backend="bioformats"` or `backend="openslide"`
|
||||
|
||||
**Issue: Out of memory errors**
|
||||
- Use tile-based loading instead of full-slide loading
|
||||
- Process at lower pyramid level (e.g., level=1 or level=2)
|
||||
- Reduce tile_size parameter
|
||||
- Enable distributed processing with Dask
|
||||
|
||||
**Issue: Color inconsistencies across slides**
|
||||
- Apply stain normalization preprocessing (see `preprocessing.md`)
|
||||
- Check scanner metadata for calibration information
|
||||
- Use `StainNormalizationHE` transform in preprocessing pipeline
|
||||
|
||||
**Issue: Metadata missing or incorrect**
|
||||
- Different vendors store metadata in different locations
|
||||
- Use `wsi.metadata` to inspect available fields
|
||||
- Some formats may have limited metadata support
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always inspect pyramid structure** before processing: Check `level_dimensions` and `level_downsamples` to understand available resolutions
|
||||
|
||||
2. **Use appropriate pyramid levels**: Process at level 1-2 for most tasks; reserve level 0 for final high-resolution analysis
|
||||
|
||||
3. **Tile with overlap** for segmentation tasks: Use stride < tile_size to avoid edge artifacts
|
||||
|
||||
4. **Verify magnification consistency**: Check `openslide.objective-power` metadata when combining slides from different sources
|
||||
|
||||
5. **Handle vendor-specific formats**: Use specialized slide classes (CODEXSlide, VectraSlide) for multiparametric data
|
||||
|
||||
6. **Implement quality control**: Generate thumbnails and inspect for artifacts before processing
|
||||
|
||||
7. **Use distributed processing** for large datasets: Leverage Dask for parallel processing across multiple workers
|
||||
|
||||
## Example Workflows
|
||||
|
||||
### Loading and Inspecting a New Slide
|
||||
|
||||
```python
|
||||
from pathml.core import SlideData
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
# Load slide
|
||||
wsi = SlideData.from_slide("path/to/slide.svs")
|
||||
|
||||
# Inspect properties
|
||||
print(f"Dimensions: {wsi.level_dimensions}")
|
||||
print(f"Downsamples: {wsi.level_downsamples}")
|
||||
print(f"Magnification: {wsi.metadata.get('openslide.objective-power')}")
|
||||
|
||||
# Generate thumbnail for QC
|
||||
thumbnail = wsi.get_thumbnail(size=(1024, 1024))
|
||||
plt.imshow(thumbnail)
|
||||
plt.title(f"Slide: {wsi.name}")
|
||||
plt.axis('off')
|
||||
plt.show()
|
||||
```
|
||||
|
||||
### Processing Multiple Slides
|
||||
|
||||
```python
|
||||
from pathml.core import SlideDataset
|
||||
from pathml.preprocessing import Pipeline, TissueDetectionHE
|
||||
import glob
|
||||
|
||||
# Find all slides
|
||||
slide_paths = glob.glob("data/slides/*.svs")
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([TissueDetectionHE()])
|
||||
|
||||
# Process all slides
|
||||
dataset = SlideDataset(
|
||||
slide_paths,
|
||||
tile_size=512,
|
||||
stride=512,
|
||||
level=1
|
||||
)
|
||||
|
||||
# Run pipeline with distributed processing
|
||||
dataset.run(pipeline, distributed=True, n_workers=8)
|
||||
|
||||
# Save processed data
|
||||
dataset.to_hdf5("processed_dataset.h5")
|
||||
```
|
||||
|
||||
### Loading CODEX Multiparametric Data
|
||||
|
||||
```python
|
||||
from pathml.core import CODEXSlide
|
||||
from pathml.preprocessing import Pipeline, CollapseRunsCODEX, SegmentMIF
|
||||
|
||||
# Load CODEX slide
|
||||
codex = CODEXSlide("path/to/codex_dir", stain="IF")
|
||||
|
||||
# Create CODEX-specific pipeline
|
||||
pipeline = Pipeline([
|
||||
CollapseRunsCODEX(z_slice=2), # Select z-slice
|
||||
SegmentMIF(
|
||||
nuclear_channel='DAPI',
|
||||
cytoplasm_channel='CD45',
|
||||
model='mesmer'
|
||||
)
|
||||
])
|
||||
|
||||
# Process
|
||||
pipeline.run(codex)
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **PathML Documentation:** https://pathml.readthedocs.io/
|
||||
- **OpenSlide:** https://openslide.org/ (underlying library for WSI formats)
|
||||
- **Bio-Formats:** https://www.openmicroscopy.org/bio-formats/ (alternative backend)
|
||||
- **DICOM Standard:** https://www.dicomstandard.org/
|
||||
Reference in New Issue
Block a user