Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/pathml/references/image_loading.md
+++ b/skills/pathml/references/image_loading.md
@@ -0,0 +1,448 @@
+# Image Loading & Formats
+
+## Overview
+
+PathML provides comprehensive support for loading whole-slide images (WSI) from 160+ proprietary medical imaging formats. The framework abstracts vendor-specific complexities through unified slide classes and interfaces, enabling seamless access to image pyramids, metadata, and regions of interest across different file formats.
+
+## Supported Formats
+
+PathML supports the following slide formats:
+
+### Brightfield Microscopy Formats
+- **Aperio SVS** (`.svs`) - Leica Biosystems
+- **Hamamatsu NDPI** (`.ndpi`) - Hamamatsu Photonics
+- **Leica SCN** (`.scn`) - Leica Biosystems
+- **Zeiss ZVI** (`.zvi`) - Carl Zeiss
+- **3DHISTECH** (`.mrxs`) - 3DHISTECH Ltd.
+- **Ventana BIF** (`.bif`) - Roche Ventana
+- **Generic tiled TIFF** (`.tif`, `.tiff`)
+
+### Medical Imaging Standards
+- **DICOM** (`.dcm`) - Digital Imaging and Communications in Medicine
+- **OME-TIFF** (`.ome.tif`, `.ome.tiff`) - Open Microscopy Environment
+
+### Multiparametric Imaging
+- **CODEX** - Spatial proteomics imaging
+- **Vectra** (`.qptiff`) - Multiplex immunofluorescence
+- **MERFISH** - Multiplexed error-robust FISH
+
+PathML leverages OpenSlide and other specialized libraries to handle format-specific nuances automatically.
+
+## Core Classes for Loading Images
+
+### SlideData
+
+`SlideData` is the fundamental class for representing whole-slide images in PathML.
+
+**Loading from file:**
+```python
+from pathml.core import SlideData
+
+# Load a whole-slide image
+wsi = SlideData.from_slide("path/to/slide.svs")
+
+# Load with specific backend
+wsi = SlideData.from_slide("path/to/slide.svs", backend="openslide")
+
+# Load from OME-TIFF
+wsi = SlideData.from_slide("path/to/slide.ome.tiff", backend="bioformats")
+```
+
+**Key attributes:**
+- `wsi.slide` - Backend slide object (OpenSlide, BioFormats, etc.)
+- `wsi.tiles` - Collection of image tiles
+- `wsi.metadata` - Slide metadata dictionary
+- `wsi.level_dimensions` - Image pyramid level dimensions
+- `wsi.level_downsamples` - Downsample factors for each pyramid level
+
+**Methods:**
+- `wsi.generate_tiles()` - Generate tiles from the slide
+- `wsi.read_region()` - Read a specific region at a given level
+- `wsi.get_thumbnail()` - Get a thumbnail image
+
+### SlideType
+
+`SlideType` is an enumeration defining supported slide backends:
+
+```python
+from pathml.core import SlideType
+
+# Available backends
+SlideType.OPENSLIDE  # For most WSI formats (SVS, NDPI, etc.)
+SlideType.BIOFORMATS  # For OME-TIFF and other formats
+SlideType.DICOM  # For DICOM WSI
+SlideType.VectraQPTIFF  # For Vectra multiplex IF
+```
+
+### Specialized Slide Classes
+
+PathML provides specialized slide classes for specific imaging modalities:
+
+**CODEXSlide:**
+```python
+from pathml.core import CODEXSlide
+
+# Load CODEX spatial proteomics data
+codex_slide = CODEXSlide(
+    path="path/to/codex_dir",
+    stain="IF",  # Immunofluorescence
+    backend="bioformats"
+)
+```
+
+**VectraSlide:**
+```python
+from pathml.core import types
+
+# Load Vectra multiplex IF data
+vectra_slide = SlideData.from_slide(
+    "path/to/vectra.qptiff",
+    backend=SlideType.VectraQPTIFF
+)
+```
+
+**MultiparametricSlide:**
+```python
+from pathml.core import MultiparametricSlide
+
+# Generic multiparametric imaging
+mp_slide = MultiparametricSlide(path="path/to/multiparametric_data")
+```
+
+## Loading Strategies
+
+### Tile-Based Loading
+
+For large WSI files, tile-based loading enables memory-efficient processing:
+
+```python
+from pathml.core import SlideData
+
+# Load slide
+wsi = SlideData.from_slide("path/to/slide.svs")
+
+# Generate tiles at specific magnification level
+wsi.generate_tiles(
+    level=0,  # Pyramid level (0 = highest resolution)
+    tile_size=256,  # Tile dimensions in pixels
+    stride=256,  # Spacing between tiles (256 = no overlap)
+    pad=False  # Whether to pad edge tiles
+)
+
+# Iterate over tiles
+for tile in wsi.tiles:
+    image = tile.image  # numpy array
+    coords = tile.coords  # (x, y) coordinates
+    # Process tile...
+```
+
+**Overlapping tiles:**
+```python
+# Generate tiles with 50% overlap
+wsi.generate_tiles(
+    level=0,
+    tile_size=256,
+    stride=128  # 50% overlap
+)
+```
+
+### Region-Based Loading
+
+Extract specific regions of interest directly:
+
+```python
+# Read region at specific location and level
+region = wsi.read_region(
+    location=(10000, 15000),  # (x, y) in level 0 coordinates
+    level=1,  # Pyramid level
+    size=(512, 512)  # Width, height in pixels
+)
+
+# Returns numpy array
+```
+
+### Pyramid Level Selection
+
+Whole-slide images are stored in multi-resolution pyramids. Select the appropriate level based on desired magnification:
+
+```python
+# Inspect available levels
+print(wsi.level_dimensions)  # [(width0, height0), (width1, height1), ...]
+print(wsi.level_downsamples)  # [1.0, 4.0, 16.0, ...]
+
+# Load at lower resolution for faster processing
+wsi.generate_tiles(level=2, tile_size=256)  # Use level 2 (16x downsampled)
+```
+
+**Common pyramid levels:**
+- Level 0: Full resolution (e.g., 40x magnification)
+- Level 1: 4x downsampled (e.g., 10x magnification)
+- Level 2: 16x downsampled (e.g., 2.5x magnification)
+- Level 3: 64x downsampled (thumbnail)
+
+### Thumbnail Loading
+
+Generate low-resolution thumbnails for visualization and quality control:
+
+```python
+# Get thumbnail
+thumbnail = wsi.get_thumbnail(size=(1024, 1024))
+
+# Display with matplotlib
+import matplotlib.pyplot as plt
+plt.imshow(thumbnail)
+plt.axis('off')
+plt.show()
+```
+
+## Batch Loading with SlideDataset
+
+Process multiple slides efficiently using `SlideDataset`:
+
+```python
+from pathml.core import SlideDataset
+import glob
+
+# Create dataset from multiple slides
+slide_paths = glob.glob("data/*.svs")
+dataset = SlideDataset(
+    slide_paths,
+    tile_size=256,
+    stride=256,
+    level=0
+)
+
+# Iterate over all tiles from all slides
+for tile in dataset:
+    image = tile.image
+    slide_id = tile.slide_id
+    # Process tile...
+```
+
+**With preprocessing pipeline:**
+```python
+from pathml.preprocessing import Pipeline, StainNormalizationHE
+
+# Create pipeline
+pipeline = Pipeline([
+    StainNormalizationHE(target='normalize')
+])
+
+# Apply to entire dataset
+dataset = SlideDataset(slide_paths)
+dataset.run(pipeline, distributed=True, n_workers=8)
+```
+
+## Metadata Access
+
+Extract slide metadata including acquisition parameters, magnification, and vendor-specific information:
+
+```python
+# Access metadata
+metadata = wsi.metadata
+
+# Common metadata fields
+print(metadata.get('openslide.objective-power'))  # Magnification
+print(metadata.get('openslide.mpp-x'))  # Microns per pixel X
+print(metadata.get('openslide.mpp-y'))  # Microns per pixel Y
+print(metadata.get('openslide.vendor'))  # Scanner vendor
+
+# Slide dimensions
+print(wsi.level_dimensions[0])  # (width, height) at level 0
+```
+
+## Working with DICOM Slides
+
+PathML supports DICOM WSI through specialized handling:
+
+```python
+from pathml.core import SlideData, SlideType
+
+# Load DICOM WSI
+dicom_slide = SlideData.from_slide(
+    "path/to/slide.dcm",
+    backend=SlideType.DICOM
+)
+
+# DICOM-specific metadata
+print(dicom_slide.metadata.get('PatientID'))
+print(dicom_slide.metadata.get('StudyDate'))
+```
+
+## Working with OME-TIFF
+
+OME-TIFF provides an open standard for multi-dimensional imaging:
+
+```python
+from pathml.core import SlideData
+
+# Load OME-TIFF
+ome_slide = SlideData.from_slide(
+    "path/to/slide.ome.tiff",
+    backend="bioformats"
+)
+
+# Access channel information for multi-channel images
+n_channels = ome_slide.shape[2]  # Number of channels
+```
+
+## Performance Considerations
+
+### Memory Management
+
+For large WSI files (often >1GB), use tile-based loading to avoid memory exhaustion:
+
+```python
+# Efficient: Tile-based processing
+wsi.generate_tiles(level=1, tile_size=256)
+for tile in wsi.tiles:
+    process_tile(tile)  # Process one tile at a time
+
+# Inefficient: Loading entire slide into memory
+full_image = wsi.read_region((0, 0), level=0, wsi.level_dimensions[0])  # May crash
+```
+
+### Distributed Processing
+
+Use Dask for parallel processing across multiple workers:
+
+```python
+from pathml.core import SlideDataset
+from dask.distributed import Client
+
+# Start Dask client
+client = Client(n_workers=8, threads_per_worker=2)
+
+# Process dataset in parallel
+dataset = SlideDataset(slide_paths)
+dataset.run(pipeline, distributed=True, client=client)
+```
+
+### Level Selection
+
+Balance resolution and performance by selecting appropriate pyramid levels:
+
+- **Level 0:** Use for final analysis requiring maximum detail
+- **Level 1-2:** Use for most preprocessing and model training
+- **Level 3+:** Use for thumbnails, quality control, and rapid exploration
+
+## Common Issues and Solutions
+
+**Issue: Slide fails to load**
+- Verify file format is supported
+- Check file permissions and path
+- Try different backend: `backend="bioformats"` or `backend="openslide"`
+
+**Issue: Out of memory errors**
+- Use tile-based loading instead of full-slide loading
+- Process at lower pyramid level (e.g., level=1 or level=2)
+- Reduce tile_size parameter
+- Enable distributed processing with Dask
+
+**Issue: Color inconsistencies across slides**
+- Apply stain normalization preprocessing (see `preprocessing.md`)
+- Check scanner metadata for calibration information
+- Use `StainNormalizationHE` transform in preprocessing pipeline
+
+**Issue: Metadata missing or incorrect**
+- Different vendors store metadata in different locations
+- Use `wsi.metadata` to inspect available fields
+- Some formats may have limited metadata support
+
+## Best Practices
+
+1. **Always inspect pyramid structure** before processing: Check `level_dimensions` and `level_downsamples` to understand available resolutions
+
+2. **Use appropriate pyramid levels**: Process at level 1-2 for most tasks; reserve level 0 for final high-resolution analysis
+
+3. **Tile with overlap** for segmentation tasks: Use stride < tile_size to avoid edge artifacts
+
+4. **Verify magnification consistency**: Check `openslide.objective-power` metadata when combining slides from different sources
+
+5. **Handle vendor-specific formats**: Use specialized slide classes (CODEXSlide, VectraSlide) for multiparametric data
+
+6. **Implement quality control**: Generate thumbnails and inspect for artifacts before processing
+
+7. **Use distributed processing** for large datasets: Leverage Dask for parallel processing across multiple workers
+
+## Example Workflows
+
+### Loading and Inspecting a New Slide
+
+```python
+from pathml.core import SlideData
+import matplotlib.pyplot as plt
+
+# Load slide
+wsi = SlideData.from_slide("path/to/slide.svs")
+
+# Inspect properties
+print(f"Dimensions: {wsi.level_dimensions}")
+print(f"Downsamples: {wsi.level_downsamples}")
+print(f"Magnification: {wsi.metadata.get('openslide.objective-power')}")
+
+# Generate thumbnail for QC
+thumbnail = wsi.get_thumbnail(size=(1024, 1024))
+plt.imshow(thumbnail)
+plt.title(f"Slide: {wsi.name}")
+plt.axis('off')
+plt.show()
+```
+
+### Processing Multiple Slides
+
+```python
+from pathml.core import SlideDataset
+from pathml.preprocessing import Pipeline, TissueDetectionHE
+import glob
+
+# Find all slides
+slide_paths = glob.glob("data/slides/*.svs")
+
+# Create pipeline
+pipeline = Pipeline([TissueDetectionHE()])
+
+# Process all slides
+dataset = SlideDataset(
+    slide_paths,
+    tile_size=512,
+    stride=512,
+    level=1
+)
+
+# Run pipeline with distributed processing
+dataset.run(pipeline, distributed=True, n_workers=8)
+
+# Save processed data
+dataset.to_hdf5("processed_dataset.h5")
+```
+
+### Loading CODEX Multiparametric Data
+
+```python
+from pathml.core import CODEXSlide
+from pathml.preprocessing import Pipeline, CollapseRunsCODEX, SegmentMIF
+
+# Load CODEX slide
+codex = CODEXSlide("path/to/codex_dir", stain="IF")
+
+# Create CODEX-specific pipeline
+pipeline = Pipeline([
+    CollapseRunsCODEX(z_slice=2),  # Select z-slice
+    SegmentMIF(
+        nuclear_channel='DAPI',
+        cytoplasm_channel='CD45',
+        model='mesmer'
+    )
+])
+
+# Process
+pipeline.run(codex)
+```
+
+## Additional Resources
+
+- **PathML Documentation:** https://pathml.readthedocs.io/
+- **OpenSlide:** https://openslide.org/ (underlying library for WSI formats)
+- **Bio-Formats:** https://www.openmicroscopy.org/bio-formats/ (alternative backend)
+- **DICOM Standard:** https://www.dicomstandard.org/