zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

12 KiB

Raw Blame History

Image Loading & Formats

Overview

PathML provides comprehensive support for loading whole-slide images (WSI) from 160+ proprietary medical imaging formats. The framework abstracts vendor-specific complexities through unified slide classes and interfaces, enabling seamless access to image pyramids, metadata, and regions of interest across different file formats.

Supported Formats

PathML supports the following slide formats:

Brightfield Microscopy Formats

Aperio SVS (.svs) - Leica Biosystems
Hamamatsu NDPI (.ndpi) - Hamamatsu Photonics
Leica SCN (.scn) - Leica Biosystems
Zeiss ZVI (.zvi) - Carl Zeiss
3DHISTECH (.mrxs) - 3DHISTECH Ltd.
Ventana BIF (.bif) - Roche Ventana
Generic tiled TIFF (.tif, .tiff)

Medical Imaging Standards

DICOM (.dcm) - Digital Imaging and Communications in Medicine
OME-TIFF (.ome.tif, .ome.tiff) - Open Microscopy Environment

Multiparametric Imaging

CODEX - Spatial proteomics imaging
Vectra (.qptiff) - Multiplex immunofluorescence
MERFISH - Multiplexed error-robust FISH

PathML leverages OpenSlide and other specialized libraries to handle format-specific nuances automatically.

Core Classes for Loading Images

SlideData

SlideData is the fundamental class for representing whole-slide images in PathML.

Loading from file:

from pathml.core import SlideData

# Load a whole-slide image
wsi = SlideData.from_slide("path/to/slide.svs")

# Load with specific backend
wsi = SlideData.from_slide("path/to/slide.svs", backend="openslide")

# Load from OME-TIFF
wsi = SlideData.from_slide("path/to/slide.ome.tiff", backend="bioformats")

Key attributes:

wsi.slide - Backend slide object (OpenSlide, BioFormats, etc.)
wsi.tiles - Collection of image tiles
wsi.metadata - Slide metadata dictionary
wsi.level_dimensions - Image pyramid level dimensions
wsi.level_downsamples - Downsample factors for each pyramid level

Methods:

wsi.generate_tiles() - Generate tiles from the slide
wsi.read_region() - Read a specific region at a given level
wsi.get_thumbnail() - Get a thumbnail image

SlideType

SlideType is an enumeration defining supported slide backends:

from pathml.core import SlideType

# Available backends
SlideType.OPENSLIDE  # For most WSI formats (SVS, NDPI, etc.)
SlideType.BIOFORMATS  # For OME-TIFF and other formats
SlideType.DICOM  # For DICOM WSI
SlideType.VectraQPTIFF  # For Vectra multiplex IF

Specialized Slide Classes

PathML provides specialized slide classes for specific imaging modalities:

CODEXSlide:

from pathml.core import CODEXSlide

# Load CODEX spatial proteomics data
codex_slide = CODEXSlide(
    path="path/to/codex_dir",
    stain="IF",  # Immunofluorescence
    backend="bioformats"
)

VectraSlide:

from pathml.core import types

# Load Vectra multiplex IF data
vectra_slide = SlideData.from_slide(
    "path/to/vectra.qptiff",
    backend=SlideType.VectraQPTIFF
)

MultiparametricSlide:

from pathml.core import MultiparametricSlide

# Generic multiparametric imaging
mp_slide = MultiparametricSlide(path="path/to/multiparametric_data")

Loading Strategies

Tile-Based Loading

For large WSI files, tile-based loading enables memory-efficient processing:

from pathml.core import SlideData

# Load slide
wsi = SlideData.from_slide("path/to/slide.svs")

# Generate tiles at specific magnification level
wsi.generate_tiles(
    level=0,  # Pyramid level (0 = highest resolution)
    tile_size=256,  # Tile dimensions in pixels
    stride=256,  # Spacing between tiles (256 = no overlap)
    pad=False  # Whether to pad edge tiles
)

# Iterate over tiles
for tile in wsi.tiles:
    image = tile.image  # numpy array
    coords = tile.coords  # (x, y) coordinates
    # Process tile...

Overlapping tiles:

# Generate tiles with 50% overlap
wsi.generate_tiles(
    level=0,
    tile_size=256,
    stride=128  # 50% overlap
)

Region-Based Loading

Extract specific regions of interest directly:

# Read region at specific location and level
region = wsi.read_region(
    location=(10000, 15000),  # (x, y) in level 0 coordinates
    level=1,  # Pyramid level
    size=(512, 512)  # Width, height in pixels
)

# Returns numpy array

Pyramid Level Selection

Whole-slide images are stored in multi-resolution pyramids. Select the appropriate level based on desired magnification:

# Inspect available levels
print(wsi.level_dimensions)  # [(width0, height0), (width1, height1), ...]
print(wsi.level_downsamples)  # [1.0, 4.0, 16.0, ...]

# Load at lower resolution for faster processing
wsi.generate_tiles(level=2, tile_size=256)  # Use level 2 (16x downsampled)

Common pyramid levels:

Level 0: Full resolution (e.g., 40x magnification)
Level 1: 4x downsampled (e.g., 10x magnification)
Level 2: 16x downsampled (e.g., 2.5x magnification)
Level 3: 64x downsampled (thumbnail)

Thumbnail Loading

Generate low-resolution thumbnails for visualization and quality control:

# Get thumbnail
thumbnail = wsi.get_thumbnail(size=(1024, 1024))

# Display with matplotlib
import matplotlib.pyplot as plt
plt.imshow(thumbnail)
plt.axis('off')
plt.show()

Batch Loading with SlideDataset

Process multiple slides efficiently using SlideDataset:

from pathml.core import SlideDataset
import glob

# Create dataset from multiple slides
slide_paths = glob.glob("data/*.svs")
dataset = SlideDataset(
    slide_paths,
    tile_size=256,
    stride=256,
    level=0
)

# Iterate over all tiles from all slides
for tile in dataset:
    image = tile.image
    slide_id = tile.slide_id
    # Process tile...

With preprocessing pipeline:

from pathml.preprocessing import Pipeline, StainNormalizationHE

# Create pipeline
pipeline = Pipeline([
    StainNormalizationHE(target='normalize')
])

# Apply to entire dataset
dataset = SlideDataset(slide_paths)
dataset.run(pipeline, distributed=True, n_workers=8)

Metadata Access

Extract slide metadata including acquisition parameters, magnification, and vendor-specific information:

# Access metadata
metadata = wsi.metadata

# Common metadata fields
print(metadata.get('openslide.objective-power'))  # Magnification
print(metadata.get('openslide.mpp-x'))  # Microns per pixel X
print(metadata.get('openslide.mpp-y'))  # Microns per pixel Y
print(metadata.get('openslide.vendor'))  # Scanner vendor

# Slide dimensions
print(wsi.level_dimensions[0])  # (width, height) at level 0

Working with DICOM Slides

PathML supports DICOM WSI through specialized handling:

from pathml.core import SlideData, SlideType

# Load DICOM WSI
dicom_slide = SlideData.from_slide(
    "path/to/slide.dcm",
    backend=SlideType.DICOM
)

# DICOM-specific metadata
print(dicom_slide.metadata.get('PatientID'))
print(dicom_slide.metadata.get('StudyDate'))

Working with OME-TIFF

OME-TIFF provides an open standard for multi-dimensional imaging:

from pathml.core import SlideData

# Load OME-TIFF
ome_slide = SlideData.from_slide(
    "path/to/slide.ome.tiff",
    backend="bioformats"
)

# Access channel information for multi-channel images
n_channels = ome_slide.shape[2]  # Number of channels

Performance Considerations

Memory Management

For large WSI files (often >1GB), use tile-based loading to avoid memory exhaustion:

# Efficient: Tile-based processing
wsi.generate_tiles(level=1, tile_size=256)
for tile in wsi.tiles:
    process_tile(tile)  # Process one tile at a time

# Inefficient: Loading entire slide into memory
full_image = wsi.read_region((0, 0), level=0, wsi.level_dimensions[0])  # May crash

Distributed Processing

Use Dask for parallel processing across multiple workers:

from pathml.core import SlideDataset
from dask.distributed import Client

# Start Dask client
client = Client(n_workers=8, threads_per_worker=2)

# Process dataset in parallel
dataset = SlideDataset(slide_paths)
dataset.run(pipeline, distributed=True, client=client)

Level Selection

Balance resolution and performance by selecting appropriate pyramid levels:

Level 0: Use for final analysis requiring maximum detail
Level 1-2: Use for most preprocessing and model training
Level 3+: Use for thumbnails, quality control, and rapid exploration

Common Issues and Solutions

Issue: Slide fails to load

Verify file format is supported
Check file permissions and path
Try different backend: backend="bioformats" or backend="openslide"

Issue: Out of memory errors

Use tile-based loading instead of full-slide loading
Process at lower pyramid level (e.g., level=1 or level=2)
Reduce tile_size parameter
Enable distributed processing with Dask

Issue: Color inconsistencies across slides

Apply stain normalization preprocessing (see preprocessing.md)
Check scanner metadata for calibration information
Use StainNormalizationHE transform in preprocessing pipeline

Issue: Metadata missing or incorrect

Different vendors store metadata in different locations
Use wsi.metadata to inspect available fields
Some formats may have limited metadata support

Best Practices

Always inspect pyramid structure before processing: Check level_dimensions and level_downsamples to understand available resolutions
Use appropriate pyramid levels: Process at level 1-2 for most tasks; reserve level 0 for final high-resolution analysis
Tile with overlap for segmentation tasks: Use stride < tile_size to avoid edge artifacts
Verify magnification consistency: Check openslide.objective-power metadata when combining slides from different sources
Handle vendor-specific formats: Use specialized slide classes (CODEXSlide, VectraSlide) for multiparametric data
Implement quality control: Generate thumbnails and inspect for artifacts before processing
Use distributed processing for large datasets: Leverage Dask for parallel processing across multiple workers

Example Workflows

Loading and Inspecting a New Slide

from pathml.core import SlideData
import matplotlib.pyplot as plt

# Load slide
wsi = SlideData.from_slide("path/to/slide.svs")

# Inspect properties
print(f"Dimensions: {wsi.level_dimensions}")
print(f"Downsamples: {wsi.level_downsamples}")
print(f"Magnification: {wsi.metadata.get('openslide.objective-power')}")

# Generate thumbnail for QC
thumbnail = wsi.get_thumbnail(size=(1024, 1024))
plt.imshow(thumbnail)
plt.title(f"Slide: {wsi.name}")
plt.axis('off')
plt.show()

Processing Multiple Slides

from pathml.core import SlideDataset
from pathml.preprocessing import Pipeline, TissueDetectionHE
import glob

# Find all slides
slide_paths = glob.glob("data/slides/*.svs")

# Create pipeline
pipeline = Pipeline([TissueDetectionHE()])

# Process all slides
dataset = SlideDataset(
    slide_paths,
    tile_size=512,
    stride=512,
    level=1
)

# Run pipeline with distributed processing
dataset.run(pipeline, distributed=True, n_workers=8)

# Save processed data
dataset.to_hdf5("processed_dataset.h5")

Loading CODEX Multiparametric Data

from pathml.core import CODEXSlide
from pathml.preprocessing import Pipeline, CollapseRunsCODEX, SegmentMIF

# Load CODEX slide
codex = CODEXSlide("path/to/codex_dir", stain="IF")

# Create CODEX-specific pipeline
pipeline = Pipeline([
    CollapseRunsCODEX(z_slice=2),  # Select z-slice
    SegmentMIF(
        nuclear_channel='DAPI',
        cytoplasm_channel='CD45',
        model='mesmer'
    )
])

# Process
pipeline.run(codex)

Additional Resources

PathML Documentation: https://pathml.readthedocs.io/
OpenSlide: https://openslide.org/ (underlying library for WSI formats)
Bio-Formats: https://www.openmicroscopy.org/bio-formats/ (alternative backend)
DICOM Standard: https://www.dicomstandard.org/

12 KiB Raw Blame History