Files
2025-11-30 08:30:10 +08:00

14 KiB

PyHealth Data Preprocessing and Processors

Overview

PyHealth provides comprehensive data processing utilities to transform raw healthcare data into model-ready formats. Processors handle feature extraction, sequence processing, signal transformation, and label preparation.

Processor Base Class

All processors inherit from Processor with standard interface:

Key Methods:

  • __call__(): Transform input data
  • get_input_info(): Return processed input schema
  • get_output_info(): Return processed output schema

Core Processor Types

Feature Processors

FeatureProcessor (FeatureProcessor)

  • Base class for feature extraction
  • Handles vocabulary building
  • Embedding preparation
  • Feature encoding

Common Operations:

  • Medical code tokenization
  • Categorical encoding
  • Feature normalization
  • Missing value handling

Usage:

from pyhealth.data import FeatureProcessor

processor = FeatureProcessor(
    vocabulary="diagnoses",
    min_freq=5,  # Minimum code frequency
    max_vocab_size=10000
)

processed_features = processor(raw_features)

Sequence Processors

SequenceProcessor (SequenceProcessor)

  • Processes sequential clinical events
  • Temporal ordering preservation
  • Sequence padding/truncation
  • Time gap encoding

Key Features:

  • Variable-length sequence handling
  • Temporal feature extraction
  • Sequence statistics computation

Parameters:

  • max_seq_length: Maximum sequence length (truncate if longer)
  • padding: Padding strategy ("pre" or "post")
  • truncating: Truncation strategy ("pre" or "post")

Usage:

from pyhealth.data import SequenceProcessor

processor = SequenceProcessor(
    max_seq_length=100,
    padding="post",
    truncating="post"
)

# Process diagnosis sequences
processed_seq = processor(diagnosis_sequences)

NestedSequenceProcessor (NestedSequenceProcessor)

  • Handles hierarchical sequences (e.g., visits containing events)
  • Two-level processing (visit-level and event-level)
  • Preserves nested structure

Use Cases:

  • EHR with visits containing multiple events
  • Multi-level temporal modeling
  • Hierarchical attention models

Structure:

# Input: [[visit1_events], [visit2_events], ...]
# Output: Processed nested sequences with proper padding

Numeric Data Processors

NestedFloatsProcessor (NestedFloatsProcessor)

  • Processes nested numeric arrays
  • Lab values, vital signs, measurements
  • Multi-level numeric features

Operations:

  • Normalization
  • Standardization
  • Missing value imputation
  • Outlier handling

Usage:

from pyhealth.data import NestedFloatsProcessor

processor = NestedFloatsProcessor(
    normalization="z-score",  # or "min-max"
    fill_missing="mean"  # imputation strategy
)

processed_labs = processor(lab_values)

TensorProcessor (TensorProcessor)

  • Converts data to PyTorch tensors
  • Type handling (long, float, etc.)
  • Device placement (CPU/GPU)

Parameters:

  • dtype: Tensor data type
  • device: Computation device

Time-Series Processors

TimeseriesProcessor (TimeseriesProcessor)

  • Handles temporal data with timestamps
  • Time gap computation
  • Temporal feature engineering
  • Irregular sampling handling

Extracted Features:

  • Time since previous event
  • Time to next event
  • Event frequency
  • Temporal patterns

Usage:

from pyhealth.data import TimeseriesProcessor

processor = TimeseriesProcessor(
    time_unit="hour",  # "day", "hour", "minute"
    compute_gaps=True,
    compute_frequency=True
)

processed_ts = processor(timestamps, events)

SignalProcessor (SignalProcessor)

  • Physiological signal processing
  • EEG, ECG, PPG signals
  • Filtering and preprocessing

Operations:

  • Bandpass filtering
  • Artifact removal
  • Segmentation
  • Feature extraction (frequency, amplitude)

Usage:

from pyhealth.data import SignalProcessor

processor = SignalProcessor(
    sampling_rate=256,  # Hz
    bandpass_filter=(0.5, 50),  # Hz range
    segment_length=30  # seconds
)

processed_signal = processor(raw_eeg_signal)

Image Processors

ImageProcessor (ImageProcessor)

  • Medical image preprocessing
  • Normalization and resizing
  • Augmentation support
  • Format standardization

Operations:

  • Resize to standard dimensions
  • Normalization (mean/std)
  • Windowing (for CT/MRI)
  • Data augmentation

Usage:

from pyhealth.data import ImageProcessor

processor = ImageProcessor(
    image_size=(224, 224),
    normalization="imagenet",  # or custom mean/std
    augmentation=True
)

processed_image = processor(raw_image)

Label Processors

Binary Classification

BinaryLabelProcessor (BinaryLabelProcessor)

  • Binary classification labels (0/1)
  • Handles positive/negative classes
  • Class weighting for imbalance

Usage:

from pyhealth.data import BinaryLabelProcessor

processor = BinaryLabelProcessor(
    positive_class=1,
    class_weight="balanced"
)

processed_labels = processor(raw_labels)

Multi-Class Classification

MultiClassLabelProcessor (MultiClassLabelProcessor)

  • Multi-class classification (mutually exclusive classes)
  • Label encoding
  • Class balancing

Parameters:

  • num_classes: Number of classes
  • class_weight: Weighting strategy

Usage:

from pyhealth.data import MultiClassLabelProcessor

processor = MultiClassLabelProcessor(
    num_classes=5,  # e.g., sleep stages: W, N1, N2, N3, REM
    class_weight="balanced"
)

processed_labels = processor(raw_labels)

Multi-Label Classification

MultiLabelProcessor (MultiLabelProcessor)

  • Multi-label classification (multiple labels per sample)
  • Binary encoding for each label
  • Label co-occurrence handling

Use Cases:

  • Drug recommendation (multiple drugs)
  • ICD coding (multiple diagnoses)
  • Comorbidity prediction

Usage:

from pyhealth.data import MultiLabelProcessor

processor = MultiLabelProcessor(
    num_labels=100,  # total possible labels
    threshold=0.5  # prediction threshold
)

processed_labels = processor(raw_label_sets)

Regression

RegressionLabelProcessor (RegressionLabelProcessor)

  • Continuous value prediction
  • Target scaling and normalization
  • Outlier handling

Use Cases:

  • Length of stay prediction
  • Lab value prediction
  • Risk score estimation

Usage:

from pyhealth.data import RegressionLabelProcessor

processor = RegressionLabelProcessor(
    normalization="z-score",  # or "min-max"
    clip_outliers=True,
    outlier_std=3  # clip at 3 standard deviations
)

processed_targets = processor(raw_values)

Specialized Processors

Text Processing

TextProcessor (TextProcessor)

  • Clinical text preprocessing
  • Tokenization
  • Vocabulary building
  • Sequence encoding

Operations:

  • Lowercasing
  • Punctuation removal
  • Medical abbreviation handling
  • Token frequency filtering

Usage:

from pyhealth.data import TextProcessor

processor = TextProcessor(
    tokenizer="word",  # or "sentencepiece", "bpe"
    lowercase=True,
    max_vocab_size=50000,
    min_freq=5
)

processed_text = processor(clinical_notes)

Model-Specific Processors

StageNetProcessor (StageNetProcessor)

  • Specialized preprocessing for StageNet model
  • Chunk-based sequence processing
  • Stage-aware feature extraction

Usage:

from pyhealth.data import StageNetProcessor

processor = StageNetProcessor(
    chunk_size=128,
    num_stages=3
)

processed_data = processor(sequential_data)

StageNetTensorProcessor (StageNetTensorProcessor)

  • Tensor conversion for StageNet
  • Proper batching and padding
  • Stage mask generation

Raw Data Processing

RawProcessor (RawProcessor)

  • Minimal preprocessing
  • Pass-through for pre-processed data
  • Custom preprocessing scenarios

Usage:

from pyhealth.data import RawProcessor

processor = RawProcessor()
processed_data = processor(data)  # Minimal transformation

Sample-Level Processing

SampleProcessor (SampleProcessor)

  • Processes complete samples (input + output)
  • Coordinates multiple processors
  • End-to-end preprocessing pipeline

Workflow:

  1. Apply input processors to features
  2. Apply output processors to labels
  3. Combine into model-ready samples

Usage:

from pyhealth.data import SampleProcessor

processor = SampleProcessor(
    input_processors={
        "diagnoses": SequenceProcessor(max_seq_length=50),
        "medications": SequenceProcessor(max_seq_length=30),
        "labs": NestedFloatsProcessor(normalization="z-score")
    },
    output_processor=BinaryLabelProcessor()
)

processed_sample = processor(raw_sample)

Dataset-Level Processing

DatasetProcessor (DatasetProcessor)

  • Processes entire datasets
  • Batch processing
  • Parallel processing support
  • Caching for efficiency

Operations:

  • Apply processors to all samples
  • Generate vocabulary from dataset
  • Compute dataset statistics
  • Save processed data

Usage:

from pyhealth.data import DatasetProcessor

processor = DatasetProcessor(
    sample_processor=sample_processor,
    num_workers=4,  # parallel processing
    cache_dir="/path/to/cache"
)

processed_dataset = processor(raw_dataset)

Common Preprocessing Workflows

Workflow 1: EHR Mortality Prediction

from pyhealth.data import (
    SequenceProcessor,
    BinaryLabelProcessor,
    SampleProcessor
)

# Define processors
input_processors = {
    "diagnoses": SequenceProcessor(max_seq_length=50),
    "medications": SequenceProcessor(max_seq_length=30),
    "procedures": SequenceProcessor(max_seq_length=20)
}

output_processor = BinaryLabelProcessor(class_weight="balanced")

# Combine into sample processor
sample_processor = SampleProcessor(
    input_processors=input_processors,
    output_processor=output_processor
)

# Process dataset
processed_samples = [sample_processor(s) for s in raw_samples]

Workflow 2: Sleep Staging from EEG

from pyhealth.data import (
    SignalProcessor,
    MultiClassLabelProcessor,
    SampleProcessor
)

# Signal preprocessing
signal_processor = SignalProcessor(
    sampling_rate=100,
    bandpass_filter=(0.3, 35),  # EEG frequency range
    segment_length=30  # 30-second epochs
)

# Label processing
label_processor = MultiClassLabelProcessor(
    num_classes=5,  # W, N1, N2, N3, REM
    class_weight="balanced"
)

# Combine
sample_processor = SampleProcessor(
    input_processors={"signal": signal_processor},
    output_processor=label_processor
)

Workflow 3: Drug Recommendation

from pyhealth.data import (
    SequenceProcessor,
    MultiLabelProcessor,
    SampleProcessor
)

# Input processing
input_processors = {
    "diagnoses": SequenceProcessor(max_seq_length=50),
    "previous_medications": SequenceProcessor(max_seq_length=40)
}

# Multi-label output (multiple drugs)
output_processor = MultiLabelProcessor(
    num_labels=150,  # number of possible drugs
    threshold=0.5
)

sample_processor = SampleProcessor(
    input_processors=input_processors,
    output_processor=output_processor
)

Workflow 4: Length of Stay Prediction

from pyhealth.data import (
    SequenceProcessor,
    NestedFloatsProcessor,
    RegressionLabelProcessor,
    SampleProcessor
)

# Process different feature types
input_processors = {
    "diagnoses": SequenceProcessor(max_seq_length=30),
    "procedures": SequenceProcessor(max_seq_length=20),
    "labs": NestedFloatsProcessor(
        normalization="z-score",
        fill_missing="mean"
    )
}

# Regression target
output_processor = RegressionLabelProcessor(
    normalization="log",  # log-transform LOS
    clip_outliers=True
)

sample_processor = SampleProcessor(
    input_processors=input_processors,
    output_processor=output_processor
)

Best Practices

Sequence Processing

  1. Choose appropriate max_seq_length: Balance between context and computation

    • Short sequences (20-50): Fast, less context
    • Medium sequences (50-100): Good balance
    • Long sequences (100+): More context, slower
  2. Truncation strategy:

    • "post": Keep most recent events (recommended for clinical prediction)
    • "pre": Keep earliest events
  3. Padding strategy:

    • "post": Pad at end (standard)
    • "pre": Pad at beginning

Feature Encoding

  1. Vocabulary size: Limit to frequent codes

    • min_freq=5: Include codes appearing ≥5 times
    • max_vocab_size=10000: Cap total vocabulary size
  2. Handle rare codes: Group into "unknown" category

  3. Missing values:

    • Imputation (mean, median, forward-fill)
    • Indicator variables
    • Special tokens

Normalization

  1. Numeric features: Always normalize

    • Z-score: Standard scaling (mean=0, std=1)
    • Min-max: Range scaling [0, 1]
  2. Compute statistics on training set only: Prevent data leakage

  3. Apply same normalization to val/test sets

Class Imbalance

  1. Use class weighting: class_weight="balanced"

  2. Consider oversampling: For very rare positive cases

  3. Evaluate with appropriate metrics: AUROC, AUPRC, F1

Performance Optimization

  1. Cache processed data: Save preprocessing results

  2. Parallel processing: Use num_workers for DataLoader

  3. Batch processing: Process multiple samples at once

  4. Feature selection: Remove low-information features

Validation

  1. Check processed shapes: Ensure correct dimensions

  2. Verify value ranges: After normalization

  3. Inspect samples: Manually review processed data

  4. Monitor memory usage: Especially for large datasets

Troubleshooting

Common Issues

Memory Error:

  • Reduce max_seq_length
  • Use smaller batches
  • Process data in chunks
  • Enable caching to disk

Slow Processing:

  • Enable parallel processing (num_workers)
  • Cache preprocessed data
  • Reduce feature dimensionality
  • Use more efficient data types

Shape Mismatch:

  • Check sequence lengths
  • Verify padding configuration
  • Ensure consistent processor settings

NaN Values:

  • Handle missing data explicitly
  • Check normalization parameters
  • Verify imputation strategy

Class Imbalance:

  • Use class weighting
  • Consider oversampling
  • Adjust decision threshold
  • Use appropriate evaluation metrics