14 KiB
PyHealth Data Preprocessing and Processors
Overview
PyHealth provides comprehensive data processing utilities to transform raw healthcare data into model-ready formats. Processors handle feature extraction, sequence processing, signal transformation, and label preparation.
Processor Base Class
All processors inherit from Processor with standard interface:
Key Methods:
__call__(): Transform input dataget_input_info(): Return processed input schemaget_output_info(): Return processed output schema
Core Processor Types
Feature Processors
FeatureProcessor (FeatureProcessor)
- Base class for feature extraction
- Handles vocabulary building
- Embedding preparation
- Feature encoding
Common Operations:
- Medical code tokenization
- Categorical encoding
- Feature normalization
- Missing value handling
Usage:
from pyhealth.data import FeatureProcessor
processor = FeatureProcessor(
vocabulary="diagnoses",
min_freq=5, # Minimum code frequency
max_vocab_size=10000
)
processed_features = processor(raw_features)
Sequence Processors
SequenceProcessor (SequenceProcessor)
- Processes sequential clinical events
- Temporal ordering preservation
- Sequence padding/truncation
- Time gap encoding
Key Features:
- Variable-length sequence handling
- Temporal feature extraction
- Sequence statistics computation
Parameters:
max_seq_length: Maximum sequence length (truncate if longer)padding: Padding strategy ("pre" or "post")truncating: Truncation strategy ("pre" or "post")
Usage:
from pyhealth.data import SequenceProcessor
processor = SequenceProcessor(
max_seq_length=100,
padding="post",
truncating="post"
)
# Process diagnosis sequences
processed_seq = processor(diagnosis_sequences)
NestedSequenceProcessor (NestedSequenceProcessor)
- Handles hierarchical sequences (e.g., visits containing events)
- Two-level processing (visit-level and event-level)
- Preserves nested structure
Use Cases:
- EHR with visits containing multiple events
- Multi-level temporal modeling
- Hierarchical attention models
Structure:
# Input: [[visit1_events], [visit2_events], ...]
# Output: Processed nested sequences with proper padding
Numeric Data Processors
NestedFloatsProcessor (NestedFloatsProcessor)
- Processes nested numeric arrays
- Lab values, vital signs, measurements
- Multi-level numeric features
Operations:
- Normalization
- Standardization
- Missing value imputation
- Outlier handling
Usage:
from pyhealth.data import NestedFloatsProcessor
processor = NestedFloatsProcessor(
normalization="z-score", # or "min-max"
fill_missing="mean" # imputation strategy
)
processed_labs = processor(lab_values)
TensorProcessor (TensorProcessor)
- Converts data to PyTorch tensors
- Type handling (long, float, etc.)
- Device placement (CPU/GPU)
Parameters:
dtype: Tensor data typedevice: Computation device
Time-Series Processors
TimeseriesProcessor (TimeseriesProcessor)
- Handles temporal data with timestamps
- Time gap computation
- Temporal feature engineering
- Irregular sampling handling
Extracted Features:
- Time since previous event
- Time to next event
- Event frequency
- Temporal patterns
Usage:
from pyhealth.data import TimeseriesProcessor
processor = TimeseriesProcessor(
time_unit="hour", # "day", "hour", "minute"
compute_gaps=True,
compute_frequency=True
)
processed_ts = processor(timestamps, events)
SignalProcessor (SignalProcessor)
- Physiological signal processing
- EEG, ECG, PPG signals
- Filtering and preprocessing
Operations:
- Bandpass filtering
- Artifact removal
- Segmentation
- Feature extraction (frequency, amplitude)
Usage:
from pyhealth.data import SignalProcessor
processor = SignalProcessor(
sampling_rate=256, # Hz
bandpass_filter=(0.5, 50), # Hz range
segment_length=30 # seconds
)
processed_signal = processor(raw_eeg_signal)
Image Processors
ImageProcessor (ImageProcessor)
- Medical image preprocessing
- Normalization and resizing
- Augmentation support
- Format standardization
Operations:
- Resize to standard dimensions
- Normalization (mean/std)
- Windowing (for CT/MRI)
- Data augmentation
Usage:
from pyhealth.data import ImageProcessor
processor = ImageProcessor(
image_size=(224, 224),
normalization="imagenet", # or custom mean/std
augmentation=True
)
processed_image = processor(raw_image)
Label Processors
Binary Classification
BinaryLabelProcessor (BinaryLabelProcessor)
- Binary classification labels (0/1)
- Handles positive/negative classes
- Class weighting for imbalance
Usage:
from pyhealth.data import BinaryLabelProcessor
processor = BinaryLabelProcessor(
positive_class=1,
class_weight="balanced"
)
processed_labels = processor(raw_labels)
Multi-Class Classification
MultiClassLabelProcessor (MultiClassLabelProcessor)
- Multi-class classification (mutually exclusive classes)
- Label encoding
- Class balancing
Parameters:
num_classes: Number of classesclass_weight: Weighting strategy
Usage:
from pyhealth.data import MultiClassLabelProcessor
processor = MultiClassLabelProcessor(
num_classes=5, # e.g., sleep stages: W, N1, N2, N3, REM
class_weight="balanced"
)
processed_labels = processor(raw_labels)
Multi-Label Classification
MultiLabelProcessor (MultiLabelProcessor)
- Multi-label classification (multiple labels per sample)
- Binary encoding for each label
- Label co-occurrence handling
Use Cases:
- Drug recommendation (multiple drugs)
- ICD coding (multiple diagnoses)
- Comorbidity prediction
Usage:
from pyhealth.data import MultiLabelProcessor
processor = MultiLabelProcessor(
num_labels=100, # total possible labels
threshold=0.5 # prediction threshold
)
processed_labels = processor(raw_label_sets)
Regression
RegressionLabelProcessor (RegressionLabelProcessor)
- Continuous value prediction
- Target scaling and normalization
- Outlier handling
Use Cases:
- Length of stay prediction
- Lab value prediction
- Risk score estimation
Usage:
from pyhealth.data import RegressionLabelProcessor
processor = RegressionLabelProcessor(
normalization="z-score", # or "min-max"
clip_outliers=True,
outlier_std=3 # clip at 3 standard deviations
)
processed_targets = processor(raw_values)
Specialized Processors
Text Processing
TextProcessor (TextProcessor)
- Clinical text preprocessing
- Tokenization
- Vocabulary building
- Sequence encoding
Operations:
- Lowercasing
- Punctuation removal
- Medical abbreviation handling
- Token frequency filtering
Usage:
from pyhealth.data import TextProcessor
processor = TextProcessor(
tokenizer="word", # or "sentencepiece", "bpe"
lowercase=True,
max_vocab_size=50000,
min_freq=5
)
processed_text = processor(clinical_notes)
Model-Specific Processors
StageNetProcessor (StageNetProcessor)
- Specialized preprocessing for StageNet model
- Chunk-based sequence processing
- Stage-aware feature extraction
Usage:
from pyhealth.data import StageNetProcessor
processor = StageNetProcessor(
chunk_size=128,
num_stages=3
)
processed_data = processor(sequential_data)
StageNetTensorProcessor (StageNetTensorProcessor)
- Tensor conversion for StageNet
- Proper batching and padding
- Stage mask generation
Raw Data Processing
RawProcessor (RawProcessor)
- Minimal preprocessing
- Pass-through for pre-processed data
- Custom preprocessing scenarios
Usage:
from pyhealth.data import RawProcessor
processor = RawProcessor()
processed_data = processor(data) # Minimal transformation
Sample-Level Processing
SampleProcessor (SampleProcessor)
- Processes complete samples (input + output)
- Coordinates multiple processors
- End-to-end preprocessing pipeline
Workflow:
- Apply input processors to features
- Apply output processors to labels
- Combine into model-ready samples
Usage:
from pyhealth.data import SampleProcessor
processor = SampleProcessor(
input_processors={
"diagnoses": SequenceProcessor(max_seq_length=50),
"medications": SequenceProcessor(max_seq_length=30),
"labs": NestedFloatsProcessor(normalization="z-score")
},
output_processor=BinaryLabelProcessor()
)
processed_sample = processor(raw_sample)
Dataset-Level Processing
DatasetProcessor (DatasetProcessor)
- Processes entire datasets
- Batch processing
- Parallel processing support
- Caching for efficiency
Operations:
- Apply processors to all samples
- Generate vocabulary from dataset
- Compute dataset statistics
- Save processed data
Usage:
from pyhealth.data import DatasetProcessor
processor = DatasetProcessor(
sample_processor=sample_processor,
num_workers=4, # parallel processing
cache_dir="/path/to/cache"
)
processed_dataset = processor(raw_dataset)
Common Preprocessing Workflows
Workflow 1: EHR Mortality Prediction
from pyhealth.data import (
SequenceProcessor,
BinaryLabelProcessor,
SampleProcessor
)
# Define processors
input_processors = {
"diagnoses": SequenceProcessor(max_seq_length=50),
"medications": SequenceProcessor(max_seq_length=30),
"procedures": SequenceProcessor(max_seq_length=20)
}
output_processor = BinaryLabelProcessor(class_weight="balanced")
# Combine into sample processor
sample_processor = SampleProcessor(
input_processors=input_processors,
output_processor=output_processor
)
# Process dataset
processed_samples = [sample_processor(s) for s in raw_samples]
Workflow 2: Sleep Staging from EEG
from pyhealth.data import (
SignalProcessor,
MultiClassLabelProcessor,
SampleProcessor
)
# Signal preprocessing
signal_processor = SignalProcessor(
sampling_rate=100,
bandpass_filter=(0.3, 35), # EEG frequency range
segment_length=30 # 30-second epochs
)
# Label processing
label_processor = MultiClassLabelProcessor(
num_classes=5, # W, N1, N2, N3, REM
class_weight="balanced"
)
# Combine
sample_processor = SampleProcessor(
input_processors={"signal": signal_processor},
output_processor=label_processor
)
Workflow 3: Drug Recommendation
from pyhealth.data import (
SequenceProcessor,
MultiLabelProcessor,
SampleProcessor
)
# Input processing
input_processors = {
"diagnoses": SequenceProcessor(max_seq_length=50),
"previous_medications": SequenceProcessor(max_seq_length=40)
}
# Multi-label output (multiple drugs)
output_processor = MultiLabelProcessor(
num_labels=150, # number of possible drugs
threshold=0.5
)
sample_processor = SampleProcessor(
input_processors=input_processors,
output_processor=output_processor
)
Workflow 4: Length of Stay Prediction
from pyhealth.data import (
SequenceProcessor,
NestedFloatsProcessor,
RegressionLabelProcessor,
SampleProcessor
)
# Process different feature types
input_processors = {
"diagnoses": SequenceProcessor(max_seq_length=30),
"procedures": SequenceProcessor(max_seq_length=20),
"labs": NestedFloatsProcessor(
normalization="z-score",
fill_missing="mean"
)
}
# Regression target
output_processor = RegressionLabelProcessor(
normalization="log", # log-transform LOS
clip_outliers=True
)
sample_processor = SampleProcessor(
input_processors=input_processors,
output_processor=output_processor
)
Best Practices
Sequence Processing
-
Choose appropriate max_seq_length: Balance between context and computation
- Short sequences (20-50): Fast, less context
- Medium sequences (50-100): Good balance
- Long sequences (100+): More context, slower
-
Truncation strategy:
- "post": Keep most recent events (recommended for clinical prediction)
- "pre": Keep earliest events
-
Padding strategy:
- "post": Pad at end (standard)
- "pre": Pad at beginning
Feature Encoding
-
Vocabulary size: Limit to frequent codes
min_freq=5: Include codes appearing ≥5 timesmax_vocab_size=10000: Cap total vocabulary size
-
Handle rare codes: Group into "unknown" category
-
Missing values:
- Imputation (mean, median, forward-fill)
- Indicator variables
- Special tokens
Normalization
-
Numeric features: Always normalize
- Z-score: Standard scaling (mean=0, std=1)
- Min-max: Range scaling [0, 1]
-
Compute statistics on training set only: Prevent data leakage
-
Apply same normalization to val/test sets
Class Imbalance
-
Use class weighting:
class_weight="balanced" -
Consider oversampling: For very rare positive cases
-
Evaluate with appropriate metrics: AUROC, AUPRC, F1
Performance Optimization
-
Cache processed data: Save preprocessing results
-
Parallel processing: Use
num_workersfor DataLoader -
Batch processing: Process multiple samples at once
-
Feature selection: Remove low-information features
Validation
-
Check processed shapes: Ensure correct dimensions
-
Verify value ranges: After normalization
-
Inspect samples: Manually review processed data
-
Monitor memory usage: Especially for large datasets
Troubleshooting
Common Issues
Memory Error:
- Reduce
max_seq_length - Use smaller batches
- Process data in chunks
- Enable caching to disk
Slow Processing:
- Enable parallel processing (
num_workers) - Cache preprocessed data
- Reduce feature dimensionality
- Use more efficient data types
Shape Mismatch:
- Check sequence lengths
- Verify padding configuration
- Ensure consistent processor settings
NaN Values:
- Handle missing data explicitly
- Check normalization parameters
- Verify imputation strategy
Class Imbalance:
- Use class weighting
- Consider oversampling
- Adjust decision threshold
- Use appropriate evaluation metrics