Initial commit
This commit is contained in:
638
skills/pyhealth/references/preprocessing.md
Normal file
638
skills/pyhealth/references/preprocessing.md
Normal file
@@ -0,0 +1,638 @@
|
||||
# PyHealth Data Preprocessing and Processors
|
||||
|
||||
## Overview
|
||||
|
||||
PyHealth provides comprehensive data processing utilities to transform raw healthcare data into model-ready formats. Processors handle feature extraction, sequence processing, signal transformation, and label preparation.
|
||||
|
||||
## Processor Base Class
|
||||
|
||||
All processors inherit from `Processor` with standard interface:
|
||||
|
||||
**Key Methods:**
|
||||
- `__call__()`: Transform input data
|
||||
- `get_input_info()`: Return processed input schema
|
||||
- `get_output_info()`: Return processed output schema
|
||||
|
||||
## Core Processor Types
|
||||
|
||||
### Feature Processors
|
||||
|
||||
**FeatureProcessor** (`FeatureProcessor`)
|
||||
- Base class for feature extraction
|
||||
- Handles vocabulary building
|
||||
- Embedding preparation
|
||||
- Feature encoding
|
||||
|
||||
**Common Operations:**
|
||||
- Medical code tokenization
|
||||
- Categorical encoding
|
||||
- Feature normalization
|
||||
- Missing value handling
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import FeatureProcessor
|
||||
|
||||
processor = FeatureProcessor(
|
||||
vocabulary="diagnoses",
|
||||
min_freq=5, # Minimum code frequency
|
||||
max_vocab_size=10000
|
||||
)
|
||||
|
||||
processed_features = processor(raw_features)
|
||||
```
|
||||
|
||||
### Sequence Processors
|
||||
|
||||
**SequenceProcessor** (`SequenceProcessor`)
|
||||
- Processes sequential clinical events
|
||||
- Temporal ordering preservation
|
||||
- Sequence padding/truncation
|
||||
- Time gap encoding
|
||||
|
||||
**Key Features:**
|
||||
- Variable-length sequence handling
|
||||
- Temporal feature extraction
|
||||
- Sequence statistics computation
|
||||
|
||||
**Parameters:**
|
||||
- `max_seq_length`: Maximum sequence length (truncate if longer)
|
||||
- `padding`: Padding strategy ("pre" or "post")
|
||||
- `truncating`: Truncation strategy ("pre" or "post")
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import SequenceProcessor
|
||||
|
||||
processor = SequenceProcessor(
|
||||
max_seq_length=100,
|
||||
padding="post",
|
||||
truncating="post"
|
||||
)
|
||||
|
||||
# Process diagnosis sequences
|
||||
processed_seq = processor(diagnosis_sequences)
|
||||
```
|
||||
|
||||
**NestedSequenceProcessor** (`NestedSequenceProcessor`)
|
||||
- Handles hierarchical sequences (e.g., visits containing events)
|
||||
- Two-level processing (visit-level and event-level)
|
||||
- Preserves nested structure
|
||||
|
||||
**Use Cases:**
|
||||
- EHR with visits containing multiple events
|
||||
- Multi-level temporal modeling
|
||||
- Hierarchical attention models
|
||||
|
||||
**Structure:**
|
||||
```python
|
||||
# Input: [[visit1_events], [visit2_events], ...]
|
||||
# Output: Processed nested sequences with proper padding
|
||||
```
|
||||
|
||||
### Numeric Data Processors
|
||||
|
||||
**NestedFloatsProcessor** (`NestedFloatsProcessor`)
|
||||
- Processes nested numeric arrays
|
||||
- Lab values, vital signs, measurements
|
||||
- Multi-level numeric features
|
||||
|
||||
**Operations:**
|
||||
- Normalization
|
||||
- Standardization
|
||||
- Missing value imputation
|
||||
- Outlier handling
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import NestedFloatsProcessor
|
||||
|
||||
processor = NestedFloatsProcessor(
|
||||
normalization="z-score", # or "min-max"
|
||||
fill_missing="mean" # imputation strategy
|
||||
)
|
||||
|
||||
processed_labs = processor(lab_values)
|
||||
```
|
||||
|
||||
**TensorProcessor** (`TensorProcessor`)
|
||||
- Converts data to PyTorch tensors
|
||||
- Type handling (long, float, etc.)
|
||||
- Device placement (CPU/GPU)
|
||||
|
||||
**Parameters:**
|
||||
- `dtype`: Tensor data type
|
||||
- `device`: Computation device
|
||||
|
||||
### Time-Series Processors
|
||||
|
||||
**TimeseriesProcessor** (`TimeseriesProcessor`)
|
||||
- Handles temporal data with timestamps
|
||||
- Time gap computation
|
||||
- Temporal feature engineering
|
||||
- Irregular sampling handling
|
||||
|
||||
**Extracted Features:**
|
||||
- Time since previous event
|
||||
- Time to next event
|
||||
- Event frequency
|
||||
- Temporal patterns
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import TimeseriesProcessor
|
||||
|
||||
processor = TimeseriesProcessor(
|
||||
time_unit="hour", # "day", "hour", "minute"
|
||||
compute_gaps=True,
|
||||
compute_frequency=True
|
||||
)
|
||||
|
||||
processed_ts = processor(timestamps, events)
|
||||
```
|
||||
|
||||
**SignalProcessor** (`SignalProcessor`)
|
||||
- Physiological signal processing
|
||||
- EEG, ECG, PPG signals
|
||||
- Filtering and preprocessing
|
||||
|
||||
**Operations:**
|
||||
- Bandpass filtering
|
||||
- Artifact removal
|
||||
- Segmentation
|
||||
- Feature extraction (frequency, amplitude)
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import SignalProcessor
|
||||
|
||||
processor = SignalProcessor(
|
||||
sampling_rate=256, # Hz
|
||||
bandpass_filter=(0.5, 50), # Hz range
|
||||
segment_length=30 # seconds
|
||||
)
|
||||
|
||||
processed_signal = processor(raw_eeg_signal)
|
||||
```
|
||||
|
||||
### Image Processors
|
||||
|
||||
**ImageProcessor** (`ImageProcessor`)
|
||||
- Medical image preprocessing
|
||||
- Normalization and resizing
|
||||
- Augmentation support
|
||||
- Format standardization
|
||||
|
||||
**Operations:**
|
||||
- Resize to standard dimensions
|
||||
- Normalization (mean/std)
|
||||
- Windowing (for CT/MRI)
|
||||
- Data augmentation
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import ImageProcessor
|
||||
|
||||
processor = ImageProcessor(
|
||||
image_size=(224, 224),
|
||||
normalization="imagenet", # or custom mean/std
|
||||
augmentation=True
|
||||
)
|
||||
|
||||
processed_image = processor(raw_image)
|
||||
```
|
||||
|
||||
## Label Processors
|
||||
|
||||
### Binary Classification
|
||||
|
||||
**BinaryLabelProcessor** (`BinaryLabelProcessor`)
|
||||
- Binary classification labels (0/1)
|
||||
- Handles positive/negative classes
|
||||
- Class weighting for imbalance
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import BinaryLabelProcessor
|
||||
|
||||
processor = BinaryLabelProcessor(
|
||||
positive_class=1,
|
||||
class_weight="balanced"
|
||||
)
|
||||
|
||||
processed_labels = processor(raw_labels)
|
||||
```
|
||||
|
||||
### Multi-Class Classification
|
||||
|
||||
**MultiClassLabelProcessor** (`MultiClassLabelProcessor`)
|
||||
- Multi-class classification (mutually exclusive classes)
|
||||
- Label encoding
|
||||
- Class balancing
|
||||
|
||||
**Parameters:**
|
||||
- `num_classes`: Number of classes
|
||||
- `class_weight`: Weighting strategy
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import MultiClassLabelProcessor
|
||||
|
||||
processor = MultiClassLabelProcessor(
|
||||
num_classes=5, # e.g., sleep stages: W, N1, N2, N3, REM
|
||||
class_weight="balanced"
|
||||
)
|
||||
|
||||
processed_labels = processor(raw_labels)
|
||||
```
|
||||
|
||||
### Multi-Label Classification
|
||||
|
||||
**MultiLabelProcessor** (`MultiLabelProcessor`)
|
||||
- Multi-label classification (multiple labels per sample)
|
||||
- Binary encoding for each label
|
||||
- Label co-occurrence handling
|
||||
|
||||
**Use Cases:**
|
||||
- Drug recommendation (multiple drugs)
|
||||
- ICD coding (multiple diagnoses)
|
||||
- Comorbidity prediction
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import MultiLabelProcessor
|
||||
|
||||
processor = MultiLabelProcessor(
|
||||
num_labels=100, # total possible labels
|
||||
threshold=0.5 # prediction threshold
|
||||
)
|
||||
|
||||
processed_labels = processor(raw_label_sets)
|
||||
```
|
||||
|
||||
### Regression
|
||||
|
||||
**RegressionLabelProcessor** (`RegressionLabelProcessor`)
|
||||
- Continuous value prediction
|
||||
- Target scaling and normalization
|
||||
- Outlier handling
|
||||
|
||||
**Use Cases:**
|
||||
- Length of stay prediction
|
||||
- Lab value prediction
|
||||
- Risk score estimation
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import RegressionLabelProcessor
|
||||
|
||||
processor = RegressionLabelProcessor(
|
||||
normalization="z-score", # or "min-max"
|
||||
clip_outliers=True,
|
||||
outlier_std=3 # clip at 3 standard deviations
|
||||
)
|
||||
|
||||
processed_targets = processor(raw_values)
|
||||
```
|
||||
|
||||
## Specialized Processors
|
||||
|
||||
### Text Processing
|
||||
|
||||
**TextProcessor** (`TextProcessor`)
|
||||
- Clinical text preprocessing
|
||||
- Tokenization
|
||||
- Vocabulary building
|
||||
- Sequence encoding
|
||||
|
||||
**Operations:**
|
||||
- Lowercasing
|
||||
- Punctuation removal
|
||||
- Medical abbreviation handling
|
||||
- Token frequency filtering
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import TextProcessor
|
||||
|
||||
processor = TextProcessor(
|
||||
tokenizer="word", # or "sentencepiece", "bpe"
|
||||
lowercase=True,
|
||||
max_vocab_size=50000,
|
||||
min_freq=5
|
||||
)
|
||||
|
||||
processed_text = processor(clinical_notes)
|
||||
```
|
||||
|
||||
### Model-Specific Processors
|
||||
|
||||
**StageNetProcessor** (`StageNetProcessor`)
|
||||
- Specialized preprocessing for StageNet model
|
||||
- Chunk-based sequence processing
|
||||
- Stage-aware feature extraction
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import StageNetProcessor
|
||||
|
||||
processor = StageNetProcessor(
|
||||
chunk_size=128,
|
||||
num_stages=3
|
||||
)
|
||||
|
||||
processed_data = processor(sequential_data)
|
||||
```
|
||||
|
||||
**StageNetTensorProcessor** (`StageNetTensorProcessor`)
|
||||
- Tensor conversion for StageNet
|
||||
- Proper batching and padding
|
||||
- Stage mask generation
|
||||
|
||||
### Raw Data Processing
|
||||
|
||||
**RawProcessor** (`RawProcessor`)
|
||||
- Minimal preprocessing
|
||||
- Pass-through for pre-processed data
|
||||
- Custom preprocessing scenarios
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import RawProcessor
|
||||
|
||||
processor = RawProcessor()
|
||||
processed_data = processor(data) # Minimal transformation
|
||||
```
|
||||
|
||||
## Sample-Level Processing
|
||||
|
||||
**SampleProcessor** (`SampleProcessor`)
|
||||
- Processes complete samples (input + output)
|
||||
- Coordinates multiple processors
|
||||
- End-to-end preprocessing pipeline
|
||||
|
||||
**Workflow:**
|
||||
1. Apply input processors to features
|
||||
2. Apply output processors to labels
|
||||
3. Combine into model-ready samples
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import SampleProcessor
|
||||
|
||||
processor = SampleProcessor(
|
||||
input_processors={
|
||||
"diagnoses": SequenceProcessor(max_seq_length=50),
|
||||
"medications": SequenceProcessor(max_seq_length=30),
|
||||
"labs": NestedFloatsProcessor(normalization="z-score")
|
||||
},
|
||||
output_processor=BinaryLabelProcessor()
|
||||
)
|
||||
|
||||
processed_sample = processor(raw_sample)
|
||||
```
|
||||
|
||||
## Dataset-Level Processing
|
||||
|
||||
**DatasetProcessor** (`DatasetProcessor`)
|
||||
- Processes entire datasets
|
||||
- Batch processing
|
||||
- Parallel processing support
|
||||
- Caching for efficiency
|
||||
|
||||
**Operations:**
|
||||
- Apply processors to all samples
|
||||
- Generate vocabulary from dataset
|
||||
- Compute dataset statistics
|
||||
- Save processed data
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from pyhealth.data import DatasetProcessor
|
||||
|
||||
processor = DatasetProcessor(
|
||||
sample_processor=sample_processor,
|
||||
num_workers=4, # parallel processing
|
||||
cache_dir="/path/to/cache"
|
||||
)
|
||||
|
||||
processed_dataset = processor(raw_dataset)
|
||||
```
|
||||
|
||||
## Common Preprocessing Workflows
|
||||
|
||||
### Workflow 1: EHR Mortality Prediction
|
||||
|
||||
```python
|
||||
from pyhealth.data import (
|
||||
SequenceProcessor,
|
||||
BinaryLabelProcessor,
|
||||
SampleProcessor
|
||||
)
|
||||
|
||||
# Define processors
|
||||
input_processors = {
|
||||
"diagnoses": SequenceProcessor(max_seq_length=50),
|
||||
"medications": SequenceProcessor(max_seq_length=30),
|
||||
"procedures": SequenceProcessor(max_seq_length=20)
|
||||
}
|
||||
|
||||
output_processor = BinaryLabelProcessor(class_weight="balanced")
|
||||
|
||||
# Combine into sample processor
|
||||
sample_processor = SampleProcessor(
|
||||
input_processors=input_processors,
|
||||
output_processor=output_processor
|
||||
)
|
||||
|
||||
# Process dataset
|
||||
processed_samples = [sample_processor(s) for s in raw_samples]
|
||||
```
|
||||
|
||||
### Workflow 2: Sleep Staging from EEG
|
||||
|
||||
```python
|
||||
from pyhealth.data import (
|
||||
SignalProcessor,
|
||||
MultiClassLabelProcessor,
|
||||
SampleProcessor
|
||||
)
|
||||
|
||||
# Signal preprocessing
|
||||
signal_processor = SignalProcessor(
|
||||
sampling_rate=100,
|
||||
bandpass_filter=(0.3, 35), # EEG frequency range
|
||||
segment_length=30 # 30-second epochs
|
||||
)
|
||||
|
||||
# Label processing
|
||||
label_processor = MultiClassLabelProcessor(
|
||||
num_classes=5, # W, N1, N2, N3, REM
|
||||
class_weight="balanced"
|
||||
)
|
||||
|
||||
# Combine
|
||||
sample_processor = SampleProcessor(
|
||||
input_processors={"signal": signal_processor},
|
||||
output_processor=label_processor
|
||||
)
|
||||
```
|
||||
|
||||
### Workflow 3: Drug Recommendation
|
||||
|
||||
```python
|
||||
from pyhealth.data import (
|
||||
SequenceProcessor,
|
||||
MultiLabelProcessor,
|
||||
SampleProcessor
|
||||
)
|
||||
|
||||
# Input processing
|
||||
input_processors = {
|
||||
"diagnoses": SequenceProcessor(max_seq_length=50),
|
||||
"previous_medications": SequenceProcessor(max_seq_length=40)
|
||||
}
|
||||
|
||||
# Multi-label output (multiple drugs)
|
||||
output_processor = MultiLabelProcessor(
|
||||
num_labels=150, # number of possible drugs
|
||||
threshold=0.5
|
||||
)
|
||||
|
||||
sample_processor = SampleProcessor(
|
||||
input_processors=input_processors,
|
||||
output_processor=output_processor
|
||||
)
|
||||
```
|
||||
|
||||
### Workflow 4: Length of Stay Prediction
|
||||
|
||||
```python
|
||||
from pyhealth.data import (
|
||||
SequenceProcessor,
|
||||
NestedFloatsProcessor,
|
||||
RegressionLabelProcessor,
|
||||
SampleProcessor
|
||||
)
|
||||
|
||||
# Process different feature types
|
||||
input_processors = {
|
||||
"diagnoses": SequenceProcessor(max_seq_length=30),
|
||||
"procedures": SequenceProcessor(max_seq_length=20),
|
||||
"labs": NestedFloatsProcessor(
|
||||
normalization="z-score",
|
||||
fill_missing="mean"
|
||||
)
|
||||
}
|
||||
|
||||
# Regression target
|
||||
output_processor = RegressionLabelProcessor(
|
||||
normalization="log", # log-transform LOS
|
||||
clip_outliers=True
|
||||
)
|
||||
|
||||
sample_processor = SampleProcessor(
|
||||
input_processors=input_processors,
|
||||
output_processor=output_processor
|
||||
)
|
||||
```
|
||||
|
||||
## Best Practices
|
||||
|
||||
### Sequence Processing
|
||||
|
||||
1. **Choose appropriate max_seq_length**: Balance between context and computation
|
||||
- Short sequences (20-50): Fast, less context
|
||||
- Medium sequences (50-100): Good balance
|
||||
- Long sequences (100+): More context, slower
|
||||
|
||||
2. **Truncation strategy**:
|
||||
- "post": Keep most recent events (recommended for clinical prediction)
|
||||
- "pre": Keep earliest events
|
||||
|
||||
3. **Padding strategy**:
|
||||
- "post": Pad at end (standard)
|
||||
- "pre": Pad at beginning
|
||||
|
||||
### Feature Encoding
|
||||
|
||||
1. **Vocabulary size**: Limit to frequent codes
|
||||
- `min_freq=5`: Include codes appearing ≥5 times
|
||||
- `max_vocab_size=10000`: Cap total vocabulary size
|
||||
|
||||
2. **Handle rare codes**: Group into "unknown" category
|
||||
|
||||
3. **Missing values**:
|
||||
- Imputation (mean, median, forward-fill)
|
||||
- Indicator variables
|
||||
- Special tokens
|
||||
|
||||
### Normalization
|
||||
|
||||
1. **Numeric features**: Always normalize
|
||||
- Z-score: Standard scaling (mean=0, std=1)
|
||||
- Min-max: Range scaling [0, 1]
|
||||
|
||||
2. **Compute statistics on training set only**: Prevent data leakage
|
||||
|
||||
3. **Apply same normalization to val/test sets**
|
||||
|
||||
### Class Imbalance
|
||||
|
||||
1. **Use class weighting**: `class_weight="balanced"`
|
||||
|
||||
2. **Consider oversampling**: For very rare positive cases
|
||||
|
||||
3. **Evaluate with appropriate metrics**: AUROC, AUPRC, F1
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
1. **Cache processed data**: Save preprocessing results
|
||||
|
||||
2. **Parallel processing**: Use `num_workers` for DataLoader
|
||||
|
||||
3. **Batch processing**: Process multiple samples at once
|
||||
|
||||
4. **Feature selection**: Remove low-information features
|
||||
|
||||
### Validation
|
||||
|
||||
1. **Check processed shapes**: Ensure correct dimensions
|
||||
|
||||
2. **Verify value ranges**: After normalization
|
||||
|
||||
3. **Inspect samples**: Manually review processed data
|
||||
|
||||
4. **Monitor memory usage**: Especially for large datasets
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Memory Error:**
|
||||
- Reduce `max_seq_length`
|
||||
- Use smaller batches
|
||||
- Process data in chunks
|
||||
- Enable caching to disk
|
||||
|
||||
**Slow Processing:**
|
||||
- Enable parallel processing (`num_workers`)
|
||||
- Cache preprocessed data
|
||||
- Reduce feature dimensionality
|
||||
- Use more efficient data types
|
||||
|
||||
**Shape Mismatch:**
|
||||
- Check sequence lengths
|
||||
- Verify padding configuration
|
||||
- Ensure consistent processor settings
|
||||
|
||||
**NaN Values:**
|
||||
- Handle missing data explicitly
|
||||
- Check normalization parameters
|
||||
- Verify imputation strategy
|
||||
|
||||
**Class Imbalance:**
|
||||
- Use class weighting
|
||||
- Consider oversampling
|
||||
- Adjust decision threshold
|
||||
- Use appropriate evaluation metrics
|
||||
Reference in New Issue
Block a user