gh-k-dense-ai-claude-scient…/skills/pyhealth/references/preprocessing.md

# PyHealth Data Preprocessing and Processors

## Overview

PyHealth provides comprehensive data processing utilities to transform raw healthcare data into model-ready formats. Processors handle feature extraction, sequence processing, signal transformation, and label preparation.

## Processor Base Class

All processors inherit from `Processor` with standard interface:

**Key Methods:**
- `__call__()`: Transform input data
- `get_input_info()`: Return processed input schema
- `get_output_info()`: Return processed output schema

## Core Processor Types

### Feature Processors

**FeatureProcessor** (`FeatureProcessor`)
- Base class for feature extraction
- Handles vocabulary building
- Embedding preparation
- Feature encoding

**Common Operations:**
- Medical code tokenization
- Categorical encoding
- Feature normalization
- Missing value handling

**Usage:**
```python
from pyhealth.data import FeatureProcessor

processor = FeatureProcessor(
    vocabulary="diagnoses",
    min_freq=5,  # Minimum code frequency
    max_vocab_size=10000
)

processed_features = processor(raw_features)
```

### Sequence Processors

**SequenceProcessor** (`SequenceProcessor`)
- Processes sequential clinical events
- Temporal ordering preservation
- Sequence padding/truncation
- Time gap encoding

**Key Features:**
- Variable-length sequence handling
- Temporal feature extraction
- Sequence statistics computation

**Parameters:**
- `max_seq_length`: Maximum sequence length (truncate if longer)
- `padding`: Padding strategy ("pre" or "post")
- `truncating`: Truncation strategy ("pre" or "post")

**Usage:**
```python
from pyhealth.data import SequenceProcessor

processor = SequenceProcessor(
    max_seq_length=100,
    padding="post",
    truncating="post"
)

# Process diagnosis sequences
processed_seq = processor(diagnosis_sequences)
```

**NestedSequenceProcessor** (`NestedSequenceProcessor`)
- Handles hierarchical sequences (e.g., visits containing events)
- Two-level processing (visit-level and event-level)
- Preserves nested structure

**Use Cases:**
- EHR with visits containing multiple events
- Multi-level temporal modeling
- Hierarchical attention models

**Structure:**
```python
# Input: [[visit1_events], [visit2_events], ...]
# Output: Processed nested sequences with proper padding
```

### Numeric Data Processors

**NestedFloatsProcessor** (`NestedFloatsProcessor`)
- Processes nested numeric arrays
- Lab values, vital signs, measurements
- Multi-level numeric features

**Operations:**
- Normalization
- Standardization
- Missing value imputation
- Outlier handling

**Usage:**
```python
from pyhealth.data import NestedFloatsProcessor

processor = NestedFloatsProcessor(
    normalization="z-score",  # or "min-max"
    fill_missing="mean"  # imputation strategy
)

processed_labs = processor(lab_values)
```

**TensorProcessor** (`TensorProcessor`)
- Converts data to PyTorch tensors
- Type handling (long, float, etc.)
- Device placement (CPU/GPU)

**Parameters:**
- `dtype`: Tensor data type
- `device`: Computation device

### Time-Series Processors

**TimeseriesProcessor** (`TimeseriesProcessor`)
- Handles temporal data with timestamps
- Time gap computation
- Temporal feature engineering
- Irregular sampling handling

**Extracted Features:**
- Time since previous event
- Time to next event
- Event frequency
- Temporal patterns

**Usage:**
```python
from pyhealth.data import TimeseriesProcessor

processor = TimeseriesProcessor(
    time_unit="hour",  # "day", "hour", "minute"
    compute_gaps=True,
    compute_frequency=True
)

processed_ts = processor(timestamps, events)
```

**SignalProcessor** (`SignalProcessor`)
- Physiological signal processing
- EEG, ECG, PPG signals
- Filtering and preprocessing

**Operations:**
- Bandpass filtering
- Artifact removal
- Segmentation
- Feature extraction (frequency, amplitude)

**Usage:**
```python
from pyhealth.data import SignalProcessor

processor = SignalProcessor(
    sampling_rate=256,  # Hz
    bandpass_filter=(0.5, 50),  # Hz range
    segment_length=30  # seconds
)

processed_signal = processor(raw_eeg_signal)
```

### Image Processors

**ImageProcessor** (`ImageProcessor`)
- Medical image preprocessing
- Normalization and resizing
- Augmentation support
- Format standardization

**Operations:**
- Resize to standard dimensions
- Normalization (mean/std)
- Windowing (for CT/MRI)
- Data augmentation

**Usage:**
```python
from pyhealth.data import ImageProcessor

processor = ImageProcessor(
    image_size=(224, 224),
    normalization="imagenet",  # or custom mean/std
    augmentation=True
)

processed_image = processor(raw_image)
```

## Label Processors

### Binary Classification

**BinaryLabelProcessor** (`BinaryLabelProcessor`)
- Binary classification labels (0/1)
- Handles positive/negative classes
- Class weighting for imbalance

**Usage:**
```python
from pyhealth.data import BinaryLabelProcessor

processor = BinaryLabelProcessor(
    positive_class=1,
    class_weight="balanced"
)

processed_labels = processor(raw_labels)
```

### Multi-Class Classification

**MultiClassLabelProcessor** (`MultiClassLabelProcessor`)
- Multi-class classification (mutually exclusive classes)
- Label encoding
- Class balancing

**Parameters:**
- `num_classes`: Number of classes
- `class_weight`: Weighting strategy

**Usage:**
```python
from pyhealth.data import MultiClassLabelProcessor

processor = MultiClassLabelProcessor(
    num_classes=5,  # e.g., sleep stages: W, N1, N2, N3, REM
    class_weight="balanced"
)

processed_labels = processor(raw_labels)
```

### Multi-Label Classification

**MultiLabelProcessor** (`MultiLabelProcessor`)
- Multi-label classification (multiple labels per sample)
- Binary encoding for each label
- Label co-occurrence handling

**Use Cases:**
- Drug recommendation (multiple drugs)
- ICD coding (multiple diagnoses)
- Comorbidity prediction

**Usage:**
```python
from pyhealth.data import MultiLabelProcessor

processor = MultiLabelProcessor(
    num_labels=100,  # total possible labels
    threshold=0.5  # prediction threshold
)

processed_labels = processor(raw_label_sets)
```

### Regression

**RegressionLabelProcessor** (`RegressionLabelProcessor`)
- Continuous value prediction
- Target scaling and normalization
- Outlier handling

**Use Cases:**
- Length of stay prediction
- Lab value prediction
- Risk score estimation

**Usage:**
```python
from pyhealth.data import RegressionLabelProcessor

processor = RegressionLabelProcessor(
    normalization="z-score",  # or "min-max"
    clip_outliers=True,
    outlier_std=3  # clip at 3 standard deviations
)

processed_targets = processor(raw_values)
```

## Specialized Processors

### Text Processing

**TextProcessor** (`TextProcessor`)
- Clinical text preprocessing
- Tokenization
- Vocabulary building
- Sequence encoding

**Operations:**
- Lowercasing
- Punctuation removal
- Medical abbreviation handling
- Token frequency filtering

**Usage:**
```python
from pyhealth.data import TextProcessor

processor = TextProcessor(
    tokenizer="word",  # or "sentencepiece", "bpe"
    lowercase=True,
    max_vocab_size=50000,
    min_freq=5
)

processed_text = processor(clinical_notes)
```

### Model-Specific Processors

**StageNetProcessor** (`StageNetProcessor`)
- Specialized preprocessing for StageNet model
- Chunk-based sequence processing
- Stage-aware feature extraction

**Usage:**
```python
from pyhealth.data import StageNetProcessor

processor = StageNetProcessor(
    chunk_size=128,
    num_stages=3
)

processed_data = processor(sequential_data)
```

**StageNetTensorProcessor** (`StageNetTensorProcessor`)
- Tensor conversion for StageNet
- Proper batching and padding
- Stage mask generation

### Raw Data Processing

**RawProcessor** (`RawProcessor`)
- Minimal preprocessing
- Pass-through for pre-processed data
- Custom preprocessing scenarios

**Usage:**
```python
from pyhealth.data import RawProcessor

processor = RawProcessor()
processed_data = processor(data)  # Minimal transformation
```

## Sample-Level Processing

**SampleProcessor** (`SampleProcessor`)
- Processes complete samples (input + output)
- Coordinates multiple processors
- End-to-end preprocessing pipeline

**Workflow:**
1. Apply input processors to features
2. Apply output processors to labels
3. Combine into model-ready samples

**Usage:**
```python
from pyhealth.data import SampleProcessor

processor = SampleProcessor(
    input_processors={
        "diagnoses": SequenceProcessor(max_seq_length=50),
        "medications": SequenceProcessor(max_seq_length=30),
        "labs": NestedFloatsProcessor(normalization="z-score")
    },
    output_processor=BinaryLabelProcessor()
)

processed_sample = processor(raw_sample)
```

## Dataset-Level Processing

**DatasetProcessor** (`DatasetProcessor`)
- Processes entire datasets
- Batch processing
- Parallel processing support
- Caching for efficiency

**Operations:**
- Apply processors to all samples
- Generate vocabulary from dataset
- Compute dataset statistics
- Save processed data

**Usage:**
```python
from pyhealth.data import DatasetProcessor

processor = DatasetProcessor(
    sample_processor=sample_processor,
    num_workers=4,  # parallel processing
    cache_dir="/path/to/cache"
)

processed_dataset = processor(raw_dataset)
```

## Common Preprocessing Workflows

### Workflow 1: EHR Mortality Prediction

```python
from pyhealth.data import (
    SequenceProcessor,
    BinaryLabelProcessor,
    SampleProcessor
)

# Define processors
input_processors = {
    "diagnoses": SequenceProcessor(max_seq_length=50),
    "medications": SequenceProcessor(max_seq_length=30),
    "procedures": SequenceProcessor(max_seq_length=20)
}

output_processor = BinaryLabelProcessor(class_weight="balanced")

# Combine into sample processor
sample_processor = SampleProcessor(
    input_processors=input_processors,
    output_processor=output_processor
)

# Process dataset
processed_samples = [sample_processor(s) for s in raw_samples]
```

### Workflow 2: Sleep Staging from EEG

```python
from pyhealth.data import (
    SignalProcessor,
    MultiClassLabelProcessor,
    SampleProcessor
)

# Signal preprocessing
signal_processor = SignalProcessor(
    sampling_rate=100,
    bandpass_filter=(0.3, 35),  # EEG frequency range
    segment_length=30  # 30-second epochs
)

# Label processing
label_processor = MultiClassLabelProcessor(
    num_classes=5,  # W, N1, N2, N3, REM
    class_weight="balanced"
)

# Combine
sample_processor = SampleProcessor(
    input_processors={"signal": signal_processor},
    output_processor=label_processor
)
```

### Workflow 3: Drug Recommendation

```python
from pyhealth.data import (
    SequenceProcessor,
    MultiLabelProcessor,
    SampleProcessor
)

# Input processing
input_processors = {
    "diagnoses": SequenceProcessor(max_seq_length=50),
    "previous_medications": SequenceProcessor(max_seq_length=40)
}

# Multi-label output (multiple drugs)
output_processor = MultiLabelProcessor(
    num_labels=150,  # number of possible drugs
    threshold=0.5
)

sample_processor = SampleProcessor(
    input_processors=input_processors,
    output_processor=output_processor
)
```

### Workflow 4: Length of Stay Prediction

```python
from pyhealth.data import (
    SequenceProcessor,
    NestedFloatsProcessor,
    RegressionLabelProcessor,
    SampleProcessor
)

# Process different feature types
input_processors = {
    "diagnoses": SequenceProcessor(max_seq_length=30),
    "procedures": SequenceProcessor(max_seq_length=20),
    "labs": NestedFloatsProcessor(
        normalization="z-score",
        fill_missing="mean"
    )
}

# Regression target
output_processor = RegressionLabelProcessor(
    normalization="log",  # log-transform LOS
    clip_outliers=True
)

sample_processor = SampleProcessor(
    input_processors=input_processors,
    output_processor=output_processor
)
```

## Best Practices

### Sequence Processing

1. **Choose appropriate max_seq_length**: Balance between context and computation
   - Short sequences (20-50): Fast, less context
   - Medium sequences (50-100): Good balance
   - Long sequences (100+): More context, slower

2. **Truncation strategy**:
   - "post": Keep most recent events (recommended for clinical prediction)
   - "pre": Keep earliest events

3. **Padding strategy**:
   - "post": Pad at end (standard)
   - "pre": Pad at beginning

### Feature Encoding

1. **Vocabulary size**: Limit to frequent codes
   - `min_freq=5`: Include codes appearing ≥5 times
   - `max_vocab_size=10000`: Cap total vocabulary size

2. **Handle rare codes**: Group into "unknown" category

3. **Missing values**:
   - Imputation (mean, median, forward-fill)
   - Indicator variables
   - Special tokens

### Normalization

1. **Numeric features**: Always normalize
   - Z-score: Standard scaling (mean=0, std=1)
   - Min-max: Range scaling [0, 1]

2. **Compute statistics on training set only**: Prevent data leakage

3. **Apply same normalization to val/test sets**

### Class Imbalance

1. **Use class weighting**: `class_weight="balanced"`

2. **Consider oversampling**: For very rare positive cases

3. **Evaluate with appropriate metrics**: AUROC, AUPRC, F1

### Performance Optimization

1. **Cache processed data**: Save preprocessing results

2. **Parallel processing**: Use `num_workers` for DataLoader

3. **Batch processing**: Process multiple samples at once

4. **Feature selection**: Remove low-information features

### Validation

1. **Check processed shapes**: Ensure correct dimensions

2. **Verify value ranges**: After normalization

3. **Inspect samples**: Manually review processed data

4. **Monitor memory usage**: Especially for large datasets

## Troubleshooting

### Common Issues

**Memory Error:**
- Reduce `max_seq_length`
- Use smaller batches
- Process data in chunks
- Enable caching to disk

**Slow Processing:**
- Enable parallel processing (`num_workers`)
- Cache preprocessed data
- Reduce feature dimensionality
- Use more efficient data types

**Shape Mismatch:**
- Check sequence lengths
- Verify padding configuration
- Ensure consistent processor settings

**NaN Values:**
- Handle missing data explicitly
- Check normalization parameters
- Verify imputation strategy

**Class Imbalance:**
- Use class weighting
- Consider oversampling
- Adjust decision threshold
- Use appropriate evaluation metrics