5.3 KiB
PyHealth Datasets and Data Structures
Core Data Structures
Event
Individual medical occurrences with attributes including:
- code: Medical code (diagnosis, medication, procedure, lab test)
- vocabulary: Coding system (ICD-9-CM, NDC, LOINC, etc.)
- timestamp: Event occurrence time
- value: Numeric value (for labs, vital signs)
- unit: Measurement unit
Patient
Collection of events organized chronologically across visits. Each patient contains:
- patient_id: Unique identifier
- birth_datetime: Date of birth
- gender: Patient gender
- ethnicity: Patient ethnicity
- visits: List of visit objects
Visit
Healthcare encounter containing:
- visit_id: Unique identifier
- encounter_time: Visit timestamp
- discharge_time: Discharge timestamp
- visit_type: Type of encounter (inpatient, outpatient, emergency)
- events: List of events during this visit
BaseDataset Class
Key Methods:
get_patient(patient_id): Retrieve single patient recorditer_patients(): Iterate through all patientsstats(): Get dataset statistics (patients, visits, events)set_task(task_fn): Define prediction task
Available Datasets
Electronic Health Record (EHR) Datasets
MIMIC-III Dataset (MIMIC3Dataset)
- Intensive care unit data from Beth Israel Deaconess Medical Center
- 40,000+ critical care patients
- Diagnoses, procedures, medications, lab results
- Usage:
from pyhealth.datasets import MIMIC3Dataset
MIMIC-IV Dataset (MIMIC4Dataset)
- Updated version with 70,000+ patients
- Improved data quality and coverage
- Enhanced demographic and clinical detail
- Usage:
from pyhealth.datasets import MIMIC4Dataset
eICU Dataset (eICUDataset)
- Multi-center critical care database
- 200,000+ admissions from 200+ hospitals
- Standardized ICU data across facilities
- Usage:
from pyhealth.datasets import eICUDataset
OMOP Dataset (OMOPDataset)
- Observational Medical Outcomes Partnership format
- Standardized common data model
- Interoperability across healthcare systems
- Usage:
from pyhealth.datasets import OMOPDataset
EHRShot Dataset (EHRShotDataset)
- Benchmark dataset for few-shot learning
- Specialized for testing model generalization
- Usage:
from pyhealth.datasets import EHRShotDataset
Physiological Signal Datasets
Sleep EEG Datasets:
SleepEDFDataset: Sleep-EDF database for sleep stagingSHHSDataset: Sleep Heart Health Study dataISRUCDataset: ISRUC-Sleep database
Temple University EEG Datasets:
TUEVDataset: Abnormal EEG events detectionTUABDataset: Abnormal/normal EEG classificationTUSZDataset: Seizure detection
All signal datasets support:
- Multi-channel EEG signals
- Standardized sampling rates
- Expert annotations
- Sleep stage or abnormality labels
Medical Imaging Datasets
COVID-19 CXR Dataset (COVID19CXRDataset)
- Chest X-ray images for COVID-19 classification
- Multi-class labels (COVID-19, pneumonia, normal)
- Usage:
from pyhealth.datasets import COVID19CXRDataset
Text-Based Datasets
Medical Transcriptions Dataset (MedicalTranscriptionsDataset)
- Clinical notes and transcriptions
- Medical specialty classification
- Text-based prediction tasks
- Usage:
from pyhealth.datasets import MedicalTranscriptionsDataset
Cardiology Dataset (CardiologyDataset)
- Cardiac patient records
- Cardiovascular disease prediction
- Usage:
from pyhealth.datasets import CardiologyDataset
Preprocessed Datasets
MIMIC Extract Dataset (MIMICExtractDataset)
- Pre-extracted MIMIC features
- Ready-to-use benchmarking data
- Reduced preprocessing requirements
- Usage:
from pyhealth.datasets import MIMICExtractDataset
SampleDataset Class
Converts raw datasets into task-specific formatted samples.
Purpose: Transform patient-level data into model-ready input/output pairs
Key Attributes:
input_schema: Defines input data structureoutput_schema: Defines target labels/predictionssamples: List of processed samples
Usage Pattern:
# After setting task on BaseDataset
sample_dataset = dataset.set_task(task_fn)
Data Splitting Functions
Patient-Level Split (split_by_patient)
- Ensures no patient appears in multiple splits
- Prevents data leakage
- Recommended for clinical prediction tasks
Visit-Level Split (split_by_visit)
- Splits by individual visits
- Allows same patient across splits (use cautiously)
Sample-Level Split (split_by_sample)
- Random sample splitting
- Most flexible but may cause leakage
Parameters:
dataset: SampleDataset to splitratios: Tuple of split ratios (e.g., [0.7, 0.1, 0.2])seed: Random seed for reproducibility
Common Workflow
from pyhealth.datasets import MIMIC4Dataset
from pyhealth.tasks import mortality_prediction_mimic4_fn
from pyhealth.datasets import split_by_patient
# 1. Load dataset
dataset = MIMIC4Dataset(root="/path/to/data")
# 2. Set prediction task
sample_dataset = dataset.set_task(mortality_prediction_mimic4_fn)
# 3. Split data
train, val, test = split_by_patient(sample_dataset, [0.7, 0.1, 0.2])
# 4. Get statistics
print(dataset.stats())
Performance Notes
- PyHealth is 3x faster than pandas for healthcare data processing
- Optimized for large-scale EHR datasets
- Memory-efficient patient iteration
- Vectorized operations for feature extraction