# PyHealth Datasets and Data Structures ## Core Data Structures ### Event Individual medical occurrences with attributes including: - **code**: Medical code (diagnosis, medication, procedure, lab test) - **vocabulary**: Coding system (ICD-9-CM, NDC, LOINC, etc.) - **timestamp**: Event occurrence time - **value**: Numeric value (for labs, vital signs) - **unit**: Measurement unit ### Patient Collection of events organized chronologically across visits. Each patient contains: - **patient_id**: Unique identifier - **birth_datetime**: Date of birth - **gender**: Patient gender - **ethnicity**: Patient ethnicity - **visits**: List of visit objects ### Visit Healthcare encounter containing: - **visit_id**: Unique identifier - **encounter_time**: Visit timestamp - **discharge_time**: Discharge timestamp - **visit_type**: Type of encounter (inpatient, outpatient, emergency) - **events**: List of events during this visit ## BaseDataset Class **Key Methods:** - `get_patient(patient_id)`: Retrieve single patient record - `iter_patients()`: Iterate through all patients - `stats()`: Get dataset statistics (patients, visits, events) - `set_task(task_fn)`: Define prediction task ## Available Datasets ### Electronic Health Record (EHR) Datasets **MIMIC-III Dataset** (`MIMIC3Dataset`) - Intensive care unit data from Beth Israel Deaconess Medical Center - 40,000+ critical care patients - Diagnoses, procedures, medications, lab results - Usage: `from pyhealth.datasets import MIMIC3Dataset` **MIMIC-IV Dataset** (`MIMIC4Dataset`) - Updated version with 70,000+ patients - Improved data quality and coverage - Enhanced demographic and clinical detail - Usage: `from pyhealth.datasets import MIMIC4Dataset` **eICU Dataset** (`eICUDataset`) - Multi-center critical care database - 200,000+ admissions from 200+ hospitals - Standardized ICU data across facilities - Usage: `from pyhealth.datasets import eICUDataset` **OMOP Dataset** (`OMOPDataset`) - Observational Medical Outcomes Partnership format - Standardized common data model - Interoperability across healthcare systems - Usage: `from pyhealth.datasets import OMOPDataset` **EHRShot Dataset** (`EHRShotDataset`) - Benchmark dataset for few-shot learning - Specialized for testing model generalization - Usage: `from pyhealth.datasets import EHRShotDataset` ### Physiological Signal Datasets **Sleep EEG Datasets:** - `SleepEDFDataset`: Sleep-EDF database for sleep staging - `SHHSDataset`: Sleep Heart Health Study data - `ISRUCDataset`: ISRUC-Sleep database **Temple University EEG Datasets:** - `TUEVDataset`: Abnormal EEG events detection - `TUABDataset`: Abnormal/normal EEG classification - `TUSZDataset`: Seizure detection **All signal datasets support:** - Multi-channel EEG signals - Standardized sampling rates - Expert annotations - Sleep stage or abnormality labels ### Medical Imaging Datasets **COVID-19 CXR Dataset** (`COVID19CXRDataset`) - Chest X-ray images for COVID-19 classification - Multi-class labels (COVID-19, pneumonia, normal) - Usage: `from pyhealth.datasets import COVID19CXRDataset` ### Text-Based Datasets **Medical Transcriptions Dataset** (`MedicalTranscriptionsDataset`) - Clinical notes and transcriptions - Medical specialty classification - Text-based prediction tasks - Usage: `from pyhealth.datasets import MedicalTranscriptionsDataset` **Cardiology Dataset** (`CardiologyDataset`) - Cardiac patient records - Cardiovascular disease prediction - Usage: `from pyhealth.datasets import CardiologyDataset` ### Preprocessed Datasets **MIMIC Extract Dataset** (`MIMICExtractDataset`) - Pre-extracted MIMIC features - Ready-to-use benchmarking data - Reduced preprocessing requirements - Usage: `from pyhealth.datasets import MIMICExtractDataset` ## SampleDataset Class Converts raw datasets into task-specific formatted samples. **Purpose:** Transform patient-level data into model-ready input/output pairs **Key Attributes:** - `input_schema`: Defines input data structure - `output_schema`: Defines target labels/predictions - `samples`: List of processed samples **Usage Pattern:** ```python # After setting task on BaseDataset sample_dataset = dataset.set_task(task_fn) ``` ## Data Splitting Functions **Patient-Level Split** (`split_by_patient`) - Ensures no patient appears in multiple splits - Prevents data leakage - Recommended for clinical prediction tasks **Visit-Level Split** (`split_by_visit`) - Splits by individual visits - Allows same patient across splits (use cautiously) **Sample-Level Split** (`split_by_sample`) - Random sample splitting - Most flexible but may cause leakage **Parameters:** - `dataset`: SampleDataset to split - `ratios`: Tuple of split ratios (e.g., [0.7, 0.1, 0.2]) - `seed`: Random seed for reproducibility ## Common Workflow ```python from pyhealth.datasets import MIMIC4Dataset from pyhealth.tasks import mortality_prediction_mimic4_fn from pyhealth.datasets import split_by_patient # 1. Load dataset dataset = MIMIC4Dataset(root="/path/to/data") # 2. Set prediction task sample_dataset = dataset.set_task(mortality_prediction_mimic4_fn) # 3. Split data train, val, test = split_by_patient(sample_dataset, [0.7, 0.1, 0.2]) # 4. Get statistics print(dataset.stats()) ``` ## Performance Notes - PyHealth is **3x faster than pandas** for healthcare data processing - Optimized for large-scale EHR datasets - Memory-efficient patient iteration - Vectorized operations for feature extraction