Files
gh-k-dense-ai-claude-scient…/skills/pyhealth/references/datasets.md
2025-11-30 08:30:10 +08:00

179 lines
5.3 KiB
Markdown

# PyHealth Datasets and Data Structures
## Core Data Structures
### Event
Individual medical occurrences with attributes including:
- **code**: Medical code (diagnosis, medication, procedure, lab test)
- **vocabulary**: Coding system (ICD-9-CM, NDC, LOINC, etc.)
- **timestamp**: Event occurrence time
- **value**: Numeric value (for labs, vital signs)
- **unit**: Measurement unit
### Patient
Collection of events organized chronologically across visits. Each patient contains:
- **patient_id**: Unique identifier
- **birth_datetime**: Date of birth
- **gender**: Patient gender
- **ethnicity**: Patient ethnicity
- **visits**: List of visit objects
### Visit
Healthcare encounter containing:
- **visit_id**: Unique identifier
- **encounter_time**: Visit timestamp
- **discharge_time**: Discharge timestamp
- **visit_type**: Type of encounter (inpatient, outpatient, emergency)
- **events**: List of events during this visit
## BaseDataset Class
**Key Methods:**
- `get_patient(patient_id)`: Retrieve single patient record
- `iter_patients()`: Iterate through all patients
- `stats()`: Get dataset statistics (patients, visits, events)
- `set_task(task_fn)`: Define prediction task
## Available Datasets
### Electronic Health Record (EHR) Datasets
**MIMIC-III Dataset** (`MIMIC3Dataset`)
- Intensive care unit data from Beth Israel Deaconess Medical Center
- 40,000+ critical care patients
- Diagnoses, procedures, medications, lab results
- Usage: `from pyhealth.datasets import MIMIC3Dataset`
**MIMIC-IV Dataset** (`MIMIC4Dataset`)
- Updated version with 70,000+ patients
- Improved data quality and coverage
- Enhanced demographic and clinical detail
- Usage: `from pyhealth.datasets import MIMIC4Dataset`
**eICU Dataset** (`eICUDataset`)
- Multi-center critical care database
- 200,000+ admissions from 200+ hospitals
- Standardized ICU data across facilities
- Usage: `from pyhealth.datasets import eICUDataset`
**OMOP Dataset** (`OMOPDataset`)
- Observational Medical Outcomes Partnership format
- Standardized common data model
- Interoperability across healthcare systems
- Usage: `from pyhealth.datasets import OMOPDataset`
**EHRShot Dataset** (`EHRShotDataset`)
- Benchmark dataset for few-shot learning
- Specialized for testing model generalization
- Usage: `from pyhealth.datasets import EHRShotDataset`
### Physiological Signal Datasets
**Sleep EEG Datasets:**
- `SleepEDFDataset`: Sleep-EDF database for sleep staging
- `SHHSDataset`: Sleep Heart Health Study data
- `ISRUCDataset`: ISRUC-Sleep database
**Temple University EEG Datasets:**
- `TUEVDataset`: Abnormal EEG events detection
- `TUABDataset`: Abnormal/normal EEG classification
- `TUSZDataset`: Seizure detection
**All signal datasets support:**
- Multi-channel EEG signals
- Standardized sampling rates
- Expert annotations
- Sleep stage or abnormality labels
### Medical Imaging Datasets
**COVID-19 CXR Dataset** (`COVID19CXRDataset`)
- Chest X-ray images for COVID-19 classification
- Multi-class labels (COVID-19, pneumonia, normal)
- Usage: `from pyhealth.datasets import COVID19CXRDataset`
### Text-Based Datasets
**Medical Transcriptions Dataset** (`MedicalTranscriptionsDataset`)
- Clinical notes and transcriptions
- Medical specialty classification
- Text-based prediction tasks
- Usage: `from pyhealth.datasets import MedicalTranscriptionsDataset`
**Cardiology Dataset** (`CardiologyDataset`)
- Cardiac patient records
- Cardiovascular disease prediction
- Usage: `from pyhealth.datasets import CardiologyDataset`
### Preprocessed Datasets
**MIMIC Extract Dataset** (`MIMICExtractDataset`)
- Pre-extracted MIMIC features
- Ready-to-use benchmarking data
- Reduced preprocessing requirements
- Usage: `from pyhealth.datasets import MIMICExtractDataset`
## SampleDataset Class
Converts raw datasets into task-specific formatted samples.
**Purpose:** Transform patient-level data into model-ready input/output pairs
**Key Attributes:**
- `input_schema`: Defines input data structure
- `output_schema`: Defines target labels/predictions
- `samples`: List of processed samples
**Usage Pattern:**
```python
# After setting task on BaseDataset
sample_dataset = dataset.set_task(task_fn)
```
## Data Splitting Functions
**Patient-Level Split** (`split_by_patient`)
- Ensures no patient appears in multiple splits
- Prevents data leakage
- Recommended for clinical prediction tasks
**Visit-Level Split** (`split_by_visit`)
- Splits by individual visits
- Allows same patient across splits (use cautiously)
**Sample-Level Split** (`split_by_sample`)
- Random sample splitting
- Most flexible but may cause leakage
**Parameters:**
- `dataset`: SampleDataset to split
- `ratios`: Tuple of split ratios (e.g., [0.7, 0.1, 0.2])
- `seed`: Random seed for reproducibility
## Common Workflow
```python
from pyhealth.datasets import MIMIC4Dataset
from pyhealth.tasks import mortality_prediction_mimic4_fn
from pyhealth.datasets import split_by_patient
# 1. Load dataset
dataset = MIMIC4Dataset(root="/path/to/data")
# 2. Set prediction task
sample_dataset = dataset.set_task(mortality_prediction_mimic4_fn)
# 3. Split data
train, val, test = split_by_patient(sample_dataset, [0.7, 0.1, 0.2])
# 4. Get statistics
print(dataset.stats())
```
## Performance Notes
- PyHealth is **3x faster than pandas** for healthcare data processing
- Optimized for large-scale EHR datasets
- Memory-efficient patient iteration
- Vectorized operations for feature extraction