Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/pyhealth/references/datasets.md
+++ b/skills/pyhealth/references/datasets.md
@@ -0,0 +1,178 @@
+# PyHealth Datasets and Data Structures
+
+## Core Data Structures
+
+### Event
+Individual medical occurrences with attributes including:
+- **code**: Medical code (diagnosis, medication, procedure, lab test)
+- **vocabulary**: Coding system (ICD-9-CM, NDC, LOINC, etc.)
+- **timestamp**: Event occurrence time
+- **value**: Numeric value (for labs, vital signs)
+- **unit**: Measurement unit
+
+### Patient
+Collection of events organized chronologically across visits. Each patient contains:
+- **patient_id**: Unique identifier
+- **birth_datetime**: Date of birth
+- **gender**: Patient gender
+- **ethnicity**: Patient ethnicity
+- **visits**: List of visit objects
+
+### Visit
+Healthcare encounter containing:
+- **visit_id**: Unique identifier
+- **encounter_time**: Visit timestamp
+- **discharge_time**: Discharge timestamp
+- **visit_type**: Type of encounter (inpatient, outpatient, emergency)
+- **events**: List of events during this visit
+
+## BaseDataset Class
+
+**Key Methods:**
+- `get_patient(patient_id)`: Retrieve single patient record
+- `iter_patients()`: Iterate through all patients
+- `stats()`: Get dataset statistics (patients, visits, events)
+- `set_task(task_fn)`: Define prediction task
+
+## Available Datasets
+
+### Electronic Health Record (EHR) Datasets
+
+**MIMIC-III Dataset** (`MIMIC3Dataset`)
+- Intensive care unit data from Beth Israel Deaconess Medical Center
+- 40,000+ critical care patients
+- Diagnoses, procedures, medications, lab results
+- Usage: `from pyhealth.datasets import MIMIC3Dataset`
+
+**MIMIC-IV Dataset** (`MIMIC4Dataset`)
+- Updated version with 70,000+ patients
+- Improved data quality and coverage
+- Enhanced demographic and clinical detail
+- Usage: `from pyhealth.datasets import MIMIC4Dataset`
+
+**eICU Dataset** (`eICUDataset`)
+- Multi-center critical care database
+- 200,000+ admissions from 200+ hospitals
+- Standardized ICU data across facilities
+- Usage: `from pyhealth.datasets import eICUDataset`
+
+**OMOP Dataset** (`OMOPDataset`)
+- Observational Medical Outcomes Partnership format
+- Standardized common data model
+- Interoperability across healthcare systems
+- Usage: `from pyhealth.datasets import OMOPDataset`
+
+**EHRShot Dataset** (`EHRShotDataset`)
+- Benchmark dataset for few-shot learning
+- Specialized for testing model generalization
+- Usage: `from pyhealth.datasets import EHRShotDataset`
+
+### Physiological Signal Datasets
+
+**Sleep EEG Datasets:**
+- `SleepEDFDataset`: Sleep-EDF database for sleep staging
+- `SHHSDataset`: Sleep Heart Health Study data
+- `ISRUCDataset`: ISRUC-Sleep database
+
+**Temple University EEG Datasets:**
+- `TUEVDataset`: Abnormal EEG events detection
+- `TUABDataset`: Abnormal/normal EEG classification
+- `TUSZDataset`: Seizure detection
+
+**All signal datasets support:**
+- Multi-channel EEG signals
+- Standardized sampling rates
+- Expert annotations
+- Sleep stage or abnormality labels
+
+### Medical Imaging Datasets
+
+**COVID-19 CXR Dataset** (`COVID19CXRDataset`)
+- Chest X-ray images for COVID-19 classification
+- Multi-class labels (COVID-19, pneumonia, normal)
+- Usage: `from pyhealth.datasets import COVID19CXRDataset`
+
+### Text-Based Datasets
+
+**Medical Transcriptions Dataset** (`MedicalTranscriptionsDataset`)
+- Clinical notes and transcriptions
+- Medical specialty classification
+- Text-based prediction tasks
+- Usage: `from pyhealth.datasets import MedicalTranscriptionsDataset`
+
+**Cardiology Dataset** (`CardiologyDataset`)
+- Cardiac patient records
+- Cardiovascular disease prediction
+- Usage: `from pyhealth.datasets import CardiologyDataset`
+
+### Preprocessed Datasets
+
+**MIMIC Extract Dataset** (`MIMICExtractDataset`)
+- Pre-extracted MIMIC features
+- Ready-to-use benchmarking data
+- Reduced preprocessing requirements
+- Usage: `from pyhealth.datasets import MIMICExtractDataset`
+
+## SampleDataset Class
+
+Converts raw datasets into task-specific formatted samples.
+
+**Purpose:** Transform patient-level data into model-ready input/output pairs
+
+**Key Attributes:**
+- `input_schema`: Defines input data structure
+- `output_schema`: Defines target labels/predictions
+- `samples`: List of processed samples
+
+**Usage Pattern:**
+```python
+# After setting task on BaseDataset
+sample_dataset = dataset.set_task(task_fn)
+```
+
+## Data Splitting Functions
+
+**Patient-Level Split** (`split_by_patient`)
+- Ensures no patient appears in multiple splits
+- Prevents data leakage
+- Recommended for clinical prediction tasks
+
+**Visit-Level Split** (`split_by_visit`)
+- Splits by individual visits
+- Allows same patient across splits (use cautiously)
+
+**Sample-Level Split** (`split_by_sample`)
+- Random sample splitting
+- Most flexible but may cause leakage
+
+**Parameters:**
+- `dataset`: SampleDataset to split
+- `ratios`: Tuple of split ratios (e.g., [0.7, 0.1, 0.2])
+- `seed`: Random seed for reproducibility
+
+## Common Workflow
+
+```python
+from pyhealth.datasets import MIMIC4Dataset
+from pyhealth.tasks import mortality_prediction_mimic4_fn
+from pyhealth.datasets import split_by_patient
+
+# 1. Load dataset
+dataset = MIMIC4Dataset(root="/path/to/data")
+
+# 2. Set prediction task
+sample_dataset = dataset.set_task(mortality_prediction_mimic4_fn)
+
+# 3. Split data
+train, val, test = split_by_patient(sample_dataset, [0.7, 0.1, 0.2])
+
+# 4. Get statistics
+print(dataset.stats())
+```
+
+## Performance Notes
+
+- PyHealth is **3x faster than pandas** for healthcare data processing
+- Optimized for large-scale EHR datasets
+- Memory-efficient patient iteration
+- Vectorized operations for feature extraction