Files
2025-11-30 08:30:10 +08:00

16 KiB

TDC Utilities and Data Functions

This document provides comprehensive documentation for TDC's data processing, evaluation, and utility functions.

Overview

TDC provides utilities organized into four main categories:

  1. Dataset Splits - Train/validation/test partitioning strategies
  2. Model Evaluation - Standardized performance metrics
  3. Data Processing - Molecule conversion, filtering, and transformation
  4. Entity Retrieval - Database queries and conversions

1. Dataset Splits

Dataset splitting is crucial for evaluating model generalization. TDC provides multiple splitting strategies designed for therapeutic ML.

Basic Split Usage

from tdc.single_pred import ADME

data = ADME(name='Caco2_Wang')

# Get split with default parameters
split = data.get_split()
# Returns: {'train': DataFrame, 'valid': DataFrame, 'test': DataFrame}

# Customize split parameters
split = data.get_split(
    method='scaffold',
    seed=42,
    frac=[0.7, 0.1, 0.2]
)

Split Methods

Random Split

Random shuffling of data - suitable for general ML tasks.

split = data.get_split(method='random', seed=1)

When to use:

  • Baseline model evaluation
  • When chemical/temporal structure is not important
  • Quick prototyping

Not recommended for:

  • Realistic drug discovery scenarios
  • Evaluating generalization to new chemical matter

Scaffold Split

Splits based on molecular scaffolds (Bemis-Murcko scaffolds) - ensures test molecules are structurally distinct from training.

split = data.get_split(method='scaffold', seed=1)

When to use:

  • Default for most single prediction tasks
  • Evaluating generalization to new chemical series
  • Realistic drug discovery scenarios

How it works:

  1. Extract Bemis-Murcko scaffold from each molecule
  2. Group molecules by scaffold
  3. Assign scaffolds to train/valid/test sets
  4. Ensures test molecules have unseen scaffolds

Cold Splits (DTI/DDI Tasks)

For multi-instance prediction, cold splits ensure test set contains unseen drugs, targets, or both.

Cold Drug Split:

from tdc.multi_pred import DTI
data = DTI(name='BindingDB_Kd')
split = data.get_split(method='cold_drug', seed=1)
  • Test set contains drugs not seen during training
  • Evaluates generalization to new compounds

Cold Target Split:

split = data.get_split(method='cold_target', seed=1)
  • Test set contains targets not seen during training
  • Evaluates generalization to new proteins

Cold Drug-Target Split:

split = data.get_split(method='cold_drug_target', seed=1)
  • Test set contains novel drug-target pairs
  • Most challenging evaluation scenario

Temporal Split

For datasets with temporal information - ensures test data is from later time points.

split = data.get_split(method='temporal', seed=1)

When to use:

  • Datasets with time stamps
  • Simulating prospective prediction
  • Clinical trial outcome prediction

Custom Split Fractions

# 80% train, 10% valid, 10% test
split = data.get_split(method='scaffold', frac=[0.8, 0.1, 0.1])

# 70% train, 15% valid, 15% test
split = data.get_split(method='scaffold', frac=[0.7, 0.15, 0.15])

Stratified Splits

For classification tasks with imbalanced labels:

split = data.get_split(method='scaffold', stratified=True)

Maintains label distribution across train/valid/test sets.

2. Model Evaluation

TDC provides standardized evaluation metrics for different task types.

Basic Evaluator Usage

from tdc import Evaluator

# Initialize evaluator
evaluator = Evaluator(name='ROC-AUC')

# Evaluate predictions
score = evaluator(y_true, y_pred)

Classification Metrics

ROC-AUC

Receiver Operating Characteristic - Area Under Curve

evaluator = Evaluator(name='ROC-AUC')
score = evaluator(y_true, y_pred_proba)

Best for:

  • Binary classification
  • Imbalanced datasets
  • Overall discriminative ability

Range: 0-1 (higher is better, 0.5 is random)

PR-AUC

Precision-Recall Area Under Curve

evaluator = Evaluator(name='PR-AUC')
score = evaluator(y_true, y_pred_proba)

Best for:

  • Highly imbalanced datasets
  • When positive class is rare
  • Complements ROC-AUC

Range: 0-1 (higher is better)

F1 Score

Harmonic mean of precision and recall

evaluator = Evaluator(name='F1')
score = evaluator(y_true, y_pred_binary)

Best for:

  • Balance between precision and recall
  • Multi-class classification

Range: 0-1 (higher is better)

Accuracy

Fraction of correct predictions

evaluator = Evaluator(name='Accuracy')
score = evaluator(y_true, y_pred_binary)

Best for:

  • Balanced datasets
  • Simple baseline metric

Not recommended for: Imbalanced datasets

Cohen's Kappa

Agreement between predictions and ground truth, accounting for chance

evaluator = Evaluator(name='Kappa')
score = evaluator(y_true, y_pred_binary)

Range: -1 to 1 (higher is better, 0 is random)

Regression Metrics

RMSE - Root Mean Squared Error

evaluator = Evaluator(name='RMSE')
score = evaluator(y_true, y_pred)

Best for:

  • Continuous predictions
  • Penalizes large errors heavily

Range: 0-∞ (lower is better)

MAE - Mean Absolute Error

evaluator = Evaluator(name='MAE')
score = evaluator(y_true, y_pred)

Best for:

  • Continuous predictions
  • More robust to outliers than RMSE

Range: 0-∞ (lower is better)

R² - Coefficient of Determination

evaluator = Evaluator(name='R2')
score = evaluator(y_true, y_pred)

Best for:

  • Variance explained by model
  • Comparing different models

Range: -∞ to 1 (higher is better, 1 is perfect)

MSE - Mean Squared Error

evaluator = Evaluator(name='MSE')
score = evaluator(y_true, y_pred)

Range: 0-∞ (lower is better)

Ranking Metrics

Spearman Correlation

Rank correlation coefficient

evaluator = Evaluator(name='Spearman')
score = evaluator(y_true, y_pred)

Best for:

  • Ranking tasks
  • Non-linear relationships
  • Ordinal data

Range: -1 to 1 (higher is better)

Pearson Correlation

Linear correlation coefficient

evaluator = Evaluator(name='Pearson')
score = evaluator(y_true, y_pred)

Best for:

  • Linear relationships
  • Continuous data

Range: -1 to 1 (higher is better)

Multi-Label Classification

evaluator = Evaluator(name='Micro-F1')
score = evaluator(y_true_multilabel, y_pred_multilabel)

Available: Micro-F1, Macro-F1, Micro-AUPR, Macro-AUPR

Benchmark Group Evaluation

For benchmark groups, evaluation requires multiple seeds:

from tdc.benchmark_group import admet_group

group = admet_group(path='data/')
benchmark = group.get('Caco2_Wang')

# Predictions must be dict with seeds as keys
predictions = {}
for seed in [1, 2, 3, 4, 5]:
    # Train model and predict
    predictions[seed] = model_predictions

# Evaluate with mean and std across seeds
results = group.evaluate(predictions)
print(results)  # {'Caco2_Wang': [mean_score, std_score]}

3. Data Processing

TDC provides 11 comprehensive data processing utilities.

Molecule Format Conversion

Convert between ~15 molecular representations.

from tdc.chem_utils import MolConvert

# SMILES to PyTorch Geometric
converter = MolConvert(src='SMILES', dst='PyG')
pyg_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')

# SMILES to DGL
converter = MolConvert(src='SMILES', dst='DGL')
dgl_graph = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')

# SMILES to Morgan Fingerprint (ECFP)
converter = MolConvert(src='SMILES', dst='ECFP')
fingerprint = converter('CC(C)Cc1ccc(cc1)C(C)C(O)=O')

Available formats:

  • Text: SMILES, SELFIES, InChI
  • Fingerprints: ECFP (Morgan), MACCS, RDKit, AtomPair, TopologicalTorsion
  • Graphs: PyG (PyTorch Geometric), DGL (Deep Graph Library)
  • 3D: Graph3D, Coulomb Matrix, Distance Matrix

Batch conversion:

converter = MolConvert(src='SMILES', dst='PyG')
graphs = converter(['SMILES1', 'SMILES2', 'SMILES3'])

Molecule Filters

Remove non-drug-like molecules using curated chemical rules.

from tdc.chem_utils import MolFilter

# Initialize filter with rules
mol_filter = MolFilter(
    rules=['PAINS', 'BMS'],  # Chemical filter rules
    property_filters_dict={
        'MW': (150, 500),      # Molecular weight range
        'LogP': (-0.4, 5.6),   # Lipophilicity range
        'HBD': (0, 5),         # H-bond donors
        'HBA': (0, 10)         # H-bond acceptors
    }
)

# Filter molecules
filtered_smiles = mol_filter(smiles_list)

Available filter rules:

  • PAINS - Pan-Assay Interference Compounds
  • BMS - Bristol-Myers Squibb HTS deck filters
  • Glaxo - GlaxoSmithKline filters
  • Dundee - University of Dundee filters
  • Inpharmatica - Inpharmatica filters
  • LINT - Pfizer LINT filters

Label Distribution Visualization

# Visualize label distribution
data.label_distribution()

# Print statistics
data.print_stats()

Displays histogram and computes mean, median, std for continuous labels.

Label Binarization

Convert continuous labels to binary using threshold.

from tdc.utils import binarize

# Binarize with threshold
binary_labels = binarize(y_continuous, threshold=5.0, order='ascending')
# order='ascending': values >= threshold become 1
# order='descending': values <= threshold become 1

Label Units Conversion

Transform between measurement units.

from tdc.chem_utils import label_transform

# Convert nM to pKd
y_pkd = label_transform(y_nM, from_unit='nM', to_unit='p')

# Convert μM to nM
y_nM = label_transform(y_uM, from_unit='uM', to_unit='nM')

Available conversions:

  • Binding affinity: nM, μM, pKd, pKi, pIC50
  • Log transformations
  • Natural log conversions

Label Meaning

Get interpretable descriptions for labels.

# Get label mapping
label_map = data.get_label_map(name='DrugBank')
print(label_map)
# {0: 'No interaction', 1: 'Increased effect', 2: 'Decreased effect', ...}

Data Balancing

Handle class imbalance via over/under-sampling.

from tdc.utils import balance

# Oversample minority class
X_balanced, y_balanced = balance(X, y, method='oversample')

# Undersample majority class
X_balanced, y_balanced = balance(X, y, method='undersample')

Graph Transformation for Pair Data

Convert paired data to graph representations.

from tdc.utils import create_graph_from_pairs

# Create graph from drug-drug pairs
graph = create_graph_from_pairs(
    pairs=ddi_pairs,  # [(drug1, drug2, label), ...]
    format='edge_list'  # or 'PyG', 'DGL'
)

Negative Sampling

Generate negative samples for binary tasks.

from tdc.utils import negative_sample

# Generate negative samples for DTI
negative_pairs = negative_sample(
    positive_pairs=known_interactions,
    all_drugs=drug_list,
    all_targets=target_list,
    ratio=1.0  # Negative:positive ratio
)

Use cases:

  • Drug-target interaction prediction
  • Drug-drug interaction tasks
  • Creating balanced datasets

Entity Retrieval

Convert between database identifiers.

PubChem CID to SMILES

from tdc.utils import cid2smiles

smiles = cid2smiles(2244)  # Aspirin
# Returns: 'CC(=O)Oc1ccccc1C(=O)O'

UniProt ID to Amino Acid Sequence

from tdc.utils import uniprot2seq

sequence = uniprot2seq('P12345')
# Returns: 'MVKVYAPASS...'

Batch Retrieval

# Multiple CIDs
smiles_list = [cid2smiles(cid) for cid in [2244, 5090, 6323]]

# Multiple UniProt IDs
sequences = [uniprot2seq(uid) for uid in ['P12345', 'Q9Y5S9']]

4. Advanced Utilities

Retrieve Dataset Names

from tdc.utils import retrieve_dataset_names

# Get all datasets for a task
adme_datasets = retrieve_dataset_names('ADME')
dti_datasets = retrieve_dataset_names('DTI')
tox_datasets = retrieve_dataset_names('Tox')

print(f"ADME datasets: {adme_datasets}")

TDC supports fuzzy matching for dataset names:

from tdc.single_pred import ADME

# These all work (typo-tolerant)
data = ADME(name='Caco2_Wang')
data = ADME(name='caco2_wang')
data = ADME(name='Caco2')  # Partial match

Data Format Options

# Pandas DataFrame (default)
df = data.get_data(format='df')

# Dictionary
data_dict = data.get_data(format='dict')

# DeepPurpose format (for DeepPurpose library)
dp_format = data.get_data(format='DeepPurpose')

# PyG/DGL graphs (if applicable)
graphs = data.get_data(format='PyG')

Data Loader Utilities

from tdc.utils import create_fold

# Create cross-validation folds
folds = create_fold(data, fold=5, seed=42)
# Returns list of (train_idx, test_idx) tuples

# Iterate through folds
for i, (train_idx, test_idx) in enumerate(folds):
    train_data = data.iloc[train_idx]
    test_data = data.iloc[test_idx]
    # Train and evaluate

Common Workflows

Workflow 1: Complete Data Pipeline

from tdc.single_pred import ADME
from tdc import Evaluator
from tdc.chem_utils import MolConvert, MolFilter

# 1. Load data
data = ADME(name='Caco2_Wang')

# 2. Filter molecules
mol_filter = MolFilter(rules=['PAINS'])
filtered_data = data.get_data()
filtered_data = filtered_data[
    filtered_data['Drug'].apply(lambda x: mol_filter([x]))
]

# 3. Split data
split = data.get_split(method='scaffold', seed=42)
train, valid, test = split['train'], split['valid'], split['test']

# 4. Convert to graph representations
converter = MolConvert(src='SMILES', dst='PyG')
train_graphs = converter(train['Drug'].tolist())

# 5. Train model (user implements)
# model.fit(train_graphs, train['Y'])

# 6. Evaluate
evaluator = Evaluator(name='MAE')
# score = evaluator(test['Y'], predictions)

Workflow 2: Multi-Task Learning Preparation

from tdc.benchmark_group import admet_group
from tdc.chem_utils import MolConvert

# Load benchmark group
group = admet_group(path='data/')

# Get multiple datasets
datasets = ['Caco2_Wang', 'HIA_Hou', 'Bioavailability_Ma']
all_data = {}

for dataset_name in datasets:
    benchmark = group.get(dataset_name)
    all_data[dataset_name] = benchmark

# Prepare for multi-task learning
converter = MolConvert(src='SMILES', dst='ECFP')
# Process each dataset...

Workflow 3: DTI Cold Split Evaluation

from tdc.multi_pred import DTI
from tdc import Evaluator

# Load DTI data
data = DTI(name='BindingDB_Kd')

# Cold drug split
split = data.get_split(method='cold_drug', seed=42)
train, test = split['train'], split['test']

# Verify no drug overlap
train_drugs = set(train['Drug_ID'])
test_drugs = set(test['Drug_ID'])
assert len(train_drugs & test_drugs) == 0, "Drug leakage detected!"

# Train and evaluate
# model.fit(train)
evaluator = Evaluator(name='RMSE')
# score = evaluator(test['Y'], predictions)

Best Practices

  1. Always use meaningful splits - Use scaffold or cold splits for realistic evaluation
  2. Multiple seeds - Run experiments with multiple seeds for robust results
  3. Appropriate metrics - Choose metrics that match your task and dataset characteristics
  4. Data filtering - Remove PAINS and non-drug-like molecules before training
  5. Format conversion - Convert molecules to appropriate format for your model
  6. Batch processing - Use batch operations for efficiency with large datasets

Performance Tips

  • Convert molecules in batch mode for faster processing
  • Cache converted representations to avoid recomputation
  • Use appropriate data formats for your framework (PyG, DGL, etc.)
  • Filter data early in the pipeline to reduce computation

References