9.8 KiB
TDC Molecule Generation Oracles
Oracles are functions that evaluate the quality of generated molecules across specific dimensions. TDC provides 17+ oracle functions for molecular optimization tasks in de novo drug design.
Overview
Oracles measure molecular properties and serve two main purposes:
- Goal-Directed Generation: Optimize molecules to maximize/minimize specific properties
- Distribution Learning: Evaluate whether generated molecules match desired property distributions
Using Oracles
Basic Usage
from tdc import Oracle
# Initialize oracle
oracle = Oracle(name='GSK3B')
# Evaluate single molecule (SMILES string)
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
# Evaluate multiple molecules
scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])
Oracle Categories
TDC oracles are organized into several categories based on the molecular property being evaluated.
Biochemical Oracles
Predict binding affinity or activity against biological targets.
Target-Specific Oracles
DRD2 - Dopamine Receptor D2
oracle = Oracle(name='DRD2')
score = oracle(smiles)
- Measures binding affinity to DRD2 receptor
- Important for neurological and psychiatric drug development
- Higher scores indicate stronger binding
GSK3B - Glycogen Synthase Kinase-3 Beta
oracle = Oracle(name='GSK3B')
score = oracle(smiles)
- Predicts GSK3β inhibition
- Relevant for Alzheimer's, diabetes, and cancer research
- Higher scores indicate better inhibition
JNK3 - c-Jun N-terminal Kinase 3
oracle = Oracle(name='JNK3')
score = oracle(smiles)
- Measures JNK3 kinase inhibition
- Target for neurodegenerative diseases
- Higher scores indicate stronger inhibition
5HT2A - Serotonin 2A Receptor
oracle = Oracle(name='5HT2A')
score = oracle(smiles)
- Predicts serotonin receptor binding
- Important for psychiatric medications
- Higher scores indicate stronger binding
ACE - Angiotensin-Converting Enzyme
oracle = Oracle(name='ACE')
score = oracle(smiles)
- Measures ACE inhibition
- Target for hypertension treatment
- Higher scores indicate better inhibition
MAPK - Mitogen-Activated Protein Kinase
oracle = Oracle(name='MAPK')
score = oracle(smiles)
- Predicts MAPK inhibition
- Target for cancer and inflammatory diseases
CDK - Cyclin-Dependent Kinase
oracle = Oracle(name='CDK')
score = oracle(smiles)
- Measures CDK inhibition
- Important for cancer drug development
P38 - p38 MAP Kinase
oracle = Oracle(name='P38')
score = oracle(smiles)
- Predicts p38 MAPK inhibition
- Target for inflammatory diseases
PARP1 - Poly (ADP-ribose) Polymerase 1
oracle = Oracle(name='PARP1')
score = oracle(smiles)
- Measures PARP1 inhibition
- Target for cancer treatment (DNA repair mechanism)
PIK3CA - Phosphatidylinositol-4,5-Bisphosphate 3-Kinase
oracle = Oracle(name='PIK3CA')
score = oracle(smiles)
- Predicts PIK3CA inhibition
- Important target in oncology
Physicochemical Oracles
Evaluate drug-like properties and ADME characteristics.
Drug-Likeness Oracles
QED - Quantitative Estimate of Drug-likeness
oracle = Oracle(name='QED')
score = oracle(smiles)
- Combines multiple physicochemical properties
- Score ranges from 0 (non-drug-like) to 1 (drug-like)
- Based on Bickerton et al. criteria
Lipinski - Rule of Five
oracle = Oracle(name='Lipinski')
score = oracle(smiles)
- Number of Lipinski rule violations
- Rules: MW ≤ 500, logP ≤ 5, HBD ≤ 5, HBA ≤ 10
- Score of 0 means fully compliant
Molecular Properties
SA - Synthetic Accessibility
oracle = Oracle(name='SA')
score = oracle(smiles)
- Estimates ease of synthesis
- Score ranges from 1 (easy) to 10 (difficult)
- Lower scores indicate easier synthesis
LogP - Octanol-Water Partition Coefficient
oracle = Oracle(name='LogP')
score = oracle(smiles)
- Measures lipophilicity
- Important for membrane permeability
- Typical drug-like range: 0-5
MW - Molecular Weight
oracle = Oracle(name='MW')
score = oracle(smiles)
- Returns molecular weight in Daltons
- Drug-like range typically 150-500 Da
Composite Oracles
Combine multiple properties for multi-objective optimization.
Isomer Meta
oracle = Oracle(name='Isomer_Meta')
score = oracle(smiles)
- Evaluates specific isomeric properties
- Used for stereochemistry optimization
Median Molecules
oracle = Oracle(name='Median1', 'Median2')
score = oracle(smiles)
- Tests ability to generate molecules with median properties
- Useful for distribution learning benchmarks
Rediscovery
oracle = Oracle(name='Rediscovery')
score = oracle(smiles)
- Measures similarity to known reference molecules
- Tests ability to regenerate existing drugs
Similarity
oracle = Oracle(name='Similarity')
score = oracle(smiles)
- Computes structural similarity to target molecules
- Based on molecular fingerprints (typically Tanimoto similarity)
Uniqueness
oracle = Oracle(name='Uniqueness')
scores = oracle(smiles_list)
- Measures diversity in generated molecule set
- Returns fraction of unique molecules
Novelty
oracle = Oracle(name='Novelty')
scores = oracle(smiles_list, training_set)
- Measures how different generated molecules are from training set
- Higher scores indicate more novel structures
Specialized Oracles
ASKCOS - Retrosynthesis Scoring
oracle = Oracle(name='ASKCOS')
score = oracle(smiles)
- Evaluates synthetic feasibility using retrosynthesis
- Requires ASKCOS backend (IBM RXN)
- Scores based on retrosynthetic route availability
Docking Score
oracle = Oracle(name='Docking')
score = oracle(smiles)
- Molecular docking score against target protein
- Requires protein structure and docking software
- Lower scores typically indicate better binding
Vina - AutoDock Vina Score
oracle = Oracle(name='Vina')
score = oracle(smiles)
- Uses AutoDock Vina for protein-ligand docking
- Predicts binding affinity in kcal/mol
- More negative scores indicate stronger binding
Multi-Objective Optimization
Combine multiple oracles for multi-property optimization:
from tdc import Oracle
# Initialize multiple oracles
qed_oracle = Oracle(name='QED')
sa_oracle = Oracle(name='SA')
drd2_oracle = Oracle(name='DRD2')
# Define custom scoring function
def multi_objective_score(smiles):
qed = qed_oracle(smiles)
sa = 1 / (1 + sa_oracle(smiles)) # Invert SA (lower is better)
drd2 = drd2_oracle(smiles)
# Weighted combination
return 0.3 * qed + 0.3 * sa + 0.4 * drd2
# Evaluate molecule
score = multi_objective_score('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
Oracle Performance Considerations
Speed
- Fast: QED, SA, LogP, MW, Lipinski (rule-based calculations)
- Medium: Target-specific ML models (DRD2, GSK3B, etc.)
- Slow: Docking-based oracles (Vina, ASKCOS)
Reliability
- Oracles are ML models trained on specific datasets
- May not generalize to all chemical spaces
- Use multiple oracles to validate results
Batch Processing
# Efficient batch evaluation
oracle = Oracle(name='GSK3B')
smiles_list = ['SMILES1', 'SMILES2', ..., 'SMILES1000']
scores = oracle(smiles_list) # Faster than individual calls
Common Workflows
Goal-Directed Generation
from tdc import Oracle
from tdc.generation import MolGen
# Load training data
data = MolGen(name='ChEMBL_V29')
train_smiles = data.get_data()['Drug'].tolist()
# Initialize oracle
oracle = Oracle(name='GSK3B')
# Generate molecules (user implements generative model)
# generated_smiles = generator.generate(n=1000)
# Evaluate generated molecules
scores = oracle(generated_smiles)
best_molecules = [(s, score) for s, score in zip(generated_smiles, scores)]
best_molecules.sort(key=lambda x: x[1], reverse=True)
print(f"Top 10 molecules:")
for smiles, score in best_molecules[:10]:
print(f"{smiles}: {score:.3f}")
Distribution Learning
from tdc import Oracle
import numpy as np
# Initialize oracle
oracle = Oracle(name='QED')
# Evaluate training set
train_scores = oracle(train_smiles)
train_mean = np.mean(train_scores)
train_std = np.std(train_scores)
# Evaluate generated set
gen_scores = oracle(generated_smiles)
gen_mean = np.mean(gen_scores)
gen_std = np.std(gen_scores)
# Compare distributions
print(f"Training: μ={train_mean:.3f}, σ={train_std:.3f}")
print(f"Generated: μ={gen_mean:.3f}, σ={gen_std:.3f}")
Integration with TDC Benchmarks
from tdc.generation import MolGen
# Use with GuacaMol benchmark
data = MolGen(name='GuacaMol')
# Oracles are automatically integrated
# Each GuacaMol task has associated oracle
benchmark_results = data.evaluate_guacamol(
generated_molecules=your_molecules,
oracle_name='GSK3B'
)
Notes
- Oracle scores are predictions, not experimental measurements
- Always validate top candidates experimentally
- Different oracles may have different score ranges and interpretations
- Some oracles require additional dependencies or API access
- Check oracle documentation for specific details: https://tdcommons.ai/functions/oracles/
Adding Custom Oracles
To create custom oracle functions:
class CustomOracle:
def __init__(self):
# Initialize your model/method
pass
def __call__(self, smiles):
# Implement your scoring logic
# Return score or list of scores
pass
# Use like built-in oracles
custom_oracle = CustomOracle()
score = custom_oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
References
- TDC Oracles Documentation: https://tdcommons.ai/functions/oracles/
- GuacaMol Paper: "GuacaMol: Benchmarking Models for de Novo Molecular Design"
- MOSES Paper: "Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models"