Initial commit
This commit is contained in:
400
skills/pytdc/references/oracles.md
Normal file
400
skills/pytdc/references/oracles.md
Normal file
@@ -0,0 +1,400 @@
|
||||
# TDC Molecule Generation Oracles
|
||||
|
||||
Oracles are functions that evaluate the quality of generated molecules across specific dimensions. TDC provides 17+ oracle functions for molecular optimization tasks in de novo drug design.
|
||||
|
||||
## Overview
|
||||
|
||||
Oracles measure molecular properties and serve two main purposes:
|
||||
|
||||
1. **Goal-Directed Generation**: Optimize molecules to maximize/minimize specific properties
|
||||
2. **Distribution Learning**: Evaluate whether generated molecules match desired property distributions
|
||||
|
||||
## Using Oracles
|
||||
|
||||
### Basic Usage
|
||||
|
||||
```python
|
||||
from tdc import Oracle
|
||||
|
||||
# Initialize oracle
|
||||
oracle = Oracle(name='GSK3B')
|
||||
|
||||
# Evaluate single molecule (SMILES string)
|
||||
score = oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
|
||||
|
||||
# Evaluate multiple molecules
|
||||
scores = oracle(['SMILES1', 'SMILES2', 'SMILES3'])
|
||||
```
|
||||
|
||||
### Oracle Categories
|
||||
|
||||
TDC oracles are organized into several categories based on the molecular property being evaluated.
|
||||
|
||||
## Biochemical Oracles
|
||||
|
||||
Predict binding affinity or activity against biological targets.
|
||||
|
||||
### Target-Specific Oracles
|
||||
|
||||
**DRD2 - Dopamine Receptor D2**
|
||||
```python
|
||||
oracle = Oracle(name='DRD2')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Measures binding affinity to DRD2 receptor
|
||||
- Important for neurological and psychiatric drug development
|
||||
- Higher scores indicate stronger binding
|
||||
|
||||
**GSK3B - Glycogen Synthase Kinase-3 Beta**
|
||||
```python
|
||||
oracle = Oracle(name='GSK3B')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Predicts GSK3β inhibition
|
||||
- Relevant for Alzheimer's, diabetes, and cancer research
|
||||
- Higher scores indicate better inhibition
|
||||
|
||||
**JNK3 - c-Jun N-terminal Kinase 3**
|
||||
```python
|
||||
oracle = Oracle(name='JNK3')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Measures JNK3 kinase inhibition
|
||||
- Target for neurodegenerative diseases
|
||||
- Higher scores indicate stronger inhibition
|
||||
|
||||
**5HT2A - Serotonin 2A Receptor**
|
||||
```python
|
||||
oracle = Oracle(name='5HT2A')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Predicts serotonin receptor binding
|
||||
- Important for psychiatric medications
|
||||
- Higher scores indicate stronger binding
|
||||
|
||||
**ACE - Angiotensin-Converting Enzyme**
|
||||
```python
|
||||
oracle = Oracle(name='ACE')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Measures ACE inhibition
|
||||
- Target for hypertension treatment
|
||||
- Higher scores indicate better inhibition
|
||||
|
||||
**MAPK - Mitogen-Activated Protein Kinase**
|
||||
```python
|
||||
oracle = Oracle(name='MAPK')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Predicts MAPK inhibition
|
||||
- Target for cancer and inflammatory diseases
|
||||
|
||||
**CDK - Cyclin-Dependent Kinase**
|
||||
```python
|
||||
oracle = Oracle(name='CDK')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Measures CDK inhibition
|
||||
- Important for cancer drug development
|
||||
|
||||
**P38 - p38 MAP Kinase**
|
||||
```python
|
||||
oracle = Oracle(name='P38')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Predicts p38 MAPK inhibition
|
||||
- Target for inflammatory diseases
|
||||
|
||||
**PARP1 - Poly (ADP-ribose) Polymerase 1**
|
||||
```python
|
||||
oracle = Oracle(name='PARP1')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Measures PARP1 inhibition
|
||||
- Target for cancer treatment (DNA repair mechanism)
|
||||
|
||||
**PIK3CA - Phosphatidylinositol-4,5-Bisphosphate 3-Kinase**
|
||||
```python
|
||||
oracle = Oracle(name='PIK3CA')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Predicts PIK3CA inhibition
|
||||
- Important target in oncology
|
||||
|
||||
## Physicochemical Oracles
|
||||
|
||||
Evaluate drug-like properties and ADME characteristics.
|
||||
|
||||
### Drug-Likeness Oracles
|
||||
|
||||
**QED - Quantitative Estimate of Drug-likeness**
|
||||
```python
|
||||
oracle = Oracle(name='QED')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Combines multiple physicochemical properties
|
||||
- Score ranges from 0 (non-drug-like) to 1 (drug-like)
|
||||
- Based on Bickerton et al. criteria
|
||||
|
||||
**Lipinski - Rule of Five**
|
||||
```python
|
||||
oracle = Oracle(name='Lipinski')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Number of Lipinski rule violations
|
||||
- Rules: MW ≤ 500, logP ≤ 5, HBD ≤ 5, HBA ≤ 10
|
||||
- Score of 0 means fully compliant
|
||||
|
||||
### Molecular Properties
|
||||
|
||||
**SA - Synthetic Accessibility**
|
||||
```python
|
||||
oracle = Oracle(name='SA')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Estimates ease of synthesis
|
||||
- Score ranges from 1 (easy) to 10 (difficult)
|
||||
- Lower scores indicate easier synthesis
|
||||
|
||||
**LogP - Octanol-Water Partition Coefficient**
|
||||
```python
|
||||
oracle = Oracle(name='LogP')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Measures lipophilicity
|
||||
- Important for membrane permeability
|
||||
- Typical drug-like range: 0-5
|
||||
|
||||
**MW - Molecular Weight**
|
||||
```python
|
||||
oracle = Oracle(name='MW')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Returns molecular weight in Daltons
|
||||
- Drug-like range typically 150-500 Da
|
||||
|
||||
## Composite Oracles
|
||||
|
||||
Combine multiple properties for multi-objective optimization.
|
||||
|
||||
**Isomer Meta**
|
||||
```python
|
||||
oracle = Oracle(name='Isomer_Meta')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Evaluates specific isomeric properties
|
||||
- Used for stereochemistry optimization
|
||||
|
||||
**Median Molecules**
|
||||
```python
|
||||
oracle = Oracle(name='Median1', 'Median2')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Tests ability to generate molecules with median properties
|
||||
- Useful for distribution learning benchmarks
|
||||
|
||||
**Rediscovery**
|
||||
```python
|
||||
oracle = Oracle(name='Rediscovery')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Measures similarity to known reference molecules
|
||||
- Tests ability to regenerate existing drugs
|
||||
|
||||
**Similarity**
|
||||
```python
|
||||
oracle = Oracle(name='Similarity')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Computes structural similarity to target molecules
|
||||
- Based on molecular fingerprints (typically Tanimoto similarity)
|
||||
|
||||
**Uniqueness**
|
||||
```python
|
||||
oracle = Oracle(name='Uniqueness')
|
||||
scores = oracle(smiles_list)
|
||||
```
|
||||
- Measures diversity in generated molecule set
|
||||
- Returns fraction of unique molecules
|
||||
|
||||
**Novelty**
|
||||
```python
|
||||
oracle = Oracle(name='Novelty')
|
||||
scores = oracle(smiles_list, training_set)
|
||||
```
|
||||
- Measures how different generated molecules are from training set
|
||||
- Higher scores indicate more novel structures
|
||||
|
||||
## Specialized Oracles
|
||||
|
||||
**ASKCOS - Retrosynthesis Scoring**
|
||||
```python
|
||||
oracle = Oracle(name='ASKCOS')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Evaluates synthetic feasibility using retrosynthesis
|
||||
- Requires ASKCOS backend (IBM RXN)
|
||||
- Scores based on retrosynthetic route availability
|
||||
|
||||
**Docking Score**
|
||||
```python
|
||||
oracle = Oracle(name='Docking')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Molecular docking score against target protein
|
||||
- Requires protein structure and docking software
|
||||
- Lower scores typically indicate better binding
|
||||
|
||||
**Vina - AutoDock Vina Score**
|
||||
```python
|
||||
oracle = Oracle(name='Vina')
|
||||
score = oracle(smiles)
|
||||
```
|
||||
- Uses AutoDock Vina for protein-ligand docking
|
||||
- Predicts binding affinity in kcal/mol
|
||||
- More negative scores indicate stronger binding
|
||||
|
||||
## Multi-Objective Optimization
|
||||
|
||||
Combine multiple oracles for multi-property optimization:
|
||||
|
||||
```python
|
||||
from tdc import Oracle
|
||||
|
||||
# Initialize multiple oracles
|
||||
qed_oracle = Oracle(name='QED')
|
||||
sa_oracle = Oracle(name='SA')
|
||||
drd2_oracle = Oracle(name='DRD2')
|
||||
|
||||
# Define custom scoring function
|
||||
def multi_objective_score(smiles):
|
||||
qed = qed_oracle(smiles)
|
||||
sa = 1 / (1 + sa_oracle(smiles)) # Invert SA (lower is better)
|
||||
drd2 = drd2_oracle(smiles)
|
||||
|
||||
# Weighted combination
|
||||
return 0.3 * qed + 0.3 * sa + 0.4 * drd2
|
||||
|
||||
# Evaluate molecule
|
||||
score = multi_objective_score('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
|
||||
```
|
||||
|
||||
## Oracle Performance Considerations
|
||||
|
||||
### Speed
|
||||
- **Fast**: QED, SA, LogP, MW, Lipinski (rule-based calculations)
|
||||
- **Medium**: Target-specific ML models (DRD2, GSK3B, etc.)
|
||||
- **Slow**: Docking-based oracles (Vina, ASKCOS)
|
||||
|
||||
### Reliability
|
||||
- Oracles are ML models trained on specific datasets
|
||||
- May not generalize to all chemical spaces
|
||||
- Use multiple oracles to validate results
|
||||
|
||||
### Batch Processing
|
||||
```python
|
||||
# Efficient batch evaluation
|
||||
oracle = Oracle(name='GSK3B')
|
||||
smiles_list = ['SMILES1', 'SMILES2', ..., 'SMILES1000']
|
||||
scores = oracle(smiles_list) # Faster than individual calls
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Goal-Directed Generation
|
||||
```python
|
||||
from tdc import Oracle
|
||||
from tdc.generation import MolGen
|
||||
|
||||
# Load training data
|
||||
data = MolGen(name='ChEMBL_V29')
|
||||
train_smiles = data.get_data()['Drug'].tolist()
|
||||
|
||||
# Initialize oracle
|
||||
oracle = Oracle(name='GSK3B')
|
||||
|
||||
# Generate molecules (user implements generative model)
|
||||
# generated_smiles = generator.generate(n=1000)
|
||||
|
||||
# Evaluate generated molecules
|
||||
scores = oracle(generated_smiles)
|
||||
best_molecules = [(s, score) for s, score in zip(generated_smiles, scores)]
|
||||
best_molecules.sort(key=lambda x: x[1], reverse=True)
|
||||
|
||||
print(f"Top 10 molecules:")
|
||||
for smiles, score in best_molecules[:10]:
|
||||
print(f"{smiles}: {score:.3f}")
|
||||
```
|
||||
|
||||
### Distribution Learning
|
||||
```python
|
||||
from tdc import Oracle
|
||||
import numpy as np
|
||||
|
||||
# Initialize oracle
|
||||
oracle = Oracle(name='QED')
|
||||
|
||||
# Evaluate training set
|
||||
train_scores = oracle(train_smiles)
|
||||
train_mean = np.mean(train_scores)
|
||||
train_std = np.std(train_scores)
|
||||
|
||||
# Evaluate generated set
|
||||
gen_scores = oracle(generated_smiles)
|
||||
gen_mean = np.mean(gen_scores)
|
||||
gen_std = np.std(gen_scores)
|
||||
|
||||
# Compare distributions
|
||||
print(f"Training: μ={train_mean:.3f}, σ={train_std:.3f}")
|
||||
print(f"Generated: μ={gen_mean:.3f}, σ={gen_std:.3f}")
|
||||
```
|
||||
|
||||
## Integration with TDC Benchmarks
|
||||
|
||||
```python
|
||||
from tdc.generation import MolGen
|
||||
|
||||
# Use with GuacaMol benchmark
|
||||
data = MolGen(name='GuacaMol')
|
||||
|
||||
# Oracles are automatically integrated
|
||||
# Each GuacaMol task has associated oracle
|
||||
benchmark_results = data.evaluate_guacamol(
|
||||
generated_molecules=your_molecules,
|
||||
oracle_name='GSK3B'
|
||||
)
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Oracle scores are predictions, not experimental measurements
|
||||
- Always validate top candidates experimentally
|
||||
- Different oracles may have different score ranges and interpretations
|
||||
- Some oracles require additional dependencies or API access
|
||||
- Check oracle documentation for specific details: https://tdcommons.ai/functions/oracles/
|
||||
|
||||
## Adding Custom Oracles
|
||||
|
||||
To create custom oracle functions:
|
||||
|
||||
```python
|
||||
class CustomOracle:
|
||||
def __init__(self):
|
||||
# Initialize your model/method
|
||||
pass
|
||||
|
||||
def __call__(self, smiles):
|
||||
# Implement your scoring logic
|
||||
# Return score or list of scores
|
||||
pass
|
||||
|
||||
# Use like built-in oracles
|
||||
custom_oracle = CustomOracle()
|
||||
score = custom_oracle('CC(C)Cc1ccc(cc1)C(C)C(O)=O')
|
||||
```
|
||||
|
||||
## References
|
||||
|
||||
- TDC Oracles Documentation: https://tdcommons.ai/functions/oracles/
|
||||
- GuacaMol Paper: "GuacaMol: Benchmarking Models for de Novo Molecular Design"
|
||||
- MOSES Paper: "Molecular Sets (MOSES): A Benchmarking Platform for Molecular Generation Models"
|
||||
Reference in New Issue
Block a user