Initial commit
This commit is contained in:
436
skills/torchdrug/references/retrosynthesis.md
Normal file
436
skills/torchdrug/references/retrosynthesis.md
Normal file
@@ -0,0 +1,436 @@
|
||||
# Retrosynthesis
|
||||
|
||||
## Overview
|
||||
|
||||
Retrosynthesis is the process of planning synthetic routes from target molecules back to commercially available starting materials. TorchDrug provides tools for learning-based retrosynthesis prediction, breaking down the complex task into manageable subtasks.
|
||||
|
||||
## Available Datasets
|
||||
|
||||
### USPTO-50K
|
||||
|
||||
The standard benchmark dataset for retrosynthesis derived from US patent literature.
|
||||
|
||||
**Statistics:**
|
||||
- 50,017 reaction examples
|
||||
- Single-step reactions
|
||||
- Filtered for quality and canonicalization
|
||||
- Contains atom mapping for reaction center identification
|
||||
|
||||
**Reaction Types:**
|
||||
- Diverse organic reactions
|
||||
- Drug-like transformations
|
||||
- Well-balanced across common reaction classes
|
||||
|
||||
**Data Splits:**
|
||||
- Training: ~40k reactions
|
||||
- Validation: ~5k reactions
|
||||
- Test: ~5k reactions
|
||||
|
||||
**Format:**
|
||||
- Product → Reactants
|
||||
- SMILES representation
|
||||
- Atom-mapped reactions for training
|
||||
|
||||
## Task Types
|
||||
|
||||
TorchDrug decomposes retrosynthesis into a multi-step pipeline:
|
||||
|
||||
### 1. CenterIdentification
|
||||
|
||||
Identifies the reaction center - which bonds were formed/broken in the forward reaction.
|
||||
|
||||
**Input:** Product molecule
|
||||
**Output:** Probability for each bond of being part of reaction center
|
||||
|
||||
**Purpose:**
|
||||
- Locate where chemistry happened
|
||||
- Guide subsequent synthon generation
|
||||
- Reduce search space dramatically
|
||||
|
||||
**Model Architecture:**
|
||||
- Graph neural network on product molecule
|
||||
- Edge-level classification
|
||||
- Attention mechanisms to highlight reactive regions
|
||||
|
||||
**Evaluation Metrics:**
|
||||
- **Top-K Accuracy**: Correct reaction center in top K predictions
|
||||
- **Bond-level F1**: Precision and recall for bond predictions
|
||||
|
||||
### 2. SynthonCompletion
|
||||
|
||||
Given the product and identified reaction center, predict the reactant structures (synthons).
|
||||
|
||||
**Input:**
|
||||
- Product molecule
|
||||
- Identified reaction center (broken/formed bonds)
|
||||
|
||||
**Output:**
|
||||
- Predicted reactant molecules (synthons)
|
||||
|
||||
**Process:**
|
||||
1. Break bonds at reaction center
|
||||
2. Modify atom environments (valence, charges)
|
||||
3. Determine leaving groups and protecting groups
|
||||
4. Generate complete reactant structures
|
||||
|
||||
**Challenges:**
|
||||
- Multiple valid reactant sets
|
||||
- Stereospecificity
|
||||
- Atom environment changes (hybridization, charge)
|
||||
- Leaving group selection
|
||||
|
||||
**Evaluation:**
|
||||
- **Exact Match**: Generated reactants exactly match ground truth
|
||||
- **Top-K Accuracy**: Correct reactants in top K predictions
|
||||
- **Chemical Validity**: Generated molecules are valid
|
||||
|
||||
### 3. Retrosynthesis (End-to-End)
|
||||
|
||||
Combines center identification and synthon completion into a unified pipeline.
|
||||
|
||||
**Input:** Target product molecule
|
||||
**Output:** Ranked list of reactant sets (synthesis pathways)
|
||||
|
||||
**Workflow:**
|
||||
1. Identify top-K reaction centers
|
||||
2. For each center, generate reactant candidates
|
||||
3. Rank combinations by model confidence
|
||||
4. Filter for commercial availability and feasibility
|
||||
|
||||
**Advantages:**
|
||||
- Single model to train and deploy
|
||||
- Joint optimization of subtasks
|
||||
- Error propagation from center identification accounted for
|
||||
|
||||
## Training Workflows
|
||||
|
||||
### Basic Pipeline
|
||||
|
||||
```python
|
||||
from torchdrug import datasets, models, tasks
|
||||
|
||||
# Load dataset
|
||||
dataset = datasets.USPTO50k("~/retro-datasets/")
|
||||
|
||||
# For center identification
|
||||
model_center = models.RGCN(
|
||||
input_dim=dataset.node_feature_dim,
|
||||
num_relation=dataset.num_bond_type,
|
||||
hidden_dims=[256, 256, 256]
|
||||
)
|
||||
|
||||
task_center = tasks.CenterIdentification(
|
||||
model_center,
|
||||
top_k=3 # Consider top 3 reaction centers
|
||||
)
|
||||
|
||||
# For synthon completion
|
||||
model_synthon = models.GIN(
|
||||
input_dim=dataset.node_feature_dim,
|
||||
hidden_dims=[256, 256, 256]
|
||||
)
|
||||
|
||||
task_synthon = tasks.SynthonCompletion(
|
||||
model_synthon,
|
||||
center_topk=3, # Use top 3 from center identification
|
||||
num_synthon_beam=5 # Beam search for synthon generation
|
||||
)
|
||||
|
||||
# End-to-end
|
||||
task_retro = tasks.Retrosynthesis(
|
||||
model=model_center,
|
||||
synthon_model=model_synthon,
|
||||
center_topk=5,
|
||||
num_synthon_beam=10
|
||||
)
|
||||
```
|
||||
|
||||
### Transfer Learning
|
||||
|
||||
Pre-train on large reaction datasets (e.g., USPTO-full with 1M+ reactions), then fine-tune on specific reaction classes.
|
||||
|
||||
**Benefits:**
|
||||
- Better generalization to rare reaction types
|
||||
- Improved performance on small datasets
|
||||
- Learn general reaction patterns
|
||||
|
||||
### Multi-Task Learning
|
||||
|
||||
Train jointly on:
|
||||
- Forward reaction prediction
|
||||
- Retrosynthesis
|
||||
- Reaction type classification
|
||||
- Yield prediction
|
||||
|
||||
**Advantages:**
|
||||
- Shared representations of chemistry
|
||||
- Better sample efficiency
|
||||
- Improved robustness
|
||||
|
||||
## Model Architectures
|
||||
|
||||
### Graph Neural Networks
|
||||
|
||||
**RGCN (Relational Graph Convolutional Network):**
|
||||
- Handles multiple bond types (single, double, triple, aromatic)
|
||||
- Edge-type-specific transformations
|
||||
- Good for reaction center identification
|
||||
|
||||
**GIN (Graph Isomorphism Network):**
|
||||
- Powerful message passing
|
||||
- Captures structural patterns
|
||||
- Works well for synthon completion
|
||||
|
||||
**GAT (Graph Attention Network):**
|
||||
- Attention weights highlight important atoms/bonds
|
||||
- Interpretable reaction center predictions
|
||||
- Flexible for various reaction types
|
||||
|
||||
### Sequence-Based Models
|
||||
|
||||
**Transformer Models:**
|
||||
- SMILES-to-SMILES translation
|
||||
- Can capture long-range dependencies
|
||||
- Require large datasets
|
||||
|
||||
**LSTM/GRU:**
|
||||
- Sequence generation for reactants
|
||||
- Autoregressive decoding
|
||||
- Good for small molecules
|
||||
|
||||
### Hybrid Approaches
|
||||
|
||||
Combine graph and sequence representations:
|
||||
- Graph encoder for products
|
||||
- Sequence decoder for reactants
|
||||
- Best of both representations
|
||||
|
||||
## Reaction Chemistry Considerations
|
||||
|
||||
### Reaction Classes
|
||||
|
||||
**Common Transformations:**
|
||||
- C-C bond formation (coupling, addition)
|
||||
- Functional group interconversions (oxidation, reduction)
|
||||
- Heterocycle synthesis (cyclizations)
|
||||
- Protection/deprotection
|
||||
- Aromatic substitutions
|
||||
|
||||
**Rare Reactions:**
|
||||
- Novel coupling methods
|
||||
- Complex rearrangements
|
||||
- Multi-component reactions
|
||||
|
||||
### Selectivity Issues
|
||||
|
||||
**Regioselectivity:**
|
||||
- Which position reacts on molecule
|
||||
- Requires understanding of electronics and sterics
|
||||
|
||||
**Stereoselectivity:**
|
||||
- Control of stereochemistry
|
||||
- Diastereoselectivity and enantioselectivity
|
||||
- Critical for drug synthesis
|
||||
|
||||
**Chemoselectivity:**
|
||||
- Which functional group reacts
|
||||
- Requires protecting group strategies
|
||||
|
||||
### Reaction Conditions
|
||||
|
||||
While TorchDrug focuses on reaction connectivity, consider:
|
||||
- Temperature and pressure
|
||||
- Catalysts and reagents
|
||||
- Solvents
|
||||
- Reaction time
|
||||
- Work-up and purification
|
||||
|
||||
## Multi-Step Synthesis Planning
|
||||
|
||||
### Single-Step Retrosynthesis
|
||||
|
||||
Predict immediate precursors for target molecule.
|
||||
|
||||
**Use Case:**
|
||||
- Late-stage transformations
|
||||
- Simple molecules (1-2 steps from commercial)
|
||||
- Initial route scouting
|
||||
|
||||
### Multi-Step Planning
|
||||
|
||||
Recursively apply retrosynthesis to each predicted reactant until reaching commercial building blocks.
|
||||
|
||||
**Tree Search Strategies:**
|
||||
|
||||
**Breadth-First Search:**
|
||||
- Explore all routes to same depth
|
||||
- Find shortest routes
|
||||
- Memory intensive
|
||||
|
||||
**Depth-First Search:**
|
||||
- Follow each route to completion
|
||||
- Memory efficient
|
||||
- May miss optimal routes
|
||||
|
||||
**Monte Carlo Tree Search (MCTS):**
|
||||
- Balance exploration and exploitation
|
||||
- Guided by model confidence
|
||||
- State-of-the-art for multi-step planning
|
||||
|
||||
**A\* Search:**
|
||||
- Heuristic-guided search
|
||||
- Optimizes for cost, complexity, or feasibility
|
||||
- Efficient for finding best routes
|
||||
|
||||
### Route Scoring
|
||||
|
||||
Rank synthetic routes by:
|
||||
1. **Number of Steps**: Fewer is better (efficiency)
|
||||
2. **Convergent vs Linear**: Convergent routes preferred
|
||||
3. **Commercial Availability**: How many steps to buyable compounds
|
||||
4. **Reaction Feasibility**: Likelihood each step works
|
||||
5. **Overall Yield**: Estimated end-to-end yield
|
||||
6. **Cost**: Reagents, labor, equipment
|
||||
7. **Green Chemistry**: Environmental impact, safety
|
||||
|
||||
### Stopping Criteria
|
||||
|
||||
Stop retrosynthesis when reaching:
|
||||
- **Commercial Compounds**: Available from vendors (e.g., Sigma-Aldrich, Enamine)
|
||||
- **Building Blocks**: Standard synthetic intermediates
|
||||
- **Max Depth**: e.g., 6-10 steps
|
||||
- **Low Confidence**: Model uncertainty too high
|
||||
|
||||
## Validation and Filtering
|
||||
|
||||
### Chemical Validity
|
||||
|
||||
Check each predicted reaction:
|
||||
- Reactants are valid molecules
|
||||
- Reaction is chemically reasonable
|
||||
- Atom mapping is consistent
|
||||
- Stoichiometry balances
|
||||
|
||||
### Synthetic Feasibility
|
||||
|
||||
**Filters:**
|
||||
- Reaction precedent (literature examples)
|
||||
- Functional group compatibility
|
||||
- Typical reaction conditions
|
||||
- Expected yield ranges
|
||||
|
||||
**Expert Systems:**
|
||||
- Rule-based validation (e.g., ARChem Route Designer)
|
||||
- Check for incompatible functional groups
|
||||
- Identify protection/deprotection needs
|
||||
|
||||
### Commercial Availability
|
||||
|
||||
**Databases:**
|
||||
- eMolecules: 10M+ commercial compounds
|
||||
- ZINC: Annotated with vendor info
|
||||
- Reaxys: Commercially available building blocks
|
||||
|
||||
**Considerations:**
|
||||
- Cost per gram
|
||||
- Purity and quality
|
||||
- Lead time for delivery
|
||||
- Minimum order quantities
|
||||
|
||||
## Integration with Other Tools
|
||||
|
||||
### Reaction Prediction (Forward)
|
||||
|
||||
Train forward reaction prediction models to validate retrosynthetic proposals:
|
||||
- Predict products from proposed reactants
|
||||
- Validate reaction feasibility
|
||||
- Estimate yields
|
||||
|
||||
### Retrosynthesis Software
|
||||
|
||||
**Integration with:**
|
||||
- SciFinder (CAS)
|
||||
- Reaxys (Elsevier)
|
||||
- ARChem Route Designer
|
||||
- IBM RXN for Chemistry
|
||||
|
||||
**TorchDrug as Component:**
|
||||
- Use TorchDrug models within larger planning systems
|
||||
- Combine ML predictions with rule-based systems
|
||||
- Hybrid AI + expert system approaches
|
||||
|
||||
### Experimental Validation
|
||||
|
||||
**High-Throughput Screening:**
|
||||
- Rapid testing of predicted reactions
|
||||
- Automated synthesis platforms
|
||||
- Feedback loop to improve models
|
||||
|
||||
**Robotic Synthesis:**
|
||||
- Automated execution of planned routes
|
||||
- Real-time optimization
|
||||
- Data generation for model improvement
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Ensemble Predictions**: Use multiple models for robustness
|
||||
2. **Reaction Validation**: Always validate with chemistry rules
|
||||
3. **Commercial Check**: Verify building block availability early
|
||||
4. **Diversity**: Generate multiple diverse routes, not just top-1
|
||||
5. **Expert Review**: Have chemists evaluate proposed routes
|
||||
6. **Literature Search**: Check for precedents of key steps
|
||||
7. **Iterative Refinement**: Update models with experimental feedback
|
||||
8. **Interpretability**: Understand why model suggests each step
|
||||
9. **Edge Cases**: Handle unusual functional groups and scaffolds
|
||||
10. **Benchmarking**: Compare against known synthesis routes
|
||||
|
||||
## Common Applications
|
||||
|
||||
### Drug Synthesis Planning
|
||||
|
||||
- Small molecule drugs
|
||||
- Natural product total synthesis
|
||||
- Late-stage functionalization strategies
|
||||
|
||||
### Library Enumeration
|
||||
|
||||
- Virtual library design
|
||||
- Retrosynthetic filtering of generated molecules
|
||||
- Prioritize synthesizable compounds
|
||||
|
||||
### Process Chemistry
|
||||
|
||||
- Route scouting for large-scale synthesis
|
||||
- Cost optimization
|
||||
- Green chemistry alternatives
|
||||
|
||||
### Synthetic Method Development
|
||||
|
||||
- Identify gaps in synthetic methodology
|
||||
- Guide development of new reactions
|
||||
- Expand retrosynthesis model capabilities
|
||||
|
||||
## Challenges and Future Directions
|
||||
|
||||
### Current Limitations
|
||||
|
||||
- Limited to single-step predictions (most models)
|
||||
- Doesn't consider reaction conditions explicitly
|
||||
- Stereochemistry handling is challenging
|
||||
- Rare reaction types underrepresented
|
||||
|
||||
### Active Research Areas
|
||||
|
||||
- End-to-end multi-step planning
|
||||
- Incorporation of reaction conditions
|
||||
- Stereoselective retrosynthesis
|
||||
- Integration with robotics for closed-loop optimization
|
||||
- Semi-template methods (balance templates and templates-free)
|
||||
- Uncertainty quantification for predictions
|
||||
|
||||
### Emerging Techniques
|
||||
|
||||
- Large language models for chemistry (ChemGPT, MolT5)
|
||||
- Reinforcement learning for route optimization
|
||||
- Graph transformers for long-range interactions
|
||||
- Self-supervised pre-training on reaction databases
|
||||
Reference in New Issue
Block a user