Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/torchdrug/references/retrosynthesis.md
+++ b/skills/torchdrug/references/retrosynthesis.md
@@ -0,0 +1,436 @@
+# Retrosynthesis
+
+## Overview
+
+Retrosynthesis is the process of planning synthetic routes from target molecules back to commercially available starting materials. TorchDrug provides tools for learning-based retrosynthesis prediction, breaking down the complex task into manageable subtasks.
+
+## Available Datasets
+
+### USPTO-50K
+
+The standard benchmark dataset for retrosynthesis derived from US patent literature.
+
+**Statistics:**
+- 50,017 reaction examples
+- Single-step reactions
+- Filtered for quality and canonicalization
+- Contains atom mapping for reaction center identification
+
+**Reaction Types:**
+- Diverse organic reactions
+- Drug-like transformations
+- Well-balanced across common reaction classes
+
+**Data Splits:**
+- Training: ~40k reactions
+- Validation: ~5k reactions
+- Test: ~5k reactions
+
+**Format:**
+- Product → Reactants
+- SMILES representation
+- Atom-mapped reactions for training
+
+## Task Types
+
+TorchDrug decomposes retrosynthesis into a multi-step pipeline:
+
+### 1. CenterIdentification
+
+Identifies the reaction center - which bonds were formed/broken in the forward reaction.
+
+**Input:** Product molecule
+**Output:** Probability for each bond of being part of reaction center
+
+**Purpose:**
+- Locate where chemistry happened
+- Guide subsequent synthon generation
+- Reduce search space dramatically
+
+**Model Architecture:**
+- Graph neural network on product molecule
+- Edge-level classification
+- Attention mechanisms to highlight reactive regions
+
+**Evaluation Metrics:**
+- **Top-K Accuracy**: Correct reaction center in top K predictions
+- **Bond-level F1**: Precision and recall for bond predictions
+
+### 2. SynthonCompletion
+
+Given the product and identified reaction center, predict the reactant structures (synthons).
+
+**Input:**
+- Product molecule
+- Identified reaction center (broken/formed bonds)
+
+**Output:**
+- Predicted reactant molecules (synthons)
+
+**Process:**
+1. Break bonds at reaction center
+2. Modify atom environments (valence, charges)
+3. Determine leaving groups and protecting groups
+4. Generate complete reactant structures
+
+**Challenges:**
+- Multiple valid reactant sets
+- Stereospecificity
+- Atom environment changes (hybridization, charge)
+- Leaving group selection
+
+**Evaluation:**
+- **Exact Match**: Generated reactants exactly match ground truth
+- **Top-K Accuracy**: Correct reactants in top K predictions
+- **Chemical Validity**: Generated molecules are valid
+
+### 3. Retrosynthesis (End-to-End)
+
+Combines center identification and synthon completion into a unified pipeline.
+
+**Input:** Target product molecule
+**Output:** Ranked list of reactant sets (synthesis pathways)
+
+**Workflow:**
+1. Identify top-K reaction centers
+2. For each center, generate reactant candidates
+3. Rank combinations by model confidence
+4. Filter for commercial availability and feasibility
+
+**Advantages:**
+- Single model to train and deploy
+- Joint optimization of subtasks
+- Error propagation from center identification accounted for
+
+## Training Workflows
+
+### Basic Pipeline
+
+```python
+from torchdrug import datasets, models, tasks
+
+# Load dataset
+dataset = datasets.USPTO50k("~/retro-datasets/")
+
+# For center identification
+model_center = models.RGCN(
+    input_dim=dataset.node_feature_dim,
+    num_relation=dataset.num_bond_type,
+    hidden_dims=[256, 256, 256]
+)
+
+task_center = tasks.CenterIdentification(
+    model_center,
+    top_k=3  # Consider top 3 reaction centers
+)
+
+# For synthon completion
+model_synthon = models.GIN(
+    input_dim=dataset.node_feature_dim,
+    hidden_dims=[256, 256, 256]
+)
+
+task_synthon = tasks.SynthonCompletion(
+    model_synthon,
+    center_topk=3,  # Use top 3 from center identification
+    num_synthon_beam=5  # Beam search for synthon generation
+)
+
+# End-to-end
+task_retro = tasks.Retrosynthesis(
+    model=model_center,
+    synthon_model=model_synthon,
+    center_topk=5,
+    num_synthon_beam=10
+)
+```
+
+### Transfer Learning
+
+Pre-train on large reaction datasets (e.g., USPTO-full with 1M+ reactions), then fine-tune on specific reaction classes.
+
+**Benefits:**
+- Better generalization to rare reaction types
+- Improved performance on small datasets
+- Learn general reaction patterns
+
+### Multi-Task Learning
+
+Train jointly on:
+- Forward reaction prediction
+- Retrosynthesis
+- Reaction type classification
+- Yield prediction
+
+**Advantages:**
+- Shared representations of chemistry
+- Better sample efficiency
+- Improved robustness
+
+## Model Architectures
+
+### Graph Neural Networks
+
+**RGCN (Relational Graph Convolutional Network):**
+- Handles multiple bond types (single, double, triple, aromatic)
+- Edge-type-specific transformations
+- Good for reaction center identification
+
+**GIN (Graph Isomorphism Network):**
+- Powerful message passing
+- Captures structural patterns
+- Works well for synthon completion
+
+**GAT (Graph Attention Network):**
+- Attention weights highlight important atoms/bonds
+- Interpretable reaction center predictions
+- Flexible for various reaction types
+
+### Sequence-Based Models
+
+**Transformer Models:**
+- SMILES-to-SMILES translation
+- Can capture long-range dependencies
+- Require large datasets
+
+**LSTM/GRU:**
+- Sequence generation for reactants
+- Autoregressive decoding
+- Good for small molecules
+
+### Hybrid Approaches
+
+Combine graph and sequence representations:
+- Graph encoder for products
+- Sequence decoder for reactants
+- Best of both representations
+
+## Reaction Chemistry Considerations
+
+### Reaction Classes
+
+**Common Transformations:**
+- C-C bond formation (coupling, addition)
+- Functional group interconversions (oxidation, reduction)
+- Heterocycle synthesis (cyclizations)
+- Protection/deprotection
+- Aromatic substitutions
+
+**Rare Reactions:**
+- Novel coupling methods
+- Complex rearrangements
+- Multi-component reactions
+
+### Selectivity Issues
+
+**Regioselectivity:**
+- Which position reacts on molecule
+- Requires understanding of electronics and sterics
+
+**Stereoselectivity:**
+- Control of stereochemistry
+- Diastereoselectivity and enantioselectivity
+- Critical for drug synthesis
+
+**Chemoselectivity:**
+- Which functional group reacts
+- Requires protecting group strategies
+
+### Reaction Conditions
+
+While TorchDrug focuses on reaction connectivity, consider:
+- Temperature and pressure
+- Catalysts and reagents
+- Solvents
+- Reaction time
+- Work-up and purification
+
+## Multi-Step Synthesis Planning
+
+### Single-Step Retrosynthesis
+
+Predict immediate precursors for target molecule.
+
+**Use Case:**
+- Late-stage transformations
+- Simple molecules (1-2 steps from commercial)
+- Initial route scouting
+
+### Multi-Step Planning
+
+Recursively apply retrosynthesis to each predicted reactant until reaching commercial building blocks.
+
+**Tree Search Strategies:**
+
+**Breadth-First Search:**
+- Explore all routes to same depth
+- Find shortest routes
+- Memory intensive
+
+**Depth-First Search:**
+- Follow each route to completion
+- Memory efficient
+- May miss optimal routes
+
+**Monte Carlo Tree Search (MCTS):**
+- Balance exploration and exploitation
+- Guided by model confidence
+- State-of-the-art for multi-step planning
+
+**A\* Search:**
+- Heuristic-guided search
+- Optimizes for cost, complexity, or feasibility
+- Efficient for finding best routes
+
+### Route Scoring
+
+Rank synthetic routes by:
+1. **Number of Steps**: Fewer is better (efficiency)
+2. **Convergent vs Linear**: Convergent routes preferred
+3. **Commercial Availability**: How many steps to buyable compounds
+4. **Reaction Feasibility**: Likelihood each step works
+5. **Overall Yield**: Estimated end-to-end yield
+6. **Cost**: Reagents, labor, equipment
+7. **Green Chemistry**: Environmental impact, safety
+
+### Stopping Criteria
+
+Stop retrosynthesis when reaching:
+- **Commercial Compounds**: Available from vendors (e.g., Sigma-Aldrich, Enamine)
+- **Building Blocks**: Standard synthetic intermediates
+- **Max Depth**: e.g., 6-10 steps
+- **Low Confidence**: Model uncertainty too high
+
+## Validation and Filtering
+
+### Chemical Validity
+
+Check each predicted reaction:
+- Reactants are valid molecules
+- Reaction is chemically reasonable
+- Atom mapping is consistent
+- Stoichiometry balances
+
+### Synthetic Feasibility
+
+**Filters:**
+- Reaction precedent (literature examples)
+- Functional group compatibility
+- Typical reaction conditions
+- Expected yield ranges
+
+**Expert Systems:**
+- Rule-based validation (e.g., ARChem Route Designer)
+- Check for incompatible functional groups
+- Identify protection/deprotection needs
+
+### Commercial Availability
+
+**Databases:**
+- eMolecules: 10M+ commercial compounds
+- ZINC: Annotated with vendor info
+- Reaxys: Commercially available building blocks
+
+**Considerations:**
+- Cost per gram
+- Purity and quality
+- Lead time for delivery
+- Minimum order quantities
+
+## Integration with Other Tools
+
+### Reaction Prediction (Forward)
+
+Train forward reaction prediction models to validate retrosynthetic proposals:
+- Predict products from proposed reactants
+- Validate reaction feasibility
+- Estimate yields
+
+### Retrosynthesis Software
+
+**Integration with:**
+- SciFinder (CAS)
+- Reaxys (Elsevier)
+- ARChem Route Designer
+- IBM RXN for Chemistry
+
+**TorchDrug as Component:**
+- Use TorchDrug models within larger planning systems
+- Combine ML predictions with rule-based systems
+- Hybrid AI + expert system approaches
+
+### Experimental Validation
+
+**High-Throughput Screening:**
+- Rapid testing of predicted reactions
+- Automated synthesis platforms
+- Feedback loop to improve models
+
+**Robotic Synthesis:**
+- Automated execution of planned routes
+- Real-time optimization
+- Data generation for model improvement
+
+## Best Practices
+
+1. **Ensemble Predictions**: Use multiple models for robustness
+2. **Reaction Validation**: Always validate with chemistry rules
+3. **Commercial Check**: Verify building block availability early
+4. **Diversity**: Generate multiple diverse routes, not just top-1
+5. **Expert Review**: Have chemists evaluate proposed routes
+6. **Literature Search**: Check for precedents of key steps
+7. **Iterative Refinement**: Update models with experimental feedback
+8. **Interpretability**: Understand why model suggests each step
+9. **Edge Cases**: Handle unusual functional groups and scaffolds
+10. **Benchmarking**: Compare against known synthesis routes
+
+## Common Applications
+
+### Drug Synthesis Planning
+
+- Small molecule drugs
+- Natural product total synthesis
+- Late-stage functionalization strategies
+
+### Library Enumeration
+
+- Virtual library design
+- Retrosynthetic filtering of generated molecules
+- Prioritize synthesizable compounds
+
+### Process Chemistry
+
+- Route scouting for large-scale synthesis
+- Cost optimization
+- Green chemistry alternatives
+
+### Synthetic Method Development
+
+- Identify gaps in synthetic methodology
+- Guide development of new reactions
+- Expand retrosynthesis model capabilities
+
+## Challenges and Future Directions
+
+### Current Limitations
+
+- Limited to single-step predictions (most models)
+- Doesn't consider reaction conditions explicitly
+- Stereochemistry handling is challenging
+- Rare reaction types underrepresented
+
+### Active Research Areas
+
+- End-to-end multi-step planning
+- Incorporation of reaction conditions
+- Stereoselective retrosynthesis
+- Integration with robotics for closed-loop optimization
+- Semi-template methods (balance templates and templates-free)
+- Uncertainty quantification for predictions
+
+### Emerging Techniques
+
+- Large language models for chemistry (ChemGPT, MolT5)
+- Reinforcement learning for route optimization
+- Graph transformers for long-range interactions
+- Self-supervised pre-training on reaction databases