Initial commit
This commit is contained in:
444
skills/torchdrug/SKILL.md
Normal file
444
skills/torchdrug/SKILL.md
Normal file
@@ -0,0 +1,444 @@
|
||||
---
|
||||
name: torchdrug
|
||||
description: "Graph-based drug discovery toolkit. Molecular property prediction (ADMET), protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis, GNNs (GIN, GAT, SchNet), 40+ datasets, for PyTorch-based ML on molecules, proteins, and biomedical graphs."
|
||||
---
|
||||
|
||||
# TorchDrug
|
||||
|
||||
## Overview
|
||||
|
||||
TorchDrug is a comprehensive PyTorch-based machine learning toolbox for drug discovery and molecular science. Apply graph neural networks, pre-trained models, and task definitions to molecules, proteins, and biological knowledge graphs, including molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis planning, with 40+ curated datasets and 20+ model architectures.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when working with:
|
||||
|
||||
**Data Types:**
|
||||
- SMILES strings or molecular structures
|
||||
- Protein sequences or 3D structures (PDB files)
|
||||
- Chemical reactions and retrosynthesis
|
||||
- Biomedical knowledge graphs
|
||||
- Drug discovery datasets
|
||||
|
||||
**Tasks:**
|
||||
- Predicting molecular properties (solubility, toxicity, activity)
|
||||
- Protein function or structure prediction
|
||||
- Drug-target binding prediction
|
||||
- Generating new molecular structures
|
||||
- Planning chemical synthesis routes
|
||||
- Link prediction in biomedical knowledge bases
|
||||
- Training graph neural networks on scientific data
|
||||
|
||||
**Libraries and Integration:**
|
||||
- TorchDrug is the primary library
|
||||
- Often used with RDKit for cheminformatics
|
||||
- Compatible with PyTorch and PyTorch Lightning
|
||||
- Integrates with AlphaFold and ESM for proteins
|
||||
|
||||
## Getting Started
|
||||
|
||||
### Installation
|
||||
|
||||
```bash
|
||||
uv pip install torchdrug
|
||||
# Or with optional dependencies
|
||||
uv pip install torchdrug[full]
|
||||
```
|
||||
|
||||
### Quick Example
|
||||
|
||||
```python
|
||||
from torchdrug import datasets, models, tasks
|
||||
from torch.utils.data import DataLoader
|
||||
|
||||
# Load molecular dataset
|
||||
dataset = datasets.BBBP("~/molecule-datasets/")
|
||||
train_set, valid_set, test_set = dataset.split()
|
||||
|
||||
# Define GNN model
|
||||
model = models.GIN(
|
||||
input_dim=dataset.node_feature_dim,
|
||||
hidden_dims=[256, 256, 256],
|
||||
edge_input_dim=dataset.edge_feature_dim,
|
||||
batch_norm=True,
|
||||
readout="mean"
|
||||
)
|
||||
|
||||
# Create property prediction task
|
||||
task = tasks.PropertyPrediction(
|
||||
model,
|
||||
task=dataset.tasks,
|
||||
criterion="bce",
|
||||
metric=["auroc", "auprc"]
|
||||
)
|
||||
|
||||
# Train with PyTorch
|
||||
optimizer = torch.optim.Adam(task.parameters(), lr=1e-3)
|
||||
train_loader = DataLoader(train_set, batch_size=32, shuffle=True)
|
||||
|
||||
for epoch in range(100):
|
||||
for batch in train_loader:
|
||||
loss = task(batch)
|
||||
optimizer.zero_grad()
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
## Core Capabilities
|
||||
|
||||
### 1. Molecular Property Prediction
|
||||
|
||||
Predict chemical, physical, and biological properties of molecules from structure.
|
||||
|
||||
**Use Cases:**
|
||||
- Drug-likeness and ADMET properties
|
||||
- Toxicity screening
|
||||
- Quantum chemistry properties
|
||||
- Binding affinity prediction
|
||||
|
||||
**Key Components:**
|
||||
- 20+ molecular datasets (BBBP, HIV, Tox21, QM9, etc.)
|
||||
- GNN models (GIN, GAT, SchNet)
|
||||
- PropertyPrediction and MultipleBinaryClassification tasks
|
||||
|
||||
**Reference:** See `references/molecular_property_prediction.md` for:
|
||||
- Complete dataset catalog
|
||||
- Model selection guide
|
||||
- Training workflows and best practices
|
||||
- Feature engineering details
|
||||
|
||||
### 2. Protein Modeling
|
||||
|
||||
Work with protein sequences, structures, and properties.
|
||||
|
||||
**Use Cases:**
|
||||
- Enzyme function prediction
|
||||
- Protein stability and solubility
|
||||
- Subcellular localization
|
||||
- Protein-protein interactions
|
||||
- Structure prediction
|
||||
|
||||
**Key Components:**
|
||||
- 15+ protein datasets (EnzymeCommission, GeneOntology, PDBBind, etc.)
|
||||
- Sequence models (ESM, ProteinBERT, ProteinLSTM)
|
||||
- Structure models (GearNet, SchNet)
|
||||
- Multiple task types for different prediction levels
|
||||
|
||||
**Reference:** See `references/protein_modeling.md` for:
|
||||
- Protein-specific datasets
|
||||
- Sequence vs structure models
|
||||
- Pre-training strategies
|
||||
- Integration with AlphaFold and ESM
|
||||
|
||||
### 3. Knowledge Graph Reasoning
|
||||
|
||||
Predict missing links and relationships in biological knowledge graphs.
|
||||
|
||||
**Use Cases:**
|
||||
- Drug repurposing
|
||||
- Disease mechanism discovery
|
||||
- Gene-disease associations
|
||||
- Multi-hop biomedical reasoning
|
||||
|
||||
**Key Components:**
|
||||
- General KGs (FB15k, WN18) and biomedical (Hetionet)
|
||||
- Embedding models (TransE, RotatE, ComplEx)
|
||||
- KnowledgeGraphCompletion task
|
||||
|
||||
**Reference:** See `references/knowledge_graphs.md` for:
|
||||
- Knowledge graph datasets (including Hetionet with 45k biomedical entities)
|
||||
- Embedding model comparison
|
||||
- Evaluation metrics and protocols
|
||||
- Biomedical applications
|
||||
|
||||
### 4. Molecular Generation
|
||||
|
||||
Generate novel molecular structures with desired properties.
|
||||
|
||||
**Use Cases:**
|
||||
- De novo drug design
|
||||
- Lead optimization
|
||||
- Chemical space exploration
|
||||
- Property-guided generation
|
||||
|
||||
**Key Components:**
|
||||
- Autoregressive generation
|
||||
- GCPN (policy-based generation)
|
||||
- GraphAutoregressiveFlow
|
||||
- Property optimization workflows
|
||||
|
||||
**Reference:** See `references/molecular_generation.md` for:
|
||||
- Generation strategies (unconditional, conditional, scaffold-based)
|
||||
- Multi-objective optimization
|
||||
- Validation and filtering
|
||||
- Integration with property prediction
|
||||
|
||||
### 5. Retrosynthesis
|
||||
|
||||
Predict synthetic routes from target molecules to starting materials.
|
||||
|
||||
**Use Cases:**
|
||||
- Synthesis planning
|
||||
- Route optimization
|
||||
- Synthetic accessibility assessment
|
||||
- Multi-step planning
|
||||
|
||||
**Key Components:**
|
||||
- USPTO-50k reaction dataset
|
||||
- CenterIdentification (reaction center prediction)
|
||||
- SynthonCompletion (reactant prediction)
|
||||
- End-to-end Retrosynthesis pipeline
|
||||
|
||||
**Reference:** See `references/retrosynthesis.md` for:
|
||||
- Task decomposition (center ID → synthon completion)
|
||||
- Multi-step synthesis planning
|
||||
- Commercial availability checking
|
||||
- Integration with other retrosynthesis tools
|
||||
|
||||
### 6. Graph Neural Network Models
|
||||
|
||||
Comprehensive catalog of GNN architectures for different data types and tasks.
|
||||
|
||||
**Available Models:**
|
||||
- General GNNs: GCN, GAT, GIN, RGCN, MPNN
|
||||
- 3D-aware: SchNet, GearNet
|
||||
- Protein-specific: ESM, ProteinBERT, GearNet
|
||||
- Knowledge graph: TransE, RotatE, ComplEx, SimplE
|
||||
- Generative: GraphAutoregressiveFlow
|
||||
|
||||
**Reference:** See `references/models_architectures.md` for:
|
||||
- Detailed model descriptions
|
||||
- Model selection guide by task and dataset
|
||||
- Architecture comparisons
|
||||
- Implementation tips
|
||||
|
||||
### 7. Datasets
|
||||
|
||||
40+ curated datasets spanning chemistry, biology, and knowledge graphs.
|
||||
|
||||
**Categories:**
|
||||
- Molecular properties (drug discovery, quantum chemistry)
|
||||
- Protein properties (function, structure, interactions)
|
||||
- Knowledge graphs (general and biomedical)
|
||||
- Retrosynthesis reactions
|
||||
|
||||
**Reference:** See `references/datasets.md` for:
|
||||
- Complete dataset catalog with sizes and tasks
|
||||
- Dataset selection guide
|
||||
- Loading and preprocessing
|
||||
- Splitting strategies (random, scaffold)
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Workflow 1: Molecular Property Prediction
|
||||
|
||||
**Scenario:** Predict blood-brain barrier penetration for drug candidates.
|
||||
|
||||
**Steps:**
|
||||
1. Load dataset: `datasets.BBBP()`
|
||||
2. Choose model: GIN for molecular graphs
|
||||
3. Define task: `PropertyPrediction` with binary classification
|
||||
4. Train with scaffold split for realistic evaluation
|
||||
5. Evaluate using AUROC and AUPRC
|
||||
|
||||
**Navigation:** `references/molecular_property_prediction.md` → Dataset selection → Model selection → Training
|
||||
|
||||
### Workflow 2: Protein Function Prediction
|
||||
|
||||
**Scenario:** Predict enzyme function from sequence.
|
||||
|
||||
**Steps:**
|
||||
1. Load dataset: `datasets.EnzymeCommission()`
|
||||
2. Choose model: ESM (pre-trained) or GearNet (with structure)
|
||||
3. Define task: `PropertyPrediction` with multi-class classification
|
||||
4. Fine-tune pre-trained model or train from scratch
|
||||
5. Evaluate using accuracy and per-class metrics
|
||||
|
||||
**Navigation:** `references/protein_modeling.md` → Model selection (sequence vs structure) → Pre-training strategies
|
||||
|
||||
### Workflow 3: Drug Repurposing via Knowledge Graphs
|
||||
|
||||
**Scenario:** Find new disease treatments in Hetionet.
|
||||
|
||||
**Steps:**
|
||||
1. Load dataset: `datasets.Hetionet()`
|
||||
2. Choose model: RotatE or ComplEx
|
||||
3. Define task: `KnowledgeGraphCompletion`
|
||||
4. Train with negative sampling
|
||||
5. Query for "Compound-treats-Disease" predictions
|
||||
6. Filter by plausibility and mechanism
|
||||
|
||||
**Navigation:** `references/knowledge_graphs.md` → Hetionet dataset → Model selection → Biomedical applications
|
||||
|
||||
### Workflow 4: De Novo Molecule Generation
|
||||
|
||||
**Scenario:** Generate drug-like molecules optimized for target binding.
|
||||
|
||||
**Steps:**
|
||||
1. Train property predictor on activity data
|
||||
2. Choose generation approach: GCPN for RL-based optimization
|
||||
3. Define reward function combining affinity, drug-likeness, synthesizability
|
||||
4. Generate candidates with property constraints
|
||||
5. Validate chemistry and filter by drug-likeness
|
||||
6. Rank by multi-objective scoring
|
||||
|
||||
**Navigation:** `references/molecular_generation.md` → Conditional generation → Multi-objective optimization
|
||||
|
||||
### Workflow 5: Retrosynthesis Planning
|
||||
|
||||
**Scenario:** Plan synthesis route for target molecule.
|
||||
|
||||
**Steps:**
|
||||
1. Load dataset: `datasets.USPTO50k()`
|
||||
2. Train center identification model (RGCN)
|
||||
3. Train synthon completion model (GIN)
|
||||
4. Combine into end-to-end retrosynthesis pipeline
|
||||
5. Apply recursively for multi-step planning
|
||||
6. Check commercial availability of building blocks
|
||||
|
||||
**Navigation:** `references/retrosynthesis.md` → Task types → Multi-step planning
|
||||
|
||||
## Integration Patterns
|
||||
|
||||
### With RDKit
|
||||
|
||||
Convert between TorchDrug molecules and RDKit:
|
||||
```python
|
||||
from torchdrug import data
|
||||
from rdkit import Chem
|
||||
|
||||
# SMILES → TorchDrug molecule
|
||||
smiles = "CCO"
|
||||
mol = data.Molecule.from_smiles(smiles)
|
||||
|
||||
# TorchDrug → RDKit
|
||||
rdkit_mol = mol.to_molecule()
|
||||
|
||||
# RDKit → TorchDrug
|
||||
rdkit_mol = Chem.MolFromSmiles(smiles)
|
||||
mol = data.Molecule.from_molecule(rdkit_mol)
|
||||
```
|
||||
|
||||
### With AlphaFold/ESM
|
||||
|
||||
Use predicted structures:
|
||||
```python
|
||||
from torchdrug import data
|
||||
|
||||
# Load AlphaFold predicted structure
|
||||
protein = data.Protein.from_pdb("AF-P12345-F1-model_v4.pdb")
|
||||
|
||||
# Build graph with spatial edges
|
||||
graph = protein.residue_graph(
|
||||
node_position="ca",
|
||||
edge_types=["sequential", "radius"],
|
||||
radius_cutoff=10.0
|
||||
)
|
||||
```
|
||||
|
||||
### With PyTorch Lightning
|
||||
|
||||
Wrap tasks for Lightning training:
|
||||
```python
|
||||
import pytorch_lightning as pl
|
||||
|
||||
class LightningTask(pl.LightningModule):
|
||||
def __init__(self, torchdrug_task):
|
||||
super().__init__()
|
||||
self.task = torchdrug_task
|
||||
|
||||
def training_step(self, batch, batch_idx):
|
||||
return self.task(batch)
|
||||
|
||||
def validation_step(self, batch, batch_idx):
|
||||
pred = self.task.predict(batch)
|
||||
target = self.task.target(batch)
|
||||
return {"pred": pred, "target": target}
|
||||
|
||||
def configure_optimizers(self):
|
||||
return torch.optim.Adam(self.parameters(), lr=1e-3)
|
||||
```
|
||||
|
||||
## Technical Details
|
||||
|
||||
For deep dives into TorchDrug's architecture:
|
||||
|
||||
**Core Concepts:** See `references/core_concepts.md` for:
|
||||
- Architecture philosophy (modular, configurable)
|
||||
- Data structures (Graph, Molecule, Protein, PackedGraph)
|
||||
- Model interface and forward function signature
|
||||
- Task interface (predict, target, forward, evaluate)
|
||||
- Training workflows and best practices
|
||||
- Loss functions and metrics
|
||||
- Common pitfalls and debugging
|
||||
|
||||
## Quick Reference Cheat Sheet
|
||||
|
||||
**Choose Dataset:**
|
||||
- Molecular property → `references/datasets.md` → Molecular section
|
||||
- Protein task → `references/datasets.md` → Protein section
|
||||
- Knowledge graph → `references/datasets.md` → Knowledge graph section
|
||||
|
||||
**Choose Model:**
|
||||
- Molecules → `references/models_architectures.md` → GNN section → GIN/GAT/SchNet
|
||||
- Proteins (sequence) → `references/models_architectures.md` → Protein section → ESM
|
||||
- Proteins (structure) → `references/models_architectures.md` → Protein section → GearNet
|
||||
- Knowledge graph → `references/models_architectures.md` → KG section → RotatE/ComplEx
|
||||
|
||||
**Common Tasks:**
|
||||
- Property prediction → `references/molecular_property_prediction.md` or `references/protein_modeling.md`
|
||||
- Generation → `references/molecular_generation.md`
|
||||
- Retrosynthesis → `references/retrosynthesis.md`
|
||||
- KG reasoning → `references/knowledge_graphs.md`
|
||||
|
||||
**Understand Architecture:**
|
||||
- Data structures → `references/core_concepts.md` → Data Structures
|
||||
- Model design → `references/core_concepts.md` → Model Interface
|
||||
- Task design → `references/core_concepts.md` → Task Interface
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
**Issue: Dimension mismatch errors**
|
||||
→ Check `model.input_dim` matches `dataset.node_feature_dim`
|
||||
→ See `references/core_concepts.md` → Essential Attributes
|
||||
|
||||
**Issue: Poor performance on molecular tasks**
|
||||
→ Use scaffold splitting, not random
|
||||
→ Try GIN instead of GCN
|
||||
→ See `references/molecular_property_prediction.md` → Best Practices
|
||||
|
||||
**Issue: Protein model not learning**
|
||||
→ Use pre-trained ESM for sequence tasks
|
||||
→ Check edge construction for structure models
|
||||
→ See `references/protein_modeling.md` → Training Workflows
|
||||
|
||||
**Issue: Memory errors with large graphs**
|
||||
→ Reduce batch size
|
||||
→ Use gradient accumulation
|
||||
→ See `references/core_concepts.md` → Memory Efficiency
|
||||
|
||||
**Issue: Generated molecules are invalid**
|
||||
→ Add validity constraints
|
||||
→ Post-process with RDKit validation
|
||||
→ See `references/molecular_generation.md` → Validation and Filtering
|
||||
|
||||
## Resources
|
||||
|
||||
**Official Documentation:** https://torchdrug.ai/docs/
|
||||
**GitHub:** https://github.com/DeepGraphLearning/torchdrug
|
||||
**Paper:** TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery
|
||||
|
||||
## Summary
|
||||
|
||||
Navigate to the appropriate reference file based on your task:
|
||||
|
||||
1. **Molecular property prediction** → `molecular_property_prediction.md`
|
||||
2. **Protein modeling** → `protein_modeling.md`
|
||||
3. **Knowledge graphs** → `knowledge_graphs.md`
|
||||
4. **Molecular generation** → `molecular_generation.md`
|
||||
5. **Retrosynthesis** → `retrosynthesis.md`
|
||||
6. **Model selection** → `models_architectures.md`
|
||||
7. **Dataset selection** → `datasets.md`
|
||||
8. **Technical details** → `core_concepts.md`
|
||||
|
||||
Each reference provides comprehensive coverage of its domain with examples, best practices, and common use cases.
|
||||
Reference in New Issue
Block a user