Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/torchdrug/SKILL.md
+++ b/skills/torchdrug/SKILL.md
@@ -0,0 +1,444 @@
+---
+name: torchdrug
+description: "Graph-based drug discovery toolkit. Molecular property prediction (ADMET), protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis, GNNs (GIN, GAT, SchNet), 40+ datasets, for PyTorch-based ML on molecules, proteins, and biomedical graphs."
+---
+
+# TorchDrug
+
+## Overview
+
+TorchDrug is a comprehensive PyTorch-based machine learning toolbox for drug discovery and molecular science. Apply graph neural networks, pre-trained models, and task definitions to molecules, proteins, and biological knowledge graphs, including molecular property prediction, protein modeling, knowledge graph reasoning, molecular generation, retrosynthesis planning, with 40+ curated datasets and 20+ model architectures.
+
+## When to Use This Skill
+
+This skill should be used when working with:
+
+**Data Types:**
+- SMILES strings or molecular structures
+- Protein sequences or 3D structures (PDB files)
+- Chemical reactions and retrosynthesis
+- Biomedical knowledge graphs
+- Drug discovery datasets
+
+**Tasks:**
+- Predicting molecular properties (solubility, toxicity, activity)
+- Protein function or structure prediction
+- Drug-target binding prediction
+- Generating new molecular structures
+- Planning chemical synthesis routes
+- Link prediction in biomedical knowledge bases
+- Training graph neural networks on scientific data
+
+**Libraries and Integration:**
+- TorchDrug is the primary library
+- Often used with RDKit for cheminformatics
+- Compatible with PyTorch and PyTorch Lightning
+- Integrates with AlphaFold and ESM for proteins
+
+## Getting Started
+
+### Installation
+
+```bash
+uv pip install torchdrug
+# Or with optional dependencies
+uv pip install torchdrug[full]
+```
+
+### Quick Example
+
+```python
+from torchdrug import datasets, models, tasks
+from torch.utils.data import DataLoader
+
+# Load molecular dataset
+dataset = datasets.BBBP("~/molecule-datasets/")
+train_set, valid_set, test_set = dataset.split()
+
+# Define GNN model
+model = models.GIN(
+    input_dim=dataset.node_feature_dim,
+    hidden_dims=[256, 256, 256],
+    edge_input_dim=dataset.edge_feature_dim,
+    batch_norm=True,
+    readout="mean"
+)
+
+# Create property prediction task
+task = tasks.PropertyPrediction(
+    model,
+    task=dataset.tasks,
+    criterion="bce",
+    metric=["auroc", "auprc"]
+)
+
+# Train with PyTorch
+optimizer = torch.optim.Adam(task.parameters(), lr=1e-3)
+train_loader = DataLoader(train_set, batch_size=32, shuffle=True)
+
+for epoch in range(100):
+    for batch in train_loader:
+        loss = task(batch)
+        optimizer.zero_grad()
+        loss.backward()
+        optimizer.step()
+```
+
+## Core Capabilities
+
+### 1. Molecular Property Prediction
+
+Predict chemical, physical, and biological properties of molecules from structure.
+
+**Use Cases:**
+- Drug-likeness and ADMET properties
+- Toxicity screening
+- Quantum chemistry properties
+- Binding affinity prediction
+
+**Key Components:**
+- 20+ molecular datasets (BBBP, HIV, Tox21, QM9, etc.)
+- GNN models (GIN, GAT, SchNet)
+- PropertyPrediction and MultipleBinaryClassification tasks
+
+**Reference:** See `references/molecular_property_prediction.md` for:
+- Complete dataset catalog
+- Model selection guide
+- Training workflows and best practices
+- Feature engineering details
+
+### 2. Protein Modeling
+
+Work with protein sequences, structures, and properties.
+
+**Use Cases:**
+- Enzyme function prediction
+- Protein stability and solubility
+- Subcellular localization
+- Protein-protein interactions
+- Structure prediction
+
+**Key Components:**
+- 15+ protein datasets (EnzymeCommission, GeneOntology, PDBBind, etc.)
+- Sequence models (ESM, ProteinBERT, ProteinLSTM)
+- Structure models (GearNet, SchNet)
+- Multiple task types for different prediction levels
+
+**Reference:** See `references/protein_modeling.md` for:
+- Protein-specific datasets
+- Sequence vs structure models
+- Pre-training strategies
+- Integration with AlphaFold and ESM
+
+### 3. Knowledge Graph Reasoning
+
+Predict missing links and relationships in biological knowledge graphs.
+
+**Use Cases:**
+- Drug repurposing
+- Disease mechanism discovery
+- Gene-disease associations
+- Multi-hop biomedical reasoning
+
+**Key Components:**
+- General KGs (FB15k, WN18) and biomedical (Hetionet)
+- Embedding models (TransE, RotatE, ComplEx)
+- KnowledgeGraphCompletion task
+
+**Reference:** See `references/knowledge_graphs.md` for:
+- Knowledge graph datasets (including Hetionet with 45k biomedical entities)
+- Embedding model comparison
+- Evaluation metrics and protocols
+- Biomedical applications
+
+### 4. Molecular Generation
+
+Generate novel molecular structures with desired properties.
+
+**Use Cases:**
+- De novo drug design
+- Lead optimization
+- Chemical space exploration
+- Property-guided generation
+
+**Key Components:**
+- Autoregressive generation
+- GCPN (policy-based generation)
+- GraphAutoregressiveFlow
+- Property optimization workflows
+
+**Reference:** See `references/molecular_generation.md` for:
+- Generation strategies (unconditional, conditional, scaffold-based)
+- Multi-objective optimization
+- Validation and filtering
+- Integration with property prediction
+
+### 5. Retrosynthesis
+
+Predict synthetic routes from target molecules to starting materials.
+
+**Use Cases:**
+- Synthesis planning
+- Route optimization
+- Synthetic accessibility assessment
+- Multi-step planning
+
+**Key Components:**
+- USPTO-50k reaction dataset
+- CenterIdentification (reaction center prediction)
+- SynthonCompletion (reactant prediction)
+- End-to-end Retrosynthesis pipeline
+
+**Reference:** See `references/retrosynthesis.md` for:
+- Task decomposition (center ID → synthon completion)
+- Multi-step synthesis planning
+- Commercial availability checking
+- Integration with other retrosynthesis tools
+
+### 6. Graph Neural Network Models
+
+Comprehensive catalog of GNN architectures for different data types and tasks.
+
+**Available Models:**
+- General GNNs: GCN, GAT, GIN, RGCN, MPNN
+- 3D-aware: SchNet, GearNet
+- Protein-specific: ESM, ProteinBERT, GearNet
+- Knowledge graph: TransE, RotatE, ComplEx, SimplE
+- Generative: GraphAutoregressiveFlow
+
+**Reference:** See `references/models_architectures.md` for:
+- Detailed model descriptions
+- Model selection guide by task and dataset
+- Architecture comparisons
+- Implementation tips
+
+### 7. Datasets
+
+40+ curated datasets spanning chemistry, biology, and knowledge graphs.
+
+**Categories:**
+- Molecular properties (drug discovery, quantum chemistry)
+- Protein properties (function, structure, interactions)
+- Knowledge graphs (general and biomedical)
+- Retrosynthesis reactions
+
+**Reference:** See `references/datasets.md` for:
+- Complete dataset catalog with sizes and tasks
+- Dataset selection guide
+- Loading and preprocessing
+- Splitting strategies (random, scaffold)
+
+## Common Workflows
+
+### Workflow 1: Molecular Property Prediction
+
+**Scenario:** Predict blood-brain barrier penetration for drug candidates.
+
+**Steps:**
+1. Load dataset: `datasets.BBBP()`
+2. Choose model: GIN for molecular graphs
+3. Define task: `PropertyPrediction` with binary classification
+4. Train with scaffold split for realistic evaluation
+5. Evaluate using AUROC and AUPRC
+
+**Navigation:** `references/molecular_property_prediction.md` → Dataset selection → Model selection → Training
+
+### Workflow 2: Protein Function Prediction
+
+**Scenario:** Predict enzyme function from sequence.
+
+**Steps:**
+1. Load dataset: `datasets.EnzymeCommission()`
+2. Choose model: ESM (pre-trained) or GearNet (with structure)
+3. Define task: `PropertyPrediction` with multi-class classification
+4. Fine-tune pre-trained model or train from scratch
+5. Evaluate using accuracy and per-class metrics
+
+**Navigation:** `references/protein_modeling.md` → Model selection (sequence vs structure) → Pre-training strategies
+
+### Workflow 3: Drug Repurposing via Knowledge Graphs
+
+**Scenario:** Find new disease treatments in Hetionet.
+
+**Steps:**
+1. Load dataset: `datasets.Hetionet()`
+2. Choose model: RotatE or ComplEx
+3. Define task: `KnowledgeGraphCompletion`
+4. Train with negative sampling
+5. Query for "Compound-treats-Disease" predictions
+6. Filter by plausibility and mechanism
+
+**Navigation:** `references/knowledge_graphs.md` → Hetionet dataset → Model selection → Biomedical applications
+
+### Workflow 4: De Novo Molecule Generation
+
+**Scenario:** Generate drug-like molecules optimized for target binding.
+
+**Steps:**
+1. Train property predictor on activity data
+2. Choose generation approach: GCPN for RL-based optimization
+3. Define reward function combining affinity, drug-likeness, synthesizability
+4. Generate candidates with property constraints
+5. Validate chemistry and filter by drug-likeness
+6. Rank by multi-objective scoring
+
+**Navigation:** `references/molecular_generation.md` → Conditional generation → Multi-objective optimization
+
+### Workflow 5: Retrosynthesis Planning
+
+**Scenario:** Plan synthesis route for target molecule.
+
+**Steps:**
+1. Load dataset: `datasets.USPTO50k()`
+2. Train center identification model (RGCN)
+3. Train synthon completion model (GIN)
+4. Combine into end-to-end retrosynthesis pipeline
+5. Apply recursively for multi-step planning
+6. Check commercial availability of building blocks
+
+**Navigation:** `references/retrosynthesis.md` → Task types → Multi-step planning
+
+## Integration Patterns
+
+### With RDKit
+
+Convert between TorchDrug molecules and RDKit:
+```python
+from torchdrug import data
+from rdkit import Chem
+
+# SMILES → TorchDrug molecule
+smiles = "CCO"
+mol = data.Molecule.from_smiles(smiles)
+
+# TorchDrug → RDKit
+rdkit_mol = mol.to_molecule()
+
+# RDKit → TorchDrug
+rdkit_mol = Chem.MolFromSmiles(smiles)
+mol = data.Molecule.from_molecule(rdkit_mol)
+```
+
+### With AlphaFold/ESM
+
+Use predicted structures:
+```python
+from torchdrug import data
+
+# Load AlphaFold predicted structure
+protein = data.Protein.from_pdb("AF-P12345-F1-model_v4.pdb")
+
+# Build graph with spatial edges
+graph = protein.residue_graph(
+    node_position="ca",
+    edge_types=["sequential", "radius"],
+    radius_cutoff=10.0
+)
+```
+
+### With PyTorch Lightning
+
+Wrap tasks for Lightning training:
+```python
+import pytorch_lightning as pl
+
+class LightningTask(pl.LightningModule):
+    def __init__(self, torchdrug_task):
+        super().__init__()
+        self.task = torchdrug_task
+
+    def training_step(self, batch, batch_idx):
+        return self.task(batch)
+
+    def validation_step(self, batch, batch_idx):
+        pred = self.task.predict(batch)
+        target = self.task.target(batch)
+        return {"pred": pred, "target": target}
+
+    def configure_optimizers(self):
+        return torch.optim.Adam(self.parameters(), lr=1e-3)
+```
+
+## Technical Details
+
+For deep dives into TorchDrug's architecture:
+
+**Core Concepts:** See `references/core_concepts.md` for:
+- Architecture philosophy (modular, configurable)
+- Data structures (Graph, Molecule, Protein, PackedGraph)
+- Model interface and forward function signature
+- Task interface (predict, target, forward, evaluate)
+- Training workflows and best practices
+- Loss functions and metrics
+- Common pitfalls and debugging
+
+## Quick Reference Cheat Sheet
+
+**Choose Dataset:**
+- Molecular property → `references/datasets.md` → Molecular section
+- Protein task → `references/datasets.md` → Protein section
+- Knowledge graph → `references/datasets.md` → Knowledge graph section
+
+**Choose Model:**
+- Molecules → `references/models_architectures.md` → GNN section → GIN/GAT/SchNet
+- Proteins (sequence) → `references/models_architectures.md` → Protein section → ESM
+- Proteins (structure) → `references/models_architectures.md` → Protein section → GearNet
+- Knowledge graph → `references/models_architectures.md` → KG section → RotatE/ComplEx
+
+**Common Tasks:**
+- Property prediction → `references/molecular_property_prediction.md` or `references/protein_modeling.md`
+- Generation → `references/molecular_generation.md`
+- Retrosynthesis → `references/retrosynthesis.md`
+- KG reasoning → `references/knowledge_graphs.md`
+
+**Understand Architecture:**
+- Data structures → `references/core_concepts.md` → Data Structures
+- Model design → `references/core_concepts.md` → Model Interface
+- Task design → `references/core_concepts.md` → Task Interface
+
+## Troubleshooting Common Issues
+
+**Issue: Dimension mismatch errors**
+→ Check `model.input_dim` matches `dataset.node_feature_dim`
+→ See `references/core_concepts.md` → Essential Attributes
+
+**Issue: Poor performance on molecular tasks**
+→ Use scaffold splitting, not random
+→ Try GIN instead of GCN
+→ See `references/molecular_property_prediction.md` → Best Practices
+
+**Issue: Protein model not learning**
+→ Use pre-trained ESM for sequence tasks
+→ Check edge construction for structure models
+→ See `references/protein_modeling.md` → Training Workflows
+
+**Issue: Memory errors with large graphs**
+→ Reduce batch size
+→ Use gradient accumulation
+→ See `references/core_concepts.md` → Memory Efficiency
+
+**Issue: Generated molecules are invalid**
+→ Add validity constraints
+→ Post-process with RDKit validation
+→ See `references/molecular_generation.md` → Validation and Filtering
+
+## Resources
+
+**Official Documentation:** https://torchdrug.ai/docs/
+**GitHub:** https://github.com/DeepGraphLearning/torchdrug
+**Paper:** TorchDrug: A Powerful and Flexible Machine Learning Platform for Drug Discovery
+
+## Summary
+
+Navigate to the appropriate reference file based on your task:
+
+1. **Molecular property prediction** → `molecular_property_prediction.md`
+2. **Protein modeling** → `protein_modeling.md`
+3. **Knowledge graphs** → `knowledge_graphs.md`
+4. **Molecular generation** → `molecular_generation.md`
+5. **Retrosynthesis** → `retrosynthesis.md`
+6. **Model selection** → `models_architectures.md`
+7. **Dataset selection** → `datasets.md`
+8. **Technical details** → `core_concepts.md`
+
+Each reference provides comprehensive coverage of its domain with examples, best practices, and common use cases.