Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/diffdock/references/confidence_and_limitations.md
+++ b/skills/diffdock/references/confidence_and_limitations.md
@@ -0,0 +1,182 @@
+# DiffDock Confidence Scores and Limitations
+
+This document provides detailed guidance on interpreting DiffDock confidence scores and understanding the tool's limitations.
+
+## Confidence Score Interpretation
+
+DiffDock generates a confidence score for each predicted binding pose. This score indicates the model's certainty about the prediction.
+
+### Score Ranges
+
+| Score Range | Confidence Level | Interpretation |
+|------------|------------------|----------------|
+| **> 0** | High confidence | Strong prediction, likely accurate binding pose |
+| **-1.5 to 0** | Moderate confidence | Reasonable prediction, may need validation |
+| **< -1.5** | Low confidence | Uncertain prediction, requires careful validation |
+
+### Important Notes on Confidence Scores
+
+1. **Not Binding Affinity**: Confidence scores reflect prediction certainty, NOT binding affinity strength
+   - High confidence = model is confident about the structure
+   - Does NOT indicate strong/weak binding affinity
+
+2. **Context-Dependent**: Confidence scores should be adjusted based on system complexity:
+   - **Lower expectations** for:
+     - Large ligands (>500 Da)
+     - Protein complexes with many chains
+     - Unbound protein conformations (may require conformational changes)
+     - Novel protein families not well-represented in training data
+
+   - **Higher expectations** for:
+     - Drug-like small molecules (150-500 Da)
+     - Single-chain proteins or well-defined binding sites
+     - Proteins similar to those in training data (PDBBind, BindingMOAD)
+
+3. **Multiple Predictions**: DiffDock generates multiple samples per complex (default: 10)
+   - Review top-ranked predictions (by confidence)
+   - Consider clustering similar poses
+   - High-confidence consensus across multiple samples strengthens prediction
+
+## What DiffDock Predicts
+
+### ✅ DiffDock DOES Predict
+- **Binding poses**: 3D spatial orientation of ligand in protein binding site
+- **Confidence scores**: Model's certainty about predictions
+- **Multiple conformations**: Various possible binding modes
+
+### ❌ DiffDock DOES NOT Predict
+- **Binding affinity**: Strength of protein-ligand interaction (ΔG, Kd, Ki)
+- **Binding kinetics**: On/off rates, residence time
+- **ADMET properties**: Absorption, distribution, metabolism, excretion, toxicity
+- **Selectivity**: Relative binding to different targets
+
+## Scope and Limitations
+
+### Designed For
+- **Small molecule docking**: Organic compounds typically 100-1000 Da
+- **Protein targets**: Single or multi-chain proteins
+- **Small peptides**: Short peptide ligands (< ~20 residues)
+- **Small nucleic acids**: Short oligonucleotides
+
+### NOT Designed For
+- **Large biomolecules**: Full protein-protein interactions
+  - Use DiffDock-PP, AlphaFold-Multimer, or RoseTTAFold2NA instead
+- **Large peptides/proteins**: >20 residues as ligands
+- **Covalent docking**: Irreversible covalent bond formation
+- **Metalloprotein specifics**: May not accurately handle metal coordination
+- **Membrane proteins**: Not specifically trained on membrane-embedded proteins
+
+### Training Data Considerations
+
+DiffDock was trained on:
+- **PDBBind**: Diverse protein-ligand complexes
+- **BindingMOAD**: Multi-domain protein structures
+
+**Implications**:
+- Best performance on proteins/ligands similar to training data
+- May underperform on:
+  - Novel protein families
+  - Unusual ligand chemotypes
+  - Allosteric sites not well-represented in training data
+
+## Validation and Complementary Tools
+
+### Recommended Workflow
+
+1. **Generate poses with DiffDock**
+   - Use confidence scores for initial ranking
+   - Consider multiple high-confidence predictions
+
+2. **Visual Inspection**
+   - Examine protein-ligand interactions in molecular viewer
+   - Check for reasonable:
+     - Hydrogen bonds
+     - Hydrophobic interactions
+     - Steric complementarity
+     - Electrostatic interactions
+
+3. **Scoring and Refinement** (choose one or more):
+   - **GNINA**: Deep learning-based scoring function
+   - **Molecular mechanics**: Energy minimization and refinement
+   - **MM/GBSA or MM/PBSA**: Binding free energy estimation
+   - **Free energy calculations**: FEP or TI for accurate affinity prediction
+
+4. **Experimental Validation**
+   - Biochemical assays (IC50, Kd measurements)
+   - Structural validation (X-ray crystallography, cryo-EM)
+
+### Tools for Binding Affinity Assessment
+
+DiffDock should be combined with these tools for affinity prediction:
+
+- **GNINA**: Fast, accurate scoring function
+  - Github: github.com/gnina/gnina
+
+- **AutoDock Vina**: Classical docking and scoring
+  - Website: vina.scripps.edu
+
+- **Free Energy Calculations**:
+  - OpenMM + OpenFE
+  - GROMACS + ABFE/RBFE protocols
+
+- **MM/GBSA Tools**:
+  - MMPBSA.py (AmberTools)
+  - gmx_MMPBSA
+
+## Performance Optimization
+
+### For Best Results
+
+1. **Protein Preparation**:
+   - Remove water molecules far from binding site
+   - Resolve missing residues if possible
+   - Consider protonation states at physiological pH
+
+2. **Ligand Input**:
+   - Provide reasonable 3D conformers when using structure files
+   - Use canonical SMILES for consistent results
+   - Pre-process with RDKit if needed
+
+3. **Computational Resources**:
+   - GPU strongly recommended (10-100x speedup)
+   - First run pre-computes lookup tables (takes a few minutes)
+   - Batch processing more efficient than single predictions
+
+4. **Parameter Tuning**:
+   - Increase `samples_per_complex` for difficult cases (20-40)
+   - Adjust temperature parameters for diversity/accuracy trade-off
+   - Use pre-computed ESM embeddings for repeated predictions
+
+## Common Issues and Troubleshooting
+
+### Low Confidence Scores
+- **Large/flexible ligands**: Consider splitting into fragments or use alternative methods
+- **Multiple binding sites**: May predict multiple locations with distributed confidence
+- **Protein flexibility**: Consider using ensemble of protein conformations
+
+### Unrealistic Predictions
+- **Clashes**: May indicate need for protein preparation or refinement
+- **Surface binding**: Check if true binding site is blocked or unclear
+- **Unusual poses**: Consider increasing samples to explore more conformations
+
+### Slow Performance
+- **Use GPU**: Essential for reasonable runtime
+- **Pre-compute embeddings**: Reuse ESM embeddings for same protein
+- **Batch processing**: More efficient than sequential individual predictions
+- **Reduce samples**: Lower `samples_per_complex` for quick screening
+
+## Citation and Further Reading
+
+For methodology details and benchmarking results, see:
+
+1. **Original DiffDock Paper** (ICLR 2023):
+   - "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
+   - Corso et al., arXiv:2210.01776
+
+2. **DiffDock-L Paper** (2024):
+   - Enhanced model with improved generalization
+   - Stärk et al., arXiv:2402.18396
+
+3. **PoseBusters Benchmark**:
+   - Rigorous docking evaluation framework
+   - Used for DiffDock validation