Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,182 @@
# DiffDock Confidence Scores and Limitations
This document provides detailed guidance on interpreting DiffDock confidence scores and understanding the tool's limitations.
## Confidence Score Interpretation
DiffDock generates a confidence score for each predicted binding pose. This score indicates the model's certainty about the prediction.
### Score Ranges
| Score Range | Confidence Level | Interpretation |
|------------|------------------|----------------|
| **> 0** | High confidence | Strong prediction, likely accurate binding pose |
| **-1.5 to 0** | Moderate confidence | Reasonable prediction, may need validation |
| **< -1.5** | Low confidence | Uncertain prediction, requires careful validation |
### Important Notes on Confidence Scores
1. **Not Binding Affinity**: Confidence scores reflect prediction certainty, NOT binding affinity strength
- High confidence = model is confident about the structure
- Does NOT indicate strong/weak binding affinity
2. **Context-Dependent**: Confidence scores should be adjusted based on system complexity:
- **Lower expectations** for:
- Large ligands (>500 Da)
- Protein complexes with many chains
- Unbound protein conformations (may require conformational changes)
- Novel protein families not well-represented in training data
- **Higher expectations** for:
- Drug-like small molecules (150-500 Da)
- Single-chain proteins or well-defined binding sites
- Proteins similar to those in training data (PDBBind, BindingMOAD)
3. **Multiple Predictions**: DiffDock generates multiple samples per complex (default: 10)
- Review top-ranked predictions (by confidence)
- Consider clustering similar poses
- High-confidence consensus across multiple samples strengthens prediction
## What DiffDock Predicts
### ✅ DiffDock DOES Predict
- **Binding poses**: 3D spatial orientation of ligand in protein binding site
- **Confidence scores**: Model's certainty about predictions
- **Multiple conformations**: Various possible binding modes
### ❌ DiffDock DOES NOT Predict
- **Binding affinity**: Strength of protein-ligand interaction (ΔG, Kd, Ki)
- **Binding kinetics**: On/off rates, residence time
- **ADMET properties**: Absorption, distribution, metabolism, excretion, toxicity
- **Selectivity**: Relative binding to different targets
## Scope and Limitations
### Designed For
- **Small molecule docking**: Organic compounds typically 100-1000 Da
- **Protein targets**: Single or multi-chain proteins
- **Small peptides**: Short peptide ligands (< ~20 residues)
- **Small nucleic acids**: Short oligonucleotides
### NOT Designed For
- **Large biomolecules**: Full protein-protein interactions
- Use DiffDock-PP, AlphaFold-Multimer, or RoseTTAFold2NA instead
- **Large peptides/proteins**: >20 residues as ligands
- **Covalent docking**: Irreversible covalent bond formation
- **Metalloprotein specifics**: May not accurately handle metal coordination
- **Membrane proteins**: Not specifically trained on membrane-embedded proteins
### Training Data Considerations
DiffDock was trained on:
- **PDBBind**: Diverse protein-ligand complexes
- **BindingMOAD**: Multi-domain protein structures
**Implications**:
- Best performance on proteins/ligands similar to training data
- May underperform on:
- Novel protein families
- Unusual ligand chemotypes
- Allosteric sites not well-represented in training data
## Validation and Complementary Tools
### Recommended Workflow
1. **Generate poses with DiffDock**
- Use confidence scores for initial ranking
- Consider multiple high-confidence predictions
2. **Visual Inspection**
- Examine protein-ligand interactions in molecular viewer
- Check for reasonable:
- Hydrogen bonds
- Hydrophobic interactions
- Steric complementarity
- Electrostatic interactions
3. **Scoring and Refinement** (choose one or more):
- **GNINA**: Deep learning-based scoring function
- **Molecular mechanics**: Energy minimization and refinement
- **MM/GBSA or MM/PBSA**: Binding free energy estimation
- **Free energy calculations**: FEP or TI for accurate affinity prediction
4. **Experimental Validation**
- Biochemical assays (IC50, Kd measurements)
- Structural validation (X-ray crystallography, cryo-EM)
### Tools for Binding Affinity Assessment
DiffDock should be combined with these tools for affinity prediction:
- **GNINA**: Fast, accurate scoring function
- Github: github.com/gnina/gnina
- **AutoDock Vina**: Classical docking and scoring
- Website: vina.scripps.edu
- **Free Energy Calculations**:
- OpenMM + OpenFE
- GROMACS + ABFE/RBFE protocols
- **MM/GBSA Tools**:
- MMPBSA.py (AmberTools)
- gmx_MMPBSA
## Performance Optimization
### For Best Results
1. **Protein Preparation**:
- Remove water molecules far from binding site
- Resolve missing residues if possible
- Consider protonation states at physiological pH
2. **Ligand Input**:
- Provide reasonable 3D conformers when using structure files
- Use canonical SMILES for consistent results
- Pre-process with RDKit if needed
3. **Computational Resources**:
- GPU strongly recommended (10-100x speedup)
- First run pre-computes lookup tables (takes a few minutes)
- Batch processing more efficient than single predictions
4. **Parameter Tuning**:
- Increase `samples_per_complex` for difficult cases (20-40)
- Adjust temperature parameters for diversity/accuracy trade-off
- Use pre-computed ESM embeddings for repeated predictions
## Common Issues and Troubleshooting
### Low Confidence Scores
- **Large/flexible ligands**: Consider splitting into fragments or use alternative methods
- **Multiple binding sites**: May predict multiple locations with distributed confidence
- **Protein flexibility**: Consider using ensemble of protein conformations
### Unrealistic Predictions
- **Clashes**: May indicate need for protein preparation or refinement
- **Surface binding**: Check if true binding site is blocked or unclear
- **Unusual poses**: Consider increasing samples to explore more conformations
### Slow Performance
- **Use GPU**: Essential for reasonable runtime
- **Pre-compute embeddings**: Reuse ESM embeddings for same protein
- **Batch processing**: More efficient than sequential individual predictions
- **Reduce samples**: Lower `samples_per_complex` for quick screening
## Citation and Further Reading
For methodology details and benchmarking results, see:
1. **Original DiffDock Paper** (ICLR 2023):
- "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
- Corso et al., arXiv:2210.01776
2. **DiffDock-L Paper** (2024):
- Enhanced model with improved generalization
- Stärk et al., arXiv:2402.18396
3. **PoseBusters Benchmark**:
- Rigorous docking evaluation framework
- Used for DiffDock validation