Initial commit
This commit is contained in:
182
skills/diffdock/references/confidence_and_limitations.md
Normal file
182
skills/diffdock/references/confidence_and_limitations.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# DiffDock Confidence Scores and Limitations
|
||||
|
||||
This document provides detailed guidance on interpreting DiffDock confidence scores and understanding the tool's limitations.
|
||||
|
||||
## Confidence Score Interpretation
|
||||
|
||||
DiffDock generates a confidence score for each predicted binding pose. This score indicates the model's certainty about the prediction.
|
||||
|
||||
### Score Ranges
|
||||
|
||||
| Score Range | Confidence Level | Interpretation |
|
||||
|------------|------------------|----------------|
|
||||
| **> 0** | High confidence | Strong prediction, likely accurate binding pose |
|
||||
| **-1.5 to 0** | Moderate confidence | Reasonable prediction, may need validation |
|
||||
| **< -1.5** | Low confidence | Uncertain prediction, requires careful validation |
|
||||
|
||||
### Important Notes on Confidence Scores
|
||||
|
||||
1. **Not Binding Affinity**: Confidence scores reflect prediction certainty, NOT binding affinity strength
|
||||
- High confidence = model is confident about the structure
|
||||
- Does NOT indicate strong/weak binding affinity
|
||||
|
||||
2. **Context-Dependent**: Confidence scores should be adjusted based on system complexity:
|
||||
- **Lower expectations** for:
|
||||
- Large ligands (>500 Da)
|
||||
- Protein complexes with many chains
|
||||
- Unbound protein conformations (may require conformational changes)
|
||||
- Novel protein families not well-represented in training data
|
||||
|
||||
- **Higher expectations** for:
|
||||
- Drug-like small molecules (150-500 Da)
|
||||
- Single-chain proteins or well-defined binding sites
|
||||
- Proteins similar to those in training data (PDBBind, BindingMOAD)
|
||||
|
||||
3. **Multiple Predictions**: DiffDock generates multiple samples per complex (default: 10)
|
||||
- Review top-ranked predictions (by confidence)
|
||||
- Consider clustering similar poses
|
||||
- High-confidence consensus across multiple samples strengthens prediction
|
||||
|
||||
## What DiffDock Predicts
|
||||
|
||||
### ✅ DiffDock DOES Predict
|
||||
- **Binding poses**: 3D spatial orientation of ligand in protein binding site
|
||||
- **Confidence scores**: Model's certainty about predictions
|
||||
- **Multiple conformations**: Various possible binding modes
|
||||
|
||||
### ❌ DiffDock DOES NOT Predict
|
||||
- **Binding affinity**: Strength of protein-ligand interaction (ΔG, Kd, Ki)
|
||||
- **Binding kinetics**: On/off rates, residence time
|
||||
- **ADMET properties**: Absorption, distribution, metabolism, excretion, toxicity
|
||||
- **Selectivity**: Relative binding to different targets
|
||||
|
||||
## Scope and Limitations
|
||||
|
||||
### Designed For
|
||||
- **Small molecule docking**: Organic compounds typically 100-1000 Da
|
||||
- **Protein targets**: Single or multi-chain proteins
|
||||
- **Small peptides**: Short peptide ligands (< ~20 residues)
|
||||
- **Small nucleic acids**: Short oligonucleotides
|
||||
|
||||
### NOT Designed For
|
||||
- **Large biomolecules**: Full protein-protein interactions
|
||||
- Use DiffDock-PP, AlphaFold-Multimer, or RoseTTAFold2NA instead
|
||||
- **Large peptides/proteins**: >20 residues as ligands
|
||||
- **Covalent docking**: Irreversible covalent bond formation
|
||||
- **Metalloprotein specifics**: May not accurately handle metal coordination
|
||||
- **Membrane proteins**: Not specifically trained on membrane-embedded proteins
|
||||
|
||||
### Training Data Considerations
|
||||
|
||||
DiffDock was trained on:
|
||||
- **PDBBind**: Diverse protein-ligand complexes
|
||||
- **BindingMOAD**: Multi-domain protein structures
|
||||
|
||||
**Implications**:
|
||||
- Best performance on proteins/ligands similar to training data
|
||||
- May underperform on:
|
||||
- Novel protein families
|
||||
- Unusual ligand chemotypes
|
||||
- Allosteric sites not well-represented in training data
|
||||
|
||||
## Validation and Complementary Tools
|
||||
|
||||
### Recommended Workflow
|
||||
|
||||
1. **Generate poses with DiffDock**
|
||||
- Use confidence scores for initial ranking
|
||||
- Consider multiple high-confidence predictions
|
||||
|
||||
2. **Visual Inspection**
|
||||
- Examine protein-ligand interactions in molecular viewer
|
||||
- Check for reasonable:
|
||||
- Hydrogen bonds
|
||||
- Hydrophobic interactions
|
||||
- Steric complementarity
|
||||
- Electrostatic interactions
|
||||
|
||||
3. **Scoring and Refinement** (choose one or more):
|
||||
- **GNINA**: Deep learning-based scoring function
|
||||
- **Molecular mechanics**: Energy minimization and refinement
|
||||
- **MM/GBSA or MM/PBSA**: Binding free energy estimation
|
||||
- **Free energy calculations**: FEP or TI for accurate affinity prediction
|
||||
|
||||
4. **Experimental Validation**
|
||||
- Biochemical assays (IC50, Kd measurements)
|
||||
- Structural validation (X-ray crystallography, cryo-EM)
|
||||
|
||||
### Tools for Binding Affinity Assessment
|
||||
|
||||
DiffDock should be combined with these tools for affinity prediction:
|
||||
|
||||
- **GNINA**: Fast, accurate scoring function
|
||||
- Github: github.com/gnina/gnina
|
||||
|
||||
- **AutoDock Vina**: Classical docking and scoring
|
||||
- Website: vina.scripps.edu
|
||||
|
||||
- **Free Energy Calculations**:
|
||||
- OpenMM + OpenFE
|
||||
- GROMACS + ABFE/RBFE protocols
|
||||
|
||||
- **MM/GBSA Tools**:
|
||||
- MMPBSA.py (AmberTools)
|
||||
- gmx_MMPBSA
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### For Best Results
|
||||
|
||||
1. **Protein Preparation**:
|
||||
- Remove water molecules far from binding site
|
||||
- Resolve missing residues if possible
|
||||
- Consider protonation states at physiological pH
|
||||
|
||||
2. **Ligand Input**:
|
||||
- Provide reasonable 3D conformers when using structure files
|
||||
- Use canonical SMILES for consistent results
|
||||
- Pre-process with RDKit if needed
|
||||
|
||||
3. **Computational Resources**:
|
||||
- GPU strongly recommended (10-100x speedup)
|
||||
- First run pre-computes lookup tables (takes a few minutes)
|
||||
- Batch processing more efficient than single predictions
|
||||
|
||||
4. **Parameter Tuning**:
|
||||
- Increase `samples_per_complex` for difficult cases (20-40)
|
||||
- Adjust temperature parameters for diversity/accuracy trade-off
|
||||
- Use pre-computed ESM embeddings for repeated predictions
|
||||
|
||||
## Common Issues and Troubleshooting
|
||||
|
||||
### Low Confidence Scores
|
||||
- **Large/flexible ligands**: Consider splitting into fragments or use alternative methods
|
||||
- **Multiple binding sites**: May predict multiple locations with distributed confidence
|
||||
- **Protein flexibility**: Consider using ensemble of protein conformations
|
||||
|
||||
### Unrealistic Predictions
|
||||
- **Clashes**: May indicate need for protein preparation or refinement
|
||||
- **Surface binding**: Check if true binding site is blocked or unclear
|
||||
- **Unusual poses**: Consider increasing samples to explore more conformations
|
||||
|
||||
### Slow Performance
|
||||
- **Use GPU**: Essential for reasonable runtime
|
||||
- **Pre-compute embeddings**: Reuse ESM embeddings for same protein
|
||||
- **Batch processing**: More efficient than sequential individual predictions
|
||||
- **Reduce samples**: Lower `samples_per_complex` for quick screening
|
||||
|
||||
## Citation and Further Reading
|
||||
|
||||
For methodology details and benchmarking results, see:
|
||||
|
||||
1. **Original DiffDock Paper** (ICLR 2023):
|
||||
- "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
|
||||
- Corso et al., arXiv:2210.01776
|
||||
|
||||
2. **DiffDock-L Paper** (2024):
|
||||
- Enhanced model with improved generalization
|
||||
- Stärk et al., arXiv:2402.18396
|
||||
|
||||
3. **PoseBusters Benchmark**:
|
||||
- Rigorous docking evaluation framework
|
||||
- Used for DiffDock validation
|
||||
163
skills/diffdock/references/parameters_reference.md
Normal file
163
skills/diffdock/references/parameters_reference.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# DiffDock Configuration Parameters Reference
|
||||
|
||||
This document provides comprehensive details on all DiffDock configuration parameters and command-line options.
|
||||
|
||||
## Model & Checkpoint Settings
|
||||
|
||||
### Model Paths
|
||||
- **`--model_dir`**: Directory containing the score model checkpoint
|
||||
- Default: `./workdir/v1.1/score_model`
|
||||
- DiffDock-L model (current default)
|
||||
|
||||
- **`--confidence_model_dir`**: Directory containing the confidence model checkpoint
|
||||
- Default: `./workdir/v1.1/confidence_model`
|
||||
|
||||
- **`--ckpt`**: Name of the score model checkpoint file
|
||||
- Default: `best_ema_inference_epoch_model.pt`
|
||||
|
||||
- **`--confidence_ckpt`**: Name of the confidence model checkpoint file
|
||||
- Default: `best_model_epoch75.pt`
|
||||
|
||||
### Model Version Flags
|
||||
- **`--old_score_model`**: Use original DiffDock model instead of DiffDock-L
|
||||
- Default: `false` (uses DiffDock-L)
|
||||
|
||||
- **`--old_filtering_model`**: Use legacy confidence filtering approach
|
||||
- Default: `true`
|
||||
|
||||
## Input/Output Options
|
||||
|
||||
### Input Specification
|
||||
- **`--protein_path`**: Path to protein PDB file
|
||||
- Example: `--protein_path protein.pdb`
|
||||
- Alternative to `--protein_sequence`
|
||||
|
||||
- **`--protein_sequence`**: Amino acid sequence for ESMFold folding
|
||||
- Automatically generates protein structure from sequence
|
||||
- Alternative to `--protein_path`
|
||||
|
||||
- **`--ligand`**: Ligand specification (SMILES string or file path)
|
||||
- SMILES string: `--ligand "COc(cc1)ccc1C#N"`
|
||||
- File path: `--ligand ligand.sdf` or `.mol2`
|
||||
|
||||
- **`--protein_ligand_csv`**: CSV file for batch processing
|
||||
- Required columns: `complex_name`, `protein_path`, `ligand_description`, `protein_sequence`
|
||||
- Example: `--protein_ligand_csv data/protein_ligand_example.csv`
|
||||
|
||||
### Output Control
|
||||
- **`--out_dir`**: Output directory for predictions
|
||||
- Example: `--out_dir results/user_predictions/`
|
||||
|
||||
- **`--save_visualisation`**: Export predicted molecules as SDF files
|
||||
- Enables visualization of results
|
||||
|
||||
## Inference Parameters
|
||||
|
||||
### Diffusion Steps
|
||||
- **`--inference_steps`**: Number of planned inference iterations
|
||||
- Default: `20`
|
||||
- Higher values may improve accuracy but increase runtime
|
||||
|
||||
- **`--actual_steps`**: Actual diffusion steps executed
|
||||
- Default: `19`
|
||||
|
||||
- **`--no_final_step_noise`**: Omit noise at the final diffusion step
|
||||
- Default: `true`
|
||||
|
||||
### Sampling Settings
|
||||
- **`--samples_per_complex`**: Number of samples to generate per complex
|
||||
- Default: `10`
|
||||
- More samples provide better coverage but increase computation
|
||||
|
||||
- **`--sigma_schedule`**: Noise schedule type
|
||||
- Default: `expbeta` (exponential-beta)
|
||||
|
||||
- **`--initial_noise_std_proportion`**: Initial noise standard deviation scaling
|
||||
- Default: `1.46`
|
||||
|
||||
### Temperature Parameters
|
||||
|
||||
#### Sampling Temperatures (Controls diversity of predictions)
|
||||
- **`--temp_sampling_tr`**: Translation sampling temperature
|
||||
- Default: `1.17`
|
||||
|
||||
- **`--temp_sampling_rot`**: Rotation sampling temperature
|
||||
- Default: `2.06`
|
||||
|
||||
- **`--temp_sampling_tor`**: Torsion sampling temperature
|
||||
- Default: `7.04`
|
||||
|
||||
#### Psi Angle Temperatures
|
||||
- **`--temp_psi_tr`**: Translation psi temperature
|
||||
- Default: `0.73`
|
||||
|
||||
- **`--temp_psi_rot`**: Rotation psi temperature
|
||||
- Default: `0.90`
|
||||
|
||||
- **`--temp_psi_tor`**: Torsion psi temperature
|
||||
- Default: `0.59`
|
||||
|
||||
#### Sigma Data Temperatures
|
||||
- **`--temp_sigma_data_tr`**: Translation data distribution scaling
|
||||
- Default: `0.93`
|
||||
|
||||
- **`--temp_sigma_data_rot`**: Rotation data distribution scaling
|
||||
- Default: `0.75`
|
||||
|
||||
- **`--temp_sigma_data_tor`**: Torsion data distribution scaling
|
||||
- Default: `0.69`
|
||||
|
||||
## Processing Options
|
||||
|
||||
### Performance
|
||||
- **`--batch_size`**: Processing batch size
|
||||
- Default: `10`
|
||||
- Larger values increase throughput but require more memory
|
||||
|
||||
- **`--tqdm`**: Enable progress bar visualization
|
||||
- Useful for monitoring long-running jobs
|
||||
|
||||
### Protein Structure
|
||||
- **`--chain_cutoff`**: Maximum number of protein chains to process
|
||||
- Example: `--chain_cutoff 10`
|
||||
- Useful for large multi-chain complexes
|
||||
|
||||
- **`--esm_embeddings_path`**: Path to pre-computed ESM2 protein embeddings
|
||||
- Speeds up inference by reusing embeddings
|
||||
- Optional optimization
|
||||
|
||||
### Dataset Options
|
||||
- **`--split`**: Dataset split to use (train/test/val)
|
||||
- Used for evaluation on standard benchmarks
|
||||
|
||||
## Advanced Flags
|
||||
|
||||
### Debugging & Testing
|
||||
- **`--no_model`**: Disable model inference (debugging)
|
||||
- Default: `false`
|
||||
|
||||
- **`--no_random`**: Disable randomization
|
||||
- Default: `false`
|
||||
- Useful for reproducibility testing
|
||||
|
||||
### Alternative Sampling
|
||||
- **`--ode`**: Use ODE solver instead of SDE
|
||||
- Default: `false`
|
||||
- Alternative sampling approach
|
||||
|
||||
- **`--different_schedules`**: Use different noise schedules per component
|
||||
- Default: `false`
|
||||
|
||||
### Error Handling
|
||||
- **`--limit_failures`**: Maximum allowed failures before stopping
|
||||
- Default: `5`
|
||||
|
||||
## Configuration File
|
||||
|
||||
All parameters can be specified in a YAML configuration file (typically `default_inference_args.yaml`) or overridden via command line:
|
||||
|
||||
```bash
|
||||
python -m inference --config default_inference_args.yaml --samples_per_complex 20
|
||||
```
|
||||
|
||||
Command-line arguments take precedence over configuration file values.
|
||||
392
skills/diffdock/references/workflows_examples.md
Normal file
392
skills/diffdock/references/workflows_examples.md
Normal file
@@ -0,0 +1,392 @@
|
||||
# DiffDock Workflows and Examples
|
||||
|
||||
This document provides practical workflows and usage examples for common DiffDock tasks.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
### Conda Installation (Recommended)
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
|
||||
# Create conda environment
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
```
|
||||
|
||||
### Docker Installation
|
||||
|
||||
```bash
|
||||
# Pull Docker image
|
||||
docker pull rbgcsail/diffdock
|
||||
|
||||
# Run container with GPU support
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
|
||||
# Inside container, activate environment
|
||||
micromamba activate diffdock
|
||||
```
|
||||
|
||||
### First Run
|
||||
The first execution pre-computes SO(2) and SO(3) lookup tables, taking a few minutes. Subsequent runs start immediately.
|
||||
|
||||
## Workflow 1: Single Protein-Ligand Docking
|
||||
|
||||
### Using PDB File and SMILES String
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path examples/protein.pdb \
|
||||
--ligand "COc1ccc(C(=O)Nc2ccccc2)cc1" \
|
||||
--out_dir results/single_docking/
|
||||
```
|
||||
|
||||
**Output Structure**:
|
||||
```
|
||||
results/single_docking/
|
||||
├── index_0_rank_1.sdf # Top-ranked prediction
|
||||
├── index_0_rank_2.sdf # Second-ranked prediction
|
||||
├── ...
|
||||
├── index_0_rank_10.sdf # 10th prediction (if samples_per_complex=10)
|
||||
└── confidence_scores.txt # Scores for all predictions
|
||||
```
|
||||
|
||||
### Using Ligand Structure File
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand ligand.sdf \
|
||||
--out_dir results/ligand_file/
|
||||
```
|
||||
|
||||
**Supported ligand formats**: SDF, MOL2, or any format readable by RDKit
|
||||
|
||||
## Workflow 2: Protein Sequence to Structure Docking
|
||||
|
||||
### Using ESMFold for Protein Folding
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" \
|
||||
--ligand "CC(C)Cc1ccc(cc1)C(C)C(=O)O" \
|
||||
--out_dir results/sequence_docking/
|
||||
```
|
||||
|
||||
**Use Cases**:
|
||||
- Protein structure not available in PDB
|
||||
- Modeling mutations or variants
|
||||
- De novo protein design validation
|
||||
|
||||
**Note**: ESMFold folding adds computation time (30s-5min depending on sequence length)
|
||||
|
||||
## Workflow 3: Batch Processing Multiple Complexes
|
||||
|
||||
### Prepare CSV File
|
||||
|
||||
Create `complexes.csv` with required columns:
|
||||
|
||||
```csv
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
complex1,proteins/protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
complex2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKF...
|
||||
complex3,proteins/protein3.pdb,ligands/ligand3.sdf,
|
||||
```
|
||||
|
||||
**Column Descriptions**:
|
||||
- `complex_name`: Unique identifier for the complex
|
||||
- `protein_path`: Path to PDB file (leave empty if using sequence)
|
||||
- `ligand_description`: SMILES string or path to ligand file
|
||||
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
|
||||
|
||||
### Run Batch Docking
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv complexes.csv \
|
||||
--out_dir results/batch_predictions/ \
|
||||
--batch_size 10
|
||||
```
|
||||
|
||||
**Output Structure**:
|
||||
```
|
||||
results/batch_predictions/
|
||||
├── complex1/
|
||||
│ ├── rank_1.sdf
|
||||
│ ├── rank_2.sdf
|
||||
│ └── ...
|
||||
├── complex2/
|
||||
│ ├── rank_1.sdf
|
||||
│ └── ...
|
||||
└── complex3/
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Workflow 4: High-Throughput Virtual Screening
|
||||
|
||||
### Setup for Screening Large Ligand Libraries
|
||||
|
||||
```python
|
||||
# generate_screening_csv.py
|
||||
import pandas as pd
|
||||
|
||||
# Load ligand library
|
||||
ligands = pd.read_csv("ligand_library.csv") # Contains SMILES
|
||||
|
||||
# Create DiffDock input
|
||||
screening_data = {
|
||||
"complex_name": [f"screen_{i}" for i in range(len(ligands))],
|
||||
"protein_path": ["target_protein.pdb"] * len(ligands),
|
||||
"ligand_description": ligands["smiles"].tolist(),
|
||||
"protein_sequence": [""] * len(ligands)
|
||||
}
|
||||
|
||||
df = pd.DataFrame(screening_data)
|
||||
df.to_csv("screening_input.csv", index=False)
|
||||
```
|
||||
|
||||
### Run Screening
|
||||
|
||||
```bash
|
||||
# Pre-compute ESM embeddings for faster screening
|
||||
python datasets/esm_embedding_preparation.py \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--out_file protein_embeddings.pt
|
||||
|
||||
# Run docking with pre-computed embeddings
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--esm_embeddings_path protein_embeddings.pt \
|
||||
--out_dir results/virtual_screening/ \
|
||||
--batch_size 32
|
||||
```
|
||||
|
||||
### Post-Processing: Extract Top Hits
|
||||
|
||||
```python
|
||||
# analyze_screening_results.py
|
||||
import os
|
||||
import pandas as pd
|
||||
|
||||
results = []
|
||||
results_dir = "results/virtual_screening/"
|
||||
|
||||
for complex_dir in os.listdir(results_dir):
|
||||
confidence_file = os.path.join(results_dir, complex_dir, "confidence_scores.txt")
|
||||
if os.path.exists(confidence_file):
|
||||
with open(confidence_file) as f:
|
||||
scores = [float(line.strip()) for line in f]
|
||||
top_score = max(scores)
|
||||
results.append({"complex": complex_dir, "top_confidence": top_score})
|
||||
|
||||
# Sort by confidence
|
||||
df = pd.DataFrame(results)
|
||||
df_sorted = df.sort_values("top_confidence", ascending=False)
|
||||
|
||||
# Get top 100 hits
|
||||
top_hits = df_sorted.head(100)
|
||||
top_hits.to_csv("top_hits.csv", index=False)
|
||||
```
|
||||
|
||||
## Workflow 5: Ensemble Docking with Protein Flexibility
|
||||
|
||||
### Prepare Protein Ensemble
|
||||
|
||||
```python
|
||||
# For proteins with known flexibility, use multiple conformations
|
||||
# Example: Using MD snapshots or crystal structures
|
||||
|
||||
# create_ensemble_csv.py
|
||||
import pandas as pd
|
||||
|
||||
conformations = [
|
||||
"protein_conf1.pdb",
|
||||
"protein_conf2.pdb",
|
||||
"protein_conf3.pdb",
|
||||
"protein_conf4.pdb"
|
||||
]
|
||||
|
||||
ligand = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"
|
||||
|
||||
data = {
|
||||
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
|
||||
"protein_path": conformations,
|
||||
"ligand_description": [ligand] * len(conformations),
|
||||
"protein_sequence": [""] * len(conformations)
|
||||
}
|
||||
|
||||
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
|
||||
```
|
||||
|
||||
### Run Ensemble Docking
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv ensemble_input.csv \
|
||||
--out_dir results/ensemble_docking/ \
|
||||
--samples_per_complex 20 # More samples per conformation
|
||||
```
|
||||
|
||||
## Workflow 6: Integration with Downstream Analysis
|
||||
|
||||
### Example: DiffDock + GNINA Rescoring
|
||||
|
||||
```bash
|
||||
# 1. Run DiffDock
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
||||
--out_dir results/diffdock_poses/ \
|
||||
--save_visualisation
|
||||
|
||||
# 2. Rescore with GNINA
|
||||
for pose in results/diffdock_poses/*.sdf; do
|
||||
gnina -r protein.pdb -l "$pose" --score_only -o "${pose%.sdf}_gnina.sdf"
|
||||
done
|
||||
```
|
||||
|
||||
### Example: DiffDock + OpenMM Energy Minimization
|
||||
|
||||
```python
|
||||
# minimize_poses.py
|
||||
from openmm import app, LangevinIntegrator, Platform
|
||||
from openmm.app import ForceField, Modeller, PDBFile
|
||||
from rdkit import Chem
|
||||
import os
|
||||
|
||||
# Load protein
|
||||
protein = PDBFile('protein.pdb')
|
||||
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
|
||||
|
||||
# Process each DiffDock pose
|
||||
pose_dir = 'results/diffdock_poses/'
|
||||
for pose_file in os.listdir(pose_dir):
|
||||
if pose_file.endswith('.sdf'):
|
||||
# Load ligand
|
||||
mol = Chem.SDMolSupplier(os.path.join(pose_dir, pose_file))[0]
|
||||
|
||||
# Combine protein + ligand
|
||||
modeller = Modeller(protein.topology, protein.positions)
|
||||
# ... add ligand to modeller ...
|
||||
|
||||
# Create system and minimize
|
||||
system = forcefield.createSystem(modeller.topology)
|
||||
integrator = LangevinIntegrator(300, 1.0, 0.002)
|
||||
simulation = app.Simulation(modeller.topology, system, integrator)
|
||||
simulation.minimizeEnergy(maxIterations=1000)
|
||||
|
||||
# Save minimized structure
|
||||
positions = simulation.context.getState(getPositions=True).getPositions()
|
||||
PDBFile.writeFile(simulation.topology, positions,
|
||||
open(f"minimized_{pose_file}.pdb", 'w'))
|
||||
```
|
||||
|
||||
## Workflow 7: Using the Graphical Interface
|
||||
|
||||
### Launch Web Interface
|
||||
|
||||
```bash
|
||||
python app/main.py
|
||||
```
|
||||
|
||||
### Access Interface
|
||||
Navigate to `http://localhost:7860` in web browser
|
||||
|
||||
### Features
|
||||
- Upload protein PDB or enter sequence
|
||||
- Input ligand SMILES or upload structure
|
||||
- Adjust inference parameters via GUI
|
||||
- Visualize results interactively
|
||||
- Download predictions directly
|
||||
|
||||
### Online Alternative
|
||||
Use the Hugging Face Spaces demo without local installation:
|
||||
- URL: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Custom Inference Settings
|
||||
|
||||
Create custom YAML configuration:
|
||||
|
||||
```yaml
|
||||
# custom_inference.yaml
|
||||
# Model settings
|
||||
model_dir: ./workdir/v1.1/score_model
|
||||
confidence_model_dir: ./workdir/v1.1/confidence_model
|
||||
|
||||
# Sampling parameters
|
||||
samples_per_complex: 20 # More samples for better coverage
|
||||
inference_steps: 25 # More steps for accuracy
|
||||
|
||||
# Temperature adjustments (increase for more diversity)
|
||||
temp_sampling_tr: 1.3
|
||||
temp_sampling_rot: 2.2
|
||||
temp_sampling_tor: 7.5
|
||||
|
||||
# Output
|
||||
save_visualisation: true
|
||||
```
|
||||
|
||||
Use custom configuration:
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config custom_inference.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
||||
--out_dir results/custom_config/
|
||||
```
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Issue: Out of Memory Errors
|
||||
|
||||
**Solution**: Reduce batch size
|
||||
```bash
|
||||
python -m inference ... --batch_size 2
|
||||
```
|
||||
|
||||
### Issue: Slow Performance
|
||||
|
||||
**Solution**: Ensure GPU usage
|
||||
```python
|
||||
import torch
|
||||
print(torch.cuda.is_available()) # Should return True
|
||||
```
|
||||
|
||||
### Issue: Poor Predictions for Large Ligands
|
||||
|
||||
**Solution**: Increase sampling diversity
|
||||
```bash
|
||||
python -m inference ... --samples_per_complex 40 --temp_sampling_tor 9.0
|
||||
```
|
||||
|
||||
### Issue: Protein with Many Chains
|
||||
|
||||
**Solution**: Limit chains or isolate binding site
|
||||
```bash
|
||||
python -m inference ... --chain_cutoff 4
|
||||
```
|
||||
|
||||
Or pre-process PDB to include only relevant chains.
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Start Simple**: Test with single complex before batch processing
|
||||
2. **GPU Essential**: Use GPU for reasonable performance
|
||||
3. **Multiple Samples**: Generate 10-40 samples for robust predictions
|
||||
4. **Validate Results**: Use molecular visualization and complementary scoring
|
||||
5. **Consider Confidence**: Use confidence scores for initial ranking, not final decisions
|
||||
6. **Iterate Parameters**: Adjust temperature/steps for specific systems
|
||||
7. **Pre-compute Embeddings**: For repeated use of same protein
|
||||
8. **Combine Tools**: Integrate with scoring functions and energy minimization
|
||||
Reference in New Issue
Block a user