393 lines
10 KiB
Markdown
393 lines
10 KiB
Markdown
# DiffDock Workflows and Examples
|
|
|
|
This document provides practical workflows and usage examples for common DiffDock tasks.
|
|
|
|
## Installation and Setup
|
|
|
|
### Conda Installation (Recommended)
|
|
|
|
```bash
|
|
# Clone repository
|
|
git clone https://github.com/gcorso/DiffDock.git
|
|
cd DiffDock
|
|
|
|
# Create conda environment
|
|
conda env create --file environment.yml
|
|
conda activate diffdock
|
|
```
|
|
|
|
### Docker Installation
|
|
|
|
```bash
|
|
# Pull Docker image
|
|
docker pull rbgcsail/diffdock
|
|
|
|
# Run container with GPU support
|
|
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
|
|
|
# Inside container, activate environment
|
|
micromamba activate diffdock
|
|
```
|
|
|
|
### First Run
|
|
The first execution pre-computes SO(2) and SO(3) lookup tables, taking a few minutes. Subsequent runs start immediately.
|
|
|
|
## Workflow 1: Single Protein-Ligand Docking
|
|
|
|
### Using PDB File and SMILES String
|
|
|
|
```bash
|
|
python -m inference \
|
|
--config default_inference_args.yaml \
|
|
--protein_path examples/protein.pdb \
|
|
--ligand "COc1ccc(C(=O)Nc2ccccc2)cc1" \
|
|
--out_dir results/single_docking/
|
|
```
|
|
|
|
**Output Structure**:
|
|
```
|
|
results/single_docking/
|
|
├── index_0_rank_1.sdf # Top-ranked prediction
|
|
├── index_0_rank_2.sdf # Second-ranked prediction
|
|
├── ...
|
|
├── index_0_rank_10.sdf # 10th prediction (if samples_per_complex=10)
|
|
└── confidence_scores.txt # Scores for all predictions
|
|
```
|
|
|
|
### Using Ligand Structure File
|
|
|
|
```bash
|
|
python -m inference \
|
|
--config default_inference_args.yaml \
|
|
--protein_path protein.pdb \
|
|
--ligand ligand.sdf \
|
|
--out_dir results/ligand_file/
|
|
```
|
|
|
|
**Supported ligand formats**: SDF, MOL2, or any format readable by RDKit
|
|
|
|
## Workflow 2: Protein Sequence to Structure Docking
|
|
|
|
### Using ESMFold for Protein Folding
|
|
|
|
```bash
|
|
python -m inference \
|
|
--config default_inference_args.yaml \
|
|
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" \
|
|
--ligand "CC(C)Cc1ccc(cc1)C(C)C(=O)O" \
|
|
--out_dir results/sequence_docking/
|
|
```
|
|
|
|
**Use Cases**:
|
|
- Protein structure not available in PDB
|
|
- Modeling mutations or variants
|
|
- De novo protein design validation
|
|
|
|
**Note**: ESMFold folding adds computation time (30s-5min depending on sequence length)
|
|
|
|
## Workflow 3: Batch Processing Multiple Complexes
|
|
|
|
### Prepare CSV File
|
|
|
|
Create `complexes.csv` with required columns:
|
|
|
|
```csv
|
|
complex_name,protein_path,ligand_description,protein_sequence
|
|
complex1,proteins/protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
|
complex2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKF...
|
|
complex3,proteins/protein3.pdb,ligands/ligand3.sdf,
|
|
```
|
|
|
|
**Column Descriptions**:
|
|
- `complex_name`: Unique identifier for the complex
|
|
- `protein_path`: Path to PDB file (leave empty if using sequence)
|
|
- `ligand_description`: SMILES string or path to ligand file
|
|
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
|
|
|
|
### Run Batch Docking
|
|
|
|
```bash
|
|
python -m inference \
|
|
--config default_inference_args.yaml \
|
|
--protein_ligand_csv complexes.csv \
|
|
--out_dir results/batch_predictions/ \
|
|
--batch_size 10
|
|
```
|
|
|
|
**Output Structure**:
|
|
```
|
|
results/batch_predictions/
|
|
├── complex1/
|
|
│ ├── rank_1.sdf
|
|
│ ├── rank_2.sdf
|
|
│ └── ...
|
|
├── complex2/
|
|
│ ├── rank_1.sdf
|
|
│ └── ...
|
|
└── complex3/
|
|
└── ...
|
|
```
|
|
|
|
## Workflow 4: High-Throughput Virtual Screening
|
|
|
|
### Setup for Screening Large Ligand Libraries
|
|
|
|
```python
|
|
# generate_screening_csv.py
|
|
import pandas as pd
|
|
|
|
# Load ligand library
|
|
ligands = pd.read_csv("ligand_library.csv") # Contains SMILES
|
|
|
|
# Create DiffDock input
|
|
screening_data = {
|
|
"complex_name": [f"screen_{i}" for i in range(len(ligands))],
|
|
"protein_path": ["target_protein.pdb"] * len(ligands),
|
|
"ligand_description": ligands["smiles"].tolist(),
|
|
"protein_sequence": [""] * len(ligands)
|
|
}
|
|
|
|
df = pd.DataFrame(screening_data)
|
|
df.to_csv("screening_input.csv", index=False)
|
|
```
|
|
|
|
### Run Screening
|
|
|
|
```bash
|
|
# Pre-compute ESM embeddings for faster screening
|
|
python datasets/esm_embedding_preparation.py \
|
|
--protein_ligand_csv screening_input.csv \
|
|
--out_file protein_embeddings.pt
|
|
|
|
# Run docking with pre-computed embeddings
|
|
python -m inference \
|
|
--config default_inference_args.yaml \
|
|
--protein_ligand_csv screening_input.csv \
|
|
--esm_embeddings_path protein_embeddings.pt \
|
|
--out_dir results/virtual_screening/ \
|
|
--batch_size 32
|
|
```
|
|
|
|
### Post-Processing: Extract Top Hits
|
|
|
|
```python
|
|
# analyze_screening_results.py
|
|
import os
|
|
import pandas as pd
|
|
|
|
results = []
|
|
results_dir = "results/virtual_screening/"
|
|
|
|
for complex_dir in os.listdir(results_dir):
|
|
confidence_file = os.path.join(results_dir, complex_dir, "confidence_scores.txt")
|
|
if os.path.exists(confidence_file):
|
|
with open(confidence_file) as f:
|
|
scores = [float(line.strip()) for line in f]
|
|
top_score = max(scores)
|
|
results.append({"complex": complex_dir, "top_confidence": top_score})
|
|
|
|
# Sort by confidence
|
|
df = pd.DataFrame(results)
|
|
df_sorted = df.sort_values("top_confidence", ascending=False)
|
|
|
|
# Get top 100 hits
|
|
top_hits = df_sorted.head(100)
|
|
top_hits.to_csv("top_hits.csv", index=False)
|
|
```
|
|
|
|
## Workflow 5: Ensemble Docking with Protein Flexibility
|
|
|
|
### Prepare Protein Ensemble
|
|
|
|
```python
|
|
# For proteins with known flexibility, use multiple conformations
|
|
# Example: Using MD snapshots or crystal structures
|
|
|
|
# create_ensemble_csv.py
|
|
import pandas as pd
|
|
|
|
conformations = [
|
|
"protein_conf1.pdb",
|
|
"protein_conf2.pdb",
|
|
"protein_conf3.pdb",
|
|
"protein_conf4.pdb"
|
|
]
|
|
|
|
ligand = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"
|
|
|
|
data = {
|
|
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
|
|
"protein_path": conformations,
|
|
"ligand_description": [ligand] * len(conformations),
|
|
"protein_sequence": [""] * len(conformations)
|
|
}
|
|
|
|
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
|
|
```
|
|
|
|
### Run Ensemble Docking
|
|
|
|
```bash
|
|
python -m inference \
|
|
--config default_inference_args.yaml \
|
|
--protein_ligand_csv ensemble_input.csv \
|
|
--out_dir results/ensemble_docking/ \
|
|
--samples_per_complex 20 # More samples per conformation
|
|
```
|
|
|
|
## Workflow 6: Integration with Downstream Analysis
|
|
|
|
### Example: DiffDock + GNINA Rescoring
|
|
|
|
```bash
|
|
# 1. Run DiffDock
|
|
python -m inference \
|
|
--config default_inference_args.yaml \
|
|
--protein_path protein.pdb \
|
|
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
|
--out_dir results/diffdock_poses/ \
|
|
--save_visualisation
|
|
|
|
# 2. Rescore with GNINA
|
|
for pose in results/diffdock_poses/*.sdf; do
|
|
gnina -r protein.pdb -l "$pose" --score_only -o "${pose%.sdf}_gnina.sdf"
|
|
done
|
|
```
|
|
|
|
### Example: DiffDock + OpenMM Energy Minimization
|
|
|
|
```python
|
|
# minimize_poses.py
|
|
from openmm import app, LangevinIntegrator, Platform
|
|
from openmm.app import ForceField, Modeller, PDBFile
|
|
from rdkit import Chem
|
|
import os
|
|
|
|
# Load protein
|
|
protein = PDBFile('protein.pdb')
|
|
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
|
|
|
|
# Process each DiffDock pose
|
|
pose_dir = 'results/diffdock_poses/'
|
|
for pose_file in os.listdir(pose_dir):
|
|
if pose_file.endswith('.sdf'):
|
|
# Load ligand
|
|
mol = Chem.SDMolSupplier(os.path.join(pose_dir, pose_file))[0]
|
|
|
|
# Combine protein + ligand
|
|
modeller = Modeller(protein.topology, protein.positions)
|
|
# ... add ligand to modeller ...
|
|
|
|
# Create system and minimize
|
|
system = forcefield.createSystem(modeller.topology)
|
|
integrator = LangevinIntegrator(300, 1.0, 0.002)
|
|
simulation = app.Simulation(modeller.topology, system, integrator)
|
|
simulation.minimizeEnergy(maxIterations=1000)
|
|
|
|
# Save minimized structure
|
|
positions = simulation.context.getState(getPositions=True).getPositions()
|
|
PDBFile.writeFile(simulation.topology, positions,
|
|
open(f"minimized_{pose_file}.pdb", 'w'))
|
|
```
|
|
|
|
## Workflow 7: Using the Graphical Interface
|
|
|
|
### Launch Web Interface
|
|
|
|
```bash
|
|
python app/main.py
|
|
```
|
|
|
|
### Access Interface
|
|
Navigate to `http://localhost:7860` in web browser
|
|
|
|
### Features
|
|
- Upload protein PDB or enter sequence
|
|
- Input ligand SMILES or upload structure
|
|
- Adjust inference parameters via GUI
|
|
- Visualize results interactively
|
|
- Download predictions directly
|
|
|
|
### Online Alternative
|
|
Use the Hugging Face Spaces demo without local installation:
|
|
- URL: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
|
|
|
## Advanced Configuration
|
|
|
|
### Custom Inference Settings
|
|
|
|
Create custom YAML configuration:
|
|
|
|
```yaml
|
|
# custom_inference.yaml
|
|
# Model settings
|
|
model_dir: ./workdir/v1.1/score_model
|
|
confidence_model_dir: ./workdir/v1.1/confidence_model
|
|
|
|
# Sampling parameters
|
|
samples_per_complex: 20 # More samples for better coverage
|
|
inference_steps: 25 # More steps for accuracy
|
|
|
|
# Temperature adjustments (increase for more diversity)
|
|
temp_sampling_tr: 1.3
|
|
temp_sampling_rot: 2.2
|
|
temp_sampling_tor: 7.5
|
|
|
|
# Output
|
|
save_visualisation: true
|
|
```
|
|
|
|
Use custom configuration:
|
|
|
|
```bash
|
|
python -m inference \
|
|
--config custom_inference.yaml \
|
|
--protein_path protein.pdb \
|
|
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
|
--out_dir results/custom_config/
|
|
```
|
|
|
|
## Troubleshooting Common Issues
|
|
|
|
### Issue: Out of Memory Errors
|
|
|
|
**Solution**: Reduce batch size
|
|
```bash
|
|
python -m inference ... --batch_size 2
|
|
```
|
|
|
|
### Issue: Slow Performance
|
|
|
|
**Solution**: Ensure GPU usage
|
|
```python
|
|
import torch
|
|
print(torch.cuda.is_available()) # Should return True
|
|
```
|
|
|
|
### Issue: Poor Predictions for Large Ligands
|
|
|
|
**Solution**: Increase sampling diversity
|
|
```bash
|
|
python -m inference ... --samples_per_complex 40 --temp_sampling_tor 9.0
|
|
```
|
|
|
|
### Issue: Protein with Many Chains
|
|
|
|
**Solution**: Limit chains or isolate binding site
|
|
```bash
|
|
python -m inference ... --chain_cutoff 4
|
|
```
|
|
|
|
Or pre-process PDB to include only relevant chains.
|
|
|
|
## Best Practices Summary
|
|
|
|
1. **Start Simple**: Test with single complex before batch processing
|
|
2. **GPU Essential**: Use GPU for reasonable performance
|
|
3. **Multiple Samples**: Generate 10-40 samples for robust predictions
|
|
4. **Validate Results**: Use molecular visualization and complementary scoring
|
|
5. **Consider Confidence**: Use confidence scores for initial ranking, not final decisions
|
|
6. **Iterate Parameters**: Adjust temperature/steps for specific systems
|
|
7. **Pre-compute Embeddings**: For repeated use of same protein
|
|
8. **Combine Tools**: Integrate with scoring functions and energy minimization
|