Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

477
skills/diffdock/SKILL.md Normal file
View File

@@ -0,0 +1,477 @@
---
name: diffdock
description: "Diffusion-based molecular docking. Predict protein-ligand binding poses from PDB/SMILES, confidence scores, virtual screening, for structure-based drug design. Not for affinity prediction."
---
# DiffDock: Molecular Docking with Diffusion Models
## Overview
DiffDock is a diffusion-based deep learning tool for molecular docking that predicts 3D binding poses of small molecule ligands to protein targets. It represents the state-of-the-art in computational docking, crucial for structure-based drug discovery and chemical biology.
**Core Capabilities:**
- Predict ligand binding poses with high accuracy using deep learning
- Support protein structures (PDB files) or sequences (via ESMFold)
- Process single complexes or batch virtual screening campaigns
- Generate confidence scores to assess prediction reliability
- Handle diverse ligand inputs (SMILES, SDF, MOL2)
**Key Distinction:** DiffDock predicts **binding poses** (3D structure) and **confidence** (prediction certainty), NOT binding affinity (ΔG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
## When to Use This Skill
This skill should be used when:
- "Dock this ligand to a protein" or "predict binding pose"
- "Run molecular docking" or "perform protein-ligand docking"
- "Virtual screening" or "screen compound library"
- "Where does this molecule bind?" or "predict binding site"
- Structure-based drug design or lead optimization tasks
- Tasks involving PDB files + SMILES strings or ligand structures
- Batch docking of multiple protein-ligand pairs
## Installation and Environment Setup
### Check Environment Status
Before proceeding with DiffDock tasks, verify the environment setup:
```bash
# Use the provided setup checker
python scripts/setup_check.py
```
This script validates Python version, PyTorch with CUDA, PyTorch Geometric, RDKit, ESM, and other dependencies.
### Installation Options
**Option 1: Conda (Recommended)**
```bash
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
conda env create --file environment.yml
conda activate diffdock
```
**Option 2: Docker**
```bash
docker pull rbgcsail/diffdock
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
micromamba activate diffdock
```
**Important Notes:**
- GPU strongly recommended (10-100x speedup vs CPU)
- First run pre-computes SO(2)/SO(3) lookup tables (~2-5 minutes)
- Model checkpoints (~500MB) download automatically if not present
## Core Workflows
### Workflow 1: Single Protein-Ligand Docking
**Use Case:** Dock one ligand to one protein target
**Input Requirements:**
- Protein: PDB file OR amino acid sequence
- Ligand: SMILES string OR structure file (SDF/MOL2)
**Command:**
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_path protein.pdb \
--ligand "CC(=O)Oc1ccccc1C(=O)O" \
--out_dir results/single_docking/
```
**Alternative (protein sequence):**
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKF..." \
--ligand ligand.sdf \
--out_dir results/sequence_docking/
```
**Output Structure:**
```
results/single_docking/
├── rank_1.sdf # Top-ranked pose
├── rank_2.sdf # Second-ranked pose
├── ...
├── rank_10.sdf # 10th pose (default: 10 samples)
└── confidence_scores.txt
```
### Workflow 2: Batch Processing Multiple Complexes
**Use Case:** Dock multiple ligands to proteins, virtual screening campaigns
**Step 1: Prepare Batch CSV**
Use the provided script to create or validate batch input:
```bash
# Create template
python scripts/prepare_batch_csv.py --create --output batch_input.csv
# Validate existing CSV
python scripts/prepare_batch_csv.py my_input.csv --validate
```
**CSV Format:**
```csv
complex_name,protein_path,ligand_description,protein_sequence
complex1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
complex2,,COc1ccc(C#N)cc1,MSKGEELFT...
complex3,protein3.pdb,ligand3.sdf,
```
**Required Columns:**
- `complex_name`: Unique identifier
- `protein_path`: PDB file path (leave empty if using sequence)
- `ligand_description`: SMILES string or ligand file path
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
**Step 2: Run Batch Docking**
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv batch_input.csv \
--out_dir results/batch/ \
--batch_size 10
```
**For Large Virtual Screening (>100 compounds):**
Pre-compute protein embeddings for faster processing:
```bash
# Pre-compute embeddings
python datasets/esm_embedding_preparation.py \
--protein_ligand_csv screening_input.csv \
--out_file protein_embeddings.pt
# Run with pre-computed embeddings
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv screening_input.csv \
--esm_embeddings_path protein_embeddings.pt \
--out_dir results/screening/
```
### Workflow 3: Analyzing Results
After docking completes, analyze confidence scores and rank predictions:
```bash
# Analyze all results
python scripts/analyze_results.py results/batch/
# Show top 5 per complex
python scripts/analyze_results.py results/batch/ --top 5
# Filter by confidence threshold
python scripts/analyze_results.py results/batch/ --threshold 0.0
# Export to CSV
python scripts/analyze_results.py results/batch/ --export summary.csv
# Show top 20 predictions across all complexes
python scripts/analyze_results.py results/batch/ --best 20
```
The analysis script:
- Parses confidence scores from all predictions
- Classifies as High (>0), Moderate (-1.5 to 0), or Low (<-1.5)
- Ranks predictions within and across complexes
- Generates statistical summaries
- Exports results to CSV for downstream analysis
## Confidence Score Interpretation
**Understanding Scores:**
| Score Range | Confidence Level | Interpretation |
|------------|------------------|----------------|
| **> 0** | High | Strong prediction, likely accurate |
| **-1.5 to 0** | Moderate | Reasonable prediction, validate carefully |
| **< -1.5** | Low | Uncertain prediction, requires validation |
**Critical Notes:**
1. **Confidence ≠ Affinity**: High confidence means model certainty about structure, NOT strong binding
2. **Context Matters**: Adjust expectations for:
- Large ligands (>500 Da): Lower confidence expected
- Multiple protein chains: May decrease confidence
- Novel protein families: May underperform
3. **Multiple Samples**: Review top 3-5 predictions, look for consensus
**For detailed guidance:** Read `references/confidence_and_limitations.md` using the Read tool
## Parameter Customization
### Using Custom Configuration
Create custom configuration for specific use cases:
```bash
# Copy template
cp assets/custom_inference_config.yaml my_config.yaml
# Edit parameters (see template for presets)
# Then run with custom config
python -m inference \
--config my_config.yaml \
--protein_ligand_csv input.csv \
--out_dir results/
```
### Key Parameters to Adjust
**Sampling Density:**
- `samples_per_complex: 10` → Increase to 20-40 for difficult cases
- More samples = better coverage but longer runtime
**Inference Steps:**
- `inference_steps: 20` → Increase to 25-30 for higher accuracy
- More steps = potentially better quality but slower
**Temperature Parameters (control diversity):**
- `temp_sampling_tor: 7.04` → Increase for flexible ligands (8-10)
- `temp_sampling_tor: 7.04` → Decrease for rigid ligands (5-6)
- Higher temperature = more diverse poses
**Presets Available in Template:**
1. High Accuracy: More samples + steps, lower temperature
2. Fast Screening: Fewer samples, faster
3. Flexible Ligands: Increased torsion temperature
4. Rigid Ligands: Decreased torsion temperature
**For complete parameter reference:** Read `references/parameters_reference.md` using the Read tool
## Advanced Techniques
### Ensemble Docking (Protein Flexibility)
For proteins with known flexibility, dock to multiple conformations:
```python
# Create ensemble CSV
import pandas as pd
conformations = ["conf1.pdb", "conf2.pdb", "conf3.pdb"]
ligand = "CC(=O)Oc1ccccc1C(=O)O"
data = {
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
"protein_path": conformations,
"ligand_description": [ligand] * len(conformations),
"protein_sequence": [""] * len(conformations)
}
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
```
Run docking with increased sampling:
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv ensemble_input.csv \
--samples_per_complex 20 \
--out_dir results/ensemble/
```
### Integration with Scoring Functions
DiffDock generates poses; combine with other tools for affinity:
**GNINA (Fast neural network scoring):**
```bash
for pose in results/*.sdf; do
gnina -r protein.pdb -l "$pose" --score_only
done
```
**MM/GBSA (More accurate, slower):**
Use AmberTools MMPBSA.py or gmx_MMPBSA after energy minimization
**Free Energy Calculations (Most accurate):**
Use OpenMM + OpenFE or GROMACS for FEP/TI calculations
**Recommended Workflow:**
1. DiffDock → Generate poses with confidence scores
2. Visual inspection → Check structural plausibility
3. GNINA or MM/GBSA → Rescore and rank by affinity
4. Experimental validation → Biochemical assays
## Limitations and Scope
**DiffDock IS Designed For:**
- Small molecule ligands (typically 100-1000 Da)
- Drug-like organic compounds
- Small peptides (<20 residues)
- Single or multi-chain proteins
**DiffDock IS NOT Designed For:**
- Large biomolecules (protein-protein docking) → Use DiffDock-PP or AlphaFold-Multimer
- Large peptides (>20 residues) → Use alternative methods
- Covalent docking → Use specialized covalent docking tools
- Binding affinity prediction → Combine with scoring functions
- Membrane proteins → Not specifically trained, use with caution
**For complete limitations:** Read `references/confidence_and_limitations.md` using the Read tool
## Troubleshooting
### Common Issues
**Issue: Low confidence scores across all predictions**
- Cause: Large/unusual ligands, unclear binding site, protein flexibility
- Solution: Increase `samples_per_complex` (20-40), try ensemble docking, validate protein structure
**Issue: Out of memory errors**
- Cause: GPU memory insufficient for batch size
- Solution: Reduce `--batch_size 2` or process fewer complexes at once
**Issue: Slow performance**
- Cause: Running on CPU instead of GPU
- Solution: Verify CUDA with `python -c "import torch; print(torch.cuda.is_available())"`, use GPU
**Issue: Unrealistic binding poses**
- Cause: Poor protein preparation, ligand too large, wrong binding site
- Solution: Check protein for missing residues, remove far waters, consider specifying binding site
**Issue: "Module not found" errors**
- Cause: Missing dependencies or wrong environment
- Solution: Run `python scripts/setup_check.py` to diagnose
### Performance Optimization
**For Best Results:**
1. Use GPU (essential for practical use)
2. Pre-compute ESM embeddings for repeated protein use
3. Batch process multiple complexes together
4. Start with default parameters, then tune if needed
5. Validate protein structures (resolve missing residues)
6. Use canonical SMILES for ligands
## Graphical User Interface
For interactive use, launch the web interface:
```bash
python app/main.py
# Navigate to http://localhost:7860
```
Or use the online demo without installation:
- https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
## Resources
### Helper Scripts (`scripts/`)
**`prepare_batch_csv.py`**: Create and validate batch input CSV files
- Create templates with example entries
- Validate file paths and SMILES strings
- Check for required columns and format issues
**`analyze_results.py`**: Analyze confidence scores and rank predictions
- Parse results from single or batch runs
- Generate statistical summaries
- Export to CSV for downstream analysis
- Identify top predictions across complexes
**`setup_check.py`**: Verify DiffDock environment setup
- Check Python version and dependencies
- Verify PyTorch and CUDA availability
- Test RDKit and PyTorch Geometric installation
- Provide installation instructions if needed
### Reference Documentation (`references/`)
**`parameters_reference.md`**: Complete parameter documentation
- All command-line options and configuration parameters
- Default values and acceptable ranges
- Temperature parameters for controlling diversity
- Model checkpoint locations and version flags
Read this file when users need:
- Detailed parameter explanations
- Fine-tuning guidance for specific systems
- Alternative sampling strategies
**`confidence_and_limitations.md`**: Confidence score interpretation and tool limitations
- Detailed confidence score interpretation
- When to trust predictions
- Scope and limitations of DiffDock
- Integration with complementary tools
- Troubleshooting prediction quality
Read this file when users need:
- Help interpreting confidence scores
- Understanding when NOT to use DiffDock
- Guidance on combining with other tools
- Validation strategies
**`workflows_examples.md`**: Comprehensive workflow examples
- Detailed installation instructions
- Step-by-step examples for all workflows
- Advanced integration patterns
- Troubleshooting common issues
- Best practices and optimization tips
Read this file when users need:
- Complete workflow examples with code
- Integration with GNINA, OpenMM, or other tools
- Virtual screening workflows
- Ensemble docking procedures
### Assets (`assets/`)
**`batch_template.csv`**: Template for batch processing
- Pre-formatted CSV with required columns
- Example entries showing different input types
- Ready to customize with actual data
**`custom_inference_config.yaml`**: Configuration template
- Annotated YAML with all parameters
- Four preset configurations for common use cases
- Detailed comments explaining each parameter
- Ready to customize and use
## Best Practices
1. **Always verify environment** with `setup_check.py` before starting large jobs
2. **Validate batch CSVs** with `prepare_batch_csv.py` to catch errors early
3. **Start with defaults** then tune parameters based on system-specific needs
4. **Generate multiple samples** (10-40) for robust predictions
5. **Visual inspection** of top poses before downstream analysis
6. **Combine with scoring** functions for affinity assessment
7. **Use confidence scores** for initial ranking, not final decisions
8. **Pre-compute embeddings** for virtual screening campaigns
9. **Document parameters** used for reproducibility
10. **Validate results** experimentally when possible
## Citations
When using DiffDock, cite the appropriate papers:
**DiffDock-L (current default model):**
```
Stärk et al. (2024) "DiffDock-L: Improving Molecular Docking with Diffusion Models"
arXiv:2402.18396
```
**Original DiffDock:**
```
Corso et al. (2023) "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
ICLR 2023, arXiv:2210.01776
```
## Additional Resources
- **GitHub Repository**: https://github.com/gcorso/DiffDock
- **Online Demo**: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
- **DiffDock-L Paper**: https://arxiv.org/abs/2402.18396
- **Original Paper**: https://arxiv.org/abs/2210.01776

View File

@@ -0,0 +1,4 @@
complex_name,protein_path,ligand_description,protein_sequence
example_1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
example_2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
example_3,protein3.pdb,ligand3.sdf,
1 complex_name protein_path ligand_description protein_sequence
2 example_1 protein1.pdb CC(=O)Oc1ccccc1C(=O)O
3 example_2 COc1ccc(C#N)cc1 MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
4 example_3 protein3.pdb ligand3.sdf

View File

@@ -0,0 +1,90 @@
# DiffDock Custom Inference Configuration Template
# Copy and modify this file to customize inference parameters
# Model paths (usually don't need to change these)
model_dir: ./workdir/v1.1/score_model
confidence_model_dir: ./workdir/v1.1/confidence_model
ckpt: best_ema_inference_epoch_model.pt
confidence_ckpt: best_model_epoch75.pt
# Model version flags
old_score_model: false # Set to true to use original DiffDock instead of DiffDock-L
old_filtering_model: true
# Inference steps
inference_steps: 20 # Increase for potentially better accuracy (e.g., 25-30)
actual_steps: 19
no_final_step_noise: true
# Sampling parameters
samples_per_complex: 10 # Increase for difficult cases (e.g., 20-40)
sigma_schedule: expbeta
initial_noise_std_proportion: 1.46
# Temperature controls - Adjust these to balance exploration vs accuracy
# Higher values = more diverse predictions, lower values = more focused predictions
# Sampling temperatures
temp_sampling_tr: 1.17 # Translation sampling temperature
temp_sampling_rot: 2.06 # Rotation sampling temperature
temp_sampling_tor: 7.04 # Torsion sampling temperature (increase for flexible ligands)
# Psi angle temperatures
temp_psi_tr: 0.73
temp_psi_rot: 0.90
temp_psi_tor: 0.59
# Sigma data temperatures
temp_sigma_data_tr: 0.93
temp_sigma_data_rot: 0.75
temp_sigma_data_tor: 0.69
# Feature flags
no_model: false
no_random: false
ode: false # Set to true to use ODE solver instead of SDE
different_schedules: false
limit_failures: 5
# Output settings
# save_visualisation: true # Uncomment to save SDF files
# ============================================================================
# Configuration Presets for Common Use Cases
# ============================================================================
# PRESET 1: High Accuracy (slower, more thorough)
# samples_per_complex: 30
# inference_steps: 25
# temp_sampling_tr: 1.0
# temp_sampling_rot: 1.8
# temp_sampling_tor: 6.5
# PRESET 2: Fast Screening (faster, less thorough)
# samples_per_complex: 5
# inference_steps: 15
# temp_sampling_tr: 1.3
# temp_sampling_rot: 2.2
# temp_sampling_tor: 7.5
# PRESET 3: Flexible Ligands (more conformational diversity)
# samples_per_complex: 20
# inference_steps: 20
# temp_sampling_tr: 1.2
# temp_sampling_rot: 2.1
# temp_sampling_tor: 8.5 # Increased torsion temperature
# PRESET 4: Rigid Ligands (more focused predictions)
# samples_per_complex: 10
# inference_steps: 20
# temp_sampling_tr: 1.1
# temp_sampling_rot: 2.0
# temp_sampling_tor: 6.0 # Decreased torsion temperature
# ============================================================================
# Usage Example
# ============================================================================
# python -m inference \
# --config custom_inference_config.yaml \
# --protein_ligand_csv input.csv \
# --out_dir results/

View File

@@ -0,0 +1,182 @@
# DiffDock Confidence Scores and Limitations
This document provides detailed guidance on interpreting DiffDock confidence scores and understanding the tool's limitations.
## Confidence Score Interpretation
DiffDock generates a confidence score for each predicted binding pose. This score indicates the model's certainty about the prediction.
### Score Ranges
| Score Range | Confidence Level | Interpretation |
|------------|------------------|----------------|
| **> 0** | High confidence | Strong prediction, likely accurate binding pose |
| **-1.5 to 0** | Moderate confidence | Reasonable prediction, may need validation |
| **< -1.5** | Low confidence | Uncertain prediction, requires careful validation |
### Important Notes on Confidence Scores
1. **Not Binding Affinity**: Confidence scores reflect prediction certainty, NOT binding affinity strength
- High confidence = model is confident about the structure
- Does NOT indicate strong/weak binding affinity
2. **Context-Dependent**: Confidence scores should be adjusted based on system complexity:
- **Lower expectations** for:
- Large ligands (>500 Da)
- Protein complexes with many chains
- Unbound protein conformations (may require conformational changes)
- Novel protein families not well-represented in training data
- **Higher expectations** for:
- Drug-like small molecules (150-500 Da)
- Single-chain proteins or well-defined binding sites
- Proteins similar to those in training data (PDBBind, BindingMOAD)
3. **Multiple Predictions**: DiffDock generates multiple samples per complex (default: 10)
- Review top-ranked predictions (by confidence)
- Consider clustering similar poses
- High-confidence consensus across multiple samples strengthens prediction
## What DiffDock Predicts
### ✅ DiffDock DOES Predict
- **Binding poses**: 3D spatial orientation of ligand in protein binding site
- **Confidence scores**: Model's certainty about predictions
- **Multiple conformations**: Various possible binding modes
### ❌ DiffDock DOES NOT Predict
- **Binding affinity**: Strength of protein-ligand interaction (ΔG, Kd, Ki)
- **Binding kinetics**: On/off rates, residence time
- **ADMET properties**: Absorption, distribution, metabolism, excretion, toxicity
- **Selectivity**: Relative binding to different targets
## Scope and Limitations
### Designed For
- **Small molecule docking**: Organic compounds typically 100-1000 Da
- **Protein targets**: Single or multi-chain proteins
- **Small peptides**: Short peptide ligands (< ~20 residues)
- **Small nucleic acids**: Short oligonucleotides
### NOT Designed For
- **Large biomolecules**: Full protein-protein interactions
- Use DiffDock-PP, AlphaFold-Multimer, or RoseTTAFold2NA instead
- **Large peptides/proteins**: >20 residues as ligands
- **Covalent docking**: Irreversible covalent bond formation
- **Metalloprotein specifics**: May not accurately handle metal coordination
- **Membrane proteins**: Not specifically trained on membrane-embedded proteins
### Training Data Considerations
DiffDock was trained on:
- **PDBBind**: Diverse protein-ligand complexes
- **BindingMOAD**: Multi-domain protein structures
**Implications**:
- Best performance on proteins/ligands similar to training data
- May underperform on:
- Novel protein families
- Unusual ligand chemotypes
- Allosteric sites not well-represented in training data
## Validation and Complementary Tools
### Recommended Workflow
1. **Generate poses with DiffDock**
- Use confidence scores for initial ranking
- Consider multiple high-confidence predictions
2. **Visual Inspection**
- Examine protein-ligand interactions in molecular viewer
- Check for reasonable:
- Hydrogen bonds
- Hydrophobic interactions
- Steric complementarity
- Electrostatic interactions
3. **Scoring and Refinement** (choose one or more):
- **GNINA**: Deep learning-based scoring function
- **Molecular mechanics**: Energy minimization and refinement
- **MM/GBSA or MM/PBSA**: Binding free energy estimation
- **Free energy calculations**: FEP or TI for accurate affinity prediction
4. **Experimental Validation**
- Biochemical assays (IC50, Kd measurements)
- Structural validation (X-ray crystallography, cryo-EM)
### Tools for Binding Affinity Assessment
DiffDock should be combined with these tools for affinity prediction:
- **GNINA**: Fast, accurate scoring function
- Github: github.com/gnina/gnina
- **AutoDock Vina**: Classical docking and scoring
- Website: vina.scripps.edu
- **Free Energy Calculations**:
- OpenMM + OpenFE
- GROMACS + ABFE/RBFE protocols
- **MM/GBSA Tools**:
- MMPBSA.py (AmberTools)
- gmx_MMPBSA
## Performance Optimization
### For Best Results
1. **Protein Preparation**:
- Remove water molecules far from binding site
- Resolve missing residues if possible
- Consider protonation states at physiological pH
2. **Ligand Input**:
- Provide reasonable 3D conformers when using structure files
- Use canonical SMILES for consistent results
- Pre-process with RDKit if needed
3. **Computational Resources**:
- GPU strongly recommended (10-100x speedup)
- First run pre-computes lookup tables (takes a few minutes)
- Batch processing more efficient than single predictions
4. **Parameter Tuning**:
- Increase `samples_per_complex` for difficult cases (20-40)
- Adjust temperature parameters for diversity/accuracy trade-off
- Use pre-computed ESM embeddings for repeated predictions
## Common Issues and Troubleshooting
### Low Confidence Scores
- **Large/flexible ligands**: Consider splitting into fragments or use alternative methods
- **Multiple binding sites**: May predict multiple locations with distributed confidence
- **Protein flexibility**: Consider using ensemble of protein conformations
### Unrealistic Predictions
- **Clashes**: May indicate need for protein preparation or refinement
- **Surface binding**: Check if true binding site is blocked or unclear
- **Unusual poses**: Consider increasing samples to explore more conformations
### Slow Performance
- **Use GPU**: Essential for reasonable runtime
- **Pre-compute embeddings**: Reuse ESM embeddings for same protein
- **Batch processing**: More efficient than sequential individual predictions
- **Reduce samples**: Lower `samples_per_complex` for quick screening
## Citation and Further Reading
For methodology details and benchmarking results, see:
1. **Original DiffDock Paper** (ICLR 2023):
- "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
- Corso et al., arXiv:2210.01776
2. **DiffDock-L Paper** (2024):
- Enhanced model with improved generalization
- Stärk et al., arXiv:2402.18396
3. **PoseBusters Benchmark**:
- Rigorous docking evaluation framework
- Used for DiffDock validation

View File

@@ -0,0 +1,163 @@
# DiffDock Configuration Parameters Reference
This document provides comprehensive details on all DiffDock configuration parameters and command-line options.
## Model & Checkpoint Settings
### Model Paths
- **`--model_dir`**: Directory containing the score model checkpoint
- Default: `./workdir/v1.1/score_model`
- DiffDock-L model (current default)
- **`--confidence_model_dir`**: Directory containing the confidence model checkpoint
- Default: `./workdir/v1.1/confidence_model`
- **`--ckpt`**: Name of the score model checkpoint file
- Default: `best_ema_inference_epoch_model.pt`
- **`--confidence_ckpt`**: Name of the confidence model checkpoint file
- Default: `best_model_epoch75.pt`
### Model Version Flags
- **`--old_score_model`**: Use original DiffDock model instead of DiffDock-L
- Default: `false` (uses DiffDock-L)
- **`--old_filtering_model`**: Use legacy confidence filtering approach
- Default: `true`
## Input/Output Options
### Input Specification
- **`--protein_path`**: Path to protein PDB file
- Example: `--protein_path protein.pdb`
- Alternative to `--protein_sequence`
- **`--protein_sequence`**: Amino acid sequence for ESMFold folding
- Automatically generates protein structure from sequence
- Alternative to `--protein_path`
- **`--ligand`**: Ligand specification (SMILES string or file path)
- SMILES string: `--ligand "COc(cc1)ccc1C#N"`
- File path: `--ligand ligand.sdf` or `.mol2`
- **`--protein_ligand_csv`**: CSV file for batch processing
- Required columns: `complex_name`, `protein_path`, `ligand_description`, `protein_sequence`
- Example: `--protein_ligand_csv data/protein_ligand_example.csv`
### Output Control
- **`--out_dir`**: Output directory for predictions
- Example: `--out_dir results/user_predictions/`
- **`--save_visualisation`**: Export predicted molecules as SDF files
- Enables visualization of results
## Inference Parameters
### Diffusion Steps
- **`--inference_steps`**: Number of planned inference iterations
- Default: `20`
- Higher values may improve accuracy but increase runtime
- **`--actual_steps`**: Actual diffusion steps executed
- Default: `19`
- **`--no_final_step_noise`**: Omit noise at the final diffusion step
- Default: `true`
### Sampling Settings
- **`--samples_per_complex`**: Number of samples to generate per complex
- Default: `10`
- More samples provide better coverage but increase computation
- **`--sigma_schedule`**: Noise schedule type
- Default: `expbeta` (exponential-beta)
- **`--initial_noise_std_proportion`**: Initial noise standard deviation scaling
- Default: `1.46`
### Temperature Parameters
#### Sampling Temperatures (Controls diversity of predictions)
- **`--temp_sampling_tr`**: Translation sampling temperature
- Default: `1.17`
- **`--temp_sampling_rot`**: Rotation sampling temperature
- Default: `2.06`
- **`--temp_sampling_tor`**: Torsion sampling temperature
- Default: `7.04`
#### Psi Angle Temperatures
- **`--temp_psi_tr`**: Translation psi temperature
- Default: `0.73`
- **`--temp_psi_rot`**: Rotation psi temperature
- Default: `0.90`
- **`--temp_psi_tor`**: Torsion psi temperature
- Default: `0.59`
#### Sigma Data Temperatures
- **`--temp_sigma_data_tr`**: Translation data distribution scaling
- Default: `0.93`
- **`--temp_sigma_data_rot`**: Rotation data distribution scaling
- Default: `0.75`
- **`--temp_sigma_data_tor`**: Torsion data distribution scaling
- Default: `0.69`
## Processing Options
### Performance
- **`--batch_size`**: Processing batch size
- Default: `10`
- Larger values increase throughput but require more memory
- **`--tqdm`**: Enable progress bar visualization
- Useful for monitoring long-running jobs
### Protein Structure
- **`--chain_cutoff`**: Maximum number of protein chains to process
- Example: `--chain_cutoff 10`
- Useful for large multi-chain complexes
- **`--esm_embeddings_path`**: Path to pre-computed ESM2 protein embeddings
- Speeds up inference by reusing embeddings
- Optional optimization
### Dataset Options
- **`--split`**: Dataset split to use (train/test/val)
- Used for evaluation on standard benchmarks
## Advanced Flags
### Debugging & Testing
- **`--no_model`**: Disable model inference (debugging)
- Default: `false`
- **`--no_random`**: Disable randomization
- Default: `false`
- Useful for reproducibility testing
### Alternative Sampling
- **`--ode`**: Use ODE solver instead of SDE
- Default: `false`
- Alternative sampling approach
- **`--different_schedules`**: Use different noise schedules per component
- Default: `false`
### Error Handling
- **`--limit_failures`**: Maximum allowed failures before stopping
- Default: `5`
## Configuration File
All parameters can be specified in a YAML configuration file (typically `default_inference_args.yaml`) or overridden via command line:
```bash
python -m inference --config default_inference_args.yaml --samples_per_complex 20
```
Command-line arguments take precedence over configuration file values.

View File

@@ -0,0 +1,392 @@
# DiffDock Workflows and Examples
This document provides practical workflows and usage examples for common DiffDock tasks.
## Installation and Setup
### Conda Installation (Recommended)
```bash
# Clone repository
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
# Create conda environment
conda env create --file environment.yml
conda activate diffdock
```
### Docker Installation
```bash
# Pull Docker image
docker pull rbgcsail/diffdock
# Run container with GPU support
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
# Inside container, activate environment
micromamba activate diffdock
```
### First Run
The first execution pre-computes SO(2) and SO(3) lookup tables, taking a few minutes. Subsequent runs start immediately.
## Workflow 1: Single Protein-Ligand Docking
### Using PDB File and SMILES String
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_path examples/protein.pdb \
--ligand "COc1ccc(C(=O)Nc2ccccc2)cc1" \
--out_dir results/single_docking/
```
**Output Structure**:
```
results/single_docking/
├── index_0_rank_1.sdf # Top-ranked prediction
├── index_0_rank_2.sdf # Second-ranked prediction
├── ...
├── index_0_rank_10.sdf # 10th prediction (if samples_per_complex=10)
└── confidence_scores.txt # Scores for all predictions
```
### Using Ligand Structure File
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_path protein.pdb \
--ligand ligand.sdf \
--out_dir results/ligand_file/
```
**Supported ligand formats**: SDF, MOL2, or any format readable by RDKit
## Workflow 2: Protein Sequence to Structure Docking
### Using ESMFold for Protein Folding
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" \
--ligand "CC(C)Cc1ccc(cc1)C(C)C(=O)O" \
--out_dir results/sequence_docking/
```
**Use Cases**:
- Protein structure not available in PDB
- Modeling mutations or variants
- De novo protein design validation
**Note**: ESMFold folding adds computation time (30s-5min depending on sequence length)
## Workflow 3: Batch Processing Multiple Complexes
### Prepare CSV File
Create `complexes.csv` with required columns:
```csv
complex_name,protein_path,ligand_description,protein_sequence
complex1,proteins/protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
complex2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKF...
complex3,proteins/protein3.pdb,ligands/ligand3.sdf,
```
**Column Descriptions**:
- `complex_name`: Unique identifier for the complex
- `protein_path`: Path to PDB file (leave empty if using sequence)
- `ligand_description`: SMILES string or path to ligand file
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
### Run Batch Docking
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv complexes.csv \
--out_dir results/batch_predictions/ \
--batch_size 10
```
**Output Structure**:
```
results/batch_predictions/
├── complex1/
│ ├── rank_1.sdf
│ ├── rank_2.sdf
│ └── ...
├── complex2/
│ ├── rank_1.sdf
│ └── ...
└── complex3/
└── ...
```
## Workflow 4: High-Throughput Virtual Screening
### Setup for Screening Large Ligand Libraries
```python
# generate_screening_csv.py
import pandas as pd
# Load ligand library
ligands = pd.read_csv("ligand_library.csv") # Contains SMILES
# Create DiffDock input
screening_data = {
"complex_name": [f"screen_{i}" for i in range(len(ligands))],
"protein_path": ["target_protein.pdb"] * len(ligands),
"ligand_description": ligands["smiles"].tolist(),
"protein_sequence": [""] * len(ligands)
}
df = pd.DataFrame(screening_data)
df.to_csv("screening_input.csv", index=False)
```
### Run Screening
```bash
# Pre-compute ESM embeddings for faster screening
python datasets/esm_embedding_preparation.py \
--protein_ligand_csv screening_input.csv \
--out_file protein_embeddings.pt
# Run docking with pre-computed embeddings
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv screening_input.csv \
--esm_embeddings_path protein_embeddings.pt \
--out_dir results/virtual_screening/ \
--batch_size 32
```
### Post-Processing: Extract Top Hits
```python
# analyze_screening_results.py
import os
import pandas as pd
results = []
results_dir = "results/virtual_screening/"
for complex_dir in os.listdir(results_dir):
confidence_file = os.path.join(results_dir, complex_dir, "confidence_scores.txt")
if os.path.exists(confidence_file):
with open(confidence_file) as f:
scores = [float(line.strip()) for line in f]
top_score = max(scores)
results.append({"complex": complex_dir, "top_confidence": top_score})
# Sort by confidence
df = pd.DataFrame(results)
df_sorted = df.sort_values("top_confidence", ascending=False)
# Get top 100 hits
top_hits = df_sorted.head(100)
top_hits.to_csv("top_hits.csv", index=False)
```
## Workflow 5: Ensemble Docking with Protein Flexibility
### Prepare Protein Ensemble
```python
# For proteins with known flexibility, use multiple conformations
# Example: Using MD snapshots or crystal structures
# create_ensemble_csv.py
import pandas as pd
conformations = [
"protein_conf1.pdb",
"protein_conf2.pdb",
"protein_conf3.pdb",
"protein_conf4.pdb"
]
ligand = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"
data = {
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
"protein_path": conformations,
"ligand_description": [ligand] * len(conformations),
"protein_sequence": [""] * len(conformations)
}
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
```
### Run Ensemble Docking
```bash
python -m inference \
--config default_inference_args.yaml \
--protein_ligand_csv ensemble_input.csv \
--out_dir results/ensemble_docking/ \
--samples_per_complex 20 # More samples per conformation
```
## Workflow 6: Integration with Downstream Analysis
### Example: DiffDock + GNINA Rescoring
```bash
# 1. Run DiffDock
python -m inference \
--config default_inference_args.yaml \
--protein_path protein.pdb \
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
--out_dir results/diffdock_poses/ \
--save_visualisation
# 2. Rescore with GNINA
for pose in results/diffdock_poses/*.sdf; do
gnina -r protein.pdb -l "$pose" --score_only -o "${pose%.sdf}_gnina.sdf"
done
```
### Example: DiffDock + OpenMM Energy Minimization
```python
# minimize_poses.py
from openmm import app, LangevinIntegrator, Platform
from openmm.app import ForceField, Modeller, PDBFile
from rdkit import Chem
import os
# Load protein
protein = PDBFile('protein.pdb')
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
# Process each DiffDock pose
pose_dir = 'results/diffdock_poses/'
for pose_file in os.listdir(pose_dir):
if pose_file.endswith('.sdf'):
# Load ligand
mol = Chem.SDMolSupplier(os.path.join(pose_dir, pose_file))[0]
# Combine protein + ligand
modeller = Modeller(protein.topology, protein.positions)
# ... add ligand to modeller ...
# Create system and minimize
system = forcefield.createSystem(modeller.topology)
integrator = LangevinIntegrator(300, 1.0, 0.002)
simulation = app.Simulation(modeller.topology, system, integrator)
simulation.minimizeEnergy(maxIterations=1000)
# Save minimized structure
positions = simulation.context.getState(getPositions=True).getPositions()
PDBFile.writeFile(simulation.topology, positions,
open(f"minimized_{pose_file}.pdb", 'w'))
```
## Workflow 7: Using the Graphical Interface
### Launch Web Interface
```bash
python app/main.py
```
### Access Interface
Navigate to `http://localhost:7860` in web browser
### Features
- Upload protein PDB or enter sequence
- Input ligand SMILES or upload structure
- Adjust inference parameters via GUI
- Visualize results interactively
- Download predictions directly
### Online Alternative
Use the Hugging Face Spaces demo without local installation:
- URL: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
## Advanced Configuration
### Custom Inference Settings
Create custom YAML configuration:
```yaml
# custom_inference.yaml
# Model settings
model_dir: ./workdir/v1.1/score_model
confidence_model_dir: ./workdir/v1.1/confidence_model
# Sampling parameters
samples_per_complex: 20 # More samples for better coverage
inference_steps: 25 # More steps for accuracy
# Temperature adjustments (increase for more diversity)
temp_sampling_tr: 1.3
temp_sampling_rot: 2.2
temp_sampling_tor: 7.5
# Output
save_visualisation: true
```
Use custom configuration:
```bash
python -m inference \
--config custom_inference.yaml \
--protein_path protein.pdb \
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
--out_dir results/custom_config/
```
## Troubleshooting Common Issues
### Issue: Out of Memory Errors
**Solution**: Reduce batch size
```bash
python -m inference ... --batch_size 2
```
### Issue: Slow Performance
**Solution**: Ensure GPU usage
```python
import torch
print(torch.cuda.is_available()) # Should return True
```
### Issue: Poor Predictions for Large Ligands
**Solution**: Increase sampling diversity
```bash
python -m inference ... --samples_per_complex 40 --temp_sampling_tor 9.0
```
### Issue: Protein with Many Chains
**Solution**: Limit chains or isolate binding site
```bash
python -m inference ... --chain_cutoff 4
```
Or pre-process PDB to include only relevant chains.
## Best Practices Summary
1. **Start Simple**: Test with single complex before batch processing
2. **GPU Essential**: Use GPU for reasonable performance
3. **Multiple Samples**: Generate 10-40 samples for robust predictions
4. **Validate Results**: Use molecular visualization and complementary scoring
5. **Consider Confidence**: Use confidence scores for initial ranking, not final decisions
6. **Iterate Parameters**: Adjust temperature/steps for specific systems
7. **Pre-compute Embeddings**: For repeated use of same protein
8. **Combine Tools**: Integrate with scoring functions and energy minimization

View File

@@ -0,0 +1,334 @@
#!/usr/bin/env python3
"""
DiffDock Results Analysis Script
This script analyzes DiffDock prediction results, extracting confidence scores,
ranking predictions, and generating summary reports.
Usage:
python analyze_results.py results/output_dir/
python analyze_results.py results/ --top 50 --threshold 0.0
python analyze_results.py results/ --export summary.csv
"""
import argparse
import os
import sys
import json
from pathlib import Path
from collections import defaultdict
import re
def parse_confidence_scores(results_dir):
"""
Parse confidence scores from DiffDock output directory.
Args:
results_dir: Path to DiffDock results directory
Returns:
dict: Dictionary mapping complex names to their predictions and scores
"""
results = {}
results_path = Path(results_dir)
# Check if this is a single complex or batch results
sdf_files = list(results_path.glob("*.sdf"))
if sdf_files:
# Single complex output
results['single_complex'] = parse_single_complex(results_path)
else:
# Batch output - multiple subdirectories
for subdir in results_path.iterdir():
if subdir.is_dir():
complex_results = parse_single_complex(subdir)
if complex_results:
results[subdir.name] = complex_results
return results
def parse_single_complex(complex_dir):
"""Parse results for a single complex."""
predictions = []
# Look for SDF files with rank information
for sdf_file in complex_dir.glob("*.sdf"):
filename = sdf_file.name
# Extract rank from filename (e.g., "rank_1.sdf" or "index_0_rank_1.sdf")
rank_match = re.search(r'rank_(\d+)', filename)
if rank_match:
rank = int(rank_match.group(1))
# Try to extract confidence score from filename or separate file
confidence = extract_confidence_score(sdf_file, complex_dir)
predictions.append({
'rank': rank,
'file': sdf_file.name,
'path': str(sdf_file),
'confidence': confidence
})
# Sort by rank
predictions.sort(key=lambda x: x['rank'])
return {'predictions': predictions} if predictions else None
def extract_confidence_score(sdf_file, complex_dir):
"""
Extract confidence score for a prediction.
Tries multiple methods:
1. Read from confidence_scores.txt file
2. Parse from SDF file properties
3. Extract from filename if present
"""
# Method 1: confidence_scores.txt
confidence_file = complex_dir / "confidence_scores.txt"
if confidence_file.exists():
try:
with open(confidence_file) as f:
lines = f.readlines()
# Extract rank from filename
rank_match = re.search(r'rank_(\d+)', sdf_file.name)
if rank_match:
rank = int(rank_match.group(1))
if rank <= len(lines):
return float(lines[rank - 1].strip())
except Exception:
pass
# Method 2: Parse from SDF file
try:
with open(sdf_file) as f:
content = f.read()
# Look for confidence score in SDF properties
conf_match = re.search(r'confidence[:\s]+(-?\d+\.?\d*)', content, re.IGNORECASE)
if conf_match:
return float(conf_match.group(1))
except Exception:
pass
# Method 3: Filename (e.g., "rank_1_conf_0.95.sdf")
conf_match = re.search(r'conf_(-?\d+\.?\d*)', sdf_file.name)
if conf_match:
return float(conf_match.group(1))
return None
def classify_confidence(score):
"""Classify confidence score into categories."""
if score is None:
return "Unknown"
elif score > 0:
return "High"
elif score > -1.5:
return "Moderate"
else:
return "Low"
def print_summary(results, top_n=None, min_confidence=None):
"""Print a formatted summary of results."""
print("\n" + "="*80)
print("DiffDock Results Summary")
print("="*80)
all_predictions = []
for complex_name, data in results.items():
predictions = data.get('predictions', [])
print(f"\n{complex_name}")
print("-" * 80)
if not predictions:
print(" No predictions found")
continue
# Filter by confidence if specified
filtered_predictions = predictions
if min_confidence is not None:
filtered_predictions = [p for p in predictions if p['confidence'] is not None and p['confidence'] >= min_confidence]
# Limit to top N if specified
if top_n is not None:
filtered_predictions = filtered_predictions[:top_n]
for pred in filtered_predictions:
confidence = pred['confidence']
confidence_class = classify_confidence(confidence)
conf_str = f"{confidence:>7.3f}" if confidence is not None else " N/A"
print(f" Rank {pred['rank']:2d}: Confidence = {conf_str} ({confidence_class:8s}) | {pred['file']}")
# Add to all predictions for overall statistics
if confidence is not None:
all_predictions.append((complex_name, pred['rank'], confidence))
# Show statistics for this complex
if filtered_predictions and any(p['confidence'] is not None for p in filtered_predictions):
confidences = [p['confidence'] for p in filtered_predictions if p['confidence'] is not None]
print(f"\n Statistics: {len(filtered_predictions)} predictions")
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
print(f" Max confidence: {max(confidences):.3f}")
print(f" Min confidence: {min(confidences):.3f}")
# Overall statistics
if all_predictions:
print("\n" + "="*80)
print("Overall Statistics")
print("="*80)
confidences = [conf for _, _, conf in all_predictions]
print(f" Total predictions: {len(all_predictions)}")
print(f" Total complexes: {len(results)}")
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
print(f" Max confidence: {max(confidences):.3f}")
print(f" Min confidence: {min(confidences):.3f}")
# Confidence distribution
high = sum(1 for c in confidences if c > 0)
moderate = sum(1 for c in confidences if -1.5 < c <= 0)
low = sum(1 for c in confidences if c <= -1.5)
print(f"\n Confidence distribution:")
print(f" High (> 0): {high:4d} ({100*high/len(confidences):5.1f}%)")
print(f" Moderate (-1.5 to 0): {moderate:4d} ({100*moderate/len(confidences):5.1f}%)")
print(f" Low (< -1.5): {low:4d} ({100*low/len(confidences):5.1f}%)")
print("\n" + "="*80)
def export_to_csv(results, output_path):
"""Export results to CSV file."""
import csv
with open(output_path, 'w', newline='') as f:
writer = csv.writer(f)
writer.writerow(['complex_name', 'rank', 'confidence', 'confidence_class', 'file_path'])
for complex_name, data in results.items():
predictions = data.get('predictions', [])
for pred in predictions:
confidence = pred['confidence']
confidence_class = classify_confidence(confidence)
conf_value = confidence if confidence is not None else ''
writer.writerow([
complex_name,
pred['rank'],
conf_value,
confidence_class,
pred['path']
])
print(f"✓ Exported results to: {output_path}")
def get_top_predictions(results, n=10, sort_by='confidence'):
"""Get top N predictions across all complexes."""
all_predictions = []
for complex_name, data in results.items():
predictions = data.get('predictions', [])
for pred in predictions:
if pred['confidence'] is not None:
all_predictions.append({
'complex': complex_name,
**pred
})
# Sort by confidence (descending)
all_predictions.sort(key=lambda x: x['confidence'], reverse=True)
return all_predictions[:n]
def print_top_predictions(results, n=10):
"""Print top N predictions across all complexes."""
top_preds = get_top_predictions(results, n)
print("\n" + "="*80)
print(f"Top {n} Predictions Across All Complexes")
print("="*80)
for i, pred in enumerate(top_preds, 1):
confidence_class = classify_confidence(pred['confidence'])
print(f"{i:2d}. {pred['complex']:30s} | Rank {pred['rank']:2d} | "
f"Confidence: {pred['confidence']:7.3f} ({confidence_class})")
print("="*80)
def main():
parser = argparse.ArgumentParser(
description='Analyze DiffDock prediction results',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Analyze all results in directory
python analyze_results.py results/output_dir/
# Show only top 5 predictions per complex
python analyze_results.py results/ --top 5
# Filter by confidence threshold
python analyze_results.py results/ --threshold 0.0
# Export to CSV
python analyze_results.py results/ --export summary.csv
# Show top 20 predictions across all complexes
python analyze_results.py results/ --best 20
"""
)
parser.add_argument('results_dir', help='Path to DiffDock results directory')
parser.add_argument('--top', '-t', type=int,
help='Show only top N predictions per complex')
parser.add_argument('--threshold', type=float,
help='Minimum confidence threshold')
parser.add_argument('--export', '-e', metavar='FILE',
help='Export results to CSV file')
parser.add_argument('--best', '-b', type=int, metavar='N',
help='Show top N predictions across all complexes')
args = parser.parse_args()
# Validate results directory
if not os.path.exists(args.results_dir):
print(f"Error: Results directory not found: {args.results_dir}")
return 1
# Parse results
print(f"Analyzing results in: {args.results_dir}")
results = parse_confidence_scores(args.results_dir)
if not results:
print("No DiffDock results found in directory")
return 1
# Print summary
print_summary(results, top_n=args.top, min_confidence=args.threshold)
# Print top predictions across all complexes
if args.best:
print_top_predictions(results, args.best)
# Export to CSV if requested
if args.export:
export_to_csv(results, args.export)
return 0
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,254 @@
#!/usr/bin/env python3
"""
DiffDock Batch CSV Preparation and Validation Script
This script helps prepare and validate CSV files for DiffDock batch processing.
It checks for required columns, validates file paths, and ensures SMILES strings
are properly formatted.
Usage:
python prepare_batch_csv.py input.csv --validate
python prepare_batch_csv.py --create --output batch_input.csv
"""
import argparse
import os
import sys
import pandas as pd
from pathlib import Path
try:
from rdkit import Chem
from rdkit import RDLogger
RDLogger.DisableLog('rdApp.*')
RDKIT_AVAILABLE = True
except ImportError:
RDKIT_AVAILABLE = False
print("Warning: RDKit not available. SMILES validation will be skipped.")
def validate_smiles(smiles_string):
"""Validate a SMILES string using RDKit."""
if not RDKIT_AVAILABLE:
return True, "RDKit not available for validation"
try:
mol = Chem.MolFromSmiles(smiles_string)
if mol is None:
return False, "Invalid SMILES structure"
return True, "Valid SMILES"
except Exception as e:
return False, str(e)
def validate_file_path(file_path, base_dir=None):
"""Validate that a file path exists."""
if pd.isna(file_path) or file_path == "":
return True, "Empty (will use protein_sequence)"
# Handle relative paths
if base_dir:
full_path = Path(base_dir) / file_path
else:
full_path = Path(file_path)
if full_path.exists():
return True, f"File exists: {full_path}"
else:
return False, f"File not found: {full_path}"
def validate_csv(csv_path, base_dir=None):
"""
Validate a DiffDock batch input CSV file.
Args:
csv_path: Path to CSV file
base_dir: Base directory for relative paths (default: CSV directory)
Returns:
bool: True if validation passes
list: List of validation messages
"""
messages = []
valid = True
# Read CSV
try:
df = pd.read_csv(csv_path)
messages.append(f"✓ Successfully read CSV with {len(df)} rows")
except Exception as e:
messages.append(f"✗ Error reading CSV: {e}")
return False, messages
# Check required columns
required_cols = ['complex_name', 'protein_path', 'ligand_description', 'protein_sequence']
missing_cols = [col for col in required_cols if col not in df.columns]
if missing_cols:
messages.append(f"✗ Missing required columns: {', '.join(missing_cols)}")
valid = False
else:
messages.append("✓ All required columns present")
# Set base directory
if base_dir is None:
base_dir = Path(csv_path).parent
# Validate each row
for idx, row in df.iterrows():
row_msgs = []
# Check complex name
if pd.isna(row['complex_name']) or row['complex_name'] == "":
row_msgs.append("Missing complex_name")
valid = False
# Check that either protein_path or protein_sequence is provided
has_protein_path = not pd.isna(row['protein_path']) and row['protein_path'] != ""
has_protein_seq = not pd.isna(row['protein_sequence']) and row['protein_sequence'] != ""
if not has_protein_path and not has_protein_seq:
row_msgs.append("Must provide either protein_path or protein_sequence")
valid = False
elif has_protein_path and has_protein_seq:
row_msgs.append("Warning: Both protein_path and protein_sequence provided, will use protein_path")
# Validate protein path if provided
if has_protein_path:
file_valid, msg = validate_file_path(row['protein_path'], base_dir)
if not file_valid:
row_msgs.append(f"Protein file issue: {msg}")
valid = False
# Validate ligand description
if pd.isna(row['ligand_description']) or row['ligand_description'] == "":
row_msgs.append("Missing ligand_description")
valid = False
else:
ligand_desc = row['ligand_description']
# Check if it's a file path or SMILES
if os.path.exists(ligand_desc) or "/" in ligand_desc or "\\" in ligand_desc:
# Likely a file path
file_valid, msg = validate_file_path(ligand_desc, base_dir)
if not file_valid:
row_msgs.append(f"Ligand file issue: {msg}")
valid = False
else:
# Likely a SMILES string
smiles_valid, msg = validate_smiles(ligand_desc)
if not smiles_valid:
row_msgs.append(f"SMILES issue: {msg}")
valid = False
if row_msgs:
messages.append(f"\nRow {idx + 1} ({row.get('complex_name', 'unnamed')}):")
for msg in row_msgs:
messages.append(f" - {msg}")
# Summary
messages.append(f"\n{'='*60}")
if valid:
messages.append("✓ CSV validation PASSED - ready for DiffDock")
else:
messages.append("✗ CSV validation FAILED - please fix issues above")
return valid, messages
def create_template_csv(output_path, num_examples=3):
"""Create a template CSV file with example entries."""
examples = {
'complex_name': ['example1', 'example2', 'example3'][:num_examples],
'protein_path': ['protein1.pdb', '', 'protein3.pdb'][:num_examples],
'ligand_description': [
'CC(=O)Oc1ccccc1C(=O)O', # Aspirin SMILES
'COc1ccc(C#N)cc1', # Example SMILES
'ligand.sdf' # Example file path
][:num_examples],
'protein_sequence': [
'', # Empty - using PDB file
'MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK', # GFP sequence
'' # Empty - using PDB file
][:num_examples]
}
df = pd.DataFrame(examples)
df.to_csv(output_path, index=False)
return df
def main():
parser = argparse.ArgumentParser(
description='Prepare and validate DiffDock batch CSV files',
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
# Validate existing CSV
python prepare_batch_csv.py input.csv --validate
# Create template CSV
python prepare_batch_csv.py --create --output batch_template.csv
# Create template with 5 example rows
python prepare_batch_csv.py --create --output template.csv --num-examples 5
# Validate with custom base directory for relative paths
python prepare_batch_csv.py input.csv --validate --base-dir /path/to/data/
"""
)
parser.add_argument('csv_file', nargs='?', help='CSV file to validate')
parser.add_argument('--validate', action='store_true',
help='Validate the CSV file')
parser.add_argument('--create', action='store_true',
help='Create a template CSV file')
parser.add_argument('--output', '-o', help='Output path for template CSV')
parser.add_argument('--num-examples', type=int, default=3,
help='Number of example rows in template (default: 3)')
parser.add_argument('--base-dir', help='Base directory for relative file paths')
args = parser.parse_args()
# Create template
if args.create:
output_path = args.output or 'diffdock_batch_template.csv'
df = create_template_csv(output_path, args.num_examples)
print(f"✓ Created template CSV: {output_path}")
print(f"\nTemplate contents:")
print(df.to_string(index=False))
print(f"\nEdit this file with your protein-ligand pairs and run with:")
print(f" python -m inference --config default_inference_args.yaml \\")
print(f" --protein_ligand_csv {output_path} --out_dir results/")
return 0
# Validate CSV
if args.validate or args.csv_file:
if not args.csv_file:
print("Error: CSV file required for validation")
parser.print_help()
return 1
if not os.path.exists(args.csv_file):
print(f"Error: CSV file not found: {args.csv_file}")
return 1
print(f"Validating: {args.csv_file}")
print("="*60)
valid, messages = validate_csv(args.csv_file, args.base_dir)
for msg in messages:
print(msg)
return 0 if valid else 1
# No action specified
parser.print_help()
return 1
if __name__ == '__main__':
sys.exit(main())

View File

@@ -0,0 +1,278 @@
#!/usr/bin/env python3
"""
DiffDock Environment Setup Checker
This script verifies that the DiffDock environment is properly configured
and all dependencies are available.
Usage:
python setup_check.py
python setup_check.py --verbose
"""
import argparse
import sys
import os
from pathlib import Path
def check_python_version():
"""Check Python version."""
import sys
version = sys.version_info
print("Checking Python version...")
if version.major == 3 and version.minor >= 8:
print(f" ✓ Python {version.major}.{version.minor}.{version.micro}")
return True
else:
print(f" ✗ Python {version.major}.{version.minor}.{version.micro} "
f"(requires Python 3.8 or higher)")
return False
def check_package(package_name, import_name=None, version_attr='__version__'):
"""Check if a Python package is installed."""
if import_name is None:
import_name = package_name
try:
module = __import__(import_name)
version = getattr(module, version_attr, 'unknown')
print(f"{package_name:20s} (version: {version})")
return True
except ImportError:
print(f"{package_name:20s} (not installed)")
return False
def check_pytorch():
"""Check PyTorch installation and CUDA availability."""
print("\nChecking PyTorch...")
try:
import torch
print(f" ✓ PyTorch version: {torch.__version__}")
# Check CUDA
if torch.cuda.is_available():
print(f" ✓ CUDA available: {torch.cuda.get_device_name(0)}")
print(f" - CUDA version: {torch.version.cuda}")
print(f" - Number of GPUs: {torch.cuda.device_count()}")
return True, True
else:
print(f" ⚠ CUDA not available (will run on CPU)")
return True, False
except ImportError:
print(f" ✗ PyTorch not installed")
return False, False
def check_pytorch_geometric():
"""Check PyTorch Geometric installation."""
print("\nChecking PyTorch Geometric...")
packages = [
('torch-geometric', 'torch_geometric'),
('torch-scatter', 'torch_scatter'),
('torch-sparse', 'torch_sparse'),
('torch-cluster', 'torch_cluster'),
]
all_ok = True
for pkg_name, import_name in packages:
if not check_package(pkg_name, import_name):
all_ok = False
return all_ok
def check_core_dependencies():
"""Check core DiffDock dependencies."""
print("\nChecking core dependencies...")
dependencies = [
('numpy', 'numpy'),
('scipy', 'scipy'),
('pandas', 'pandas'),
('rdkit', 'rdkit', 'rdBase.__version__'),
('biopython', 'Bio', '__version__'),
('pytorch-lightning', 'pytorch_lightning'),
('PyYAML', 'yaml'),
]
all_ok = True
for dep in dependencies:
pkg_name = dep[0]
import_name = dep[1]
version_attr = dep[2] if len(dep) > 2 else '__version__'
if not check_package(pkg_name, import_name, version_attr):
all_ok = False
return all_ok
def check_esm():
"""Check ESM (protein language model) installation."""
print("\nChecking ESM (for protein sequence folding)...")
try:
import esm
print(f" ✓ ESM installed (version: {esm.__version__ if hasattr(esm, '__version__') else 'unknown'})")
return True
except ImportError:
print(f" ⚠ ESM not installed (needed for protein sequence folding)")
print(f" Install with: pip install fair-esm")
return False
def check_diffdock_installation():
"""Check if DiffDock is properly installed/cloned."""
print("\nChecking DiffDock installation...")
# Look for key files
key_files = [
'inference.py',
'default_inference_args.yaml',
'environment.yml',
]
found_files = []
missing_files = []
for filename in key_files:
if os.path.exists(filename):
found_files.append(filename)
else:
missing_files.append(filename)
if found_files:
print(f" ✓ Found DiffDock files in current directory:")
for f in found_files:
print(f" - {f}")
else:
print(f" ⚠ DiffDock files not found in current directory")
print(f" Current directory: {os.getcwd()}")
print(f" Make sure you're in the DiffDock repository root")
# Check for model checkpoints
model_dir = Path('./workdir/v1.1/score_model')
confidence_dir = Path('./workdir/v1.1/confidence_model')
if model_dir.exists() and confidence_dir.exists():
print(f" ✓ Model checkpoints found")
else:
print(f" ⚠ Model checkpoints not found in ./workdir/v1.1/")
print(f" Models will be downloaded on first run")
return len(found_files) > 0
def print_installation_instructions():
"""Print installation instructions if setup is incomplete."""
print("\n" + "="*80)
print("Installation Instructions")
print("="*80)
print("""
If DiffDock is not installed, follow these steps:
1. Clone the repository:
git clone https://github.com/gcorso/DiffDock.git
cd DiffDock
2. Create conda environment:
conda env create --file environment.yml
conda activate diffdock
3. Verify installation:
python setup_check.py
For Docker installation:
docker pull rbgcsail/diffdock
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
micromamba activate diffdock
For more information, visit: https://github.com/gcorso/DiffDock
""")
def print_performance_notes(has_cuda):
"""Print performance notes based on available hardware."""
print("\n" + "="*80)
print("Performance Notes")
print("="*80)
if has_cuda:
print("""
✓ GPU detected - DiffDock will run efficiently
Expected performance:
- First run: ~2-5 minutes (pre-computing SO(2)/SO(3) tables)
- Subsequent runs: ~10-60 seconds per complex (depending on settings)
- Batch processing: Highly efficient with GPU
""")
else:
print("""
⚠ No GPU detected - DiffDock will run on CPU
Expected performance:
- CPU inference is SIGNIFICANTLY slower than GPU
- Single complex: Several minutes to hours
- Batch processing: Not recommended on CPU
Recommendation: Use GPU for practical applications
- Cloud options: Google Colab, AWS, or other cloud GPU services
- Local: Install CUDA-capable GPU
""")
def main():
parser = argparse.ArgumentParser(
description='Check DiffDock environment setup',
formatter_class=argparse.RawDescriptionHelpFormatter
)
parser.add_argument('--verbose', '-v', action='store_true',
help='Show detailed version information')
args = parser.parse_args()
print("="*80)
print("DiffDock Environment Setup Checker")
print("="*80)
checks = []
# Run all checks
checks.append(("Python version", check_python_version()))
pytorch_ok, has_cuda = check_pytorch()
checks.append(("PyTorch", pytorch_ok))
checks.append(("PyTorch Geometric", check_pytorch_geometric()))
checks.append(("Core dependencies", check_core_dependencies()))
checks.append(("ESM", check_esm()))
checks.append(("DiffDock files", check_diffdock_installation()))
# Summary
print("\n" + "="*80)
print("Summary")
print("="*80)
all_passed = all(result for _, result in checks)
for check_name, result in checks:
status = "✓ PASS" if result else "✗ FAIL"
print(f" {status:8s} - {check_name}")
if all_passed:
print("\n✓ All checks passed! DiffDock is ready to use.")
print_performance_notes(has_cuda)
return 0
else:
print("\n✗ Some checks failed. Please install missing dependencies.")
print_installation_instructions()
return 1
if __name__ == '__main__':
sys.exit(main())