Initial commit
This commit is contained in:
477
skills/diffdock/SKILL.md
Normal file
477
skills/diffdock/SKILL.md
Normal file
@@ -0,0 +1,477 @@
|
||||
---
|
||||
name: diffdock
|
||||
description: "Diffusion-based molecular docking. Predict protein-ligand binding poses from PDB/SMILES, confidence scores, virtual screening, for structure-based drug design. Not for affinity prediction."
|
||||
---
|
||||
|
||||
# DiffDock: Molecular Docking with Diffusion Models
|
||||
|
||||
## Overview
|
||||
|
||||
DiffDock is a diffusion-based deep learning tool for molecular docking that predicts 3D binding poses of small molecule ligands to protein targets. It represents the state-of-the-art in computational docking, crucial for structure-based drug discovery and chemical biology.
|
||||
|
||||
**Core Capabilities:**
|
||||
- Predict ligand binding poses with high accuracy using deep learning
|
||||
- Support protein structures (PDB files) or sequences (via ESMFold)
|
||||
- Process single complexes or batch virtual screening campaigns
|
||||
- Generate confidence scores to assess prediction reliability
|
||||
- Handle diverse ligand inputs (SMILES, SDF, MOL2)
|
||||
|
||||
**Key Distinction:** DiffDock predicts **binding poses** (3D structure) and **confidence** (prediction certainty), NOT binding affinity (ΔG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when:
|
||||
|
||||
- "Dock this ligand to a protein" or "predict binding pose"
|
||||
- "Run molecular docking" or "perform protein-ligand docking"
|
||||
- "Virtual screening" or "screen compound library"
|
||||
- "Where does this molecule bind?" or "predict binding site"
|
||||
- Structure-based drug design or lead optimization tasks
|
||||
- Tasks involving PDB files + SMILES strings or ligand structures
|
||||
- Batch docking of multiple protein-ligand pairs
|
||||
|
||||
## Installation and Environment Setup
|
||||
|
||||
### Check Environment Status
|
||||
|
||||
Before proceeding with DiffDock tasks, verify the environment setup:
|
||||
|
||||
```bash
|
||||
# Use the provided setup checker
|
||||
python scripts/setup_check.py
|
||||
```
|
||||
|
||||
This script validates Python version, PyTorch with CUDA, PyTorch Geometric, RDKit, ESM, and other dependencies.
|
||||
|
||||
### Installation Options
|
||||
|
||||
**Option 1: Conda (Recommended)**
|
||||
```bash
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
```
|
||||
|
||||
**Option 2: Docker**
|
||||
```bash
|
||||
docker pull rbgcsail/diffdock
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
micromamba activate diffdock
|
||||
```
|
||||
|
||||
**Important Notes:**
|
||||
- GPU strongly recommended (10-100x speedup vs CPU)
|
||||
- First run pre-computes SO(2)/SO(3) lookup tables (~2-5 minutes)
|
||||
- Model checkpoints (~500MB) download automatically if not present
|
||||
|
||||
## Core Workflows
|
||||
|
||||
### Workflow 1: Single Protein-Ligand Docking
|
||||
|
||||
**Use Case:** Dock one ligand to one protein target
|
||||
|
||||
**Input Requirements:**
|
||||
- Protein: PDB file OR amino acid sequence
|
||||
- Ligand: SMILES string OR structure file (SDF/MOL2)
|
||||
|
||||
**Command:**
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)Oc1ccccc1C(=O)O" \
|
||||
--out_dir results/single_docking/
|
||||
```
|
||||
|
||||
**Alternative (protein sequence):**
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKF..." \
|
||||
--ligand ligand.sdf \
|
||||
--out_dir results/sequence_docking/
|
||||
```
|
||||
|
||||
**Output Structure:**
|
||||
```
|
||||
results/single_docking/
|
||||
├── rank_1.sdf # Top-ranked pose
|
||||
├── rank_2.sdf # Second-ranked pose
|
||||
├── ...
|
||||
├── rank_10.sdf # 10th pose (default: 10 samples)
|
||||
└── confidence_scores.txt
|
||||
```
|
||||
|
||||
### Workflow 2: Batch Processing Multiple Complexes
|
||||
|
||||
**Use Case:** Dock multiple ligands to proteins, virtual screening campaigns
|
||||
|
||||
**Step 1: Prepare Batch CSV**
|
||||
|
||||
Use the provided script to create or validate batch input:
|
||||
|
||||
```bash
|
||||
# Create template
|
||||
python scripts/prepare_batch_csv.py --create --output batch_input.csv
|
||||
|
||||
# Validate existing CSV
|
||||
python scripts/prepare_batch_csv.py my_input.csv --validate
|
||||
```
|
||||
|
||||
**CSV Format:**
|
||||
```csv
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
complex1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
complex2,,COc1ccc(C#N)cc1,MSKGEELFT...
|
||||
complex3,protein3.pdb,ligand3.sdf,
|
||||
```
|
||||
|
||||
**Required Columns:**
|
||||
- `complex_name`: Unique identifier
|
||||
- `protein_path`: PDB file path (leave empty if using sequence)
|
||||
- `ligand_description`: SMILES string or ligand file path
|
||||
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
|
||||
|
||||
**Step 2: Run Batch Docking**
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv batch_input.csv \
|
||||
--out_dir results/batch/ \
|
||||
--batch_size 10
|
||||
```
|
||||
|
||||
**For Large Virtual Screening (>100 compounds):**
|
||||
|
||||
Pre-compute protein embeddings for faster processing:
|
||||
```bash
|
||||
# Pre-compute embeddings
|
||||
python datasets/esm_embedding_preparation.py \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--out_file protein_embeddings.pt
|
||||
|
||||
# Run with pre-computed embeddings
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--esm_embeddings_path protein_embeddings.pt \
|
||||
--out_dir results/screening/
|
||||
```
|
||||
|
||||
### Workflow 3: Analyzing Results
|
||||
|
||||
After docking completes, analyze confidence scores and rank predictions:
|
||||
|
||||
```bash
|
||||
# Analyze all results
|
||||
python scripts/analyze_results.py results/batch/
|
||||
|
||||
# Show top 5 per complex
|
||||
python scripts/analyze_results.py results/batch/ --top 5
|
||||
|
||||
# Filter by confidence threshold
|
||||
python scripts/analyze_results.py results/batch/ --threshold 0.0
|
||||
|
||||
# Export to CSV
|
||||
python scripts/analyze_results.py results/batch/ --export summary.csv
|
||||
|
||||
# Show top 20 predictions across all complexes
|
||||
python scripts/analyze_results.py results/batch/ --best 20
|
||||
```
|
||||
|
||||
The analysis script:
|
||||
- Parses confidence scores from all predictions
|
||||
- Classifies as High (>0), Moderate (-1.5 to 0), or Low (<-1.5)
|
||||
- Ranks predictions within and across complexes
|
||||
- Generates statistical summaries
|
||||
- Exports results to CSV for downstream analysis
|
||||
|
||||
## Confidence Score Interpretation
|
||||
|
||||
**Understanding Scores:**
|
||||
|
||||
| Score Range | Confidence Level | Interpretation |
|
||||
|------------|------------------|----------------|
|
||||
| **> 0** | High | Strong prediction, likely accurate |
|
||||
| **-1.5 to 0** | Moderate | Reasonable prediction, validate carefully |
|
||||
| **< -1.5** | Low | Uncertain prediction, requires validation |
|
||||
|
||||
**Critical Notes:**
|
||||
1. **Confidence ≠ Affinity**: High confidence means model certainty about structure, NOT strong binding
|
||||
2. **Context Matters**: Adjust expectations for:
|
||||
- Large ligands (>500 Da): Lower confidence expected
|
||||
- Multiple protein chains: May decrease confidence
|
||||
- Novel protein families: May underperform
|
||||
3. **Multiple Samples**: Review top 3-5 predictions, look for consensus
|
||||
|
||||
**For detailed guidance:** Read `references/confidence_and_limitations.md` using the Read tool
|
||||
|
||||
## Parameter Customization
|
||||
|
||||
### Using Custom Configuration
|
||||
|
||||
Create custom configuration for specific use cases:
|
||||
|
||||
```bash
|
||||
# Copy template
|
||||
cp assets/custom_inference_config.yaml my_config.yaml
|
||||
|
||||
# Edit parameters (see template for presets)
|
||||
# Then run with custom config
|
||||
python -m inference \
|
||||
--config my_config.yaml \
|
||||
--protein_ligand_csv input.csv \
|
||||
--out_dir results/
|
||||
```
|
||||
|
||||
### Key Parameters to Adjust
|
||||
|
||||
**Sampling Density:**
|
||||
- `samples_per_complex: 10` → Increase to 20-40 for difficult cases
|
||||
- More samples = better coverage but longer runtime
|
||||
|
||||
**Inference Steps:**
|
||||
- `inference_steps: 20` → Increase to 25-30 for higher accuracy
|
||||
- More steps = potentially better quality but slower
|
||||
|
||||
**Temperature Parameters (control diversity):**
|
||||
- `temp_sampling_tor: 7.04` → Increase for flexible ligands (8-10)
|
||||
- `temp_sampling_tor: 7.04` → Decrease for rigid ligands (5-6)
|
||||
- Higher temperature = more diverse poses
|
||||
|
||||
**Presets Available in Template:**
|
||||
1. High Accuracy: More samples + steps, lower temperature
|
||||
2. Fast Screening: Fewer samples, faster
|
||||
3. Flexible Ligands: Increased torsion temperature
|
||||
4. Rigid Ligands: Decreased torsion temperature
|
||||
|
||||
**For complete parameter reference:** Read `references/parameters_reference.md` using the Read tool
|
||||
|
||||
## Advanced Techniques
|
||||
|
||||
### Ensemble Docking (Protein Flexibility)
|
||||
|
||||
For proteins with known flexibility, dock to multiple conformations:
|
||||
|
||||
```python
|
||||
# Create ensemble CSV
|
||||
import pandas as pd
|
||||
|
||||
conformations = ["conf1.pdb", "conf2.pdb", "conf3.pdb"]
|
||||
ligand = "CC(=O)Oc1ccccc1C(=O)O"
|
||||
|
||||
data = {
|
||||
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
|
||||
"protein_path": conformations,
|
||||
"ligand_description": [ligand] * len(conformations),
|
||||
"protein_sequence": [""] * len(conformations)
|
||||
}
|
||||
|
||||
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
|
||||
```
|
||||
|
||||
Run docking with increased sampling:
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv ensemble_input.csv \
|
||||
--samples_per_complex 20 \
|
||||
--out_dir results/ensemble/
|
||||
```
|
||||
|
||||
### Integration with Scoring Functions
|
||||
|
||||
DiffDock generates poses; combine with other tools for affinity:
|
||||
|
||||
**GNINA (Fast neural network scoring):**
|
||||
```bash
|
||||
for pose in results/*.sdf; do
|
||||
gnina -r protein.pdb -l "$pose" --score_only
|
||||
done
|
||||
```
|
||||
|
||||
**MM/GBSA (More accurate, slower):**
|
||||
Use AmberTools MMPBSA.py or gmx_MMPBSA after energy minimization
|
||||
|
||||
**Free Energy Calculations (Most accurate):**
|
||||
Use OpenMM + OpenFE or GROMACS for FEP/TI calculations
|
||||
|
||||
**Recommended Workflow:**
|
||||
1. DiffDock → Generate poses with confidence scores
|
||||
2. Visual inspection → Check structural plausibility
|
||||
3. GNINA or MM/GBSA → Rescore and rank by affinity
|
||||
4. Experimental validation → Biochemical assays
|
||||
|
||||
## Limitations and Scope
|
||||
|
||||
**DiffDock IS Designed For:**
|
||||
- Small molecule ligands (typically 100-1000 Da)
|
||||
- Drug-like organic compounds
|
||||
- Small peptides (<20 residues)
|
||||
- Single or multi-chain proteins
|
||||
|
||||
**DiffDock IS NOT Designed For:**
|
||||
- Large biomolecules (protein-protein docking) → Use DiffDock-PP or AlphaFold-Multimer
|
||||
- Large peptides (>20 residues) → Use alternative methods
|
||||
- Covalent docking → Use specialized covalent docking tools
|
||||
- Binding affinity prediction → Combine with scoring functions
|
||||
- Membrane proteins → Not specifically trained, use with caution
|
||||
|
||||
**For complete limitations:** Read `references/confidence_and_limitations.md` using the Read tool
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Common Issues
|
||||
|
||||
**Issue: Low confidence scores across all predictions**
|
||||
- Cause: Large/unusual ligands, unclear binding site, protein flexibility
|
||||
- Solution: Increase `samples_per_complex` (20-40), try ensemble docking, validate protein structure
|
||||
|
||||
**Issue: Out of memory errors**
|
||||
- Cause: GPU memory insufficient for batch size
|
||||
- Solution: Reduce `--batch_size 2` or process fewer complexes at once
|
||||
|
||||
**Issue: Slow performance**
|
||||
- Cause: Running on CPU instead of GPU
|
||||
- Solution: Verify CUDA with `python -c "import torch; print(torch.cuda.is_available())"`, use GPU
|
||||
|
||||
**Issue: Unrealistic binding poses**
|
||||
- Cause: Poor protein preparation, ligand too large, wrong binding site
|
||||
- Solution: Check protein for missing residues, remove far waters, consider specifying binding site
|
||||
|
||||
**Issue: "Module not found" errors**
|
||||
- Cause: Missing dependencies or wrong environment
|
||||
- Solution: Run `python scripts/setup_check.py` to diagnose
|
||||
|
||||
### Performance Optimization
|
||||
|
||||
**For Best Results:**
|
||||
1. Use GPU (essential for practical use)
|
||||
2. Pre-compute ESM embeddings for repeated protein use
|
||||
3. Batch process multiple complexes together
|
||||
4. Start with default parameters, then tune if needed
|
||||
5. Validate protein structures (resolve missing residues)
|
||||
6. Use canonical SMILES for ligands
|
||||
|
||||
## Graphical User Interface
|
||||
|
||||
For interactive use, launch the web interface:
|
||||
|
||||
```bash
|
||||
python app/main.py
|
||||
# Navigate to http://localhost:7860
|
||||
```
|
||||
|
||||
Or use the online demo without installation:
|
||||
- https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
|
||||
## Resources
|
||||
|
||||
### Helper Scripts (`scripts/`)
|
||||
|
||||
**`prepare_batch_csv.py`**: Create and validate batch input CSV files
|
||||
- Create templates with example entries
|
||||
- Validate file paths and SMILES strings
|
||||
- Check for required columns and format issues
|
||||
|
||||
**`analyze_results.py`**: Analyze confidence scores and rank predictions
|
||||
- Parse results from single or batch runs
|
||||
- Generate statistical summaries
|
||||
- Export to CSV for downstream analysis
|
||||
- Identify top predictions across complexes
|
||||
|
||||
**`setup_check.py`**: Verify DiffDock environment setup
|
||||
- Check Python version and dependencies
|
||||
- Verify PyTorch and CUDA availability
|
||||
- Test RDKit and PyTorch Geometric installation
|
||||
- Provide installation instructions if needed
|
||||
|
||||
### Reference Documentation (`references/`)
|
||||
|
||||
**`parameters_reference.md`**: Complete parameter documentation
|
||||
- All command-line options and configuration parameters
|
||||
- Default values and acceptable ranges
|
||||
- Temperature parameters for controlling diversity
|
||||
- Model checkpoint locations and version flags
|
||||
|
||||
Read this file when users need:
|
||||
- Detailed parameter explanations
|
||||
- Fine-tuning guidance for specific systems
|
||||
- Alternative sampling strategies
|
||||
|
||||
**`confidence_and_limitations.md`**: Confidence score interpretation and tool limitations
|
||||
- Detailed confidence score interpretation
|
||||
- When to trust predictions
|
||||
- Scope and limitations of DiffDock
|
||||
- Integration with complementary tools
|
||||
- Troubleshooting prediction quality
|
||||
|
||||
Read this file when users need:
|
||||
- Help interpreting confidence scores
|
||||
- Understanding when NOT to use DiffDock
|
||||
- Guidance on combining with other tools
|
||||
- Validation strategies
|
||||
|
||||
**`workflows_examples.md`**: Comprehensive workflow examples
|
||||
- Detailed installation instructions
|
||||
- Step-by-step examples for all workflows
|
||||
- Advanced integration patterns
|
||||
- Troubleshooting common issues
|
||||
- Best practices and optimization tips
|
||||
|
||||
Read this file when users need:
|
||||
- Complete workflow examples with code
|
||||
- Integration with GNINA, OpenMM, or other tools
|
||||
- Virtual screening workflows
|
||||
- Ensemble docking procedures
|
||||
|
||||
### Assets (`assets/`)
|
||||
|
||||
**`batch_template.csv`**: Template for batch processing
|
||||
- Pre-formatted CSV with required columns
|
||||
- Example entries showing different input types
|
||||
- Ready to customize with actual data
|
||||
|
||||
**`custom_inference_config.yaml`**: Configuration template
|
||||
- Annotated YAML with all parameters
|
||||
- Four preset configurations for common use cases
|
||||
- Detailed comments explaining each parameter
|
||||
- Ready to customize and use
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always verify environment** with `setup_check.py` before starting large jobs
|
||||
2. **Validate batch CSVs** with `prepare_batch_csv.py` to catch errors early
|
||||
3. **Start with defaults** then tune parameters based on system-specific needs
|
||||
4. **Generate multiple samples** (10-40) for robust predictions
|
||||
5. **Visual inspection** of top poses before downstream analysis
|
||||
6. **Combine with scoring** functions for affinity assessment
|
||||
7. **Use confidence scores** for initial ranking, not final decisions
|
||||
8. **Pre-compute embeddings** for virtual screening campaigns
|
||||
9. **Document parameters** used for reproducibility
|
||||
10. **Validate results** experimentally when possible
|
||||
|
||||
## Citations
|
||||
|
||||
When using DiffDock, cite the appropriate papers:
|
||||
|
||||
**DiffDock-L (current default model):**
|
||||
```
|
||||
Stärk et al. (2024) "DiffDock-L: Improving Molecular Docking with Diffusion Models"
|
||||
arXiv:2402.18396
|
||||
```
|
||||
|
||||
**Original DiffDock:**
|
||||
```
|
||||
Corso et al. (2023) "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
|
||||
ICLR 2023, arXiv:2210.01776
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **GitHub Repository**: https://github.com/gcorso/DiffDock
|
||||
- **Online Demo**: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
- **DiffDock-L Paper**: https://arxiv.org/abs/2402.18396
|
||||
- **Original Paper**: https://arxiv.org/abs/2210.01776
|
||||
4
skills/diffdock/assets/batch_template.csv
Normal file
4
skills/diffdock/assets/batch_template.csv
Normal file
@@ -0,0 +1,4 @@
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
example_1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
example_2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK
|
||||
example_3,protein3.pdb,ligand3.sdf,
|
||||
|
90
skills/diffdock/assets/custom_inference_config.yaml
Normal file
90
skills/diffdock/assets/custom_inference_config.yaml
Normal file
@@ -0,0 +1,90 @@
|
||||
# DiffDock Custom Inference Configuration Template
|
||||
# Copy and modify this file to customize inference parameters
|
||||
|
||||
# Model paths (usually don't need to change these)
|
||||
model_dir: ./workdir/v1.1/score_model
|
||||
confidence_model_dir: ./workdir/v1.1/confidence_model
|
||||
ckpt: best_ema_inference_epoch_model.pt
|
||||
confidence_ckpt: best_model_epoch75.pt
|
||||
|
||||
# Model version flags
|
||||
old_score_model: false # Set to true to use original DiffDock instead of DiffDock-L
|
||||
old_filtering_model: true
|
||||
|
||||
# Inference steps
|
||||
inference_steps: 20 # Increase for potentially better accuracy (e.g., 25-30)
|
||||
actual_steps: 19
|
||||
no_final_step_noise: true
|
||||
|
||||
# Sampling parameters
|
||||
samples_per_complex: 10 # Increase for difficult cases (e.g., 20-40)
|
||||
sigma_schedule: expbeta
|
||||
initial_noise_std_proportion: 1.46
|
||||
|
||||
# Temperature controls - Adjust these to balance exploration vs accuracy
|
||||
# Higher values = more diverse predictions, lower values = more focused predictions
|
||||
|
||||
# Sampling temperatures
|
||||
temp_sampling_tr: 1.17 # Translation sampling temperature
|
||||
temp_sampling_rot: 2.06 # Rotation sampling temperature
|
||||
temp_sampling_tor: 7.04 # Torsion sampling temperature (increase for flexible ligands)
|
||||
|
||||
# Psi angle temperatures
|
||||
temp_psi_tr: 0.73
|
||||
temp_psi_rot: 0.90
|
||||
temp_psi_tor: 0.59
|
||||
|
||||
# Sigma data temperatures
|
||||
temp_sigma_data_tr: 0.93
|
||||
temp_sigma_data_rot: 0.75
|
||||
temp_sigma_data_tor: 0.69
|
||||
|
||||
# Feature flags
|
||||
no_model: false
|
||||
no_random: false
|
||||
ode: false # Set to true to use ODE solver instead of SDE
|
||||
different_schedules: false
|
||||
limit_failures: 5
|
||||
|
||||
# Output settings
|
||||
# save_visualisation: true # Uncomment to save SDF files
|
||||
|
||||
# ============================================================================
|
||||
# Configuration Presets for Common Use Cases
|
||||
# ============================================================================
|
||||
|
||||
# PRESET 1: High Accuracy (slower, more thorough)
|
||||
# samples_per_complex: 30
|
||||
# inference_steps: 25
|
||||
# temp_sampling_tr: 1.0
|
||||
# temp_sampling_rot: 1.8
|
||||
# temp_sampling_tor: 6.5
|
||||
|
||||
# PRESET 2: Fast Screening (faster, less thorough)
|
||||
# samples_per_complex: 5
|
||||
# inference_steps: 15
|
||||
# temp_sampling_tr: 1.3
|
||||
# temp_sampling_rot: 2.2
|
||||
# temp_sampling_tor: 7.5
|
||||
|
||||
# PRESET 3: Flexible Ligands (more conformational diversity)
|
||||
# samples_per_complex: 20
|
||||
# inference_steps: 20
|
||||
# temp_sampling_tr: 1.2
|
||||
# temp_sampling_rot: 2.1
|
||||
# temp_sampling_tor: 8.5 # Increased torsion temperature
|
||||
|
||||
# PRESET 4: Rigid Ligands (more focused predictions)
|
||||
# samples_per_complex: 10
|
||||
# inference_steps: 20
|
||||
# temp_sampling_tr: 1.1
|
||||
# temp_sampling_rot: 2.0
|
||||
# temp_sampling_tor: 6.0 # Decreased torsion temperature
|
||||
|
||||
# ============================================================================
|
||||
# Usage Example
|
||||
# ============================================================================
|
||||
# python -m inference \
|
||||
# --config custom_inference_config.yaml \
|
||||
# --protein_ligand_csv input.csv \
|
||||
# --out_dir results/
|
||||
182
skills/diffdock/references/confidence_and_limitations.md
Normal file
182
skills/diffdock/references/confidence_and_limitations.md
Normal file
@@ -0,0 +1,182 @@
|
||||
# DiffDock Confidence Scores and Limitations
|
||||
|
||||
This document provides detailed guidance on interpreting DiffDock confidence scores and understanding the tool's limitations.
|
||||
|
||||
## Confidence Score Interpretation
|
||||
|
||||
DiffDock generates a confidence score for each predicted binding pose. This score indicates the model's certainty about the prediction.
|
||||
|
||||
### Score Ranges
|
||||
|
||||
| Score Range | Confidence Level | Interpretation |
|
||||
|------------|------------------|----------------|
|
||||
| **> 0** | High confidence | Strong prediction, likely accurate binding pose |
|
||||
| **-1.5 to 0** | Moderate confidence | Reasonable prediction, may need validation |
|
||||
| **< -1.5** | Low confidence | Uncertain prediction, requires careful validation |
|
||||
|
||||
### Important Notes on Confidence Scores
|
||||
|
||||
1. **Not Binding Affinity**: Confidence scores reflect prediction certainty, NOT binding affinity strength
|
||||
- High confidence = model is confident about the structure
|
||||
- Does NOT indicate strong/weak binding affinity
|
||||
|
||||
2. **Context-Dependent**: Confidence scores should be adjusted based on system complexity:
|
||||
- **Lower expectations** for:
|
||||
- Large ligands (>500 Da)
|
||||
- Protein complexes with many chains
|
||||
- Unbound protein conformations (may require conformational changes)
|
||||
- Novel protein families not well-represented in training data
|
||||
|
||||
- **Higher expectations** for:
|
||||
- Drug-like small molecules (150-500 Da)
|
||||
- Single-chain proteins or well-defined binding sites
|
||||
- Proteins similar to those in training data (PDBBind, BindingMOAD)
|
||||
|
||||
3. **Multiple Predictions**: DiffDock generates multiple samples per complex (default: 10)
|
||||
- Review top-ranked predictions (by confidence)
|
||||
- Consider clustering similar poses
|
||||
- High-confidence consensus across multiple samples strengthens prediction
|
||||
|
||||
## What DiffDock Predicts
|
||||
|
||||
### ✅ DiffDock DOES Predict
|
||||
- **Binding poses**: 3D spatial orientation of ligand in protein binding site
|
||||
- **Confidence scores**: Model's certainty about predictions
|
||||
- **Multiple conformations**: Various possible binding modes
|
||||
|
||||
### ❌ DiffDock DOES NOT Predict
|
||||
- **Binding affinity**: Strength of protein-ligand interaction (ΔG, Kd, Ki)
|
||||
- **Binding kinetics**: On/off rates, residence time
|
||||
- **ADMET properties**: Absorption, distribution, metabolism, excretion, toxicity
|
||||
- **Selectivity**: Relative binding to different targets
|
||||
|
||||
## Scope and Limitations
|
||||
|
||||
### Designed For
|
||||
- **Small molecule docking**: Organic compounds typically 100-1000 Da
|
||||
- **Protein targets**: Single or multi-chain proteins
|
||||
- **Small peptides**: Short peptide ligands (< ~20 residues)
|
||||
- **Small nucleic acids**: Short oligonucleotides
|
||||
|
||||
### NOT Designed For
|
||||
- **Large biomolecules**: Full protein-protein interactions
|
||||
- Use DiffDock-PP, AlphaFold-Multimer, or RoseTTAFold2NA instead
|
||||
- **Large peptides/proteins**: >20 residues as ligands
|
||||
- **Covalent docking**: Irreversible covalent bond formation
|
||||
- **Metalloprotein specifics**: May not accurately handle metal coordination
|
||||
- **Membrane proteins**: Not specifically trained on membrane-embedded proteins
|
||||
|
||||
### Training Data Considerations
|
||||
|
||||
DiffDock was trained on:
|
||||
- **PDBBind**: Diverse protein-ligand complexes
|
||||
- **BindingMOAD**: Multi-domain protein structures
|
||||
|
||||
**Implications**:
|
||||
- Best performance on proteins/ligands similar to training data
|
||||
- May underperform on:
|
||||
- Novel protein families
|
||||
- Unusual ligand chemotypes
|
||||
- Allosteric sites not well-represented in training data
|
||||
|
||||
## Validation and Complementary Tools
|
||||
|
||||
### Recommended Workflow
|
||||
|
||||
1. **Generate poses with DiffDock**
|
||||
- Use confidence scores for initial ranking
|
||||
- Consider multiple high-confidence predictions
|
||||
|
||||
2. **Visual Inspection**
|
||||
- Examine protein-ligand interactions in molecular viewer
|
||||
- Check for reasonable:
|
||||
- Hydrogen bonds
|
||||
- Hydrophobic interactions
|
||||
- Steric complementarity
|
||||
- Electrostatic interactions
|
||||
|
||||
3. **Scoring and Refinement** (choose one or more):
|
||||
- **GNINA**: Deep learning-based scoring function
|
||||
- **Molecular mechanics**: Energy minimization and refinement
|
||||
- **MM/GBSA or MM/PBSA**: Binding free energy estimation
|
||||
- **Free energy calculations**: FEP or TI for accurate affinity prediction
|
||||
|
||||
4. **Experimental Validation**
|
||||
- Biochemical assays (IC50, Kd measurements)
|
||||
- Structural validation (X-ray crystallography, cryo-EM)
|
||||
|
||||
### Tools for Binding Affinity Assessment
|
||||
|
||||
DiffDock should be combined with these tools for affinity prediction:
|
||||
|
||||
- **GNINA**: Fast, accurate scoring function
|
||||
- Github: github.com/gnina/gnina
|
||||
|
||||
- **AutoDock Vina**: Classical docking and scoring
|
||||
- Website: vina.scripps.edu
|
||||
|
||||
- **Free Energy Calculations**:
|
||||
- OpenMM + OpenFE
|
||||
- GROMACS + ABFE/RBFE protocols
|
||||
|
||||
- **MM/GBSA Tools**:
|
||||
- MMPBSA.py (AmberTools)
|
||||
- gmx_MMPBSA
|
||||
|
||||
## Performance Optimization
|
||||
|
||||
### For Best Results
|
||||
|
||||
1. **Protein Preparation**:
|
||||
- Remove water molecules far from binding site
|
||||
- Resolve missing residues if possible
|
||||
- Consider protonation states at physiological pH
|
||||
|
||||
2. **Ligand Input**:
|
||||
- Provide reasonable 3D conformers when using structure files
|
||||
- Use canonical SMILES for consistent results
|
||||
- Pre-process with RDKit if needed
|
||||
|
||||
3. **Computational Resources**:
|
||||
- GPU strongly recommended (10-100x speedup)
|
||||
- First run pre-computes lookup tables (takes a few minutes)
|
||||
- Batch processing more efficient than single predictions
|
||||
|
||||
4. **Parameter Tuning**:
|
||||
- Increase `samples_per_complex` for difficult cases (20-40)
|
||||
- Adjust temperature parameters for diversity/accuracy trade-off
|
||||
- Use pre-computed ESM embeddings for repeated predictions
|
||||
|
||||
## Common Issues and Troubleshooting
|
||||
|
||||
### Low Confidence Scores
|
||||
- **Large/flexible ligands**: Consider splitting into fragments or use alternative methods
|
||||
- **Multiple binding sites**: May predict multiple locations with distributed confidence
|
||||
- **Protein flexibility**: Consider using ensemble of protein conformations
|
||||
|
||||
### Unrealistic Predictions
|
||||
- **Clashes**: May indicate need for protein preparation or refinement
|
||||
- **Surface binding**: Check if true binding site is blocked or unclear
|
||||
- **Unusual poses**: Consider increasing samples to explore more conformations
|
||||
|
||||
### Slow Performance
|
||||
- **Use GPU**: Essential for reasonable runtime
|
||||
- **Pre-compute embeddings**: Reuse ESM embeddings for same protein
|
||||
- **Batch processing**: More efficient than sequential individual predictions
|
||||
- **Reduce samples**: Lower `samples_per_complex` for quick screening
|
||||
|
||||
## Citation and Further Reading
|
||||
|
||||
For methodology details and benchmarking results, see:
|
||||
|
||||
1. **Original DiffDock Paper** (ICLR 2023):
|
||||
- "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
|
||||
- Corso et al., arXiv:2210.01776
|
||||
|
||||
2. **DiffDock-L Paper** (2024):
|
||||
- Enhanced model with improved generalization
|
||||
- Stärk et al., arXiv:2402.18396
|
||||
|
||||
3. **PoseBusters Benchmark**:
|
||||
- Rigorous docking evaluation framework
|
||||
- Used for DiffDock validation
|
||||
163
skills/diffdock/references/parameters_reference.md
Normal file
163
skills/diffdock/references/parameters_reference.md
Normal file
@@ -0,0 +1,163 @@
|
||||
# DiffDock Configuration Parameters Reference
|
||||
|
||||
This document provides comprehensive details on all DiffDock configuration parameters and command-line options.
|
||||
|
||||
## Model & Checkpoint Settings
|
||||
|
||||
### Model Paths
|
||||
- **`--model_dir`**: Directory containing the score model checkpoint
|
||||
- Default: `./workdir/v1.1/score_model`
|
||||
- DiffDock-L model (current default)
|
||||
|
||||
- **`--confidence_model_dir`**: Directory containing the confidence model checkpoint
|
||||
- Default: `./workdir/v1.1/confidence_model`
|
||||
|
||||
- **`--ckpt`**: Name of the score model checkpoint file
|
||||
- Default: `best_ema_inference_epoch_model.pt`
|
||||
|
||||
- **`--confidence_ckpt`**: Name of the confidence model checkpoint file
|
||||
- Default: `best_model_epoch75.pt`
|
||||
|
||||
### Model Version Flags
|
||||
- **`--old_score_model`**: Use original DiffDock model instead of DiffDock-L
|
||||
- Default: `false` (uses DiffDock-L)
|
||||
|
||||
- **`--old_filtering_model`**: Use legacy confidence filtering approach
|
||||
- Default: `true`
|
||||
|
||||
## Input/Output Options
|
||||
|
||||
### Input Specification
|
||||
- **`--protein_path`**: Path to protein PDB file
|
||||
- Example: `--protein_path protein.pdb`
|
||||
- Alternative to `--protein_sequence`
|
||||
|
||||
- **`--protein_sequence`**: Amino acid sequence for ESMFold folding
|
||||
- Automatically generates protein structure from sequence
|
||||
- Alternative to `--protein_path`
|
||||
|
||||
- **`--ligand`**: Ligand specification (SMILES string or file path)
|
||||
- SMILES string: `--ligand "COc(cc1)ccc1C#N"`
|
||||
- File path: `--ligand ligand.sdf` or `.mol2`
|
||||
|
||||
- **`--protein_ligand_csv`**: CSV file for batch processing
|
||||
- Required columns: `complex_name`, `protein_path`, `ligand_description`, `protein_sequence`
|
||||
- Example: `--protein_ligand_csv data/protein_ligand_example.csv`
|
||||
|
||||
### Output Control
|
||||
- **`--out_dir`**: Output directory for predictions
|
||||
- Example: `--out_dir results/user_predictions/`
|
||||
|
||||
- **`--save_visualisation`**: Export predicted molecules as SDF files
|
||||
- Enables visualization of results
|
||||
|
||||
## Inference Parameters
|
||||
|
||||
### Diffusion Steps
|
||||
- **`--inference_steps`**: Number of planned inference iterations
|
||||
- Default: `20`
|
||||
- Higher values may improve accuracy but increase runtime
|
||||
|
||||
- **`--actual_steps`**: Actual diffusion steps executed
|
||||
- Default: `19`
|
||||
|
||||
- **`--no_final_step_noise`**: Omit noise at the final diffusion step
|
||||
- Default: `true`
|
||||
|
||||
### Sampling Settings
|
||||
- **`--samples_per_complex`**: Number of samples to generate per complex
|
||||
- Default: `10`
|
||||
- More samples provide better coverage but increase computation
|
||||
|
||||
- **`--sigma_schedule`**: Noise schedule type
|
||||
- Default: `expbeta` (exponential-beta)
|
||||
|
||||
- **`--initial_noise_std_proportion`**: Initial noise standard deviation scaling
|
||||
- Default: `1.46`
|
||||
|
||||
### Temperature Parameters
|
||||
|
||||
#### Sampling Temperatures (Controls diversity of predictions)
|
||||
- **`--temp_sampling_tr`**: Translation sampling temperature
|
||||
- Default: `1.17`
|
||||
|
||||
- **`--temp_sampling_rot`**: Rotation sampling temperature
|
||||
- Default: `2.06`
|
||||
|
||||
- **`--temp_sampling_tor`**: Torsion sampling temperature
|
||||
- Default: `7.04`
|
||||
|
||||
#### Psi Angle Temperatures
|
||||
- **`--temp_psi_tr`**: Translation psi temperature
|
||||
- Default: `0.73`
|
||||
|
||||
- **`--temp_psi_rot`**: Rotation psi temperature
|
||||
- Default: `0.90`
|
||||
|
||||
- **`--temp_psi_tor`**: Torsion psi temperature
|
||||
- Default: `0.59`
|
||||
|
||||
#### Sigma Data Temperatures
|
||||
- **`--temp_sigma_data_tr`**: Translation data distribution scaling
|
||||
- Default: `0.93`
|
||||
|
||||
- **`--temp_sigma_data_rot`**: Rotation data distribution scaling
|
||||
- Default: `0.75`
|
||||
|
||||
- **`--temp_sigma_data_tor`**: Torsion data distribution scaling
|
||||
- Default: `0.69`
|
||||
|
||||
## Processing Options
|
||||
|
||||
### Performance
|
||||
- **`--batch_size`**: Processing batch size
|
||||
- Default: `10`
|
||||
- Larger values increase throughput but require more memory
|
||||
|
||||
- **`--tqdm`**: Enable progress bar visualization
|
||||
- Useful for monitoring long-running jobs
|
||||
|
||||
### Protein Structure
|
||||
- **`--chain_cutoff`**: Maximum number of protein chains to process
|
||||
- Example: `--chain_cutoff 10`
|
||||
- Useful for large multi-chain complexes
|
||||
|
||||
- **`--esm_embeddings_path`**: Path to pre-computed ESM2 protein embeddings
|
||||
- Speeds up inference by reusing embeddings
|
||||
- Optional optimization
|
||||
|
||||
### Dataset Options
|
||||
- **`--split`**: Dataset split to use (train/test/val)
|
||||
- Used for evaluation on standard benchmarks
|
||||
|
||||
## Advanced Flags
|
||||
|
||||
### Debugging & Testing
|
||||
- **`--no_model`**: Disable model inference (debugging)
|
||||
- Default: `false`
|
||||
|
||||
- **`--no_random`**: Disable randomization
|
||||
- Default: `false`
|
||||
- Useful for reproducibility testing
|
||||
|
||||
### Alternative Sampling
|
||||
- **`--ode`**: Use ODE solver instead of SDE
|
||||
- Default: `false`
|
||||
- Alternative sampling approach
|
||||
|
||||
- **`--different_schedules`**: Use different noise schedules per component
|
||||
- Default: `false`
|
||||
|
||||
### Error Handling
|
||||
- **`--limit_failures`**: Maximum allowed failures before stopping
|
||||
- Default: `5`
|
||||
|
||||
## Configuration File
|
||||
|
||||
All parameters can be specified in a YAML configuration file (typically `default_inference_args.yaml`) or overridden via command line:
|
||||
|
||||
```bash
|
||||
python -m inference --config default_inference_args.yaml --samples_per_complex 20
|
||||
```
|
||||
|
||||
Command-line arguments take precedence over configuration file values.
|
||||
392
skills/diffdock/references/workflows_examples.md
Normal file
392
skills/diffdock/references/workflows_examples.md
Normal file
@@ -0,0 +1,392 @@
|
||||
# DiffDock Workflows and Examples
|
||||
|
||||
This document provides practical workflows and usage examples for common DiffDock tasks.
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
### Conda Installation (Recommended)
|
||||
|
||||
```bash
|
||||
# Clone repository
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
|
||||
# Create conda environment
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
```
|
||||
|
||||
### Docker Installation
|
||||
|
||||
```bash
|
||||
# Pull Docker image
|
||||
docker pull rbgcsail/diffdock
|
||||
|
||||
# Run container with GPU support
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
|
||||
# Inside container, activate environment
|
||||
micromamba activate diffdock
|
||||
```
|
||||
|
||||
### First Run
|
||||
The first execution pre-computes SO(2) and SO(3) lookup tables, taking a few minutes. Subsequent runs start immediately.
|
||||
|
||||
## Workflow 1: Single Protein-Ligand Docking
|
||||
|
||||
### Using PDB File and SMILES String
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path examples/protein.pdb \
|
||||
--ligand "COc1ccc(C(=O)Nc2ccccc2)cc1" \
|
||||
--out_dir results/single_docking/
|
||||
```
|
||||
|
||||
**Output Structure**:
|
||||
```
|
||||
results/single_docking/
|
||||
├── index_0_rank_1.sdf # Top-ranked prediction
|
||||
├── index_0_rank_2.sdf # Second-ranked prediction
|
||||
├── ...
|
||||
├── index_0_rank_10.sdf # 10th prediction (if samples_per_complex=10)
|
||||
└── confidence_scores.txt # Scores for all predictions
|
||||
```
|
||||
|
||||
### Using Ligand Structure File
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand ligand.sdf \
|
||||
--out_dir results/ligand_file/
|
||||
```
|
||||
|
||||
**Supported ligand formats**: SDF, MOL2, or any format readable by RDKit
|
||||
|
||||
## Workflow 2: Protein Sequence to Structure Docking
|
||||
|
||||
### Using ESMFold for Protein Folding
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK" \
|
||||
--ligand "CC(C)Cc1ccc(cc1)C(C)C(=O)O" \
|
||||
--out_dir results/sequence_docking/
|
||||
```
|
||||
|
||||
**Use Cases**:
|
||||
- Protein structure not available in PDB
|
||||
- Modeling mutations or variants
|
||||
- De novo protein design validation
|
||||
|
||||
**Note**: ESMFold folding adds computation time (30s-5min depending on sequence length)
|
||||
|
||||
## Workflow 3: Batch Processing Multiple Complexes
|
||||
|
||||
### Prepare CSV File
|
||||
|
||||
Create `complexes.csv` with required columns:
|
||||
|
||||
```csv
|
||||
complex_name,protein_path,ligand_description,protein_sequence
|
||||
complex1,proteins/protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
|
||||
complex2,,COc1ccc(C#N)cc1,MSKGEELFTGVVPILVELDGDVNGHKF...
|
||||
complex3,proteins/protein3.pdb,ligands/ligand3.sdf,
|
||||
```
|
||||
|
||||
**Column Descriptions**:
|
||||
- `complex_name`: Unique identifier for the complex
|
||||
- `protein_path`: Path to PDB file (leave empty if using sequence)
|
||||
- `ligand_description`: SMILES string or path to ligand file
|
||||
- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
|
||||
|
||||
### Run Batch Docking
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv complexes.csv \
|
||||
--out_dir results/batch_predictions/ \
|
||||
--batch_size 10
|
||||
```
|
||||
|
||||
**Output Structure**:
|
||||
```
|
||||
results/batch_predictions/
|
||||
├── complex1/
|
||||
│ ├── rank_1.sdf
|
||||
│ ├── rank_2.sdf
|
||||
│ └── ...
|
||||
├── complex2/
|
||||
│ ├── rank_1.sdf
|
||||
│ └── ...
|
||||
└── complex3/
|
||||
└── ...
|
||||
```
|
||||
|
||||
## Workflow 4: High-Throughput Virtual Screening
|
||||
|
||||
### Setup for Screening Large Ligand Libraries
|
||||
|
||||
```python
|
||||
# generate_screening_csv.py
|
||||
import pandas as pd
|
||||
|
||||
# Load ligand library
|
||||
ligands = pd.read_csv("ligand_library.csv") # Contains SMILES
|
||||
|
||||
# Create DiffDock input
|
||||
screening_data = {
|
||||
"complex_name": [f"screen_{i}" for i in range(len(ligands))],
|
||||
"protein_path": ["target_protein.pdb"] * len(ligands),
|
||||
"ligand_description": ligands["smiles"].tolist(),
|
||||
"protein_sequence": [""] * len(ligands)
|
||||
}
|
||||
|
||||
df = pd.DataFrame(screening_data)
|
||||
df.to_csv("screening_input.csv", index=False)
|
||||
```
|
||||
|
||||
### Run Screening
|
||||
|
||||
```bash
|
||||
# Pre-compute ESM embeddings for faster screening
|
||||
python datasets/esm_embedding_preparation.py \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--out_file protein_embeddings.pt
|
||||
|
||||
# Run docking with pre-computed embeddings
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv screening_input.csv \
|
||||
--esm_embeddings_path protein_embeddings.pt \
|
||||
--out_dir results/virtual_screening/ \
|
||||
--batch_size 32
|
||||
```
|
||||
|
||||
### Post-Processing: Extract Top Hits
|
||||
|
||||
```python
|
||||
# analyze_screening_results.py
|
||||
import os
|
||||
import pandas as pd
|
||||
|
||||
results = []
|
||||
results_dir = "results/virtual_screening/"
|
||||
|
||||
for complex_dir in os.listdir(results_dir):
|
||||
confidence_file = os.path.join(results_dir, complex_dir, "confidence_scores.txt")
|
||||
if os.path.exists(confidence_file):
|
||||
with open(confidence_file) as f:
|
||||
scores = [float(line.strip()) for line in f]
|
||||
top_score = max(scores)
|
||||
results.append({"complex": complex_dir, "top_confidence": top_score})
|
||||
|
||||
# Sort by confidence
|
||||
df = pd.DataFrame(results)
|
||||
df_sorted = df.sort_values("top_confidence", ascending=False)
|
||||
|
||||
# Get top 100 hits
|
||||
top_hits = df_sorted.head(100)
|
||||
top_hits.to_csv("top_hits.csv", index=False)
|
||||
```
|
||||
|
||||
## Workflow 5: Ensemble Docking with Protein Flexibility
|
||||
|
||||
### Prepare Protein Ensemble
|
||||
|
||||
```python
|
||||
# For proteins with known flexibility, use multiple conformations
|
||||
# Example: Using MD snapshots or crystal structures
|
||||
|
||||
# create_ensemble_csv.py
|
||||
import pandas as pd
|
||||
|
||||
conformations = [
|
||||
"protein_conf1.pdb",
|
||||
"protein_conf2.pdb",
|
||||
"protein_conf3.pdb",
|
||||
"protein_conf4.pdb"
|
||||
]
|
||||
|
||||
ligand = "CC(C)Cc1ccc(cc1)C(C)C(=O)O"
|
||||
|
||||
data = {
|
||||
"complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
|
||||
"protein_path": conformations,
|
||||
"ligand_description": [ligand] * len(conformations),
|
||||
"protein_sequence": [""] * len(conformations)
|
||||
}
|
||||
|
||||
pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
|
||||
```
|
||||
|
||||
### Run Ensemble Docking
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_ligand_csv ensemble_input.csv \
|
||||
--out_dir results/ensemble_docking/ \
|
||||
--samples_per_complex 20 # More samples per conformation
|
||||
```
|
||||
|
||||
## Workflow 6: Integration with Downstream Analysis
|
||||
|
||||
### Example: DiffDock + GNINA Rescoring
|
||||
|
||||
```bash
|
||||
# 1. Run DiffDock
|
||||
python -m inference \
|
||||
--config default_inference_args.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
||||
--out_dir results/diffdock_poses/ \
|
||||
--save_visualisation
|
||||
|
||||
# 2. Rescore with GNINA
|
||||
for pose in results/diffdock_poses/*.sdf; do
|
||||
gnina -r protein.pdb -l "$pose" --score_only -o "${pose%.sdf}_gnina.sdf"
|
||||
done
|
||||
```
|
||||
|
||||
### Example: DiffDock + OpenMM Energy Minimization
|
||||
|
||||
```python
|
||||
# minimize_poses.py
|
||||
from openmm import app, LangevinIntegrator, Platform
|
||||
from openmm.app import ForceField, Modeller, PDBFile
|
||||
from rdkit import Chem
|
||||
import os
|
||||
|
||||
# Load protein
|
||||
protein = PDBFile('protein.pdb')
|
||||
forcefield = ForceField('amber14-all.xml', 'amber14/tip3pfb.xml')
|
||||
|
||||
# Process each DiffDock pose
|
||||
pose_dir = 'results/diffdock_poses/'
|
||||
for pose_file in os.listdir(pose_dir):
|
||||
if pose_file.endswith('.sdf'):
|
||||
# Load ligand
|
||||
mol = Chem.SDMolSupplier(os.path.join(pose_dir, pose_file))[0]
|
||||
|
||||
# Combine protein + ligand
|
||||
modeller = Modeller(protein.topology, protein.positions)
|
||||
# ... add ligand to modeller ...
|
||||
|
||||
# Create system and minimize
|
||||
system = forcefield.createSystem(modeller.topology)
|
||||
integrator = LangevinIntegrator(300, 1.0, 0.002)
|
||||
simulation = app.Simulation(modeller.topology, system, integrator)
|
||||
simulation.minimizeEnergy(maxIterations=1000)
|
||||
|
||||
# Save minimized structure
|
||||
positions = simulation.context.getState(getPositions=True).getPositions()
|
||||
PDBFile.writeFile(simulation.topology, positions,
|
||||
open(f"minimized_{pose_file}.pdb", 'w'))
|
||||
```
|
||||
|
||||
## Workflow 7: Using the Graphical Interface
|
||||
|
||||
### Launch Web Interface
|
||||
|
||||
```bash
|
||||
python app/main.py
|
||||
```
|
||||
|
||||
### Access Interface
|
||||
Navigate to `http://localhost:7860` in web browser
|
||||
|
||||
### Features
|
||||
- Upload protein PDB or enter sequence
|
||||
- Input ligand SMILES or upload structure
|
||||
- Adjust inference parameters via GUI
|
||||
- Visualize results interactively
|
||||
- Download predictions directly
|
||||
|
||||
### Online Alternative
|
||||
Use the Hugging Face Spaces demo without local installation:
|
||||
- URL: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
|
||||
|
||||
## Advanced Configuration
|
||||
|
||||
### Custom Inference Settings
|
||||
|
||||
Create custom YAML configuration:
|
||||
|
||||
```yaml
|
||||
# custom_inference.yaml
|
||||
# Model settings
|
||||
model_dir: ./workdir/v1.1/score_model
|
||||
confidence_model_dir: ./workdir/v1.1/confidence_model
|
||||
|
||||
# Sampling parameters
|
||||
samples_per_complex: 20 # More samples for better coverage
|
||||
inference_steps: 25 # More steps for accuracy
|
||||
|
||||
# Temperature adjustments (increase for more diversity)
|
||||
temp_sampling_tr: 1.3
|
||||
temp_sampling_rot: 2.2
|
||||
temp_sampling_tor: 7.5
|
||||
|
||||
# Output
|
||||
save_visualisation: true
|
||||
```
|
||||
|
||||
Use custom configuration:
|
||||
|
||||
```bash
|
||||
python -m inference \
|
||||
--config custom_inference.yaml \
|
||||
--protein_path protein.pdb \
|
||||
--ligand "CC(=O)OC1=CC=CC=C1C(=O)O" \
|
||||
--out_dir results/custom_config/
|
||||
```
|
||||
|
||||
## Troubleshooting Common Issues
|
||||
|
||||
### Issue: Out of Memory Errors
|
||||
|
||||
**Solution**: Reduce batch size
|
||||
```bash
|
||||
python -m inference ... --batch_size 2
|
||||
```
|
||||
|
||||
### Issue: Slow Performance
|
||||
|
||||
**Solution**: Ensure GPU usage
|
||||
```python
|
||||
import torch
|
||||
print(torch.cuda.is_available()) # Should return True
|
||||
```
|
||||
|
||||
### Issue: Poor Predictions for Large Ligands
|
||||
|
||||
**Solution**: Increase sampling diversity
|
||||
```bash
|
||||
python -m inference ... --samples_per_complex 40 --temp_sampling_tor 9.0
|
||||
```
|
||||
|
||||
### Issue: Protein with Many Chains
|
||||
|
||||
**Solution**: Limit chains or isolate binding site
|
||||
```bash
|
||||
python -m inference ... --chain_cutoff 4
|
||||
```
|
||||
|
||||
Or pre-process PDB to include only relevant chains.
|
||||
|
||||
## Best Practices Summary
|
||||
|
||||
1. **Start Simple**: Test with single complex before batch processing
|
||||
2. **GPU Essential**: Use GPU for reasonable performance
|
||||
3. **Multiple Samples**: Generate 10-40 samples for robust predictions
|
||||
4. **Validate Results**: Use molecular visualization and complementary scoring
|
||||
5. **Consider Confidence**: Use confidence scores for initial ranking, not final decisions
|
||||
6. **Iterate Parameters**: Adjust temperature/steps for specific systems
|
||||
7. **Pre-compute Embeddings**: For repeated use of same protein
|
||||
8. **Combine Tools**: Integrate with scoring functions and energy minimization
|
||||
334
skills/diffdock/scripts/analyze_results.py
Executable file
334
skills/diffdock/scripts/analyze_results.py
Executable file
@@ -0,0 +1,334 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DiffDock Results Analysis Script
|
||||
|
||||
This script analyzes DiffDock prediction results, extracting confidence scores,
|
||||
ranking predictions, and generating summary reports.
|
||||
|
||||
Usage:
|
||||
python analyze_results.py results/output_dir/
|
||||
python analyze_results.py results/ --top 50 --threshold 0.0
|
||||
python analyze_results.py results/ --export summary.csv
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import json
|
||||
from pathlib import Path
|
||||
from collections import defaultdict
|
||||
import re
|
||||
|
||||
|
||||
def parse_confidence_scores(results_dir):
|
||||
"""
|
||||
Parse confidence scores from DiffDock output directory.
|
||||
|
||||
Args:
|
||||
results_dir: Path to DiffDock results directory
|
||||
|
||||
Returns:
|
||||
dict: Dictionary mapping complex names to their predictions and scores
|
||||
"""
|
||||
results = {}
|
||||
results_path = Path(results_dir)
|
||||
|
||||
# Check if this is a single complex or batch results
|
||||
sdf_files = list(results_path.glob("*.sdf"))
|
||||
|
||||
if sdf_files:
|
||||
# Single complex output
|
||||
results['single_complex'] = parse_single_complex(results_path)
|
||||
else:
|
||||
# Batch output - multiple subdirectories
|
||||
for subdir in results_path.iterdir():
|
||||
if subdir.is_dir():
|
||||
complex_results = parse_single_complex(subdir)
|
||||
if complex_results:
|
||||
results[subdir.name] = complex_results
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def parse_single_complex(complex_dir):
|
||||
"""Parse results for a single complex."""
|
||||
predictions = []
|
||||
|
||||
# Look for SDF files with rank information
|
||||
for sdf_file in complex_dir.glob("*.sdf"):
|
||||
filename = sdf_file.name
|
||||
|
||||
# Extract rank from filename (e.g., "rank_1.sdf" or "index_0_rank_1.sdf")
|
||||
rank_match = re.search(r'rank_(\d+)', filename)
|
||||
if rank_match:
|
||||
rank = int(rank_match.group(1))
|
||||
|
||||
# Try to extract confidence score from filename or separate file
|
||||
confidence = extract_confidence_score(sdf_file, complex_dir)
|
||||
|
||||
predictions.append({
|
||||
'rank': rank,
|
||||
'file': sdf_file.name,
|
||||
'path': str(sdf_file),
|
||||
'confidence': confidence
|
||||
})
|
||||
|
||||
# Sort by rank
|
||||
predictions.sort(key=lambda x: x['rank'])
|
||||
|
||||
return {'predictions': predictions} if predictions else None
|
||||
|
||||
|
||||
def extract_confidence_score(sdf_file, complex_dir):
|
||||
"""
|
||||
Extract confidence score for a prediction.
|
||||
|
||||
Tries multiple methods:
|
||||
1. Read from confidence_scores.txt file
|
||||
2. Parse from SDF file properties
|
||||
3. Extract from filename if present
|
||||
"""
|
||||
# Method 1: confidence_scores.txt
|
||||
confidence_file = complex_dir / "confidence_scores.txt"
|
||||
if confidence_file.exists():
|
||||
try:
|
||||
with open(confidence_file) as f:
|
||||
lines = f.readlines()
|
||||
# Extract rank from filename
|
||||
rank_match = re.search(r'rank_(\d+)', sdf_file.name)
|
||||
if rank_match:
|
||||
rank = int(rank_match.group(1))
|
||||
if rank <= len(lines):
|
||||
return float(lines[rank - 1].strip())
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Method 2: Parse from SDF file
|
||||
try:
|
||||
with open(sdf_file) as f:
|
||||
content = f.read()
|
||||
# Look for confidence score in SDF properties
|
||||
conf_match = re.search(r'confidence[:\s]+(-?\d+\.?\d*)', content, re.IGNORECASE)
|
||||
if conf_match:
|
||||
return float(conf_match.group(1))
|
||||
except Exception:
|
||||
pass
|
||||
|
||||
# Method 3: Filename (e.g., "rank_1_conf_0.95.sdf")
|
||||
conf_match = re.search(r'conf_(-?\d+\.?\d*)', sdf_file.name)
|
||||
if conf_match:
|
||||
return float(conf_match.group(1))
|
||||
|
||||
return None
|
||||
|
||||
|
||||
def classify_confidence(score):
|
||||
"""Classify confidence score into categories."""
|
||||
if score is None:
|
||||
return "Unknown"
|
||||
elif score > 0:
|
||||
return "High"
|
||||
elif score > -1.5:
|
||||
return "Moderate"
|
||||
else:
|
||||
return "Low"
|
||||
|
||||
|
||||
def print_summary(results, top_n=None, min_confidence=None):
|
||||
"""Print a formatted summary of results."""
|
||||
|
||||
print("\n" + "="*80)
|
||||
print("DiffDock Results Summary")
|
||||
print("="*80)
|
||||
|
||||
all_predictions = []
|
||||
|
||||
for complex_name, data in results.items():
|
||||
predictions = data.get('predictions', [])
|
||||
|
||||
print(f"\n{complex_name}")
|
||||
print("-" * 80)
|
||||
|
||||
if not predictions:
|
||||
print(" No predictions found")
|
||||
continue
|
||||
|
||||
# Filter by confidence if specified
|
||||
filtered_predictions = predictions
|
||||
if min_confidence is not None:
|
||||
filtered_predictions = [p for p in predictions if p['confidence'] is not None and p['confidence'] >= min_confidence]
|
||||
|
||||
# Limit to top N if specified
|
||||
if top_n is not None:
|
||||
filtered_predictions = filtered_predictions[:top_n]
|
||||
|
||||
for pred in filtered_predictions:
|
||||
confidence = pred['confidence']
|
||||
confidence_class = classify_confidence(confidence)
|
||||
|
||||
conf_str = f"{confidence:>7.3f}" if confidence is not None else " N/A"
|
||||
print(f" Rank {pred['rank']:2d}: Confidence = {conf_str} ({confidence_class:8s}) | {pred['file']}")
|
||||
|
||||
# Add to all predictions for overall statistics
|
||||
if confidence is not None:
|
||||
all_predictions.append((complex_name, pred['rank'], confidence))
|
||||
|
||||
# Show statistics for this complex
|
||||
if filtered_predictions and any(p['confidence'] is not None for p in filtered_predictions):
|
||||
confidences = [p['confidence'] for p in filtered_predictions if p['confidence'] is not None]
|
||||
print(f"\n Statistics: {len(filtered_predictions)} predictions")
|
||||
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
|
||||
print(f" Max confidence: {max(confidences):.3f}")
|
||||
print(f" Min confidence: {min(confidences):.3f}")
|
||||
|
||||
# Overall statistics
|
||||
if all_predictions:
|
||||
print("\n" + "="*80)
|
||||
print("Overall Statistics")
|
||||
print("="*80)
|
||||
|
||||
confidences = [conf for _, _, conf in all_predictions]
|
||||
print(f" Total predictions: {len(all_predictions)}")
|
||||
print(f" Total complexes: {len(results)}")
|
||||
print(f" Mean confidence: {sum(confidences)/len(confidences):.3f}")
|
||||
print(f" Max confidence: {max(confidences):.3f}")
|
||||
print(f" Min confidence: {min(confidences):.3f}")
|
||||
|
||||
# Confidence distribution
|
||||
high = sum(1 for c in confidences if c > 0)
|
||||
moderate = sum(1 for c in confidences if -1.5 < c <= 0)
|
||||
low = sum(1 for c in confidences if c <= -1.5)
|
||||
|
||||
print(f"\n Confidence distribution:")
|
||||
print(f" High (> 0): {high:4d} ({100*high/len(confidences):5.1f}%)")
|
||||
print(f" Moderate (-1.5 to 0): {moderate:4d} ({100*moderate/len(confidences):5.1f}%)")
|
||||
print(f" Low (< -1.5): {low:4d} ({100*low/len(confidences):5.1f}%)")
|
||||
|
||||
print("\n" + "="*80)
|
||||
|
||||
|
||||
def export_to_csv(results, output_path):
|
||||
"""Export results to CSV file."""
|
||||
import csv
|
||||
|
||||
with open(output_path, 'w', newline='') as f:
|
||||
writer = csv.writer(f)
|
||||
writer.writerow(['complex_name', 'rank', 'confidence', 'confidence_class', 'file_path'])
|
||||
|
||||
for complex_name, data in results.items():
|
||||
predictions = data.get('predictions', [])
|
||||
for pred in predictions:
|
||||
confidence = pred['confidence']
|
||||
confidence_class = classify_confidence(confidence)
|
||||
conf_value = confidence if confidence is not None else ''
|
||||
|
||||
writer.writerow([
|
||||
complex_name,
|
||||
pred['rank'],
|
||||
conf_value,
|
||||
confidence_class,
|
||||
pred['path']
|
||||
])
|
||||
|
||||
print(f"✓ Exported results to: {output_path}")
|
||||
|
||||
|
||||
def get_top_predictions(results, n=10, sort_by='confidence'):
|
||||
"""Get top N predictions across all complexes."""
|
||||
all_predictions = []
|
||||
|
||||
for complex_name, data in results.items():
|
||||
predictions = data.get('predictions', [])
|
||||
for pred in predictions:
|
||||
if pred['confidence'] is not None:
|
||||
all_predictions.append({
|
||||
'complex': complex_name,
|
||||
**pred
|
||||
})
|
||||
|
||||
# Sort by confidence (descending)
|
||||
all_predictions.sort(key=lambda x: x['confidence'], reverse=True)
|
||||
|
||||
return all_predictions[:n]
|
||||
|
||||
|
||||
def print_top_predictions(results, n=10):
|
||||
"""Print top N predictions across all complexes."""
|
||||
top_preds = get_top_predictions(results, n)
|
||||
|
||||
print("\n" + "="*80)
|
||||
print(f"Top {n} Predictions Across All Complexes")
|
||||
print("="*80)
|
||||
|
||||
for i, pred in enumerate(top_preds, 1):
|
||||
confidence_class = classify_confidence(pred['confidence'])
|
||||
print(f"{i:2d}. {pred['complex']:30s} | Rank {pred['rank']:2d} | "
|
||||
f"Confidence: {pred['confidence']:7.3f} ({confidence_class})")
|
||||
|
||||
print("="*80)
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Analyze DiffDock prediction results',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Analyze all results in directory
|
||||
python analyze_results.py results/output_dir/
|
||||
|
||||
# Show only top 5 predictions per complex
|
||||
python analyze_results.py results/ --top 5
|
||||
|
||||
# Filter by confidence threshold
|
||||
python analyze_results.py results/ --threshold 0.0
|
||||
|
||||
# Export to CSV
|
||||
python analyze_results.py results/ --export summary.csv
|
||||
|
||||
# Show top 20 predictions across all complexes
|
||||
python analyze_results.py results/ --best 20
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('results_dir', help='Path to DiffDock results directory')
|
||||
parser.add_argument('--top', '-t', type=int,
|
||||
help='Show only top N predictions per complex')
|
||||
parser.add_argument('--threshold', type=float,
|
||||
help='Minimum confidence threshold')
|
||||
parser.add_argument('--export', '-e', metavar='FILE',
|
||||
help='Export results to CSV file')
|
||||
parser.add_argument('--best', '-b', type=int, metavar='N',
|
||||
help='Show top N predictions across all complexes')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Validate results directory
|
||||
if not os.path.exists(args.results_dir):
|
||||
print(f"Error: Results directory not found: {args.results_dir}")
|
||||
return 1
|
||||
|
||||
# Parse results
|
||||
print(f"Analyzing results in: {args.results_dir}")
|
||||
results = parse_confidence_scores(args.results_dir)
|
||||
|
||||
if not results:
|
||||
print("No DiffDock results found in directory")
|
||||
return 1
|
||||
|
||||
# Print summary
|
||||
print_summary(results, top_n=args.top, min_confidence=args.threshold)
|
||||
|
||||
# Print top predictions across all complexes
|
||||
if args.best:
|
||||
print_top_predictions(results, args.best)
|
||||
|
||||
# Export to CSV if requested
|
||||
if args.export:
|
||||
export_to_csv(results, args.export)
|
||||
|
||||
return 0
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
254
skills/diffdock/scripts/prepare_batch_csv.py
Executable file
254
skills/diffdock/scripts/prepare_batch_csv.py
Executable file
@@ -0,0 +1,254 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DiffDock Batch CSV Preparation and Validation Script
|
||||
|
||||
This script helps prepare and validate CSV files for DiffDock batch processing.
|
||||
It checks for required columns, validates file paths, and ensures SMILES strings
|
||||
are properly formatted.
|
||||
|
||||
Usage:
|
||||
python prepare_batch_csv.py input.csv --validate
|
||||
python prepare_batch_csv.py --create --output batch_input.csv
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import os
|
||||
import sys
|
||||
import pandas as pd
|
||||
from pathlib import Path
|
||||
|
||||
try:
|
||||
from rdkit import Chem
|
||||
from rdkit import RDLogger
|
||||
RDLogger.DisableLog('rdApp.*')
|
||||
RDKIT_AVAILABLE = True
|
||||
except ImportError:
|
||||
RDKIT_AVAILABLE = False
|
||||
print("Warning: RDKit not available. SMILES validation will be skipped.")
|
||||
|
||||
|
||||
def validate_smiles(smiles_string):
|
||||
"""Validate a SMILES string using RDKit."""
|
||||
if not RDKIT_AVAILABLE:
|
||||
return True, "RDKit not available for validation"
|
||||
|
||||
try:
|
||||
mol = Chem.MolFromSmiles(smiles_string)
|
||||
if mol is None:
|
||||
return False, "Invalid SMILES structure"
|
||||
return True, "Valid SMILES"
|
||||
except Exception as e:
|
||||
return False, str(e)
|
||||
|
||||
|
||||
def validate_file_path(file_path, base_dir=None):
|
||||
"""Validate that a file path exists."""
|
||||
if pd.isna(file_path) or file_path == "":
|
||||
return True, "Empty (will use protein_sequence)"
|
||||
|
||||
# Handle relative paths
|
||||
if base_dir:
|
||||
full_path = Path(base_dir) / file_path
|
||||
else:
|
||||
full_path = Path(file_path)
|
||||
|
||||
if full_path.exists():
|
||||
return True, f"File exists: {full_path}"
|
||||
else:
|
||||
return False, f"File not found: {full_path}"
|
||||
|
||||
|
||||
def validate_csv(csv_path, base_dir=None):
|
||||
"""
|
||||
Validate a DiffDock batch input CSV file.
|
||||
|
||||
Args:
|
||||
csv_path: Path to CSV file
|
||||
base_dir: Base directory for relative paths (default: CSV directory)
|
||||
|
||||
Returns:
|
||||
bool: True if validation passes
|
||||
list: List of validation messages
|
||||
"""
|
||||
messages = []
|
||||
valid = True
|
||||
|
||||
# Read CSV
|
||||
try:
|
||||
df = pd.read_csv(csv_path)
|
||||
messages.append(f"✓ Successfully read CSV with {len(df)} rows")
|
||||
except Exception as e:
|
||||
messages.append(f"✗ Error reading CSV: {e}")
|
||||
return False, messages
|
||||
|
||||
# Check required columns
|
||||
required_cols = ['complex_name', 'protein_path', 'ligand_description', 'protein_sequence']
|
||||
missing_cols = [col for col in required_cols if col not in df.columns]
|
||||
|
||||
if missing_cols:
|
||||
messages.append(f"✗ Missing required columns: {', '.join(missing_cols)}")
|
||||
valid = False
|
||||
else:
|
||||
messages.append("✓ All required columns present")
|
||||
|
||||
# Set base directory
|
||||
if base_dir is None:
|
||||
base_dir = Path(csv_path).parent
|
||||
|
||||
# Validate each row
|
||||
for idx, row in df.iterrows():
|
||||
row_msgs = []
|
||||
|
||||
# Check complex name
|
||||
if pd.isna(row['complex_name']) or row['complex_name'] == "":
|
||||
row_msgs.append("Missing complex_name")
|
||||
valid = False
|
||||
|
||||
# Check that either protein_path or protein_sequence is provided
|
||||
has_protein_path = not pd.isna(row['protein_path']) and row['protein_path'] != ""
|
||||
has_protein_seq = not pd.isna(row['protein_sequence']) and row['protein_sequence'] != ""
|
||||
|
||||
if not has_protein_path and not has_protein_seq:
|
||||
row_msgs.append("Must provide either protein_path or protein_sequence")
|
||||
valid = False
|
||||
elif has_protein_path and has_protein_seq:
|
||||
row_msgs.append("Warning: Both protein_path and protein_sequence provided, will use protein_path")
|
||||
|
||||
# Validate protein path if provided
|
||||
if has_protein_path:
|
||||
file_valid, msg = validate_file_path(row['protein_path'], base_dir)
|
||||
if not file_valid:
|
||||
row_msgs.append(f"Protein file issue: {msg}")
|
||||
valid = False
|
||||
|
||||
# Validate ligand description
|
||||
if pd.isna(row['ligand_description']) or row['ligand_description'] == "":
|
||||
row_msgs.append("Missing ligand_description")
|
||||
valid = False
|
||||
else:
|
||||
ligand_desc = row['ligand_description']
|
||||
# Check if it's a file path or SMILES
|
||||
if os.path.exists(ligand_desc) or "/" in ligand_desc or "\\" in ligand_desc:
|
||||
# Likely a file path
|
||||
file_valid, msg = validate_file_path(ligand_desc, base_dir)
|
||||
if not file_valid:
|
||||
row_msgs.append(f"Ligand file issue: {msg}")
|
||||
valid = False
|
||||
else:
|
||||
# Likely a SMILES string
|
||||
smiles_valid, msg = validate_smiles(ligand_desc)
|
||||
if not smiles_valid:
|
||||
row_msgs.append(f"SMILES issue: {msg}")
|
||||
valid = False
|
||||
|
||||
if row_msgs:
|
||||
messages.append(f"\nRow {idx + 1} ({row.get('complex_name', 'unnamed')}):")
|
||||
for msg in row_msgs:
|
||||
messages.append(f" - {msg}")
|
||||
|
||||
# Summary
|
||||
messages.append(f"\n{'='*60}")
|
||||
if valid:
|
||||
messages.append("✓ CSV validation PASSED - ready for DiffDock")
|
||||
else:
|
||||
messages.append("✗ CSV validation FAILED - please fix issues above")
|
||||
|
||||
return valid, messages
|
||||
|
||||
|
||||
def create_template_csv(output_path, num_examples=3):
|
||||
"""Create a template CSV file with example entries."""
|
||||
|
||||
examples = {
|
||||
'complex_name': ['example1', 'example2', 'example3'][:num_examples],
|
||||
'protein_path': ['protein1.pdb', '', 'protein3.pdb'][:num_examples],
|
||||
'ligand_description': [
|
||||
'CC(=O)Oc1ccccc1C(=O)O', # Aspirin SMILES
|
||||
'COc1ccc(C#N)cc1', # Example SMILES
|
||||
'ligand.sdf' # Example file path
|
||||
][:num_examples],
|
||||
'protein_sequence': [
|
||||
'', # Empty - using PDB file
|
||||
'MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTTGKLPVPWPTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEGDTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGSVQLADHYQQNTPIGDGPVLLPDNHYLSTQSALSKDPNEKRDHMVLLEFVTAAGITHGMDELYK', # GFP sequence
|
||||
'' # Empty - using PDB file
|
||||
][:num_examples]
|
||||
}
|
||||
|
||||
df = pd.DataFrame(examples)
|
||||
df.to_csv(output_path, index=False)
|
||||
|
||||
return df
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Prepare and validate DiffDock batch CSV files',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter,
|
||||
epilog="""
|
||||
Examples:
|
||||
# Validate existing CSV
|
||||
python prepare_batch_csv.py input.csv --validate
|
||||
|
||||
# Create template CSV
|
||||
python prepare_batch_csv.py --create --output batch_template.csv
|
||||
|
||||
# Create template with 5 example rows
|
||||
python prepare_batch_csv.py --create --output template.csv --num-examples 5
|
||||
|
||||
# Validate with custom base directory for relative paths
|
||||
python prepare_batch_csv.py input.csv --validate --base-dir /path/to/data/
|
||||
"""
|
||||
)
|
||||
|
||||
parser.add_argument('csv_file', nargs='?', help='CSV file to validate')
|
||||
parser.add_argument('--validate', action='store_true',
|
||||
help='Validate the CSV file')
|
||||
parser.add_argument('--create', action='store_true',
|
||||
help='Create a template CSV file')
|
||||
parser.add_argument('--output', '-o', help='Output path for template CSV')
|
||||
parser.add_argument('--num-examples', type=int, default=3,
|
||||
help='Number of example rows in template (default: 3)')
|
||||
parser.add_argument('--base-dir', help='Base directory for relative file paths')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Create template
|
||||
if args.create:
|
||||
output_path = args.output or 'diffdock_batch_template.csv'
|
||||
df = create_template_csv(output_path, args.num_examples)
|
||||
print(f"✓ Created template CSV: {output_path}")
|
||||
print(f"\nTemplate contents:")
|
||||
print(df.to_string(index=False))
|
||||
print(f"\nEdit this file with your protein-ligand pairs and run with:")
|
||||
print(f" python -m inference --config default_inference_args.yaml \\")
|
||||
print(f" --protein_ligand_csv {output_path} --out_dir results/")
|
||||
return 0
|
||||
|
||||
# Validate CSV
|
||||
if args.validate or args.csv_file:
|
||||
if not args.csv_file:
|
||||
print("Error: CSV file required for validation")
|
||||
parser.print_help()
|
||||
return 1
|
||||
|
||||
if not os.path.exists(args.csv_file):
|
||||
print(f"Error: CSV file not found: {args.csv_file}")
|
||||
return 1
|
||||
|
||||
print(f"Validating: {args.csv_file}")
|
||||
print("="*60)
|
||||
|
||||
valid, messages = validate_csv(args.csv_file, args.base_dir)
|
||||
|
||||
for msg in messages:
|
||||
print(msg)
|
||||
|
||||
return 0 if valid else 1
|
||||
|
||||
# No action specified
|
||||
parser.print_help()
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
278
skills/diffdock/scripts/setup_check.py
Executable file
278
skills/diffdock/scripts/setup_check.py
Executable file
@@ -0,0 +1,278 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
DiffDock Environment Setup Checker
|
||||
|
||||
This script verifies that the DiffDock environment is properly configured
|
||||
and all dependencies are available.
|
||||
|
||||
Usage:
|
||||
python setup_check.py
|
||||
python setup_check.py --verbose
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import sys
|
||||
import os
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
def check_python_version():
|
||||
"""Check Python version."""
|
||||
import sys
|
||||
version = sys.version_info
|
||||
|
||||
print("Checking Python version...")
|
||||
if version.major == 3 and version.minor >= 8:
|
||||
print(f" ✓ Python {version.major}.{version.minor}.{version.micro}")
|
||||
return True
|
||||
else:
|
||||
print(f" ✗ Python {version.major}.{version.minor}.{version.micro} "
|
||||
f"(requires Python 3.8 or higher)")
|
||||
return False
|
||||
|
||||
|
||||
def check_package(package_name, import_name=None, version_attr='__version__'):
|
||||
"""Check if a Python package is installed."""
|
||||
if import_name is None:
|
||||
import_name = package_name
|
||||
|
||||
try:
|
||||
module = __import__(import_name)
|
||||
version = getattr(module, version_attr, 'unknown')
|
||||
print(f" ✓ {package_name:20s} (version: {version})")
|
||||
return True
|
||||
except ImportError:
|
||||
print(f" ✗ {package_name:20s} (not installed)")
|
||||
return False
|
||||
|
||||
|
||||
def check_pytorch():
|
||||
"""Check PyTorch installation and CUDA availability."""
|
||||
print("\nChecking PyTorch...")
|
||||
try:
|
||||
import torch
|
||||
print(f" ✓ PyTorch version: {torch.__version__}")
|
||||
|
||||
# Check CUDA
|
||||
if torch.cuda.is_available():
|
||||
print(f" ✓ CUDA available: {torch.cuda.get_device_name(0)}")
|
||||
print(f" - CUDA version: {torch.version.cuda}")
|
||||
print(f" - Number of GPUs: {torch.cuda.device_count()}")
|
||||
return True, True
|
||||
else:
|
||||
print(f" ⚠ CUDA not available (will run on CPU)")
|
||||
return True, False
|
||||
except ImportError:
|
||||
print(f" ✗ PyTorch not installed")
|
||||
return False, False
|
||||
|
||||
|
||||
def check_pytorch_geometric():
|
||||
"""Check PyTorch Geometric installation."""
|
||||
print("\nChecking PyTorch Geometric...")
|
||||
packages = [
|
||||
('torch-geometric', 'torch_geometric'),
|
||||
('torch-scatter', 'torch_scatter'),
|
||||
('torch-sparse', 'torch_sparse'),
|
||||
('torch-cluster', 'torch_cluster'),
|
||||
]
|
||||
|
||||
all_ok = True
|
||||
for pkg_name, import_name in packages:
|
||||
if not check_package(pkg_name, import_name):
|
||||
all_ok = False
|
||||
|
||||
return all_ok
|
||||
|
||||
|
||||
def check_core_dependencies():
|
||||
"""Check core DiffDock dependencies."""
|
||||
print("\nChecking core dependencies...")
|
||||
|
||||
dependencies = [
|
||||
('numpy', 'numpy'),
|
||||
('scipy', 'scipy'),
|
||||
('pandas', 'pandas'),
|
||||
('rdkit', 'rdkit', 'rdBase.__version__'),
|
||||
('biopython', 'Bio', '__version__'),
|
||||
('pytorch-lightning', 'pytorch_lightning'),
|
||||
('PyYAML', 'yaml'),
|
||||
]
|
||||
|
||||
all_ok = True
|
||||
for dep in dependencies:
|
||||
pkg_name = dep[0]
|
||||
import_name = dep[1]
|
||||
version_attr = dep[2] if len(dep) > 2 else '__version__'
|
||||
|
||||
if not check_package(pkg_name, import_name, version_attr):
|
||||
all_ok = False
|
||||
|
||||
return all_ok
|
||||
|
||||
|
||||
def check_esm():
|
||||
"""Check ESM (protein language model) installation."""
|
||||
print("\nChecking ESM (for protein sequence folding)...")
|
||||
try:
|
||||
import esm
|
||||
print(f" ✓ ESM installed (version: {esm.__version__ if hasattr(esm, '__version__') else 'unknown'})")
|
||||
return True
|
||||
except ImportError:
|
||||
print(f" ⚠ ESM not installed (needed for protein sequence folding)")
|
||||
print(f" Install with: pip install fair-esm")
|
||||
return False
|
||||
|
||||
|
||||
def check_diffdock_installation():
|
||||
"""Check if DiffDock is properly installed/cloned."""
|
||||
print("\nChecking DiffDock installation...")
|
||||
|
||||
# Look for key files
|
||||
key_files = [
|
||||
'inference.py',
|
||||
'default_inference_args.yaml',
|
||||
'environment.yml',
|
||||
]
|
||||
|
||||
found_files = []
|
||||
missing_files = []
|
||||
|
||||
for filename in key_files:
|
||||
if os.path.exists(filename):
|
||||
found_files.append(filename)
|
||||
else:
|
||||
missing_files.append(filename)
|
||||
|
||||
if found_files:
|
||||
print(f" ✓ Found DiffDock files in current directory:")
|
||||
for f in found_files:
|
||||
print(f" - {f}")
|
||||
else:
|
||||
print(f" ⚠ DiffDock files not found in current directory")
|
||||
print(f" Current directory: {os.getcwd()}")
|
||||
print(f" Make sure you're in the DiffDock repository root")
|
||||
|
||||
# Check for model checkpoints
|
||||
model_dir = Path('./workdir/v1.1/score_model')
|
||||
confidence_dir = Path('./workdir/v1.1/confidence_model')
|
||||
|
||||
if model_dir.exists() and confidence_dir.exists():
|
||||
print(f" ✓ Model checkpoints found")
|
||||
else:
|
||||
print(f" ⚠ Model checkpoints not found in ./workdir/v1.1/")
|
||||
print(f" Models will be downloaded on first run")
|
||||
|
||||
return len(found_files) > 0
|
||||
|
||||
|
||||
def print_installation_instructions():
|
||||
"""Print installation instructions if setup is incomplete."""
|
||||
print("\n" + "="*80)
|
||||
print("Installation Instructions")
|
||||
print("="*80)
|
||||
|
||||
print("""
|
||||
If DiffDock is not installed, follow these steps:
|
||||
|
||||
1. Clone the repository:
|
||||
git clone https://github.com/gcorso/DiffDock.git
|
||||
cd DiffDock
|
||||
|
||||
2. Create conda environment:
|
||||
conda env create --file environment.yml
|
||||
conda activate diffdock
|
||||
|
||||
3. Verify installation:
|
||||
python setup_check.py
|
||||
|
||||
For Docker installation:
|
||||
docker pull rbgcsail/diffdock
|
||||
docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
|
||||
micromamba activate diffdock
|
||||
|
||||
For more information, visit: https://github.com/gcorso/DiffDock
|
||||
""")
|
||||
|
||||
|
||||
def print_performance_notes(has_cuda):
|
||||
"""Print performance notes based on available hardware."""
|
||||
print("\n" + "="*80)
|
||||
print("Performance Notes")
|
||||
print("="*80)
|
||||
|
||||
if has_cuda:
|
||||
print("""
|
||||
✓ GPU detected - DiffDock will run efficiently
|
||||
|
||||
Expected performance:
|
||||
- First run: ~2-5 minutes (pre-computing SO(2)/SO(3) tables)
|
||||
- Subsequent runs: ~10-60 seconds per complex (depending on settings)
|
||||
- Batch processing: Highly efficient with GPU
|
||||
""")
|
||||
else:
|
||||
print("""
|
||||
⚠ No GPU detected - DiffDock will run on CPU
|
||||
|
||||
Expected performance:
|
||||
- CPU inference is SIGNIFICANTLY slower than GPU
|
||||
- Single complex: Several minutes to hours
|
||||
- Batch processing: Not recommended on CPU
|
||||
|
||||
Recommendation: Use GPU for practical applications
|
||||
- Cloud options: Google Colab, AWS, or other cloud GPU services
|
||||
- Local: Install CUDA-capable GPU
|
||||
""")
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Check DiffDock environment setup',
|
||||
formatter_class=argparse.RawDescriptionHelpFormatter
|
||||
)
|
||||
|
||||
parser.add_argument('--verbose', '-v', action='store_true',
|
||||
help='Show detailed version information')
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
print("="*80)
|
||||
print("DiffDock Environment Setup Checker")
|
||||
print("="*80)
|
||||
|
||||
checks = []
|
||||
|
||||
# Run all checks
|
||||
checks.append(("Python version", check_python_version()))
|
||||
|
||||
pytorch_ok, has_cuda = check_pytorch()
|
||||
checks.append(("PyTorch", pytorch_ok))
|
||||
|
||||
checks.append(("PyTorch Geometric", check_pytorch_geometric()))
|
||||
checks.append(("Core dependencies", check_core_dependencies()))
|
||||
checks.append(("ESM", check_esm()))
|
||||
checks.append(("DiffDock files", check_diffdock_installation()))
|
||||
|
||||
# Summary
|
||||
print("\n" + "="*80)
|
||||
print("Summary")
|
||||
print("="*80)
|
||||
|
||||
all_passed = all(result for _, result in checks)
|
||||
|
||||
for check_name, result in checks:
|
||||
status = "✓ PASS" if result else "✗ FAIL"
|
||||
print(f" {status:8s} - {check_name}")
|
||||
|
||||
if all_passed:
|
||||
print("\n✓ All checks passed! DiffDock is ready to use.")
|
||||
print_performance_notes(has_cuda)
|
||||
return 0
|
||||
else:
|
||||
print("\n✗ Some checks failed. Please install missing dependencies.")
|
||||
print_installation_instructions()
|
||||
return 1
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
sys.exit(main())
|
||||
Reference in New Issue
Block a user