Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/diffdock/SKILL.md
+++ b/skills/diffdock/SKILL.md
@@ -0,0 +1,477 @@
+---
+name: diffdock
+description: "Diffusion-based molecular docking. Predict protein-ligand binding poses from PDB/SMILES, confidence scores, virtual screening, for structure-based drug design. Not for affinity prediction."
+---
+
+# DiffDock: Molecular Docking with Diffusion Models
+
+## Overview
+
+DiffDock is a diffusion-based deep learning tool for molecular docking that predicts 3D binding poses of small molecule ligands to protein targets. It represents the state-of-the-art in computational docking, crucial for structure-based drug discovery and chemical biology.
+
+**Core Capabilities:**
+- Predict ligand binding poses with high accuracy using deep learning
+- Support protein structures (PDB files) or sequences (via ESMFold)
+- Process single complexes or batch virtual screening campaigns
+- Generate confidence scores to assess prediction reliability
+- Handle diverse ligand inputs (SMILES, SDF, MOL2)
+
+**Key Distinction:** DiffDock predicts **binding poses** (3D structure) and **confidence** (prediction certainty), NOT binding affinity (ΔG, Kd). Always combine with scoring functions (GNINA, MM/GBSA) for affinity assessment.
+
+## When to Use This Skill
+
+This skill should be used when:
+
+- "Dock this ligand to a protein" or "predict binding pose"
+- "Run molecular docking" or "perform protein-ligand docking"
+- "Virtual screening" or "screen compound library"
+- "Where does this molecule bind?" or "predict binding site"
+- Structure-based drug design or lead optimization tasks
+- Tasks involving PDB files + SMILES strings or ligand structures
+- Batch docking of multiple protein-ligand pairs
+
+## Installation and Environment Setup
+
+### Check Environment Status
+
+Before proceeding with DiffDock tasks, verify the environment setup:
+
+```bash
+# Use the provided setup checker
+python scripts/setup_check.py
+```
+
+This script validates Python version, PyTorch with CUDA, PyTorch Geometric, RDKit, ESM, and other dependencies.
+
+### Installation Options
+
+**Option 1: Conda (Recommended)**
+```bash
+git clone https://github.com/gcorso/DiffDock.git
+cd DiffDock
+conda env create --file environment.yml
+conda activate diffdock
+```
+
+**Option 2: Docker**
+```bash
+docker pull rbgcsail/diffdock
+docker run -it --gpus all --entrypoint /bin/bash rbgcsail/diffdock
+micromamba activate diffdock
+```
+
+**Important Notes:**
+- GPU strongly recommended (10-100x speedup vs CPU)
+- First run pre-computes SO(2)/SO(3) lookup tables (~2-5 minutes)
+- Model checkpoints (~500MB) download automatically if not present
+
+## Core Workflows
+
+### Workflow 1: Single Protein-Ligand Docking
+
+**Use Case:** Dock one ligand to one protein target
+
+**Input Requirements:**
+- Protein: PDB file OR amino acid sequence
+- Ligand: SMILES string OR structure file (SDF/MOL2)
+
+**Command:**
+```bash
+python -m inference \
+  --config default_inference_args.yaml \
+  --protein_path protein.pdb \
+  --ligand "CC(=O)Oc1ccccc1C(=O)O" \
+  --out_dir results/single_docking/
+```
+
+**Alternative (protein sequence):**
+```bash
+python -m inference \
+  --config default_inference_args.yaml \
+  --protein_sequence "MSKGEELFTGVVPILVELDGDVNGHKF..." \
+  --ligand ligand.sdf \
+  --out_dir results/sequence_docking/
+```
+
+**Output Structure:**
+```
+results/single_docking/
+├── rank_1.sdf          # Top-ranked pose
+├── rank_2.sdf          # Second-ranked pose
+├── ...
+├── rank_10.sdf         # 10th pose (default: 10 samples)
+└── confidence_scores.txt
+```
+
+### Workflow 2: Batch Processing Multiple Complexes
+
+**Use Case:** Dock multiple ligands to proteins, virtual screening campaigns
+
+**Step 1: Prepare Batch CSV**
+
+Use the provided script to create or validate batch input:
+
+```bash
+# Create template
+python scripts/prepare_batch_csv.py --create --output batch_input.csv
+
+# Validate existing CSV
+python scripts/prepare_batch_csv.py my_input.csv --validate
+```
+
+**CSV Format:**
+```csv
+complex_name,protein_path,ligand_description,protein_sequence
+complex1,protein1.pdb,CC(=O)Oc1ccccc1C(=O)O,
+complex2,,COc1ccc(C#N)cc1,MSKGEELFT...
+complex3,protein3.pdb,ligand3.sdf,
+```
+
+**Required Columns:**
+- `complex_name`: Unique identifier
+- `protein_path`: PDB file path (leave empty if using sequence)
+- `ligand_description`: SMILES string or ligand file path
+- `protein_sequence`: Amino acid sequence (leave empty if using PDB)
+
+**Step 2: Run Batch Docking**
+
+```bash
+python -m inference \
+  --config default_inference_args.yaml \
+  --protein_ligand_csv batch_input.csv \
+  --out_dir results/batch/ \
+  --batch_size 10
+```
+
+**For Large Virtual Screening (>100 compounds):**
+
+Pre-compute protein embeddings for faster processing:
+```bash
+# Pre-compute embeddings
+python datasets/esm_embedding_preparation.py \
+  --protein_ligand_csv screening_input.csv \
+  --out_file protein_embeddings.pt
+
+# Run with pre-computed embeddings
+python -m inference \
+  --config default_inference_args.yaml \
+  --protein_ligand_csv screening_input.csv \
+  --esm_embeddings_path protein_embeddings.pt \
+  --out_dir results/screening/
+```
+
+### Workflow 3: Analyzing Results
+
+After docking completes, analyze confidence scores and rank predictions:
+
+```bash
+# Analyze all results
+python scripts/analyze_results.py results/batch/
+
+# Show top 5 per complex
+python scripts/analyze_results.py results/batch/ --top 5
+
+# Filter by confidence threshold
+python scripts/analyze_results.py results/batch/ --threshold 0.0
+
+# Export to CSV
+python scripts/analyze_results.py results/batch/ --export summary.csv
+
+# Show top 20 predictions across all complexes
+python scripts/analyze_results.py results/batch/ --best 20
+```
+
+The analysis script:
+- Parses confidence scores from all predictions
+- Classifies as High (>0), Moderate (-1.5 to 0), or Low (<-1.5)
+- Ranks predictions within and across complexes
+- Generates statistical summaries
+- Exports results to CSV for downstream analysis
+
+## Confidence Score Interpretation
+
+**Understanding Scores:**
+
+| Score Range | Confidence Level | Interpretation |
+|------------|------------------|----------------|
+| **> 0** | High | Strong prediction, likely accurate |
+| **-1.5 to 0** | Moderate | Reasonable prediction, validate carefully |
+| **< -1.5** | Low | Uncertain prediction, requires validation |
+
+**Critical Notes:**
+1. **Confidence ≠ Affinity**: High confidence means model certainty about structure, NOT strong binding
+2. **Context Matters**: Adjust expectations for:
+   - Large ligands (>500 Da): Lower confidence expected
+   - Multiple protein chains: May decrease confidence
+   - Novel protein families: May underperform
+3. **Multiple Samples**: Review top 3-5 predictions, look for consensus
+
+**For detailed guidance:** Read `references/confidence_and_limitations.md` using the Read tool
+
+## Parameter Customization
+
+### Using Custom Configuration
+
+Create custom configuration for specific use cases:
+
+```bash
+# Copy template
+cp assets/custom_inference_config.yaml my_config.yaml
+
+# Edit parameters (see template for presets)
+# Then run with custom config
+python -m inference \
+  --config my_config.yaml \
+  --protein_ligand_csv input.csv \
+  --out_dir results/
+```
+
+### Key Parameters to Adjust
+
+**Sampling Density:**
+- `samples_per_complex: 10` → Increase to 20-40 for difficult cases
+- More samples = better coverage but longer runtime
+
+**Inference Steps:**
+- `inference_steps: 20` → Increase to 25-30 for higher accuracy
+- More steps = potentially better quality but slower
+
+**Temperature Parameters (control diversity):**
+- `temp_sampling_tor: 7.04` → Increase for flexible ligands (8-10)
+- `temp_sampling_tor: 7.04` → Decrease for rigid ligands (5-6)
+- Higher temperature = more diverse poses
+
+**Presets Available in Template:**
+1. High Accuracy: More samples + steps, lower temperature
+2. Fast Screening: Fewer samples, faster
+3. Flexible Ligands: Increased torsion temperature
+4. Rigid Ligands: Decreased torsion temperature
+
+**For complete parameter reference:** Read `references/parameters_reference.md` using the Read tool
+
+## Advanced Techniques
+
+### Ensemble Docking (Protein Flexibility)
+
+For proteins with known flexibility, dock to multiple conformations:
+
+```python
+# Create ensemble CSV
+import pandas as pd
+
+conformations = ["conf1.pdb", "conf2.pdb", "conf3.pdb"]
+ligand = "CC(=O)Oc1ccccc1C(=O)O"
+
+data = {
+    "complex_name": [f"ensemble_{i}" for i in range(len(conformations))],
+    "protein_path": conformations,
+    "ligand_description": [ligand] * len(conformations),
+    "protein_sequence": [""] * len(conformations)
+}
+
+pd.DataFrame(data).to_csv("ensemble_input.csv", index=False)
+```
+
+Run docking with increased sampling:
+```bash
+python -m inference \
+  --config default_inference_args.yaml \
+  --protein_ligand_csv ensemble_input.csv \
+  --samples_per_complex 20 \
+  --out_dir results/ensemble/
+```
+
+### Integration with Scoring Functions
+
+DiffDock generates poses; combine with other tools for affinity:
+
+**GNINA (Fast neural network scoring):**
+```bash
+for pose in results/*.sdf; do
+    gnina -r protein.pdb -l "$pose" --score_only
+done
+```
+
+**MM/GBSA (More accurate, slower):**
+Use AmberTools MMPBSA.py or gmx_MMPBSA after energy minimization
+
+**Free Energy Calculations (Most accurate):**
+Use OpenMM + OpenFE or GROMACS for FEP/TI calculations
+
+**Recommended Workflow:**
+1. DiffDock → Generate poses with confidence scores
+2. Visual inspection → Check structural plausibility
+3. GNINA or MM/GBSA → Rescore and rank by affinity
+4. Experimental validation → Biochemical assays
+
+## Limitations and Scope
+
+**DiffDock IS Designed For:**
+- Small molecule ligands (typically 100-1000 Da)
+- Drug-like organic compounds
+- Small peptides (<20 residues)
+- Single or multi-chain proteins
+
+**DiffDock IS NOT Designed For:**
+- Large biomolecules (protein-protein docking) → Use DiffDock-PP or AlphaFold-Multimer
+- Large peptides (>20 residues) → Use alternative methods
+- Covalent docking → Use specialized covalent docking tools
+- Binding affinity prediction → Combine with scoring functions
+- Membrane proteins → Not specifically trained, use with caution
+
+**For complete limitations:** Read `references/confidence_and_limitations.md` using the Read tool
+
+## Troubleshooting
+
+### Common Issues
+
+**Issue: Low confidence scores across all predictions**
+- Cause: Large/unusual ligands, unclear binding site, protein flexibility
+- Solution: Increase `samples_per_complex` (20-40), try ensemble docking, validate protein structure
+
+**Issue: Out of memory errors**
+- Cause: GPU memory insufficient for batch size
+- Solution: Reduce `--batch_size 2` or process fewer complexes at once
+
+**Issue: Slow performance**
+- Cause: Running on CPU instead of GPU
+- Solution: Verify CUDA with `python -c "import torch; print(torch.cuda.is_available())"`, use GPU
+
+**Issue: Unrealistic binding poses**
+- Cause: Poor protein preparation, ligand too large, wrong binding site
+- Solution: Check protein for missing residues, remove far waters, consider specifying binding site
+
+**Issue: "Module not found" errors**
+- Cause: Missing dependencies or wrong environment
+- Solution: Run `python scripts/setup_check.py` to diagnose
+
+### Performance Optimization
+
+**For Best Results:**
+1. Use GPU (essential for practical use)
+2. Pre-compute ESM embeddings for repeated protein use
+3. Batch process multiple complexes together
+4. Start with default parameters, then tune if needed
+5. Validate protein structures (resolve missing residues)
+6. Use canonical SMILES for ligands
+
+## Graphical User Interface
+
+For interactive use, launch the web interface:
+
+```bash
+python app/main.py
+# Navigate to http://localhost:7860
+```
+
+Or use the online demo without installation:
+- https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
+
+## Resources
+
+### Helper Scripts (`scripts/`)
+
+**`prepare_batch_csv.py`**: Create and validate batch input CSV files
+- Create templates with example entries
+- Validate file paths and SMILES strings
+- Check for required columns and format issues
+
+**`analyze_results.py`**: Analyze confidence scores and rank predictions
+- Parse results from single or batch runs
+- Generate statistical summaries
+- Export to CSV for downstream analysis
+- Identify top predictions across complexes
+
+**`setup_check.py`**: Verify DiffDock environment setup
+- Check Python version and dependencies
+- Verify PyTorch and CUDA availability
+- Test RDKit and PyTorch Geometric installation
+- Provide installation instructions if needed
+
+### Reference Documentation (`references/`)
+
+**`parameters_reference.md`**: Complete parameter documentation
+- All command-line options and configuration parameters
+- Default values and acceptable ranges
+- Temperature parameters for controlling diversity
+- Model checkpoint locations and version flags
+
+Read this file when users need:
+- Detailed parameter explanations
+- Fine-tuning guidance for specific systems
+- Alternative sampling strategies
+
+**`confidence_and_limitations.md`**: Confidence score interpretation and tool limitations
+- Detailed confidence score interpretation
+- When to trust predictions
+- Scope and limitations of DiffDock
+- Integration with complementary tools
+- Troubleshooting prediction quality
+
+Read this file when users need:
+- Help interpreting confidence scores
+- Understanding when NOT to use DiffDock
+- Guidance on combining with other tools
+- Validation strategies
+
+**`workflows_examples.md`**: Comprehensive workflow examples
+- Detailed installation instructions
+- Step-by-step examples for all workflows
+- Advanced integration patterns
+- Troubleshooting common issues
+- Best practices and optimization tips
+
+Read this file when users need:
+- Complete workflow examples with code
+- Integration with GNINA, OpenMM, or other tools
+- Virtual screening workflows
+- Ensemble docking procedures
+
+### Assets (`assets/`)
+
+**`batch_template.csv`**: Template for batch processing
+- Pre-formatted CSV with required columns
+- Example entries showing different input types
+- Ready to customize with actual data
+
+**`custom_inference_config.yaml`**: Configuration template
+- Annotated YAML with all parameters
+- Four preset configurations for common use cases
+- Detailed comments explaining each parameter
+- Ready to customize and use
+
+## Best Practices
+
+1. **Always verify environment** with `setup_check.py` before starting large jobs
+2. **Validate batch CSVs** with `prepare_batch_csv.py` to catch errors early
+3. **Start with defaults** then tune parameters based on system-specific needs
+4. **Generate multiple samples** (10-40) for robust predictions
+5. **Visual inspection** of top poses before downstream analysis
+6. **Combine with scoring** functions for affinity assessment
+7. **Use confidence scores** for initial ranking, not final decisions
+8. **Pre-compute embeddings** for virtual screening campaigns
+9. **Document parameters** used for reproducibility
+10. **Validate results** experimentally when possible
+
+## Citations
+
+When using DiffDock, cite the appropriate papers:
+
+**DiffDock-L (current default model):**
+```
+Stärk et al. (2024) "DiffDock-L: Improving Molecular Docking with Diffusion Models"
+arXiv:2402.18396
+```
+
+**Original DiffDock:**
+```
+Corso et al. (2023) "DiffDock: Diffusion Steps, Twists, and Turns for Molecular Docking"
+ICLR 2023, arXiv:2210.01776
+```
+
+## Additional Resources
+
+- **GitHub Repository**: https://github.com/gcorso/DiffDock
+- **Online Demo**: https://huggingface.co/spaces/reginabarzilaygroup/DiffDock-Web
+- **DiffDock-L Paper**: https://arxiv.org/abs/2402.18396
+- **Original Paper**: https://arxiv.org/abs/2210.01776