Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/torchdrug/references/protein_modeling.md
+++ b/skills/torchdrug/references/protein_modeling.md
@@ -0,0 +1,272 @@
+# Protein Modeling
+
+## Overview
+
+TorchDrug provides extensive support for protein-related tasks including sequence analysis, structure prediction, property prediction, and protein-protein interactions. Proteins are represented as graphs where nodes are amino acid residues and edges represent spatial or sequential relationships.
+
+## Available Datasets
+
+### Protein Function Prediction
+
+**Enzyme Function:**
+- **EnzymeCommission** (17,562 proteins): EC number classification (7 levels)
+- **BetaLactamase** (5,864 sequences): Enzyme activity prediction
+
+**Protein Characteristics:**
+- **Fluorescence** (54,025 sequences): GFP fluorescence intensity
+- **Stability** (53,614 sequences): Thermostability prediction
+- **Solubility** (62,478 sequences): Protein solubility classification
+- **BinaryLocalization** (22,168 proteins): Subcellular localization (membrane vs. soluble)
+- **SubcellularLocalization** (8,943 proteins): 10-class localization prediction
+
+**Gene Ontology:**
+- **GeneOntology** (46,796 proteins): GO term prediction across biological process, molecular function, and cellular component
+
+### Protein Structure Prediction
+
+- **Fold** (16,712 proteins): Structural fold classification (1,195 classes)
+- **SecondaryStructure** (8,678 proteins): 3-state or 8-state secondary structure prediction
+- **ContactPrediction** via ProteinNet: Residue-residue contact maps
+
+### Protein Interaction
+
+**Protein-Protein Interactions:**
+- **HumanPPI** (1,412 proteins, 6,584 interactions): Human protein interaction network
+- **YeastPPI** (2,018 proteins, 6,451 interactions): Yeast protein interaction network
+- **PPIAffinity** (2,156 protein pairs): Binding affinity measurements
+
+**Protein-Ligand Binding:**
+- **BindingDB** (~1.5M entries): Comprehensive binding affinity database
+- **PDBBind** (20,000+ complexes): 3D structure-based binding data
+  - Refined set: High-quality crystal structures
+  - Core set: Diverse benchmark set
+
+### Large-Scale Protein Databases
+
+- **AlphaFoldDB**: Access to 200M+ predicted protein structures
+- **ProteinNet**: Standardized dataset for structure prediction
+
+## Task Types
+
+### NodePropertyPrediction
+
+Predict properties at the residue (node) level, such as secondary structure or contact maps.
+
+**Use Cases:**
+- Secondary structure prediction (helix, sheet, coil)
+- Residue-level disorder prediction
+- Post-translational modification sites
+- Binding site prediction
+
+### PropertyPrediction
+
+Predict protein-level properties like function, stability, or localization.
+
+**Use Cases:**
+- Enzyme function classification
+- Subcellular localization
+- Protein stability prediction
+- Gene ontology term prediction
+
+### InteractionPrediction
+
+Predict interactions between protein pairs or protein-ligand pairs.
+
+**Key Features:**
+- Handles both sequence and structure inputs
+- Supports symmetric (PPI) and asymmetric (protein-ligand) interactions
+- Multiple negative sampling strategies
+
+### ContactPrediction
+
+Specialized task for predicting spatial proximity between residues in folded structures.
+
+**Applications:**
+- Structure prediction from sequence
+- Protein folding pathway analysis
+- Validation of predicted structures
+
+## Protein Representation Models
+
+### Sequence-Based Models
+
+**ESM (Evolutionary Scale Modeling):**
+- Pre-trained transformer model on 250M sequences
+- State-of-the-art for sequence-only tasks
+- Available in multiple sizes (ESM-1b, ESM-2)
+- Captures evolutionary and structural information
+
+**ProteinBERT:**
+- BERT-style masked language model
+- Pre-trained on UniProt sequences
+- Good for transfer learning
+
+**ProteinLSTM:**
+- Bidirectional LSTM for sequence encoding
+- Lightweight and fast
+- Good baseline for sequence tasks
+
+**ProteinCNN / ProteinResNet:**
+- Convolutional architectures
+- Capture local sequence patterns
+- Faster than transformer models
+
+### Structure-Based Models
+
+**GearNet (Geometry-Aware Relational Graph Network):**
+- Incorporates 3D geometric information
+- Edge types based on sequential, radius, and K-nearest neighbors
+- State-of-the-art for structure-based tasks
+- Supports both backbone and full-atom representations
+
+**GCN/GAT/GIN on Protein Graphs:**
+- Standard GNN architectures adapted for proteins
+- Flexible edge definitions (sequence, spatial, contact)
+
+**SchNet:**
+- Continuous-filter convolutions
+- Handles 3D coordinates directly
+- Good for structure prediction and protein-ligand binding
+
+### Feature-Based Models
+
+**Statistic Features:**
+- Amino acid composition
+- Sequence length statistics
+- Motif counts
+
+**Physicochemical Features:**
+- Hydrophobicity scales
+- Charge properties
+- Secondary structure propensity
+- Molecular weight, pI
+
+## Protein Graph Construction
+
+### Edge Types
+
+**Sequential Edges:**
+- Connect adjacent residues in sequence
+- Captures primary structure
+
+**Spatial Edges:**
+- K-nearest neighbors in 3D space
+- Radius cutoff (e.g., Cα atoms within 10Å)
+- Captures tertiary structure
+
+**Contact Edges:**
+- Based on heavy atom distances
+- Typically < 8Å threshold
+
+### Node Features
+
+**Residue Identity:**
+- One-hot encoding of 20 amino acids
+- Learned embeddings
+
+**Position Information:**
+- 3D coordinates (Cα, N, C, O)
+- Backbone angles (phi, psi, omega)
+- Relative spatial position encodings
+
+**Physicochemical Properties:**
+- Hydrophobicity
+- Charge
+- Size
+- Secondary structure
+
+## Training Workflows
+
+### Pre-training Strategies
+
+**Self-Supervised Pre-training:**
+- Masked residue prediction (like BERT)
+- Distance prediction between residues
+- Angle prediction (phi, psi, omega)
+- Dihedral angle prediction
+- Contact map prediction
+
+**Pre-trained Model Usage:**
+```python
+from torchdrug import models
+
+# Load pre-trained ESM
+model = models.ESM(path="esm1b_t33_650M_UR50S.pt")
+
+# Fine-tune on downstream task
+task = tasks.PropertyPrediction(
+    model, task=["stability"],
+    criterion="mse", metric=["mae", "rmse"]
+)
+```
+
+### Multi-Task Learning
+
+Train on multiple related tasks simultaneously:
+- Joint prediction of function, localization, and stability
+- Improves generalization and data efficiency
+- Shares representations across tasks
+
+### Best Practices
+
+**For Sequence-Only Tasks:**
+1. Start with pre-trained ESM or ProteinBERT
+2. Fine-tune with small learning rate (1e-5 to 1e-4)
+3. Use frozen embeddings for small datasets
+4. Apply dropout for regularization
+
+**For Structure-Based Tasks:**
+1. Use GearNet with multiple edge types
+2. Include geometric features (angles, dihedrals)
+3. Pre-train on large structure databases
+4. Use data augmentation (rotations, crops)
+
+**For Small Datasets:**
+1. Transfer learning from pre-trained models
+2. Multi-task learning with related tasks
+3. Data augmentation (sequence mutations, structure perturbations)
+4. Strong regularization (dropout, weight decay)
+
+## Common Use Cases
+
+### Enzyme Engineering
+- Predict enzyme activity from sequence
+- Design mutations to improve stability
+- Screen for desired catalytic properties
+
+### Antibody Design
+- Predict binding affinity
+- Optimize antibody sequences
+- Predict immunogenicity
+
+### Drug Target Identification
+- Predict protein function
+- Identify druggable sites
+- Analyze protein-ligand interactions
+
+### Protein Structure Prediction
+- Predict secondary structure from sequence
+- Generate contact maps for tertiary structure
+- Refine AlphaFold predictions
+
+## Integration with Other Tools
+
+### AlphaFold Integration
+
+Load AlphaFold-predicted structures:
+```python
+from torchdrug import data
+
+# Load AlphaFold structure
+protein = data.Protein.from_pdb("alphafold_structure.pdb")
+
+# Use in TorchDrug workflows
+```
+
+### ESMFold Integration
+
+Use ESMFold for structure prediction, then analyze with TorchDrug models.
+
+### Rosetta/PyRosetta
+
+Generate structures with Rosetta, import to TorchDrug for analysis.