Initial commit
This commit is contained in:
272
skills/torchdrug/references/protein_modeling.md
Normal file
272
skills/torchdrug/references/protein_modeling.md
Normal file
@@ -0,0 +1,272 @@
|
||||
# Protein Modeling
|
||||
|
||||
## Overview
|
||||
|
||||
TorchDrug provides extensive support for protein-related tasks including sequence analysis, structure prediction, property prediction, and protein-protein interactions. Proteins are represented as graphs where nodes are amino acid residues and edges represent spatial or sequential relationships.
|
||||
|
||||
## Available Datasets
|
||||
|
||||
### Protein Function Prediction
|
||||
|
||||
**Enzyme Function:**
|
||||
- **EnzymeCommission** (17,562 proteins): EC number classification (7 levels)
|
||||
- **BetaLactamase** (5,864 sequences): Enzyme activity prediction
|
||||
|
||||
**Protein Characteristics:**
|
||||
- **Fluorescence** (54,025 sequences): GFP fluorescence intensity
|
||||
- **Stability** (53,614 sequences): Thermostability prediction
|
||||
- **Solubility** (62,478 sequences): Protein solubility classification
|
||||
- **BinaryLocalization** (22,168 proteins): Subcellular localization (membrane vs. soluble)
|
||||
- **SubcellularLocalization** (8,943 proteins): 10-class localization prediction
|
||||
|
||||
**Gene Ontology:**
|
||||
- **GeneOntology** (46,796 proteins): GO term prediction across biological process, molecular function, and cellular component
|
||||
|
||||
### Protein Structure Prediction
|
||||
|
||||
- **Fold** (16,712 proteins): Structural fold classification (1,195 classes)
|
||||
- **SecondaryStructure** (8,678 proteins): 3-state or 8-state secondary structure prediction
|
||||
- **ContactPrediction** via ProteinNet: Residue-residue contact maps
|
||||
|
||||
### Protein Interaction
|
||||
|
||||
**Protein-Protein Interactions:**
|
||||
- **HumanPPI** (1,412 proteins, 6,584 interactions): Human protein interaction network
|
||||
- **YeastPPI** (2,018 proteins, 6,451 interactions): Yeast protein interaction network
|
||||
- **PPIAffinity** (2,156 protein pairs): Binding affinity measurements
|
||||
|
||||
**Protein-Ligand Binding:**
|
||||
- **BindingDB** (~1.5M entries): Comprehensive binding affinity database
|
||||
- **PDBBind** (20,000+ complexes): 3D structure-based binding data
|
||||
- Refined set: High-quality crystal structures
|
||||
- Core set: Diverse benchmark set
|
||||
|
||||
### Large-Scale Protein Databases
|
||||
|
||||
- **AlphaFoldDB**: Access to 200M+ predicted protein structures
|
||||
- **ProteinNet**: Standardized dataset for structure prediction
|
||||
|
||||
## Task Types
|
||||
|
||||
### NodePropertyPrediction
|
||||
|
||||
Predict properties at the residue (node) level, such as secondary structure or contact maps.
|
||||
|
||||
**Use Cases:**
|
||||
- Secondary structure prediction (helix, sheet, coil)
|
||||
- Residue-level disorder prediction
|
||||
- Post-translational modification sites
|
||||
- Binding site prediction
|
||||
|
||||
### PropertyPrediction
|
||||
|
||||
Predict protein-level properties like function, stability, or localization.
|
||||
|
||||
**Use Cases:**
|
||||
- Enzyme function classification
|
||||
- Subcellular localization
|
||||
- Protein stability prediction
|
||||
- Gene ontology term prediction
|
||||
|
||||
### InteractionPrediction
|
||||
|
||||
Predict interactions between protein pairs or protein-ligand pairs.
|
||||
|
||||
**Key Features:**
|
||||
- Handles both sequence and structure inputs
|
||||
- Supports symmetric (PPI) and asymmetric (protein-ligand) interactions
|
||||
- Multiple negative sampling strategies
|
||||
|
||||
### ContactPrediction
|
||||
|
||||
Specialized task for predicting spatial proximity between residues in folded structures.
|
||||
|
||||
**Applications:**
|
||||
- Structure prediction from sequence
|
||||
- Protein folding pathway analysis
|
||||
- Validation of predicted structures
|
||||
|
||||
## Protein Representation Models
|
||||
|
||||
### Sequence-Based Models
|
||||
|
||||
**ESM (Evolutionary Scale Modeling):**
|
||||
- Pre-trained transformer model on 250M sequences
|
||||
- State-of-the-art for sequence-only tasks
|
||||
- Available in multiple sizes (ESM-1b, ESM-2)
|
||||
- Captures evolutionary and structural information
|
||||
|
||||
**ProteinBERT:**
|
||||
- BERT-style masked language model
|
||||
- Pre-trained on UniProt sequences
|
||||
- Good for transfer learning
|
||||
|
||||
**ProteinLSTM:**
|
||||
- Bidirectional LSTM for sequence encoding
|
||||
- Lightweight and fast
|
||||
- Good baseline for sequence tasks
|
||||
|
||||
**ProteinCNN / ProteinResNet:**
|
||||
- Convolutional architectures
|
||||
- Capture local sequence patterns
|
||||
- Faster than transformer models
|
||||
|
||||
### Structure-Based Models
|
||||
|
||||
**GearNet (Geometry-Aware Relational Graph Network):**
|
||||
- Incorporates 3D geometric information
|
||||
- Edge types based on sequential, radius, and K-nearest neighbors
|
||||
- State-of-the-art for structure-based tasks
|
||||
- Supports both backbone and full-atom representations
|
||||
|
||||
**GCN/GAT/GIN on Protein Graphs:**
|
||||
- Standard GNN architectures adapted for proteins
|
||||
- Flexible edge definitions (sequence, spatial, contact)
|
||||
|
||||
**SchNet:**
|
||||
- Continuous-filter convolutions
|
||||
- Handles 3D coordinates directly
|
||||
- Good for structure prediction and protein-ligand binding
|
||||
|
||||
### Feature-Based Models
|
||||
|
||||
**Statistic Features:**
|
||||
- Amino acid composition
|
||||
- Sequence length statistics
|
||||
- Motif counts
|
||||
|
||||
**Physicochemical Features:**
|
||||
- Hydrophobicity scales
|
||||
- Charge properties
|
||||
- Secondary structure propensity
|
||||
- Molecular weight, pI
|
||||
|
||||
## Protein Graph Construction
|
||||
|
||||
### Edge Types
|
||||
|
||||
**Sequential Edges:**
|
||||
- Connect adjacent residues in sequence
|
||||
- Captures primary structure
|
||||
|
||||
**Spatial Edges:**
|
||||
- K-nearest neighbors in 3D space
|
||||
- Radius cutoff (e.g., Cα atoms within 10Å)
|
||||
- Captures tertiary structure
|
||||
|
||||
**Contact Edges:**
|
||||
- Based on heavy atom distances
|
||||
- Typically < 8Å threshold
|
||||
|
||||
### Node Features
|
||||
|
||||
**Residue Identity:**
|
||||
- One-hot encoding of 20 amino acids
|
||||
- Learned embeddings
|
||||
|
||||
**Position Information:**
|
||||
- 3D coordinates (Cα, N, C, O)
|
||||
- Backbone angles (phi, psi, omega)
|
||||
- Relative spatial position encodings
|
||||
|
||||
**Physicochemical Properties:**
|
||||
- Hydrophobicity
|
||||
- Charge
|
||||
- Size
|
||||
- Secondary structure
|
||||
|
||||
## Training Workflows
|
||||
|
||||
### Pre-training Strategies
|
||||
|
||||
**Self-Supervised Pre-training:**
|
||||
- Masked residue prediction (like BERT)
|
||||
- Distance prediction between residues
|
||||
- Angle prediction (phi, psi, omega)
|
||||
- Dihedral angle prediction
|
||||
- Contact map prediction
|
||||
|
||||
**Pre-trained Model Usage:**
|
||||
```python
|
||||
from torchdrug import models
|
||||
|
||||
# Load pre-trained ESM
|
||||
model = models.ESM(path="esm1b_t33_650M_UR50S.pt")
|
||||
|
||||
# Fine-tune on downstream task
|
||||
task = tasks.PropertyPrediction(
|
||||
model, task=["stability"],
|
||||
criterion="mse", metric=["mae", "rmse"]
|
||||
)
|
||||
```
|
||||
|
||||
### Multi-Task Learning
|
||||
|
||||
Train on multiple related tasks simultaneously:
|
||||
- Joint prediction of function, localization, and stability
|
||||
- Improves generalization and data efficiency
|
||||
- Shares representations across tasks
|
||||
|
||||
### Best Practices
|
||||
|
||||
**For Sequence-Only Tasks:**
|
||||
1. Start with pre-trained ESM or ProteinBERT
|
||||
2. Fine-tune with small learning rate (1e-5 to 1e-4)
|
||||
3. Use frozen embeddings for small datasets
|
||||
4. Apply dropout for regularization
|
||||
|
||||
**For Structure-Based Tasks:**
|
||||
1. Use GearNet with multiple edge types
|
||||
2. Include geometric features (angles, dihedrals)
|
||||
3. Pre-train on large structure databases
|
||||
4. Use data augmentation (rotations, crops)
|
||||
|
||||
**For Small Datasets:**
|
||||
1. Transfer learning from pre-trained models
|
||||
2. Multi-task learning with related tasks
|
||||
3. Data augmentation (sequence mutations, structure perturbations)
|
||||
4. Strong regularization (dropout, weight decay)
|
||||
|
||||
## Common Use Cases
|
||||
|
||||
### Enzyme Engineering
|
||||
- Predict enzyme activity from sequence
|
||||
- Design mutations to improve stability
|
||||
- Screen for desired catalytic properties
|
||||
|
||||
### Antibody Design
|
||||
- Predict binding affinity
|
||||
- Optimize antibody sequences
|
||||
- Predict immunogenicity
|
||||
|
||||
### Drug Target Identification
|
||||
- Predict protein function
|
||||
- Identify druggable sites
|
||||
- Analyze protein-ligand interactions
|
||||
|
||||
### Protein Structure Prediction
|
||||
- Predict secondary structure from sequence
|
||||
- Generate contact maps for tertiary structure
|
||||
- Refine AlphaFold predictions
|
||||
|
||||
## Integration with Other Tools
|
||||
|
||||
### AlphaFold Integration
|
||||
|
||||
Load AlphaFold-predicted structures:
|
||||
```python
|
||||
from torchdrug import data
|
||||
|
||||
# Load AlphaFold structure
|
||||
protein = data.Protein.from_pdb("alphafold_structure.pdb")
|
||||
|
||||
# Use in TorchDrug workflows
|
||||
```
|
||||
|
||||
### ESMFold Integration
|
||||
|
||||
Use ESMFold for structure prediction, then analyze with TorchDrug models.
|
||||
|
||||
### Rosetta/PyRosetta
|
||||
|
||||
Generate structures with Rosetta, import to TorchDrug for analysis.
|
||||
Reference in New Issue
Block a user