273 lines
7.6 KiB
Markdown
273 lines
7.6 KiB
Markdown
# Protein Modeling
|
||
|
||
## Overview
|
||
|
||
TorchDrug provides extensive support for protein-related tasks including sequence analysis, structure prediction, property prediction, and protein-protein interactions. Proteins are represented as graphs where nodes are amino acid residues and edges represent spatial or sequential relationships.
|
||
|
||
## Available Datasets
|
||
|
||
### Protein Function Prediction
|
||
|
||
**Enzyme Function:**
|
||
- **EnzymeCommission** (17,562 proteins): EC number classification (7 levels)
|
||
- **BetaLactamase** (5,864 sequences): Enzyme activity prediction
|
||
|
||
**Protein Characteristics:**
|
||
- **Fluorescence** (54,025 sequences): GFP fluorescence intensity
|
||
- **Stability** (53,614 sequences): Thermostability prediction
|
||
- **Solubility** (62,478 sequences): Protein solubility classification
|
||
- **BinaryLocalization** (22,168 proteins): Subcellular localization (membrane vs. soluble)
|
||
- **SubcellularLocalization** (8,943 proteins): 10-class localization prediction
|
||
|
||
**Gene Ontology:**
|
||
- **GeneOntology** (46,796 proteins): GO term prediction across biological process, molecular function, and cellular component
|
||
|
||
### Protein Structure Prediction
|
||
|
||
- **Fold** (16,712 proteins): Structural fold classification (1,195 classes)
|
||
- **SecondaryStructure** (8,678 proteins): 3-state or 8-state secondary structure prediction
|
||
- **ContactPrediction** via ProteinNet: Residue-residue contact maps
|
||
|
||
### Protein Interaction
|
||
|
||
**Protein-Protein Interactions:**
|
||
- **HumanPPI** (1,412 proteins, 6,584 interactions): Human protein interaction network
|
||
- **YeastPPI** (2,018 proteins, 6,451 interactions): Yeast protein interaction network
|
||
- **PPIAffinity** (2,156 protein pairs): Binding affinity measurements
|
||
|
||
**Protein-Ligand Binding:**
|
||
- **BindingDB** (~1.5M entries): Comprehensive binding affinity database
|
||
- **PDBBind** (20,000+ complexes): 3D structure-based binding data
|
||
- Refined set: High-quality crystal structures
|
||
- Core set: Diverse benchmark set
|
||
|
||
### Large-Scale Protein Databases
|
||
|
||
- **AlphaFoldDB**: Access to 200M+ predicted protein structures
|
||
- **ProteinNet**: Standardized dataset for structure prediction
|
||
|
||
## Task Types
|
||
|
||
### NodePropertyPrediction
|
||
|
||
Predict properties at the residue (node) level, such as secondary structure or contact maps.
|
||
|
||
**Use Cases:**
|
||
- Secondary structure prediction (helix, sheet, coil)
|
||
- Residue-level disorder prediction
|
||
- Post-translational modification sites
|
||
- Binding site prediction
|
||
|
||
### PropertyPrediction
|
||
|
||
Predict protein-level properties like function, stability, or localization.
|
||
|
||
**Use Cases:**
|
||
- Enzyme function classification
|
||
- Subcellular localization
|
||
- Protein stability prediction
|
||
- Gene ontology term prediction
|
||
|
||
### InteractionPrediction
|
||
|
||
Predict interactions between protein pairs or protein-ligand pairs.
|
||
|
||
**Key Features:**
|
||
- Handles both sequence and structure inputs
|
||
- Supports symmetric (PPI) and asymmetric (protein-ligand) interactions
|
||
- Multiple negative sampling strategies
|
||
|
||
### ContactPrediction
|
||
|
||
Specialized task for predicting spatial proximity between residues in folded structures.
|
||
|
||
**Applications:**
|
||
- Structure prediction from sequence
|
||
- Protein folding pathway analysis
|
||
- Validation of predicted structures
|
||
|
||
## Protein Representation Models
|
||
|
||
### Sequence-Based Models
|
||
|
||
**ESM (Evolutionary Scale Modeling):**
|
||
- Pre-trained transformer model on 250M sequences
|
||
- State-of-the-art for sequence-only tasks
|
||
- Available in multiple sizes (ESM-1b, ESM-2)
|
||
- Captures evolutionary and structural information
|
||
|
||
**ProteinBERT:**
|
||
- BERT-style masked language model
|
||
- Pre-trained on UniProt sequences
|
||
- Good for transfer learning
|
||
|
||
**ProteinLSTM:**
|
||
- Bidirectional LSTM for sequence encoding
|
||
- Lightweight and fast
|
||
- Good baseline for sequence tasks
|
||
|
||
**ProteinCNN / ProteinResNet:**
|
||
- Convolutional architectures
|
||
- Capture local sequence patterns
|
||
- Faster than transformer models
|
||
|
||
### Structure-Based Models
|
||
|
||
**GearNet (Geometry-Aware Relational Graph Network):**
|
||
- Incorporates 3D geometric information
|
||
- Edge types based on sequential, radius, and K-nearest neighbors
|
||
- State-of-the-art for structure-based tasks
|
||
- Supports both backbone and full-atom representations
|
||
|
||
**GCN/GAT/GIN on Protein Graphs:**
|
||
- Standard GNN architectures adapted for proteins
|
||
- Flexible edge definitions (sequence, spatial, contact)
|
||
|
||
**SchNet:**
|
||
- Continuous-filter convolutions
|
||
- Handles 3D coordinates directly
|
||
- Good for structure prediction and protein-ligand binding
|
||
|
||
### Feature-Based Models
|
||
|
||
**Statistic Features:**
|
||
- Amino acid composition
|
||
- Sequence length statistics
|
||
- Motif counts
|
||
|
||
**Physicochemical Features:**
|
||
- Hydrophobicity scales
|
||
- Charge properties
|
||
- Secondary structure propensity
|
||
- Molecular weight, pI
|
||
|
||
## Protein Graph Construction
|
||
|
||
### Edge Types
|
||
|
||
**Sequential Edges:**
|
||
- Connect adjacent residues in sequence
|
||
- Captures primary structure
|
||
|
||
**Spatial Edges:**
|
||
- K-nearest neighbors in 3D space
|
||
- Radius cutoff (e.g., Cα atoms within 10Å)
|
||
- Captures tertiary structure
|
||
|
||
**Contact Edges:**
|
||
- Based on heavy atom distances
|
||
- Typically < 8Å threshold
|
||
|
||
### Node Features
|
||
|
||
**Residue Identity:**
|
||
- One-hot encoding of 20 amino acids
|
||
- Learned embeddings
|
||
|
||
**Position Information:**
|
||
- 3D coordinates (Cα, N, C, O)
|
||
- Backbone angles (phi, psi, omega)
|
||
- Relative spatial position encodings
|
||
|
||
**Physicochemical Properties:**
|
||
- Hydrophobicity
|
||
- Charge
|
||
- Size
|
||
- Secondary structure
|
||
|
||
## Training Workflows
|
||
|
||
### Pre-training Strategies
|
||
|
||
**Self-Supervised Pre-training:**
|
||
- Masked residue prediction (like BERT)
|
||
- Distance prediction between residues
|
||
- Angle prediction (phi, psi, omega)
|
||
- Dihedral angle prediction
|
||
- Contact map prediction
|
||
|
||
**Pre-trained Model Usage:**
|
||
```python
|
||
from torchdrug import models
|
||
|
||
# Load pre-trained ESM
|
||
model = models.ESM(path="esm1b_t33_650M_UR50S.pt")
|
||
|
||
# Fine-tune on downstream task
|
||
task = tasks.PropertyPrediction(
|
||
model, task=["stability"],
|
||
criterion="mse", metric=["mae", "rmse"]
|
||
)
|
||
```
|
||
|
||
### Multi-Task Learning
|
||
|
||
Train on multiple related tasks simultaneously:
|
||
- Joint prediction of function, localization, and stability
|
||
- Improves generalization and data efficiency
|
||
- Shares representations across tasks
|
||
|
||
### Best Practices
|
||
|
||
**For Sequence-Only Tasks:**
|
||
1. Start with pre-trained ESM or ProteinBERT
|
||
2. Fine-tune with small learning rate (1e-5 to 1e-4)
|
||
3. Use frozen embeddings for small datasets
|
||
4. Apply dropout for regularization
|
||
|
||
**For Structure-Based Tasks:**
|
||
1. Use GearNet with multiple edge types
|
||
2. Include geometric features (angles, dihedrals)
|
||
3. Pre-train on large structure databases
|
||
4. Use data augmentation (rotations, crops)
|
||
|
||
**For Small Datasets:**
|
||
1. Transfer learning from pre-trained models
|
||
2. Multi-task learning with related tasks
|
||
3. Data augmentation (sequence mutations, structure perturbations)
|
||
4. Strong regularization (dropout, weight decay)
|
||
|
||
## Common Use Cases
|
||
|
||
### Enzyme Engineering
|
||
- Predict enzyme activity from sequence
|
||
- Design mutations to improve stability
|
||
- Screen for desired catalytic properties
|
||
|
||
### Antibody Design
|
||
- Predict binding affinity
|
||
- Optimize antibody sequences
|
||
- Predict immunogenicity
|
||
|
||
### Drug Target Identification
|
||
- Predict protein function
|
||
- Identify druggable sites
|
||
- Analyze protein-ligand interactions
|
||
|
||
### Protein Structure Prediction
|
||
- Predict secondary structure from sequence
|
||
- Generate contact maps for tertiary structure
|
||
- Refine AlphaFold predictions
|
||
|
||
## Integration with Other Tools
|
||
|
||
### AlphaFold Integration
|
||
|
||
Load AlphaFold-predicted structures:
|
||
```python
|
||
from torchdrug import data
|
||
|
||
# Load AlphaFold structure
|
||
protein = data.Protein.from_pdb("alphafold_structure.pdb")
|
||
|
||
# Use in TorchDrug workflows
|
||
```
|
||
|
||
### ESMFold Integration
|
||
|
||
Use ESMFold for structure prediction, then analyze with TorchDrug models.
|
||
|
||
### Rosetta/PyRosetta
|
||
|
||
Generate structures with Rosetta, import to TorchDrug for analysis.
|