7.6 KiB
Protein Modeling
Overview
TorchDrug provides extensive support for protein-related tasks including sequence analysis, structure prediction, property prediction, and protein-protein interactions. Proteins are represented as graphs where nodes are amino acid residues and edges represent spatial or sequential relationships.
Available Datasets
Protein Function Prediction
Enzyme Function:
- EnzymeCommission (17,562 proteins): EC number classification (7 levels)
- BetaLactamase (5,864 sequences): Enzyme activity prediction
Protein Characteristics:
- Fluorescence (54,025 sequences): GFP fluorescence intensity
- Stability (53,614 sequences): Thermostability prediction
- Solubility (62,478 sequences): Protein solubility classification
- BinaryLocalization (22,168 proteins): Subcellular localization (membrane vs. soluble)
- SubcellularLocalization (8,943 proteins): 10-class localization prediction
Gene Ontology:
- GeneOntology (46,796 proteins): GO term prediction across biological process, molecular function, and cellular component
Protein Structure Prediction
- Fold (16,712 proteins): Structural fold classification (1,195 classes)
- SecondaryStructure (8,678 proteins): 3-state or 8-state secondary structure prediction
- ContactPrediction via ProteinNet: Residue-residue contact maps
Protein Interaction
Protein-Protein Interactions:
- HumanPPI (1,412 proteins, 6,584 interactions): Human protein interaction network
- YeastPPI (2,018 proteins, 6,451 interactions): Yeast protein interaction network
- PPIAffinity (2,156 protein pairs): Binding affinity measurements
Protein-Ligand Binding:
- BindingDB (~1.5M entries): Comprehensive binding affinity database
- PDBBind (20,000+ complexes): 3D structure-based binding data
- Refined set: High-quality crystal structures
- Core set: Diverse benchmark set
Large-Scale Protein Databases
- AlphaFoldDB: Access to 200M+ predicted protein structures
- ProteinNet: Standardized dataset for structure prediction
Task Types
NodePropertyPrediction
Predict properties at the residue (node) level, such as secondary structure or contact maps.
Use Cases:
- Secondary structure prediction (helix, sheet, coil)
- Residue-level disorder prediction
- Post-translational modification sites
- Binding site prediction
PropertyPrediction
Predict protein-level properties like function, stability, or localization.
Use Cases:
- Enzyme function classification
- Subcellular localization
- Protein stability prediction
- Gene ontology term prediction
InteractionPrediction
Predict interactions between protein pairs or protein-ligand pairs.
Key Features:
- Handles both sequence and structure inputs
- Supports symmetric (PPI) and asymmetric (protein-ligand) interactions
- Multiple negative sampling strategies
ContactPrediction
Specialized task for predicting spatial proximity between residues in folded structures.
Applications:
- Structure prediction from sequence
- Protein folding pathway analysis
- Validation of predicted structures
Protein Representation Models
Sequence-Based Models
ESM (Evolutionary Scale Modeling):
- Pre-trained transformer model on 250M sequences
- State-of-the-art for sequence-only tasks
- Available in multiple sizes (ESM-1b, ESM-2)
- Captures evolutionary and structural information
ProteinBERT:
- BERT-style masked language model
- Pre-trained on UniProt sequences
- Good for transfer learning
ProteinLSTM:
- Bidirectional LSTM for sequence encoding
- Lightweight and fast
- Good baseline for sequence tasks
ProteinCNN / ProteinResNet:
- Convolutional architectures
- Capture local sequence patterns
- Faster than transformer models
Structure-Based Models
GearNet (Geometry-Aware Relational Graph Network):
- Incorporates 3D geometric information
- Edge types based on sequential, radius, and K-nearest neighbors
- State-of-the-art for structure-based tasks
- Supports both backbone and full-atom representations
GCN/GAT/GIN on Protein Graphs:
- Standard GNN architectures adapted for proteins
- Flexible edge definitions (sequence, spatial, contact)
SchNet:
- Continuous-filter convolutions
- Handles 3D coordinates directly
- Good for structure prediction and protein-ligand binding
Feature-Based Models
Statistic Features:
- Amino acid composition
- Sequence length statistics
- Motif counts
Physicochemical Features:
- Hydrophobicity scales
- Charge properties
- Secondary structure propensity
- Molecular weight, pI
Protein Graph Construction
Edge Types
Sequential Edges:
- Connect adjacent residues in sequence
- Captures primary structure
Spatial Edges:
- K-nearest neighbors in 3D space
- Radius cutoff (e.g., Cα atoms within 10Å)
- Captures tertiary structure
Contact Edges:
- Based on heavy atom distances
- Typically < 8Å threshold
Node Features
Residue Identity:
- One-hot encoding of 20 amino acids
- Learned embeddings
Position Information:
- 3D coordinates (Cα, N, C, O)
- Backbone angles (phi, psi, omega)
- Relative spatial position encodings
Physicochemical Properties:
- Hydrophobicity
- Charge
- Size
- Secondary structure
Training Workflows
Pre-training Strategies
Self-Supervised Pre-training:
- Masked residue prediction (like BERT)
- Distance prediction between residues
- Angle prediction (phi, psi, omega)
- Dihedral angle prediction
- Contact map prediction
Pre-trained Model Usage:
from torchdrug import models
# Load pre-trained ESM
model = models.ESM(path="esm1b_t33_650M_UR50S.pt")
# Fine-tune on downstream task
task = tasks.PropertyPrediction(
model, task=["stability"],
criterion="mse", metric=["mae", "rmse"]
)
Multi-Task Learning
Train on multiple related tasks simultaneously:
- Joint prediction of function, localization, and stability
- Improves generalization and data efficiency
- Shares representations across tasks
Best Practices
For Sequence-Only Tasks:
- Start with pre-trained ESM or ProteinBERT
- Fine-tune with small learning rate (1e-5 to 1e-4)
- Use frozen embeddings for small datasets
- Apply dropout for regularization
For Structure-Based Tasks:
- Use GearNet with multiple edge types
- Include geometric features (angles, dihedrals)
- Pre-train on large structure databases
- Use data augmentation (rotations, crops)
For Small Datasets:
- Transfer learning from pre-trained models
- Multi-task learning with related tasks
- Data augmentation (sequence mutations, structure perturbations)
- Strong regularization (dropout, weight decay)
Common Use Cases
Enzyme Engineering
- Predict enzyme activity from sequence
- Design mutations to improve stability
- Screen for desired catalytic properties
Antibody Design
- Predict binding affinity
- Optimize antibody sequences
- Predict immunogenicity
Drug Target Identification
- Predict protein function
- Identify druggable sites
- Analyze protein-ligand interactions
Protein Structure Prediction
- Predict secondary structure from sequence
- Generate contact maps for tertiary structure
- Refine AlphaFold predictions
Integration with Other Tools
AlphaFold Integration
Load AlphaFold-predicted structures:
from torchdrug import data
# Load AlphaFold structure
protein = data.Protein.from_pdb("alphafold_structure.pdb")
# Use in TorchDrug workflows
ESMFold Integration
Use ESMFold for structure prediction, then analyze with TorchDrug models.
Rosetta/PyRosetta
Generate structures with Rosetta, import to TorchDrug for analysis.