Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,272 @@
# Protein Modeling
## Overview
TorchDrug provides extensive support for protein-related tasks including sequence analysis, structure prediction, property prediction, and protein-protein interactions. Proteins are represented as graphs where nodes are amino acid residues and edges represent spatial or sequential relationships.
## Available Datasets
### Protein Function Prediction
**Enzyme Function:**
- **EnzymeCommission** (17,562 proteins): EC number classification (7 levels)
- **BetaLactamase** (5,864 sequences): Enzyme activity prediction
**Protein Characteristics:**
- **Fluorescence** (54,025 sequences): GFP fluorescence intensity
- **Stability** (53,614 sequences): Thermostability prediction
- **Solubility** (62,478 sequences): Protein solubility classification
- **BinaryLocalization** (22,168 proteins): Subcellular localization (membrane vs. soluble)
- **SubcellularLocalization** (8,943 proteins): 10-class localization prediction
**Gene Ontology:**
- **GeneOntology** (46,796 proteins): GO term prediction across biological process, molecular function, and cellular component
### Protein Structure Prediction
- **Fold** (16,712 proteins): Structural fold classification (1,195 classes)
- **SecondaryStructure** (8,678 proteins): 3-state or 8-state secondary structure prediction
- **ContactPrediction** via ProteinNet: Residue-residue contact maps
### Protein Interaction
**Protein-Protein Interactions:**
- **HumanPPI** (1,412 proteins, 6,584 interactions): Human protein interaction network
- **YeastPPI** (2,018 proteins, 6,451 interactions): Yeast protein interaction network
- **PPIAffinity** (2,156 protein pairs): Binding affinity measurements
**Protein-Ligand Binding:**
- **BindingDB** (~1.5M entries): Comprehensive binding affinity database
- **PDBBind** (20,000+ complexes): 3D structure-based binding data
- Refined set: High-quality crystal structures
- Core set: Diverse benchmark set
### Large-Scale Protein Databases
- **AlphaFoldDB**: Access to 200M+ predicted protein structures
- **ProteinNet**: Standardized dataset for structure prediction
## Task Types
### NodePropertyPrediction
Predict properties at the residue (node) level, such as secondary structure or contact maps.
**Use Cases:**
- Secondary structure prediction (helix, sheet, coil)
- Residue-level disorder prediction
- Post-translational modification sites
- Binding site prediction
### PropertyPrediction
Predict protein-level properties like function, stability, or localization.
**Use Cases:**
- Enzyme function classification
- Subcellular localization
- Protein stability prediction
- Gene ontology term prediction
### InteractionPrediction
Predict interactions between protein pairs or protein-ligand pairs.
**Key Features:**
- Handles both sequence and structure inputs
- Supports symmetric (PPI) and asymmetric (protein-ligand) interactions
- Multiple negative sampling strategies
### ContactPrediction
Specialized task for predicting spatial proximity between residues in folded structures.
**Applications:**
- Structure prediction from sequence
- Protein folding pathway analysis
- Validation of predicted structures
## Protein Representation Models
### Sequence-Based Models
**ESM (Evolutionary Scale Modeling):**
- Pre-trained transformer model on 250M sequences
- State-of-the-art for sequence-only tasks
- Available in multiple sizes (ESM-1b, ESM-2)
- Captures evolutionary and structural information
**ProteinBERT:**
- BERT-style masked language model
- Pre-trained on UniProt sequences
- Good for transfer learning
**ProteinLSTM:**
- Bidirectional LSTM for sequence encoding
- Lightweight and fast
- Good baseline for sequence tasks
**ProteinCNN / ProteinResNet:**
- Convolutional architectures
- Capture local sequence patterns
- Faster than transformer models
### Structure-Based Models
**GearNet (Geometry-Aware Relational Graph Network):**
- Incorporates 3D geometric information
- Edge types based on sequential, radius, and K-nearest neighbors
- State-of-the-art for structure-based tasks
- Supports both backbone and full-atom representations
**GCN/GAT/GIN on Protein Graphs:**
- Standard GNN architectures adapted for proteins
- Flexible edge definitions (sequence, spatial, contact)
**SchNet:**
- Continuous-filter convolutions
- Handles 3D coordinates directly
- Good for structure prediction and protein-ligand binding
### Feature-Based Models
**Statistic Features:**
- Amino acid composition
- Sequence length statistics
- Motif counts
**Physicochemical Features:**
- Hydrophobicity scales
- Charge properties
- Secondary structure propensity
- Molecular weight, pI
## Protein Graph Construction
### Edge Types
**Sequential Edges:**
- Connect adjacent residues in sequence
- Captures primary structure
**Spatial Edges:**
- K-nearest neighbors in 3D space
- Radius cutoff (e.g., Cα atoms within 10Å)
- Captures tertiary structure
**Contact Edges:**
- Based on heavy atom distances
- Typically < 8Å threshold
### Node Features
**Residue Identity:**
- One-hot encoding of 20 amino acids
- Learned embeddings
**Position Information:**
- 3D coordinates (Cα, N, C, O)
- Backbone angles (phi, psi, omega)
- Relative spatial position encodings
**Physicochemical Properties:**
- Hydrophobicity
- Charge
- Size
- Secondary structure
## Training Workflows
### Pre-training Strategies
**Self-Supervised Pre-training:**
- Masked residue prediction (like BERT)
- Distance prediction between residues
- Angle prediction (phi, psi, omega)
- Dihedral angle prediction
- Contact map prediction
**Pre-trained Model Usage:**
```python
from torchdrug import models
# Load pre-trained ESM
model = models.ESM(path="esm1b_t33_650M_UR50S.pt")
# Fine-tune on downstream task
task = tasks.PropertyPrediction(
model, task=["stability"],
criterion="mse", metric=["mae", "rmse"]
)
```
### Multi-Task Learning
Train on multiple related tasks simultaneously:
- Joint prediction of function, localization, and stability
- Improves generalization and data efficiency
- Shares representations across tasks
### Best Practices
**For Sequence-Only Tasks:**
1. Start with pre-trained ESM or ProteinBERT
2. Fine-tune with small learning rate (1e-5 to 1e-4)
3. Use frozen embeddings for small datasets
4. Apply dropout for regularization
**For Structure-Based Tasks:**
1. Use GearNet with multiple edge types
2. Include geometric features (angles, dihedrals)
3. Pre-train on large structure databases
4. Use data augmentation (rotations, crops)
**For Small Datasets:**
1. Transfer learning from pre-trained models
2. Multi-task learning with related tasks
3. Data augmentation (sequence mutations, structure perturbations)
4. Strong regularization (dropout, weight decay)
## Common Use Cases
### Enzyme Engineering
- Predict enzyme activity from sequence
- Design mutations to improve stability
- Screen for desired catalytic properties
### Antibody Design
- Predict binding affinity
- Optimize antibody sequences
- Predict immunogenicity
### Drug Target Identification
- Predict protein function
- Identify druggable sites
- Analyze protein-ligand interactions
### Protein Structure Prediction
- Predict secondary structure from sequence
- Generate contact maps for tertiary structure
- Refine AlphaFold predictions
## Integration with Other Tools
### AlphaFold Integration
Load AlphaFold-predicted structures:
```python
from torchdrug import data
# Load AlphaFold structure
protein = data.Protein.from_pdb("alphafold_structure.pdb")
# Use in TorchDrug workflows
```
### ESMFold Integration
Use ESMFold for structure prediction, then analyze with TorchDrug models.
### Rosetta/PyRosetta
Generate structures with Rosetta, import to TorchDrug for analysis.