gh-k-dense-ai-claude-scient…/skills/torchdrug/references/protein_modeling.md

# Protein Modeling

## Overview

TorchDrug provides extensive support for protein-related tasks including sequence analysis, structure prediction, property prediction, and protein-protein interactions. Proteins are represented as graphs where nodes are amino acid residues and edges represent spatial or sequential relationships.

## Available Datasets

### Protein Function Prediction

**Enzyme Function:**
- **EnzymeCommission** (17,562 proteins): EC number classification (7 levels)
- **BetaLactamase** (5,864 sequences): Enzyme activity prediction

**Protein Characteristics:**
- **Fluorescence** (54,025 sequences): GFP fluorescence intensity
- **Stability** (53,614 sequences): Thermostability prediction
- **Solubility** (62,478 sequences): Protein solubility classification
- **BinaryLocalization** (22,168 proteins): Subcellular localization (membrane vs. soluble)
- **SubcellularLocalization** (8,943 proteins): 10-class localization prediction

**Gene Ontology:**
- **GeneOntology** (46,796 proteins): GO term prediction across biological process, molecular function, and cellular component

### Protein Structure Prediction

- **Fold** (16,712 proteins): Structural fold classification (1,195 classes)
- **SecondaryStructure** (8,678 proteins): 3-state or 8-state secondary structure prediction
- **ContactPrediction** via ProteinNet: Residue-residue contact maps

### Protein Interaction

**Protein-Protein Interactions:**
- **HumanPPI** (1,412 proteins, 6,584 interactions): Human protein interaction network
- **YeastPPI** (2,018 proteins, 6,451 interactions): Yeast protein interaction network
- **PPIAffinity** (2,156 protein pairs): Binding affinity measurements

**Protein-Ligand Binding:**
- **BindingDB** (~1.5M entries): Comprehensive binding affinity database
- **PDBBind** (20,000+ complexes): 3D structure-based binding data
  - Refined set: High-quality crystal structures
  - Core set: Diverse benchmark set

### Large-Scale Protein Databases

- **AlphaFoldDB**: Access to 200M+ predicted protein structures
- **ProteinNet**: Standardized dataset for structure prediction

## Task Types

### NodePropertyPrediction

Predict properties at the residue (node) level, such as secondary structure or contact maps.

**Use Cases:**
- Secondary structure prediction (helix, sheet, coil)
- Residue-level disorder prediction
- Post-translational modification sites
- Binding site prediction

### PropertyPrediction

Predict protein-level properties like function, stability, or localization.

**Use Cases:**
- Enzyme function classification
- Subcellular localization
- Protein stability prediction
- Gene ontology term prediction

### InteractionPrediction

Predict interactions between protein pairs or protein-ligand pairs.

**Key Features:**
- Handles both sequence and structure inputs
- Supports symmetric (PPI) and asymmetric (protein-ligand) interactions
- Multiple negative sampling strategies

### ContactPrediction

Specialized task for predicting spatial proximity between residues in folded structures.

**Applications:**
- Structure prediction from sequence
- Protein folding pathway analysis
- Validation of predicted structures

## Protein Representation Models

### Sequence-Based Models

**ESM (Evolutionary Scale Modeling):**
- Pre-trained transformer model on 250M sequences
- State-of-the-art for sequence-only tasks
- Available in multiple sizes (ESM-1b, ESM-2)
- Captures evolutionary and structural information

**ProteinBERT:**
- BERT-style masked language model
- Pre-trained on UniProt sequences
- Good for transfer learning

**ProteinLSTM:**
- Bidirectional LSTM for sequence encoding
- Lightweight and fast
- Good baseline for sequence tasks

**ProteinCNN / ProteinResNet:**
- Convolutional architectures
- Capture local sequence patterns
- Faster than transformer models

### Structure-Based Models

**GearNet (Geometry-Aware Relational Graph Network):**
- Incorporates 3D geometric information
- Edge types based on sequential, radius, and K-nearest neighbors
- State-of-the-art for structure-based tasks
- Supports both backbone and full-atom representations

**GCN/GAT/GIN on Protein Graphs:**
- Standard GNN architectures adapted for proteins
- Flexible edge definitions (sequence, spatial, contact)

**SchNet:**
- Continuous-filter convolutions
- Handles 3D coordinates directly
- Good for structure prediction and protein-ligand binding

### Feature-Based Models

**Statistic Features:**
- Amino acid composition
- Sequence length statistics
- Motif counts

**Physicochemical Features:**
- Hydrophobicity scales
- Charge properties
- Secondary structure propensity
- Molecular weight, pI

## Protein Graph Construction

### Edge Types

**Sequential Edges:**
- Connect adjacent residues in sequence
- Captures primary structure

**Spatial Edges:**
- K-nearest neighbors in 3D space
- Radius cutoff (e.g., Cα atoms within 10Å)
- Captures tertiary structure

**Contact Edges:**
- Based on heavy atom distances
- Typically < 8Å threshold

### Node Features

**Residue Identity:**
- One-hot encoding of 20 amino acids
- Learned embeddings

**Position Information:**
- 3D coordinates (Cα, N, C, O)
- Backbone angles (phi, psi, omega)
- Relative spatial position encodings

**Physicochemical Properties:**
- Hydrophobicity
- Charge
- Size
- Secondary structure

## Training Workflows

### Pre-training Strategies

**Self-Supervised Pre-training:**
- Masked residue prediction (like BERT)
- Distance prediction between residues
- Angle prediction (phi, psi, omega)
- Dihedral angle prediction
- Contact map prediction

**Pre-trained Model Usage:**
```python
from torchdrug import models

# Load pre-trained ESM
model = models.ESM(path="esm1b_t33_650M_UR50S.pt")

# Fine-tune on downstream task
task = tasks.PropertyPrediction(
    model, task=["stability"],
    criterion="mse", metric=["mae", "rmse"]
)
```

### Multi-Task Learning

Train on multiple related tasks simultaneously:
- Joint prediction of function, localization, and stability
- Improves generalization and data efficiency
- Shares representations across tasks

### Best Practices

**For Sequence-Only Tasks:**
1. Start with pre-trained ESM or ProteinBERT
2. Fine-tune with small learning rate (1e-5 to 1e-4)
3. Use frozen embeddings for small datasets
4. Apply dropout for regularization

**For Structure-Based Tasks:**
1. Use GearNet with multiple edge types
2. Include geometric features (angles, dihedrals)
3. Pre-train on large structure databases
4. Use data augmentation (rotations, crops)

**For Small Datasets:**
1. Transfer learning from pre-trained models
2. Multi-task learning with related tasks
3. Data augmentation (sequence mutations, structure perturbations)
4. Strong regularization (dropout, weight decay)

## Common Use Cases

### Enzyme Engineering
- Predict enzyme activity from sequence
- Design mutations to improve stability
- Screen for desired catalytic properties

### Antibody Design
- Predict binding affinity
- Optimize antibody sequences
- Predict immunogenicity

### Drug Target Identification
- Predict protein function
- Identify druggable sites
- Analyze protein-ligand interactions

### Protein Structure Prediction
- Predict secondary structure from sequence
- Generate contact maps for tertiary structure
- Refine AlphaFold predictions

## Integration with Other Tools

### AlphaFold Integration

Load AlphaFold-predicted structures:
```python
from torchdrug import data

# Load AlphaFold structure
protein = data.Protein.from_pdb("alphafold_structure.pdb")

# Use in TorchDrug workflows
```

### ESMFold Integration

Use ESMFold for structure prediction, then analyze with TorchDrug models.

### Rosetta/PyRosetta

Generate structures with Rosetta, import to TorchDrug for analysis.