zhongwei/gh-k-dense-ai-claude-scientific-skills-scientific-skills

Files

Zhongwei Li f0bd18fb4e Initial commit

2025-11-30 08:30:10 +08:00

7.6 KiB

Raw Blame History

Protein Modeling

Overview

TorchDrug provides extensive support for protein-related tasks including sequence analysis, structure prediction, property prediction, and protein-protein interactions. Proteins are represented as graphs where nodes are amino acid residues and edges represent spatial or sequential relationships.

Available Datasets

Protein Function Prediction

Enzyme Function:

EnzymeCommission (17,562 proteins): EC number classification (7 levels)
BetaLactamase (5,864 sequences): Enzyme activity prediction

Protein Characteristics:

Fluorescence (54,025 sequences): GFP fluorescence intensity
Stability (53,614 sequences): Thermostability prediction
Solubility (62,478 sequences): Protein solubility classification
BinaryLocalization (22,168 proteins): Subcellular localization (membrane vs. soluble)
SubcellularLocalization (8,943 proteins): 10-class localization prediction

Gene Ontology:

GeneOntology (46,796 proteins): GO term prediction across biological process, molecular function, and cellular component

Protein Structure Prediction

Fold (16,712 proteins): Structural fold classification (1,195 classes)
SecondaryStructure (8,678 proteins): 3-state or 8-state secondary structure prediction
ContactPrediction via ProteinNet: Residue-residue contact maps

Protein Interaction

Protein-Protein Interactions:

HumanPPI (1,412 proteins, 6,584 interactions): Human protein interaction network
YeastPPI (2,018 proteins, 6,451 interactions): Yeast protein interaction network
PPIAffinity (2,156 protein pairs): Binding affinity measurements

Protein-Ligand Binding:

BindingDB (~1.5M entries): Comprehensive binding affinity database
PDBBind (20,000+ complexes): 3D structure-based binding data
- Refined set: High-quality crystal structures
- Core set: Diverse benchmark set

Large-Scale Protein Databases

AlphaFoldDB: Access to 200M+ predicted protein structures
ProteinNet: Standardized dataset for structure prediction

Task Types

NodePropertyPrediction

Predict properties at the residue (node) level, such as secondary structure or contact maps.

Use Cases:

Secondary structure prediction (helix, sheet, coil)
Residue-level disorder prediction
Post-translational modification sites
Binding site prediction

PropertyPrediction

Predict protein-level properties like function, stability, or localization.

Use Cases:

Enzyme function classification
Subcellular localization
Protein stability prediction
Gene ontology term prediction

InteractionPrediction

Predict interactions between protein pairs or protein-ligand pairs.

Key Features:

Handles both sequence and structure inputs
Supports symmetric (PPI) and asymmetric (protein-ligand) interactions
Multiple negative sampling strategies

ContactPrediction

Specialized task for predicting spatial proximity between residues in folded structures.

Applications:

Structure prediction from sequence
Protein folding pathway analysis
Validation of predicted structures

Protein Representation Models

Sequence-Based Models

ESM (Evolutionary Scale Modeling):

Pre-trained transformer model on 250M sequences
State-of-the-art for sequence-only tasks
Available in multiple sizes (ESM-1b, ESM-2)
Captures evolutionary and structural information

ProteinBERT:

BERT-style masked language model
Pre-trained on UniProt sequences
Good for transfer learning

ProteinLSTM:

Bidirectional LSTM for sequence encoding
Lightweight and fast
Good baseline for sequence tasks

ProteinCNN / ProteinResNet:

Convolutional architectures
Capture local sequence patterns
Faster than transformer models

Structure-Based Models

GearNet (Geometry-Aware Relational Graph Network):

Incorporates 3D geometric information
Edge types based on sequential, radius, and K-nearest neighbors
State-of-the-art for structure-based tasks
Supports both backbone and full-atom representations

GCN/GAT/GIN on Protein Graphs:

Standard GNN architectures adapted for proteins
Flexible edge definitions (sequence, spatial, contact)

SchNet:

Continuous-filter convolutions
Handles 3D coordinates directly
Good for structure prediction and protein-ligand binding

Feature-Based Models

Statistic Features:

Amino acid composition
Sequence length statistics
Motif counts

Physicochemical Features:

Hydrophobicity scales
Charge properties
Secondary structure propensity
Molecular weight, pI

Protein Graph Construction

Edge Types

Sequential Edges:

Connect adjacent residues in sequence
Captures primary structure

Spatial Edges:

K-nearest neighbors in 3D space
Radius cutoff (e.g., Cα atoms within 10Å)
Captures tertiary structure

Contact Edges:

Based on heavy atom distances
Typically < 8Å threshold

Node Features

Residue Identity:

One-hot encoding of 20 amino acids
Learned embeddings

Position Information:

3D coordinates (Cα, N, C, O)
Backbone angles (phi, psi, omega)
Relative spatial position encodings

Physicochemical Properties:

Hydrophobicity
Charge
Size
Secondary structure

Training Workflows

Pre-training Strategies

Self-Supervised Pre-training:

Masked residue prediction (like BERT)
Distance prediction between residues
Angle prediction (phi, psi, omega)
Dihedral angle prediction
Contact map prediction

Pre-trained Model Usage:

from torchdrug import models

# Load pre-trained ESM
model = models.ESM(path="esm1b_t33_650M_UR50S.pt")

# Fine-tune on downstream task
task = tasks.PropertyPrediction(
    model, task=["stability"],
    criterion="mse", metric=["mae", "rmse"]
)

Multi-Task Learning

Train on multiple related tasks simultaneously:

Joint prediction of function, localization, and stability
Improves generalization and data efficiency
Shares representations across tasks

Best Practices

For Sequence-Only Tasks:

Start with pre-trained ESM or ProteinBERT
Fine-tune with small learning rate (1e-5 to 1e-4)
Use frozen embeddings for small datasets
Apply dropout for regularization

For Structure-Based Tasks:

Use GearNet with multiple edge types
Include geometric features (angles, dihedrals)
Pre-train on large structure databases
Use data augmentation (rotations, crops)

For Small Datasets:

Transfer learning from pre-trained models
Multi-task learning with related tasks
Data augmentation (sequence mutations, structure perturbations)
Strong regularization (dropout, weight decay)

Common Use Cases

Enzyme Engineering

Predict enzyme activity from sequence
Design mutations to improve stability
Screen for desired catalytic properties

Antibody Design

Predict binding affinity
Optimize antibody sequences
Predict immunogenicity

Drug Target Identification

Predict protein function
Identify druggable sites
Analyze protein-ligand interactions

Protein Structure Prediction

Predict secondary structure from sequence
Generate contact maps for tertiary structure
Refine AlphaFold predictions

Integration with Other Tools

AlphaFold Integration

Load AlphaFold-predicted structures:

from torchdrug import data

# Load AlphaFold structure
protein = data.Protein.from_pdb("alphafold_structure.pdb")

# Use in TorchDrug workflows

ESMFold Integration

Use ESMFold for structure prediction, then analyze with TorchDrug models.

Rosetta/PyRosetta

Generate structures with Rosetta, import to TorchDrug for analysis.

7.6 KiB Raw Blame History Unescape Escape

Protein Modeling

Overview

Available Datasets

Protein Function Prediction

Protein Structure Prediction

Protein Interaction

Large-Scale Protein Databases

Task Types

NodePropertyPrediction

PropertyPrediction

InteractionPrediction

ContactPrediction

Protein Representation Models

Sequence-Based Models

Structure-Based Models

Feature-Based Models

Protein Graph Construction

Edge Types

Node Features

Training Workflows

Pre-training Strategies

Multi-Task Learning

Best Practices

Common Use Cases

Enzyme Engineering

Antibody Design

Drug Target Identification

Protein Structure Prediction

Integration with Other Tools

AlphaFold Integration

ESMFold Integration

Rosetta/PyRosetta

7.6 KiB

Raw Blame History