334 lines
10 KiB
Markdown
334 lines
10 KiB
Markdown
# Available Featurizers in Molfeat
|
|
|
|
This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
|
|
|
|
## Transformer-Based Language Models
|
|
|
|
Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
|
|
|
|
### RoBERTa-style Models
|
|
- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
|
|
- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
|
|
- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds
|
|
|
|
### GPT-style Autoregressive Models
|
|
- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
|
|
- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
|
|
- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
|
|
- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M
|
|
|
|
### Specialized Transformer Models
|
|
- **MolT5** - Self-supervised framework for molecule captioning and text-based generation
|
|
|
|
## Graph Neural Networks (GNNs)
|
|
|
|
Pre-trained graph neural network models operating on molecular graph structures.
|
|
|
|
### GIN (Graph Isomorphism Network) Variants
|
|
All pre-trained on ChEMBL molecules with different objectives:
|
|
- **gin-supervised-masking** - Supervised with node masking objective
|
|
- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
|
|
- **gin-supervised-edgepred** - Supervised with edge prediction objective
|
|
- **gin-supervised-contextpred** - Supervised with context prediction objective
|
|
|
|
### Other Graph-Based Models
|
|
- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
|
|
- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
|
|
|
|
## Molecular Descriptors
|
|
|
|
Calculators for physico-chemical properties and molecular characteristics.
|
|
|
|
### 2D Descriptors
|
|
- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
|
|
- Molecular weight, logP, TPSA
|
|
- H-bond donors/acceptors
|
|
- Rotatable bonds
|
|
- Ring counts and aromaticity
|
|
- Molecular complexity metrics
|
|
|
|
### 3D Descriptors
|
|
- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
|
|
- Inertial moments
|
|
- PMI (Principal Moments of Inertia) ratios
|
|
- Asphericity, eccentricity
|
|
- Radius of gyration
|
|
|
|
### Comprehensive Descriptor Sets
|
|
- **mordred** - Over 1800 molecular descriptors covering:
|
|
- Constitutional descriptors
|
|
- Topological indices
|
|
- Connectivity indices
|
|
- Information content
|
|
- 2D/3D autocorrelations
|
|
- WHIM descriptors
|
|
- GETAWAY descriptors
|
|
- And many more
|
|
|
|
### Electrotopological Descriptors
|
|
- **estate** - Electrotopological state (E-State) indices encoding:
|
|
- Atomic environment information
|
|
- Electronic and topological properties
|
|
- Heteroatom contributions
|
|
|
|
## Molecular Fingerprints
|
|
|
|
Binary or count-based fixed-length vectors representing molecular substructures.
|
|
|
|
### Circular Fingerprints (ECFP-style)
|
|
- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
|
|
- Radius variants (2, 4, 6 correspond to diameter)
|
|
- Default: radius=3, 2048 bits
|
|
- Most popular for similarity searching
|
|
- **ecfp-count** - Count version of ECFP (non-binary)
|
|
- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
|
|
- Similar to ECFP but uses functional groups
|
|
- Better for pharmacophore-based similarity
|
|
|
|
### Path-Based Fingerprints
|
|
- **rdkit** - RDKit topological fingerprints based on linear paths
|
|
- **pattern** - Pattern fingerprints (similar to MACCS but automated)
|
|
- **layered** - Layered fingerprints with multiple substructure layers
|
|
|
|
### Key-Based Fingerprints
|
|
- **maccs** - MACCS keys (166-bit structural keys)
|
|
- Fixed set of predefined substructures
|
|
- Good for scaffold hopping
|
|
- Fast computation
|
|
- **avalon** - Avalon fingerprints
|
|
- Similar to MACCS but more features
|
|
- Optimized for similarity searching
|
|
|
|
### Atom-Pair Fingerprints
|
|
- **atompair** - Atom pair fingerprints
|
|
- Encodes pairs of atoms and distance between them
|
|
- Good for 3D similarity
|
|
- **atompair-count** - Count version of atom pairs
|
|
|
|
### Topological Torsion Fingerprints
|
|
- **topological** - Topological torsion fingerprints
|
|
- Encodes sequences of 4 connected atoms
|
|
- Captures local topology
|
|
- **topological-count** - Count version of topological torsions
|
|
|
|
### MinHashed Fingerprints
|
|
- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
|
|
- Combines atom-pair and ECFP concepts
|
|
- Default: 1024 dimensions
|
|
- Fast and efficient for large datasets
|
|
- **secfp** - SMILES Extended Connectivity Fingerprint
|
|
- Operates directly on SMILES strings
|
|
- Captures both substructure and atom-pair information
|
|
|
|
### Extended Reduced Graph
|
|
- **erg** - Extended Reduced Graph
|
|
- Uses pharmacophoric points instead of atoms
|
|
- Reduces graph complexity while preserving key features
|
|
|
|
## Pharmacophore Descriptors
|
|
|
|
Features based on pharmacologically relevant functional groups and their spatial relationships.
|
|
|
|
### CATS (Chemically Advanced Template Search)
|
|
- **cats2D** - 2D CATS descriptors
|
|
- Pharmacophore point pair distributions
|
|
- Distance based on shortest path
|
|
- 21 descriptors by default
|
|
- **cats3D** - 3D CATS descriptors
|
|
- Euclidean distance based
|
|
- Requires conformer generation
|
|
- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants
|
|
|
|
### Gobbi Pharmacophores
|
|
- **gobbi2D** - 2D pharmacophore fingerprints
|
|
- 8 pharmacophore feature types:
|
|
- Hydrophobic
|
|
- Aromatic
|
|
- H-bond acceptor
|
|
- H-bond donor
|
|
- Positive ionizable
|
|
- Negative ionizable
|
|
- Lumped hydrophobe
|
|
- Good for virtual screening
|
|
|
|
### Pmapper Pharmacophores
|
|
- **pmapper2D** - 2D pharmacophore signatures
|
|
- **pmapper3D** - 3D pharmacophore signatures
|
|
- High-dimensional pharmacophore descriptors
|
|
- Useful for QSAR and similarity searching
|
|
|
|
## Shape Descriptors
|
|
|
|
Descriptors capturing 3D molecular shape and electrostatic properties.
|
|
|
|
### USR (Ultrafast Shape Recognition)
|
|
- **usr** - Basic USR descriptors
|
|
- 12 dimensions encoding shape distribution
|
|
- Extremely fast computation
|
|
- **usrcat** - USR with pharmacophoric constraints
|
|
- 60 dimensions (12 per feature type)
|
|
- Combines shape and pharmacophore information
|
|
|
|
### Electrostatic Shape
|
|
- **electroshape** - ElectroShape descriptors
|
|
- Combines molecular shape, chirality, and electrostatics
|
|
- Useful for protein-ligand docking predictions
|
|
|
|
## Scaffold-Based Descriptors
|
|
|
|
Descriptors based on molecular scaffolds and core structures.
|
|
|
|
### Scaffold Keys
|
|
- **scaffoldkeys** - Scaffold key calculator
|
|
- 40+ scaffold-based properties
|
|
- Bioisosteric scaffold representation
|
|
- Captures core structural features
|
|
|
|
## Graph Featurizers for GNN Input
|
|
|
|
Atom and bond-level features for constructing graph representations for Graph Neural Networks.
|
|
|
|
### Atom-Level Features
|
|
- **atom-onehot** - One-hot encoded atom features
|
|
- **atom-default** - Default atom featurization including:
|
|
- Atomic number
|
|
- Degree, formal charge
|
|
- Hybridization
|
|
- Aromaticity
|
|
- Number of hydrogen atoms
|
|
|
|
### Bond-Level Features
|
|
- **bond-onehot** - One-hot encoded bond features
|
|
- **bond-default** - Default bond featurization including:
|
|
- Bond type (single, double, triple, aromatic)
|
|
- Conjugation
|
|
- Ring membership
|
|
- Stereochemistry
|
|
|
|
## Integrated Pretrained Model Collections
|
|
|
|
Molfeat integrates models from various sources:
|
|
|
|
### HuggingFace Models
|
|
Access to transformer models through HuggingFace hub:
|
|
- ChemBERTa variants
|
|
- ChemGPT variants
|
|
- MolT5
|
|
- Custom uploaded models
|
|
|
|
### DGL-LifeSci Models
|
|
Pre-trained GNN models from DGL-Life:
|
|
- GIN variants with different pre-training tasks
|
|
- AttentiveFP models
|
|
- MPNN models
|
|
|
|
### FCD (Fréchet ChemNet Distance)
|
|
- **fcd** - Pre-trained CNN for molecular generation evaluation
|
|
|
|
### Graphormer Models
|
|
- Graph transformers from Microsoft Research
|
|
- Pre-trained on quantum chemistry datasets
|
|
|
|
## Usage Notes
|
|
|
|
### Choosing a Featurizer
|
|
|
|
**For traditional ML (Random Forest, SVM, etc.):**
|
|
- Start with **ecfp** or **maccs** fingerprints
|
|
- Try **desc2D** for interpretable models
|
|
- Use **FeatConcat** to combine multiple fingerprints
|
|
|
|
**For deep learning:**
|
|
- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
|
|
- Use **gin-supervised-*** for graph neural network embeddings
|
|
- Consider **Graphormer** for quantum property predictions
|
|
|
|
**For similarity searching:**
|
|
- **ecfp** - General purpose, most popular
|
|
- **maccs** - Fast, good for scaffold hopping
|
|
- **map4** - Efficient for large-scale searches
|
|
- **usr** / **usrcat** - 3D shape similarity
|
|
|
|
**For pharmacophore-based approaches:**
|
|
- **fcfp** - Functional group based
|
|
- **cats2D/3D** - Pharmacophore pair distributions
|
|
- **gobbi2D** - Explicit pharmacophore features
|
|
|
|
**For interpretability:**
|
|
- **desc2D** / **mordred** - Named descriptors
|
|
- **maccs** - Interpretable substructure keys
|
|
- **scaffoldkeys** - Scaffold-based features
|
|
|
|
### Model Dependencies
|
|
|
|
Some featurizers require optional dependencies:
|
|
|
|
- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
|
|
- **Graphormer**: `pip install "molfeat[graphormer]"`
|
|
- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
|
|
- **FCD**: `pip install "molfeat[fcd]"`
|
|
- **MAP4**: `pip install "molfeat[map4]"`
|
|
- **All dependencies**: `pip install "molfeat[all]"`
|
|
|
|
### Accessing All Available Models
|
|
|
|
```python
|
|
from molfeat.store.modelstore import ModelStore
|
|
|
|
store = ModelStore()
|
|
all_models = store.available_models
|
|
|
|
# Print all available featurizers
|
|
for model in all_models:
|
|
print(f"{model.name}: {model.description}")
|
|
|
|
# Search for specific types
|
|
transformers = [m for m in all_models if "transformer" in m.tags]
|
|
gnn_models = [m for m in all_models if "gnn" in m.tags]
|
|
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
|
|
```
|
|
|
|
## Performance Characteristics
|
|
|
|
### Computational Speed (relative)
|
|
**Fastest:**
|
|
- maccs
|
|
- ecfp
|
|
- rdkit fingerprints
|
|
- usr
|
|
|
|
**Medium:**
|
|
- desc2D
|
|
- cats2D
|
|
- Most fingerprints
|
|
|
|
**Slower:**
|
|
- mordred (1800+ descriptors)
|
|
- desc3D (requires conformer generation)
|
|
- 3D descriptors in general
|
|
|
|
**Slowest (first run):**
|
|
- Pretrained models (ChemBERTa, ChemGPT, GIN)
|
|
- Note: Subsequent runs benefit from caching
|
|
|
|
### Dimensionality
|
|
|
|
**Low (< 200 dims):**
|
|
- maccs (167)
|
|
- usr (12)
|
|
- usrcat (60)
|
|
|
|
**Medium (200-2000 dims):**
|
|
- desc2D (~200)
|
|
- ecfp (2048 default, configurable)
|
|
- map4 (1024 default)
|
|
|
|
**High (> 2000 dims):**
|
|
- mordred (1800+)
|
|
- Concatenated fingerprints
|
|
- Some transformer embeddings
|
|
|
|
**Variable:**
|
|
- Transformer models (typically 768-1024)
|
|
- GNN models (depends on architecture)
|