10 KiB
Available Featurizers in Molfeat
This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
Transformer-Based Language Models
Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
RoBERTa-style Models
- Roberta-Zinc480M-102M - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
- ChemBERTa-77M-MLM - Masked language model based on RoBERTa trained on 77M PubChem compounds
- ChemBERTa-77M-MTR - Multitask regression version trained on PubChem compounds
GPT-style Autoregressive Models
- GPT2-Zinc480M-87M - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
- ChemGPT-1.2B - Large transformer (1.2B parameters) pretrained on PubChem10M
- ChemGPT-19M - Medium transformer (19M parameters) pretrained on PubChem10M
- ChemGPT-4.7M - Small transformer (4.7M parameters) pretrained on PubChem10M
Specialized Transformer Models
- MolT5 - Self-supervised framework for molecule captioning and text-based generation
Graph Neural Networks (GNNs)
Pre-trained graph neural network models operating on molecular graph structures.
GIN (Graph Isomorphism Network) Variants
All pre-trained on ChEMBL molecules with different objectives:
- gin-supervised-masking - Supervised with node masking objective
- gin-supervised-infomax - Supervised with graph-level mutual information maximization
- gin-supervised-edgepred - Supervised with edge prediction objective
- gin-supervised-contextpred - Supervised with context prediction objective
Other Graph-Based Models
- JTVAE_zinc_no_kl - Junction-tree VAE for molecule generation (trained on ZINC)
- Graphormer-pcqm4mv2 - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
Molecular Descriptors
Calculators for physico-chemical properties and molecular characteristics.
2D Descriptors
- desc2D / rdkit2D - 200+ RDKit 2D molecular descriptors including:
- Molecular weight, logP, TPSA
- H-bond donors/acceptors
- Rotatable bonds
- Ring counts and aromaticity
- Molecular complexity metrics
3D Descriptors
- desc3D / rdkit3D - RDKit 3D molecular descriptors (requires conformer generation)
- Inertial moments
- PMI (Principal Moments of Inertia) ratios
- Asphericity, eccentricity
- Radius of gyration
Comprehensive Descriptor Sets
- mordred - Over 1800 molecular descriptors covering:
- Constitutional descriptors
- Topological indices
- Connectivity indices
- Information content
- 2D/3D autocorrelations
- WHIM descriptors
- GETAWAY descriptors
- And many more
Electrotopological Descriptors
- estate - Electrotopological state (E-State) indices encoding:
- Atomic environment information
- Electronic and topological properties
- Heteroatom contributions
Molecular Fingerprints
Binary or count-based fixed-length vectors representing molecular substructures.
Circular Fingerprints (ECFP-style)
- ecfp / ecfp:2 / ecfp:4 / ecfp:6 - Extended-connectivity fingerprints
- Radius variants (2, 4, 6 correspond to diameter)
- Default: radius=3, 2048 bits
- Most popular for similarity searching
- ecfp-count - Count version of ECFP (non-binary)
- fcfp / fcfp-count - Functional-class circular fingerprints
- Similar to ECFP but uses functional groups
- Better for pharmacophore-based similarity
Path-Based Fingerprints
- rdkit - RDKit topological fingerprints based on linear paths
- pattern - Pattern fingerprints (similar to MACCS but automated)
- layered - Layered fingerprints with multiple substructure layers
Key-Based Fingerprints
- maccs - MACCS keys (166-bit structural keys)
- Fixed set of predefined substructures
- Good for scaffold hopping
- Fast computation
- avalon - Avalon fingerprints
- Similar to MACCS but more features
- Optimized for similarity searching
Atom-Pair Fingerprints
- atompair - Atom pair fingerprints
- Encodes pairs of atoms and distance between them
- Good for 3D similarity
- atompair-count - Count version of atom pairs
Topological Torsion Fingerprints
- topological - Topological torsion fingerprints
- Encodes sequences of 4 connected atoms
- Captures local topology
- topological-count - Count version of topological torsions
MinHashed Fingerprints
- map4 - MinHashed Atom-Pair fingerprint up to 4 bonds
- Combines atom-pair and ECFP concepts
- Default: 1024 dimensions
- Fast and efficient for large datasets
- secfp - SMILES Extended Connectivity Fingerprint
- Operates directly on SMILES strings
- Captures both substructure and atom-pair information
Extended Reduced Graph
- erg - Extended Reduced Graph
- Uses pharmacophoric points instead of atoms
- Reduces graph complexity while preserving key features
Pharmacophore Descriptors
Features based on pharmacologically relevant functional groups and their spatial relationships.
CATS (Chemically Advanced Template Search)
- cats2D - 2D CATS descriptors
- Pharmacophore point pair distributions
- Distance based on shortest path
- 21 descriptors by default
- cats3D - 3D CATS descriptors
- Euclidean distance based
- Requires conformer generation
- cats2D_pharm / cats3D_pharm - Pharmacophore variants
Gobbi Pharmacophores
- gobbi2D - 2D pharmacophore fingerprints
- 8 pharmacophore feature types:
- Hydrophobic
- Aromatic
- H-bond acceptor
- H-bond donor
- Positive ionizable
- Negative ionizable
- Lumped hydrophobe
- Good for virtual screening
- 8 pharmacophore feature types:
Pmapper Pharmacophores
- pmapper2D - 2D pharmacophore signatures
- pmapper3D - 3D pharmacophore signatures
- High-dimensional pharmacophore descriptors
- Useful for QSAR and similarity searching
Shape Descriptors
Descriptors capturing 3D molecular shape and electrostatic properties.
USR (Ultrafast Shape Recognition)
- usr - Basic USR descriptors
- 12 dimensions encoding shape distribution
- Extremely fast computation
- usrcat - USR with pharmacophoric constraints
- 60 dimensions (12 per feature type)
- Combines shape and pharmacophore information
Electrostatic Shape
- electroshape - ElectroShape descriptors
- Combines molecular shape, chirality, and electrostatics
- Useful for protein-ligand docking predictions
Scaffold-Based Descriptors
Descriptors based on molecular scaffolds and core structures.
Scaffold Keys
- scaffoldkeys - Scaffold key calculator
- 40+ scaffold-based properties
- Bioisosteric scaffold representation
- Captures core structural features
Graph Featurizers for GNN Input
Atom and bond-level features for constructing graph representations for Graph Neural Networks.
Atom-Level Features
- atom-onehot - One-hot encoded atom features
- atom-default - Default atom featurization including:
- Atomic number
- Degree, formal charge
- Hybridization
- Aromaticity
- Number of hydrogen atoms
Bond-Level Features
- bond-onehot - One-hot encoded bond features
- bond-default - Default bond featurization including:
- Bond type (single, double, triple, aromatic)
- Conjugation
- Ring membership
- Stereochemistry
Integrated Pretrained Model Collections
Molfeat integrates models from various sources:
HuggingFace Models
Access to transformer models through HuggingFace hub:
- ChemBERTa variants
- ChemGPT variants
- MolT5
- Custom uploaded models
DGL-LifeSci Models
Pre-trained GNN models from DGL-Life:
- GIN variants with different pre-training tasks
- AttentiveFP models
- MPNN models
FCD (Fréchet ChemNet Distance)
- fcd - Pre-trained CNN for molecular generation evaluation
Graphormer Models
- Graph transformers from Microsoft Research
- Pre-trained on quantum chemistry datasets
Usage Notes
Choosing a Featurizer
For traditional ML (Random Forest, SVM, etc.):
- Start with ecfp or maccs fingerprints
- Try desc2D for interpretable models
- Use FeatConcat to combine multiple fingerprints
For deep learning:
- Use ChemBERTa or ChemGPT for transformer embeddings
- Use gin-supervised-* for graph neural network embeddings
- Consider Graphormer for quantum property predictions
For similarity searching:
- ecfp - General purpose, most popular
- maccs - Fast, good for scaffold hopping
- map4 - Efficient for large-scale searches
- usr / usrcat - 3D shape similarity
For pharmacophore-based approaches:
- fcfp - Functional group based
- cats2D/3D - Pharmacophore pair distributions
- gobbi2D - Explicit pharmacophore features
For interpretability:
- desc2D / mordred - Named descriptors
- maccs - Interpretable substructure keys
- scaffoldkeys - Scaffold-based features
Model Dependencies
Some featurizers require optional dependencies:
- DGL models (gin-*, jtvae):
pip install "molfeat[dgl]" - Graphormer:
pip install "molfeat[graphormer]" - Transformers (ChemBERTa, ChemGPT, MolT5):
pip install "molfeat[transformer]" - FCD:
pip install "molfeat[fcd]" - MAP4:
pip install "molfeat[map4]" - All dependencies:
pip install "molfeat[all]"
Accessing All Available Models
from molfeat.store.modelstore import ModelStore
store = ModelStore()
all_models = store.available_models
# Print all available featurizers
for model in all_models:
print(f"{model.name}: {model.description}")
# Search for specific types
transformers = [m for m in all_models if "transformer" in m.tags]
gnn_models = [m for m in all_models if "gnn" in m.tags]
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
Performance Characteristics
Computational Speed (relative)
Fastest:
- maccs
- ecfp
- rdkit fingerprints
- usr
Medium:
- desc2D
- cats2D
- Most fingerprints
Slower:
- mordred (1800+ descriptors)
- desc3D (requires conformer generation)
- 3D descriptors in general
Slowest (first run):
- Pretrained models (ChemBERTa, ChemGPT, GIN)
- Note: Subsequent runs benefit from caching
Dimensionality
Low (< 200 dims):
- maccs (167)
- usr (12)
- usrcat (60)
Medium (200-2000 dims):
- desc2D (~200)
- ecfp (2048 default, configurable)
- map4 (1024 default)
High (> 2000 dims):
- mordred (1800+)
- Concatenated fingerprints
- Some transformer embeddings
Variable:
- Transformer models (typically 768-1024)
- GNN models (depends on architecture)