Initial commit
This commit is contained in:
333
skills/molfeat/references/available_featurizers.md
Normal file
333
skills/molfeat/references/available_featurizers.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# Available Featurizers in Molfeat
|
||||
|
||||
This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
|
||||
|
||||
## Transformer-Based Language Models
|
||||
|
||||
Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
|
||||
|
||||
### RoBERTa-style Models
|
||||
- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
|
||||
- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
|
||||
- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds
|
||||
|
||||
### GPT-style Autoregressive Models
|
||||
- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
|
||||
- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
|
||||
- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
|
||||
- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M
|
||||
|
||||
### Specialized Transformer Models
|
||||
- **MolT5** - Self-supervised framework for molecule captioning and text-based generation
|
||||
|
||||
## Graph Neural Networks (GNNs)
|
||||
|
||||
Pre-trained graph neural network models operating on molecular graph structures.
|
||||
|
||||
### GIN (Graph Isomorphism Network) Variants
|
||||
All pre-trained on ChEMBL molecules with different objectives:
|
||||
- **gin-supervised-masking** - Supervised with node masking objective
|
||||
- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
|
||||
- **gin-supervised-edgepred** - Supervised with edge prediction objective
|
||||
- **gin-supervised-contextpred** - Supervised with context prediction objective
|
||||
|
||||
### Other Graph-Based Models
|
||||
- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
|
||||
- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
|
||||
|
||||
## Molecular Descriptors
|
||||
|
||||
Calculators for physico-chemical properties and molecular characteristics.
|
||||
|
||||
### 2D Descriptors
|
||||
- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
|
||||
- Molecular weight, logP, TPSA
|
||||
- H-bond donors/acceptors
|
||||
- Rotatable bonds
|
||||
- Ring counts and aromaticity
|
||||
- Molecular complexity metrics
|
||||
|
||||
### 3D Descriptors
|
||||
- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
|
||||
- Inertial moments
|
||||
- PMI (Principal Moments of Inertia) ratios
|
||||
- Asphericity, eccentricity
|
||||
- Radius of gyration
|
||||
|
||||
### Comprehensive Descriptor Sets
|
||||
- **mordred** - Over 1800 molecular descriptors covering:
|
||||
- Constitutional descriptors
|
||||
- Topological indices
|
||||
- Connectivity indices
|
||||
- Information content
|
||||
- 2D/3D autocorrelations
|
||||
- WHIM descriptors
|
||||
- GETAWAY descriptors
|
||||
- And many more
|
||||
|
||||
### Electrotopological Descriptors
|
||||
- **estate** - Electrotopological state (E-State) indices encoding:
|
||||
- Atomic environment information
|
||||
- Electronic and topological properties
|
||||
- Heteroatom contributions
|
||||
|
||||
## Molecular Fingerprints
|
||||
|
||||
Binary or count-based fixed-length vectors representing molecular substructures.
|
||||
|
||||
### Circular Fingerprints (ECFP-style)
|
||||
- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
|
||||
- Radius variants (2, 4, 6 correspond to diameter)
|
||||
- Default: radius=3, 2048 bits
|
||||
- Most popular for similarity searching
|
||||
- **ecfp-count** - Count version of ECFP (non-binary)
|
||||
- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
|
||||
- Similar to ECFP but uses functional groups
|
||||
- Better for pharmacophore-based similarity
|
||||
|
||||
### Path-Based Fingerprints
|
||||
- **rdkit** - RDKit topological fingerprints based on linear paths
|
||||
- **pattern** - Pattern fingerprints (similar to MACCS but automated)
|
||||
- **layered** - Layered fingerprints with multiple substructure layers
|
||||
|
||||
### Key-Based Fingerprints
|
||||
- **maccs** - MACCS keys (166-bit structural keys)
|
||||
- Fixed set of predefined substructures
|
||||
- Good for scaffold hopping
|
||||
- Fast computation
|
||||
- **avalon** - Avalon fingerprints
|
||||
- Similar to MACCS but more features
|
||||
- Optimized for similarity searching
|
||||
|
||||
### Atom-Pair Fingerprints
|
||||
- **atompair** - Atom pair fingerprints
|
||||
- Encodes pairs of atoms and distance between them
|
||||
- Good for 3D similarity
|
||||
- **atompair-count** - Count version of atom pairs
|
||||
|
||||
### Topological Torsion Fingerprints
|
||||
- **topological** - Topological torsion fingerprints
|
||||
- Encodes sequences of 4 connected atoms
|
||||
- Captures local topology
|
||||
- **topological-count** - Count version of topological torsions
|
||||
|
||||
### MinHashed Fingerprints
|
||||
- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
|
||||
- Combines atom-pair and ECFP concepts
|
||||
- Default: 1024 dimensions
|
||||
- Fast and efficient for large datasets
|
||||
- **secfp** - SMILES Extended Connectivity Fingerprint
|
||||
- Operates directly on SMILES strings
|
||||
- Captures both substructure and atom-pair information
|
||||
|
||||
### Extended Reduced Graph
|
||||
- **erg** - Extended Reduced Graph
|
||||
- Uses pharmacophoric points instead of atoms
|
||||
- Reduces graph complexity while preserving key features
|
||||
|
||||
## Pharmacophore Descriptors
|
||||
|
||||
Features based on pharmacologically relevant functional groups and their spatial relationships.
|
||||
|
||||
### CATS (Chemically Advanced Template Search)
|
||||
- **cats2D** - 2D CATS descriptors
|
||||
- Pharmacophore point pair distributions
|
||||
- Distance based on shortest path
|
||||
- 21 descriptors by default
|
||||
- **cats3D** - 3D CATS descriptors
|
||||
- Euclidean distance based
|
||||
- Requires conformer generation
|
||||
- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants
|
||||
|
||||
### Gobbi Pharmacophores
|
||||
- **gobbi2D** - 2D pharmacophore fingerprints
|
||||
- 8 pharmacophore feature types:
|
||||
- Hydrophobic
|
||||
- Aromatic
|
||||
- H-bond acceptor
|
||||
- H-bond donor
|
||||
- Positive ionizable
|
||||
- Negative ionizable
|
||||
- Lumped hydrophobe
|
||||
- Good for virtual screening
|
||||
|
||||
### Pmapper Pharmacophores
|
||||
- **pmapper2D** - 2D pharmacophore signatures
|
||||
- **pmapper3D** - 3D pharmacophore signatures
|
||||
- High-dimensional pharmacophore descriptors
|
||||
- Useful for QSAR and similarity searching
|
||||
|
||||
## Shape Descriptors
|
||||
|
||||
Descriptors capturing 3D molecular shape and electrostatic properties.
|
||||
|
||||
### USR (Ultrafast Shape Recognition)
|
||||
- **usr** - Basic USR descriptors
|
||||
- 12 dimensions encoding shape distribution
|
||||
- Extremely fast computation
|
||||
- **usrcat** - USR with pharmacophoric constraints
|
||||
- 60 dimensions (12 per feature type)
|
||||
- Combines shape and pharmacophore information
|
||||
|
||||
### Electrostatic Shape
|
||||
- **electroshape** - ElectroShape descriptors
|
||||
- Combines molecular shape, chirality, and electrostatics
|
||||
- Useful for protein-ligand docking predictions
|
||||
|
||||
## Scaffold-Based Descriptors
|
||||
|
||||
Descriptors based on molecular scaffolds and core structures.
|
||||
|
||||
### Scaffold Keys
|
||||
- **scaffoldkeys** - Scaffold key calculator
|
||||
- 40+ scaffold-based properties
|
||||
- Bioisosteric scaffold representation
|
||||
- Captures core structural features
|
||||
|
||||
## Graph Featurizers for GNN Input
|
||||
|
||||
Atom and bond-level features for constructing graph representations for Graph Neural Networks.
|
||||
|
||||
### Atom-Level Features
|
||||
- **atom-onehot** - One-hot encoded atom features
|
||||
- **atom-default** - Default atom featurization including:
|
||||
- Atomic number
|
||||
- Degree, formal charge
|
||||
- Hybridization
|
||||
- Aromaticity
|
||||
- Number of hydrogen atoms
|
||||
|
||||
### Bond-Level Features
|
||||
- **bond-onehot** - One-hot encoded bond features
|
||||
- **bond-default** - Default bond featurization including:
|
||||
- Bond type (single, double, triple, aromatic)
|
||||
- Conjugation
|
||||
- Ring membership
|
||||
- Stereochemistry
|
||||
|
||||
## Integrated Pretrained Model Collections
|
||||
|
||||
Molfeat integrates models from various sources:
|
||||
|
||||
### HuggingFace Models
|
||||
Access to transformer models through HuggingFace hub:
|
||||
- ChemBERTa variants
|
||||
- ChemGPT variants
|
||||
- MolT5
|
||||
- Custom uploaded models
|
||||
|
||||
### DGL-LifeSci Models
|
||||
Pre-trained GNN models from DGL-Life:
|
||||
- GIN variants with different pre-training tasks
|
||||
- AttentiveFP models
|
||||
- MPNN models
|
||||
|
||||
### FCD (Fréchet ChemNet Distance)
|
||||
- **fcd** - Pre-trained CNN for molecular generation evaluation
|
||||
|
||||
### Graphormer Models
|
||||
- Graph transformers from Microsoft Research
|
||||
- Pre-trained on quantum chemistry datasets
|
||||
|
||||
## Usage Notes
|
||||
|
||||
### Choosing a Featurizer
|
||||
|
||||
**For traditional ML (Random Forest, SVM, etc.):**
|
||||
- Start with **ecfp** or **maccs** fingerprints
|
||||
- Try **desc2D** for interpretable models
|
||||
- Use **FeatConcat** to combine multiple fingerprints
|
||||
|
||||
**For deep learning:**
|
||||
- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
|
||||
- Use **gin-supervised-*** for graph neural network embeddings
|
||||
- Consider **Graphormer** for quantum property predictions
|
||||
|
||||
**For similarity searching:**
|
||||
- **ecfp** - General purpose, most popular
|
||||
- **maccs** - Fast, good for scaffold hopping
|
||||
- **map4** - Efficient for large-scale searches
|
||||
- **usr** / **usrcat** - 3D shape similarity
|
||||
|
||||
**For pharmacophore-based approaches:**
|
||||
- **fcfp** - Functional group based
|
||||
- **cats2D/3D** - Pharmacophore pair distributions
|
||||
- **gobbi2D** - Explicit pharmacophore features
|
||||
|
||||
**For interpretability:**
|
||||
- **desc2D** / **mordred** - Named descriptors
|
||||
- **maccs** - Interpretable substructure keys
|
||||
- **scaffoldkeys** - Scaffold-based features
|
||||
|
||||
### Model Dependencies
|
||||
|
||||
Some featurizers require optional dependencies:
|
||||
|
||||
- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
|
||||
- **Graphormer**: `pip install "molfeat[graphormer]"`
|
||||
- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
|
||||
- **FCD**: `pip install "molfeat[fcd]"`
|
||||
- **MAP4**: `pip install "molfeat[map4]"`
|
||||
- **All dependencies**: `pip install "molfeat[all]"`
|
||||
|
||||
### Accessing All Available Models
|
||||
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
store = ModelStore()
|
||||
all_models = store.available_models
|
||||
|
||||
# Print all available featurizers
|
||||
for model in all_models:
|
||||
print(f"{model.name}: {model.description}")
|
||||
|
||||
# Search for specific types
|
||||
transformers = [m for m in all_models if "transformer" in m.tags]
|
||||
gnn_models = [m for m in all_models if "gnn" in m.tags]
|
||||
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Computational Speed (relative)
|
||||
**Fastest:**
|
||||
- maccs
|
||||
- ecfp
|
||||
- rdkit fingerprints
|
||||
- usr
|
||||
|
||||
**Medium:**
|
||||
- desc2D
|
||||
- cats2D
|
||||
- Most fingerprints
|
||||
|
||||
**Slower:**
|
||||
- mordred (1800+ descriptors)
|
||||
- desc3D (requires conformer generation)
|
||||
- 3D descriptors in general
|
||||
|
||||
**Slowest (first run):**
|
||||
- Pretrained models (ChemBERTa, ChemGPT, GIN)
|
||||
- Note: Subsequent runs benefit from caching
|
||||
|
||||
### Dimensionality
|
||||
|
||||
**Low (< 200 dims):**
|
||||
- maccs (167)
|
||||
- usr (12)
|
||||
- usrcat (60)
|
||||
|
||||
**Medium (200-2000 dims):**
|
||||
- desc2D (~200)
|
||||
- ecfp (2048 default, configurable)
|
||||
- map4 (1024 default)
|
||||
|
||||
**High (> 2000 dims):**
|
||||
- mordred (1800+)
|
||||
- Concatenated fingerprints
|
||||
- Some transformer embeddings
|
||||
|
||||
**Variable:**
|
||||
- Transformer models (typically 768-1024)
|
||||
- GNN models (depends on architecture)
|
||||
Reference in New Issue
Block a user