Files
gh-k-dense-ai-claude-scient…/skills/molfeat/references/available_featurizers.md
2025-11-30 08:30:10 +08:00

334 lines
10 KiB
Markdown

# Available Featurizers in Molfeat
This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
## Transformer-Based Language Models
Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
### RoBERTa-style Models
- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds
### GPT-style Autoregressive Models
- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M
### Specialized Transformer Models
- **MolT5** - Self-supervised framework for molecule captioning and text-based generation
## Graph Neural Networks (GNNs)
Pre-trained graph neural network models operating on molecular graph structures.
### GIN (Graph Isomorphism Network) Variants
All pre-trained on ChEMBL molecules with different objectives:
- **gin-supervised-masking** - Supervised with node masking objective
- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
- **gin-supervised-edgepred** - Supervised with edge prediction objective
- **gin-supervised-contextpred** - Supervised with context prediction objective
### Other Graph-Based Models
- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
## Molecular Descriptors
Calculators for physico-chemical properties and molecular characteristics.
### 2D Descriptors
- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
- Molecular weight, logP, TPSA
- H-bond donors/acceptors
- Rotatable bonds
- Ring counts and aromaticity
- Molecular complexity metrics
### 3D Descriptors
- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
- Inertial moments
- PMI (Principal Moments of Inertia) ratios
- Asphericity, eccentricity
- Radius of gyration
### Comprehensive Descriptor Sets
- **mordred** - Over 1800 molecular descriptors covering:
- Constitutional descriptors
- Topological indices
- Connectivity indices
- Information content
- 2D/3D autocorrelations
- WHIM descriptors
- GETAWAY descriptors
- And many more
### Electrotopological Descriptors
- **estate** - Electrotopological state (E-State) indices encoding:
- Atomic environment information
- Electronic and topological properties
- Heteroatom contributions
## Molecular Fingerprints
Binary or count-based fixed-length vectors representing molecular substructures.
### Circular Fingerprints (ECFP-style)
- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
- Radius variants (2, 4, 6 correspond to diameter)
- Default: radius=3, 2048 bits
- Most popular for similarity searching
- **ecfp-count** - Count version of ECFP (non-binary)
- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
- Similar to ECFP but uses functional groups
- Better for pharmacophore-based similarity
### Path-Based Fingerprints
- **rdkit** - RDKit topological fingerprints based on linear paths
- **pattern** - Pattern fingerprints (similar to MACCS but automated)
- **layered** - Layered fingerprints with multiple substructure layers
### Key-Based Fingerprints
- **maccs** - MACCS keys (166-bit structural keys)
- Fixed set of predefined substructures
- Good for scaffold hopping
- Fast computation
- **avalon** - Avalon fingerprints
- Similar to MACCS but more features
- Optimized for similarity searching
### Atom-Pair Fingerprints
- **atompair** - Atom pair fingerprints
- Encodes pairs of atoms and distance between them
- Good for 3D similarity
- **atompair-count** - Count version of atom pairs
### Topological Torsion Fingerprints
- **topological** - Topological torsion fingerprints
- Encodes sequences of 4 connected atoms
- Captures local topology
- **topological-count** - Count version of topological torsions
### MinHashed Fingerprints
- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
- Combines atom-pair and ECFP concepts
- Default: 1024 dimensions
- Fast and efficient for large datasets
- **secfp** - SMILES Extended Connectivity Fingerprint
- Operates directly on SMILES strings
- Captures both substructure and atom-pair information
### Extended Reduced Graph
- **erg** - Extended Reduced Graph
- Uses pharmacophoric points instead of atoms
- Reduces graph complexity while preserving key features
## Pharmacophore Descriptors
Features based on pharmacologically relevant functional groups and their spatial relationships.
### CATS (Chemically Advanced Template Search)
- **cats2D** - 2D CATS descriptors
- Pharmacophore point pair distributions
- Distance based on shortest path
- 21 descriptors by default
- **cats3D** - 3D CATS descriptors
- Euclidean distance based
- Requires conformer generation
- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants
### Gobbi Pharmacophores
- **gobbi2D** - 2D pharmacophore fingerprints
- 8 pharmacophore feature types:
- Hydrophobic
- Aromatic
- H-bond acceptor
- H-bond donor
- Positive ionizable
- Negative ionizable
- Lumped hydrophobe
- Good for virtual screening
### Pmapper Pharmacophores
- **pmapper2D** - 2D pharmacophore signatures
- **pmapper3D** - 3D pharmacophore signatures
- High-dimensional pharmacophore descriptors
- Useful for QSAR and similarity searching
## Shape Descriptors
Descriptors capturing 3D molecular shape and electrostatic properties.
### USR (Ultrafast Shape Recognition)
- **usr** - Basic USR descriptors
- 12 dimensions encoding shape distribution
- Extremely fast computation
- **usrcat** - USR with pharmacophoric constraints
- 60 dimensions (12 per feature type)
- Combines shape and pharmacophore information
### Electrostatic Shape
- **electroshape** - ElectroShape descriptors
- Combines molecular shape, chirality, and electrostatics
- Useful for protein-ligand docking predictions
## Scaffold-Based Descriptors
Descriptors based on molecular scaffolds and core structures.
### Scaffold Keys
- **scaffoldkeys** - Scaffold key calculator
- 40+ scaffold-based properties
- Bioisosteric scaffold representation
- Captures core structural features
## Graph Featurizers for GNN Input
Atom and bond-level features for constructing graph representations for Graph Neural Networks.
### Atom-Level Features
- **atom-onehot** - One-hot encoded atom features
- **atom-default** - Default atom featurization including:
- Atomic number
- Degree, formal charge
- Hybridization
- Aromaticity
- Number of hydrogen atoms
### Bond-Level Features
- **bond-onehot** - One-hot encoded bond features
- **bond-default** - Default bond featurization including:
- Bond type (single, double, triple, aromatic)
- Conjugation
- Ring membership
- Stereochemistry
## Integrated Pretrained Model Collections
Molfeat integrates models from various sources:
### HuggingFace Models
Access to transformer models through HuggingFace hub:
- ChemBERTa variants
- ChemGPT variants
- MolT5
- Custom uploaded models
### DGL-LifeSci Models
Pre-trained GNN models from DGL-Life:
- GIN variants with different pre-training tasks
- AttentiveFP models
- MPNN models
### FCD (Fréchet ChemNet Distance)
- **fcd** - Pre-trained CNN for molecular generation evaluation
### Graphormer Models
- Graph transformers from Microsoft Research
- Pre-trained on quantum chemistry datasets
## Usage Notes
### Choosing a Featurizer
**For traditional ML (Random Forest, SVM, etc.):**
- Start with **ecfp** or **maccs** fingerprints
- Try **desc2D** for interpretable models
- Use **FeatConcat** to combine multiple fingerprints
**For deep learning:**
- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
- Use **gin-supervised-*** for graph neural network embeddings
- Consider **Graphormer** for quantum property predictions
**For similarity searching:**
- **ecfp** - General purpose, most popular
- **maccs** - Fast, good for scaffold hopping
- **map4** - Efficient for large-scale searches
- **usr** / **usrcat** - 3D shape similarity
**For pharmacophore-based approaches:**
- **fcfp** - Functional group based
- **cats2D/3D** - Pharmacophore pair distributions
- **gobbi2D** - Explicit pharmacophore features
**For interpretability:**
- **desc2D** / **mordred** - Named descriptors
- **maccs** - Interpretable substructure keys
- **scaffoldkeys** - Scaffold-based features
### Model Dependencies
Some featurizers require optional dependencies:
- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
- **Graphormer**: `pip install "molfeat[graphormer]"`
- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
- **FCD**: `pip install "molfeat[fcd]"`
- **MAP4**: `pip install "molfeat[map4]"`
- **All dependencies**: `pip install "molfeat[all]"`
### Accessing All Available Models
```python
from molfeat.store.modelstore import ModelStore
store = ModelStore()
all_models = store.available_models
# Print all available featurizers
for model in all_models:
print(f"{model.name}: {model.description}")
# Search for specific types
transformers = [m for m in all_models if "transformer" in m.tags]
gnn_models = [m for m in all_models if "gnn" in m.tags]
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
```
## Performance Characteristics
### Computational Speed (relative)
**Fastest:**
- maccs
- ecfp
- rdkit fingerprints
- usr
**Medium:**
- desc2D
- cats2D
- Most fingerprints
**Slower:**
- mordred (1800+ descriptors)
- desc3D (requires conformer generation)
- 3D descriptors in general
**Slowest (first run):**
- Pretrained models (ChemBERTa, ChemGPT, GIN)
- Note: Subsequent runs benefit from caching
### Dimensionality
**Low (< 200 dims):**
- maccs (167)
- usr (12)
- usrcat (60)
**Medium (200-2000 dims):**
- desc2D (~200)
- ecfp (2048 default, configurable)
- map4 (1024 default)
**High (> 2000 dims):**
- mordred (1800+)
- Concatenated fingerprints
- Some transformer embeddings
**Variable:**
- Transformer models (typically 768-1024)
- GNN models (depends on architecture)