gh-k-dense-ai-claude-scient…/skills/molfeat/references/available_featurizers.md

# Available Featurizers in Molfeat

This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.

## Transformer-Based Language Models

Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.

### RoBERTa-style Models
- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds

### GPT-style Autoregressive Models
- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M

### Specialized Transformer Models
- **MolT5** - Self-supervised framework for molecule captioning and text-based generation

## Graph Neural Networks (GNNs)

Pre-trained graph neural network models operating on molecular graph structures.

### GIN (Graph Isomorphism Network) Variants
All pre-trained on ChEMBL molecules with different objectives:
- **gin-supervised-masking** - Supervised with node masking objective
- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
- **gin-supervised-edgepred** - Supervised with edge prediction objective
- **gin-supervised-contextpred** - Supervised with context prediction objective

### Other Graph-Based Models
- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction

## Molecular Descriptors

Calculators for physico-chemical properties and molecular characteristics.

### 2D Descriptors
- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
  - Molecular weight, logP, TPSA
  - H-bond donors/acceptors
  - Rotatable bonds
  - Ring counts and aromaticity
  - Molecular complexity metrics

### 3D Descriptors
- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
  - Inertial moments
  - PMI (Principal Moments of Inertia) ratios
  - Asphericity, eccentricity
  - Radius of gyration

### Comprehensive Descriptor Sets
- **mordred** - Over 1800 molecular descriptors covering:
  - Constitutional descriptors
  - Topological indices
  - Connectivity indices
  - Information content
  - 2D/3D autocorrelations
  - WHIM descriptors
  - GETAWAY descriptors
  - And many more

### Electrotopological Descriptors
- **estate** - Electrotopological state (E-State) indices encoding:
  - Atomic environment information
  - Electronic and topological properties
  - Heteroatom contributions

## Molecular Fingerprints

Binary or count-based fixed-length vectors representing molecular substructures.

### Circular Fingerprints (ECFP-style)
- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
  - Radius variants (2, 4, 6 correspond to diameter)
  - Default: radius=3, 2048 bits
  - Most popular for similarity searching
- **ecfp-count** - Count version of ECFP (non-binary)
- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
  - Similar to ECFP but uses functional groups
  - Better for pharmacophore-based similarity

### Path-Based Fingerprints
- **rdkit** - RDKit topological fingerprints based on linear paths
- **pattern** - Pattern fingerprints (similar to MACCS but automated)
- **layered** - Layered fingerprints with multiple substructure layers

### Key-Based Fingerprints
- **maccs** - MACCS keys (166-bit structural keys)
  - Fixed set of predefined substructures
  - Good for scaffold hopping
  - Fast computation
- **avalon** - Avalon fingerprints
  - Similar to MACCS but more features
  - Optimized for similarity searching

### Atom-Pair Fingerprints
- **atompair** - Atom pair fingerprints
  - Encodes pairs of atoms and distance between them
  - Good for 3D similarity
- **atompair-count** - Count version of atom pairs

### Topological Torsion Fingerprints
- **topological** - Topological torsion fingerprints
  - Encodes sequences of 4 connected atoms
  - Captures local topology
- **topological-count** - Count version of topological torsions

### MinHashed Fingerprints
- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
  - Combines atom-pair and ECFP concepts
  - Default: 1024 dimensions
  - Fast and efficient for large datasets
- **secfp** - SMILES Extended Connectivity Fingerprint
  - Operates directly on SMILES strings
  - Captures both substructure and atom-pair information

### Extended Reduced Graph
- **erg** - Extended Reduced Graph
  - Uses pharmacophoric points instead of atoms
  - Reduces graph complexity while preserving key features

## Pharmacophore Descriptors

Features based on pharmacologically relevant functional groups and their spatial relationships.

### CATS (Chemically Advanced Template Search)
- **cats2D** - 2D CATS descriptors
  - Pharmacophore point pair distributions
  - Distance based on shortest path
  - 21 descriptors by default
- **cats3D** - 3D CATS descriptors
  - Euclidean distance based
  - Requires conformer generation
- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants

### Gobbi Pharmacophores
- **gobbi2D** - 2D pharmacophore fingerprints
  - 8 pharmacophore feature types:
    - Hydrophobic
    - Aromatic
    - H-bond acceptor
    - H-bond donor
    - Positive ionizable
    - Negative ionizable
    - Lumped hydrophobe
  - Good for virtual screening

### Pmapper Pharmacophores
- **pmapper2D** - 2D pharmacophore signatures
- **pmapper3D** - 3D pharmacophore signatures
  - High-dimensional pharmacophore descriptors
  - Useful for QSAR and similarity searching

## Shape Descriptors

Descriptors capturing 3D molecular shape and electrostatic properties.

### USR (Ultrafast Shape Recognition)
- **usr** - Basic USR descriptors
  - 12 dimensions encoding shape distribution
  - Extremely fast computation
- **usrcat** - USR with pharmacophoric constraints
  - 60 dimensions (12 per feature type)
  - Combines shape and pharmacophore information

### Electrostatic Shape
- **electroshape** - ElectroShape descriptors
  - Combines molecular shape, chirality, and electrostatics
  - Useful for protein-ligand docking predictions

## Scaffold-Based Descriptors

Descriptors based on molecular scaffolds and core structures.

### Scaffold Keys
- **scaffoldkeys** - Scaffold key calculator
  - 40+ scaffold-based properties
  - Bioisosteric scaffold representation
  - Captures core structural features

## Graph Featurizers for GNN Input

Atom and bond-level features for constructing graph representations for Graph Neural Networks.

### Atom-Level Features
- **atom-onehot** - One-hot encoded atom features
- **atom-default** - Default atom featurization including:
  - Atomic number
  - Degree, formal charge
  - Hybridization
  - Aromaticity
  - Number of hydrogen atoms

### Bond-Level Features
- **bond-onehot** - One-hot encoded bond features
- **bond-default** - Default bond featurization including:
  - Bond type (single, double, triple, aromatic)
  - Conjugation
  - Ring membership
  - Stereochemistry

## Integrated Pretrained Model Collections

Molfeat integrates models from various sources:

### HuggingFace Models
Access to transformer models through HuggingFace hub:
- ChemBERTa variants
- ChemGPT variants
- MolT5
- Custom uploaded models

### DGL-LifeSci Models
Pre-trained GNN models from DGL-Life:
- GIN variants with different pre-training tasks
- AttentiveFP models
- MPNN models

### FCD (Fréchet ChemNet Distance)
- **fcd** - Pre-trained CNN for molecular generation evaluation

### Graphormer Models
- Graph transformers from Microsoft Research
- Pre-trained on quantum chemistry datasets

## Usage Notes

### Choosing a Featurizer

**For traditional ML (Random Forest, SVM, etc.):**
- Start with **ecfp** or **maccs** fingerprints
- Try **desc2D** for interpretable models
- Use **FeatConcat** to combine multiple fingerprints

**For deep learning:**
- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
- Use **gin-supervised-*** for graph neural network embeddings
- Consider **Graphormer** for quantum property predictions

**For similarity searching:**
- **ecfp** - General purpose, most popular
- **maccs** - Fast, good for scaffold hopping
- **map4** - Efficient for large-scale searches
- **usr** / **usrcat** - 3D shape similarity

**For pharmacophore-based approaches:**
- **fcfp** - Functional group based
- **cats2D/3D** - Pharmacophore pair distributions
- **gobbi2D** - Explicit pharmacophore features

**For interpretability:**
- **desc2D** / **mordred** - Named descriptors
- **maccs** - Interpretable substructure keys
- **scaffoldkeys** - Scaffold-based features

### Model Dependencies

Some featurizers require optional dependencies:

- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
- **Graphormer**: `pip install "molfeat[graphormer]"`
- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
- **FCD**: `pip install "molfeat[fcd]"`
- **MAP4**: `pip install "molfeat[map4]"`
- **All dependencies**: `pip install "molfeat[all]"`

### Accessing All Available Models

```python
from molfeat.store.modelstore import ModelStore

store = ModelStore()
all_models = store.available_models

# Print all available featurizers
for model in all_models:
    print(f"{model.name}: {model.description}")

# Search for specific types
transformers = [m for m in all_models if "transformer" in m.tags]
gnn_models = [m for m in all_models if "gnn" in m.tags]
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
```

## Performance Characteristics

### Computational Speed (relative)
**Fastest:**
- maccs
- ecfp
- rdkit fingerprints
- usr

**Medium:**
- desc2D
- cats2D
- Most fingerprints

**Slower:**
- mordred (1800+ descriptors)
- desc3D (requires conformer generation)
- 3D descriptors in general

**Slowest (first run):**
- Pretrained models (ChemBERTa, ChemGPT, GIN)
- Note: Subsequent runs benefit from caching

### Dimensionality

**Low (< 200 dims):**
- maccs (167)
- usr (12)
- usrcat (60)

**Medium (200-2000 dims):**
- desc2D (~200)
- ecfp (2048 default, configurable)
- map4 (1024 default)

**High (> 2000 dims):**
- mordred (1800+)
- Concatenated fingerprints
- Some transformer embeddings

**Variable:**
- Transformer models (typically 768-1024)
- GNN models (depends on architecture)