Files
2025-11-30 08:30:10 +08:00

11 KiB

Molfeat API Reference

Core Modules

Molfeat is organized into several key modules that provide different aspects of molecular featurization:

  • molfeat.store - Manages model loading, listing, and registration
  • molfeat.calc - Provides calculators for single-molecule featurization
  • molfeat.trans - Offers scikit-learn compatible transformers for batch processing
  • molfeat.utils - Utility functions for data handling
  • molfeat.viz - Visualization tools for molecular features

molfeat.calc - Calculators

Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit Chem.Mol objects or SMILES strings as input.

SerializableCalculator (Base Class)

Base abstract class for all calculators. When subclassing, must implement:

  • __call__() - Required method for featurization
  • __len__() - Optional, returns output length
  • columns - Optional property, returns feature names
  • batch_compute() - Optional, for efficient batch processing

State Management Methods:

  • to_state_json() - Save calculator state as JSON
  • to_state_yaml() - Save calculator state as YAML
  • from_state_dict() - Load calculator from state dictionary
  • to_state_dict() - Export calculator state as dictionary

FPCalculator

Computes molecular fingerprints. Supports 15+ fingerprint methods.

Supported Fingerprint Types:

Structural Fingerprints:

  • ecfp - Extended-connectivity fingerprints (circular)
  • fcfp - Functional-class fingerprints
  • rdkit - RDKit topological fingerprints
  • maccs - MACCS keys (166-bit structural keys)
  • avalon - Avalon fingerprints
  • pattern - Pattern fingerprints
  • layered - Layered fingerprints

Atom-based Fingerprints:

  • atompair - Atom pair fingerprints
  • atompair-count - Counted atom pairs
  • topological - Topological torsion fingerprints
  • topological-count - Counted topological torsions

Specialized Fingerprints:

  • map4 - MinHashed atom-pair fingerprint up to 4 bonds
  • secfp - SMILES extended connectivity fingerprint
  • erg - Extended reduced graphs
  • estate - Electrotopological state indices

Parameters:

  • method (str) - Fingerprint type name
  • radius (int) - Radius for circular fingerprints (default: 3)
  • fpSize (int) - Fingerprint size (default: 2048)
  • includeChirality (bool) - Include chirality information
  • counting (bool) - Use count vectors instead of binary

Usage:

from molfeat.calc import FPCalculator

# Create fingerprint calculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)

# Compute fingerprint for single molecule
fp = calc("CCO")  # Returns numpy array

# Get fingerprint length
length = len(calc)  # 2048

# Get feature names
names = calc.columns

Common Fingerprint Dimensions:

  • MACCS: 167 dimensions
  • ECFP (default): 2048 dimensions
  • MAP4 (default): 1024 dimensions

Descriptor Calculators

RDKitDescriptors2D Computes 2D molecular descriptors using RDKit.

from molfeat.calc import RDKitDescriptors2D

calc = RDKitDescriptors2D()
descriptors = calc("CCO")  # Returns 200+ descriptors

RDKitDescriptors3D Computes 3D molecular descriptors (requires conformer generation).

MordredDescriptors Calculates over 1800 molecular descriptors using Mordred.

from molfeat.calc import MordredDescriptors

calc = MordredDescriptors()
descriptors = calc("CCO")

Pharmacophore Calculators

Pharmacophore2D RDKit's 2D pharmacophore fingerprint generation.

Pharmacophore3D Consensus pharmacophore fingerprints from multiple conformers.

CATSCalculator Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.

Parameters:

  • mode - "2D" or "3D" distance calculations
  • dist_bins - Distance bins for pair distributions
  • scale - Scaling mode: "raw", "num", or "count"
from molfeat.calc import CATSCalculator

calc = CATSCalculator(mode="2D", scale="raw")
cats = calc("CCO")  # Returns 21 descriptors by default

Shape Descriptors

USRDescriptors Ultrafast shape recognition descriptors (multiple variants).

ElectroShapeDescriptors Electrostatic shape descriptors combining shape, chirality, and electrostatics.

Graph-Based Calculators

ScaffoldKeyCalculator Computes 40+ scaffold-based molecular properties.

AtomCalculator Atom-level featurization for graph neural networks.

BondCalculator Bond-level featurization for graph neural networks.

Utility Function

get_calculator() Factory function to instantiate calculators by name.

from molfeat.calc import get_calculator

# Instantiate any calculator by name
calc = get_calculator("ecfp", radius=3)
calc = get_calculator("maccs")
calc = get_calculator("desc2D")

Raises ValueError for unsupported featurizers.


molfeat.trans - Transformers

Transformers wrap calculators into complete featurization pipelines for batch processing.

MoleculeTransformer

Scikit-learn compatible transformer for batch molecular featurization.

Key Parameters:

  • featurizer - Calculator or featurizer to use
  • n_jobs (int) - Number of parallel jobs (-1 for all cores)
  • dtype - Output data type (numpy float32/64, torch tensors)
  • verbose (bool) - Enable verbose logging
  • ignore_errors (bool) - Continue on failures (returns None for failed molecules)

Essential Methods:

  • transform(mols) - Processes batches and returns representations
  • _transform(mol) - Handles individual molecule featurization
  • __call__(mols) - Convenience wrapper around transform()
  • preprocess(mol) - Prepares input molecules (not automatically applied)
  • to_state_yaml_file(path) - Save transformer configuration
  • from_state_yaml_file(path) - Load transformer configuration

Usage:

from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
import datamol as dm

# Load molecules
smiles = dm.data.freesolv().sample(100).smiles.values

# Create transformer
calc = FPCalculator("ecfp")
transformer = MoleculeTransformer(calc, n_jobs=-1)

# Featurize batch
features = transformer(smiles)  # Returns numpy array (100, 2048)

# Save configuration
transformer.to_state_yaml_file("ecfp_config.yml")

# Reload
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")

Performance: Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.

FeatConcat

Concatenates multiple featurizers into unified representations.

from molfeat.trans import FeatConcat
from molfeat.calc import FPCalculator

# Combine multiple fingerprints
concat = FeatConcat([
    FPCalculator("maccs"),      # 167 dimensions
    FPCalculator("ecfp")         # 2048 dimensions
])

# Result: 2167-dimensional features
transformer = MoleculeTransformer(concat, n_jobs=-1)
features = transformer(smiles)

PretrainedMolTransformer

Subclass of MoleculeTransformer for pre-trained deep learning models.

Unique Features:

  • _embed() - Batched inference for neural networks
  • _convert() - Transforms SMILES/molecules into model-compatible formats
    • SELFIES strings for language models
    • DGL graphs for graph neural networks
  • Integrated caching system for efficient storage

Usage:

from molfeat.trans.pretrained import PretrainedMolTransformer

# Load pretrained model
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)

# Generate embeddings
embeddings = transformer(smiles)

PrecomputedMolTransformer

Transformer for cached/precomputed features.


molfeat.store - Model Store

Manages featurizer discovery, loading, and registration.

ModelStore

Central hub for accessing available featurizers.

Key Methods:

  • available_models - Property listing all available featurizers
  • search(name=None, **kwargs) - Search for specific featurizers
  • load(name, **kwargs) - Load a featurizer by name
  • register(name, card) - Register custom featurizer

Usage:

from molfeat.store.modelstore import ModelStore

# Initialize store
store = ModelStore()

# List all available models
all_models = store.available_models
print(f"Found {len(all_models)} featurizers")

# Search for specific model
results = store.search(name="ChemBERTa-77M-MLM")
if results:
    model_card = results[0]

    # View usage information
    model_card.usage()

    # Load the model
    transformer = model_card.load()

# Direct loading
transformer = store.load("ChemBERTa-77M-MLM")

ModelCard Attributes:

  • name - Model identifier
  • description - Model description
  • version - Model version
  • authors - Model authors
  • tags - Categorization tags
  • usage() - Display usage examples
  • load(**kwargs) - Load the model

Common Patterns

Error Handling

# Enable error tolerance
featurizer = MoleculeTransformer(
    calc,
    n_jobs=-1,
    verbose=True,
    ignore_errors=True
)

# Failed molecules return None
features = featurizer(smiles_with_errors)

Data Type Control

# NumPy float32 (default)
features = transformer(smiles, enforce_dtype=True)

# PyTorch tensors
import torch
transformer = MoleculeTransformer(calc, dtype=torch.float32)
features = transformer(smiles)

Persistence and Reproducibility

# Save transformer state
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")

# Load from saved state
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
transformer = MoleculeTransformer.from_state_json_file("config.json")

Preprocessing

# Manual preprocessing
mol = transformer.preprocess("CCO")

# Transform with preprocessing
features = transformer.transform(smiles_list)

Integration Examples

Scikit-learn Pipeline

from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator

# Create pipeline
pipeline = Pipeline([
    ('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
    ('classifier', RandomForestClassifier())
])

# Fit and predict
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)

PyTorch Integration

import torch
from torch.utils.data import Dataset, DataLoader
from molfeat.trans import MoleculeTransformer

class MoleculeDataset(Dataset):
    def __init__(self, smiles, labels, transformer):
        self.smiles = smiles
        self.labels = labels
        self.transformer = transformer

    def __len__(self):
        return len(self.smiles)

    def __getitem__(self, idx):
        features = self.transformer(self.smiles[idx])
        return torch.tensor(features), torch.tensor(self.labels[idx])

# Create dataset and dataloader
transformer = MoleculeTransformer(FPCalculator("ecfp"))
dataset = MoleculeDataset(smiles, labels, transformer)
loader = DataLoader(dataset, batch_size=32)

Performance Tips

  1. Parallelization: Use n_jobs=-1 to utilize all CPU cores
  2. Batch Processing: Process multiple molecules at once instead of loops
  3. Caching: Leverage built-in caching for pretrained models
  4. Data Types: Use float32 instead of float64 when precision allows
  5. Error Handling: Set ignore_errors=True for large datasets with potential invalid molecules