11 KiB
Molfeat API Reference
Core Modules
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
molfeat.store- Manages model loading, listing, and registrationmolfeat.calc- Provides calculators for single-molecule featurizationmolfeat.trans- Offers scikit-learn compatible transformers for batch processingmolfeat.utils- Utility functions for data handlingmolfeat.viz- Visualization tools for molecular features
molfeat.calc - Calculators
Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit Chem.Mol objects or SMILES strings as input.
SerializableCalculator (Base Class)
Base abstract class for all calculators. When subclassing, must implement:
__call__()- Required method for featurization__len__()- Optional, returns output lengthcolumns- Optional property, returns feature namesbatch_compute()- Optional, for efficient batch processing
State Management Methods:
to_state_json()- Save calculator state as JSONto_state_yaml()- Save calculator state as YAMLfrom_state_dict()- Load calculator from state dictionaryto_state_dict()- Export calculator state as dictionary
FPCalculator
Computes molecular fingerprints. Supports 15+ fingerprint methods.
Supported Fingerprint Types:
Structural Fingerprints:
ecfp- Extended-connectivity fingerprints (circular)fcfp- Functional-class fingerprintsrdkit- RDKit topological fingerprintsmaccs- MACCS keys (166-bit structural keys)avalon- Avalon fingerprintspattern- Pattern fingerprintslayered- Layered fingerprints
Atom-based Fingerprints:
atompair- Atom pair fingerprintsatompair-count- Counted atom pairstopological- Topological torsion fingerprintstopological-count- Counted topological torsions
Specialized Fingerprints:
map4- MinHashed atom-pair fingerprint up to 4 bondssecfp- SMILES extended connectivity fingerprinterg- Extended reduced graphsestate- Electrotopological state indices
Parameters:
method(str) - Fingerprint type nameradius(int) - Radius for circular fingerprints (default: 3)fpSize(int) - Fingerprint size (default: 2048)includeChirality(bool) - Include chirality informationcounting(bool) - Use count vectors instead of binary
Usage:
from molfeat.calc import FPCalculator
# Create fingerprint calculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
# Compute fingerprint for single molecule
fp = calc("CCO") # Returns numpy array
# Get fingerprint length
length = len(calc) # 2048
# Get feature names
names = calc.columns
Common Fingerprint Dimensions:
- MACCS: 167 dimensions
- ECFP (default): 2048 dimensions
- MAP4 (default): 1024 dimensions
Descriptor Calculators
RDKitDescriptors2D Computes 2D molecular descriptors using RDKit.
from molfeat.calc import RDKitDescriptors2D
calc = RDKitDescriptors2D()
descriptors = calc("CCO") # Returns 200+ descriptors
RDKitDescriptors3D Computes 3D molecular descriptors (requires conformer generation).
MordredDescriptors Calculates over 1800 molecular descriptors using Mordred.
from molfeat.calc import MordredDescriptors
calc = MordredDescriptors()
descriptors = calc("CCO")
Pharmacophore Calculators
Pharmacophore2D RDKit's 2D pharmacophore fingerprint generation.
Pharmacophore3D Consensus pharmacophore fingerprints from multiple conformers.
CATSCalculator Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
Parameters:
mode- "2D" or "3D" distance calculationsdist_bins- Distance bins for pair distributionsscale- Scaling mode: "raw", "num", or "count"
from molfeat.calc import CATSCalculator
calc = CATSCalculator(mode="2D", scale="raw")
cats = calc("CCO") # Returns 21 descriptors by default
Shape Descriptors
USRDescriptors Ultrafast shape recognition descriptors (multiple variants).
ElectroShapeDescriptors Electrostatic shape descriptors combining shape, chirality, and electrostatics.
Graph-Based Calculators
ScaffoldKeyCalculator Computes 40+ scaffold-based molecular properties.
AtomCalculator Atom-level featurization for graph neural networks.
BondCalculator Bond-level featurization for graph neural networks.
Utility Function
get_calculator() Factory function to instantiate calculators by name.
from molfeat.calc import get_calculator
# Instantiate any calculator by name
calc = get_calculator("ecfp", radius=3)
calc = get_calculator("maccs")
calc = get_calculator("desc2D")
Raises ValueError for unsupported featurizers.
molfeat.trans - Transformers
Transformers wrap calculators into complete featurization pipelines for batch processing.
MoleculeTransformer
Scikit-learn compatible transformer for batch molecular featurization.
Key Parameters:
featurizer- Calculator or featurizer to usen_jobs(int) - Number of parallel jobs (-1 for all cores)dtype- Output data type (numpy float32/64, torch tensors)verbose(bool) - Enable verbose loggingignore_errors(bool) - Continue on failures (returns None for failed molecules)
Essential Methods:
transform(mols)- Processes batches and returns representations_transform(mol)- Handles individual molecule featurization__call__(mols)- Convenience wrapper around transform()preprocess(mol)- Prepares input molecules (not automatically applied)to_state_yaml_file(path)- Save transformer configurationfrom_state_yaml_file(path)- Load transformer configuration
Usage:
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
import datamol as dm
# Load molecules
smiles = dm.data.freesolv().sample(100).smiles.values
# Create transformer
calc = FPCalculator("ecfp")
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize batch
features = transformer(smiles) # Returns numpy array (100, 2048)
# Save configuration
transformer.to_state_yaml_file("ecfp_config.yml")
# Reload
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
Performance: Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
FeatConcat
Concatenates multiple featurizers into unified representations.
from molfeat.trans import FeatConcat
from molfeat.calc import FPCalculator
# Combine multiple fingerprints
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
])
# Result: 2167-dimensional features
transformer = MoleculeTransformer(concat, n_jobs=-1)
features = transformer(smiles)
PretrainedMolTransformer
Subclass of MoleculeTransformer for pre-trained deep learning models.
Unique Features:
_embed()- Batched inference for neural networks_convert()- Transforms SMILES/molecules into model-compatible formats- SELFIES strings for language models
- DGL graphs for graph neural networks
- Integrated caching system for efficient storage
Usage:
from molfeat.trans.pretrained import PretrainedMolTransformer
# Load pretrained model
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Generate embeddings
embeddings = transformer(smiles)
PrecomputedMolTransformer
Transformer for cached/precomputed features.
molfeat.store - Model Store
Manages featurizer discovery, loading, and registration.
ModelStore
Central hub for accessing available featurizers.
Key Methods:
available_models- Property listing all available featurizerssearch(name=None, **kwargs)- Search for specific featurizersload(name, **kwargs)- Load a featurizer by nameregister(name, card)- Register custom featurizer
Usage:
from molfeat.store.modelstore import ModelStore
# Initialize store
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Found {len(all_models)} featurizers")
# Search for specific model
results = store.search(name="ChemBERTa-77M-MLM")
if results:
model_card = results[0]
# View usage information
model_card.usage()
# Load the model
transformer = model_card.load()
# Direct loading
transformer = store.load("ChemBERTa-77M-MLM")
ModelCard Attributes:
name- Model identifierdescription- Model descriptionversion- Model versionauthors- Model authorstags- Categorization tagsusage()- Display usage examplesload(**kwargs)- Load the model
Common Patterns
Error Handling
# Enable error tolerance
featurizer = MoleculeTransformer(
calc,
n_jobs=-1,
verbose=True,
ignore_errors=True
)
# Failed molecules return None
features = featurizer(smiles_with_errors)
Data Type Control
# NumPy float32 (default)
features = transformer(smiles, enforce_dtype=True)
# PyTorch tensors
import torch
transformer = MoleculeTransformer(calc, dtype=torch.float32)
features = transformer(smiles)
Persistence and Reproducibility
# Save transformer state
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")
# Load from saved state
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
transformer = MoleculeTransformer.from_state_json_file("config.json")
Preprocessing
# Manual preprocessing
mol = transformer.preprocess("CCO")
# Transform with preprocessing
features = transformer.transform(smiles_list)
Integration Examples
Scikit-learn Pipeline
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
('classifier', RandomForestClassifier())
])
# Fit and predict
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
PyTorch Integration
import torch
from torch.utils.data import Dataset, DataLoader
from molfeat.trans import MoleculeTransformer
class MoleculeDataset(Dataset):
def __init__(self, smiles, labels, transformer):
self.smiles = smiles
self.labels = labels
self.transformer = transformer
def __len__(self):
return len(self.smiles)
def __getitem__(self, idx):
features = self.transformer(self.smiles[idx])
return torch.tensor(features), torch.tensor(self.labels[idx])
# Create dataset and dataloader
transformer = MoleculeTransformer(FPCalculator("ecfp"))
dataset = MoleculeDataset(smiles, labels, transformer)
loader = DataLoader(dataset, batch_size=32)
Performance Tips
- Parallelization: Use
n_jobs=-1to utilize all CPU cores - Batch Processing: Process multiple molecules at once instead of loops
- Caching: Leverage built-in caching for pretrained models
- Data Types: Use float32 instead of float64 when precision allows
- Error Handling: Set
ignore_errors=Truefor large datasets with potential invalid molecules