429 lines
11 KiB
Markdown
429 lines
11 KiB
Markdown
# Molfeat API Reference
|
|
|
|
## Core Modules
|
|
|
|
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
|
|
|
|
- **`molfeat.store`** - Manages model loading, listing, and registration
|
|
- **`molfeat.calc`** - Provides calculators for single-molecule featurization
|
|
- **`molfeat.trans`** - Offers scikit-learn compatible transformers for batch processing
|
|
- **`molfeat.utils`** - Utility functions for data handling
|
|
- **`molfeat.viz`** - Visualization tools for molecular features
|
|
|
|
---
|
|
|
|
## molfeat.calc - Calculators
|
|
|
|
Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit `Chem.Mol` objects or SMILES strings as input.
|
|
|
|
### SerializableCalculator (Base Class)
|
|
|
|
Base abstract class for all calculators. When subclassing, must implement:
|
|
- `__call__()` - Required method for featurization
|
|
- `__len__()` - Optional, returns output length
|
|
- `columns` - Optional property, returns feature names
|
|
- `batch_compute()` - Optional, for efficient batch processing
|
|
|
|
**State Management Methods:**
|
|
- `to_state_json()` - Save calculator state as JSON
|
|
- `to_state_yaml()` - Save calculator state as YAML
|
|
- `from_state_dict()` - Load calculator from state dictionary
|
|
- `to_state_dict()` - Export calculator state as dictionary
|
|
|
|
### FPCalculator
|
|
|
|
Computes molecular fingerprints. Supports 15+ fingerprint methods.
|
|
|
|
**Supported Fingerprint Types:**
|
|
|
|
**Structural Fingerprints:**
|
|
- `ecfp` - Extended-connectivity fingerprints (circular)
|
|
- `fcfp` - Functional-class fingerprints
|
|
- `rdkit` - RDKit topological fingerprints
|
|
- `maccs` - MACCS keys (166-bit structural keys)
|
|
- `avalon` - Avalon fingerprints
|
|
- `pattern` - Pattern fingerprints
|
|
- `layered` - Layered fingerprints
|
|
|
|
**Atom-based Fingerprints:**
|
|
- `atompair` - Atom pair fingerprints
|
|
- `atompair-count` - Counted atom pairs
|
|
- `topological` - Topological torsion fingerprints
|
|
- `topological-count` - Counted topological torsions
|
|
|
|
**Specialized Fingerprints:**
|
|
- `map4` - MinHashed atom-pair fingerprint up to 4 bonds
|
|
- `secfp` - SMILES extended connectivity fingerprint
|
|
- `erg` - Extended reduced graphs
|
|
- `estate` - Electrotopological state indices
|
|
|
|
**Parameters:**
|
|
- `method` (str) - Fingerprint type name
|
|
- `radius` (int) - Radius for circular fingerprints (default: 3)
|
|
- `fpSize` (int) - Fingerprint size (default: 2048)
|
|
- `includeChirality` (bool) - Include chirality information
|
|
- `counting` (bool) - Use count vectors instead of binary
|
|
|
|
**Usage:**
|
|
```python
|
|
from molfeat.calc import FPCalculator
|
|
|
|
# Create fingerprint calculator
|
|
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
|
|
|
|
# Compute fingerprint for single molecule
|
|
fp = calc("CCO") # Returns numpy array
|
|
|
|
# Get fingerprint length
|
|
length = len(calc) # 2048
|
|
|
|
# Get feature names
|
|
names = calc.columns
|
|
```
|
|
|
|
**Common Fingerprint Dimensions:**
|
|
- MACCS: 167 dimensions
|
|
- ECFP (default): 2048 dimensions
|
|
- MAP4 (default): 1024 dimensions
|
|
|
|
### Descriptor Calculators
|
|
|
|
**RDKitDescriptors2D**
|
|
Computes 2D molecular descriptors using RDKit.
|
|
|
|
```python
|
|
from molfeat.calc import RDKitDescriptors2D
|
|
|
|
calc = RDKitDescriptors2D()
|
|
descriptors = calc("CCO") # Returns 200+ descriptors
|
|
```
|
|
|
|
**RDKitDescriptors3D**
|
|
Computes 3D molecular descriptors (requires conformer generation).
|
|
|
|
**MordredDescriptors**
|
|
Calculates over 1800 molecular descriptors using Mordred.
|
|
|
|
```python
|
|
from molfeat.calc import MordredDescriptors
|
|
|
|
calc = MordredDescriptors()
|
|
descriptors = calc("CCO")
|
|
```
|
|
|
|
### Pharmacophore Calculators
|
|
|
|
**Pharmacophore2D**
|
|
RDKit's 2D pharmacophore fingerprint generation.
|
|
|
|
**Pharmacophore3D**
|
|
Consensus pharmacophore fingerprints from multiple conformers.
|
|
|
|
**CATSCalculator**
|
|
Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
|
|
|
|
**Parameters:**
|
|
- `mode` - "2D" or "3D" distance calculations
|
|
- `dist_bins` - Distance bins for pair distributions
|
|
- `scale` - Scaling mode: "raw", "num", or "count"
|
|
|
|
```python
|
|
from molfeat.calc import CATSCalculator
|
|
|
|
calc = CATSCalculator(mode="2D", scale="raw")
|
|
cats = calc("CCO") # Returns 21 descriptors by default
|
|
```
|
|
|
|
### Shape Descriptors
|
|
|
|
**USRDescriptors**
|
|
Ultrafast shape recognition descriptors (multiple variants).
|
|
|
|
**ElectroShapeDescriptors**
|
|
Electrostatic shape descriptors combining shape, chirality, and electrostatics.
|
|
|
|
### Graph-Based Calculators
|
|
|
|
**ScaffoldKeyCalculator**
|
|
Computes 40+ scaffold-based molecular properties.
|
|
|
|
**AtomCalculator**
|
|
Atom-level featurization for graph neural networks.
|
|
|
|
**BondCalculator**
|
|
Bond-level featurization for graph neural networks.
|
|
|
|
### Utility Function
|
|
|
|
**get_calculator()**
|
|
Factory function to instantiate calculators by name.
|
|
|
|
```python
|
|
from molfeat.calc import get_calculator
|
|
|
|
# Instantiate any calculator by name
|
|
calc = get_calculator("ecfp", radius=3)
|
|
calc = get_calculator("maccs")
|
|
calc = get_calculator("desc2D")
|
|
```
|
|
|
|
Raises `ValueError` for unsupported featurizers.
|
|
|
|
---
|
|
|
|
## molfeat.trans - Transformers
|
|
|
|
Transformers wrap calculators into complete featurization pipelines for batch processing.
|
|
|
|
### MoleculeTransformer
|
|
|
|
Scikit-learn compatible transformer for batch molecular featurization.
|
|
|
|
**Key Parameters:**
|
|
- `featurizer` - Calculator or featurizer to use
|
|
- `n_jobs` (int) - Number of parallel jobs (-1 for all cores)
|
|
- `dtype` - Output data type (numpy float32/64, torch tensors)
|
|
- `verbose` (bool) - Enable verbose logging
|
|
- `ignore_errors` (bool) - Continue on failures (returns None for failed molecules)
|
|
|
|
**Essential Methods:**
|
|
- `transform(mols)` - Processes batches and returns representations
|
|
- `_transform(mol)` - Handles individual molecule featurization
|
|
- `__call__(mols)` - Convenience wrapper around transform()
|
|
- `preprocess(mol)` - Prepares input molecules (not automatically applied)
|
|
- `to_state_yaml_file(path)` - Save transformer configuration
|
|
- `from_state_yaml_file(path)` - Load transformer configuration
|
|
|
|
**Usage:**
|
|
```python
|
|
from molfeat.calc import FPCalculator
|
|
from molfeat.trans import MoleculeTransformer
|
|
import datamol as dm
|
|
|
|
# Load molecules
|
|
smiles = dm.data.freesolv().sample(100).smiles.values
|
|
|
|
# Create transformer
|
|
calc = FPCalculator("ecfp")
|
|
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
|
|
|
# Featurize batch
|
|
features = transformer(smiles) # Returns numpy array (100, 2048)
|
|
|
|
# Save configuration
|
|
transformer.to_state_yaml_file("ecfp_config.yml")
|
|
|
|
# Reload
|
|
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
|
|
```
|
|
|
|
**Performance:** Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
|
|
|
|
### FeatConcat
|
|
|
|
Concatenates multiple featurizers into unified representations.
|
|
|
|
```python
|
|
from molfeat.trans import FeatConcat
|
|
from molfeat.calc import FPCalculator
|
|
|
|
# Combine multiple fingerprints
|
|
concat = FeatConcat([
|
|
FPCalculator("maccs"), # 167 dimensions
|
|
FPCalculator("ecfp") # 2048 dimensions
|
|
])
|
|
|
|
# Result: 2167-dimensional features
|
|
transformer = MoleculeTransformer(concat, n_jobs=-1)
|
|
features = transformer(smiles)
|
|
```
|
|
|
|
### PretrainedMolTransformer
|
|
|
|
Subclass of `MoleculeTransformer` for pre-trained deep learning models.
|
|
|
|
**Unique Features:**
|
|
- `_embed()` - Batched inference for neural networks
|
|
- `_convert()` - Transforms SMILES/molecules into model-compatible formats
|
|
- SELFIES strings for language models
|
|
- DGL graphs for graph neural networks
|
|
- Integrated caching system for efficient storage
|
|
|
|
**Usage:**
|
|
```python
|
|
from molfeat.trans.pretrained import PretrainedMolTransformer
|
|
|
|
# Load pretrained model
|
|
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
|
|
|
# Generate embeddings
|
|
embeddings = transformer(smiles)
|
|
```
|
|
|
|
### PrecomputedMolTransformer
|
|
|
|
Transformer for cached/precomputed features.
|
|
|
|
---
|
|
|
|
## molfeat.store - Model Store
|
|
|
|
Manages featurizer discovery, loading, and registration.
|
|
|
|
### ModelStore
|
|
|
|
Central hub for accessing available featurizers.
|
|
|
|
**Key Methods:**
|
|
- `available_models` - Property listing all available featurizers
|
|
- `search(name=None, **kwargs)` - Search for specific featurizers
|
|
- `load(name, **kwargs)` - Load a featurizer by name
|
|
- `register(name, card)` - Register custom featurizer
|
|
|
|
**Usage:**
|
|
```python
|
|
from molfeat.store.modelstore import ModelStore
|
|
|
|
# Initialize store
|
|
store = ModelStore()
|
|
|
|
# List all available models
|
|
all_models = store.available_models
|
|
print(f"Found {len(all_models)} featurizers")
|
|
|
|
# Search for specific model
|
|
results = store.search(name="ChemBERTa-77M-MLM")
|
|
if results:
|
|
model_card = results[0]
|
|
|
|
# View usage information
|
|
model_card.usage()
|
|
|
|
# Load the model
|
|
transformer = model_card.load()
|
|
|
|
# Direct loading
|
|
transformer = store.load("ChemBERTa-77M-MLM")
|
|
```
|
|
|
|
**ModelCard Attributes:**
|
|
- `name` - Model identifier
|
|
- `description` - Model description
|
|
- `version` - Model version
|
|
- `authors` - Model authors
|
|
- `tags` - Categorization tags
|
|
- `usage()` - Display usage examples
|
|
- `load(**kwargs)` - Load the model
|
|
|
|
---
|
|
|
|
## Common Patterns
|
|
|
|
### Error Handling
|
|
|
|
```python
|
|
# Enable error tolerance
|
|
featurizer = MoleculeTransformer(
|
|
calc,
|
|
n_jobs=-1,
|
|
verbose=True,
|
|
ignore_errors=True
|
|
)
|
|
|
|
# Failed molecules return None
|
|
features = featurizer(smiles_with_errors)
|
|
```
|
|
|
|
### Data Type Control
|
|
|
|
```python
|
|
# NumPy float32 (default)
|
|
features = transformer(smiles, enforce_dtype=True)
|
|
|
|
# PyTorch tensors
|
|
import torch
|
|
transformer = MoleculeTransformer(calc, dtype=torch.float32)
|
|
features = transformer(smiles)
|
|
```
|
|
|
|
### Persistence and Reproducibility
|
|
|
|
```python
|
|
# Save transformer state
|
|
transformer.to_state_yaml_file("config.yml")
|
|
transformer.to_state_json_file("config.json")
|
|
|
|
# Load from saved state
|
|
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
|
|
transformer = MoleculeTransformer.from_state_json_file("config.json")
|
|
```
|
|
|
|
### Preprocessing
|
|
|
|
```python
|
|
# Manual preprocessing
|
|
mol = transformer.preprocess("CCO")
|
|
|
|
# Transform with preprocessing
|
|
features = transformer.transform(smiles_list)
|
|
```
|
|
|
|
---
|
|
|
|
## Integration Examples
|
|
|
|
### Scikit-learn Pipeline
|
|
|
|
```python
|
|
from sklearn.pipeline import Pipeline
|
|
from sklearn.ensemble import RandomForestClassifier
|
|
from molfeat.trans import MoleculeTransformer
|
|
from molfeat.calc import FPCalculator
|
|
|
|
# Create pipeline
|
|
pipeline = Pipeline([
|
|
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
|
|
('classifier', RandomForestClassifier())
|
|
])
|
|
|
|
# Fit and predict
|
|
pipeline.fit(smiles_train, y_train)
|
|
predictions = pipeline.predict(smiles_test)
|
|
```
|
|
|
|
### PyTorch Integration
|
|
|
|
```python
|
|
import torch
|
|
from torch.utils.data import Dataset, DataLoader
|
|
from molfeat.trans import MoleculeTransformer
|
|
|
|
class MoleculeDataset(Dataset):
|
|
def __init__(self, smiles, labels, transformer):
|
|
self.smiles = smiles
|
|
self.labels = labels
|
|
self.transformer = transformer
|
|
|
|
def __len__(self):
|
|
return len(self.smiles)
|
|
|
|
def __getitem__(self, idx):
|
|
features = self.transformer(self.smiles[idx])
|
|
return torch.tensor(features), torch.tensor(self.labels[idx])
|
|
|
|
# Create dataset and dataloader
|
|
transformer = MoleculeTransformer(FPCalculator("ecfp"))
|
|
dataset = MoleculeDataset(smiles, labels, transformer)
|
|
loader = DataLoader(dataset, batch_size=32)
|
|
```
|
|
|
|
---
|
|
|
|
## Performance Tips
|
|
|
|
1. **Parallelization**: Use `n_jobs=-1` to utilize all CPU cores
|
|
2. **Batch Processing**: Process multiple molecules at once instead of loops
|
|
3. **Caching**: Leverage built-in caching for pretrained models
|
|
4. **Data Types**: Use float32 instead of float64 when precision allows
|
|
5. **Error Handling**: Set `ignore_errors=True` for large datasets with potential invalid molecules
|