Files
2025-11-30 08:30:10 +08:00

429 lines
11 KiB
Markdown

# Molfeat API Reference
## Core Modules
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
- **`molfeat.store`** - Manages model loading, listing, and registration
- **`molfeat.calc`** - Provides calculators for single-molecule featurization
- **`molfeat.trans`** - Offers scikit-learn compatible transformers for batch processing
- **`molfeat.utils`** - Utility functions for data handling
- **`molfeat.viz`** - Visualization tools for molecular features
---
## molfeat.calc - Calculators
Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit `Chem.Mol` objects or SMILES strings as input.
### SerializableCalculator (Base Class)
Base abstract class for all calculators. When subclassing, must implement:
- `__call__()` - Required method for featurization
- `__len__()` - Optional, returns output length
- `columns` - Optional property, returns feature names
- `batch_compute()` - Optional, for efficient batch processing
**State Management Methods:**
- `to_state_json()` - Save calculator state as JSON
- `to_state_yaml()` - Save calculator state as YAML
- `from_state_dict()` - Load calculator from state dictionary
- `to_state_dict()` - Export calculator state as dictionary
### FPCalculator
Computes molecular fingerprints. Supports 15+ fingerprint methods.
**Supported Fingerprint Types:**
**Structural Fingerprints:**
- `ecfp` - Extended-connectivity fingerprints (circular)
- `fcfp` - Functional-class fingerprints
- `rdkit` - RDKit topological fingerprints
- `maccs` - MACCS keys (166-bit structural keys)
- `avalon` - Avalon fingerprints
- `pattern` - Pattern fingerprints
- `layered` - Layered fingerprints
**Atom-based Fingerprints:**
- `atompair` - Atom pair fingerprints
- `atompair-count` - Counted atom pairs
- `topological` - Topological torsion fingerprints
- `topological-count` - Counted topological torsions
**Specialized Fingerprints:**
- `map4` - MinHashed atom-pair fingerprint up to 4 bonds
- `secfp` - SMILES extended connectivity fingerprint
- `erg` - Extended reduced graphs
- `estate` - Electrotopological state indices
**Parameters:**
- `method` (str) - Fingerprint type name
- `radius` (int) - Radius for circular fingerprints (default: 3)
- `fpSize` (int) - Fingerprint size (default: 2048)
- `includeChirality` (bool) - Include chirality information
- `counting` (bool) - Use count vectors instead of binary
**Usage:**
```python
from molfeat.calc import FPCalculator
# Create fingerprint calculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
# Compute fingerprint for single molecule
fp = calc("CCO") # Returns numpy array
# Get fingerprint length
length = len(calc) # 2048
# Get feature names
names = calc.columns
```
**Common Fingerprint Dimensions:**
- MACCS: 167 dimensions
- ECFP (default): 2048 dimensions
- MAP4 (default): 1024 dimensions
### Descriptor Calculators
**RDKitDescriptors2D**
Computes 2D molecular descriptors using RDKit.
```python
from molfeat.calc import RDKitDescriptors2D
calc = RDKitDescriptors2D()
descriptors = calc("CCO") # Returns 200+ descriptors
```
**RDKitDescriptors3D**
Computes 3D molecular descriptors (requires conformer generation).
**MordredDescriptors**
Calculates over 1800 molecular descriptors using Mordred.
```python
from molfeat.calc import MordredDescriptors
calc = MordredDescriptors()
descriptors = calc("CCO")
```
### Pharmacophore Calculators
**Pharmacophore2D**
RDKit's 2D pharmacophore fingerprint generation.
**Pharmacophore3D**
Consensus pharmacophore fingerprints from multiple conformers.
**CATSCalculator**
Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
**Parameters:**
- `mode` - "2D" or "3D" distance calculations
- `dist_bins` - Distance bins for pair distributions
- `scale` - Scaling mode: "raw", "num", or "count"
```python
from molfeat.calc import CATSCalculator
calc = CATSCalculator(mode="2D", scale="raw")
cats = calc("CCO") # Returns 21 descriptors by default
```
### Shape Descriptors
**USRDescriptors**
Ultrafast shape recognition descriptors (multiple variants).
**ElectroShapeDescriptors**
Electrostatic shape descriptors combining shape, chirality, and electrostatics.
### Graph-Based Calculators
**ScaffoldKeyCalculator**
Computes 40+ scaffold-based molecular properties.
**AtomCalculator**
Atom-level featurization for graph neural networks.
**BondCalculator**
Bond-level featurization for graph neural networks.
### Utility Function
**get_calculator()**
Factory function to instantiate calculators by name.
```python
from molfeat.calc import get_calculator
# Instantiate any calculator by name
calc = get_calculator("ecfp", radius=3)
calc = get_calculator("maccs")
calc = get_calculator("desc2D")
```
Raises `ValueError` for unsupported featurizers.
---
## molfeat.trans - Transformers
Transformers wrap calculators into complete featurization pipelines for batch processing.
### MoleculeTransformer
Scikit-learn compatible transformer for batch molecular featurization.
**Key Parameters:**
- `featurizer` - Calculator or featurizer to use
- `n_jobs` (int) - Number of parallel jobs (-1 for all cores)
- `dtype` - Output data type (numpy float32/64, torch tensors)
- `verbose` (bool) - Enable verbose logging
- `ignore_errors` (bool) - Continue on failures (returns None for failed molecules)
**Essential Methods:**
- `transform(mols)` - Processes batches and returns representations
- `_transform(mol)` - Handles individual molecule featurization
- `__call__(mols)` - Convenience wrapper around transform()
- `preprocess(mol)` - Prepares input molecules (not automatically applied)
- `to_state_yaml_file(path)` - Save transformer configuration
- `from_state_yaml_file(path)` - Load transformer configuration
**Usage:**
```python
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
import datamol as dm
# Load molecules
smiles = dm.data.freesolv().sample(100).smiles.values
# Create transformer
calc = FPCalculator("ecfp")
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize batch
features = transformer(smiles) # Returns numpy array (100, 2048)
# Save configuration
transformer.to_state_yaml_file("ecfp_config.yml")
# Reload
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
```
**Performance:** Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
### FeatConcat
Concatenates multiple featurizers into unified representations.
```python
from molfeat.trans import FeatConcat
from molfeat.calc import FPCalculator
# Combine multiple fingerprints
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
])
# Result: 2167-dimensional features
transformer = MoleculeTransformer(concat, n_jobs=-1)
features = transformer(smiles)
```
### PretrainedMolTransformer
Subclass of `MoleculeTransformer` for pre-trained deep learning models.
**Unique Features:**
- `_embed()` - Batched inference for neural networks
- `_convert()` - Transforms SMILES/molecules into model-compatible formats
- SELFIES strings for language models
- DGL graphs for graph neural networks
- Integrated caching system for efficient storage
**Usage:**
```python
from molfeat.trans.pretrained import PretrainedMolTransformer
# Load pretrained model
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Generate embeddings
embeddings = transformer(smiles)
```
### PrecomputedMolTransformer
Transformer for cached/precomputed features.
---
## molfeat.store - Model Store
Manages featurizer discovery, loading, and registration.
### ModelStore
Central hub for accessing available featurizers.
**Key Methods:**
- `available_models` - Property listing all available featurizers
- `search(name=None, **kwargs)` - Search for specific featurizers
- `load(name, **kwargs)` - Load a featurizer by name
- `register(name, card)` - Register custom featurizer
**Usage:**
```python
from molfeat.store.modelstore import ModelStore
# Initialize store
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Found {len(all_models)} featurizers")
# Search for specific model
results = store.search(name="ChemBERTa-77M-MLM")
if results:
model_card = results[0]
# View usage information
model_card.usage()
# Load the model
transformer = model_card.load()
# Direct loading
transformer = store.load("ChemBERTa-77M-MLM")
```
**ModelCard Attributes:**
- `name` - Model identifier
- `description` - Model description
- `version` - Model version
- `authors` - Model authors
- `tags` - Categorization tags
- `usage()` - Display usage examples
- `load(**kwargs)` - Load the model
---
## Common Patterns
### Error Handling
```python
# Enable error tolerance
featurizer = MoleculeTransformer(
calc,
n_jobs=-1,
verbose=True,
ignore_errors=True
)
# Failed molecules return None
features = featurizer(smiles_with_errors)
```
### Data Type Control
```python
# NumPy float32 (default)
features = transformer(smiles, enforce_dtype=True)
# PyTorch tensors
import torch
transformer = MoleculeTransformer(calc, dtype=torch.float32)
features = transformer(smiles)
```
### Persistence and Reproducibility
```python
# Save transformer state
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")
# Load from saved state
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
transformer = MoleculeTransformer.from_state_json_file("config.json")
```
### Preprocessing
```python
# Manual preprocessing
mol = transformer.preprocess("CCO")
# Transform with preprocessing
features = transformer.transform(smiles_list)
```
---
## Integration Examples
### Scikit-learn Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
('classifier', RandomForestClassifier())
])
# Fit and predict
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
```
### PyTorch Integration
```python
import torch
from torch.utils.data import Dataset, DataLoader
from molfeat.trans import MoleculeTransformer
class MoleculeDataset(Dataset):
def __init__(self, smiles, labels, transformer):
self.smiles = smiles
self.labels = labels
self.transformer = transformer
def __len__(self):
return len(self.smiles)
def __getitem__(self, idx):
features = self.transformer(self.smiles[idx])
return torch.tensor(features), torch.tensor(self.labels[idx])
# Create dataset and dataloader
transformer = MoleculeTransformer(FPCalculator("ecfp"))
dataset = MoleculeDataset(smiles, labels, transformer)
loader = DataLoader(dataset, batch_size=32)
```
---
## Performance Tips
1. **Parallelization**: Use `n_jobs=-1` to utilize all CPU cores
2. **Batch Processing**: Process multiple molecules at once instead of loops
3. **Caching**: Leverage built-in caching for pretrained models
4. **Data Types**: Use float32 instead of float64 when precision allows
5. **Error Handling**: Set `ignore_errors=True` for large datasets with potential invalid molecules