Initial commit
This commit is contained in:
428
skills/molfeat/references/api_reference.md
Normal file
428
skills/molfeat/references/api_reference.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Molfeat API Reference
|
||||
|
||||
## Core Modules
|
||||
|
||||
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
|
||||
|
||||
- **`molfeat.store`** - Manages model loading, listing, and registration
|
||||
- **`molfeat.calc`** - Provides calculators for single-molecule featurization
|
||||
- **`molfeat.trans`** - Offers scikit-learn compatible transformers for batch processing
|
||||
- **`molfeat.utils`** - Utility functions for data handling
|
||||
- **`molfeat.viz`** - Visualization tools for molecular features
|
||||
|
||||
---
|
||||
|
||||
## molfeat.calc - Calculators
|
||||
|
||||
Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit `Chem.Mol` objects or SMILES strings as input.
|
||||
|
||||
### SerializableCalculator (Base Class)
|
||||
|
||||
Base abstract class for all calculators. When subclassing, must implement:
|
||||
- `__call__()` - Required method for featurization
|
||||
- `__len__()` - Optional, returns output length
|
||||
- `columns` - Optional property, returns feature names
|
||||
- `batch_compute()` - Optional, for efficient batch processing
|
||||
|
||||
**State Management Methods:**
|
||||
- `to_state_json()` - Save calculator state as JSON
|
||||
- `to_state_yaml()` - Save calculator state as YAML
|
||||
- `from_state_dict()` - Load calculator from state dictionary
|
||||
- `to_state_dict()` - Export calculator state as dictionary
|
||||
|
||||
### FPCalculator
|
||||
|
||||
Computes molecular fingerprints. Supports 15+ fingerprint methods.
|
||||
|
||||
**Supported Fingerprint Types:**
|
||||
|
||||
**Structural Fingerprints:**
|
||||
- `ecfp` - Extended-connectivity fingerprints (circular)
|
||||
- `fcfp` - Functional-class fingerprints
|
||||
- `rdkit` - RDKit topological fingerprints
|
||||
- `maccs` - MACCS keys (166-bit structural keys)
|
||||
- `avalon` - Avalon fingerprints
|
||||
- `pattern` - Pattern fingerprints
|
||||
- `layered` - Layered fingerprints
|
||||
|
||||
**Atom-based Fingerprints:**
|
||||
- `atompair` - Atom pair fingerprints
|
||||
- `atompair-count` - Counted atom pairs
|
||||
- `topological` - Topological torsion fingerprints
|
||||
- `topological-count` - Counted topological torsions
|
||||
|
||||
**Specialized Fingerprints:**
|
||||
- `map4` - MinHashed atom-pair fingerprint up to 4 bonds
|
||||
- `secfp` - SMILES extended connectivity fingerprint
|
||||
- `erg` - Extended reduced graphs
|
||||
- `estate` - Electrotopological state indices
|
||||
|
||||
**Parameters:**
|
||||
- `method` (str) - Fingerprint type name
|
||||
- `radius` (int) - Radius for circular fingerprints (default: 3)
|
||||
- `fpSize` (int) - Fingerprint size (default: 2048)
|
||||
- `includeChirality` (bool) - Include chirality information
|
||||
- `counting` (bool) - Use count vectors instead of binary
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create fingerprint calculator
|
||||
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
|
||||
# Compute fingerprint for single molecule
|
||||
fp = calc("CCO") # Returns numpy array
|
||||
|
||||
# Get fingerprint length
|
||||
length = len(calc) # 2048
|
||||
|
||||
# Get feature names
|
||||
names = calc.columns
|
||||
```
|
||||
|
||||
**Common Fingerprint Dimensions:**
|
||||
- MACCS: 167 dimensions
|
||||
- ECFP (default): 2048 dimensions
|
||||
- MAP4 (default): 1024 dimensions
|
||||
|
||||
### Descriptor Calculators
|
||||
|
||||
**RDKitDescriptors2D**
|
||||
Computes 2D molecular descriptors using RDKit.
|
||||
|
||||
```python
|
||||
from molfeat.calc import RDKitDescriptors2D
|
||||
|
||||
calc = RDKitDescriptors2D()
|
||||
descriptors = calc("CCO") # Returns 200+ descriptors
|
||||
```
|
||||
|
||||
**RDKitDescriptors3D**
|
||||
Computes 3D molecular descriptors (requires conformer generation).
|
||||
|
||||
**MordredDescriptors**
|
||||
Calculates over 1800 molecular descriptors using Mordred.
|
||||
|
||||
```python
|
||||
from molfeat.calc import MordredDescriptors
|
||||
|
||||
calc = MordredDescriptors()
|
||||
descriptors = calc("CCO")
|
||||
```
|
||||
|
||||
### Pharmacophore Calculators
|
||||
|
||||
**Pharmacophore2D**
|
||||
RDKit's 2D pharmacophore fingerprint generation.
|
||||
|
||||
**Pharmacophore3D**
|
||||
Consensus pharmacophore fingerprints from multiple conformers.
|
||||
|
||||
**CATSCalculator**
|
||||
Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
|
||||
|
||||
**Parameters:**
|
||||
- `mode` - "2D" or "3D" distance calculations
|
||||
- `dist_bins` - Distance bins for pair distributions
|
||||
- `scale` - Scaling mode: "raw", "num", or "count"
|
||||
|
||||
```python
|
||||
from molfeat.calc import CATSCalculator
|
||||
|
||||
calc = CATSCalculator(mode="2D", scale="raw")
|
||||
cats = calc("CCO") # Returns 21 descriptors by default
|
||||
```
|
||||
|
||||
### Shape Descriptors
|
||||
|
||||
**USRDescriptors**
|
||||
Ultrafast shape recognition descriptors (multiple variants).
|
||||
|
||||
**ElectroShapeDescriptors**
|
||||
Electrostatic shape descriptors combining shape, chirality, and electrostatics.
|
||||
|
||||
### Graph-Based Calculators
|
||||
|
||||
**ScaffoldKeyCalculator**
|
||||
Computes 40+ scaffold-based molecular properties.
|
||||
|
||||
**AtomCalculator**
|
||||
Atom-level featurization for graph neural networks.
|
||||
|
||||
**BondCalculator**
|
||||
Bond-level featurization for graph neural networks.
|
||||
|
||||
### Utility Function
|
||||
|
||||
**get_calculator()**
|
||||
Factory function to instantiate calculators by name.
|
||||
|
||||
```python
|
||||
from molfeat.calc import get_calculator
|
||||
|
||||
# Instantiate any calculator by name
|
||||
calc = get_calculator("ecfp", radius=3)
|
||||
calc = get_calculator("maccs")
|
||||
calc = get_calculator("desc2D")
|
||||
```
|
||||
|
||||
Raises `ValueError` for unsupported featurizers.
|
||||
|
||||
---
|
||||
|
||||
## molfeat.trans - Transformers
|
||||
|
||||
Transformers wrap calculators into complete featurization pipelines for batch processing.
|
||||
|
||||
### MoleculeTransformer
|
||||
|
||||
Scikit-learn compatible transformer for batch molecular featurization.
|
||||
|
||||
**Key Parameters:**
|
||||
- `featurizer` - Calculator or featurizer to use
|
||||
- `n_jobs` (int) - Number of parallel jobs (-1 for all cores)
|
||||
- `dtype` - Output data type (numpy float32/64, torch tensors)
|
||||
- `verbose` (bool) - Enable verbose logging
|
||||
- `ignore_errors` (bool) - Continue on failures (returns None for failed molecules)
|
||||
|
||||
**Essential Methods:**
|
||||
- `transform(mols)` - Processes batches and returns representations
|
||||
- `_transform(mol)` - Handles individual molecule featurization
|
||||
- `__call__(mols)` - Convenience wrapper around transform()
|
||||
- `preprocess(mol)` - Prepares input molecules (not automatically applied)
|
||||
- `to_state_yaml_file(path)` - Save transformer configuration
|
||||
- `from_state_yaml_file(path)` - Load transformer configuration
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
import datamol as dm
|
||||
|
||||
# Load molecules
|
||||
smiles = dm.data.freesolv().sample(100).smiles.values
|
||||
|
||||
# Create transformer
|
||||
calc = FPCalculator("ecfp")
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
|
||||
# Featurize batch
|
||||
features = transformer(smiles) # Returns numpy array (100, 2048)
|
||||
|
||||
# Save configuration
|
||||
transformer.to_state_yaml_file("ecfp_config.yml")
|
||||
|
||||
# Reload
|
||||
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
|
||||
```
|
||||
|
||||
**Performance:** Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
|
||||
|
||||
### FeatConcat
|
||||
|
||||
Concatenates multiple featurizers into unified representations.
|
||||
|
||||
```python
|
||||
from molfeat.trans import FeatConcat
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Combine multiple fingerprints
|
||||
concat = FeatConcat([
|
||||
FPCalculator("maccs"), # 167 dimensions
|
||||
FPCalculator("ecfp") # 2048 dimensions
|
||||
])
|
||||
|
||||
# Result: 2167-dimensional features
|
||||
transformer = MoleculeTransformer(concat, n_jobs=-1)
|
||||
features = transformer(smiles)
|
||||
```
|
||||
|
||||
### PretrainedMolTransformer
|
||||
|
||||
Subclass of `MoleculeTransformer` for pre-trained deep learning models.
|
||||
|
||||
**Unique Features:**
|
||||
- `_embed()` - Batched inference for neural networks
|
||||
- `_convert()` - Transforms SMILES/molecules into model-compatible formats
|
||||
- SELFIES strings for language models
|
||||
- DGL graphs for graph neural networks
|
||||
- Integrated caching system for efficient storage
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
|
||||
# Load pretrained model
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = transformer(smiles)
|
||||
```
|
||||
|
||||
### PrecomputedMolTransformer
|
||||
|
||||
Transformer for cached/precomputed features.
|
||||
|
||||
---
|
||||
|
||||
## molfeat.store - Model Store
|
||||
|
||||
Manages featurizer discovery, loading, and registration.
|
||||
|
||||
### ModelStore
|
||||
|
||||
Central hub for accessing available featurizers.
|
||||
|
||||
**Key Methods:**
|
||||
- `available_models` - Property listing all available featurizers
|
||||
- `search(name=None, **kwargs)` - Search for specific featurizers
|
||||
- `load(name, **kwargs)` - Load a featurizer by name
|
||||
- `register(name, card)` - Register custom featurizer
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
# Initialize store
|
||||
store = ModelStore()
|
||||
|
||||
# List all available models
|
||||
all_models = store.available_models
|
||||
print(f"Found {len(all_models)} featurizers")
|
||||
|
||||
# Search for specific model
|
||||
results = store.search(name="ChemBERTa-77M-MLM")
|
||||
if results:
|
||||
model_card = results[0]
|
||||
|
||||
# View usage information
|
||||
model_card.usage()
|
||||
|
||||
# Load the model
|
||||
transformer = model_card.load()
|
||||
|
||||
# Direct loading
|
||||
transformer = store.load("ChemBERTa-77M-MLM")
|
||||
```
|
||||
|
||||
**ModelCard Attributes:**
|
||||
- `name` - Model identifier
|
||||
- `description` - Model description
|
||||
- `version` - Model version
|
||||
- `authors` - Model authors
|
||||
- `tags` - Categorization tags
|
||||
- `usage()` - Display usage examples
|
||||
- `load(**kwargs)` - Load the model
|
||||
|
||||
---
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
# Enable error tolerance
|
||||
featurizer = MoleculeTransformer(
|
||||
calc,
|
||||
n_jobs=-1,
|
||||
verbose=True,
|
||||
ignore_errors=True
|
||||
)
|
||||
|
||||
# Failed molecules return None
|
||||
features = featurizer(smiles_with_errors)
|
||||
```
|
||||
|
||||
### Data Type Control
|
||||
|
||||
```python
|
||||
# NumPy float32 (default)
|
||||
features = transformer(smiles, enforce_dtype=True)
|
||||
|
||||
# PyTorch tensors
|
||||
import torch
|
||||
transformer = MoleculeTransformer(calc, dtype=torch.float32)
|
||||
features = transformer(smiles)
|
||||
```
|
||||
|
||||
### Persistence and Reproducibility
|
||||
|
||||
```python
|
||||
# Save transformer state
|
||||
transformer.to_state_yaml_file("config.yml")
|
||||
transformer.to_state_json_file("config.json")
|
||||
|
||||
# Load from saved state
|
||||
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
|
||||
transformer = MoleculeTransformer.from_state_json_file("config.json")
|
||||
```
|
||||
|
||||
### Preprocessing
|
||||
|
||||
```python
|
||||
# Manual preprocessing
|
||||
mol = transformer.preprocess("CCO")
|
||||
|
||||
# Transform with preprocessing
|
||||
features = transformer.transform(smiles_list)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Scikit-learn Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
|
||||
('classifier', RandomForestClassifier())
|
||||
])
|
||||
|
||||
# Fit and predict
|
||||
pipeline.fit(smiles_train, y_train)
|
||||
predictions = pipeline.predict(smiles_test)
|
||||
```
|
||||
|
||||
### PyTorch Integration
|
||||
|
||||
```python
|
||||
import torch
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
|
||||
class MoleculeDataset(Dataset):
|
||||
def __init__(self, smiles, labels, transformer):
|
||||
self.smiles = smiles
|
||||
self.labels = labels
|
||||
self.transformer = transformer
|
||||
|
||||
def __len__(self):
|
||||
return len(self.smiles)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
features = self.transformer(self.smiles[idx])
|
||||
return torch.tensor(features), torch.tensor(self.labels[idx])
|
||||
|
||||
# Create dataset and dataloader
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"))
|
||||
dataset = MoleculeDataset(smiles, labels, transformer)
|
||||
loader = DataLoader(dataset, batch_size=32)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Parallelization**: Use `n_jobs=-1` to utilize all CPU cores
|
||||
2. **Batch Processing**: Process multiple molecules at once instead of loops
|
||||
3. **Caching**: Leverage built-in caching for pretrained models
|
||||
4. **Data Types**: Use float32 instead of float64 when precision allows
|
||||
5. **Error Handling**: Set `ignore_errors=True` for large datasets with potential invalid molecules
|
||||
333
skills/molfeat/references/available_featurizers.md
Normal file
333
skills/molfeat/references/available_featurizers.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# Available Featurizers in Molfeat
|
||||
|
||||
This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
|
||||
|
||||
## Transformer-Based Language Models
|
||||
|
||||
Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
|
||||
|
||||
### RoBERTa-style Models
|
||||
- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
|
||||
- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
|
||||
- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds
|
||||
|
||||
### GPT-style Autoregressive Models
|
||||
- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
|
||||
- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
|
||||
- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
|
||||
- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M
|
||||
|
||||
### Specialized Transformer Models
|
||||
- **MolT5** - Self-supervised framework for molecule captioning and text-based generation
|
||||
|
||||
## Graph Neural Networks (GNNs)
|
||||
|
||||
Pre-trained graph neural network models operating on molecular graph structures.
|
||||
|
||||
### GIN (Graph Isomorphism Network) Variants
|
||||
All pre-trained on ChEMBL molecules with different objectives:
|
||||
- **gin-supervised-masking** - Supervised with node masking objective
|
||||
- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
|
||||
- **gin-supervised-edgepred** - Supervised with edge prediction objective
|
||||
- **gin-supervised-contextpred** - Supervised with context prediction objective
|
||||
|
||||
### Other Graph-Based Models
|
||||
- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
|
||||
- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
|
||||
|
||||
## Molecular Descriptors
|
||||
|
||||
Calculators for physico-chemical properties and molecular characteristics.
|
||||
|
||||
### 2D Descriptors
|
||||
- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
|
||||
- Molecular weight, logP, TPSA
|
||||
- H-bond donors/acceptors
|
||||
- Rotatable bonds
|
||||
- Ring counts and aromaticity
|
||||
- Molecular complexity metrics
|
||||
|
||||
### 3D Descriptors
|
||||
- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
|
||||
- Inertial moments
|
||||
- PMI (Principal Moments of Inertia) ratios
|
||||
- Asphericity, eccentricity
|
||||
- Radius of gyration
|
||||
|
||||
### Comprehensive Descriptor Sets
|
||||
- **mordred** - Over 1800 molecular descriptors covering:
|
||||
- Constitutional descriptors
|
||||
- Topological indices
|
||||
- Connectivity indices
|
||||
- Information content
|
||||
- 2D/3D autocorrelations
|
||||
- WHIM descriptors
|
||||
- GETAWAY descriptors
|
||||
- And many more
|
||||
|
||||
### Electrotopological Descriptors
|
||||
- **estate** - Electrotopological state (E-State) indices encoding:
|
||||
- Atomic environment information
|
||||
- Electronic and topological properties
|
||||
- Heteroatom contributions
|
||||
|
||||
## Molecular Fingerprints
|
||||
|
||||
Binary or count-based fixed-length vectors representing molecular substructures.
|
||||
|
||||
### Circular Fingerprints (ECFP-style)
|
||||
- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
|
||||
- Radius variants (2, 4, 6 correspond to diameter)
|
||||
- Default: radius=3, 2048 bits
|
||||
- Most popular for similarity searching
|
||||
- **ecfp-count** - Count version of ECFP (non-binary)
|
||||
- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
|
||||
- Similar to ECFP but uses functional groups
|
||||
- Better for pharmacophore-based similarity
|
||||
|
||||
### Path-Based Fingerprints
|
||||
- **rdkit** - RDKit topological fingerprints based on linear paths
|
||||
- **pattern** - Pattern fingerprints (similar to MACCS but automated)
|
||||
- **layered** - Layered fingerprints with multiple substructure layers
|
||||
|
||||
### Key-Based Fingerprints
|
||||
- **maccs** - MACCS keys (166-bit structural keys)
|
||||
- Fixed set of predefined substructures
|
||||
- Good for scaffold hopping
|
||||
- Fast computation
|
||||
- **avalon** - Avalon fingerprints
|
||||
- Similar to MACCS but more features
|
||||
- Optimized for similarity searching
|
||||
|
||||
### Atom-Pair Fingerprints
|
||||
- **atompair** - Atom pair fingerprints
|
||||
- Encodes pairs of atoms and distance between them
|
||||
- Good for 3D similarity
|
||||
- **atompair-count** - Count version of atom pairs
|
||||
|
||||
### Topological Torsion Fingerprints
|
||||
- **topological** - Topological torsion fingerprints
|
||||
- Encodes sequences of 4 connected atoms
|
||||
- Captures local topology
|
||||
- **topological-count** - Count version of topological torsions
|
||||
|
||||
### MinHashed Fingerprints
|
||||
- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
|
||||
- Combines atom-pair and ECFP concepts
|
||||
- Default: 1024 dimensions
|
||||
- Fast and efficient for large datasets
|
||||
- **secfp** - SMILES Extended Connectivity Fingerprint
|
||||
- Operates directly on SMILES strings
|
||||
- Captures both substructure and atom-pair information
|
||||
|
||||
### Extended Reduced Graph
|
||||
- **erg** - Extended Reduced Graph
|
||||
- Uses pharmacophoric points instead of atoms
|
||||
- Reduces graph complexity while preserving key features
|
||||
|
||||
## Pharmacophore Descriptors
|
||||
|
||||
Features based on pharmacologically relevant functional groups and their spatial relationships.
|
||||
|
||||
### CATS (Chemically Advanced Template Search)
|
||||
- **cats2D** - 2D CATS descriptors
|
||||
- Pharmacophore point pair distributions
|
||||
- Distance based on shortest path
|
||||
- 21 descriptors by default
|
||||
- **cats3D** - 3D CATS descriptors
|
||||
- Euclidean distance based
|
||||
- Requires conformer generation
|
||||
- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants
|
||||
|
||||
### Gobbi Pharmacophores
|
||||
- **gobbi2D** - 2D pharmacophore fingerprints
|
||||
- 8 pharmacophore feature types:
|
||||
- Hydrophobic
|
||||
- Aromatic
|
||||
- H-bond acceptor
|
||||
- H-bond donor
|
||||
- Positive ionizable
|
||||
- Negative ionizable
|
||||
- Lumped hydrophobe
|
||||
- Good for virtual screening
|
||||
|
||||
### Pmapper Pharmacophores
|
||||
- **pmapper2D** - 2D pharmacophore signatures
|
||||
- **pmapper3D** - 3D pharmacophore signatures
|
||||
- High-dimensional pharmacophore descriptors
|
||||
- Useful for QSAR and similarity searching
|
||||
|
||||
## Shape Descriptors
|
||||
|
||||
Descriptors capturing 3D molecular shape and electrostatic properties.
|
||||
|
||||
### USR (Ultrafast Shape Recognition)
|
||||
- **usr** - Basic USR descriptors
|
||||
- 12 dimensions encoding shape distribution
|
||||
- Extremely fast computation
|
||||
- **usrcat** - USR with pharmacophoric constraints
|
||||
- 60 dimensions (12 per feature type)
|
||||
- Combines shape and pharmacophore information
|
||||
|
||||
### Electrostatic Shape
|
||||
- **electroshape** - ElectroShape descriptors
|
||||
- Combines molecular shape, chirality, and electrostatics
|
||||
- Useful for protein-ligand docking predictions
|
||||
|
||||
## Scaffold-Based Descriptors
|
||||
|
||||
Descriptors based on molecular scaffolds and core structures.
|
||||
|
||||
### Scaffold Keys
|
||||
- **scaffoldkeys** - Scaffold key calculator
|
||||
- 40+ scaffold-based properties
|
||||
- Bioisosteric scaffold representation
|
||||
- Captures core structural features
|
||||
|
||||
## Graph Featurizers for GNN Input
|
||||
|
||||
Atom and bond-level features for constructing graph representations for Graph Neural Networks.
|
||||
|
||||
### Atom-Level Features
|
||||
- **atom-onehot** - One-hot encoded atom features
|
||||
- **atom-default** - Default atom featurization including:
|
||||
- Atomic number
|
||||
- Degree, formal charge
|
||||
- Hybridization
|
||||
- Aromaticity
|
||||
- Number of hydrogen atoms
|
||||
|
||||
### Bond-Level Features
|
||||
- **bond-onehot** - One-hot encoded bond features
|
||||
- **bond-default** - Default bond featurization including:
|
||||
- Bond type (single, double, triple, aromatic)
|
||||
- Conjugation
|
||||
- Ring membership
|
||||
- Stereochemistry
|
||||
|
||||
## Integrated Pretrained Model Collections
|
||||
|
||||
Molfeat integrates models from various sources:
|
||||
|
||||
### HuggingFace Models
|
||||
Access to transformer models through HuggingFace hub:
|
||||
- ChemBERTa variants
|
||||
- ChemGPT variants
|
||||
- MolT5
|
||||
- Custom uploaded models
|
||||
|
||||
### DGL-LifeSci Models
|
||||
Pre-trained GNN models from DGL-Life:
|
||||
- GIN variants with different pre-training tasks
|
||||
- AttentiveFP models
|
||||
- MPNN models
|
||||
|
||||
### FCD (Fréchet ChemNet Distance)
|
||||
- **fcd** - Pre-trained CNN for molecular generation evaluation
|
||||
|
||||
### Graphormer Models
|
||||
- Graph transformers from Microsoft Research
|
||||
- Pre-trained on quantum chemistry datasets
|
||||
|
||||
## Usage Notes
|
||||
|
||||
### Choosing a Featurizer
|
||||
|
||||
**For traditional ML (Random Forest, SVM, etc.):**
|
||||
- Start with **ecfp** or **maccs** fingerprints
|
||||
- Try **desc2D** for interpretable models
|
||||
- Use **FeatConcat** to combine multiple fingerprints
|
||||
|
||||
**For deep learning:**
|
||||
- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
|
||||
- Use **gin-supervised-*** for graph neural network embeddings
|
||||
- Consider **Graphormer** for quantum property predictions
|
||||
|
||||
**For similarity searching:**
|
||||
- **ecfp** - General purpose, most popular
|
||||
- **maccs** - Fast, good for scaffold hopping
|
||||
- **map4** - Efficient for large-scale searches
|
||||
- **usr** / **usrcat** - 3D shape similarity
|
||||
|
||||
**For pharmacophore-based approaches:**
|
||||
- **fcfp** - Functional group based
|
||||
- **cats2D/3D** - Pharmacophore pair distributions
|
||||
- **gobbi2D** - Explicit pharmacophore features
|
||||
|
||||
**For interpretability:**
|
||||
- **desc2D** / **mordred** - Named descriptors
|
||||
- **maccs** - Interpretable substructure keys
|
||||
- **scaffoldkeys** - Scaffold-based features
|
||||
|
||||
### Model Dependencies
|
||||
|
||||
Some featurizers require optional dependencies:
|
||||
|
||||
- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
|
||||
- **Graphormer**: `pip install "molfeat[graphormer]"`
|
||||
- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
|
||||
- **FCD**: `pip install "molfeat[fcd]"`
|
||||
- **MAP4**: `pip install "molfeat[map4]"`
|
||||
- **All dependencies**: `pip install "molfeat[all]"`
|
||||
|
||||
### Accessing All Available Models
|
||||
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
store = ModelStore()
|
||||
all_models = store.available_models
|
||||
|
||||
# Print all available featurizers
|
||||
for model in all_models:
|
||||
print(f"{model.name}: {model.description}")
|
||||
|
||||
# Search for specific types
|
||||
transformers = [m for m in all_models if "transformer" in m.tags]
|
||||
gnn_models = [m for m in all_models if "gnn" in m.tags]
|
||||
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Computational Speed (relative)
|
||||
**Fastest:**
|
||||
- maccs
|
||||
- ecfp
|
||||
- rdkit fingerprints
|
||||
- usr
|
||||
|
||||
**Medium:**
|
||||
- desc2D
|
||||
- cats2D
|
||||
- Most fingerprints
|
||||
|
||||
**Slower:**
|
||||
- mordred (1800+ descriptors)
|
||||
- desc3D (requires conformer generation)
|
||||
- 3D descriptors in general
|
||||
|
||||
**Slowest (first run):**
|
||||
- Pretrained models (ChemBERTa, ChemGPT, GIN)
|
||||
- Note: Subsequent runs benefit from caching
|
||||
|
||||
### Dimensionality
|
||||
|
||||
**Low (< 200 dims):**
|
||||
- maccs (167)
|
||||
- usr (12)
|
||||
- usrcat (60)
|
||||
|
||||
**Medium (200-2000 dims):**
|
||||
- desc2D (~200)
|
||||
- ecfp (2048 default, configurable)
|
||||
- map4 (1024 default)
|
||||
|
||||
**High (> 2000 dims):**
|
||||
- mordred (1800+)
|
||||
- Concatenated fingerprints
|
||||
- Some transformer embeddings
|
||||
|
||||
**Variable:**
|
||||
- Transformer models (typically 768-1024)
|
||||
- GNN models (depends on architecture)
|
||||
723
skills/molfeat/references/examples.md
Normal file
723
skills/molfeat/references/examples.md
Normal file
@@ -0,0 +1,723 @@
|
||||
# Molfeat Usage Examples
|
||||
|
||||
This document provides practical examples for common molfeat use cases.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Recommended: Using conda/mamba
|
||||
mamba install -c conda-forge molfeat
|
||||
|
||||
# Alternative: Using pip
|
||||
pip install molfeat
|
||||
|
||||
# With all optional dependencies
|
||||
pip install "molfeat[all]"
|
||||
|
||||
# With specific dependencies
|
||||
pip install "molfeat[dgl]" # For GNN models
|
||||
pip install "molfeat[graphormer]" # For Graphormer
|
||||
pip install "molfeat[transformer]" # For ChemBERTa, ChemGPT
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Featurization Workflow
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from molfeat.calc import FPCalculator
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
|
||||
# Load sample data
|
||||
data = dm.data.freesolv().sample(100).smiles.values
|
||||
|
||||
# Single molecule featurization
|
||||
calc = FPCalculator("ecfp")
|
||||
features_single = calc(data[0])
|
||||
print(f"Single molecule features shape: {features_single.shape}")
|
||||
# Output: (2048,)
|
||||
|
||||
# Batch featurization with parallelization
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
features_batch = transformer(data)
|
||||
print(f"Batch features shape: {features_batch.shape}")
|
||||
# Output: (100, 2048)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Calculator Examples
|
||||
|
||||
### Fingerprint Calculators
|
||||
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# ECFP (Extended-Connectivity Fingerprints)
|
||||
ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
fp = ecfp("CCO") # Ethanol
|
||||
print(f"ECFP shape: {fp.shape}") # (2048,)
|
||||
|
||||
# MACCS keys
|
||||
maccs = FPCalculator("maccs")
|
||||
fp = maccs("c1ccccc1") # Benzene
|
||||
print(f"MACCS shape: {fp.shape}") # (167,)
|
||||
|
||||
# Count-based fingerprints
|
||||
ecfp_count = FPCalculator("ecfp-count", radius=3)
|
||||
fp_count = ecfp_count("CC(C)CC(C)C") # Non-binary counts
|
||||
|
||||
# MAP4 fingerprints
|
||||
map4 = FPCalculator("map4")
|
||||
fp = map4("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
|
||||
```
|
||||
|
||||
### Descriptor Calculators
|
||||
|
||||
```python
|
||||
from molfeat.calc import RDKitDescriptors2D, MordredDescriptors
|
||||
|
||||
# RDKit 2D descriptors (200+ properties)
|
||||
desc2d = RDKitDescriptors2D()
|
||||
descriptors = desc2d("CCO")
|
||||
print(f"Number of 2D descriptors: {len(descriptors)}")
|
||||
|
||||
# Get descriptor names
|
||||
names = desc2d.columns
|
||||
print(f"First 5 descriptors: {names[:5]}")
|
||||
|
||||
# Mordred descriptors (1800+ properties)
|
||||
mordred = MordredDescriptors()
|
||||
descriptors = mordred("c1ccccc1O") # Phenol
|
||||
print(f"Mordred descriptors: {len(descriptors)}")
|
||||
```
|
||||
|
||||
### Pharmacophore Calculators
|
||||
|
||||
```python
|
||||
from molfeat.calc import CATSCalculator
|
||||
|
||||
# 2D CATS descriptors
|
||||
cats = CATSCalculator(mode="2D", scale="raw")
|
||||
descriptors = cats("CC(C)Cc1ccc(C)cc1C") # Cymene
|
||||
print(f"CATS descriptors: {descriptors.shape}") # (21,)
|
||||
|
||||
# 3D CATS descriptors (requires conformer)
|
||||
cats3d = CATSCalculator(mode="3D", scale="num")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Transformer Examples
|
||||
|
||||
### Basic Transformer Usage
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
import datamol as dm
|
||||
|
||||
# Prepare data
|
||||
smiles_list = [
|
||||
"CCO",
|
||||
"CC(=O)O",
|
||||
"c1ccccc1",
|
||||
"CC(C)O",
|
||||
"CCCC"
|
||||
]
|
||||
|
||||
# Create transformer
|
||||
calc = FPCalculator("ecfp")
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
|
||||
# Transform molecules
|
||||
features = transformer(smiles_list)
|
||||
print(f"Features shape: {features.shape}") # (5, 2048)
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
# Handle invalid SMILES gracefully
|
||||
smiles_with_errors = [
|
||||
"CCO", # Valid
|
||||
"invalid", # Invalid
|
||||
"CC(=O)O", # Valid
|
||||
"xyz123", # Invalid
|
||||
]
|
||||
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp"),
|
||||
n_jobs=-1,
|
||||
verbose=True, # Log errors
|
||||
ignore_errors=True # Continue on failure
|
||||
)
|
||||
|
||||
features = transformer(smiles_with_errors)
|
||||
# Returns: array with None for failed molecules
|
||||
print(features) # [array(...), None, array(...), None]
|
||||
```
|
||||
|
||||
### Concatenating Multiple Featurizers
|
||||
|
||||
```python
|
||||
from molfeat.trans import FeatConcat, MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Combine MACCS (167) + ECFP (2048) = 2215 dimensions
|
||||
concat_calc = FeatConcat([
|
||||
FPCalculator("maccs"),
|
||||
FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
])
|
||||
|
||||
transformer = MoleculeTransformer(concat_calc, n_jobs=-1)
|
||||
features = transformer(smiles_list)
|
||||
print(f"Combined features shape: {features.shape}") # (n, 2215)
|
||||
|
||||
# Triple combination
|
||||
triple_concat = FeatConcat([
|
||||
FPCalculator("maccs"),
|
||||
FPCalculator("ecfp"),
|
||||
FPCalculator("rdkit")
|
||||
])
|
||||
```
|
||||
|
||||
### Saving and Loading Configurations
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create and save transformer
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp", radius=3, fpSize=2048),
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
# Save to YAML
|
||||
transformer.to_state_yaml_file("my_featurizer.yml")
|
||||
|
||||
# Save to JSON
|
||||
transformer.to_state_json_file("my_featurizer.json")
|
||||
|
||||
# Load from saved state
|
||||
loaded_transformer = MoleculeTransformer.from_state_yaml_file("my_featurizer.yml")
|
||||
|
||||
# Use loaded transformer
|
||||
features = loaded_transformer(smiles_list)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pretrained Model Examples
|
||||
|
||||
### Using the ModelStore
|
||||
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
# Initialize model store
|
||||
store = ModelStore()
|
||||
|
||||
# List all available models
|
||||
print(f"Total available models: {len(store.available_models)}")
|
||||
|
||||
# Search for specific models
|
||||
chemberta_models = store.search(name="ChemBERTa")
|
||||
for model in chemberta_models:
|
||||
print(f"- {model.name}: {model.description}")
|
||||
|
||||
# Get model information
|
||||
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
|
||||
print(f"Model: {model_card.name}")
|
||||
print(f"Version: {model_card.version}")
|
||||
print(f"Authors: {model_card.authors}")
|
||||
|
||||
# View usage instructions
|
||||
model_card.usage()
|
||||
|
||||
# Load model directly
|
||||
transformer = store.load("ChemBERTa-77M-MLM")
|
||||
```
|
||||
|
||||
### ChemBERTa Embeddings
|
||||
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
|
||||
# Load ChemBERTa model
|
||||
chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
# Generate embeddings
|
||||
smiles = ["CCO", "CC(=O)O", "c1ccccc1"]
|
||||
embeddings = chemberta(smiles)
|
||||
print(f"ChemBERTa embeddings shape: {embeddings.shape}")
|
||||
# Output: (3, 768) - 768-dimensional embeddings
|
||||
|
||||
# Use in ML pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
embeddings, labels, test_size=0.2
|
||||
)
|
||||
|
||||
clf = RandomForestClassifier()
|
||||
clf.fit(X_train, y_train)
|
||||
predictions = clf.predict(X_test)
|
||||
```
|
||||
|
||||
### ChemGPT Models
|
||||
|
||||
```python
|
||||
# Small model (4.7M parameters)
|
||||
chemgpt_small = PretrainedMolTransformer("ChemGPT-4.7M", n_jobs=-1)
|
||||
|
||||
# Medium model (19M parameters)
|
||||
chemgpt_medium = PretrainedMolTransformer("ChemGPT-19M", n_jobs=-1)
|
||||
|
||||
# Large model (1.2B parameters)
|
||||
chemgpt_large = PretrainedMolTransformer("ChemGPT-1.2B", n_jobs=-1)
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = chemgpt_small(smiles)
|
||||
```
|
||||
|
||||
### Graph Neural Network Models
|
||||
|
||||
```python
|
||||
# GIN models with different pre-training objectives
|
||||
gin_masking = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
|
||||
gin_infomax = PretrainedMolTransformer("gin-supervised-infomax", n_jobs=-1)
|
||||
gin_edgepred = PretrainedMolTransformer("gin-supervised-edgepred", n_jobs=-1)
|
||||
|
||||
# Generate graph embeddings
|
||||
embeddings = gin_masking(smiles)
|
||||
print(f"GIN embeddings shape: {embeddings.shape}")
|
||||
|
||||
# Graphormer (for quantum chemistry)
|
||||
graphormer = PretrainedMolTransformer("Graphormer-pcqm4mv2", n_jobs=-1)
|
||||
embeddings = graphormer(smiles)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Machine Learning Integration
|
||||
|
||||
### Scikit-learn Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create ML pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
|
||||
('classifier', RandomForestClassifier(n_estimators=100))
|
||||
])
|
||||
|
||||
# Train and evaluate
|
||||
pipeline.fit(smiles_train, y_train)
|
||||
predictions = pipeline.predict(smiles_test)
|
||||
|
||||
# Cross-validation
|
||||
scores = cross_val_score(pipeline, smiles_all, y_all, cv=5)
|
||||
print(f"CV scores: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Grid Search for Hyperparameter Tuning
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sklearn.svm import SVC
|
||||
|
||||
# Define pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
|
||||
('classifier', SVC())
|
||||
])
|
||||
|
||||
# Define parameter grid
|
||||
param_grid = {
|
||||
'classifier__C': [0.1, 1, 10],
|
||||
'classifier__kernel': ['rbf', 'linear'],
|
||||
'classifier__gamma': ['scale', 'auto']
|
||||
}
|
||||
|
||||
# Grid search
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
|
||||
grid_search.fit(smiles_train, y_train)
|
||||
|
||||
print(f"Best parameters: {grid_search.best_params_}")
|
||||
print(f"Best score: {grid_search.best_score_:.3f}")
|
||||
```
|
||||
|
||||
### Multiple Featurizer Comparison
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score
|
||||
|
||||
# Test different featurizers
|
||||
featurizers = {
|
||||
'ECFP': FPCalculator("ecfp"),
|
||||
'MACCS': FPCalculator("maccs"),
|
||||
'RDKit': FPCalculator("rdkit"),
|
||||
'Descriptors': RDKitDescriptors2D(),
|
||||
'Combined': FeatConcat([
|
||||
FPCalculator("maccs"),
|
||||
FPCalculator("ecfp")
|
||||
])
|
||||
}
|
||||
|
||||
results = {}
|
||||
for name, calc in featurizers.items():
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
X_train = transformer(smiles_train)
|
||||
X_test = transformer(smiles_test)
|
||||
|
||||
clf = RandomForestClassifier(n_estimators=100)
|
||||
clf.fit(X_train, y_train)
|
||||
|
||||
y_pred = clf.predict_proba(X_test)[:, 1]
|
||||
auc = roc_auc_score(y_test, y_pred)
|
||||
results[name] = auc
|
||||
|
||||
print(f"{name}: AUC = {auc:.3f}")
|
||||
```
|
||||
|
||||
### PyTorch Deep Learning
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Custom dataset
|
||||
class MoleculeDataset(Dataset):
|
||||
def __init__(self, smiles, labels, transformer):
|
||||
self.features = transformer(smiles)
|
||||
self.labels = torch.tensor(labels, dtype=torch.float32)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.labels)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return (
|
||||
torch.tensor(self.features[idx], dtype=torch.float32),
|
||||
self.labels[idx]
|
||||
)
|
||||
|
||||
# Prepare data
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
train_dataset = MoleculeDataset(smiles_train, y_train, transformer)
|
||||
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
|
||||
|
||||
# Simple neural network
|
||||
class MoleculeClassifier(nn.Module):
|
||||
def __init__(self, input_dim):
|
||||
super().__init__()
|
||||
self.network = nn.Sequential(
|
||||
nn.Linear(input_dim, 512),
|
||||
nn.ReLU(),
|
||||
nn.Dropout(0.3),
|
||||
nn.Linear(512, 256),
|
||||
nn.ReLU(),
|
||||
nn.Dropout(0.3),
|
||||
nn.Linear(256, 1),
|
||||
nn.Sigmoid()
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.network(x)
|
||||
|
||||
# Train model
|
||||
model = MoleculeClassifier(input_dim=2048)
|
||||
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
|
||||
criterion = nn.BCELoss()
|
||||
|
||||
for epoch in range(10):
|
||||
for batch_features, batch_labels in train_loader:
|
||||
optimizer.zero_grad()
|
||||
outputs = model(batch_features).squeeze()
|
||||
loss = criterion(outputs, batch_labels)
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Usage Patterns
|
||||
|
||||
### Custom Preprocessing
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
import datamol as dm
|
||||
|
||||
class CustomTransformer(MoleculeTransformer):
|
||||
def preprocess(self, mol):
|
||||
"""Custom preprocessing: standardize molecule"""
|
||||
if isinstance(mol, str):
|
||||
mol = dm.to_mol(mol)
|
||||
|
||||
# Standardize
|
||||
mol = dm.standardize_mol(mol)
|
||||
|
||||
# Remove salts
|
||||
mol = dm.remove_salts(mol)
|
||||
|
||||
return mol
|
||||
|
||||
# Use custom transformer
|
||||
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
features = transformer(smiles_list)
|
||||
```
|
||||
|
||||
### Featurization with Conformers
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from molfeat.calc import RDKitDescriptors3D
|
||||
|
||||
# Generate conformers
|
||||
def prepare_3d_mol(smiles):
|
||||
mol = dm.to_mol(smiles)
|
||||
mol = dm.add_hs(mol)
|
||||
mol = dm.conform.generate_conformers(mol, n_confs=1)
|
||||
return mol
|
||||
|
||||
# 3D descriptors
|
||||
calc_3d = RDKitDescriptors3D()
|
||||
|
||||
smiles = "CC(C)Cc1ccc(C)cc1C"
|
||||
mol_3d = prepare_3d_mol(smiles)
|
||||
descriptors_3d = calc_3d(mol_3d)
|
||||
```
|
||||
|
||||
### Parallel Batch Processing
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
import time
|
||||
|
||||
# Large dataset
|
||||
smiles_large = load_large_dataset() # e.g., 100,000 molecules
|
||||
|
||||
# Test different parallelization levels
|
||||
for n_jobs in [1, 2, 4, -1]:
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp"),
|
||||
n_jobs=n_jobs
|
||||
)
|
||||
|
||||
start = time.time()
|
||||
features = transformer(smiles_large)
|
||||
elapsed = time.time() - start
|
||||
|
||||
print(f"n_jobs={n_jobs}: {elapsed:.2f}s")
|
||||
```
|
||||
|
||||
### Caching for Expensive Operations
|
||||
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
import pickle
|
||||
|
||||
# Load expensive pretrained model
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
# Cache embeddings for reuse
|
||||
cache_file = "embeddings_cache.pkl"
|
||||
|
||||
try:
|
||||
# Try loading cached embeddings
|
||||
with open(cache_file, "rb") as f:
|
||||
embeddings = pickle.load(f)
|
||||
print("Loaded cached embeddings")
|
||||
except FileNotFoundError:
|
||||
# Compute and cache
|
||||
embeddings = transformer(smiles_list)
|
||||
with open(cache_file, "wb") as f:
|
||||
pickle.dump(embeddings, f)
|
||||
print("Computed and cached embeddings")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Virtual Screening Workflow
|
||||
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
import datamol as dm
|
||||
|
||||
# 1. Prepare training data (known actives/inactives)
|
||||
train_smiles = load_training_data()
|
||||
train_labels = load_training_labels() # 1=active, 0=inactive
|
||||
|
||||
# 2. Featurize training set
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
X_train = transformer(train_smiles)
|
||||
|
||||
# 3. Train classifier
|
||||
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
|
||||
clf.fit(X_train, train_labels)
|
||||
|
||||
# 4. Featurize screening library
|
||||
screening_smiles = load_screening_library() # e.g., 1M compounds
|
||||
X_screen = transformer(screening_smiles)
|
||||
|
||||
# 5. Predict and rank
|
||||
predictions = clf.predict_proba(X_screen)[:, 1]
|
||||
ranked_indices = predictions.argsort()[::-1]
|
||||
|
||||
# 6. Get top hits
|
||||
top_n = 1000
|
||||
top_hits = [screening_smiles[i] for i in ranked_indices[:top_n]]
|
||||
```
|
||||
|
||||
### QSAR Model Building
|
||||
|
||||
```python
|
||||
from molfeat.calc import RDKitDescriptors2D
|
||||
from sklearn.linear_model import Ridge
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.model_selection import cross_val_score
|
||||
import numpy as np
|
||||
|
||||
# Load QSAR dataset
|
||||
smiles = load_molecules()
|
||||
y = load_activity_values() # e.g., IC50, logP
|
||||
|
||||
# Featurize with interpretable descriptors
|
||||
transformer = MoleculeTransformer(RDKitDescriptors2D(), n_jobs=-1)
|
||||
X = transformer(smiles)
|
||||
|
||||
# Standardize features
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Build linear model
|
||||
model = Ridge(alpha=1.0)
|
||||
scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
|
||||
print(f"R² = {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
|
||||
# Fit final model
|
||||
model.fit(X_scaled, y)
|
||||
|
||||
# Interpret feature importance
|
||||
feature_names = transformer.featurizer.columns
|
||||
importance = np.abs(model.coef_)
|
||||
top_features_idx = importance.argsort()[-10:][::-1]
|
||||
|
||||
print("Top 10 important features:")
|
||||
for idx in top_features_idx:
|
||||
print(f" {feature_names[idx]}: {model.coef_[idx]:.3f}")
|
||||
```
|
||||
|
||||
### Similarity Search
|
||||
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
import numpy as np
|
||||
|
||||
# Query molecule
|
||||
query_smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin
|
||||
|
||||
# Database of molecules
|
||||
database_smiles = load_molecule_database() # Large collection
|
||||
|
||||
# Compute fingerprints
|
||||
calc = FPCalculator("ecfp")
|
||||
query_fp = calc(query_smiles).reshape(1, -1)
|
||||
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
database_fps = transformer(database_smiles)
|
||||
|
||||
# Compute similarity
|
||||
similarities = cosine_similarity(query_fp, database_fps)[0]
|
||||
|
||||
# Find most similar
|
||||
top_k = 10
|
||||
top_indices = similarities.argsort()[-top_k:][::-1]
|
||||
|
||||
print(f"Top {top_k} similar molecules:")
|
||||
for i, idx in enumerate(top_indices, 1):
|
||||
print(f"{i}. {database_smiles[idx]} (similarity: {similarities[idx]:.3f})")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Handling Invalid Molecules
|
||||
|
||||
```python
|
||||
# Use ignore_errors to skip invalid molecules
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp"),
|
||||
ignore_errors=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Filter out None values after transformation
|
||||
features = transformer(smiles_list)
|
||||
valid_mask = [f is not None for f in features]
|
||||
valid_features = [f for f in features if f is not None]
|
||||
valid_smiles = [s for s, m in zip(smiles_list, valid_mask) if m]
|
||||
```
|
||||
|
||||
### Memory Management for Large Datasets
|
||||
|
||||
```python
|
||||
# Process in chunks for very large datasets
|
||||
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
|
||||
all_features = []
|
||||
|
||||
for i in range(0, len(smiles_list), chunk_size):
|
||||
chunk = smiles_list[i:i+chunk_size]
|
||||
features = transformer(chunk)
|
||||
all_features.append(features)
|
||||
print(f"Processed {i+len(chunk)}/{len(smiles_list)}")
|
||||
|
||||
return np.vstack(all_features)
|
||||
|
||||
# Use with large dataset
|
||||
features = featurize_in_chunks(large_smiles_list, transformer)
|
||||
```
|
||||
|
||||
### Reproducibility
|
||||
|
||||
```python
|
||||
import random
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
# Set all random seeds
|
||||
def set_seed(seed=42):
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
torch.manual_seed(seed)
|
||||
torch.cuda.manual_seed_all(seed)
|
||||
|
||||
set_seed(42)
|
||||
|
||||
# Save exact configuration
|
||||
transformer.to_state_yaml_file("config.yml")
|
||||
|
||||
# Document version
|
||||
import molfeat
|
||||
print(f"molfeat version: {molfeat.__version__}")
|
||||
```
|
||||
Reference in New Issue
Block a user