Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

505
skills/molfeat/SKILL.md Normal file
View File

@@ -0,0 +1,505 @@
---
name: molfeat
description: "Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML."
---
# Molfeat - Molecular Featurization Hub
## Overview
Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications. Features fast parallel processing, scikit-learn compatible transformers, and built-in caching.
## When to Use This Skill
This skill should be used when working with:
- **Molecular machine learning**: Building QSAR/QSPR models, property prediction
- **Virtual screening**: Ranking compound libraries for biological activity
- **Similarity searching**: Finding structurally similar molecules
- **Chemical space analysis**: Clustering, visualization, dimensionality reduction
- **Deep learning**: Training neural networks on molecular data
- **Featurization pipelines**: Converting SMILES to ML-ready representations
- **Cheminformatics**: Any task requiring molecular feature extraction
## Installation
```bash
uv pip install molfeat
# With all optional dependencies
uv pip install "molfeat[all]"
```
**Optional dependencies for specific featurizers:**
- `molfeat[dgl]` - GNN models (GIN variants)
- `molfeat[graphormer]` - Graphormer models
- `molfeat[transformer]` - ChemBERTa, ChemGPT, MolT5
- `molfeat[fcd]` - FCD descriptors
- `molfeat[map4]` - MAP4 fingerprints
## Core Concepts
Molfeat organizes featurization into three hierarchical classes:
### 1. Calculators (`molfeat.calc`)
Callable objects that convert individual molecules into feature vectors. Accept RDKit `Chem.Mol` objects or SMILES strings.
**Use calculators for:**
- Single molecule featurization
- Custom processing loops
- Direct feature computation
**Example:**
```python
from molfeat.calc import FPCalculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
features = calc("CCO") # Returns numpy array (2048,)
```
### 2. Transformers (`molfeat.trans`)
Scikit-learn compatible transformers that wrap calculators for batch processing with parallelization.
**Use transformers for:**
- Batch featurization of molecular datasets
- Integration with scikit-learn pipelines
- Parallel processing (automatic CPU utilization)
**Example:**
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list) # Parallel processing
```
### 3. Pretrained Transformers (`molfeat.trans.pretrained`)
Specialized transformers for deep learning models with batched inference and caching.
**Use pretrained transformers for:**
- State-of-the-art molecular embeddings
- Transfer learning from large chemical datasets
- Deep learning feature extraction
**Example:**
```python
from molfeat.trans.pretrained import PretrainedMolTransformer
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
embeddings = transformer(smiles_list) # Deep learning embeddings
```
## Quick Start Workflow
### Basic Featurization
```python
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
# Load molecular data
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
# Create calculator and transformer
calc = FPCalculator("ecfp", radius=3)
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize molecules
features = transformer(smiles)
print(f"Shape: {features.shape}") # (4, 2048)
```
### Save and Load Configuration
```python
# Save featurizer configuration for reproducibility
transformer.to_state_yaml_file("featurizer_config.yml")
# Reload exact configuration
loaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
```
### Handle Errors Gracefully
```python
# Process dataset with potentially invalid SMILES
transformer = MoleculeTransformer(
calc,
n_jobs=-1,
ignore_errors=True, # Continue on failures
verbose=True # Log error details
)
features = transformer(smiles_with_errors)
# Returns None for failed molecules
```
## Choosing the Right Featurizer
### For Traditional Machine Learning (RF, SVM, XGBoost)
**Start with fingerprints:**
```python
# ECFP - Most popular, general-purpose
FPCalculator("ecfp", radius=3, fpSize=2048)
# MACCS - Fast, good for scaffold hopping
FPCalculator("maccs")
# MAP4 - Efficient for large-scale screening
FPCalculator("map4")
```
**For interpretable models:**
```python
# RDKit 2D descriptors (200+ named properties)
from molfeat.calc import RDKitDescriptors2D
RDKitDescriptors2D()
# Mordred (1800+ comprehensive descriptors)
from molfeat.calc import MordredDescriptors
MordredDescriptors()
```
**Combine multiple featurizers:**
```python
from molfeat.trans import FeatConcat
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
]) # Result: 2215-dimensional combined features
```
### For Deep Learning
**Transformer-based embeddings:**
```python
# ChemBERTa - Pre-trained on 77M PubChem compounds
PretrainedMolTransformer("ChemBERTa-77M-MLM")
# ChemGPT - Autoregressive language model
PretrainedMolTransformer("ChemGPT-1.2B")
```
**Graph neural networks:**
```python
# GIN models with different pre-training objectives
PretrainedMolTransformer("gin-supervised-masking")
PretrainedMolTransformer("gin-supervised-infomax")
# Graphormer for quantum chemistry
PretrainedMolTransformer("Graphormer-pcqm4mv2")
```
### For Similarity Searching
```python
# ECFP - General purpose, most widely used
FPCalculator("ecfp")
# MACCS - Fast, scaffold-based similarity
FPCalculator("maccs")
# MAP4 - Efficient for large databases
FPCalculator("map4")
# USR/USRCAT - 3D shape similarity
from molfeat.calc import USRDescriptors
USRDescriptors()
```
### For Pharmacophore-Based Approaches
```python
# FCFP - Functional group based
FPCalculator("fcfp")
# CATS - Pharmacophore pair distributions
from molfeat.calc import CATSCalculator
CATSCalculator(mode="2D")
# Gobbi - Explicit pharmacophore features
FPCalculator("gobbi2D")
```
## Common Workflows
### Building a QSAR Model
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score
# Featurize molecules
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X = transformer(smiles_train)
# Train model
model = RandomForestRegressor(n_estimators=100)
scores = cross_val_score(model, X, y_train, cv=5)
print(f"R² = {scores.mean():.3f}")
# Save configuration for deployment
transformer.to_state_yaml_file("production_featurizer.yml")
```
### Virtual Screening Pipeline
```python
from sklearn.ensemble import RandomForestClassifier
# Train on known actives/inactives
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)
clf = RandomForestClassifier(n_estimators=500)
clf.fit(X_train, train_labels)
# Screen large library
X_screen = transformer(screening_library) # e.g., 1M compounds
predictions = clf.predict_proba(X_screen)[:, 1]
# Rank and select top hits
top_indices = predictions.argsort()[::-1][:1000]
top_hits = [screening_library[i] for i in top_indices]
```
### Similarity Search
```python
from sklearn.metrics.pairwise import cosine_similarity
# Query molecule
calc = FPCalculator("ecfp")
query_fp = calc(query_smiles).reshape(1, -1)
# Database fingerprints
transformer = MoleculeTransformer(calc, n_jobs=-1)
database_fps = transformer(database_smiles)
# Compute similarity
similarities = cosine_similarity(query_fp, database_fps)[0]
top_similar = similarities.argsort()[-10:][::-1]
```
### Scikit-learn Pipeline Integration
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
# Create end-to-end pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', RandomForestClassifier(n_estimators=100))
])
# Train and predict directly on SMILES
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
```
### Comparing Multiple Featurizers
```python
featurizers = {
'ECFP': FPCalculator("ecfp"),
'MACCS': FPCalculator("maccs"),
'Descriptors': RDKitDescriptors2D(),
'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
}
results = {}
for name, feat in featurizers.items():
transformer = MoleculeTransformer(feat, n_jobs=-1)
X = transformer(smiles)
# Evaluate with your ML model
score = evaluate_model(X, y)
results[name] = score
```
## Discovering Available Featurizers
Use the ModelStore to explore all available featurizers:
```python
from molfeat.store.modelstore import ModelStore
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Total featurizers: {len(all_models)}")
# Search for specific models
chemberta_models = store.search(name="ChemBERTa")
for model in chemberta_models:
print(f"- {model.name}: {model.description}")
# Get usage information
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
model_card.usage() # Display usage examples
# Load model
transformer = store.load("ChemBERTa-77M-MLM")
```
## Advanced Features
### Custom Preprocessing
```python
class CustomTransformer(MoleculeTransformer):
def preprocess(self, mol):
"""Custom preprocessing pipeline"""
if isinstance(mol, str):
mol = dm.to_mol(mol)
mol = dm.standardize_mol(mol)
mol = dm.remove_salts(mol)
return mol
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
```
### Batch Processing Large Datasets
```python
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
"""Process large datasets in chunks to manage memory"""
all_features = []
for i in range(0, len(smiles_list), chunk_size):
chunk = smiles_list[i:i+chunk_size]
features = transformer(chunk)
all_features.append(features)
return np.vstack(all_features)
```
### Caching Expensive Embeddings
```python
import pickle
cache_file = "embeddings_cache.pkl"
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
try:
with open(cache_file, "rb") as f:
embeddings = pickle.load(f)
except FileNotFoundError:
embeddings = transformer(smiles_list)
with open(cache_file, "wb") as f:
pickle.dump(embeddings, f)
```
## Performance Tips
1. **Use parallelization**: Set `n_jobs=-1` to utilize all CPU cores
2. **Batch processing**: Process multiple molecules at once instead of loops
3. **Choose appropriate featurizers**: Fingerprints are faster than deep learning models
4. **Cache pretrained models**: Leverage built-in caching for repeated use
5. **Use float32**: Set `dtype=np.float32` when precision allows
6. **Handle errors efficiently**: Use `ignore_errors=True` for large datasets
## Common Featurizers Reference
**Quick reference for frequently used featurizers:**
| Featurizer | Type | Dimensions | Speed | Use Case |
|------------|------|------------|-------|----------|
| `ecfp` | Fingerprint | 2048 | Fast | General purpose |
| `maccs` | Fingerprint | 167 | Very fast | Scaffold similarity |
| `desc2D` | Descriptors | 200+ | Fast | Interpretable models |
| `mordred` | Descriptors | 1800+ | Medium | Comprehensive features |
| `map4` | Fingerprint | 1024 | Fast | Large-scale screening |
| `ChemBERTa-77M-MLM` | Deep learning | 768 | Slow* | Transfer learning |
| `gin-supervised-masking` | GNN | Variable | Slow* | Graph-based models |
*First run is slow; subsequent runs benefit from caching
## Resources
This skill includes comprehensive reference documentation:
### references/api_reference.md
Complete API documentation covering:
- `molfeat.calc` - All calculator classes and parameters
- `molfeat.trans` - Transformer classes and methods
- `molfeat.store` - ModelStore usage
- Common patterns and integration examples
- Performance optimization tips
**When to load:** Reference when implementing specific calculators, understanding transformer parameters, or integrating with scikit-learn/PyTorch.
### references/available_featurizers.md
Comprehensive catalog of all 100+ featurizers organized by category:
- Transformer-based language models (ChemBERTa, ChemGPT)
- Graph neural networks (GIN, Graphormer)
- Molecular descriptors (RDKit, Mordred)
- Fingerprints (ECFP, MACCS, MAP4, and 15+ others)
- Pharmacophore descriptors (CATS, Gobbi)
- Shape descriptors (USR, ElectroShape)
- Scaffold-based descriptors
**When to load:** Reference when selecting the optimal featurizer for a specific task, exploring available options, or understanding featurizer characteristics.
**Search tip:** Use grep to find specific featurizer types:
```bash
grep -i "chembert" references/available_featurizers.md
grep -i "pharmacophore" references/available_featurizers.md
```
### references/examples.md
Practical code examples for common scenarios:
- Installation and quick start
- Calculator and transformer examples
- Pretrained model usage
- Scikit-learn and PyTorch integration
- Virtual screening workflows
- QSAR model building
- Similarity searching
- Troubleshooting and best practices
**When to load:** Reference when implementing specific workflows, troubleshooting issues, or learning molfeat patterns.
## Troubleshooting
### Invalid Molecules
Enable error handling to skip invalid SMILES:
```python
transformer = MoleculeTransformer(
calc,
ignore_errors=True,
verbose=True
)
```
### Memory Issues with Large Datasets
Process in chunks or use streaming approaches for datasets > 100K molecules.
### Pretrained Model Dependencies
Some models require additional packages. Install specific extras:
```bash
uv pip install "molfeat[transformer]" # For ChemBERTa/ChemGPT
uv pip install "molfeat[dgl]" # For GIN models
```
### Reproducibility
Save exact configurations and document versions:
```python
transformer.to_state_yaml_file("config.yml")
import molfeat
print(f"molfeat version: {molfeat.__version__}")
```
## Additional Resources
- **Official Documentation**: https://molfeat-docs.datamol.io/
- **GitHub Repository**: https://github.com/datamol-io/molfeat
- **PyPI Package**: https://pypi.org/project/molfeat/
- **Tutorial**: https://portal.valencelabs.com/datamol/post/types-of-featurizers-b1e8HHrbFMkbun6

View File

@@ -0,0 +1,428 @@
# Molfeat API Reference
## Core Modules
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
- **`molfeat.store`** - Manages model loading, listing, and registration
- **`molfeat.calc`** - Provides calculators for single-molecule featurization
- **`molfeat.trans`** - Offers scikit-learn compatible transformers for batch processing
- **`molfeat.utils`** - Utility functions for data handling
- **`molfeat.viz`** - Visualization tools for molecular features
---
## molfeat.calc - Calculators
Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit `Chem.Mol` objects or SMILES strings as input.
### SerializableCalculator (Base Class)
Base abstract class for all calculators. When subclassing, must implement:
- `__call__()` - Required method for featurization
- `__len__()` - Optional, returns output length
- `columns` - Optional property, returns feature names
- `batch_compute()` - Optional, for efficient batch processing
**State Management Methods:**
- `to_state_json()` - Save calculator state as JSON
- `to_state_yaml()` - Save calculator state as YAML
- `from_state_dict()` - Load calculator from state dictionary
- `to_state_dict()` - Export calculator state as dictionary
### FPCalculator
Computes molecular fingerprints. Supports 15+ fingerprint methods.
**Supported Fingerprint Types:**
**Structural Fingerprints:**
- `ecfp` - Extended-connectivity fingerprints (circular)
- `fcfp` - Functional-class fingerprints
- `rdkit` - RDKit topological fingerprints
- `maccs` - MACCS keys (166-bit structural keys)
- `avalon` - Avalon fingerprints
- `pattern` - Pattern fingerprints
- `layered` - Layered fingerprints
**Atom-based Fingerprints:**
- `atompair` - Atom pair fingerprints
- `atompair-count` - Counted atom pairs
- `topological` - Topological torsion fingerprints
- `topological-count` - Counted topological torsions
**Specialized Fingerprints:**
- `map4` - MinHashed atom-pair fingerprint up to 4 bonds
- `secfp` - SMILES extended connectivity fingerprint
- `erg` - Extended reduced graphs
- `estate` - Electrotopological state indices
**Parameters:**
- `method` (str) - Fingerprint type name
- `radius` (int) - Radius for circular fingerprints (default: 3)
- `fpSize` (int) - Fingerprint size (default: 2048)
- `includeChirality` (bool) - Include chirality information
- `counting` (bool) - Use count vectors instead of binary
**Usage:**
```python
from molfeat.calc import FPCalculator
# Create fingerprint calculator
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
# Compute fingerprint for single molecule
fp = calc("CCO") # Returns numpy array
# Get fingerprint length
length = len(calc) # 2048
# Get feature names
names = calc.columns
```
**Common Fingerprint Dimensions:**
- MACCS: 167 dimensions
- ECFP (default): 2048 dimensions
- MAP4 (default): 1024 dimensions
### Descriptor Calculators
**RDKitDescriptors2D**
Computes 2D molecular descriptors using RDKit.
```python
from molfeat.calc import RDKitDescriptors2D
calc = RDKitDescriptors2D()
descriptors = calc("CCO") # Returns 200+ descriptors
```
**RDKitDescriptors3D**
Computes 3D molecular descriptors (requires conformer generation).
**MordredDescriptors**
Calculates over 1800 molecular descriptors using Mordred.
```python
from molfeat.calc import MordredDescriptors
calc = MordredDescriptors()
descriptors = calc("CCO")
```
### Pharmacophore Calculators
**Pharmacophore2D**
RDKit's 2D pharmacophore fingerprint generation.
**Pharmacophore3D**
Consensus pharmacophore fingerprints from multiple conformers.
**CATSCalculator**
Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
**Parameters:**
- `mode` - "2D" or "3D" distance calculations
- `dist_bins` - Distance bins for pair distributions
- `scale` - Scaling mode: "raw", "num", or "count"
```python
from molfeat.calc import CATSCalculator
calc = CATSCalculator(mode="2D", scale="raw")
cats = calc("CCO") # Returns 21 descriptors by default
```
### Shape Descriptors
**USRDescriptors**
Ultrafast shape recognition descriptors (multiple variants).
**ElectroShapeDescriptors**
Electrostatic shape descriptors combining shape, chirality, and electrostatics.
### Graph-Based Calculators
**ScaffoldKeyCalculator**
Computes 40+ scaffold-based molecular properties.
**AtomCalculator**
Atom-level featurization for graph neural networks.
**BondCalculator**
Bond-level featurization for graph neural networks.
### Utility Function
**get_calculator()**
Factory function to instantiate calculators by name.
```python
from molfeat.calc import get_calculator
# Instantiate any calculator by name
calc = get_calculator("ecfp", radius=3)
calc = get_calculator("maccs")
calc = get_calculator("desc2D")
```
Raises `ValueError` for unsupported featurizers.
---
## molfeat.trans - Transformers
Transformers wrap calculators into complete featurization pipelines for batch processing.
### MoleculeTransformer
Scikit-learn compatible transformer for batch molecular featurization.
**Key Parameters:**
- `featurizer` - Calculator or featurizer to use
- `n_jobs` (int) - Number of parallel jobs (-1 for all cores)
- `dtype` - Output data type (numpy float32/64, torch tensors)
- `verbose` (bool) - Enable verbose logging
- `ignore_errors` (bool) - Continue on failures (returns None for failed molecules)
**Essential Methods:**
- `transform(mols)` - Processes batches and returns representations
- `_transform(mol)` - Handles individual molecule featurization
- `__call__(mols)` - Convenience wrapper around transform()
- `preprocess(mol)` - Prepares input molecules (not automatically applied)
- `to_state_yaml_file(path)` - Save transformer configuration
- `from_state_yaml_file(path)` - Load transformer configuration
**Usage:**
```python
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
import datamol as dm
# Load molecules
smiles = dm.data.freesolv().sample(100).smiles.values
# Create transformer
calc = FPCalculator("ecfp")
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Featurize batch
features = transformer(smiles) # Returns numpy array (100, 2048)
# Save configuration
transformer.to_state_yaml_file("ecfp_config.yml")
# Reload
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
```
**Performance:** Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
### FeatConcat
Concatenates multiple featurizers into unified representations.
```python
from molfeat.trans import FeatConcat
from molfeat.calc import FPCalculator
# Combine multiple fingerprints
concat = FeatConcat([
FPCalculator("maccs"), # 167 dimensions
FPCalculator("ecfp") # 2048 dimensions
])
# Result: 2167-dimensional features
transformer = MoleculeTransformer(concat, n_jobs=-1)
features = transformer(smiles)
```
### PretrainedMolTransformer
Subclass of `MoleculeTransformer` for pre-trained deep learning models.
**Unique Features:**
- `_embed()` - Batched inference for neural networks
- `_convert()` - Transforms SMILES/molecules into model-compatible formats
- SELFIES strings for language models
- DGL graphs for graph neural networks
- Integrated caching system for efficient storage
**Usage:**
```python
from molfeat.trans.pretrained import PretrainedMolTransformer
# Load pretrained model
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Generate embeddings
embeddings = transformer(smiles)
```
### PrecomputedMolTransformer
Transformer for cached/precomputed features.
---
## molfeat.store - Model Store
Manages featurizer discovery, loading, and registration.
### ModelStore
Central hub for accessing available featurizers.
**Key Methods:**
- `available_models` - Property listing all available featurizers
- `search(name=None, **kwargs)` - Search for specific featurizers
- `load(name, **kwargs)` - Load a featurizer by name
- `register(name, card)` - Register custom featurizer
**Usage:**
```python
from molfeat.store.modelstore import ModelStore
# Initialize store
store = ModelStore()
# List all available models
all_models = store.available_models
print(f"Found {len(all_models)} featurizers")
# Search for specific model
results = store.search(name="ChemBERTa-77M-MLM")
if results:
model_card = results[0]
# View usage information
model_card.usage()
# Load the model
transformer = model_card.load()
# Direct loading
transformer = store.load("ChemBERTa-77M-MLM")
```
**ModelCard Attributes:**
- `name` - Model identifier
- `description` - Model description
- `version` - Model version
- `authors` - Model authors
- `tags` - Categorization tags
- `usage()` - Display usage examples
- `load(**kwargs)` - Load the model
---
## Common Patterns
### Error Handling
```python
# Enable error tolerance
featurizer = MoleculeTransformer(
calc,
n_jobs=-1,
verbose=True,
ignore_errors=True
)
# Failed molecules return None
features = featurizer(smiles_with_errors)
```
### Data Type Control
```python
# NumPy float32 (default)
features = transformer(smiles, enforce_dtype=True)
# PyTorch tensors
import torch
transformer = MoleculeTransformer(calc, dtype=torch.float32)
features = transformer(smiles)
```
### Persistence and Reproducibility
```python
# Save transformer state
transformer.to_state_yaml_file("config.yml")
transformer.to_state_json_file("config.json")
# Load from saved state
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
transformer = MoleculeTransformer.from_state_json_file("config.json")
```
### Preprocessing
```python
# Manual preprocessing
mol = transformer.preprocess("CCO")
# Transform with preprocessing
features = transformer.transform(smiles_list)
```
---
## Integration Examples
### Scikit-learn Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
('classifier', RandomForestClassifier())
])
# Fit and predict
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
```
### PyTorch Integration
```python
import torch
from torch.utils.data import Dataset, DataLoader
from molfeat.trans import MoleculeTransformer
class MoleculeDataset(Dataset):
def __init__(self, smiles, labels, transformer):
self.smiles = smiles
self.labels = labels
self.transformer = transformer
def __len__(self):
return len(self.smiles)
def __getitem__(self, idx):
features = self.transformer(self.smiles[idx])
return torch.tensor(features), torch.tensor(self.labels[idx])
# Create dataset and dataloader
transformer = MoleculeTransformer(FPCalculator("ecfp"))
dataset = MoleculeDataset(smiles, labels, transformer)
loader = DataLoader(dataset, batch_size=32)
```
---
## Performance Tips
1. **Parallelization**: Use `n_jobs=-1` to utilize all CPU cores
2. **Batch Processing**: Process multiple molecules at once instead of loops
3. **Caching**: Leverage built-in caching for pretrained models
4. **Data Types**: Use float32 instead of float64 when precision allows
5. **Error Handling**: Set `ignore_errors=True` for large datasets with potential invalid molecules

View File

@@ -0,0 +1,333 @@
# Available Featurizers in Molfeat
This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
## Transformer-Based Language Models
Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
### RoBERTa-style Models
- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds
### GPT-style Autoregressive Models
- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M
### Specialized Transformer Models
- **MolT5** - Self-supervised framework for molecule captioning and text-based generation
## Graph Neural Networks (GNNs)
Pre-trained graph neural network models operating on molecular graph structures.
### GIN (Graph Isomorphism Network) Variants
All pre-trained on ChEMBL molecules with different objectives:
- **gin-supervised-masking** - Supervised with node masking objective
- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
- **gin-supervised-edgepred** - Supervised with edge prediction objective
- **gin-supervised-contextpred** - Supervised with context prediction objective
### Other Graph-Based Models
- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
## Molecular Descriptors
Calculators for physico-chemical properties and molecular characteristics.
### 2D Descriptors
- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
- Molecular weight, logP, TPSA
- H-bond donors/acceptors
- Rotatable bonds
- Ring counts and aromaticity
- Molecular complexity metrics
### 3D Descriptors
- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
- Inertial moments
- PMI (Principal Moments of Inertia) ratios
- Asphericity, eccentricity
- Radius of gyration
### Comprehensive Descriptor Sets
- **mordred** - Over 1800 molecular descriptors covering:
- Constitutional descriptors
- Topological indices
- Connectivity indices
- Information content
- 2D/3D autocorrelations
- WHIM descriptors
- GETAWAY descriptors
- And many more
### Electrotopological Descriptors
- **estate** - Electrotopological state (E-State) indices encoding:
- Atomic environment information
- Electronic and topological properties
- Heteroatom contributions
## Molecular Fingerprints
Binary or count-based fixed-length vectors representing molecular substructures.
### Circular Fingerprints (ECFP-style)
- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
- Radius variants (2, 4, 6 correspond to diameter)
- Default: radius=3, 2048 bits
- Most popular for similarity searching
- **ecfp-count** - Count version of ECFP (non-binary)
- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
- Similar to ECFP but uses functional groups
- Better for pharmacophore-based similarity
### Path-Based Fingerprints
- **rdkit** - RDKit topological fingerprints based on linear paths
- **pattern** - Pattern fingerprints (similar to MACCS but automated)
- **layered** - Layered fingerprints with multiple substructure layers
### Key-Based Fingerprints
- **maccs** - MACCS keys (166-bit structural keys)
- Fixed set of predefined substructures
- Good for scaffold hopping
- Fast computation
- **avalon** - Avalon fingerprints
- Similar to MACCS but more features
- Optimized for similarity searching
### Atom-Pair Fingerprints
- **atompair** - Atom pair fingerprints
- Encodes pairs of atoms and distance between them
- Good for 3D similarity
- **atompair-count** - Count version of atom pairs
### Topological Torsion Fingerprints
- **topological** - Topological torsion fingerprints
- Encodes sequences of 4 connected atoms
- Captures local topology
- **topological-count** - Count version of topological torsions
### MinHashed Fingerprints
- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
- Combines atom-pair and ECFP concepts
- Default: 1024 dimensions
- Fast and efficient for large datasets
- **secfp** - SMILES Extended Connectivity Fingerprint
- Operates directly on SMILES strings
- Captures both substructure and atom-pair information
### Extended Reduced Graph
- **erg** - Extended Reduced Graph
- Uses pharmacophoric points instead of atoms
- Reduces graph complexity while preserving key features
## Pharmacophore Descriptors
Features based on pharmacologically relevant functional groups and their spatial relationships.
### CATS (Chemically Advanced Template Search)
- **cats2D** - 2D CATS descriptors
- Pharmacophore point pair distributions
- Distance based on shortest path
- 21 descriptors by default
- **cats3D** - 3D CATS descriptors
- Euclidean distance based
- Requires conformer generation
- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants
### Gobbi Pharmacophores
- **gobbi2D** - 2D pharmacophore fingerprints
- 8 pharmacophore feature types:
- Hydrophobic
- Aromatic
- H-bond acceptor
- H-bond donor
- Positive ionizable
- Negative ionizable
- Lumped hydrophobe
- Good for virtual screening
### Pmapper Pharmacophores
- **pmapper2D** - 2D pharmacophore signatures
- **pmapper3D** - 3D pharmacophore signatures
- High-dimensional pharmacophore descriptors
- Useful for QSAR and similarity searching
## Shape Descriptors
Descriptors capturing 3D molecular shape and electrostatic properties.
### USR (Ultrafast Shape Recognition)
- **usr** - Basic USR descriptors
- 12 dimensions encoding shape distribution
- Extremely fast computation
- **usrcat** - USR with pharmacophoric constraints
- 60 dimensions (12 per feature type)
- Combines shape and pharmacophore information
### Electrostatic Shape
- **electroshape** - ElectroShape descriptors
- Combines molecular shape, chirality, and electrostatics
- Useful for protein-ligand docking predictions
## Scaffold-Based Descriptors
Descriptors based on molecular scaffolds and core structures.
### Scaffold Keys
- **scaffoldkeys** - Scaffold key calculator
- 40+ scaffold-based properties
- Bioisosteric scaffold representation
- Captures core structural features
## Graph Featurizers for GNN Input
Atom and bond-level features for constructing graph representations for Graph Neural Networks.
### Atom-Level Features
- **atom-onehot** - One-hot encoded atom features
- **atom-default** - Default atom featurization including:
- Atomic number
- Degree, formal charge
- Hybridization
- Aromaticity
- Number of hydrogen atoms
### Bond-Level Features
- **bond-onehot** - One-hot encoded bond features
- **bond-default** - Default bond featurization including:
- Bond type (single, double, triple, aromatic)
- Conjugation
- Ring membership
- Stereochemistry
## Integrated Pretrained Model Collections
Molfeat integrates models from various sources:
### HuggingFace Models
Access to transformer models through HuggingFace hub:
- ChemBERTa variants
- ChemGPT variants
- MolT5
- Custom uploaded models
### DGL-LifeSci Models
Pre-trained GNN models from DGL-Life:
- GIN variants with different pre-training tasks
- AttentiveFP models
- MPNN models
### FCD (Fréchet ChemNet Distance)
- **fcd** - Pre-trained CNN for molecular generation evaluation
### Graphormer Models
- Graph transformers from Microsoft Research
- Pre-trained on quantum chemistry datasets
## Usage Notes
### Choosing a Featurizer
**For traditional ML (Random Forest, SVM, etc.):**
- Start with **ecfp** or **maccs** fingerprints
- Try **desc2D** for interpretable models
- Use **FeatConcat** to combine multiple fingerprints
**For deep learning:**
- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
- Use **gin-supervised-*** for graph neural network embeddings
- Consider **Graphormer** for quantum property predictions
**For similarity searching:**
- **ecfp** - General purpose, most popular
- **maccs** - Fast, good for scaffold hopping
- **map4** - Efficient for large-scale searches
- **usr** / **usrcat** - 3D shape similarity
**For pharmacophore-based approaches:**
- **fcfp** - Functional group based
- **cats2D/3D** - Pharmacophore pair distributions
- **gobbi2D** - Explicit pharmacophore features
**For interpretability:**
- **desc2D** / **mordred** - Named descriptors
- **maccs** - Interpretable substructure keys
- **scaffoldkeys** - Scaffold-based features
### Model Dependencies
Some featurizers require optional dependencies:
- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
- **Graphormer**: `pip install "molfeat[graphormer]"`
- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
- **FCD**: `pip install "molfeat[fcd]"`
- **MAP4**: `pip install "molfeat[map4]"`
- **All dependencies**: `pip install "molfeat[all]"`
### Accessing All Available Models
```python
from molfeat.store.modelstore import ModelStore
store = ModelStore()
all_models = store.available_models
# Print all available featurizers
for model in all_models:
print(f"{model.name}: {model.description}")
# Search for specific types
transformers = [m for m in all_models if "transformer" in m.tags]
gnn_models = [m for m in all_models if "gnn" in m.tags]
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
```
## Performance Characteristics
### Computational Speed (relative)
**Fastest:**
- maccs
- ecfp
- rdkit fingerprints
- usr
**Medium:**
- desc2D
- cats2D
- Most fingerprints
**Slower:**
- mordred (1800+ descriptors)
- desc3D (requires conformer generation)
- 3D descriptors in general
**Slowest (first run):**
- Pretrained models (ChemBERTa, ChemGPT, GIN)
- Note: Subsequent runs benefit from caching
### Dimensionality
**Low (< 200 dims):**
- maccs (167)
- usr (12)
- usrcat (60)
**Medium (200-2000 dims):**
- desc2D (~200)
- ecfp (2048 default, configurable)
- map4 (1024 default)
**High (> 2000 dims):**
- mordred (1800+)
- Concatenated fingerprints
- Some transformer embeddings
**Variable:**
- Transformer models (typically 768-1024)
- GNN models (depends on architecture)

View File

@@ -0,0 +1,723 @@
# Molfeat Usage Examples
This document provides practical examples for common molfeat use cases.
## Installation
```bash
# Recommended: Using conda/mamba
mamba install -c conda-forge molfeat
# Alternative: Using pip
pip install molfeat
# With all optional dependencies
pip install "molfeat[all]"
# With specific dependencies
pip install "molfeat[dgl]" # For GNN models
pip install "molfeat[graphormer]" # For Graphormer
pip install "molfeat[transformer]" # For ChemBERTa, ChemGPT
```
---
## Quick Start
### Basic Featurization Workflow
```python
import datamol as dm
from molfeat.calc import FPCalculator
from molfeat.trans import MoleculeTransformer
# Load sample data
data = dm.data.freesolv().sample(100).smiles.values
# Single molecule featurization
calc = FPCalculator("ecfp")
features_single = calc(data[0])
print(f"Single molecule features shape: {features_single.shape}")
# Output: (2048,)
# Batch featurization with parallelization
transformer = MoleculeTransformer(calc, n_jobs=-1)
features_batch = transformer(data)
print(f"Batch features shape: {features_batch.shape}")
# Output: (100, 2048)
```
---
## Calculator Examples
### Fingerprint Calculators
```python
from molfeat.calc import FPCalculator
# ECFP (Extended-Connectivity Fingerprints)
ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
fp = ecfp("CCO") # Ethanol
print(f"ECFP shape: {fp.shape}") # (2048,)
# MACCS keys
maccs = FPCalculator("maccs")
fp = maccs("c1ccccc1") # Benzene
print(f"MACCS shape: {fp.shape}") # (167,)
# Count-based fingerprints
ecfp_count = FPCalculator("ecfp-count", radius=3)
fp_count = ecfp_count("CC(C)CC(C)C") # Non-binary counts
# MAP4 fingerprints
map4 = FPCalculator("map4")
fp = map4("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
```
### Descriptor Calculators
```python
from molfeat.calc import RDKitDescriptors2D, MordredDescriptors
# RDKit 2D descriptors (200+ properties)
desc2d = RDKitDescriptors2D()
descriptors = desc2d("CCO")
print(f"Number of 2D descriptors: {len(descriptors)}")
# Get descriptor names
names = desc2d.columns
print(f"First 5 descriptors: {names[:5]}")
# Mordred descriptors (1800+ properties)
mordred = MordredDescriptors()
descriptors = mordred("c1ccccc1O") # Phenol
print(f"Mordred descriptors: {len(descriptors)}")
```
### Pharmacophore Calculators
```python
from molfeat.calc import CATSCalculator
# 2D CATS descriptors
cats = CATSCalculator(mode="2D", scale="raw")
descriptors = cats("CC(C)Cc1ccc(C)cc1C") # Cymene
print(f"CATS descriptors: {descriptors.shape}") # (21,)
# 3D CATS descriptors (requires conformer)
cats3d = CATSCalculator(mode="3D", scale="num")
```
---
## Transformer Examples
### Basic Transformer Usage
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
import datamol as dm
# Prepare data
smiles_list = [
"CCO",
"CC(=O)O",
"c1ccccc1",
"CC(C)O",
"CCCC"
]
# Create transformer
calc = FPCalculator("ecfp")
transformer = MoleculeTransformer(calc, n_jobs=-1)
# Transform molecules
features = transformer(smiles_list)
print(f"Features shape: {features.shape}") # (5, 2048)
```
### Error Handling
```python
# Handle invalid SMILES gracefully
smiles_with_errors = [
"CCO", # Valid
"invalid", # Invalid
"CC(=O)O", # Valid
"xyz123", # Invalid
]
transformer = MoleculeTransformer(
FPCalculator("ecfp"),
n_jobs=-1,
verbose=True, # Log errors
ignore_errors=True # Continue on failure
)
features = transformer(smiles_with_errors)
# Returns: array with None for failed molecules
print(features) # [array(...), None, array(...), None]
```
### Concatenating Multiple Featurizers
```python
from molfeat.trans import FeatConcat, MoleculeTransformer
from molfeat.calc import FPCalculator
# Combine MACCS (167) + ECFP (2048) = 2215 dimensions
concat_calc = FeatConcat([
FPCalculator("maccs"),
FPCalculator("ecfp", radius=3, fpSize=2048)
])
transformer = MoleculeTransformer(concat_calc, n_jobs=-1)
features = transformer(smiles_list)
print(f"Combined features shape: {features.shape}") # (n, 2215)
# Triple combination
triple_concat = FeatConcat([
FPCalculator("maccs"),
FPCalculator("ecfp"),
FPCalculator("rdkit")
])
```
### Saving and Loading Configurations
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create and save transformer
transformer = MoleculeTransformer(
FPCalculator("ecfp", radius=3, fpSize=2048),
n_jobs=-1
)
# Save to YAML
transformer.to_state_yaml_file("my_featurizer.yml")
# Save to JSON
transformer.to_state_json_file("my_featurizer.json")
# Load from saved state
loaded_transformer = MoleculeTransformer.from_state_yaml_file("my_featurizer.yml")
# Use loaded transformer
features = loaded_transformer(smiles_list)
```
---
## Pretrained Model Examples
### Using the ModelStore
```python
from molfeat.store.modelstore import ModelStore
# Initialize model store
store = ModelStore()
# List all available models
print(f"Total available models: {len(store.available_models)}")
# Search for specific models
chemberta_models = store.search(name="ChemBERTa")
for model in chemberta_models:
print(f"- {model.name}: {model.description}")
# Get model information
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
print(f"Model: {model_card.name}")
print(f"Version: {model_card.version}")
print(f"Authors: {model_card.authors}")
# View usage instructions
model_card.usage()
# Load model directly
transformer = store.load("ChemBERTa-77M-MLM")
```
### ChemBERTa Embeddings
```python
from molfeat.trans.pretrained import PretrainedMolTransformer
# Load ChemBERTa model
chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Generate embeddings
smiles = ["CCO", "CC(=O)O", "c1ccccc1"]
embeddings = chemberta(smiles)
print(f"ChemBERTa embeddings shape: {embeddings.shape}")
# Output: (3, 768) - 768-dimensional embeddings
# Use in ML pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
embeddings, labels, test_size=0.2
)
clf = RandomForestClassifier()
clf.fit(X_train, y_train)
predictions = clf.predict(X_test)
```
### ChemGPT Models
```python
# Small model (4.7M parameters)
chemgpt_small = PretrainedMolTransformer("ChemGPT-4.7M", n_jobs=-1)
# Medium model (19M parameters)
chemgpt_medium = PretrainedMolTransformer("ChemGPT-19M", n_jobs=-1)
# Large model (1.2B parameters)
chemgpt_large = PretrainedMolTransformer("ChemGPT-1.2B", n_jobs=-1)
# Generate embeddings
embeddings = chemgpt_small(smiles)
```
### Graph Neural Network Models
```python
# GIN models with different pre-training objectives
gin_masking = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
gin_infomax = PretrainedMolTransformer("gin-supervised-infomax", n_jobs=-1)
gin_edgepred = PretrainedMolTransformer("gin-supervised-edgepred", n_jobs=-1)
# Generate graph embeddings
embeddings = gin_masking(smiles)
print(f"GIN embeddings shape: {embeddings.shape}")
# Graphormer (for quantum chemistry)
graphormer = PretrainedMolTransformer("Graphormer-pcqm4mv2", n_jobs=-1)
embeddings = graphormer(smiles)
```
---
## Machine Learning Integration
### Scikit-learn Pipeline
```python
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Create ML pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', RandomForestClassifier(n_estimators=100))
])
# Train and evaluate
pipeline.fit(smiles_train, y_train)
predictions = pipeline.predict(smiles_test)
# Cross-validation
scores = cross_val_score(pipeline, smiles_all, y_all, cv=5)
print(f"CV scores: {scores.mean():.3f} (+/- {scores.std():.3f})")
```
### Grid Search for Hyperparameter Tuning
```python
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
# Define pipeline
pipeline = Pipeline([
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
('classifier', SVC())
])
# Define parameter grid
param_grid = {
'classifier__C': [0.1, 1, 10],
'classifier__kernel': ['rbf', 'linear'],
'classifier__gamma': ['scale', 'auto']
}
# Grid search
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
grid_search.fit(smiles_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best score: {grid_search.best_score_:.3f}")
```
### Multiple Featurizer Comparison
```python
from sklearn.metrics import roc_auc_score
# Test different featurizers
featurizers = {
'ECFP': FPCalculator("ecfp"),
'MACCS': FPCalculator("maccs"),
'RDKit': FPCalculator("rdkit"),
'Descriptors': RDKitDescriptors2D(),
'Combined': FeatConcat([
FPCalculator("maccs"),
FPCalculator("ecfp")
])
}
results = {}
for name, calc in featurizers.items():
transformer = MoleculeTransformer(calc, n_jobs=-1)
X_train = transformer(smiles_train)
X_test = transformer(smiles_test)
clf = RandomForestClassifier(n_estimators=100)
clf.fit(X_train, y_train)
y_pred = clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred)
results[name] = auc
print(f"{name}: AUC = {auc:.3f}")
```
### PyTorch Deep Learning
```python
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
# Custom dataset
class MoleculeDataset(Dataset):
def __init__(self, smiles, labels, transformer):
self.features = transformer(smiles)
self.labels = torch.tensor(labels, dtype=torch.float32)
def __len__(self):
return len(self.labels)
def __getitem__(self, idx):
return (
torch.tensor(self.features[idx], dtype=torch.float32),
self.labels[idx]
)
# Prepare data
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
train_dataset = MoleculeDataset(smiles_train, y_train, transformer)
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
# Simple neural network
class MoleculeClassifier(nn.Module):
def __init__(self, input_dim):
super().__init__()
self.network = nn.Sequential(
nn.Linear(input_dim, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, 256),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(256, 1),
nn.Sigmoid()
)
def forward(self, x):
return self.network(x)
# Train model
model = MoleculeClassifier(input_dim=2048)
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
criterion = nn.BCELoss()
for epoch in range(10):
for batch_features, batch_labels in train_loader:
optimizer.zero_grad()
outputs = model(batch_features).squeeze()
loss = criterion(outputs, batch_labels)
loss.backward()
optimizer.step()
```
---
## Advanced Usage Patterns
### Custom Preprocessing
```python
from molfeat.trans import MoleculeTransformer
import datamol as dm
class CustomTransformer(MoleculeTransformer):
def preprocess(self, mol):
"""Custom preprocessing: standardize molecule"""
if isinstance(mol, str):
mol = dm.to_mol(mol)
# Standardize
mol = dm.standardize_mol(mol)
# Remove salts
mol = dm.remove_salts(mol)
return mol
# Use custom transformer
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
features = transformer(smiles_list)
```
### Featurization with Conformers
```python
import datamol as dm
from molfeat.calc import RDKitDescriptors3D
# Generate conformers
def prepare_3d_mol(smiles):
mol = dm.to_mol(smiles)
mol = dm.add_hs(mol)
mol = dm.conform.generate_conformers(mol, n_confs=1)
return mol
# 3D descriptors
calc_3d = RDKitDescriptors3D()
smiles = "CC(C)Cc1ccc(C)cc1C"
mol_3d = prepare_3d_mol(smiles)
descriptors_3d = calc_3d(mol_3d)
```
### Parallel Batch Processing
```python
from molfeat.trans import MoleculeTransformer
from molfeat.calc import FPCalculator
import time
# Large dataset
smiles_large = load_large_dataset() # e.g., 100,000 molecules
# Test different parallelization levels
for n_jobs in [1, 2, 4, -1]:
transformer = MoleculeTransformer(
FPCalculator("ecfp"),
n_jobs=n_jobs
)
start = time.time()
features = transformer(smiles_large)
elapsed = time.time() - start
print(f"n_jobs={n_jobs}: {elapsed:.2f}s")
```
### Caching for Expensive Operations
```python
from molfeat.trans.pretrained import PretrainedMolTransformer
import pickle
# Load expensive pretrained model
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
# Cache embeddings for reuse
cache_file = "embeddings_cache.pkl"
try:
# Try loading cached embeddings
with open(cache_file, "rb") as f:
embeddings = pickle.load(f)
print("Loaded cached embeddings")
except FileNotFoundError:
# Compute and cache
embeddings = transformer(smiles_list)
with open(cache_file, "wb") as f:
pickle.dump(embeddings, f)
print("Computed and cached embeddings")
```
---
## Common Workflows
### Virtual Screening Workflow
```python
from molfeat.calc import FPCalculator
from sklearn.ensemble import RandomForestClassifier
import datamol as dm
# 1. Prepare training data (known actives/inactives)
train_smiles = load_training_data()
train_labels = load_training_labels() # 1=active, 0=inactive
# 2. Featurize training set
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
X_train = transformer(train_smiles)
# 3. Train classifier
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
clf.fit(X_train, train_labels)
# 4. Featurize screening library
screening_smiles = load_screening_library() # e.g., 1M compounds
X_screen = transformer(screening_smiles)
# 5. Predict and rank
predictions = clf.predict_proba(X_screen)[:, 1]
ranked_indices = predictions.argsort()[::-1]
# 6. Get top hits
top_n = 1000
top_hits = [screening_smiles[i] for i in ranked_indices[:top_n]]
```
### QSAR Model Building
```python
from molfeat.calc import RDKitDescriptors2D
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score
import numpy as np
# Load QSAR dataset
smiles = load_molecules()
y = load_activity_values() # e.g., IC50, logP
# Featurize with interpretable descriptors
transformer = MoleculeTransformer(RDKitDescriptors2D(), n_jobs=-1)
X = transformer(smiles)
# Standardize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Build linear model
model = Ridge(alpha=1.0)
scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
print(f"R² = {scores.mean():.3f} (+/- {scores.std():.3f})")
# Fit final model
model.fit(X_scaled, y)
# Interpret feature importance
feature_names = transformer.featurizer.columns
importance = np.abs(model.coef_)
top_features_idx = importance.argsort()[-10:][::-1]
print("Top 10 important features:")
for idx in top_features_idx:
print(f" {feature_names[idx]}: {model.coef_[idx]:.3f}")
```
### Similarity Search
```python
from molfeat.calc import FPCalculator
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Query molecule
query_smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin
# Database of molecules
database_smiles = load_molecule_database() # Large collection
# Compute fingerprints
calc = FPCalculator("ecfp")
query_fp = calc(query_smiles).reshape(1, -1)
transformer = MoleculeTransformer(calc, n_jobs=-1)
database_fps = transformer(database_smiles)
# Compute similarity
similarities = cosine_similarity(query_fp, database_fps)[0]
# Find most similar
top_k = 10
top_indices = similarities.argsort()[-top_k:][::-1]
print(f"Top {top_k} similar molecules:")
for i, idx in enumerate(top_indices, 1):
print(f"{i}. {database_smiles[idx]} (similarity: {similarities[idx]:.3f})")
```
---
## Troubleshooting
### Handling Invalid Molecules
```python
# Use ignore_errors to skip invalid molecules
transformer = MoleculeTransformer(
FPCalculator("ecfp"),
ignore_errors=True,
verbose=True
)
# Filter out None values after transformation
features = transformer(smiles_list)
valid_mask = [f is not None for f in features]
valid_features = [f for f in features if f is not None]
valid_smiles = [s for s, m in zip(smiles_list, valid_mask) if m]
```
### Memory Management for Large Datasets
```python
# Process in chunks for very large datasets
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
all_features = []
for i in range(0, len(smiles_list), chunk_size):
chunk = smiles_list[i:i+chunk_size]
features = transformer(chunk)
all_features.append(features)
print(f"Processed {i+len(chunk)}/{len(smiles_list)}")
return np.vstack(all_features)
# Use with large dataset
features = featurize_in_chunks(large_smiles_list, transformer)
```
### Reproducibility
```python
import random
import numpy as np
import torch
# Set all random seeds
def set_seed(seed=42):
random.seed(seed)
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
set_seed(42)
# Save exact configuration
transformer.to_state_yaml_file("config.yml")
# Document version
import molfeat
print(f"molfeat version: {molfeat.__version__}")
```