Initial commit
This commit is contained in:
505
skills/molfeat/SKILL.md
Normal file
505
skills/molfeat/SKILL.md
Normal file
@@ -0,0 +1,505 @@
|
||||
---
|
||||
name: molfeat
|
||||
description: "Molecular featurization for ML (100+ featurizers). ECFP, MACCS, descriptors, pretrained models (ChemBERTa), convert SMILES to features, for QSAR and molecular ML."
|
||||
---
|
||||
|
||||
# Molfeat - Molecular Featurization Hub
|
||||
|
||||
## Overview
|
||||
|
||||
Molfeat is a comprehensive Python library for molecular featurization that unifies 100+ pre-trained embeddings and hand-crafted featurizers. Convert chemical structures (SMILES strings or RDKit molecules) into numerical representations for machine learning tasks including QSAR modeling, virtual screening, similarity searching, and deep learning applications. Features fast parallel processing, scikit-learn compatible transformers, and built-in caching.
|
||||
|
||||
## When to Use This Skill
|
||||
|
||||
This skill should be used when working with:
|
||||
- **Molecular machine learning**: Building QSAR/QSPR models, property prediction
|
||||
- **Virtual screening**: Ranking compound libraries for biological activity
|
||||
- **Similarity searching**: Finding structurally similar molecules
|
||||
- **Chemical space analysis**: Clustering, visualization, dimensionality reduction
|
||||
- **Deep learning**: Training neural networks on molecular data
|
||||
- **Featurization pipelines**: Converting SMILES to ML-ready representations
|
||||
- **Cheminformatics**: Any task requiring molecular feature extraction
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
uv pip install molfeat
|
||||
|
||||
# With all optional dependencies
|
||||
uv pip install "molfeat[all]"
|
||||
```
|
||||
|
||||
**Optional dependencies for specific featurizers:**
|
||||
- `molfeat[dgl]` - GNN models (GIN variants)
|
||||
- `molfeat[graphormer]` - Graphormer models
|
||||
- `molfeat[transformer]` - ChemBERTa, ChemGPT, MolT5
|
||||
- `molfeat[fcd]` - FCD descriptors
|
||||
- `molfeat[map4]` - MAP4 fingerprints
|
||||
|
||||
## Core Concepts
|
||||
|
||||
Molfeat organizes featurization into three hierarchical classes:
|
||||
|
||||
### 1. Calculators (`molfeat.calc`)
|
||||
|
||||
Callable objects that convert individual molecules into feature vectors. Accept RDKit `Chem.Mol` objects or SMILES strings.
|
||||
|
||||
**Use calculators for:**
|
||||
- Single molecule featurization
|
||||
- Custom processing loops
|
||||
- Direct feature computation
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
features = calc("CCO") # Returns numpy array (2048,)
|
||||
```
|
||||
|
||||
### 2. Transformers (`molfeat.trans`)
|
||||
|
||||
Scikit-learn compatible transformers that wrap calculators for batch processing with parallelization.
|
||||
|
||||
**Use transformers for:**
|
||||
- Batch featurization of molecular datasets
|
||||
- Integration with scikit-learn pipelines
|
||||
- Parallel processing (automatic CPU utilization)
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
features = transformer(smiles_list) # Parallel processing
|
||||
```
|
||||
|
||||
### 3. Pretrained Transformers (`molfeat.trans.pretrained`)
|
||||
|
||||
Specialized transformers for deep learning models with batched inference and caching.
|
||||
|
||||
**Use pretrained transformers for:**
|
||||
- State-of-the-art molecular embeddings
|
||||
- Transfer learning from large chemical datasets
|
||||
- Deep learning feature extraction
|
||||
|
||||
**Example:**
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
embeddings = transformer(smiles_list) # Deep learning embeddings
|
||||
```
|
||||
|
||||
## Quick Start Workflow
|
||||
|
||||
### Basic Featurization
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from molfeat.calc import FPCalculator
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
|
||||
# Load molecular data
|
||||
smiles = ["CCO", "CC(=O)O", "c1ccccc1", "CC(C)O"]
|
||||
|
||||
# Create calculator and transformer
|
||||
calc = FPCalculator("ecfp", radius=3)
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
|
||||
# Featurize molecules
|
||||
features = transformer(smiles)
|
||||
print(f"Shape: {features.shape}") # (4, 2048)
|
||||
```
|
||||
|
||||
### Save and Load Configuration
|
||||
|
||||
```python
|
||||
# Save featurizer configuration for reproducibility
|
||||
transformer.to_state_yaml_file("featurizer_config.yml")
|
||||
|
||||
# Reload exact configuration
|
||||
loaded = MoleculeTransformer.from_state_yaml_file("featurizer_config.yml")
|
||||
```
|
||||
|
||||
### Handle Errors Gracefully
|
||||
|
||||
```python
|
||||
# Process dataset with potentially invalid SMILES
|
||||
transformer = MoleculeTransformer(
|
||||
calc,
|
||||
n_jobs=-1,
|
||||
ignore_errors=True, # Continue on failures
|
||||
verbose=True # Log error details
|
||||
)
|
||||
|
||||
features = transformer(smiles_with_errors)
|
||||
# Returns None for failed molecules
|
||||
```
|
||||
|
||||
## Choosing the Right Featurizer
|
||||
|
||||
### For Traditional Machine Learning (RF, SVM, XGBoost)
|
||||
|
||||
**Start with fingerprints:**
|
||||
```python
|
||||
# ECFP - Most popular, general-purpose
|
||||
FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
|
||||
# MACCS - Fast, good for scaffold hopping
|
||||
FPCalculator("maccs")
|
||||
|
||||
# MAP4 - Efficient for large-scale screening
|
||||
FPCalculator("map4")
|
||||
```
|
||||
|
||||
**For interpretable models:**
|
||||
```python
|
||||
# RDKit 2D descriptors (200+ named properties)
|
||||
from molfeat.calc import RDKitDescriptors2D
|
||||
RDKitDescriptors2D()
|
||||
|
||||
# Mordred (1800+ comprehensive descriptors)
|
||||
from molfeat.calc import MordredDescriptors
|
||||
MordredDescriptors()
|
||||
```
|
||||
|
||||
**Combine multiple featurizers:**
|
||||
```python
|
||||
from molfeat.trans import FeatConcat
|
||||
|
||||
concat = FeatConcat([
|
||||
FPCalculator("maccs"), # 167 dimensions
|
||||
FPCalculator("ecfp") # 2048 dimensions
|
||||
]) # Result: 2215-dimensional combined features
|
||||
```
|
||||
|
||||
### For Deep Learning
|
||||
|
||||
**Transformer-based embeddings:**
|
||||
```python
|
||||
# ChemBERTa - Pre-trained on 77M PubChem compounds
|
||||
PretrainedMolTransformer("ChemBERTa-77M-MLM")
|
||||
|
||||
# ChemGPT - Autoregressive language model
|
||||
PretrainedMolTransformer("ChemGPT-1.2B")
|
||||
```
|
||||
|
||||
**Graph neural networks:**
|
||||
```python
|
||||
# GIN models with different pre-training objectives
|
||||
PretrainedMolTransformer("gin-supervised-masking")
|
||||
PretrainedMolTransformer("gin-supervised-infomax")
|
||||
|
||||
# Graphormer for quantum chemistry
|
||||
PretrainedMolTransformer("Graphormer-pcqm4mv2")
|
||||
```
|
||||
|
||||
### For Similarity Searching
|
||||
|
||||
```python
|
||||
# ECFP - General purpose, most widely used
|
||||
FPCalculator("ecfp")
|
||||
|
||||
# MACCS - Fast, scaffold-based similarity
|
||||
FPCalculator("maccs")
|
||||
|
||||
# MAP4 - Efficient for large databases
|
||||
FPCalculator("map4")
|
||||
|
||||
# USR/USRCAT - 3D shape similarity
|
||||
from molfeat.calc import USRDescriptors
|
||||
USRDescriptors()
|
||||
```
|
||||
|
||||
### For Pharmacophore-Based Approaches
|
||||
|
||||
```python
|
||||
# FCFP - Functional group based
|
||||
FPCalculator("fcfp")
|
||||
|
||||
# CATS - Pharmacophore pair distributions
|
||||
from molfeat.calc import CATSCalculator
|
||||
CATSCalculator(mode="2D")
|
||||
|
||||
# Gobbi - Explicit pharmacophore features
|
||||
FPCalculator("gobbi2D")
|
||||
```
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Building a QSAR Model
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
from sklearn.model_selection import cross_val_score
|
||||
|
||||
# Featurize molecules
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
X = transformer(smiles_train)
|
||||
|
||||
# Train model
|
||||
model = RandomForestRegressor(n_estimators=100)
|
||||
scores = cross_val_score(model, X, y_train, cv=5)
|
||||
print(f"R² = {scores.mean():.3f}")
|
||||
|
||||
# Save configuration for deployment
|
||||
transformer.to_state_yaml_file("production_featurizer.yml")
|
||||
```
|
||||
|
||||
### Virtual Screening Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
# Train on known actives/inactives
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
X_train = transformer(train_smiles)
|
||||
clf = RandomForestClassifier(n_estimators=500)
|
||||
clf.fit(X_train, train_labels)
|
||||
|
||||
# Screen large library
|
||||
X_screen = transformer(screening_library) # e.g., 1M compounds
|
||||
predictions = clf.predict_proba(X_screen)[:, 1]
|
||||
|
||||
# Rank and select top hits
|
||||
top_indices = predictions.argsort()[::-1][:1000]
|
||||
top_hits = [screening_library[i] for i in top_indices]
|
||||
```
|
||||
|
||||
### Similarity Search
|
||||
|
||||
```python
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
|
||||
# Query molecule
|
||||
calc = FPCalculator("ecfp")
|
||||
query_fp = calc(query_smiles).reshape(1, -1)
|
||||
|
||||
# Database fingerprints
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
database_fps = transformer(database_smiles)
|
||||
|
||||
# Compute similarity
|
||||
similarities = cosine_similarity(query_fp, database_fps)[0]
|
||||
top_similar = similarities.argsort()[-10:][::-1]
|
||||
```
|
||||
|
||||
### Scikit-learn Pipeline Integration
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
# Create end-to-end pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
|
||||
('classifier', RandomForestClassifier(n_estimators=100))
|
||||
])
|
||||
|
||||
# Train and predict directly on SMILES
|
||||
pipeline.fit(smiles_train, y_train)
|
||||
predictions = pipeline.predict(smiles_test)
|
||||
```
|
||||
|
||||
### Comparing Multiple Featurizers
|
||||
|
||||
```python
|
||||
featurizers = {
|
||||
'ECFP': FPCalculator("ecfp"),
|
||||
'MACCS': FPCalculator("maccs"),
|
||||
'Descriptors': RDKitDescriptors2D(),
|
||||
'ChemBERTa': PretrainedMolTransformer("ChemBERTa-77M-MLM")
|
||||
}
|
||||
|
||||
results = {}
|
||||
for name, feat in featurizers.items():
|
||||
transformer = MoleculeTransformer(feat, n_jobs=-1)
|
||||
X = transformer(smiles)
|
||||
# Evaluate with your ML model
|
||||
score = evaluate_model(X, y)
|
||||
results[name] = score
|
||||
```
|
||||
|
||||
## Discovering Available Featurizers
|
||||
|
||||
Use the ModelStore to explore all available featurizers:
|
||||
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
store = ModelStore()
|
||||
|
||||
# List all available models
|
||||
all_models = store.available_models
|
||||
print(f"Total featurizers: {len(all_models)}")
|
||||
|
||||
# Search for specific models
|
||||
chemberta_models = store.search(name="ChemBERTa")
|
||||
for model in chemberta_models:
|
||||
print(f"- {model.name}: {model.description}")
|
||||
|
||||
# Get usage information
|
||||
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
|
||||
model_card.usage() # Display usage examples
|
||||
|
||||
# Load model
|
||||
transformer = store.load("ChemBERTa-77M-MLM")
|
||||
```
|
||||
|
||||
## Advanced Features
|
||||
|
||||
### Custom Preprocessing
|
||||
|
||||
```python
|
||||
class CustomTransformer(MoleculeTransformer):
|
||||
def preprocess(self, mol):
|
||||
"""Custom preprocessing pipeline"""
|
||||
if isinstance(mol, str):
|
||||
mol = dm.to_mol(mol)
|
||||
mol = dm.standardize_mol(mol)
|
||||
mol = dm.remove_salts(mol)
|
||||
return mol
|
||||
|
||||
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
```
|
||||
|
||||
### Batch Processing Large Datasets
|
||||
|
||||
```python
|
||||
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
|
||||
"""Process large datasets in chunks to manage memory"""
|
||||
all_features = []
|
||||
for i in range(0, len(smiles_list), chunk_size):
|
||||
chunk = smiles_list[i:i+chunk_size]
|
||||
features = transformer(chunk)
|
||||
all_features.append(features)
|
||||
return np.vstack(all_features)
|
||||
```
|
||||
|
||||
### Caching Expensive Embeddings
|
||||
|
||||
```python
|
||||
import pickle
|
||||
|
||||
cache_file = "embeddings_cache.pkl"
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
try:
|
||||
with open(cache_file, "rb") as f:
|
||||
embeddings = pickle.load(f)
|
||||
except FileNotFoundError:
|
||||
embeddings = transformer(smiles_list)
|
||||
with open(cache_file, "wb") as f:
|
||||
pickle.dump(embeddings, f)
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Use parallelization**: Set `n_jobs=-1` to utilize all CPU cores
|
||||
2. **Batch processing**: Process multiple molecules at once instead of loops
|
||||
3. **Choose appropriate featurizers**: Fingerprints are faster than deep learning models
|
||||
4. **Cache pretrained models**: Leverage built-in caching for repeated use
|
||||
5. **Use float32**: Set `dtype=np.float32` when precision allows
|
||||
6. **Handle errors efficiently**: Use `ignore_errors=True` for large datasets
|
||||
|
||||
## Common Featurizers Reference
|
||||
|
||||
**Quick reference for frequently used featurizers:**
|
||||
|
||||
| Featurizer | Type | Dimensions | Speed | Use Case |
|
||||
|------------|------|------------|-------|----------|
|
||||
| `ecfp` | Fingerprint | 2048 | Fast | General purpose |
|
||||
| `maccs` | Fingerprint | 167 | Very fast | Scaffold similarity |
|
||||
| `desc2D` | Descriptors | 200+ | Fast | Interpretable models |
|
||||
| `mordred` | Descriptors | 1800+ | Medium | Comprehensive features |
|
||||
| `map4` | Fingerprint | 1024 | Fast | Large-scale screening |
|
||||
| `ChemBERTa-77M-MLM` | Deep learning | 768 | Slow* | Transfer learning |
|
||||
| `gin-supervised-masking` | GNN | Variable | Slow* | Graph-based models |
|
||||
|
||||
*First run is slow; subsequent runs benefit from caching
|
||||
|
||||
## Resources
|
||||
|
||||
This skill includes comprehensive reference documentation:
|
||||
|
||||
### references/api_reference.md
|
||||
Complete API documentation covering:
|
||||
- `molfeat.calc` - All calculator classes and parameters
|
||||
- `molfeat.trans` - Transformer classes and methods
|
||||
- `molfeat.store` - ModelStore usage
|
||||
- Common patterns and integration examples
|
||||
- Performance optimization tips
|
||||
|
||||
**When to load:** Reference when implementing specific calculators, understanding transformer parameters, or integrating with scikit-learn/PyTorch.
|
||||
|
||||
### references/available_featurizers.md
|
||||
Comprehensive catalog of all 100+ featurizers organized by category:
|
||||
- Transformer-based language models (ChemBERTa, ChemGPT)
|
||||
- Graph neural networks (GIN, Graphormer)
|
||||
- Molecular descriptors (RDKit, Mordred)
|
||||
- Fingerprints (ECFP, MACCS, MAP4, and 15+ others)
|
||||
- Pharmacophore descriptors (CATS, Gobbi)
|
||||
- Shape descriptors (USR, ElectroShape)
|
||||
- Scaffold-based descriptors
|
||||
|
||||
**When to load:** Reference when selecting the optimal featurizer for a specific task, exploring available options, or understanding featurizer characteristics.
|
||||
|
||||
**Search tip:** Use grep to find specific featurizer types:
|
||||
```bash
|
||||
grep -i "chembert" references/available_featurizers.md
|
||||
grep -i "pharmacophore" references/available_featurizers.md
|
||||
```
|
||||
|
||||
### references/examples.md
|
||||
Practical code examples for common scenarios:
|
||||
- Installation and quick start
|
||||
- Calculator and transformer examples
|
||||
- Pretrained model usage
|
||||
- Scikit-learn and PyTorch integration
|
||||
- Virtual screening workflows
|
||||
- QSAR model building
|
||||
- Similarity searching
|
||||
- Troubleshooting and best practices
|
||||
|
||||
**When to load:** Reference when implementing specific workflows, troubleshooting issues, or learning molfeat patterns.
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Invalid Molecules
|
||||
Enable error handling to skip invalid SMILES:
|
||||
```python
|
||||
transformer = MoleculeTransformer(
|
||||
calc,
|
||||
ignore_errors=True,
|
||||
verbose=True
|
||||
)
|
||||
```
|
||||
|
||||
### Memory Issues with Large Datasets
|
||||
Process in chunks or use streaming approaches for datasets > 100K molecules.
|
||||
|
||||
### Pretrained Model Dependencies
|
||||
Some models require additional packages. Install specific extras:
|
||||
```bash
|
||||
uv pip install "molfeat[transformer]" # For ChemBERTa/ChemGPT
|
||||
uv pip install "molfeat[dgl]" # For GIN models
|
||||
```
|
||||
|
||||
### Reproducibility
|
||||
Save exact configurations and document versions:
|
||||
```python
|
||||
transformer.to_state_yaml_file("config.yml")
|
||||
import molfeat
|
||||
print(f"molfeat version: {molfeat.__version__}")
|
||||
```
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Official Documentation**: https://molfeat-docs.datamol.io/
|
||||
- **GitHub Repository**: https://github.com/datamol-io/molfeat
|
||||
- **PyPI Package**: https://pypi.org/project/molfeat/
|
||||
- **Tutorial**: https://portal.valencelabs.com/datamol/post/types-of-featurizers-b1e8HHrbFMkbun6
|
||||
428
skills/molfeat/references/api_reference.md
Normal file
428
skills/molfeat/references/api_reference.md
Normal file
@@ -0,0 +1,428 @@
|
||||
# Molfeat API Reference
|
||||
|
||||
## Core Modules
|
||||
|
||||
Molfeat is organized into several key modules that provide different aspects of molecular featurization:
|
||||
|
||||
- **`molfeat.store`** - Manages model loading, listing, and registration
|
||||
- **`molfeat.calc`** - Provides calculators for single-molecule featurization
|
||||
- **`molfeat.trans`** - Offers scikit-learn compatible transformers for batch processing
|
||||
- **`molfeat.utils`** - Utility functions for data handling
|
||||
- **`molfeat.viz`** - Visualization tools for molecular features
|
||||
|
||||
---
|
||||
|
||||
## molfeat.calc - Calculators
|
||||
|
||||
Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit `Chem.Mol` objects or SMILES strings as input.
|
||||
|
||||
### SerializableCalculator (Base Class)
|
||||
|
||||
Base abstract class for all calculators. When subclassing, must implement:
|
||||
- `__call__()` - Required method for featurization
|
||||
- `__len__()` - Optional, returns output length
|
||||
- `columns` - Optional property, returns feature names
|
||||
- `batch_compute()` - Optional, for efficient batch processing
|
||||
|
||||
**State Management Methods:**
|
||||
- `to_state_json()` - Save calculator state as JSON
|
||||
- `to_state_yaml()` - Save calculator state as YAML
|
||||
- `from_state_dict()` - Load calculator from state dictionary
|
||||
- `to_state_dict()` - Export calculator state as dictionary
|
||||
|
||||
### FPCalculator
|
||||
|
||||
Computes molecular fingerprints. Supports 15+ fingerprint methods.
|
||||
|
||||
**Supported Fingerprint Types:**
|
||||
|
||||
**Structural Fingerprints:**
|
||||
- `ecfp` - Extended-connectivity fingerprints (circular)
|
||||
- `fcfp` - Functional-class fingerprints
|
||||
- `rdkit` - RDKit topological fingerprints
|
||||
- `maccs` - MACCS keys (166-bit structural keys)
|
||||
- `avalon` - Avalon fingerprints
|
||||
- `pattern` - Pattern fingerprints
|
||||
- `layered` - Layered fingerprints
|
||||
|
||||
**Atom-based Fingerprints:**
|
||||
- `atompair` - Atom pair fingerprints
|
||||
- `atompair-count` - Counted atom pairs
|
||||
- `topological` - Topological torsion fingerprints
|
||||
- `topological-count` - Counted topological torsions
|
||||
|
||||
**Specialized Fingerprints:**
|
||||
- `map4` - MinHashed atom-pair fingerprint up to 4 bonds
|
||||
- `secfp` - SMILES extended connectivity fingerprint
|
||||
- `erg` - Extended reduced graphs
|
||||
- `estate` - Electrotopological state indices
|
||||
|
||||
**Parameters:**
|
||||
- `method` (str) - Fingerprint type name
|
||||
- `radius` (int) - Radius for circular fingerprints (default: 3)
|
||||
- `fpSize` (int) - Fingerprint size (default: 2048)
|
||||
- `includeChirality` (bool) - Include chirality information
|
||||
- `counting` (bool) - Use count vectors instead of binary
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create fingerprint calculator
|
||||
calc = FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
|
||||
# Compute fingerprint for single molecule
|
||||
fp = calc("CCO") # Returns numpy array
|
||||
|
||||
# Get fingerprint length
|
||||
length = len(calc) # 2048
|
||||
|
||||
# Get feature names
|
||||
names = calc.columns
|
||||
```
|
||||
|
||||
**Common Fingerprint Dimensions:**
|
||||
- MACCS: 167 dimensions
|
||||
- ECFP (default): 2048 dimensions
|
||||
- MAP4 (default): 1024 dimensions
|
||||
|
||||
### Descriptor Calculators
|
||||
|
||||
**RDKitDescriptors2D**
|
||||
Computes 2D molecular descriptors using RDKit.
|
||||
|
||||
```python
|
||||
from molfeat.calc import RDKitDescriptors2D
|
||||
|
||||
calc = RDKitDescriptors2D()
|
||||
descriptors = calc("CCO") # Returns 200+ descriptors
|
||||
```
|
||||
|
||||
**RDKitDescriptors3D**
|
||||
Computes 3D molecular descriptors (requires conformer generation).
|
||||
|
||||
**MordredDescriptors**
|
||||
Calculates over 1800 molecular descriptors using Mordred.
|
||||
|
||||
```python
|
||||
from molfeat.calc import MordredDescriptors
|
||||
|
||||
calc = MordredDescriptors()
|
||||
descriptors = calc("CCO")
|
||||
```
|
||||
|
||||
### Pharmacophore Calculators
|
||||
|
||||
**Pharmacophore2D**
|
||||
RDKit's 2D pharmacophore fingerprint generation.
|
||||
|
||||
**Pharmacophore3D**
|
||||
Consensus pharmacophore fingerprints from multiple conformers.
|
||||
|
||||
**CATSCalculator**
|
||||
Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
|
||||
|
||||
**Parameters:**
|
||||
- `mode` - "2D" or "3D" distance calculations
|
||||
- `dist_bins` - Distance bins for pair distributions
|
||||
- `scale` - Scaling mode: "raw", "num", or "count"
|
||||
|
||||
```python
|
||||
from molfeat.calc import CATSCalculator
|
||||
|
||||
calc = CATSCalculator(mode="2D", scale="raw")
|
||||
cats = calc("CCO") # Returns 21 descriptors by default
|
||||
```
|
||||
|
||||
### Shape Descriptors
|
||||
|
||||
**USRDescriptors**
|
||||
Ultrafast shape recognition descriptors (multiple variants).
|
||||
|
||||
**ElectroShapeDescriptors**
|
||||
Electrostatic shape descriptors combining shape, chirality, and electrostatics.
|
||||
|
||||
### Graph-Based Calculators
|
||||
|
||||
**ScaffoldKeyCalculator**
|
||||
Computes 40+ scaffold-based molecular properties.
|
||||
|
||||
**AtomCalculator**
|
||||
Atom-level featurization for graph neural networks.
|
||||
|
||||
**BondCalculator**
|
||||
Bond-level featurization for graph neural networks.
|
||||
|
||||
### Utility Function
|
||||
|
||||
**get_calculator()**
|
||||
Factory function to instantiate calculators by name.
|
||||
|
||||
```python
|
||||
from molfeat.calc import get_calculator
|
||||
|
||||
# Instantiate any calculator by name
|
||||
calc = get_calculator("ecfp", radius=3)
|
||||
calc = get_calculator("maccs")
|
||||
calc = get_calculator("desc2D")
|
||||
```
|
||||
|
||||
Raises `ValueError` for unsupported featurizers.
|
||||
|
||||
---
|
||||
|
||||
## molfeat.trans - Transformers
|
||||
|
||||
Transformers wrap calculators into complete featurization pipelines for batch processing.
|
||||
|
||||
### MoleculeTransformer
|
||||
|
||||
Scikit-learn compatible transformer for batch molecular featurization.
|
||||
|
||||
**Key Parameters:**
|
||||
- `featurizer` - Calculator or featurizer to use
|
||||
- `n_jobs` (int) - Number of parallel jobs (-1 for all cores)
|
||||
- `dtype` - Output data type (numpy float32/64, torch tensors)
|
||||
- `verbose` (bool) - Enable verbose logging
|
||||
- `ignore_errors` (bool) - Continue on failures (returns None for failed molecules)
|
||||
|
||||
**Essential Methods:**
|
||||
- `transform(mols)` - Processes batches and returns representations
|
||||
- `_transform(mol)` - Handles individual molecule featurization
|
||||
- `__call__(mols)` - Convenience wrapper around transform()
|
||||
- `preprocess(mol)` - Prepares input molecules (not automatically applied)
|
||||
- `to_state_yaml_file(path)` - Save transformer configuration
|
||||
- `from_state_yaml_file(path)` - Load transformer configuration
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
import datamol as dm
|
||||
|
||||
# Load molecules
|
||||
smiles = dm.data.freesolv().sample(100).smiles.values
|
||||
|
||||
# Create transformer
|
||||
calc = FPCalculator("ecfp")
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
|
||||
# Featurize batch
|
||||
features = transformer(smiles) # Returns numpy array (100, 2048)
|
||||
|
||||
# Save configuration
|
||||
transformer.to_state_yaml_file("ecfp_config.yml")
|
||||
|
||||
# Reload
|
||||
transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
|
||||
```
|
||||
|
||||
**Performance:** Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
|
||||
|
||||
### FeatConcat
|
||||
|
||||
Concatenates multiple featurizers into unified representations.
|
||||
|
||||
```python
|
||||
from molfeat.trans import FeatConcat
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Combine multiple fingerprints
|
||||
concat = FeatConcat([
|
||||
FPCalculator("maccs"), # 167 dimensions
|
||||
FPCalculator("ecfp") # 2048 dimensions
|
||||
])
|
||||
|
||||
# Result: 2167-dimensional features
|
||||
transformer = MoleculeTransformer(concat, n_jobs=-1)
|
||||
features = transformer(smiles)
|
||||
```
|
||||
|
||||
### PretrainedMolTransformer
|
||||
|
||||
Subclass of `MoleculeTransformer` for pre-trained deep learning models.
|
||||
|
||||
**Unique Features:**
|
||||
- `_embed()` - Batched inference for neural networks
|
||||
- `_convert()` - Transforms SMILES/molecules into model-compatible formats
|
||||
- SELFIES strings for language models
|
||||
- DGL graphs for graph neural networks
|
||||
- Integrated caching system for efficient storage
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
|
||||
# Load pretrained model
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = transformer(smiles)
|
||||
```
|
||||
|
||||
### PrecomputedMolTransformer
|
||||
|
||||
Transformer for cached/precomputed features.
|
||||
|
||||
---
|
||||
|
||||
## molfeat.store - Model Store
|
||||
|
||||
Manages featurizer discovery, loading, and registration.
|
||||
|
||||
### ModelStore
|
||||
|
||||
Central hub for accessing available featurizers.
|
||||
|
||||
**Key Methods:**
|
||||
- `available_models` - Property listing all available featurizers
|
||||
- `search(name=None, **kwargs)` - Search for specific featurizers
|
||||
- `load(name, **kwargs)` - Load a featurizer by name
|
||||
- `register(name, card)` - Register custom featurizer
|
||||
|
||||
**Usage:**
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
# Initialize store
|
||||
store = ModelStore()
|
||||
|
||||
# List all available models
|
||||
all_models = store.available_models
|
||||
print(f"Found {len(all_models)} featurizers")
|
||||
|
||||
# Search for specific model
|
||||
results = store.search(name="ChemBERTa-77M-MLM")
|
||||
if results:
|
||||
model_card = results[0]
|
||||
|
||||
# View usage information
|
||||
model_card.usage()
|
||||
|
||||
# Load the model
|
||||
transformer = model_card.load()
|
||||
|
||||
# Direct loading
|
||||
transformer = store.load("ChemBERTa-77M-MLM")
|
||||
```
|
||||
|
||||
**ModelCard Attributes:**
|
||||
- `name` - Model identifier
|
||||
- `description` - Model description
|
||||
- `version` - Model version
|
||||
- `authors` - Model authors
|
||||
- `tags` - Categorization tags
|
||||
- `usage()` - Display usage examples
|
||||
- `load(**kwargs)` - Load the model
|
||||
|
||||
---
|
||||
|
||||
## Common Patterns
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
# Enable error tolerance
|
||||
featurizer = MoleculeTransformer(
|
||||
calc,
|
||||
n_jobs=-1,
|
||||
verbose=True,
|
||||
ignore_errors=True
|
||||
)
|
||||
|
||||
# Failed molecules return None
|
||||
features = featurizer(smiles_with_errors)
|
||||
```
|
||||
|
||||
### Data Type Control
|
||||
|
||||
```python
|
||||
# NumPy float32 (default)
|
||||
features = transformer(smiles, enforce_dtype=True)
|
||||
|
||||
# PyTorch tensors
|
||||
import torch
|
||||
transformer = MoleculeTransformer(calc, dtype=torch.float32)
|
||||
features = transformer(smiles)
|
||||
```
|
||||
|
||||
### Persistence and Reproducibility
|
||||
|
||||
```python
|
||||
# Save transformer state
|
||||
transformer.to_state_yaml_file("config.yml")
|
||||
transformer.to_state_json_file("config.json")
|
||||
|
||||
# Load from saved state
|
||||
transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
|
||||
transformer = MoleculeTransformer.from_state_json_file("config.json")
|
||||
```
|
||||
|
||||
### Preprocessing
|
||||
|
||||
```python
|
||||
# Manual preprocessing
|
||||
mol = transformer.preprocess("CCO")
|
||||
|
||||
# Transform with preprocessing
|
||||
features = transformer.transform(smiles_list)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Integration Examples
|
||||
|
||||
### Scikit-learn Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
|
||||
('classifier', RandomForestClassifier())
|
||||
])
|
||||
|
||||
# Fit and predict
|
||||
pipeline.fit(smiles_train, y_train)
|
||||
predictions = pipeline.predict(smiles_test)
|
||||
```
|
||||
|
||||
### PyTorch Integration
|
||||
|
||||
```python
|
||||
import torch
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
|
||||
class MoleculeDataset(Dataset):
|
||||
def __init__(self, smiles, labels, transformer):
|
||||
self.smiles = smiles
|
||||
self.labels = labels
|
||||
self.transformer = transformer
|
||||
|
||||
def __len__(self):
|
||||
return len(self.smiles)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
features = self.transformer(self.smiles[idx])
|
||||
return torch.tensor(features), torch.tensor(self.labels[idx])
|
||||
|
||||
# Create dataset and dataloader
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"))
|
||||
dataset = MoleculeDataset(smiles, labels, transformer)
|
||||
loader = DataLoader(dataset, batch_size=32)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Parallelization**: Use `n_jobs=-1` to utilize all CPU cores
|
||||
2. **Batch Processing**: Process multiple molecules at once instead of loops
|
||||
3. **Caching**: Leverage built-in caching for pretrained models
|
||||
4. **Data Types**: Use float32 instead of float64 when precision allows
|
||||
5. **Error Handling**: Set `ignore_errors=True` for large datasets with potential invalid molecules
|
||||
333
skills/molfeat/references/available_featurizers.md
Normal file
333
skills/molfeat/references/available_featurizers.md
Normal file
@@ -0,0 +1,333 @@
|
||||
# Available Featurizers in Molfeat
|
||||
|
||||
This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
|
||||
|
||||
## Transformer-Based Language Models
|
||||
|
||||
Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
|
||||
|
||||
### RoBERTa-style Models
|
||||
- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
|
||||
- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
|
||||
- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds
|
||||
|
||||
### GPT-style Autoregressive Models
|
||||
- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
|
||||
- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
|
||||
- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
|
||||
- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M
|
||||
|
||||
### Specialized Transformer Models
|
||||
- **MolT5** - Self-supervised framework for molecule captioning and text-based generation
|
||||
|
||||
## Graph Neural Networks (GNNs)
|
||||
|
||||
Pre-trained graph neural network models operating on molecular graph structures.
|
||||
|
||||
### GIN (Graph Isomorphism Network) Variants
|
||||
All pre-trained on ChEMBL molecules with different objectives:
|
||||
- **gin-supervised-masking** - Supervised with node masking objective
|
||||
- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
|
||||
- **gin-supervised-edgepred** - Supervised with edge prediction objective
|
||||
- **gin-supervised-contextpred** - Supervised with context prediction objective
|
||||
|
||||
### Other Graph-Based Models
|
||||
- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
|
||||
- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
|
||||
|
||||
## Molecular Descriptors
|
||||
|
||||
Calculators for physico-chemical properties and molecular characteristics.
|
||||
|
||||
### 2D Descriptors
|
||||
- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
|
||||
- Molecular weight, logP, TPSA
|
||||
- H-bond donors/acceptors
|
||||
- Rotatable bonds
|
||||
- Ring counts and aromaticity
|
||||
- Molecular complexity metrics
|
||||
|
||||
### 3D Descriptors
|
||||
- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
|
||||
- Inertial moments
|
||||
- PMI (Principal Moments of Inertia) ratios
|
||||
- Asphericity, eccentricity
|
||||
- Radius of gyration
|
||||
|
||||
### Comprehensive Descriptor Sets
|
||||
- **mordred** - Over 1800 molecular descriptors covering:
|
||||
- Constitutional descriptors
|
||||
- Topological indices
|
||||
- Connectivity indices
|
||||
- Information content
|
||||
- 2D/3D autocorrelations
|
||||
- WHIM descriptors
|
||||
- GETAWAY descriptors
|
||||
- And many more
|
||||
|
||||
### Electrotopological Descriptors
|
||||
- **estate** - Electrotopological state (E-State) indices encoding:
|
||||
- Atomic environment information
|
||||
- Electronic and topological properties
|
||||
- Heteroatom contributions
|
||||
|
||||
## Molecular Fingerprints
|
||||
|
||||
Binary or count-based fixed-length vectors representing molecular substructures.
|
||||
|
||||
### Circular Fingerprints (ECFP-style)
|
||||
- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
|
||||
- Radius variants (2, 4, 6 correspond to diameter)
|
||||
- Default: radius=3, 2048 bits
|
||||
- Most popular for similarity searching
|
||||
- **ecfp-count** - Count version of ECFP (non-binary)
|
||||
- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
|
||||
- Similar to ECFP but uses functional groups
|
||||
- Better for pharmacophore-based similarity
|
||||
|
||||
### Path-Based Fingerprints
|
||||
- **rdkit** - RDKit topological fingerprints based on linear paths
|
||||
- **pattern** - Pattern fingerprints (similar to MACCS but automated)
|
||||
- **layered** - Layered fingerprints with multiple substructure layers
|
||||
|
||||
### Key-Based Fingerprints
|
||||
- **maccs** - MACCS keys (166-bit structural keys)
|
||||
- Fixed set of predefined substructures
|
||||
- Good for scaffold hopping
|
||||
- Fast computation
|
||||
- **avalon** - Avalon fingerprints
|
||||
- Similar to MACCS but more features
|
||||
- Optimized for similarity searching
|
||||
|
||||
### Atom-Pair Fingerprints
|
||||
- **atompair** - Atom pair fingerprints
|
||||
- Encodes pairs of atoms and distance between them
|
||||
- Good for 3D similarity
|
||||
- **atompair-count** - Count version of atom pairs
|
||||
|
||||
### Topological Torsion Fingerprints
|
||||
- **topological** - Topological torsion fingerprints
|
||||
- Encodes sequences of 4 connected atoms
|
||||
- Captures local topology
|
||||
- **topological-count** - Count version of topological torsions
|
||||
|
||||
### MinHashed Fingerprints
|
||||
- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
|
||||
- Combines atom-pair and ECFP concepts
|
||||
- Default: 1024 dimensions
|
||||
- Fast and efficient for large datasets
|
||||
- **secfp** - SMILES Extended Connectivity Fingerprint
|
||||
- Operates directly on SMILES strings
|
||||
- Captures both substructure and atom-pair information
|
||||
|
||||
### Extended Reduced Graph
|
||||
- **erg** - Extended Reduced Graph
|
||||
- Uses pharmacophoric points instead of atoms
|
||||
- Reduces graph complexity while preserving key features
|
||||
|
||||
## Pharmacophore Descriptors
|
||||
|
||||
Features based on pharmacologically relevant functional groups and their spatial relationships.
|
||||
|
||||
### CATS (Chemically Advanced Template Search)
|
||||
- **cats2D** - 2D CATS descriptors
|
||||
- Pharmacophore point pair distributions
|
||||
- Distance based on shortest path
|
||||
- 21 descriptors by default
|
||||
- **cats3D** - 3D CATS descriptors
|
||||
- Euclidean distance based
|
||||
- Requires conformer generation
|
||||
- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants
|
||||
|
||||
### Gobbi Pharmacophores
|
||||
- **gobbi2D** - 2D pharmacophore fingerprints
|
||||
- 8 pharmacophore feature types:
|
||||
- Hydrophobic
|
||||
- Aromatic
|
||||
- H-bond acceptor
|
||||
- H-bond donor
|
||||
- Positive ionizable
|
||||
- Negative ionizable
|
||||
- Lumped hydrophobe
|
||||
- Good for virtual screening
|
||||
|
||||
### Pmapper Pharmacophores
|
||||
- **pmapper2D** - 2D pharmacophore signatures
|
||||
- **pmapper3D** - 3D pharmacophore signatures
|
||||
- High-dimensional pharmacophore descriptors
|
||||
- Useful for QSAR and similarity searching
|
||||
|
||||
## Shape Descriptors
|
||||
|
||||
Descriptors capturing 3D molecular shape and electrostatic properties.
|
||||
|
||||
### USR (Ultrafast Shape Recognition)
|
||||
- **usr** - Basic USR descriptors
|
||||
- 12 dimensions encoding shape distribution
|
||||
- Extremely fast computation
|
||||
- **usrcat** - USR with pharmacophoric constraints
|
||||
- 60 dimensions (12 per feature type)
|
||||
- Combines shape and pharmacophore information
|
||||
|
||||
### Electrostatic Shape
|
||||
- **electroshape** - ElectroShape descriptors
|
||||
- Combines molecular shape, chirality, and electrostatics
|
||||
- Useful for protein-ligand docking predictions
|
||||
|
||||
## Scaffold-Based Descriptors
|
||||
|
||||
Descriptors based on molecular scaffolds and core structures.
|
||||
|
||||
### Scaffold Keys
|
||||
- **scaffoldkeys** - Scaffold key calculator
|
||||
- 40+ scaffold-based properties
|
||||
- Bioisosteric scaffold representation
|
||||
- Captures core structural features
|
||||
|
||||
## Graph Featurizers for GNN Input
|
||||
|
||||
Atom and bond-level features for constructing graph representations for Graph Neural Networks.
|
||||
|
||||
### Atom-Level Features
|
||||
- **atom-onehot** - One-hot encoded atom features
|
||||
- **atom-default** - Default atom featurization including:
|
||||
- Atomic number
|
||||
- Degree, formal charge
|
||||
- Hybridization
|
||||
- Aromaticity
|
||||
- Number of hydrogen atoms
|
||||
|
||||
### Bond-Level Features
|
||||
- **bond-onehot** - One-hot encoded bond features
|
||||
- **bond-default** - Default bond featurization including:
|
||||
- Bond type (single, double, triple, aromatic)
|
||||
- Conjugation
|
||||
- Ring membership
|
||||
- Stereochemistry
|
||||
|
||||
## Integrated Pretrained Model Collections
|
||||
|
||||
Molfeat integrates models from various sources:
|
||||
|
||||
### HuggingFace Models
|
||||
Access to transformer models through HuggingFace hub:
|
||||
- ChemBERTa variants
|
||||
- ChemGPT variants
|
||||
- MolT5
|
||||
- Custom uploaded models
|
||||
|
||||
### DGL-LifeSci Models
|
||||
Pre-trained GNN models from DGL-Life:
|
||||
- GIN variants with different pre-training tasks
|
||||
- AttentiveFP models
|
||||
- MPNN models
|
||||
|
||||
### FCD (Fréchet ChemNet Distance)
|
||||
- **fcd** - Pre-trained CNN for molecular generation evaluation
|
||||
|
||||
### Graphormer Models
|
||||
- Graph transformers from Microsoft Research
|
||||
- Pre-trained on quantum chemistry datasets
|
||||
|
||||
## Usage Notes
|
||||
|
||||
### Choosing a Featurizer
|
||||
|
||||
**For traditional ML (Random Forest, SVM, etc.):**
|
||||
- Start with **ecfp** or **maccs** fingerprints
|
||||
- Try **desc2D** for interpretable models
|
||||
- Use **FeatConcat** to combine multiple fingerprints
|
||||
|
||||
**For deep learning:**
|
||||
- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
|
||||
- Use **gin-supervised-*** for graph neural network embeddings
|
||||
- Consider **Graphormer** for quantum property predictions
|
||||
|
||||
**For similarity searching:**
|
||||
- **ecfp** - General purpose, most popular
|
||||
- **maccs** - Fast, good for scaffold hopping
|
||||
- **map4** - Efficient for large-scale searches
|
||||
- **usr** / **usrcat** - 3D shape similarity
|
||||
|
||||
**For pharmacophore-based approaches:**
|
||||
- **fcfp** - Functional group based
|
||||
- **cats2D/3D** - Pharmacophore pair distributions
|
||||
- **gobbi2D** - Explicit pharmacophore features
|
||||
|
||||
**For interpretability:**
|
||||
- **desc2D** / **mordred** - Named descriptors
|
||||
- **maccs** - Interpretable substructure keys
|
||||
- **scaffoldkeys** - Scaffold-based features
|
||||
|
||||
### Model Dependencies
|
||||
|
||||
Some featurizers require optional dependencies:
|
||||
|
||||
- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
|
||||
- **Graphormer**: `pip install "molfeat[graphormer]"`
|
||||
- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
|
||||
- **FCD**: `pip install "molfeat[fcd]"`
|
||||
- **MAP4**: `pip install "molfeat[map4]"`
|
||||
- **All dependencies**: `pip install "molfeat[all]"`
|
||||
|
||||
### Accessing All Available Models
|
||||
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
store = ModelStore()
|
||||
all_models = store.available_models
|
||||
|
||||
# Print all available featurizers
|
||||
for model in all_models:
|
||||
print(f"{model.name}: {model.description}")
|
||||
|
||||
# Search for specific types
|
||||
transformers = [m for m in all_models if "transformer" in m.tags]
|
||||
gnn_models = [m for m in all_models if "gnn" in m.tags]
|
||||
fingerprints = [m for m in all_models if "fingerprint" in m.tags]
|
||||
```
|
||||
|
||||
## Performance Characteristics
|
||||
|
||||
### Computational Speed (relative)
|
||||
**Fastest:**
|
||||
- maccs
|
||||
- ecfp
|
||||
- rdkit fingerprints
|
||||
- usr
|
||||
|
||||
**Medium:**
|
||||
- desc2D
|
||||
- cats2D
|
||||
- Most fingerprints
|
||||
|
||||
**Slower:**
|
||||
- mordred (1800+ descriptors)
|
||||
- desc3D (requires conformer generation)
|
||||
- 3D descriptors in general
|
||||
|
||||
**Slowest (first run):**
|
||||
- Pretrained models (ChemBERTa, ChemGPT, GIN)
|
||||
- Note: Subsequent runs benefit from caching
|
||||
|
||||
### Dimensionality
|
||||
|
||||
**Low (< 200 dims):**
|
||||
- maccs (167)
|
||||
- usr (12)
|
||||
- usrcat (60)
|
||||
|
||||
**Medium (200-2000 dims):**
|
||||
- desc2D (~200)
|
||||
- ecfp (2048 default, configurable)
|
||||
- map4 (1024 default)
|
||||
|
||||
**High (> 2000 dims):**
|
||||
- mordred (1800+)
|
||||
- Concatenated fingerprints
|
||||
- Some transformer embeddings
|
||||
|
||||
**Variable:**
|
||||
- Transformer models (typically 768-1024)
|
||||
- GNN models (depends on architecture)
|
||||
723
skills/molfeat/references/examples.md
Normal file
723
skills/molfeat/references/examples.md
Normal file
@@ -0,0 +1,723 @@
|
||||
# Molfeat Usage Examples
|
||||
|
||||
This document provides practical examples for common molfeat use cases.
|
||||
|
||||
## Installation
|
||||
|
||||
```bash
|
||||
# Recommended: Using conda/mamba
|
||||
mamba install -c conda-forge molfeat
|
||||
|
||||
# Alternative: Using pip
|
||||
pip install molfeat
|
||||
|
||||
# With all optional dependencies
|
||||
pip install "molfeat[all]"
|
||||
|
||||
# With specific dependencies
|
||||
pip install "molfeat[dgl]" # For GNN models
|
||||
pip install "molfeat[graphormer]" # For Graphormer
|
||||
pip install "molfeat[transformer]" # For ChemBERTa, ChemGPT
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Basic Featurization Workflow
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from molfeat.calc import FPCalculator
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
|
||||
# Load sample data
|
||||
data = dm.data.freesolv().sample(100).smiles.values
|
||||
|
||||
# Single molecule featurization
|
||||
calc = FPCalculator("ecfp")
|
||||
features_single = calc(data[0])
|
||||
print(f"Single molecule features shape: {features_single.shape}")
|
||||
# Output: (2048,)
|
||||
|
||||
# Batch featurization with parallelization
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
features_batch = transformer(data)
|
||||
print(f"Batch features shape: {features_batch.shape}")
|
||||
# Output: (100, 2048)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Calculator Examples
|
||||
|
||||
### Fingerprint Calculators
|
||||
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# ECFP (Extended-Connectivity Fingerprints)
|
||||
ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
fp = ecfp("CCO") # Ethanol
|
||||
print(f"ECFP shape: {fp.shape}") # (2048,)
|
||||
|
||||
# MACCS keys
|
||||
maccs = FPCalculator("maccs")
|
||||
fp = maccs("c1ccccc1") # Benzene
|
||||
print(f"MACCS shape: {fp.shape}") # (167,)
|
||||
|
||||
# Count-based fingerprints
|
||||
ecfp_count = FPCalculator("ecfp-count", radius=3)
|
||||
fp_count = ecfp_count("CC(C)CC(C)C") # Non-binary counts
|
||||
|
||||
# MAP4 fingerprints
|
||||
map4 = FPCalculator("map4")
|
||||
fp = map4("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
|
||||
```
|
||||
|
||||
### Descriptor Calculators
|
||||
|
||||
```python
|
||||
from molfeat.calc import RDKitDescriptors2D, MordredDescriptors
|
||||
|
||||
# RDKit 2D descriptors (200+ properties)
|
||||
desc2d = RDKitDescriptors2D()
|
||||
descriptors = desc2d("CCO")
|
||||
print(f"Number of 2D descriptors: {len(descriptors)}")
|
||||
|
||||
# Get descriptor names
|
||||
names = desc2d.columns
|
||||
print(f"First 5 descriptors: {names[:5]}")
|
||||
|
||||
# Mordred descriptors (1800+ properties)
|
||||
mordred = MordredDescriptors()
|
||||
descriptors = mordred("c1ccccc1O") # Phenol
|
||||
print(f"Mordred descriptors: {len(descriptors)}")
|
||||
```
|
||||
|
||||
### Pharmacophore Calculators
|
||||
|
||||
```python
|
||||
from molfeat.calc import CATSCalculator
|
||||
|
||||
# 2D CATS descriptors
|
||||
cats = CATSCalculator(mode="2D", scale="raw")
|
||||
descriptors = cats("CC(C)Cc1ccc(C)cc1C") # Cymene
|
||||
print(f"CATS descriptors: {descriptors.shape}") # (21,)
|
||||
|
||||
# 3D CATS descriptors (requires conformer)
|
||||
cats3d = CATSCalculator(mode="3D", scale="num")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Transformer Examples
|
||||
|
||||
### Basic Transformer Usage
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
import datamol as dm
|
||||
|
||||
# Prepare data
|
||||
smiles_list = [
|
||||
"CCO",
|
||||
"CC(=O)O",
|
||||
"c1ccccc1",
|
||||
"CC(C)O",
|
||||
"CCCC"
|
||||
]
|
||||
|
||||
# Create transformer
|
||||
calc = FPCalculator("ecfp")
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
|
||||
# Transform molecules
|
||||
features = transformer(smiles_list)
|
||||
print(f"Features shape: {features.shape}") # (5, 2048)
|
||||
```
|
||||
|
||||
### Error Handling
|
||||
|
||||
```python
|
||||
# Handle invalid SMILES gracefully
|
||||
smiles_with_errors = [
|
||||
"CCO", # Valid
|
||||
"invalid", # Invalid
|
||||
"CC(=O)O", # Valid
|
||||
"xyz123", # Invalid
|
||||
]
|
||||
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp"),
|
||||
n_jobs=-1,
|
||||
verbose=True, # Log errors
|
||||
ignore_errors=True # Continue on failure
|
||||
)
|
||||
|
||||
features = transformer(smiles_with_errors)
|
||||
# Returns: array with None for failed molecules
|
||||
print(features) # [array(...), None, array(...), None]
|
||||
```
|
||||
|
||||
### Concatenating Multiple Featurizers
|
||||
|
||||
```python
|
||||
from molfeat.trans import FeatConcat, MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Combine MACCS (167) + ECFP (2048) = 2215 dimensions
|
||||
concat_calc = FeatConcat([
|
||||
FPCalculator("maccs"),
|
||||
FPCalculator("ecfp", radius=3, fpSize=2048)
|
||||
])
|
||||
|
||||
transformer = MoleculeTransformer(concat_calc, n_jobs=-1)
|
||||
features = transformer(smiles_list)
|
||||
print(f"Combined features shape: {features.shape}") # (n, 2215)
|
||||
|
||||
# Triple combination
|
||||
triple_concat = FeatConcat([
|
||||
FPCalculator("maccs"),
|
||||
FPCalculator("ecfp"),
|
||||
FPCalculator("rdkit")
|
||||
])
|
||||
```
|
||||
|
||||
### Saving and Loading Configurations
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create and save transformer
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp", radius=3, fpSize=2048),
|
||||
n_jobs=-1
|
||||
)
|
||||
|
||||
# Save to YAML
|
||||
transformer.to_state_yaml_file("my_featurizer.yml")
|
||||
|
||||
# Save to JSON
|
||||
transformer.to_state_json_file("my_featurizer.json")
|
||||
|
||||
# Load from saved state
|
||||
loaded_transformer = MoleculeTransformer.from_state_yaml_file("my_featurizer.yml")
|
||||
|
||||
# Use loaded transformer
|
||||
features = loaded_transformer(smiles_list)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Pretrained Model Examples
|
||||
|
||||
### Using the ModelStore
|
||||
|
||||
```python
|
||||
from molfeat.store.modelstore import ModelStore
|
||||
|
||||
# Initialize model store
|
||||
store = ModelStore()
|
||||
|
||||
# List all available models
|
||||
print(f"Total available models: {len(store.available_models)}")
|
||||
|
||||
# Search for specific models
|
||||
chemberta_models = store.search(name="ChemBERTa")
|
||||
for model in chemberta_models:
|
||||
print(f"- {model.name}: {model.description}")
|
||||
|
||||
# Get model information
|
||||
model_card = store.search(name="ChemBERTa-77M-MLM")[0]
|
||||
print(f"Model: {model_card.name}")
|
||||
print(f"Version: {model_card.version}")
|
||||
print(f"Authors: {model_card.authors}")
|
||||
|
||||
# View usage instructions
|
||||
model_card.usage()
|
||||
|
||||
# Load model directly
|
||||
transformer = store.load("ChemBERTa-77M-MLM")
|
||||
```
|
||||
|
||||
### ChemBERTa Embeddings
|
||||
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
|
||||
# Load ChemBERTa model
|
||||
chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
# Generate embeddings
|
||||
smiles = ["CCO", "CC(=O)O", "c1ccccc1"]
|
||||
embeddings = chemberta(smiles)
|
||||
print(f"ChemBERTa embeddings shape: {embeddings.shape}")
|
||||
# Output: (3, 768) - 768-dimensional embeddings
|
||||
|
||||
# Use in ML pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
embeddings, labels, test_size=0.2
|
||||
)
|
||||
|
||||
clf = RandomForestClassifier()
|
||||
clf.fit(X_train, y_train)
|
||||
predictions = clf.predict(X_test)
|
||||
```
|
||||
|
||||
### ChemGPT Models
|
||||
|
||||
```python
|
||||
# Small model (4.7M parameters)
|
||||
chemgpt_small = PretrainedMolTransformer("ChemGPT-4.7M", n_jobs=-1)
|
||||
|
||||
# Medium model (19M parameters)
|
||||
chemgpt_medium = PretrainedMolTransformer("ChemGPT-19M", n_jobs=-1)
|
||||
|
||||
# Large model (1.2B parameters)
|
||||
chemgpt_large = PretrainedMolTransformer("ChemGPT-1.2B", n_jobs=-1)
|
||||
|
||||
# Generate embeddings
|
||||
embeddings = chemgpt_small(smiles)
|
||||
```
|
||||
|
||||
### Graph Neural Network Models
|
||||
|
||||
```python
|
||||
# GIN models with different pre-training objectives
|
||||
gin_masking = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
|
||||
gin_infomax = PretrainedMolTransformer("gin-supervised-infomax", n_jobs=-1)
|
||||
gin_edgepred = PretrainedMolTransformer("gin-supervised-edgepred", n_jobs=-1)
|
||||
|
||||
# Generate graph embeddings
|
||||
embeddings = gin_masking(smiles)
|
||||
print(f"GIN embeddings shape: {embeddings.shape}")
|
||||
|
||||
# Graphormer (for quantum chemistry)
|
||||
graphormer = PretrainedMolTransformer("Graphormer-pcqm4mv2", n_jobs=-1)
|
||||
embeddings = graphormer(smiles)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Machine Learning Integration
|
||||
|
||||
### Scikit-learn Pipeline
|
||||
|
||||
```python
|
||||
from sklearn.pipeline import Pipeline
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
from sklearn.model_selection import cross_val_score
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Create ML pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
|
||||
('classifier', RandomForestClassifier(n_estimators=100))
|
||||
])
|
||||
|
||||
# Train and evaluate
|
||||
pipeline.fit(smiles_train, y_train)
|
||||
predictions = pipeline.predict(smiles_test)
|
||||
|
||||
# Cross-validation
|
||||
scores = cross_val_score(pipeline, smiles_all, y_all, cv=5)
|
||||
print(f"CV scores: {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
```
|
||||
|
||||
### Grid Search for Hyperparameter Tuning
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import GridSearchCV
|
||||
from sklearn.svm import SVC
|
||||
|
||||
# Define pipeline
|
||||
pipeline = Pipeline([
|
||||
('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
|
||||
('classifier', SVC())
|
||||
])
|
||||
|
||||
# Define parameter grid
|
||||
param_grid = {
|
||||
'classifier__C': [0.1, 1, 10],
|
||||
'classifier__kernel': ['rbf', 'linear'],
|
||||
'classifier__gamma': ['scale', 'auto']
|
||||
}
|
||||
|
||||
# Grid search
|
||||
grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
|
||||
grid_search.fit(smiles_train, y_train)
|
||||
|
||||
print(f"Best parameters: {grid_search.best_params_}")
|
||||
print(f"Best score: {grid_search.best_score_:.3f}")
|
||||
```
|
||||
|
||||
### Multiple Featurizer Comparison
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score
|
||||
|
||||
# Test different featurizers
|
||||
featurizers = {
|
||||
'ECFP': FPCalculator("ecfp"),
|
||||
'MACCS': FPCalculator("maccs"),
|
||||
'RDKit': FPCalculator("rdkit"),
|
||||
'Descriptors': RDKitDescriptors2D(),
|
||||
'Combined': FeatConcat([
|
||||
FPCalculator("maccs"),
|
||||
FPCalculator("ecfp")
|
||||
])
|
||||
}
|
||||
|
||||
results = {}
|
||||
for name, calc in featurizers.items():
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
X_train = transformer(smiles_train)
|
||||
X_test = transformer(smiles_test)
|
||||
|
||||
clf = RandomForestClassifier(n_estimators=100)
|
||||
clf.fit(X_train, y_train)
|
||||
|
||||
y_pred = clf.predict_proba(X_test)[:, 1]
|
||||
auc = roc_auc_score(y_test, y_pred)
|
||||
results[name] = auc
|
||||
|
||||
print(f"{name}: AUC = {auc:.3f}")
|
||||
```
|
||||
|
||||
### PyTorch Deep Learning
|
||||
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
from torch.utils.data import Dataset, DataLoader
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
|
||||
# Custom dataset
|
||||
class MoleculeDataset(Dataset):
|
||||
def __init__(self, smiles, labels, transformer):
|
||||
self.features = transformer(smiles)
|
||||
self.labels = torch.tensor(labels, dtype=torch.float32)
|
||||
|
||||
def __len__(self):
|
||||
return len(self.labels)
|
||||
|
||||
def __getitem__(self, idx):
|
||||
return (
|
||||
torch.tensor(self.features[idx], dtype=torch.float32),
|
||||
self.labels[idx]
|
||||
)
|
||||
|
||||
# Prepare data
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
train_dataset = MoleculeDataset(smiles_train, y_train, transformer)
|
||||
train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
|
||||
|
||||
# Simple neural network
|
||||
class MoleculeClassifier(nn.Module):
|
||||
def __init__(self, input_dim):
|
||||
super().__init__()
|
||||
self.network = nn.Sequential(
|
||||
nn.Linear(input_dim, 512),
|
||||
nn.ReLU(),
|
||||
nn.Dropout(0.3),
|
||||
nn.Linear(512, 256),
|
||||
nn.ReLU(),
|
||||
nn.Dropout(0.3),
|
||||
nn.Linear(256, 1),
|
||||
nn.Sigmoid()
|
||||
)
|
||||
|
||||
def forward(self, x):
|
||||
return self.network(x)
|
||||
|
||||
# Train model
|
||||
model = MoleculeClassifier(input_dim=2048)
|
||||
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
|
||||
criterion = nn.BCELoss()
|
||||
|
||||
for epoch in range(10):
|
||||
for batch_features, batch_labels in train_loader:
|
||||
optimizer.zero_grad()
|
||||
outputs = model(batch_features).squeeze()
|
||||
loss = criterion(outputs, batch_labels)
|
||||
loss.backward()
|
||||
optimizer.step()
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Advanced Usage Patterns
|
||||
|
||||
### Custom Preprocessing
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
import datamol as dm
|
||||
|
||||
class CustomTransformer(MoleculeTransformer):
|
||||
def preprocess(self, mol):
|
||||
"""Custom preprocessing: standardize molecule"""
|
||||
if isinstance(mol, str):
|
||||
mol = dm.to_mol(mol)
|
||||
|
||||
# Standardize
|
||||
mol = dm.standardize_mol(mol)
|
||||
|
||||
# Remove salts
|
||||
mol = dm.remove_salts(mol)
|
||||
|
||||
return mol
|
||||
|
||||
# Use custom transformer
|
||||
transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
features = transformer(smiles_list)
|
||||
```
|
||||
|
||||
### Featurization with Conformers
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from molfeat.calc import RDKitDescriptors3D
|
||||
|
||||
# Generate conformers
|
||||
def prepare_3d_mol(smiles):
|
||||
mol = dm.to_mol(smiles)
|
||||
mol = dm.add_hs(mol)
|
||||
mol = dm.conform.generate_conformers(mol, n_confs=1)
|
||||
return mol
|
||||
|
||||
# 3D descriptors
|
||||
calc_3d = RDKitDescriptors3D()
|
||||
|
||||
smiles = "CC(C)Cc1ccc(C)cc1C"
|
||||
mol_3d = prepare_3d_mol(smiles)
|
||||
descriptors_3d = calc_3d(mol_3d)
|
||||
```
|
||||
|
||||
### Parallel Batch Processing
|
||||
|
||||
```python
|
||||
from molfeat.trans import MoleculeTransformer
|
||||
from molfeat.calc import FPCalculator
|
||||
import time
|
||||
|
||||
# Large dataset
|
||||
smiles_large = load_large_dataset() # e.g., 100,000 molecules
|
||||
|
||||
# Test different parallelization levels
|
||||
for n_jobs in [1, 2, 4, -1]:
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp"),
|
||||
n_jobs=n_jobs
|
||||
)
|
||||
|
||||
start = time.time()
|
||||
features = transformer(smiles_large)
|
||||
elapsed = time.time() - start
|
||||
|
||||
print(f"n_jobs={n_jobs}: {elapsed:.2f}s")
|
||||
```
|
||||
|
||||
### Caching for Expensive Operations
|
||||
|
||||
```python
|
||||
from molfeat.trans.pretrained import PretrainedMolTransformer
|
||||
import pickle
|
||||
|
||||
# Load expensive pretrained model
|
||||
transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
|
||||
|
||||
# Cache embeddings for reuse
|
||||
cache_file = "embeddings_cache.pkl"
|
||||
|
||||
try:
|
||||
# Try loading cached embeddings
|
||||
with open(cache_file, "rb") as f:
|
||||
embeddings = pickle.load(f)
|
||||
print("Loaded cached embeddings")
|
||||
except FileNotFoundError:
|
||||
# Compute and cache
|
||||
embeddings = transformer(smiles_list)
|
||||
with open(cache_file, "wb") as f:
|
||||
pickle.dump(embeddings, f)
|
||||
print("Computed and cached embeddings")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Workflows
|
||||
|
||||
### Virtual Screening Workflow
|
||||
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
import datamol as dm
|
||||
|
||||
# 1. Prepare training data (known actives/inactives)
|
||||
train_smiles = load_training_data()
|
||||
train_labels = load_training_labels() # 1=active, 0=inactive
|
||||
|
||||
# 2. Featurize training set
|
||||
transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
|
||||
X_train = transformer(train_smiles)
|
||||
|
||||
# 3. Train classifier
|
||||
clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
|
||||
clf.fit(X_train, train_labels)
|
||||
|
||||
# 4. Featurize screening library
|
||||
screening_smiles = load_screening_library() # e.g., 1M compounds
|
||||
X_screen = transformer(screening_smiles)
|
||||
|
||||
# 5. Predict and rank
|
||||
predictions = clf.predict_proba(X_screen)[:, 1]
|
||||
ranked_indices = predictions.argsort()[::-1]
|
||||
|
||||
# 6. Get top hits
|
||||
top_n = 1000
|
||||
top_hits = [screening_smiles[i] for i in ranked_indices[:top_n]]
|
||||
```
|
||||
|
||||
### QSAR Model Building
|
||||
|
||||
```python
|
||||
from molfeat.calc import RDKitDescriptors2D
|
||||
from sklearn.linear_model import Ridge
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
from sklearn.model_selection import cross_val_score
|
||||
import numpy as np
|
||||
|
||||
# Load QSAR dataset
|
||||
smiles = load_molecules()
|
||||
y = load_activity_values() # e.g., IC50, logP
|
||||
|
||||
# Featurize with interpretable descriptors
|
||||
transformer = MoleculeTransformer(RDKitDescriptors2D(), n_jobs=-1)
|
||||
X = transformer(smiles)
|
||||
|
||||
# Standardize features
|
||||
scaler = StandardScaler()
|
||||
X_scaled = scaler.fit_transform(X)
|
||||
|
||||
# Build linear model
|
||||
model = Ridge(alpha=1.0)
|
||||
scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
|
||||
print(f"R² = {scores.mean():.3f} (+/- {scores.std():.3f})")
|
||||
|
||||
# Fit final model
|
||||
model.fit(X_scaled, y)
|
||||
|
||||
# Interpret feature importance
|
||||
feature_names = transformer.featurizer.columns
|
||||
importance = np.abs(model.coef_)
|
||||
top_features_idx = importance.argsort()[-10:][::-1]
|
||||
|
||||
print("Top 10 important features:")
|
||||
for idx in top_features_idx:
|
||||
print(f" {feature_names[idx]}: {model.coef_[idx]:.3f}")
|
||||
```
|
||||
|
||||
### Similarity Search
|
||||
|
||||
```python
|
||||
from molfeat.calc import FPCalculator
|
||||
from sklearn.metrics.pairwise import cosine_similarity
|
||||
import numpy as np
|
||||
|
||||
# Query molecule
|
||||
query_smiles = "CC(=O)Oc1ccccc1C(=O)O" # Aspirin
|
||||
|
||||
# Database of molecules
|
||||
database_smiles = load_molecule_database() # Large collection
|
||||
|
||||
# Compute fingerprints
|
||||
calc = FPCalculator("ecfp")
|
||||
query_fp = calc(query_smiles).reshape(1, -1)
|
||||
|
||||
transformer = MoleculeTransformer(calc, n_jobs=-1)
|
||||
database_fps = transformer(database_smiles)
|
||||
|
||||
# Compute similarity
|
||||
similarities = cosine_similarity(query_fp, database_fps)[0]
|
||||
|
||||
# Find most similar
|
||||
top_k = 10
|
||||
top_indices = similarities.argsort()[-top_k:][::-1]
|
||||
|
||||
print(f"Top {top_k} similar molecules:")
|
||||
for i, idx in enumerate(top_indices, 1):
|
||||
print(f"{i}. {database_smiles[idx]} (similarity: {similarities[idx]:.3f})")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Handling Invalid Molecules
|
||||
|
||||
```python
|
||||
# Use ignore_errors to skip invalid molecules
|
||||
transformer = MoleculeTransformer(
|
||||
FPCalculator("ecfp"),
|
||||
ignore_errors=True,
|
||||
verbose=True
|
||||
)
|
||||
|
||||
# Filter out None values after transformation
|
||||
features = transformer(smiles_list)
|
||||
valid_mask = [f is not None for f in features]
|
||||
valid_features = [f for f in features if f is not None]
|
||||
valid_smiles = [s for s, m in zip(smiles_list, valid_mask) if m]
|
||||
```
|
||||
|
||||
### Memory Management for Large Datasets
|
||||
|
||||
```python
|
||||
# Process in chunks for very large datasets
|
||||
def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
|
||||
all_features = []
|
||||
|
||||
for i in range(0, len(smiles_list), chunk_size):
|
||||
chunk = smiles_list[i:i+chunk_size]
|
||||
features = transformer(chunk)
|
||||
all_features.append(features)
|
||||
print(f"Processed {i+len(chunk)}/{len(smiles_list)}")
|
||||
|
||||
return np.vstack(all_features)
|
||||
|
||||
# Use with large dataset
|
||||
features = featurize_in_chunks(large_smiles_list, transformer)
|
||||
```
|
||||
|
||||
### Reproducibility
|
||||
|
||||
```python
|
||||
import random
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
# Set all random seeds
|
||||
def set_seed(seed=42):
|
||||
random.seed(seed)
|
||||
np.random.seed(seed)
|
||||
torch.manual_seed(seed)
|
||||
torch.cuda.manual_seed_all(seed)
|
||||
|
||||
set_seed(42)
|
||||
|
||||
# Save exact configuration
|
||||
transformer.to_state_yaml_file("config.yml")
|
||||
|
||||
# Document version
|
||||
import molfeat
|
||||
print(f"molfeat version: {molfeat.__version__}")
|
||||
```
|
||||
Reference in New Issue
Block a user