492 lines
12 KiB
Markdown
492 lines
12 KiB
Markdown
# DeepChem Workflows
|
|
|
|
This document provides detailed workflows for common DeepChem use cases.
|
|
|
|
## Workflow 1: Molecular Property Prediction from SMILES
|
|
|
|
**Goal**: Predict molecular properties (e.g., solubility, toxicity, activity) from SMILES strings.
|
|
|
|
### Step-by-Step Process
|
|
|
|
#### 1. Prepare Your Data
|
|
Data should be in CSV format with at minimum:
|
|
- A column with SMILES strings
|
|
- One or more columns with property values (targets)
|
|
|
|
Example CSV structure:
|
|
```csv
|
|
smiles,solubility,toxicity
|
|
CCO,-0.77,0
|
|
CC(=O)OC1=CC=CC=C1C(=O)O,-1.19,1
|
|
```
|
|
|
|
#### 2. Choose Featurizer
|
|
Decision tree:
|
|
- **Small dataset (<1K)**: Use `CircularFingerprint` or `RDKitDescriptors`
|
|
- **Medium dataset (1K-100K)**: Use `CircularFingerprint` or `MolGraphConvFeaturizer`
|
|
- **Large dataset (>100K)**: Use graph-based featurizers (`MolGraphConvFeaturizer`, `DMPNNFeaturizer`)
|
|
- **Transfer learning**: Use pretrained model featurizers (`GroverFeaturizer`)
|
|
|
|
#### 3. Load and Featurize Data
|
|
```python
|
|
import deepchem as dc
|
|
|
|
# For fingerprint-based
|
|
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
|
# OR for graph-based
|
|
featurizer = dc.feat.MolGraphConvFeaturizer()
|
|
|
|
loader = dc.data.CSVLoader(
|
|
tasks=['solubility', 'toxicity'], # column names to predict
|
|
feature_field='smiles', # column with SMILES
|
|
featurizer=featurizer
|
|
)
|
|
dataset = loader.create_dataset('data.csv')
|
|
```
|
|
|
|
#### 4. Split Data
|
|
**Critical**: Use `ScaffoldSplitter` for drug discovery to prevent data leakage.
|
|
|
|
```python
|
|
splitter = dc.splits.ScaffoldSplitter()
|
|
train, valid, test = splitter.train_valid_test_split(
|
|
dataset,
|
|
frac_train=0.8,
|
|
frac_valid=0.1,
|
|
frac_test=0.1
|
|
)
|
|
```
|
|
|
|
#### 5. Transform Data (Optional but Recommended)
|
|
```python
|
|
transformers = [
|
|
dc.trans.NormalizationTransformer(
|
|
transform_y=True,
|
|
dataset=train
|
|
)
|
|
]
|
|
|
|
for transformer in transformers:
|
|
train = transformer.transform(train)
|
|
valid = transformer.transform(valid)
|
|
test = transformer.transform(test)
|
|
```
|
|
|
|
#### 6. Select and Train Model
|
|
```python
|
|
# For fingerprints
|
|
model = dc.models.MultitaskRegressor(
|
|
n_tasks=2, # number of properties to predict
|
|
n_features=2048, # fingerprint size
|
|
layer_sizes=[1000, 500], # hidden layer sizes
|
|
dropouts=0.25,
|
|
learning_rate=0.001
|
|
)
|
|
|
|
# OR for graphs
|
|
model = dc.models.GCNModel(
|
|
n_tasks=2,
|
|
mode='regression',
|
|
batch_size=128,
|
|
learning_rate=0.001
|
|
)
|
|
|
|
# Train
|
|
model.fit(train, nb_epoch=50)
|
|
```
|
|
|
|
#### 7. Evaluate
|
|
```python
|
|
metric = dc.metrics.Metric(dc.metrics.r2_score)
|
|
train_score = model.evaluate(train, [metric])
|
|
valid_score = model.evaluate(valid, [metric])
|
|
test_score = model.evaluate(test, [metric])
|
|
|
|
print(f"Train R²: {train_score}")
|
|
print(f"Valid R²: {valid_score}")
|
|
print(f"Test R²: {test_score}")
|
|
```
|
|
|
|
#### 8. Make Predictions
|
|
```python
|
|
# Predict on new molecules
|
|
new_smiles = ['CCO', 'CC(C)O', 'c1ccccc1']
|
|
new_featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
|
new_features = new_featurizer.featurize(new_smiles)
|
|
new_dataset = dc.data.NumpyDataset(X=new_features)
|
|
|
|
# Apply same transformations
|
|
for transformer in transformers:
|
|
new_dataset = transformer.transform(new_dataset)
|
|
|
|
predictions = model.predict(new_dataset)
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow 2: Using MoleculeNet Benchmark Datasets
|
|
|
|
**Goal**: Quickly train and evaluate models on standard benchmarks.
|
|
|
|
### Quick Start
|
|
```python
|
|
import deepchem as dc
|
|
|
|
# Load benchmark dataset
|
|
tasks, datasets, transformers = dc.molnet.load_tox21(
|
|
featurizer='GraphConv',
|
|
splitter='scaffold'
|
|
)
|
|
train, valid, test = datasets
|
|
|
|
# Train model
|
|
model = dc.models.GCNModel(
|
|
n_tasks=len(tasks),
|
|
mode='classification'
|
|
)
|
|
model.fit(train, nb_epoch=50)
|
|
|
|
# Evaluate
|
|
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
|
|
test_score = model.evaluate(test, [metric])
|
|
print(f"Test ROC-AUC: {test_score}")
|
|
```
|
|
|
|
### Available Featurizer Options
|
|
When calling `load_*()` functions:
|
|
- `'ECFP'`: Extended-connectivity fingerprints (circular fingerprints)
|
|
- `'GraphConv'`: Graph convolution features
|
|
- `'Weave'`: Weave features
|
|
- `'Raw'`: Raw SMILES strings
|
|
- `'smiles2img'`: 2D molecular images
|
|
|
|
### Available Splitter Options
|
|
- `'scaffold'`: Scaffold-based splitting (recommended for drug discovery)
|
|
- `'random'`: Random splitting
|
|
- `'stratified'`: Stratified splitting (preserves class distributions)
|
|
- `'butina'`: Butina clustering-based splitting
|
|
|
|
---
|
|
|
|
## Workflow 3: Hyperparameter Optimization
|
|
|
|
**Goal**: Find optimal model hyperparameters systematically.
|
|
|
|
### Using GridHyperparamOpt
|
|
```python
|
|
import deepchem as dc
|
|
import numpy as np
|
|
|
|
# Load data
|
|
tasks, datasets, transformers = dc.molnet.load_bbbp(
|
|
featurizer='ECFP',
|
|
splitter='scaffold'
|
|
)
|
|
train, valid, test = datasets
|
|
|
|
# Define parameter grid
|
|
params_dict = {
|
|
'layer_sizes': [[1000], [1000, 500], [1000, 1000]],
|
|
'dropouts': [0.0, 0.25, 0.5],
|
|
'learning_rate': [0.001, 0.0001]
|
|
}
|
|
|
|
# Define model builder function
|
|
def model_builder(model_params, model_dir):
|
|
return dc.models.MultitaskClassifier(
|
|
n_tasks=len(tasks),
|
|
n_features=1024,
|
|
**model_params
|
|
)
|
|
|
|
# Setup optimizer
|
|
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
|
|
optimizer = dc.hyper.GridHyperparamOpt(model_builder)
|
|
|
|
# Run optimization
|
|
best_model, best_params, all_results = optimizer.hyperparam_search(
|
|
params_dict,
|
|
train,
|
|
valid,
|
|
metric,
|
|
transformers=transformers
|
|
)
|
|
|
|
print(f"Best parameters: {best_params}")
|
|
print(f"Best validation score: {all_results['best_validation_score']}")
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow 4: Transfer Learning with Pretrained Models
|
|
|
|
**Goal**: Leverage pretrained models for improved performance on small datasets.
|
|
|
|
### Using ChemBERTa
|
|
```python
|
|
import deepchem as dc
|
|
from transformers import AutoTokenizer
|
|
|
|
# Load your data
|
|
loader = dc.data.CSVLoader(
|
|
tasks=['activity'],
|
|
feature_field='smiles',
|
|
featurizer=dc.feat.DummyFeaturizer() # ChemBERTa handles featurization
|
|
)
|
|
dataset = loader.create_dataset('data.csv')
|
|
|
|
# Split data
|
|
splitter = dc.splits.ScaffoldSplitter()
|
|
train, test = splitter.train_test_split(dataset)
|
|
|
|
# Load pretrained ChemBERTa
|
|
model = dc.models.HuggingFaceModel(
|
|
model='seyonec/ChemBERTa-zinc-base-v1',
|
|
task='regression',
|
|
n_tasks=1
|
|
)
|
|
|
|
# Fine-tune
|
|
model.fit(train, nb_epoch=10)
|
|
|
|
# Evaluate
|
|
predictions = model.predict(test)
|
|
```
|
|
|
|
### Using GROVER
|
|
```python
|
|
# GROVER: pre-trained on molecular graphs
|
|
model = dc.models.GroverModel(
|
|
task='classification',
|
|
n_tasks=1,
|
|
model_dir='./grover_model'
|
|
)
|
|
|
|
# Fine-tune on your data
|
|
model.fit(train_dataset, nb_epoch=20)
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow 5: Molecular Generation with GANs
|
|
|
|
**Goal**: Generate novel molecules with desired properties.
|
|
|
|
### Basic MolGAN
|
|
```python
|
|
import deepchem as dc
|
|
|
|
# Load training data (molecules for the generator to learn from)
|
|
tasks, datasets, _ = dc.molnet.load_qm9(
|
|
featurizer='GraphConv',
|
|
splitter='random'
|
|
)
|
|
train, _, _ = datasets
|
|
|
|
# Create and train MolGAN
|
|
gan = dc.models.BasicMolGANModel(
|
|
learning_rate=0.001,
|
|
vertices=9, # max atoms in molecule
|
|
edges=5, # max bonds
|
|
nodes=[128, 256, 512]
|
|
)
|
|
|
|
# Train
|
|
gan.fit_gan(
|
|
train,
|
|
nb_epoch=100,
|
|
generator_steps=0.2,
|
|
checkpoint_interval=10
|
|
)
|
|
|
|
# Generate new molecules
|
|
generated_molecules = gan.predict_gan_generator(1000)
|
|
```
|
|
|
|
### Conditional Generation
|
|
```python
|
|
# For property-targeted generation
|
|
from deepchem.models.optimizers import ExponentialDecay
|
|
|
|
gan = dc.models.BasicMolGANModel(
|
|
learning_rate=ExponentialDecay(0.001, 0.9, 1000),
|
|
conditional=True # enable conditional generation
|
|
)
|
|
|
|
# Train with properties
|
|
gan.fit_gan(train, nb_epoch=100)
|
|
|
|
# Generate molecules with target properties
|
|
target_properties = np.array([[5.0, 300.0]]) # e.g., [logP, MW]
|
|
molecules = gan.predict_gan_generator(
|
|
1000,
|
|
conditional_inputs=target_properties
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow 6: Materials Property Prediction
|
|
|
|
**Goal**: Predict properties of crystalline materials.
|
|
|
|
### Using Crystal Graph Convolutional Networks
|
|
```python
|
|
import deepchem as dc
|
|
|
|
# Load materials data (structure files in CIF format)
|
|
loader = dc.data.CIFLoader()
|
|
dataset = loader.create_dataset('materials.csv')
|
|
|
|
# Split data
|
|
splitter = dc.splits.RandomSplitter()
|
|
train, test = splitter.train_test_split(dataset)
|
|
|
|
# Create CGCNN model
|
|
model = dc.models.CGCNNModel(
|
|
n_tasks=1,
|
|
mode='regression',
|
|
batch_size=32,
|
|
learning_rate=0.001
|
|
)
|
|
|
|
# Train
|
|
model.fit(train, nb_epoch=100)
|
|
|
|
# Evaluate
|
|
metric = dc.metrics.Metric(dc.metrics.mae_score)
|
|
test_score = model.evaluate(test, [metric])
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow 7: Protein Sequence Analysis
|
|
|
|
**Goal**: Predict protein properties from sequences.
|
|
|
|
### Using ProtBERT
|
|
```python
|
|
import deepchem as dc
|
|
|
|
# Load protein sequence data
|
|
loader = dc.data.FASTALoader()
|
|
dataset = loader.create_dataset('proteins.fasta')
|
|
|
|
# Use ProtBERT
|
|
model = dc.models.HuggingFaceModel(
|
|
model='Rostlab/prot_bert',
|
|
task='classification',
|
|
n_tasks=1
|
|
)
|
|
|
|
# Split and train
|
|
splitter = dc.splits.RandomSplitter()
|
|
train, test = splitter.train_test_split(dataset)
|
|
model.fit(train, nb_epoch=5)
|
|
|
|
# Predict
|
|
predictions = model.predict(test)
|
|
```
|
|
|
|
---
|
|
|
|
## Workflow 8: Custom Model Integration
|
|
|
|
**Goal**: Use your own PyTorch/scikit-learn models with DeepChem.
|
|
|
|
### Wrapping Scikit-Learn Models
|
|
```python
|
|
from sklearn.ensemble import RandomForestRegressor
|
|
import deepchem as dc
|
|
|
|
# Create scikit-learn model
|
|
sklearn_model = RandomForestRegressor(
|
|
n_estimators=100,
|
|
max_depth=10,
|
|
random_state=42
|
|
)
|
|
|
|
# Wrap in DeepChem
|
|
model = dc.models.SklearnModel(model=sklearn_model)
|
|
|
|
# Use with DeepChem datasets
|
|
model.fit(train)
|
|
predictions = model.predict(test)
|
|
|
|
# Evaluate
|
|
metric = dc.metrics.Metric(dc.metrics.r2_score)
|
|
score = model.evaluate(test, [metric])
|
|
```
|
|
|
|
### Creating Custom PyTorch Models
|
|
```python
|
|
import torch
|
|
import torch.nn as nn
|
|
import deepchem as dc
|
|
|
|
class CustomNetwork(nn.Module):
|
|
def __init__(self, n_features, n_tasks):
|
|
super().__init__()
|
|
self.fc1 = nn.Linear(n_features, 512)
|
|
self.fc2 = nn.Linear(512, 256)
|
|
self.fc3 = nn.Linear(256, n_tasks)
|
|
self.relu = nn.ReLU()
|
|
self.dropout = nn.Dropout(0.2)
|
|
|
|
def forward(self, x):
|
|
x = self.relu(self.fc1(x))
|
|
x = self.dropout(x)
|
|
x = self.relu(self.fc2(x))
|
|
x = self.dropout(x)
|
|
return self.fc3(x)
|
|
|
|
# Wrap in DeepChem TorchModel
|
|
model = dc.models.TorchModel(
|
|
model=CustomNetwork(n_features=2048, n_tasks=1),
|
|
loss=nn.MSELoss(),
|
|
output_types=['prediction']
|
|
)
|
|
|
|
# Train
|
|
model.fit(train, nb_epoch=50)
|
|
```
|
|
|
|
---
|
|
|
|
## Common Pitfalls and Solutions
|
|
|
|
### Issue 1: Data Leakage in Drug Discovery
|
|
**Problem**: Using random splitting allows similar molecules in train and test sets.
|
|
**Solution**: Always use `ScaffoldSplitter` for molecular datasets.
|
|
|
|
### Issue 2: Imbalanced Classification
|
|
**Problem**: Poor performance on minority class.
|
|
**Solution**: Use `BalancingTransformer` or weighted metrics.
|
|
```python
|
|
transformer = dc.trans.BalancingTransformer(dataset=train)
|
|
train = transformer.transform(train)
|
|
```
|
|
|
|
### Issue 3: Memory Issues with Large Datasets
|
|
**Problem**: Dataset doesn't fit in memory.
|
|
**Solution**: Use `DiskDataset` instead of `NumpyDataset`.
|
|
```python
|
|
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
|
|
```
|
|
|
|
### Issue 4: Overfitting on Small Datasets
|
|
**Problem**: Model memorizes training data.
|
|
**Solutions**:
|
|
1. Use stronger regularization (increase dropout)
|
|
2. Use simpler models (Random Forest, Ridge)
|
|
3. Apply transfer learning (pretrained models)
|
|
4. Collect more data
|
|
|
|
### Issue 5: Poor Graph Neural Network Performance
|
|
**Problem**: GNN performs worse than fingerprints.
|
|
**Solutions**:
|
|
1. Check if dataset is large enough (GNNs need >10K samples typically)
|
|
2. Increase training epochs
|
|
3. Try different GNN architectures (AttentiveFP, DMPNN)
|
|
4. Use pretrained models (GROVER)
|