12 KiB
DeepChem Workflows
This document provides detailed workflows for common DeepChem use cases.
Workflow 1: Molecular Property Prediction from SMILES
Goal: Predict molecular properties (e.g., solubility, toxicity, activity) from SMILES strings.
Step-by-Step Process
1. Prepare Your Data
Data should be in CSV format with at minimum:
- A column with SMILES strings
- One or more columns with property values (targets)
Example CSV structure:
smiles,solubility,toxicity
CCO,-0.77,0
CC(=O)OC1=CC=CC=C1C(=O)O,-1.19,1
2. Choose Featurizer
Decision tree:
- Small dataset (<1K): Use
CircularFingerprintorRDKitDescriptors - Medium dataset (1K-100K): Use
CircularFingerprintorMolGraphConvFeaturizer - Large dataset (>100K): Use graph-based featurizers (
MolGraphConvFeaturizer,DMPNNFeaturizer) - Transfer learning: Use pretrained model featurizers (
GroverFeaturizer)
3. Load and Featurize Data
import deepchem as dc
# For fingerprint-based
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
# OR for graph-based
featurizer = dc.feat.MolGraphConvFeaturizer()
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'], # column names to predict
feature_field='smiles', # column with SMILES
featurizer=featurizer
)
dataset = loader.create_dataset('data.csv')
4. Split Data
Critical: Use ScaffoldSplitter for drug discovery to prevent data leakage.
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
5. Transform Data (Optional but Recommended)
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True,
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
6. Select and Train Model
# For fingerprints
model = dc.models.MultitaskRegressor(
n_tasks=2, # number of properties to predict
n_features=2048, # fingerprint size
layer_sizes=[1000, 500], # hidden layer sizes
dropouts=0.25,
learning_rate=0.001
)
# OR for graphs
model = dc.models.GCNModel(
n_tasks=2,
mode='regression',
batch_size=128,
learning_rate=0.001
)
# Train
model.fit(train, nb_epoch=50)
7. Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
train_score = model.evaluate(train, [metric])
valid_score = model.evaluate(valid, [metric])
test_score = model.evaluate(test, [metric])
print(f"Train R²: {train_score}")
print(f"Valid R²: {valid_score}")
print(f"Test R²: {test_score}")
8. Make Predictions
# Predict on new molecules
new_smiles = ['CCO', 'CC(C)O', 'c1ccccc1']
new_featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
new_features = new_featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
# Apply same transformations
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
Workflow 2: Using MoleculeNet Benchmark Datasets
Goal: Quickly train and evaluate models on standard benchmarks.
Quick Start
import deepchem as dc
# Load benchmark dataset
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
# Train model
model = dc.models.GCNModel(
n_tasks=len(tasks),
mode='classification'
)
model.fit(train, nb_epoch=50)
# Evaluate
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
Available Featurizer Options
When calling load_*() functions:
'ECFP': Extended-connectivity fingerprints (circular fingerprints)'GraphConv': Graph convolution features'Weave': Weave features'Raw': Raw SMILES strings'smiles2img': 2D molecular images
Available Splitter Options
'scaffold': Scaffold-based splitting (recommended for drug discovery)'random': Random splitting'stratified': Stratified splitting (preserves class distributions)'butina': Butina clustering-based splitting
Workflow 3: Hyperparameter Optimization
Goal: Find optimal model hyperparameters systematically.
Using GridHyperparamOpt
import deepchem as dc
import numpy as np
# Load data
tasks, datasets, transformers = dc.molnet.load_bbbp(
featurizer='ECFP',
splitter='scaffold'
)
train, valid, test = datasets
# Define parameter grid
params_dict = {
'layer_sizes': [[1000], [1000, 500], [1000, 1000]],
'dropouts': [0.0, 0.25, 0.5],
'learning_rate': [0.001, 0.0001]
}
# Define model builder function
def model_builder(model_params, model_dir):
return dc.models.MultitaskClassifier(
n_tasks=len(tasks),
n_features=1024,
**model_params
)
# Setup optimizer
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
optimizer = dc.hyper.GridHyperparamOpt(model_builder)
# Run optimization
best_model, best_params, all_results = optimizer.hyperparam_search(
params_dict,
train,
valid,
metric,
transformers=transformers
)
print(f"Best parameters: {best_params}")
print(f"Best validation score: {all_results['best_validation_score']}")
Workflow 4: Transfer Learning with Pretrained Models
Goal: Leverage pretrained models for improved performance on small datasets.
Using ChemBERTa
import deepchem as dc
from transformers import AutoTokenizer
# Load your data
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # ChemBERTa handles featurization
)
dataset = loader.create_dataset('data.csv')
# Split data
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# Load pretrained ChemBERTa
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='regression',
n_tasks=1
)
# Fine-tune
model.fit(train, nb_epoch=10)
# Evaluate
predictions = model.predict(test)
Using GROVER
# GROVER: pre-trained on molecular graphs
model = dc.models.GroverModel(
task='classification',
n_tasks=1,
model_dir='./grover_model'
)
# Fine-tune on your data
model.fit(train_dataset, nb_epoch=20)
Workflow 5: Molecular Generation with GANs
Goal: Generate novel molecules with desired properties.
Basic MolGAN
import deepchem as dc
# Load training data (molecules for the generator to learn from)
tasks, datasets, _ = dc.molnet.load_qm9(
featurizer='GraphConv',
splitter='random'
)
train, _, _ = datasets
# Create and train MolGAN
gan = dc.models.BasicMolGANModel(
learning_rate=0.001,
vertices=9, # max atoms in molecule
edges=5, # max bonds
nodes=[128, 256, 512]
)
# Train
gan.fit_gan(
train,
nb_epoch=100,
generator_steps=0.2,
checkpoint_interval=10
)
# Generate new molecules
generated_molecules = gan.predict_gan_generator(1000)
Conditional Generation
# For property-targeted generation
from deepchem.models.optimizers import ExponentialDecay
gan = dc.models.BasicMolGANModel(
learning_rate=ExponentialDecay(0.001, 0.9, 1000),
conditional=True # enable conditional generation
)
# Train with properties
gan.fit_gan(train, nb_epoch=100)
# Generate molecules with target properties
target_properties = np.array([[5.0, 300.0]]) # e.g., [logP, MW]
molecules = gan.predict_gan_generator(
1000,
conditional_inputs=target_properties
)
Workflow 6: Materials Property Prediction
Goal: Predict properties of crystalline materials.
Using Crystal Graph Convolutional Networks
import deepchem as dc
# Load materials data (structure files in CIF format)
loader = dc.data.CIFLoader()
dataset = loader.create_dataset('materials.csv')
# Split data
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
# Create CGCNN model
model = dc.models.CGCNNModel(
n_tasks=1,
mode='regression',
batch_size=32,
learning_rate=0.001
)
# Train
model.fit(train, nb_epoch=100)
# Evaluate
metric = dc.metrics.Metric(dc.metrics.mae_score)
test_score = model.evaluate(test, [metric])
Workflow 7: Protein Sequence Analysis
Goal: Predict protein properties from sequences.
Using ProtBERT
import deepchem as dc
# Load protein sequence data
loader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
# Use ProtBERT
model = dc.models.HuggingFaceModel(
model='Rostlab/prot_bert',
task='classification',
n_tasks=1
)
# Split and train
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
model.fit(train, nb_epoch=5)
# Predict
predictions = model.predict(test)
Workflow 8: Custom Model Integration
Goal: Use your own PyTorch/scikit-learn models with DeepChem.
Wrapping Scikit-Learn Models
from sklearn.ensemble import RandomForestRegressor
import deepchem as dc
# Create scikit-learn model
sklearn_model = RandomForestRegressor(
n_estimators=100,
max_depth=10,
random_state=42
)
# Wrap in DeepChem
model = dc.models.SklearnModel(model=sklearn_model)
# Use with DeepChem datasets
model.fit(train)
predictions = model.predict(test)
# Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
score = model.evaluate(test, [metric])
Creating Custom PyTorch Models
import torch
import torch.nn as nn
import deepchem as dc
class CustomNetwork(nn.Module):
def __init__(self, n_features, n_tasks):
super().__init__()
self.fc1 = nn.Linear(n_features, 512)
self.fc2 = nn.Linear(512, 256)
self.fc3 = nn.Linear(256, n_tasks)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.2)
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.dropout(x)
x = self.relu(self.fc2(x))
x = self.dropout(x)
return self.fc3(x)
# Wrap in DeepChem TorchModel
model = dc.models.TorchModel(
model=CustomNetwork(n_features=2048, n_tasks=1),
loss=nn.MSELoss(),
output_types=['prediction']
)
# Train
model.fit(train, nb_epoch=50)
Common Pitfalls and Solutions
Issue 1: Data Leakage in Drug Discovery
Problem: Using random splitting allows similar molecules in train and test sets.
Solution: Always use ScaffoldSplitter for molecular datasets.
Issue 2: Imbalanced Classification
Problem: Poor performance on minority class.
Solution: Use BalancingTransformer or weighted metrics.
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
Issue 3: Memory Issues with Large Datasets
Problem: Dataset doesn't fit in memory.
Solution: Use DiskDataset instead of NumpyDataset.
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
Issue 4: Overfitting on Small Datasets
Problem: Model memorizes training data. Solutions:
- Use stronger regularization (increase dropout)
- Use simpler models (Random Forest, Ridge)
- Apply transfer learning (pretrained models)
- Collect more data
Issue 5: Poor Graph Neural Network Performance
Problem: GNN performs worse than fingerprints. Solutions:
- Check if dataset is large enough (GNNs need >10K samples typically)
- Increase training epochs
- Try different GNN architectures (AttentiveFP, DMPNN)
- Use pretrained models (GROVER)