Initial commit
This commit is contained in:
491
skills/deepchem/references/workflows.md
Normal file
491
skills/deepchem/references/workflows.md
Normal file
@@ -0,0 +1,491 @@
|
||||
# DeepChem Workflows
|
||||
|
||||
This document provides detailed workflows for common DeepChem use cases.
|
||||
|
||||
## Workflow 1: Molecular Property Prediction from SMILES
|
||||
|
||||
**Goal**: Predict molecular properties (e.g., solubility, toxicity, activity) from SMILES strings.
|
||||
|
||||
### Step-by-Step Process
|
||||
|
||||
#### 1. Prepare Your Data
|
||||
Data should be in CSV format with at minimum:
|
||||
- A column with SMILES strings
|
||||
- One or more columns with property values (targets)
|
||||
|
||||
Example CSV structure:
|
||||
```csv
|
||||
smiles,solubility,toxicity
|
||||
CCO,-0.77,0
|
||||
CC(=O)OC1=CC=CC=C1C(=O)O,-1.19,1
|
||||
```
|
||||
|
||||
#### 2. Choose Featurizer
|
||||
Decision tree:
|
||||
- **Small dataset (<1K)**: Use `CircularFingerprint` or `RDKitDescriptors`
|
||||
- **Medium dataset (1K-100K)**: Use `CircularFingerprint` or `MolGraphConvFeaturizer`
|
||||
- **Large dataset (>100K)**: Use graph-based featurizers (`MolGraphConvFeaturizer`, `DMPNNFeaturizer`)
|
||||
- **Transfer learning**: Use pretrained model featurizers (`GroverFeaturizer`)
|
||||
|
||||
#### 3. Load and Featurize Data
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# For fingerprint-based
|
||||
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
||||
# OR for graph-based
|
||||
featurizer = dc.feat.MolGraphConvFeaturizer()
|
||||
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=['solubility', 'toxicity'], # column names to predict
|
||||
feature_field='smiles', # column with SMILES
|
||||
featurizer=featurizer
|
||||
)
|
||||
dataset = loader.create_dataset('data.csv')
|
||||
```
|
||||
|
||||
#### 4. Split Data
|
||||
**Critical**: Use `ScaffoldSplitter` for drug discovery to prevent data leakage.
|
||||
|
||||
```python
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, valid, test = splitter.train_valid_test_split(
|
||||
dataset,
|
||||
frac_train=0.8,
|
||||
frac_valid=0.1,
|
||||
frac_test=0.1
|
||||
)
|
||||
```
|
||||
|
||||
#### 5. Transform Data (Optional but Recommended)
|
||||
```python
|
||||
transformers = [
|
||||
dc.trans.NormalizationTransformer(
|
||||
transform_y=True,
|
||||
dataset=train
|
||||
)
|
||||
]
|
||||
|
||||
for transformer in transformers:
|
||||
train = transformer.transform(train)
|
||||
valid = transformer.transform(valid)
|
||||
test = transformer.transform(test)
|
||||
```
|
||||
|
||||
#### 6. Select and Train Model
|
||||
```python
|
||||
# For fingerprints
|
||||
model = dc.models.MultitaskRegressor(
|
||||
n_tasks=2, # number of properties to predict
|
||||
n_features=2048, # fingerprint size
|
||||
layer_sizes=[1000, 500], # hidden layer sizes
|
||||
dropouts=0.25,
|
||||
learning_rate=0.001
|
||||
)
|
||||
|
||||
# OR for graphs
|
||||
model = dc.models.GCNModel(
|
||||
n_tasks=2,
|
||||
mode='regression',
|
||||
batch_size=128,
|
||||
learning_rate=0.001
|
||||
)
|
||||
|
||||
# Train
|
||||
model.fit(train, nb_epoch=50)
|
||||
```
|
||||
|
||||
#### 7. Evaluate
|
||||
```python
|
||||
metric = dc.metrics.Metric(dc.metrics.r2_score)
|
||||
train_score = model.evaluate(train, [metric])
|
||||
valid_score = model.evaluate(valid, [metric])
|
||||
test_score = model.evaluate(test, [metric])
|
||||
|
||||
print(f"Train R²: {train_score}")
|
||||
print(f"Valid R²: {valid_score}")
|
||||
print(f"Test R²: {test_score}")
|
||||
```
|
||||
|
||||
#### 8. Make Predictions
|
||||
```python
|
||||
# Predict on new molecules
|
||||
new_smiles = ['CCO', 'CC(C)O', 'c1ccccc1']
|
||||
new_featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
|
||||
new_features = new_featurizer.featurize(new_smiles)
|
||||
new_dataset = dc.data.NumpyDataset(X=new_features)
|
||||
|
||||
# Apply same transformations
|
||||
for transformer in transformers:
|
||||
new_dataset = transformer.transform(new_dataset)
|
||||
|
||||
predictions = model.predict(new_dataset)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 2: Using MoleculeNet Benchmark Datasets
|
||||
|
||||
**Goal**: Quickly train and evaluate models on standard benchmarks.
|
||||
|
||||
### Quick Start
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# Load benchmark dataset
|
||||
tasks, datasets, transformers = dc.molnet.load_tox21(
|
||||
featurizer='GraphConv',
|
||||
splitter='scaffold'
|
||||
)
|
||||
train, valid, test = datasets
|
||||
|
||||
# Train model
|
||||
model = dc.models.GCNModel(
|
||||
n_tasks=len(tasks),
|
||||
mode='classification'
|
||||
)
|
||||
model.fit(train, nb_epoch=50)
|
||||
|
||||
# Evaluate
|
||||
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
|
||||
test_score = model.evaluate(test, [metric])
|
||||
print(f"Test ROC-AUC: {test_score}")
|
||||
```
|
||||
|
||||
### Available Featurizer Options
|
||||
When calling `load_*()` functions:
|
||||
- `'ECFP'`: Extended-connectivity fingerprints (circular fingerprints)
|
||||
- `'GraphConv'`: Graph convolution features
|
||||
- `'Weave'`: Weave features
|
||||
- `'Raw'`: Raw SMILES strings
|
||||
- `'smiles2img'`: 2D molecular images
|
||||
|
||||
### Available Splitter Options
|
||||
- `'scaffold'`: Scaffold-based splitting (recommended for drug discovery)
|
||||
- `'random'`: Random splitting
|
||||
- `'stratified'`: Stratified splitting (preserves class distributions)
|
||||
- `'butina'`: Butina clustering-based splitting
|
||||
|
||||
---
|
||||
|
||||
## Workflow 3: Hyperparameter Optimization
|
||||
|
||||
**Goal**: Find optimal model hyperparameters systematically.
|
||||
|
||||
### Using GridHyperparamOpt
|
||||
```python
|
||||
import deepchem as dc
|
||||
import numpy as np
|
||||
|
||||
# Load data
|
||||
tasks, datasets, transformers = dc.molnet.load_bbbp(
|
||||
featurizer='ECFP',
|
||||
splitter='scaffold'
|
||||
)
|
||||
train, valid, test = datasets
|
||||
|
||||
# Define parameter grid
|
||||
params_dict = {
|
||||
'layer_sizes': [[1000], [1000, 500], [1000, 1000]],
|
||||
'dropouts': [0.0, 0.25, 0.5],
|
||||
'learning_rate': [0.001, 0.0001]
|
||||
}
|
||||
|
||||
# Define model builder function
|
||||
def model_builder(model_params, model_dir):
|
||||
return dc.models.MultitaskClassifier(
|
||||
n_tasks=len(tasks),
|
||||
n_features=1024,
|
||||
**model_params
|
||||
)
|
||||
|
||||
# Setup optimizer
|
||||
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
|
||||
optimizer = dc.hyper.GridHyperparamOpt(model_builder)
|
||||
|
||||
# Run optimization
|
||||
best_model, best_params, all_results = optimizer.hyperparam_search(
|
||||
params_dict,
|
||||
train,
|
||||
valid,
|
||||
metric,
|
||||
transformers=transformers
|
||||
)
|
||||
|
||||
print(f"Best parameters: {best_params}")
|
||||
print(f"Best validation score: {all_results['best_validation_score']}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 4: Transfer Learning with Pretrained Models
|
||||
|
||||
**Goal**: Leverage pretrained models for improved performance on small datasets.
|
||||
|
||||
### Using ChemBERTa
|
||||
```python
|
||||
import deepchem as dc
|
||||
from transformers import AutoTokenizer
|
||||
|
||||
# Load your data
|
||||
loader = dc.data.CSVLoader(
|
||||
tasks=['activity'],
|
||||
feature_field='smiles',
|
||||
featurizer=dc.feat.DummyFeaturizer() # ChemBERTa handles featurization
|
||||
)
|
||||
dataset = loader.create_dataset('data.csv')
|
||||
|
||||
# Split data
|
||||
splitter = dc.splits.ScaffoldSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
|
||||
# Load pretrained ChemBERTa
|
||||
model = dc.models.HuggingFaceModel(
|
||||
model='seyonec/ChemBERTa-zinc-base-v1',
|
||||
task='regression',
|
||||
n_tasks=1
|
||||
)
|
||||
|
||||
# Fine-tune
|
||||
model.fit(train, nb_epoch=10)
|
||||
|
||||
# Evaluate
|
||||
predictions = model.predict(test)
|
||||
```
|
||||
|
||||
### Using GROVER
|
||||
```python
|
||||
# GROVER: pre-trained on molecular graphs
|
||||
model = dc.models.GroverModel(
|
||||
task='classification',
|
||||
n_tasks=1,
|
||||
model_dir='./grover_model'
|
||||
)
|
||||
|
||||
# Fine-tune on your data
|
||||
model.fit(train_dataset, nb_epoch=20)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 5: Molecular Generation with GANs
|
||||
|
||||
**Goal**: Generate novel molecules with desired properties.
|
||||
|
||||
### Basic MolGAN
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# Load training data (molecules for the generator to learn from)
|
||||
tasks, datasets, _ = dc.molnet.load_qm9(
|
||||
featurizer='GraphConv',
|
||||
splitter='random'
|
||||
)
|
||||
train, _, _ = datasets
|
||||
|
||||
# Create and train MolGAN
|
||||
gan = dc.models.BasicMolGANModel(
|
||||
learning_rate=0.001,
|
||||
vertices=9, # max atoms in molecule
|
||||
edges=5, # max bonds
|
||||
nodes=[128, 256, 512]
|
||||
)
|
||||
|
||||
# Train
|
||||
gan.fit_gan(
|
||||
train,
|
||||
nb_epoch=100,
|
||||
generator_steps=0.2,
|
||||
checkpoint_interval=10
|
||||
)
|
||||
|
||||
# Generate new molecules
|
||||
generated_molecules = gan.predict_gan_generator(1000)
|
||||
```
|
||||
|
||||
### Conditional Generation
|
||||
```python
|
||||
# For property-targeted generation
|
||||
from deepchem.models.optimizers import ExponentialDecay
|
||||
|
||||
gan = dc.models.BasicMolGANModel(
|
||||
learning_rate=ExponentialDecay(0.001, 0.9, 1000),
|
||||
conditional=True # enable conditional generation
|
||||
)
|
||||
|
||||
# Train with properties
|
||||
gan.fit_gan(train, nb_epoch=100)
|
||||
|
||||
# Generate molecules with target properties
|
||||
target_properties = np.array([[5.0, 300.0]]) # e.g., [logP, MW]
|
||||
molecules = gan.predict_gan_generator(
|
||||
1000,
|
||||
conditional_inputs=target_properties
|
||||
)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 6: Materials Property Prediction
|
||||
|
||||
**Goal**: Predict properties of crystalline materials.
|
||||
|
||||
### Using Crystal Graph Convolutional Networks
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# Load materials data (structure files in CIF format)
|
||||
loader = dc.data.CIFLoader()
|
||||
dataset = loader.create_dataset('materials.csv')
|
||||
|
||||
# Split data
|
||||
splitter = dc.splits.RandomSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
|
||||
# Create CGCNN model
|
||||
model = dc.models.CGCNNModel(
|
||||
n_tasks=1,
|
||||
mode='regression',
|
||||
batch_size=32,
|
||||
learning_rate=0.001
|
||||
)
|
||||
|
||||
# Train
|
||||
model.fit(train, nb_epoch=100)
|
||||
|
||||
# Evaluate
|
||||
metric = dc.metrics.Metric(dc.metrics.mae_score)
|
||||
test_score = model.evaluate(test, [metric])
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 7: Protein Sequence Analysis
|
||||
|
||||
**Goal**: Predict protein properties from sequences.
|
||||
|
||||
### Using ProtBERT
|
||||
```python
|
||||
import deepchem as dc
|
||||
|
||||
# Load protein sequence data
|
||||
loader = dc.data.FASTALoader()
|
||||
dataset = loader.create_dataset('proteins.fasta')
|
||||
|
||||
# Use ProtBERT
|
||||
model = dc.models.HuggingFaceModel(
|
||||
model='Rostlab/prot_bert',
|
||||
task='classification',
|
||||
n_tasks=1
|
||||
)
|
||||
|
||||
# Split and train
|
||||
splitter = dc.splits.RandomSplitter()
|
||||
train, test = splitter.train_test_split(dataset)
|
||||
model.fit(train, nb_epoch=5)
|
||||
|
||||
# Predict
|
||||
predictions = model.predict(test)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Workflow 8: Custom Model Integration
|
||||
|
||||
**Goal**: Use your own PyTorch/scikit-learn models with DeepChem.
|
||||
|
||||
### Wrapping Scikit-Learn Models
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
import deepchem as dc
|
||||
|
||||
# Create scikit-learn model
|
||||
sklearn_model = RandomForestRegressor(
|
||||
n_estimators=100,
|
||||
max_depth=10,
|
||||
random_state=42
|
||||
)
|
||||
|
||||
# Wrap in DeepChem
|
||||
model = dc.models.SklearnModel(model=sklearn_model)
|
||||
|
||||
# Use with DeepChem datasets
|
||||
model.fit(train)
|
||||
predictions = model.predict(test)
|
||||
|
||||
# Evaluate
|
||||
metric = dc.metrics.Metric(dc.metrics.r2_score)
|
||||
score = model.evaluate(test, [metric])
|
||||
```
|
||||
|
||||
### Creating Custom PyTorch Models
|
||||
```python
|
||||
import torch
|
||||
import torch.nn as nn
|
||||
import deepchem as dc
|
||||
|
||||
class CustomNetwork(nn.Module):
|
||||
def __init__(self, n_features, n_tasks):
|
||||
super().__init__()
|
||||
self.fc1 = nn.Linear(n_features, 512)
|
||||
self.fc2 = nn.Linear(512, 256)
|
||||
self.fc3 = nn.Linear(256, n_tasks)
|
||||
self.relu = nn.ReLU()
|
||||
self.dropout = nn.Dropout(0.2)
|
||||
|
||||
def forward(self, x):
|
||||
x = self.relu(self.fc1(x))
|
||||
x = self.dropout(x)
|
||||
x = self.relu(self.fc2(x))
|
||||
x = self.dropout(x)
|
||||
return self.fc3(x)
|
||||
|
||||
# Wrap in DeepChem TorchModel
|
||||
model = dc.models.TorchModel(
|
||||
model=CustomNetwork(n_features=2048, n_tasks=1),
|
||||
loss=nn.MSELoss(),
|
||||
output_types=['prediction']
|
||||
)
|
||||
|
||||
# Train
|
||||
model.fit(train, nb_epoch=50)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Pitfalls and Solutions
|
||||
|
||||
### Issue 1: Data Leakage in Drug Discovery
|
||||
**Problem**: Using random splitting allows similar molecules in train and test sets.
|
||||
**Solution**: Always use `ScaffoldSplitter` for molecular datasets.
|
||||
|
||||
### Issue 2: Imbalanced Classification
|
||||
**Problem**: Poor performance on minority class.
|
||||
**Solution**: Use `BalancingTransformer` or weighted metrics.
|
||||
```python
|
||||
transformer = dc.trans.BalancingTransformer(dataset=train)
|
||||
train = transformer.transform(train)
|
||||
```
|
||||
|
||||
### Issue 3: Memory Issues with Large Datasets
|
||||
**Problem**: Dataset doesn't fit in memory.
|
||||
**Solution**: Use `DiskDataset` instead of `NumpyDataset`.
|
||||
```python
|
||||
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
|
||||
```
|
||||
|
||||
### Issue 4: Overfitting on Small Datasets
|
||||
**Problem**: Model memorizes training data.
|
||||
**Solutions**:
|
||||
1. Use stronger regularization (increase dropout)
|
||||
2. Use simpler models (Random Forest, Ridge)
|
||||
3. Apply transfer learning (pretrained models)
|
||||
4. Collect more data
|
||||
|
||||
### Issue 5: Poor Graph Neural Network Performance
|
||||
**Problem**: GNN performs worse than fingerprints.
|
||||
**Solutions**:
|
||||
1. Check if dataset is large enough (GNNs need >10K samples typically)
|
||||
2. Increase training epochs
|
||||
3. Try different GNN architectures (AttentiveFP, DMPNN)
|
||||
4. Use pretrained models (GROVER)
|
||||
Reference in New Issue
Block a user