Files
gh-k-dense-ai-claude-scient…/skills/deepchem/references/workflows.md
2025-11-30 08:30:10 +08:00

492 lines
12 KiB
Markdown

# DeepChem Workflows
This document provides detailed workflows for common DeepChem use cases.
## Workflow 1: Molecular Property Prediction from SMILES
**Goal**: Predict molecular properties (e.g., solubility, toxicity, activity) from SMILES strings.
### Step-by-Step Process
#### 1. Prepare Your Data
Data should be in CSV format with at minimum:
- A column with SMILES strings
- One or more columns with property values (targets)
Example CSV structure:
```csv
smiles,solubility,toxicity
CCO,-0.77,0
CC(=O)OC1=CC=CC=C1C(=O)O,-1.19,1
```
#### 2. Choose Featurizer
Decision tree:
- **Small dataset (<1K)**: Use `CircularFingerprint` or `RDKitDescriptors`
- **Medium dataset (1K-100K)**: Use `CircularFingerprint` or `MolGraphConvFeaturizer`
- **Large dataset (>100K)**: Use graph-based featurizers (`MolGraphConvFeaturizer`, `DMPNNFeaturizer`)
- **Transfer learning**: Use pretrained model featurizers (`GroverFeaturizer`)
#### 3. Load and Featurize Data
```python
import deepchem as dc
# For fingerprint-based
featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
# OR for graph-based
featurizer = dc.feat.MolGraphConvFeaturizer()
loader = dc.data.CSVLoader(
tasks=['solubility', 'toxicity'], # column names to predict
feature_field='smiles', # column with SMILES
featurizer=featurizer
)
dataset = loader.create_dataset('data.csv')
```
#### 4. Split Data
**Critical**: Use `ScaffoldSplitter` for drug discovery to prevent data leakage.
```python
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(
dataset,
frac_train=0.8,
frac_valid=0.1,
frac_test=0.1
)
```
#### 5. Transform Data (Optional but Recommended)
```python
transformers = [
dc.trans.NormalizationTransformer(
transform_y=True,
dataset=train
)
]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
```
#### 6. Select and Train Model
```python
# For fingerprints
model = dc.models.MultitaskRegressor(
n_tasks=2, # number of properties to predict
n_features=2048, # fingerprint size
layer_sizes=[1000, 500], # hidden layer sizes
dropouts=0.25,
learning_rate=0.001
)
# OR for graphs
model = dc.models.GCNModel(
n_tasks=2,
mode='regression',
batch_size=128,
learning_rate=0.001
)
# Train
model.fit(train, nb_epoch=50)
```
#### 7. Evaluate
```python
metric = dc.metrics.Metric(dc.metrics.r2_score)
train_score = model.evaluate(train, [metric])
valid_score = model.evaluate(valid, [metric])
test_score = model.evaluate(test, [metric])
print(f"Train R²: {train_score}")
print(f"Valid R²: {valid_score}")
print(f"Test R²: {test_score}")
```
#### 8. Make Predictions
```python
# Predict on new molecules
new_smiles = ['CCO', 'CC(C)O', 'c1ccccc1']
new_featurizer = dc.feat.CircularFingerprint(radius=2, size=2048)
new_features = new_featurizer.featurize(new_smiles)
new_dataset = dc.data.NumpyDataset(X=new_features)
# Apply same transformations
for transformer in transformers:
new_dataset = transformer.transform(new_dataset)
predictions = model.predict(new_dataset)
```
---
## Workflow 2: Using MoleculeNet Benchmark Datasets
**Goal**: Quickly train and evaluate models on standard benchmarks.
### Quick Start
```python
import deepchem as dc
# Load benchmark dataset
tasks, datasets, transformers = dc.molnet.load_tox21(
featurizer='GraphConv',
splitter='scaffold'
)
train, valid, test = datasets
# Train model
model = dc.models.GCNModel(
n_tasks=len(tasks),
mode='classification'
)
model.fit(train, nb_epoch=50)
# Evaluate
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
test_score = model.evaluate(test, [metric])
print(f"Test ROC-AUC: {test_score}")
```
### Available Featurizer Options
When calling `load_*()` functions:
- `'ECFP'`: Extended-connectivity fingerprints (circular fingerprints)
- `'GraphConv'`: Graph convolution features
- `'Weave'`: Weave features
- `'Raw'`: Raw SMILES strings
- `'smiles2img'`: 2D molecular images
### Available Splitter Options
- `'scaffold'`: Scaffold-based splitting (recommended for drug discovery)
- `'random'`: Random splitting
- `'stratified'`: Stratified splitting (preserves class distributions)
- `'butina'`: Butina clustering-based splitting
---
## Workflow 3: Hyperparameter Optimization
**Goal**: Find optimal model hyperparameters systematically.
### Using GridHyperparamOpt
```python
import deepchem as dc
import numpy as np
# Load data
tasks, datasets, transformers = dc.molnet.load_bbbp(
featurizer='ECFP',
splitter='scaffold'
)
train, valid, test = datasets
# Define parameter grid
params_dict = {
'layer_sizes': [[1000], [1000, 500], [1000, 1000]],
'dropouts': [0.0, 0.25, 0.5],
'learning_rate': [0.001, 0.0001]
}
# Define model builder function
def model_builder(model_params, model_dir):
return dc.models.MultitaskClassifier(
n_tasks=len(tasks),
n_features=1024,
**model_params
)
# Setup optimizer
metric = dc.metrics.Metric(dc.metrics.roc_auc_score)
optimizer = dc.hyper.GridHyperparamOpt(model_builder)
# Run optimization
best_model, best_params, all_results = optimizer.hyperparam_search(
params_dict,
train,
valid,
metric,
transformers=transformers
)
print(f"Best parameters: {best_params}")
print(f"Best validation score: {all_results['best_validation_score']}")
```
---
## Workflow 4: Transfer Learning with Pretrained Models
**Goal**: Leverage pretrained models for improved performance on small datasets.
### Using ChemBERTa
```python
import deepchem as dc
from transformers import AutoTokenizer
# Load your data
loader = dc.data.CSVLoader(
tasks=['activity'],
feature_field='smiles',
featurizer=dc.feat.DummyFeaturizer() # ChemBERTa handles featurization
)
dataset = loader.create_dataset('data.csv')
# Split data
splitter = dc.splits.ScaffoldSplitter()
train, test = splitter.train_test_split(dataset)
# Load pretrained ChemBERTa
model = dc.models.HuggingFaceModel(
model='seyonec/ChemBERTa-zinc-base-v1',
task='regression',
n_tasks=1
)
# Fine-tune
model.fit(train, nb_epoch=10)
# Evaluate
predictions = model.predict(test)
```
### Using GROVER
```python
# GROVER: pre-trained on molecular graphs
model = dc.models.GroverModel(
task='classification',
n_tasks=1,
model_dir='./grover_model'
)
# Fine-tune on your data
model.fit(train_dataset, nb_epoch=20)
```
---
## Workflow 5: Molecular Generation with GANs
**Goal**: Generate novel molecules with desired properties.
### Basic MolGAN
```python
import deepchem as dc
# Load training data (molecules for the generator to learn from)
tasks, datasets, _ = dc.molnet.load_qm9(
featurizer='GraphConv',
splitter='random'
)
train, _, _ = datasets
# Create and train MolGAN
gan = dc.models.BasicMolGANModel(
learning_rate=0.001,
vertices=9, # max atoms in molecule
edges=5, # max bonds
nodes=[128, 256, 512]
)
# Train
gan.fit_gan(
train,
nb_epoch=100,
generator_steps=0.2,
checkpoint_interval=10
)
# Generate new molecules
generated_molecules = gan.predict_gan_generator(1000)
```
### Conditional Generation
```python
# For property-targeted generation
from deepchem.models.optimizers import ExponentialDecay
gan = dc.models.BasicMolGANModel(
learning_rate=ExponentialDecay(0.001, 0.9, 1000),
conditional=True # enable conditional generation
)
# Train with properties
gan.fit_gan(train, nb_epoch=100)
# Generate molecules with target properties
target_properties = np.array([[5.0, 300.0]]) # e.g., [logP, MW]
molecules = gan.predict_gan_generator(
1000,
conditional_inputs=target_properties
)
```
---
## Workflow 6: Materials Property Prediction
**Goal**: Predict properties of crystalline materials.
### Using Crystal Graph Convolutional Networks
```python
import deepchem as dc
# Load materials data (structure files in CIF format)
loader = dc.data.CIFLoader()
dataset = loader.create_dataset('materials.csv')
# Split data
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
# Create CGCNN model
model = dc.models.CGCNNModel(
n_tasks=1,
mode='regression',
batch_size=32,
learning_rate=0.001
)
# Train
model.fit(train, nb_epoch=100)
# Evaluate
metric = dc.metrics.Metric(dc.metrics.mae_score)
test_score = model.evaluate(test, [metric])
```
---
## Workflow 7: Protein Sequence Analysis
**Goal**: Predict protein properties from sequences.
### Using ProtBERT
```python
import deepchem as dc
# Load protein sequence data
loader = dc.data.FASTALoader()
dataset = loader.create_dataset('proteins.fasta')
# Use ProtBERT
model = dc.models.HuggingFaceModel(
model='Rostlab/prot_bert',
task='classification',
n_tasks=1
)
# Split and train
splitter = dc.splits.RandomSplitter()
train, test = splitter.train_test_split(dataset)
model.fit(train, nb_epoch=5)
# Predict
predictions = model.predict(test)
```
---
## Workflow 8: Custom Model Integration
**Goal**: Use your own PyTorch/scikit-learn models with DeepChem.
### Wrapping Scikit-Learn Models
```python
from sklearn.ensemble import RandomForestRegressor
import deepchem as dc
# Create scikit-learn model
sklearn_model = RandomForestRegressor(
n_estimators=100,
max_depth=10,
random_state=42
)
# Wrap in DeepChem
model = dc.models.SklearnModel(model=sklearn_model)
# Use with DeepChem datasets
model.fit(train)
predictions = model.predict(test)
# Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
score = model.evaluate(test, [metric])
```
### Creating Custom PyTorch Models
```python
import torch
import torch.nn as nn
import deepchem as dc
class CustomNetwork(nn.Module):
def __init__(self, n_features, n_tasks):
super().__init__()
self.fc1 = nn.Linear(n_features, 512)
self.fc2 = nn.Linear(512, 256)
self.fc3 = nn.Linear(256, n_tasks)
self.relu = nn.ReLU()
self.dropout = nn.Dropout(0.2)
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.dropout(x)
x = self.relu(self.fc2(x))
x = self.dropout(x)
return self.fc3(x)
# Wrap in DeepChem TorchModel
model = dc.models.TorchModel(
model=CustomNetwork(n_features=2048, n_tasks=1),
loss=nn.MSELoss(),
output_types=['prediction']
)
# Train
model.fit(train, nb_epoch=50)
```
---
## Common Pitfalls and Solutions
### Issue 1: Data Leakage in Drug Discovery
**Problem**: Using random splitting allows similar molecules in train and test sets.
**Solution**: Always use `ScaffoldSplitter` for molecular datasets.
### Issue 2: Imbalanced Classification
**Problem**: Poor performance on minority class.
**Solution**: Use `BalancingTransformer` or weighted metrics.
```python
transformer = dc.trans.BalancingTransformer(dataset=train)
train = transformer.transform(train)
```
### Issue 3: Memory Issues with Large Datasets
**Problem**: Dataset doesn't fit in memory.
**Solution**: Use `DiskDataset` instead of `NumpyDataset`.
```python
dataset = dc.data.DiskDataset.from_numpy(X, y, w, ids)
```
### Issue 4: Overfitting on Small Datasets
**Problem**: Model memorizes training data.
**Solutions**:
1. Use stronger regularization (increase dropout)
2. Use simpler models (Random Forest, Ridge)
3. Apply transfer learning (pretrained models)
4. Collect more data
### Issue 5: Poor Graph Neural Network Performance
**Problem**: GNN performs worse than fingerprints.
**Solutions**:
1. Check if dataset is large enough (GNNs need >10K samples typically)
2. Increase training epochs
3. Try different GNN architectures (AttentiveFP, DMPNN)
4. Use pretrained models (GROVER)