11 KiB
11 KiB
DeepChem API Reference
This document provides a comprehensive reference for DeepChem's core APIs, organized by functionality.
Data Handling
Data Loaders
File Format Loaders
- CSVLoader: Load tabular data from CSV files with customizable feature handling
- UserCSVLoader: User-defined CSV loading with flexible column specifications
- SDFLoader: Process molecular structure files (SDF format)
- JsonLoader: Import JSON-structured datasets
- ImageLoader: Load image data for computer vision tasks
Biological Data Loaders
- FASTALoader: Handle protein/DNA sequences in FASTA format
- FASTQLoader: Process FASTQ sequencing data with quality scores
- SAMLoader/BAMLoader/CRAMLoader: Support sequence alignment formats
Specialized Loaders
- DFTYamlLoader: Process density functional theory computational data
- InMemoryLoader: Load data directly from Python objects
Dataset Classes
- NumpyDataset: Wrap NumPy arrays for in-memory data manipulation
- DiskDataset: Manage larger datasets stored on disk, reducing memory overhead
- ImageDataset: Specialized container for image-based ML tasks
Data Splitters
General Splitters
- RandomSplitter: Random dataset partitioning
- IndexSplitter: Split by specified indices
- SpecifiedSplitter: Use pre-defined splits
- RandomStratifiedSplitter: Stratified random splitting
- SingletaskStratifiedSplitter: Stratified splitting for single tasks
- TaskSplitter: Split for multitask scenarios
Molecule-Specific Splitters
- ScaffoldSplitter: Divide molecules by structural scaffolds (prevents data leakage)
- ButinaSplitter: Clustering-based molecular splitting
- FingerprintSplitter: Split based on molecular fingerprint similarity
- MaxMinSplitter: Maximize diversity between training/test sets
- MolecularWeightSplitter: Split by molecular weight properties
Best Practice: For drug discovery tasks, use ScaffoldSplitter to prevent overfitting on similar molecular structures.
Transformers
Normalization
- NormalizationTransformer: Standard normalization (mean=0, std=1)
- MinMaxTransformer: Scale features to [0,1] range
- LogTransformer: Apply log transformation
- PowerTransformer: Box-Cox and Yeo-Johnson transformations
- CDFTransformer: Cumulative distribution function normalization
Task-Specific
- BalancingTransformer: Address class imbalance
- FeaturizationTransformer: Apply dynamic feature engineering
- CoulombFitTransformer: Quantum chemistry specific
- DAGTransformer: Directed acyclic graph transformations
- RxnSplitTransformer: Chemical reaction preprocessing
Molecular Featurizers
Graph-Based Featurizers
Use these with graph neural networks (GCNs, MPNNs, etc.):
- ConvMolFeaturizer: Graph representations for graph convolutional networks
- WeaveFeaturizer: "Weave" graph embeddings
- MolGraphConvFeaturizer: Graph convolution-ready representations
- EquivariantGraphFeaturizer: Maintains geometric invariance
- DMPNNFeaturizer: Directed message-passing neural network inputs
- GroverFeaturizer: Pre-trained molecular embeddings
Fingerprint-Based Featurizers
Use these with traditional ML (Random Forest, SVM, XGBoost):
- MACCSKeysFingerprint: 167-bit structural keys
- CircularFingerprint: Extended connectivity fingerprints (Morgan fingerprints)
- Parameters:
radius(default 2),size(default 2048),useChirality(default False)
- Parameters:
- PubChemFingerprint: 881-bit structural descriptors
- Mol2VecFingerprint: Learned molecular vector representations
Descriptor Featurizers
Calculate molecular properties directly:
- RDKitDescriptors: ~200 molecular descriptors (MW, LogP, H-donors, H-acceptors, TPSA, etc.)
- MordredDescriptors: Comprehensive structural and physicochemical descriptors
- CoulombMatrix: Interatomic distance matrices for 3D structures
Sequence-Based Featurizers
For recurrent networks and transformers:
- SmilesToSeq: Convert SMILES strings to sequences
- SmilesToImage: Generate 2D image representations from SMILES
- RawFeaturizer: Pass through raw molecular data unchanged
Selection Guide
| Use Case | Recommended Featurizer | Model Type |
|---|---|---|
| Graph neural networks | ConvMolFeaturizer, MolGraphConvFeaturizer | GCN, MPNN, GAT |
| Traditional ML | CircularFingerprint, RDKitDescriptors | Random Forest, XGBoost, SVM |
| Deep learning (non-graph) | CircularFingerprint, Mol2VecFingerprint | Dense networks, CNN |
| Sequence models | SmilesToSeq | LSTM, GRU, Transformer |
| 3D molecular structures | CoulombMatrix | Specialized 3D models |
| Quick baseline | RDKitDescriptors | Linear, Ridge, Lasso |
Models
Scikit-Learn Integration
- SklearnModel: Wrapper for any scikit-learn algorithm
- Usage:
SklearnModel(model=RandomForestRegressor())
- Usage:
Gradient Boosting
- GBDTModel: Gradient boosting decision trees (XGBoost, LightGBM)
PyTorch Models
Molecular Property Prediction
- MultitaskRegressor: Multi-task regression with shared representations
- MultitaskClassifier: Multi-task classification
- MultitaskFitTransformRegressor: Regression with learned transformations
- GCNModel: Graph convolutional networks
- GATModel: Graph attention networks
- AttentiveFPModel: Attentive fingerprint networks
- DMPNNModel: Directed message passing neural networks
- GroverModel: GROVER pre-trained transformer
- MATModel: Molecule attention transformer
Materials Science
- CGCNNModel: Crystal graph convolutional networks
- MEGNetModel: Materials graph networks
- LCNNModel: Lattice CNN for materials
Generative Models
- GANModel: Generative adversarial networks
- WGANModel: Wasserstein GAN
- BasicMolGANModel: Molecular GAN
- LSTMGenerator: LSTM-based molecule generation
- SeqToSeqModel: Sequence-to-sequence models
Physics-Informed Models
- PINNModel: Physics-informed neural networks
- HNNModel: Hamiltonian neural networks
- LNN: Lagrangian neural networks
- FNOModel: Fourier neural operators
Computer Vision
- CNN: Convolutional neural networks
- UNetModel: U-Net architecture for segmentation
- InceptionV3Model: Pre-trained Inception v3
- MobileNetV2Model: Lightweight mobile networks
Hugging Face Models
- HuggingFaceModel: General wrapper for HF transformers
- Chemberta: Chemical BERT for molecular property prediction
- MoLFormer: Molecular transformer architecture
- ProtBERT: Protein sequence BERT
- DeepAbLLM: Antibody large language models
Model Selection Guide
| Task | Recommended Model | Featurizer |
|---|---|---|
| Small dataset (<1000 samples) | SklearnModel (Random Forest) | CircularFingerprint |
| Medium dataset (1K-100K) | GBDTModel or MultitaskRegressor | CircularFingerprint or ConvMolFeaturizer |
| Large dataset (>100K) | GCNModel, AttentiveFPModel, or DMPNN | MolGraphConvFeaturizer |
| Transfer learning | GroverModel, Chemberta, MoLFormer | Model-specific |
| Materials properties | CGCNNModel, MEGNetModel | Structure-based |
| Molecule generation | BasicMolGANModel, LSTMGenerator | SmilesToSeq |
| Protein sequences | ProtBERT | Sequence-based |
MoleculeNet Datasets
Quick access to 30+ benchmark datasets via dc.molnet.load_*() functions.
Classification Datasets
- load_bace(): BACE-1 inhibitors (binary classification)
- load_bbbp(): Blood-brain barrier penetration
- load_clintox(): Clinical toxicity
- load_hiv(): HIV inhibition activity
- load_muv(): PubChem BioAssay (challenging, sparse)
- load_pcba(): PubChem screening data
- load_sider(): Adverse drug reactions (multi-label)
- load_tox21(): 12 toxicity assays (multi-task)
- load_toxcast(): EPA ToxCast screening
Regression Datasets
- load_delaney(): Aqueous solubility (ESOL)
- load_freesolv(): Solvation free energy
- load_lipo(): Lipophilicity (octanol-water partition)
- load_qm7/qm8/qm9(): Quantum mechanical properties
- load_hopv(): Organic photovoltaic properties
Protein-Ligand Binding
- load_pdbbind(): Binding affinity data
Materials Science
- load_perovskite(): Perovskite stability
- load_mp_formation_energy(): Materials Project formation energy
- load_mp_metallicity(): Metal vs. non-metal classification
- load_bandgap(): Electronic bandgap prediction
Chemical Reactions
- load_uspto(): USPTO reaction dataset
Usage Pattern
tasks, datasets, transformers = dc.molnet.load_bbbp(
featurizer='GraphConv', # or 'ECFP', 'GraphConv', 'Weave', etc.
splitter='scaffold', # or 'random', 'stratified', etc.
reload=False # set True to skip caching
)
train, valid, test = datasets
Metrics
Common evaluation metrics available in dc.metrics:
Classification Metrics
- roc_auc_score: Area under ROC curve (binary/multi-class)
- prc_auc_score: Area under precision-recall curve
- accuracy_score: Classification accuracy
- balanced_accuracy_score: Balanced accuracy for imbalanced datasets
- recall_score: Sensitivity/recall
- precision_score: Precision
- f1_score: F1 score
Regression Metrics
- mean_absolute_error: MAE
- mean_squared_error: MSE
- root_mean_squared_error: RMSE
- r2_score: R² coefficient of determination
- pearson_r2_score: Pearson correlation
- spearman_correlation: Spearman rank correlation
Multi-Task Metrics
Most metrics support multi-task evaluation by averaging over tasks.
Training Pattern
Standard DeepChem workflow:
# 1. Load data
loader = dc.data.CSVLoader(tasks=['task1'], feature_field='smiles',
featurizer=dc.feat.CircularFingerprint())
dataset = loader.create_dataset('data.csv')
# 2. Split data
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)
# 3. Transform data (optional)
transformers = [dc.trans.NormalizationTransformer(dataset=train)]
for transformer in transformers:
train = transformer.transform(train)
valid = transformer.transform(valid)
test = transformer.transform(test)
# 4. Create and train model
model = dc.models.MultitaskRegressor(n_tasks=1, n_features=2048, layer_sizes=[1000])
model.fit(train, nb_epoch=50)
# 5. Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
train_score = model.evaluate(train, [metric])
test_score = model.evaluate(test, [metric])
Common Patterns
Pattern 1: Quick Baseline with MoleculeNet
tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP')
train, valid, test = datasets
model = dc.models.MultitaskClassifier(n_tasks=len(tasks), n_features=1024)
model.fit(train)
Pattern 2: Custom Data with Graph Networks
featurizer = dc.feat.MolGraphConvFeaturizer()
loader = dc.data.CSVLoader(tasks=['activity'], feature_field='smiles',
featurizer=featurizer)
dataset = loader.create_dataset('my_data.csv')
train, test = dc.splits.RandomSplitter().train_test_split(dataset)
model = dc.models.GCNModel(mode='classification', n_tasks=1)
model.fit(train)
Pattern 3: Transfer Learning with Pretrained Models
model = dc.models.GroverModel(task='classification', n_tasks=1)
model.fit(train_dataset)
predictions = model.predict(test_dataset)