Files
gh-k-dense-ai-claude-scient…/skills/deepchem/references/api_reference.md
2025-11-30 08:30:10 +08:00

11 KiB

DeepChem API Reference

This document provides a comprehensive reference for DeepChem's core APIs, organized by functionality.

Data Handling

Data Loaders

File Format Loaders

  • CSVLoader: Load tabular data from CSV files with customizable feature handling
  • UserCSVLoader: User-defined CSV loading with flexible column specifications
  • SDFLoader: Process molecular structure files (SDF format)
  • JsonLoader: Import JSON-structured datasets
  • ImageLoader: Load image data for computer vision tasks

Biological Data Loaders

  • FASTALoader: Handle protein/DNA sequences in FASTA format
  • FASTQLoader: Process FASTQ sequencing data with quality scores
  • SAMLoader/BAMLoader/CRAMLoader: Support sequence alignment formats

Specialized Loaders

  • DFTYamlLoader: Process density functional theory computational data
  • InMemoryLoader: Load data directly from Python objects

Dataset Classes

  • NumpyDataset: Wrap NumPy arrays for in-memory data manipulation
  • DiskDataset: Manage larger datasets stored on disk, reducing memory overhead
  • ImageDataset: Specialized container for image-based ML tasks

Data Splitters

General Splitters

  • RandomSplitter: Random dataset partitioning
  • IndexSplitter: Split by specified indices
  • SpecifiedSplitter: Use pre-defined splits
  • RandomStratifiedSplitter: Stratified random splitting
  • SingletaskStratifiedSplitter: Stratified splitting for single tasks
  • TaskSplitter: Split for multitask scenarios

Molecule-Specific Splitters

  • ScaffoldSplitter: Divide molecules by structural scaffolds (prevents data leakage)
  • ButinaSplitter: Clustering-based molecular splitting
  • FingerprintSplitter: Split based on molecular fingerprint similarity
  • MaxMinSplitter: Maximize diversity between training/test sets
  • MolecularWeightSplitter: Split by molecular weight properties

Best Practice: For drug discovery tasks, use ScaffoldSplitter to prevent overfitting on similar molecular structures.

Transformers

Normalization

  • NormalizationTransformer: Standard normalization (mean=0, std=1)
  • MinMaxTransformer: Scale features to [0,1] range
  • LogTransformer: Apply log transformation
  • PowerTransformer: Box-Cox and Yeo-Johnson transformations
  • CDFTransformer: Cumulative distribution function normalization

Task-Specific

  • BalancingTransformer: Address class imbalance
  • FeaturizationTransformer: Apply dynamic feature engineering
  • CoulombFitTransformer: Quantum chemistry specific
  • DAGTransformer: Directed acyclic graph transformations
  • RxnSplitTransformer: Chemical reaction preprocessing

Molecular Featurizers

Graph-Based Featurizers

Use these with graph neural networks (GCNs, MPNNs, etc.):

  • ConvMolFeaturizer: Graph representations for graph convolutional networks
  • WeaveFeaturizer: "Weave" graph embeddings
  • MolGraphConvFeaturizer: Graph convolution-ready representations
  • EquivariantGraphFeaturizer: Maintains geometric invariance
  • DMPNNFeaturizer: Directed message-passing neural network inputs
  • GroverFeaturizer: Pre-trained molecular embeddings

Fingerprint-Based Featurizers

Use these with traditional ML (Random Forest, SVM, XGBoost):

  • MACCSKeysFingerprint: 167-bit structural keys
  • CircularFingerprint: Extended connectivity fingerprints (Morgan fingerprints)
    • Parameters: radius (default 2), size (default 2048), useChirality (default False)
  • PubChemFingerprint: 881-bit structural descriptors
  • Mol2VecFingerprint: Learned molecular vector representations

Descriptor Featurizers

Calculate molecular properties directly:

  • RDKitDescriptors: ~200 molecular descriptors (MW, LogP, H-donors, H-acceptors, TPSA, etc.)
  • MordredDescriptors: Comprehensive structural and physicochemical descriptors
  • CoulombMatrix: Interatomic distance matrices for 3D structures

Sequence-Based Featurizers

For recurrent networks and transformers:

  • SmilesToSeq: Convert SMILES strings to sequences
  • SmilesToImage: Generate 2D image representations from SMILES
  • RawFeaturizer: Pass through raw molecular data unchanged

Selection Guide

Use Case Recommended Featurizer Model Type
Graph neural networks ConvMolFeaturizer, MolGraphConvFeaturizer GCN, MPNN, GAT
Traditional ML CircularFingerprint, RDKitDescriptors Random Forest, XGBoost, SVM
Deep learning (non-graph) CircularFingerprint, Mol2VecFingerprint Dense networks, CNN
Sequence models SmilesToSeq LSTM, GRU, Transformer
3D molecular structures CoulombMatrix Specialized 3D models
Quick baseline RDKitDescriptors Linear, Ridge, Lasso

Models

Scikit-Learn Integration

  • SklearnModel: Wrapper for any scikit-learn algorithm
    • Usage: SklearnModel(model=RandomForestRegressor())

Gradient Boosting

  • GBDTModel: Gradient boosting decision trees (XGBoost, LightGBM)

PyTorch Models

Molecular Property Prediction

  • MultitaskRegressor: Multi-task regression with shared representations
  • MultitaskClassifier: Multi-task classification
  • MultitaskFitTransformRegressor: Regression with learned transformations
  • GCNModel: Graph convolutional networks
  • GATModel: Graph attention networks
  • AttentiveFPModel: Attentive fingerprint networks
  • DMPNNModel: Directed message passing neural networks
  • GroverModel: GROVER pre-trained transformer
  • MATModel: Molecule attention transformer

Materials Science

  • CGCNNModel: Crystal graph convolutional networks
  • MEGNetModel: Materials graph networks
  • LCNNModel: Lattice CNN for materials

Generative Models

  • GANModel: Generative adversarial networks
  • WGANModel: Wasserstein GAN
  • BasicMolGANModel: Molecular GAN
  • LSTMGenerator: LSTM-based molecule generation
  • SeqToSeqModel: Sequence-to-sequence models

Physics-Informed Models

  • PINNModel: Physics-informed neural networks
  • HNNModel: Hamiltonian neural networks
  • LNN: Lagrangian neural networks
  • FNOModel: Fourier neural operators

Computer Vision

  • CNN: Convolutional neural networks
  • UNetModel: U-Net architecture for segmentation
  • InceptionV3Model: Pre-trained Inception v3
  • MobileNetV2Model: Lightweight mobile networks

Hugging Face Models

  • HuggingFaceModel: General wrapper for HF transformers
  • Chemberta: Chemical BERT for molecular property prediction
  • MoLFormer: Molecular transformer architecture
  • ProtBERT: Protein sequence BERT
  • DeepAbLLM: Antibody large language models

Model Selection Guide

Task Recommended Model Featurizer
Small dataset (<1000 samples) SklearnModel (Random Forest) CircularFingerprint
Medium dataset (1K-100K) GBDTModel or MultitaskRegressor CircularFingerprint or ConvMolFeaturizer
Large dataset (>100K) GCNModel, AttentiveFPModel, or DMPNN MolGraphConvFeaturizer
Transfer learning GroverModel, Chemberta, MoLFormer Model-specific
Materials properties CGCNNModel, MEGNetModel Structure-based
Molecule generation BasicMolGANModel, LSTMGenerator SmilesToSeq
Protein sequences ProtBERT Sequence-based

MoleculeNet Datasets

Quick access to 30+ benchmark datasets via dc.molnet.load_*() functions.

Classification Datasets

  • load_bace(): BACE-1 inhibitors (binary classification)
  • load_bbbp(): Blood-brain barrier penetration
  • load_clintox(): Clinical toxicity
  • load_hiv(): HIV inhibition activity
  • load_muv(): PubChem BioAssay (challenging, sparse)
  • load_pcba(): PubChem screening data
  • load_sider(): Adverse drug reactions (multi-label)
  • load_tox21(): 12 toxicity assays (multi-task)
  • load_toxcast(): EPA ToxCast screening

Regression Datasets

  • load_delaney(): Aqueous solubility (ESOL)
  • load_freesolv(): Solvation free energy
  • load_lipo(): Lipophilicity (octanol-water partition)
  • load_qm7/qm8/qm9(): Quantum mechanical properties
  • load_hopv(): Organic photovoltaic properties

Protein-Ligand Binding

  • load_pdbbind(): Binding affinity data

Materials Science

  • load_perovskite(): Perovskite stability
  • load_mp_formation_energy(): Materials Project formation energy
  • load_mp_metallicity(): Metal vs. non-metal classification
  • load_bandgap(): Electronic bandgap prediction

Chemical Reactions

  • load_uspto(): USPTO reaction dataset

Usage Pattern

tasks, datasets, transformers = dc.molnet.load_bbbp(
    featurizer='GraphConv',  # or 'ECFP', 'GraphConv', 'Weave', etc.
    splitter='scaffold',      # or 'random', 'stratified', etc.
    reload=False              # set True to skip caching
)
train, valid, test = datasets

Metrics

Common evaluation metrics available in dc.metrics:

Classification Metrics

  • roc_auc_score: Area under ROC curve (binary/multi-class)
  • prc_auc_score: Area under precision-recall curve
  • accuracy_score: Classification accuracy
  • balanced_accuracy_score: Balanced accuracy for imbalanced datasets
  • recall_score: Sensitivity/recall
  • precision_score: Precision
  • f1_score: F1 score

Regression Metrics

  • mean_absolute_error: MAE
  • mean_squared_error: MSE
  • root_mean_squared_error: RMSE
  • r2_score: R² coefficient of determination
  • pearson_r2_score: Pearson correlation
  • spearman_correlation: Spearman rank correlation

Multi-Task Metrics

Most metrics support multi-task evaluation by averaging over tasks.

Training Pattern

Standard DeepChem workflow:

# 1. Load data
loader = dc.data.CSVLoader(tasks=['task1'], feature_field='smiles',
                           featurizer=dc.feat.CircularFingerprint())
dataset = loader.create_dataset('data.csv')

# 2. Split data
splitter = dc.splits.ScaffoldSplitter()
train, valid, test = splitter.train_valid_test_split(dataset)

# 3. Transform data (optional)
transformers = [dc.trans.NormalizationTransformer(dataset=train)]
for transformer in transformers:
    train = transformer.transform(train)
    valid = transformer.transform(valid)
    test = transformer.transform(test)

# 4. Create and train model
model = dc.models.MultitaskRegressor(n_tasks=1, n_features=2048, layer_sizes=[1000])
model.fit(train, nb_epoch=50)

# 5. Evaluate
metric = dc.metrics.Metric(dc.metrics.r2_score)
train_score = model.evaluate(train, [metric])
test_score = model.evaluate(test, [metric])

Common Patterns

Pattern 1: Quick Baseline with MoleculeNet

tasks, datasets, transformers = dc.molnet.load_tox21(featurizer='ECFP')
train, valid, test = datasets
model = dc.models.MultitaskClassifier(n_tasks=len(tasks), n_features=1024)
model.fit(train)

Pattern 2: Custom Data with Graph Networks

featurizer = dc.feat.MolGraphConvFeaturizer()
loader = dc.data.CSVLoader(tasks=['activity'], feature_field='smiles',
                           featurizer=featurizer)
dataset = loader.create_dataset('my_data.csv')
train, test = dc.splits.RandomSplitter().train_test_split(dataset)
model = dc.models.GCNModel(mode='classification', n_tasks=1)
model.fit(train)

Pattern 3: Transfer Learning with Pretrained Models

model = dc.models.GroverModel(task='classification', n_tasks=1)
model.fit(train_dataset)
predictions = model.predict(test_dataset)