Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/molfeat/references/api_reference.md
+++ b/skills/molfeat/references/api_reference.md
@@ -0,0 +1,428 @@
+# Molfeat API Reference
+
+## Core Modules
+
+Molfeat is organized into several key modules that provide different aspects of molecular featurization:
+
+- **`molfeat.store`** - Manages model loading, listing, and registration
+- **`molfeat.calc`** - Provides calculators for single-molecule featurization
+- **`molfeat.trans`** - Offers scikit-learn compatible transformers for batch processing
+- **`molfeat.utils`** - Utility functions for data handling
+- **`molfeat.viz`** - Visualization tools for molecular features
+
+---
+
+## molfeat.calc - Calculators
+
+Calculators are callable objects that convert individual molecules into feature vectors. They accept either RDKit `Chem.Mol` objects or SMILES strings as input.
+
+### SerializableCalculator (Base Class)
+
+Base abstract class for all calculators. When subclassing, must implement:
+- `__call__()` - Required method for featurization
+- `__len__()` - Optional, returns output length
+- `columns` - Optional property, returns feature names
+- `batch_compute()` - Optional, for efficient batch processing
+
+**State Management Methods:**
+- `to_state_json()` - Save calculator state as JSON
+- `to_state_yaml()` - Save calculator state as YAML
+- `from_state_dict()` - Load calculator from state dictionary
+- `to_state_dict()` - Export calculator state as dictionary
+
+### FPCalculator
+
+Computes molecular fingerprints. Supports 15+ fingerprint methods.
+
+**Supported Fingerprint Types:**
+
+**Structural Fingerprints:**
+- `ecfp` - Extended-connectivity fingerprints (circular)
+- `fcfp` - Functional-class fingerprints
+- `rdkit` - RDKit topological fingerprints
+- `maccs` - MACCS keys (166-bit structural keys)
+- `avalon` - Avalon fingerprints
+- `pattern` - Pattern fingerprints
+- `layered` - Layered fingerprints
+
+**Atom-based Fingerprints:**
+- `atompair` - Atom pair fingerprints
+- `atompair-count` - Counted atom pairs
+- `topological` - Topological torsion fingerprints
+- `topological-count` - Counted topological torsions
+
+**Specialized Fingerprints:**
+- `map4` - MinHashed atom-pair fingerprint up to 4 bonds
+- `secfp` - SMILES extended connectivity fingerprint
+- `erg` - Extended reduced graphs
+- `estate` - Electrotopological state indices
+
+**Parameters:**
+- `method` (str) - Fingerprint type name
+- `radius` (int) - Radius for circular fingerprints (default: 3)
+- `fpSize` (int) - Fingerprint size (default: 2048)
+- `includeChirality` (bool) - Include chirality information
+- `counting` (bool) - Use count vectors instead of binary
+
+**Usage:**
+```python
+from molfeat.calc import FPCalculator
+
+# Create fingerprint calculator
+calc = FPCalculator("ecfp", radius=3, fpSize=2048)
+
+# Compute fingerprint for single molecule
+fp = calc("CCO")  # Returns numpy array
+
+# Get fingerprint length
+length = len(calc)  # 2048
+
+# Get feature names
+names = calc.columns
+```
+
+**Common Fingerprint Dimensions:**
+- MACCS: 167 dimensions
+- ECFP (default): 2048 dimensions
+- MAP4 (default): 1024 dimensions
+
+### Descriptor Calculators
+
+**RDKitDescriptors2D**
+Computes 2D molecular descriptors using RDKit.
+
+```python
+from molfeat.calc import RDKitDescriptors2D
+
+calc = RDKitDescriptors2D()
+descriptors = calc("CCO")  # Returns 200+ descriptors
+```
+
+**RDKitDescriptors3D**
+Computes 3D molecular descriptors (requires conformer generation).
+
+**MordredDescriptors**
+Calculates over 1800 molecular descriptors using Mordred.
+
+```python
+from molfeat.calc import MordredDescriptors
+
+calc = MordredDescriptors()
+descriptors = calc("CCO")
+```
+
+### Pharmacophore Calculators
+
+**Pharmacophore2D**
+RDKit's 2D pharmacophore fingerprint generation.
+
+**Pharmacophore3D**
+Consensus pharmacophore fingerprints from multiple conformers.
+
+**CATSCalculator**
+Computes Chemically Advanced Template Search (CATS) descriptors - pharmacophore point pair distributions.
+
+**Parameters:**
+- `mode` - "2D" or "3D" distance calculations
+- `dist_bins` - Distance bins for pair distributions
+- `scale` - Scaling mode: "raw", "num", or "count"
+
+```python
+from molfeat.calc import CATSCalculator
+
+calc = CATSCalculator(mode="2D", scale="raw")
+cats = calc("CCO")  # Returns 21 descriptors by default
+```
+
+### Shape Descriptors
+
+**USRDescriptors**
+Ultrafast shape recognition descriptors (multiple variants).
+
+**ElectroShapeDescriptors**
+Electrostatic shape descriptors combining shape, chirality, and electrostatics.
+
+### Graph-Based Calculators
+
+**ScaffoldKeyCalculator**
+Computes 40+ scaffold-based molecular properties.
+
+**AtomCalculator**
+Atom-level featurization for graph neural networks.
+
+**BondCalculator**
+Bond-level featurization for graph neural networks.
+
+### Utility Function
+
+**get_calculator()**
+Factory function to instantiate calculators by name.
+
+```python
+from molfeat.calc import get_calculator
+
+# Instantiate any calculator by name
+calc = get_calculator("ecfp", radius=3)
+calc = get_calculator("maccs")
+calc = get_calculator("desc2D")
+```
+
+Raises `ValueError` for unsupported featurizers.
+
+---
+
+## molfeat.trans - Transformers
+
+Transformers wrap calculators into complete featurization pipelines for batch processing.
+
+### MoleculeTransformer
+
+Scikit-learn compatible transformer for batch molecular featurization.
+
+**Key Parameters:**
+- `featurizer` - Calculator or featurizer to use
+- `n_jobs` (int) - Number of parallel jobs (-1 for all cores)
+- `dtype` - Output data type (numpy float32/64, torch tensors)
+- `verbose` (bool) - Enable verbose logging
+- `ignore_errors` (bool) - Continue on failures (returns None for failed molecules)
+
+**Essential Methods:**
+- `transform(mols)` - Processes batches and returns representations
+- `_transform(mol)` - Handles individual molecule featurization
+- `__call__(mols)` - Convenience wrapper around transform()
+- `preprocess(mol)` - Prepares input molecules (not automatically applied)
+- `to_state_yaml_file(path)` - Save transformer configuration
+- `from_state_yaml_file(path)` - Load transformer configuration
+
+**Usage:**
+```python
+from molfeat.calc import FPCalculator
+from molfeat.trans import MoleculeTransformer
+import datamol as dm
+
+# Load molecules
+smiles = dm.data.freesolv().sample(100).smiles.values
+
+# Create transformer
+calc = FPCalculator("ecfp")
+transformer = MoleculeTransformer(calc, n_jobs=-1)
+
+# Featurize batch
+features = transformer(smiles)  # Returns numpy array (100, 2048)
+
+# Save configuration
+transformer.to_state_yaml_file("ecfp_config.yml")
+
+# Reload
+transformer = MoleculeTransformer.from_state_yaml_file("ecfp_config.yml")
+```
+
+**Performance:** Testing on 642 molecules showed 3.4x speedup using 4 parallel jobs versus single-threaded processing.
+
+### FeatConcat
+
+Concatenates multiple featurizers into unified representations.
+
+```python
+from molfeat.trans import FeatConcat
+from molfeat.calc import FPCalculator
+
+# Combine multiple fingerprints
+concat = FeatConcat([
+    FPCalculator("maccs"),      # 167 dimensions
+    FPCalculator("ecfp")         # 2048 dimensions
+])
+
+# Result: 2167-dimensional features
+transformer = MoleculeTransformer(concat, n_jobs=-1)
+features = transformer(smiles)
+```
+
+### PretrainedMolTransformer
+
+Subclass of `MoleculeTransformer` for pre-trained deep learning models.
+
+**Unique Features:**
+- `_embed()` - Batched inference for neural networks
+- `_convert()` - Transforms SMILES/molecules into model-compatible formats
+  - SELFIES strings for language models
+  - DGL graphs for graph neural networks
+- Integrated caching system for efficient storage
+
+**Usage:**
+```python
+from molfeat.trans.pretrained import PretrainedMolTransformer
+
+# Load pretrained model
+transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
+
+# Generate embeddings
+embeddings = transformer(smiles)
+```
+
+### PrecomputedMolTransformer
+
+Transformer for cached/precomputed features.
+
+---
+
+## molfeat.store - Model Store
+
+Manages featurizer discovery, loading, and registration.
+
+### ModelStore
+
+Central hub for accessing available featurizers.
+
+**Key Methods:**
+- `available_models` - Property listing all available featurizers
+- `search(name=None, **kwargs)` - Search for specific featurizers
+- `load(name, **kwargs)` - Load a featurizer by name
+- `register(name, card)` - Register custom featurizer
+
+**Usage:**
+```python
+from molfeat.store.modelstore import ModelStore
+
+# Initialize store
+store = ModelStore()
+
+# List all available models
+all_models = store.available_models
+print(f"Found {len(all_models)} featurizers")
+
+# Search for specific model
+results = store.search(name="ChemBERTa-77M-MLM")
+if results:
+    model_card = results[0]
+
+    # View usage information
+    model_card.usage()
+
+    # Load the model
+    transformer = model_card.load()
+
+# Direct loading
+transformer = store.load("ChemBERTa-77M-MLM")
+```
+
+**ModelCard Attributes:**
+- `name` - Model identifier
+- `description` - Model description
+- `version` - Model version
+- `authors` - Model authors
+- `tags` - Categorization tags
+- `usage()` - Display usage examples
+- `load(**kwargs)` - Load the model
+
+---
+
+## Common Patterns
+
+### Error Handling
+
+```python
+# Enable error tolerance
+featurizer = MoleculeTransformer(
+    calc,
+    n_jobs=-1,
+    verbose=True,
+    ignore_errors=True
+)
+
+# Failed molecules return None
+features = featurizer(smiles_with_errors)
+```
+
+### Data Type Control
+
+```python
+# NumPy float32 (default)
+features = transformer(smiles, enforce_dtype=True)
+
+# PyTorch tensors
+import torch
+transformer = MoleculeTransformer(calc, dtype=torch.float32)
+features = transformer(smiles)
+```
+
+### Persistence and Reproducibility
+
+```python
+# Save transformer state
+transformer.to_state_yaml_file("config.yml")
+transformer.to_state_json_file("config.json")
+
+# Load from saved state
+transformer = MoleculeTransformer.from_state_yaml_file("config.yml")
+transformer = MoleculeTransformer.from_state_json_file("config.json")
+```
+
+### Preprocessing
+
+```python
+# Manual preprocessing
+mol = transformer.preprocess("CCO")
+
+# Transform with preprocessing
+features = transformer.transform(smiles_list)
+```
+
+---
+
+## Integration Examples
+
+### Scikit-learn Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.ensemble import RandomForestClassifier
+from molfeat.trans import MoleculeTransformer
+from molfeat.calc import FPCalculator
+
+# Create pipeline
+pipeline = Pipeline([
+    ('featurizer', MoleculeTransformer(FPCalculator("ecfp"))),
+    ('classifier', RandomForestClassifier())
+])
+
+# Fit and predict
+pipeline.fit(smiles_train, y_train)
+predictions = pipeline.predict(smiles_test)
+```
+
+### PyTorch Integration
+
+```python
+import torch
+from torch.utils.data import Dataset, DataLoader
+from molfeat.trans import MoleculeTransformer
+
+class MoleculeDataset(Dataset):
+    def __init__(self, smiles, labels, transformer):
+        self.smiles = smiles
+        self.labels = labels
+        self.transformer = transformer
+
+    def __len__(self):
+        return len(self.smiles)
+
+    def __getitem__(self, idx):
+        features = self.transformer(self.smiles[idx])
+        return torch.tensor(features), torch.tensor(self.labels[idx])
+
+# Create dataset and dataloader
+transformer = MoleculeTransformer(FPCalculator("ecfp"))
+dataset = MoleculeDataset(smiles, labels, transformer)
+loader = DataLoader(dataset, batch_size=32)
+```
+
+---
+
+## Performance Tips
+
+1. **Parallelization**: Use `n_jobs=-1` to utilize all CPU cores
+2. **Batch Processing**: Process multiple molecules at once instead of loops
+3. **Caching**: Leverage built-in caching for pretrained models
+4. **Data Types**: Use float32 instead of float64 when precision allows
+5. **Error Handling**: Set `ignore_errors=True` for large datasets with potential invalid molecules
--- a/skills/molfeat/references/available_featurizers.md
+++ b/skills/molfeat/references/available_featurizers.md
@@ -0,0 +1,333 @@
+# Available Featurizers in Molfeat
+
+This document provides a comprehensive catalog of all featurizers available in molfeat, organized by category.
+
+## Transformer-Based Language Models
+
+Pre-trained transformer models for molecular embeddings using SMILES/SELFIES representations.
+
+### RoBERTa-style Models
+- **Roberta-Zinc480M-102M** - RoBERTa masked language model trained on ~480M SMILES strings from ZINC database
+- **ChemBERTa-77M-MLM** - Masked language model based on RoBERTa trained on 77M PubChem compounds
+- **ChemBERTa-77M-MTR** - Multitask regression version trained on PubChem compounds
+
+### GPT-style Autoregressive Models
+- **GPT2-Zinc480M-87M** - GPT-2 autoregressive language model trained on ~480M SMILES from ZINC
+- **ChemGPT-1.2B** - Large transformer (1.2B parameters) pretrained on PubChem10M
+- **ChemGPT-19M** - Medium transformer (19M parameters) pretrained on PubChem10M
+- **ChemGPT-4.7M** - Small transformer (4.7M parameters) pretrained on PubChem10M
+
+### Specialized Transformer Models
+- **MolT5** - Self-supervised framework for molecule captioning and text-based generation
+
+## Graph Neural Networks (GNNs)
+
+Pre-trained graph neural network models operating on molecular graph structures.
+
+### GIN (Graph Isomorphism Network) Variants
+All pre-trained on ChEMBL molecules with different objectives:
+- **gin-supervised-masking** - Supervised with node masking objective
+- **gin-supervised-infomax** - Supervised with graph-level mutual information maximization
+- **gin-supervised-edgepred** - Supervised with edge prediction objective
+- **gin-supervised-contextpred** - Supervised with context prediction objective
+
+### Other Graph-Based Models
+- **JTVAE_zinc_no_kl** - Junction-tree VAE for molecule generation (trained on ZINC)
+- **Graphormer-pcqm4mv2** - Graph transformer pretrained on PCQM4Mv2 quantum chemistry dataset for HOMO-LUMO gap prediction
+
+## Molecular Descriptors
+
+Calculators for physico-chemical properties and molecular characteristics.
+
+### 2D Descriptors
+- **desc2D** / **rdkit2D** - 200+ RDKit 2D molecular descriptors including:
+  - Molecular weight, logP, TPSA
+  - H-bond donors/acceptors
+  - Rotatable bonds
+  - Ring counts and aromaticity
+  - Molecular complexity metrics
+
+### 3D Descriptors
+- **desc3D** / **rdkit3D** - RDKit 3D molecular descriptors (requires conformer generation)
+  - Inertial moments
+  - PMI (Principal Moments of Inertia) ratios
+  - Asphericity, eccentricity
+  - Radius of gyration
+
+### Comprehensive Descriptor Sets
+- **mordred** - Over 1800 molecular descriptors covering:
+  - Constitutional descriptors
+  - Topological indices
+  - Connectivity indices
+  - Information content
+  - 2D/3D autocorrelations
+  - WHIM descriptors
+  - GETAWAY descriptors
+  - And many more
+
+### Electrotopological Descriptors
+- **estate** - Electrotopological state (E-State) indices encoding:
+  - Atomic environment information
+  - Electronic and topological properties
+  - Heteroatom contributions
+
+## Molecular Fingerprints
+
+Binary or count-based fixed-length vectors representing molecular substructures.
+
+### Circular Fingerprints (ECFP-style)
+- **ecfp** / **ecfp:2** / **ecfp:4** / **ecfp:6** - Extended-connectivity fingerprints
+  - Radius variants (2, 4, 6 correspond to diameter)
+  - Default: radius=3, 2048 bits
+  - Most popular for similarity searching
+- **ecfp-count** - Count version of ECFP (non-binary)
+- **fcfp** / **fcfp-count** - Functional-class circular fingerprints
+  - Similar to ECFP but uses functional groups
+  - Better for pharmacophore-based similarity
+
+### Path-Based Fingerprints
+- **rdkit** - RDKit topological fingerprints based on linear paths
+- **pattern** - Pattern fingerprints (similar to MACCS but automated)
+- **layered** - Layered fingerprints with multiple substructure layers
+
+### Key-Based Fingerprints
+- **maccs** - MACCS keys (166-bit structural keys)
+  - Fixed set of predefined substructures
+  - Good for scaffold hopping
+  - Fast computation
+- **avalon** - Avalon fingerprints
+  - Similar to MACCS but more features
+  - Optimized for similarity searching
+
+### Atom-Pair Fingerprints
+- **atompair** - Atom pair fingerprints
+  - Encodes pairs of atoms and distance between them
+  - Good for 3D similarity
+- **atompair-count** - Count version of atom pairs
+
+### Topological Torsion Fingerprints
+- **topological** - Topological torsion fingerprints
+  - Encodes sequences of 4 connected atoms
+  - Captures local topology
+- **topological-count** - Count version of topological torsions
+
+### MinHashed Fingerprints
+- **map4** - MinHashed Atom-Pair fingerprint up to 4 bonds
+  - Combines atom-pair and ECFP concepts
+  - Default: 1024 dimensions
+  - Fast and efficient for large datasets
+- **secfp** - SMILES Extended Connectivity Fingerprint
+  - Operates directly on SMILES strings
+  - Captures both substructure and atom-pair information
+
+### Extended Reduced Graph
+- **erg** - Extended Reduced Graph
+  - Uses pharmacophoric points instead of atoms
+  - Reduces graph complexity while preserving key features
+
+## Pharmacophore Descriptors
+
+Features based on pharmacologically relevant functional groups and their spatial relationships.
+
+### CATS (Chemically Advanced Template Search)
+- **cats2D** - 2D CATS descriptors
+  - Pharmacophore point pair distributions
+  - Distance based on shortest path
+  - 21 descriptors by default
+- **cats3D** - 3D CATS descriptors
+  - Euclidean distance based
+  - Requires conformer generation
+- **cats2D_pharm** / **cats3D_pharm** - Pharmacophore variants
+
+### Gobbi Pharmacophores
+- **gobbi2D** - 2D pharmacophore fingerprints
+  - 8 pharmacophore feature types:
+    - Hydrophobic
+    - Aromatic
+    - H-bond acceptor
+    - H-bond donor
+    - Positive ionizable
+    - Negative ionizable
+    - Lumped hydrophobe
+  - Good for virtual screening
+
+### Pmapper Pharmacophores
+- **pmapper2D** - 2D pharmacophore signatures
+- **pmapper3D** - 3D pharmacophore signatures
+  - High-dimensional pharmacophore descriptors
+  - Useful for QSAR and similarity searching
+
+## Shape Descriptors
+
+Descriptors capturing 3D molecular shape and electrostatic properties.
+
+### USR (Ultrafast Shape Recognition)
+- **usr** - Basic USR descriptors
+  - 12 dimensions encoding shape distribution
+  - Extremely fast computation
+- **usrcat** - USR with pharmacophoric constraints
+  - 60 dimensions (12 per feature type)
+  - Combines shape and pharmacophore information
+
+### Electrostatic Shape
+- **electroshape** - ElectroShape descriptors
+  - Combines molecular shape, chirality, and electrostatics
+  - Useful for protein-ligand docking predictions
+
+## Scaffold-Based Descriptors
+
+Descriptors based on molecular scaffolds and core structures.
+
+### Scaffold Keys
+- **scaffoldkeys** - Scaffold key calculator
+  - 40+ scaffold-based properties
+  - Bioisosteric scaffold representation
+  - Captures core structural features
+
+## Graph Featurizers for GNN Input
+
+Atom and bond-level features for constructing graph representations for Graph Neural Networks.
+
+### Atom-Level Features
+- **atom-onehot** - One-hot encoded atom features
+- **atom-default** - Default atom featurization including:
+  - Atomic number
+  - Degree, formal charge
+  - Hybridization
+  - Aromaticity
+  - Number of hydrogen atoms
+
+### Bond-Level Features
+- **bond-onehot** - One-hot encoded bond features
+- **bond-default** - Default bond featurization including:
+  - Bond type (single, double, triple, aromatic)
+  - Conjugation
+  - Ring membership
+  - Stereochemistry
+
+## Integrated Pretrained Model Collections
+
+Molfeat integrates models from various sources:
+
+### HuggingFace Models
+Access to transformer models through HuggingFace hub:
+- ChemBERTa variants
+- ChemGPT variants
+- MolT5
+- Custom uploaded models
+
+### DGL-LifeSci Models
+Pre-trained GNN models from DGL-Life:
+- GIN variants with different pre-training tasks
+- AttentiveFP models
+- MPNN models
+
+### FCD (Fréchet ChemNet Distance)
+- **fcd** - Pre-trained CNN for molecular generation evaluation
+
+### Graphormer Models
+- Graph transformers from Microsoft Research
+- Pre-trained on quantum chemistry datasets
+
+## Usage Notes
+
+### Choosing a Featurizer
+
+**For traditional ML (Random Forest, SVM, etc.):**
+- Start with **ecfp** or **maccs** fingerprints
+- Try **desc2D** for interpretable models
+- Use **FeatConcat** to combine multiple fingerprints
+
+**For deep learning:**
+- Use **ChemBERTa** or **ChemGPT** for transformer embeddings
+- Use **gin-supervised-*** for graph neural network embeddings
+- Consider **Graphormer** for quantum property predictions
+
+**For similarity searching:**
+- **ecfp** - General purpose, most popular
+- **maccs** - Fast, good for scaffold hopping
+- **map4** - Efficient for large-scale searches
+- **usr** / **usrcat** - 3D shape similarity
+
+**For pharmacophore-based approaches:**
+- **fcfp** - Functional group based
+- **cats2D/3D** - Pharmacophore pair distributions
+- **gobbi2D** - Explicit pharmacophore features
+
+**For interpretability:**
+- **desc2D** / **mordred** - Named descriptors
+- **maccs** - Interpretable substructure keys
+- **scaffoldkeys** - Scaffold-based features
+
+### Model Dependencies
+
+Some featurizers require optional dependencies:
+
+- **DGL models** (gin-*, jtvae): `pip install "molfeat[dgl]"`
+- **Graphormer**: `pip install "molfeat[graphormer]"`
+- **Transformers** (ChemBERTa, ChemGPT, MolT5): `pip install "molfeat[transformer]"`
+- **FCD**: `pip install "molfeat[fcd]"`
+- **MAP4**: `pip install "molfeat[map4]"`
+- **All dependencies**: `pip install "molfeat[all]"`
+
+### Accessing All Available Models
+
+```python
+from molfeat.store.modelstore import ModelStore
+
+store = ModelStore()
+all_models = store.available_models
+
+# Print all available featurizers
+for model in all_models:
+    print(f"{model.name}: {model.description}")
+
+# Search for specific types
+transformers = [m for m in all_models if "transformer" in m.tags]
+gnn_models = [m for m in all_models if "gnn" in m.tags]
+fingerprints = [m for m in all_models if "fingerprint" in m.tags]
+```
+
+## Performance Characteristics
+
+### Computational Speed (relative)
+**Fastest:**
+- maccs
+- ecfp
+- rdkit fingerprints
+- usr
+
+**Medium:**
+- desc2D
+- cats2D
+- Most fingerprints
+
+**Slower:**
+- mordred (1800+ descriptors)
+- desc3D (requires conformer generation)
+- 3D descriptors in general
+
+**Slowest (first run):**
+- Pretrained models (ChemBERTa, ChemGPT, GIN)
+- Note: Subsequent runs benefit from caching
+
+### Dimensionality
+
+**Low (< 200 dims):**
+- maccs (167)
+- usr (12)
+- usrcat (60)
+
+**Medium (200-2000 dims):**
+- desc2D (~200)
+- ecfp (2048 default, configurable)
+- map4 (1024 default)
+
+**High (> 2000 dims):**
+- mordred (1800+)
+- Concatenated fingerprints
+- Some transformer embeddings
+
+**Variable:**
+- Transformer models (typically 768-1024)
+- GNN models (depends on architecture)
--- a/skills/molfeat/references/examples.md
+++ b/skills/molfeat/references/examples.md
@@ -0,0 +1,723 @@
+# Molfeat Usage Examples
+
+This document provides practical examples for common molfeat use cases.
+
+## Installation
+
+```bash
+# Recommended: Using conda/mamba
+mamba install -c conda-forge molfeat
+
+# Alternative: Using pip
+pip install molfeat
+
+# With all optional dependencies
+pip install "molfeat[all]"
+
+# With specific dependencies
+pip install "molfeat[dgl]"          # For GNN models
+pip install "molfeat[graphormer]"   # For Graphormer
+pip install "molfeat[transformer]"  # For ChemBERTa, ChemGPT
+```
+
+---
+
+## Quick Start
+
+### Basic Featurization Workflow
+
+```python
+import datamol as dm
+from molfeat.calc import FPCalculator
+from molfeat.trans import MoleculeTransformer
+
+# Load sample data
+data = dm.data.freesolv().sample(100).smiles.values
+
+# Single molecule featurization
+calc = FPCalculator("ecfp")
+features_single = calc(data[0])
+print(f"Single molecule features shape: {features_single.shape}")
+# Output: (2048,)
+
+# Batch featurization with parallelization
+transformer = MoleculeTransformer(calc, n_jobs=-1)
+features_batch = transformer(data)
+print(f"Batch features shape: {features_batch.shape}")
+# Output: (100, 2048)
+```
+
+---
+
+## Calculator Examples
+
+### Fingerprint Calculators
+
+```python
+from molfeat.calc import FPCalculator
+
+# ECFP (Extended-Connectivity Fingerprints)
+ecfp = FPCalculator("ecfp", radius=3, fpSize=2048)
+fp = ecfp("CCO")  # Ethanol
+print(f"ECFP shape: {fp.shape}")  # (2048,)
+
+# MACCS keys
+maccs = FPCalculator("maccs")
+fp = maccs("c1ccccc1")  # Benzene
+print(f"MACCS shape: {fp.shape}")  # (167,)
+
+# Count-based fingerprints
+ecfp_count = FPCalculator("ecfp-count", radius=3)
+fp_count = ecfp_count("CC(C)CC(C)C")  # Non-binary counts
+
+# MAP4 fingerprints
+map4 = FPCalculator("map4")
+fp = map4("CC(=O)Oc1ccccc1C(=O)O")  # Aspirin
+```
+
+### Descriptor Calculators
+
+```python
+from molfeat.calc import RDKitDescriptors2D, MordredDescriptors
+
+# RDKit 2D descriptors (200+ properties)
+desc2d = RDKitDescriptors2D()
+descriptors = desc2d("CCO")
+print(f"Number of 2D descriptors: {len(descriptors)}")
+
+# Get descriptor names
+names = desc2d.columns
+print(f"First 5 descriptors: {names[:5]}")
+
+# Mordred descriptors (1800+ properties)
+mordred = MordredDescriptors()
+descriptors = mordred("c1ccccc1O")  # Phenol
+print(f"Mordred descriptors: {len(descriptors)}")
+```
+
+### Pharmacophore Calculators
+
+```python
+from molfeat.calc import CATSCalculator
+
+# 2D CATS descriptors
+cats = CATSCalculator(mode="2D", scale="raw")
+descriptors = cats("CC(C)Cc1ccc(C)cc1C")  # Cymene
+print(f"CATS descriptors: {descriptors.shape}")  # (21,)
+
+# 3D CATS descriptors (requires conformer)
+cats3d = CATSCalculator(mode="3D", scale="num")
+```
+
+---
+
+## Transformer Examples
+
+### Basic Transformer Usage
+
+```python
+from molfeat.trans import MoleculeTransformer
+from molfeat.calc import FPCalculator
+import datamol as dm
+
+# Prepare data
+smiles_list = [
+    "CCO",
+    "CC(=O)O",
+    "c1ccccc1",
+    "CC(C)O",
+    "CCCC"
+]
+
+# Create transformer
+calc = FPCalculator("ecfp")
+transformer = MoleculeTransformer(calc, n_jobs=-1)
+
+# Transform molecules
+features = transformer(smiles_list)
+print(f"Features shape: {features.shape}")  # (5, 2048)
+```
+
+### Error Handling
+
+```python
+# Handle invalid SMILES gracefully
+smiles_with_errors = [
+    "CCO",           # Valid
+    "invalid",       # Invalid
+    "CC(=O)O",       # Valid
+    "xyz123",        # Invalid
+]
+
+transformer = MoleculeTransformer(
+    FPCalculator("ecfp"),
+    n_jobs=-1,
+    verbose=True,           # Log errors
+    ignore_errors=True      # Continue on failure
+)
+
+features = transformer(smiles_with_errors)
+# Returns: array with None for failed molecules
+print(features)  # [array(...), None, array(...), None]
+```
+
+### Concatenating Multiple Featurizers
+
+```python
+from molfeat.trans import FeatConcat, MoleculeTransformer
+from molfeat.calc import FPCalculator
+
+# Combine MACCS (167) + ECFP (2048) = 2215 dimensions
+concat_calc = FeatConcat([
+    FPCalculator("maccs"),
+    FPCalculator("ecfp", radius=3, fpSize=2048)
+])
+
+transformer = MoleculeTransformer(concat_calc, n_jobs=-1)
+features = transformer(smiles_list)
+print(f"Combined features shape: {features.shape}")  # (n, 2215)
+
+# Triple combination
+triple_concat = FeatConcat([
+    FPCalculator("maccs"),
+    FPCalculator("ecfp"),
+    FPCalculator("rdkit")
+])
+```
+
+### Saving and Loading Configurations
+
+```python
+from molfeat.trans import MoleculeTransformer
+from molfeat.calc import FPCalculator
+
+# Create and save transformer
+transformer = MoleculeTransformer(
+    FPCalculator("ecfp", radius=3, fpSize=2048),
+    n_jobs=-1
+)
+
+# Save to YAML
+transformer.to_state_yaml_file("my_featurizer.yml")
+
+# Save to JSON
+transformer.to_state_json_file("my_featurizer.json")
+
+# Load from saved state
+loaded_transformer = MoleculeTransformer.from_state_yaml_file("my_featurizer.yml")
+
+# Use loaded transformer
+features = loaded_transformer(smiles_list)
+```
+
+---
+
+## Pretrained Model Examples
+
+### Using the ModelStore
+
+```python
+from molfeat.store.modelstore import ModelStore
+
+# Initialize model store
+store = ModelStore()
+
+# List all available models
+print(f"Total available models: {len(store.available_models)}")
+
+# Search for specific models
+chemberta_models = store.search(name="ChemBERTa")
+for model in chemberta_models:
+    print(f"- {model.name}: {model.description}")
+
+# Get model information
+model_card = store.search(name="ChemBERTa-77M-MLM")[0]
+print(f"Model: {model_card.name}")
+print(f"Version: {model_card.version}")
+print(f"Authors: {model_card.authors}")
+
+# View usage instructions
+model_card.usage()
+
+# Load model directly
+transformer = store.load("ChemBERTa-77M-MLM")
+```
+
+### ChemBERTa Embeddings
+
+```python
+from molfeat.trans.pretrained import PretrainedMolTransformer
+
+# Load ChemBERTa model
+chemberta = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
+
+# Generate embeddings
+smiles = ["CCO", "CC(=O)O", "c1ccccc1"]
+embeddings = chemberta(smiles)
+print(f"ChemBERTa embeddings shape: {embeddings.shape}")
+# Output: (3, 768) - 768-dimensional embeddings
+
+# Use in ML pipeline
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import train_test_split
+
+X_train, X_test, y_train, y_test = train_test_split(
+    embeddings, labels, test_size=0.2
+)
+
+clf = RandomForestClassifier()
+clf.fit(X_train, y_train)
+predictions = clf.predict(X_test)
+```
+
+### ChemGPT Models
+
+```python
+# Small model (4.7M parameters)
+chemgpt_small = PretrainedMolTransformer("ChemGPT-4.7M", n_jobs=-1)
+
+# Medium model (19M parameters)
+chemgpt_medium = PretrainedMolTransformer("ChemGPT-19M", n_jobs=-1)
+
+# Large model (1.2B parameters)
+chemgpt_large = PretrainedMolTransformer("ChemGPT-1.2B", n_jobs=-1)
+
+# Generate embeddings
+embeddings = chemgpt_small(smiles)
+```
+
+### Graph Neural Network Models
+
+```python
+# GIN models with different pre-training objectives
+gin_masking = PretrainedMolTransformer("gin-supervised-masking", n_jobs=-1)
+gin_infomax = PretrainedMolTransformer("gin-supervised-infomax", n_jobs=-1)
+gin_edgepred = PretrainedMolTransformer("gin-supervised-edgepred", n_jobs=-1)
+
+# Generate graph embeddings
+embeddings = gin_masking(smiles)
+print(f"GIN embeddings shape: {embeddings.shape}")
+
+# Graphormer (for quantum chemistry)
+graphormer = PretrainedMolTransformer("Graphormer-pcqm4mv2", n_jobs=-1)
+embeddings = graphormer(smiles)
+```
+
+---
+
+## Machine Learning Integration
+
+### Scikit-learn Pipeline
+
+```python
+from sklearn.pipeline import Pipeline
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import cross_val_score
+from molfeat.trans import MoleculeTransformer
+from molfeat.calc import FPCalculator
+
+# Create ML pipeline
+pipeline = Pipeline([
+    ('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
+    ('classifier', RandomForestClassifier(n_estimators=100))
+])
+
+# Train and evaluate
+pipeline.fit(smiles_train, y_train)
+predictions = pipeline.predict(smiles_test)
+
+# Cross-validation
+scores = cross_val_score(pipeline, smiles_all, y_all, cv=5)
+print(f"CV scores: {scores.mean():.3f} (+/- {scores.std():.3f})")
+```
+
+### Grid Search for Hyperparameter Tuning
+
+```python
+from sklearn.model_selection import GridSearchCV
+from sklearn.svm import SVC
+
+# Define pipeline
+pipeline = Pipeline([
+    ('featurizer', MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)),
+    ('classifier', SVC())
+])
+
+# Define parameter grid
+param_grid = {
+    'classifier__C': [0.1, 1, 10],
+    'classifier__kernel': ['rbf', 'linear'],
+    'classifier__gamma': ['scale', 'auto']
+}
+
+# Grid search
+grid_search = GridSearchCV(pipeline, param_grid, cv=5, n_jobs=-1)
+grid_search.fit(smiles_train, y_train)
+
+print(f"Best parameters: {grid_search.best_params_}")
+print(f"Best score: {grid_search.best_score_:.3f}")
+```
+
+### Multiple Featurizer Comparison
+
+```python
+from sklearn.metrics import roc_auc_score
+
+# Test different featurizers
+featurizers = {
+    'ECFP': FPCalculator("ecfp"),
+    'MACCS': FPCalculator("maccs"),
+    'RDKit': FPCalculator("rdkit"),
+    'Descriptors': RDKitDescriptors2D(),
+    'Combined': FeatConcat([
+        FPCalculator("maccs"),
+        FPCalculator("ecfp")
+    ])
+}
+
+results = {}
+for name, calc in featurizers.items():
+    transformer = MoleculeTransformer(calc, n_jobs=-1)
+    X_train = transformer(smiles_train)
+    X_test = transformer(smiles_test)
+
+    clf = RandomForestClassifier(n_estimators=100)
+    clf.fit(X_train, y_train)
+
+    y_pred = clf.predict_proba(X_test)[:, 1]
+    auc = roc_auc_score(y_test, y_pred)
+    results[name] = auc
+
+    print(f"{name}: AUC = {auc:.3f}")
+```
+
+### PyTorch Deep Learning
+
+```python
+import torch
+import torch.nn as nn
+from torch.utils.data import Dataset, DataLoader
+from molfeat.trans import MoleculeTransformer
+from molfeat.calc import FPCalculator
+
+# Custom dataset
+class MoleculeDataset(Dataset):
+    def __init__(self, smiles, labels, transformer):
+        self.features = transformer(smiles)
+        self.labels = torch.tensor(labels, dtype=torch.float32)
+
+    def __len__(self):
+        return len(self.labels)
+
+    def __getitem__(self, idx):
+        return (
+            torch.tensor(self.features[idx], dtype=torch.float32),
+            self.labels[idx]
+        )
+
+# Prepare data
+transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
+train_dataset = MoleculeDataset(smiles_train, y_train, transformer)
+train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
+
+# Simple neural network
+class MoleculeClassifier(nn.Module):
+    def __init__(self, input_dim):
+        super().__init__()
+        self.network = nn.Sequential(
+            nn.Linear(input_dim, 512),
+            nn.ReLU(),
+            nn.Dropout(0.3),
+            nn.Linear(512, 256),
+            nn.ReLU(),
+            nn.Dropout(0.3),
+            nn.Linear(256, 1),
+            nn.Sigmoid()
+        )
+
+    def forward(self, x):
+        return self.network(x)
+
+# Train model
+model = MoleculeClassifier(input_dim=2048)
+optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
+criterion = nn.BCELoss()
+
+for epoch in range(10):
+    for batch_features, batch_labels in train_loader:
+        optimizer.zero_grad()
+        outputs = model(batch_features).squeeze()
+        loss = criterion(outputs, batch_labels)
+        loss.backward()
+        optimizer.step()
+```
+
+---
+
+## Advanced Usage Patterns
+
+### Custom Preprocessing
+
+```python
+from molfeat.trans import MoleculeTransformer
+import datamol as dm
+
+class CustomTransformer(MoleculeTransformer):
+    def preprocess(self, mol):
+        """Custom preprocessing: standardize molecule"""
+        if isinstance(mol, str):
+            mol = dm.to_mol(mol)
+
+        # Standardize
+        mol = dm.standardize_mol(mol)
+
+        # Remove salts
+        mol = dm.remove_salts(mol)
+
+        return mol
+
+# Use custom transformer
+transformer = CustomTransformer(FPCalculator("ecfp"), n_jobs=-1)
+features = transformer(smiles_list)
+```
+
+### Featurization with Conformers
+
+```python
+import datamol as dm
+from molfeat.calc import RDKitDescriptors3D
+
+# Generate conformers
+def prepare_3d_mol(smiles):
+    mol = dm.to_mol(smiles)
+    mol = dm.add_hs(mol)
+    mol = dm.conform.generate_conformers(mol, n_confs=1)
+    return mol
+
+# 3D descriptors
+calc_3d = RDKitDescriptors3D()
+
+smiles = "CC(C)Cc1ccc(C)cc1C"
+mol_3d = prepare_3d_mol(smiles)
+descriptors_3d = calc_3d(mol_3d)
+```
+
+### Parallel Batch Processing
+
+```python
+from molfeat.trans import MoleculeTransformer
+from molfeat.calc import FPCalculator
+import time
+
+# Large dataset
+smiles_large = load_large_dataset()  # e.g., 100,000 molecules
+
+# Test different parallelization levels
+for n_jobs in [1, 2, 4, -1]:
+    transformer = MoleculeTransformer(
+        FPCalculator("ecfp"),
+        n_jobs=n_jobs
+    )
+
+    start = time.time()
+    features = transformer(smiles_large)
+    elapsed = time.time() - start
+
+    print(f"n_jobs={n_jobs}: {elapsed:.2f}s")
+```
+
+### Caching for Expensive Operations
+
+```python
+from molfeat.trans.pretrained import PretrainedMolTransformer
+import pickle
+
+# Load expensive pretrained model
+transformer = PretrainedMolTransformer("ChemBERTa-77M-MLM", n_jobs=-1)
+
+# Cache embeddings for reuse
+cache_file = "embeddings_cache.pkl"
+
+try:
+    # Try loading cached embeddings
+    with open(cache_file, "rb") as f:
+        embeddings = pickle.load(f)
+    print("Loaded cached embeddings")
+except FileNotFoundError:
+    # Compute and cache
+    embeddings = transformer(smiles_list)
+    with open(cache_file, "wb") as f:
+        pickle.dump(embeddings, f)
+    print("Computed and cached embeddings")
+```
+
+---
+
+## Common Workflows
+
+### Virtual Screening Workflow
+
+```python
+from molfeat.calc import FPCalculator
+from sklearn.ensemble import RandomForestClassifier
+import datamol as dm
+
+# 1. Prepare training data (known actives/inactives)
+train_smiles = load_training_data()
+train_labels = load_training_labels()  # 1=active, 0=inactive
+
+# 2. Featurize training set
+transformer = MoleculeTransformer(FPCalculator("ecfp"), n_jobs=-1)
+X_train = transformer(train_smiles)
+
+# 3. Train classifier
+clf = RandomForestClassifier(n_estimators=500, n_jobs=-1)
+clf.fit(X_train, train_labels)
+
+# 4. Featurize screening library
+screening_smiles = load_screening_library()  # e.g., 1M compounds
+X_screen = transformer(screening_smiles)
+
+# 5. Predict and rank
+predictions = clf.predict_proba(X_screen)[:, 1]
+ranked_indices = predictions.argsort()[::-1]
+
+# 6. Get top hits
+top_n = 1000
+top_hits = [screening_smiles[i] for i in ranked_indices[:top_n]]
+```
+
+### QSAR Model Building
+
+```python
+from molfeat.calc import RDKitDescriptors2D
+from sklearn.linear_model import Ridge
+from sklearn.preprocessing import StandardScaler
+from sklearn.model_selection import cross_val_score
+import numpy as np
+
+# Load QSAR dataset
+smiles = load_molecules()
+y = load_activity_values()  # e.g., IC50, logP
+
+# Featurize with interpretable descriptors
+transformer = MoleculeTransformer(RDKitDescriptors2D(), n_jobs=-1)
+X = transformer(smiles)
+
+# Standardize features
+scaler = StandardScaler()
+X_scaled = scaler.fit_transform(X)
+
+# Build linear model
+model = Ridge(alpha=1.0)
+scores = cross_val_score(model, X_scaled, y, cv=5, scoring='r2')
+print(f"R² = {scores.mean():.3f} (+/- {scores.std():.3f})")
+
+# Fit final model
+model.fit(X_scaled, y)
+
+# Interpret feature importance
+feature_names = transformer.featurizer.columns
+importance = np.abs(model.coef_)
+top_features_idx = importance.argsort()[-10:][::-1]
+
+print("Top 10 important features:")
+for idx in top_features_idx:
+    print(f"  {feature_names[idx]}: {model.coef_[idx]:.3f}")
+```
+
+### Similarity Search
+
+```python
+from molfeat.calc import FPCalculator
+from sklearn.metrics.pairwise import cosine_similarity
+import numpy as np
+
+# Query molecule
+query_smiles = "CC(=O)Oc1ccccc1C(=O)O"  # Aspirin
+
+# Database of molecules
+database_smiles = load_molecule_database()  # Large collection
+
+# Compute fingerprints
+calc = FPCalculator("ecfp")
+query_fp = calc(query_smiles).reshape(1, -1)
+
+transformer = MoleculeTransformer(calc, n_jobs=-1)
+database_fps = transformer(database_smiles)
+
+# Compute similarity
+similarities = cosine_similarity(query_fp, database_fps)[0]
+
+# Find most similar
+top_k = 10
+top_indices = similarities.argsort()[-top_k:][::-1]
+
+print(f"Top {top_k} similar molecules:")
+for i, idx in enumerate(top_indices, 1):
+    print(f"{i}. {database_smiles[idx]} (similarity: {similarities[idx]:.3f})")
+```
+
+---
+
+## Troubleshooting
+
+### Handling Invalid Molecules
+
+```python
+# Use ignore_errors to skip invalid molecules
+transformer = MoleculeTransformer(
+    FPCalculator("ecfp"),
+    ignore_errors=True,
+    verbose=True
+)
+
+# Filter out None values after transformation
+features = transformer(smiles_list)
+valid_mask = [f is not None for f in features]
+valid_features = [f for f in features if f is not None]
+valid_smiles = [s for s, m in zip(smiles_list, valid_mask) if m]
+```
+
+### Memory Management for Large Datasets
+
+```python
+# Process in chunks for very large datasets
+def featurize_in_chunks(smiles_list, transformer, chunk_size=10000):
+    all_features = []
+
+    for i in range(0, len(smiles_list), chunk_size):
+        chunk = smiles_list[i:i+chunk_size]
+        features = transformer(chunk)
+        all_features.append(features)
+        print(f"Processed {i+len(chunk)}/{len(smiles_list)}")
+
+    return np.vstack(all_features)
+
+# Use with large dataset
+features = featurize_in_chunks(large_smiles_list, transformer)
+```
+
+### Reproducibility
+
+```python
+import random
+import numpy as np
+import torch
+
+# Set all random seeds
+def set_seed(seed=42):
+    random.seed(seed)
+    np.random.seed(seed)
+    torch.manual_seed(seed)
+    torch.cuda.manual_seed_all(seed)
+
+set_seed(42)
+
+# Save exact configuration
+transformer.to_state_yaml_file("config.yml")
+
+# Document version
+import molfeat
+print(f"molfeat version: {molfeat.__version__}")
+```