Initial commit
This commit is contained in:
218
skills/datamol/references/reactions_data.md
Normal file
218
skills/datamol/references/reactions_data.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# Datamol Reactions and Data Modules Reference
|
||||
|
||||
## Reactions Module (`datamol.reactions`)
|
||||
|
||||
The reactions module enables programmatic application of chemical transformations using SMARTS reaction patterns.
|
||||
|
||||
### Applying Chemical Reactions
|
||||
|
||||
#### `dm.reactions.apply_reaction(rxn, reactants, as_smiles=False, sanitize=True, single_product_group=True, rm_attach=True, product_index=0)`
|
||||
Apply a chemical reaction to reactant molecules.
|
||||
- **Parameters**:
|
||||
- `rxn`: Reaction object (from SMARTS pattern)
|
||||
- `reactants`: Tuple of reactant molecules
|
||||
- `as_smiles`: Return SMILES strings (True) or molecule objects (False)
|
||||
- `sanitize`: Sanitize product molecules
|
||||
- `single_product_group`: Return single product (True) or all product groups (False)
|
||||
- `rm_attach`: Remove attachment point markers
|
||||
- `product_index`: Which product to return from reaction
|
||||
- **Returns**: Product molecule(s) or SMILES
|
||||
- **Example**:
|
||||
```python
|
||||
from rdkit import Chem
|
||||
|
||||
# Define reaction: alcohol + carboxylic acid → ester
|
||||
rxn = Chem.rdChemReactions.ReactionFromSmarts(
|
||||
'[C:1][OH:2].[C:3](=[O:4])[OH:5]>>[C:1][O:2][C:3](=[O:4])'
|
||||
)
|
||||
|
||||
# Apply to reactants
|
||||
alcohol = dm.to_mol("CCO")
|
||||
acid = dm.to_mol("CC(=O)O")
|
||||
product = dm.reactions.apply_reaction(rxn, (alcohol, acid))
|
||||
```
|
||||
|
||||
### Creating Reactions
|
||||
|
||||
Reactions are typically created from SMARTS patterns using RDKit:
|
||||
```python
|
||||
from rdkit.Chem import rdChemReactions
|
||||
|
||||
# Reaction pattern: [reactant1].[reactant2]>>[product]
|
||||
rxn = rdChemReactions.ReactionFromSmarts(
|
||||
'[1*][*:1].[1*][*:2]>>[*:1][*:2]'
|
||||
)
|
||||
```
|
||||
|
||||
### Validation Functions
|
||||
|
||||
The module includes functions to:
|
||||
- **Check if molecule is reactant**: Verify if molecule matches reactant pattern
|
||||
- **Validate reaction**: Check if reaction is synthetically reasonable
|
||||
- **Process reaction files**: Load reactions from files or databases
|
||||
|
||||
### Common Reaction Patterns
|
||||
|
||||
**Amide formation**:
|
||||
```python
|
||||
# Amine + carboxylic acid → amide
|
||||
amide_rxn = rdChemReactions.ReactionFromSmarts(
|
||||
'[N:1].[C:2](=[O:3])[OH]>>[N:1][C:2](=[O:3])'
|
||||
)
|
||||
```
|
||||
|
||||
**Suzuki coupling**:
|
||||
```python
|
||||
# Aryl halide + boronic acid → biaryl
|
||||
suzuki_rxn = rdChemReactions.ReactionFromSmarts(
|
||||
'[c:1][Br].[c:2][B]([OH])[OH]>>[c:1][c:2]'
|
||||
)
|
||||
```
|
||||
|
||||
**Functional group transformations**:
|
||||
```python
|
||||
# Alcohol → ester
|
||||
esterification = rdChemReactions.ReactionFromSmarts(
|
||||
'[C:1][OH:2].[C:3](=[O:4])[Cl]>>[C:1][O:2][C:3](=[O:4])'
|
||||
)
|
||||
```
|
||||
|
||||
### Workflow Example
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from rdkit.Chem import rdChemReactions
|
||||
|
||||
# 1. Define reaction
|
||||
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]' # Acid → acid chloride
|
||||
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
|
||||
|
||||
# 2. Apply to molecule library
|
||||
acids = [dm.to_mol(smi) for smi in acid_smiles_list]
|
||||
acid_chlorides = []
|
||||
|
||||
for acid in acids:
|
||||
try:
|
||||
product = dm.reactions.apply_reaction(
|
||||
rxn,
|
||||
(acid,), # Single reactant as tuple
|
||||
sanitize=True
|
||||
)
|
||||
acid_chlorides.append(product)
|
||||
except Exception as e:
|
||||
print(f"Reaction failed: {e}")
|
||||
|
||||
# 3. Validate products
|
||||
valid_products = [p for p in acid_chlorides if p is not None]
|
||||
```
|
||||
|
||||
### Key Concepts
|
||||
|
||||
- **SMARTS**: SMiles ARbitrary Target Specification - pattern language for reactions
|
||||
- **Atom Mapping**: Numbers like [C:1] preserve atom identity through reaction
|
||||
- **Attachment Points**: [1*] represents generic connection points
|
||||
- **Reaction Validation**: Not all SMARTS reactions are chemically reasonable
|
||||
|
||||
---
|
||||
|
||||
## Data Module (`datamol.data`)
|
||||
|
||||
The data module provides convenient access to curated molecular datasets for testing and learning.
|
||||
|
||||
### Available Datasets
|
||||
|
||||
#### `dm.data.cdk2(as_df=True, mol_column='mol')`
|
||||
RDKit CDK2 dataset - kinase inhibitor data.
|
||||
- **Parameters**:
|
||||
- `as_df`: Return as DataFrame (True) or list of molecules (False)
|
||||
- `mol_column`: Name for molecule column
|
||||
- **Returns**: Dataset with molecular structures and activity data
|
||||
- **Use case**: Small dataset for algorithm testing
|
||||
- **Example**:
|
||||
```python
|
||||
cdk2_df = dm.data.cdk2(as_df=True)
|
||||
print(cdk2_df.shape)
|
||||
print(cdk2_df.columns)
|
||||
```
|
||||
|
||||
#### `dm.data.freesolv()`
|
||||
FreeSolv dataset - experimental and calculated hydration free energies.
|
||||
- **Contents**: 642 molecules with:
|
||||
- IUPAC names
|
||||
- SMILES strings
|
||||
- Experimental hydration free energy values
|
||||
- Calculated values
|
||||
- **Warning**: "Only meant to be used as a toy dataset for pedagogic and testing purposes"
|
||||
- **Not suitable for**: Benchmarking or production model training
|
||||
- **Example**:
|
||||
```python
|
||||
freesolv_df = dm.data.freesolv()
|
||||
# Columns: iupac, smiles, expt (kcal/mol), calc (kcal/mol)
|
||||
```
|
||||
|
||||
#### `dm.data.solubility(as_df=True, mol_column='mol')`
|
||||
RDKit solubility dataset with train/test splits.
|
||||
- **Contents**: Aqueous solubility data with pre-defined splits
|
||||
- **Columns**: Includes 'split' column with 'train' or 'test' values
|
||||
- **Use case**: Testing ML workflows with proper train/test separation
|
||||
- **Example**:
|
||||
```python
|
||||
sol_df = dm.data.solubility(as_df=True)
|
||||
|
||||
# Split into train/test
|
||||
train_df = sol_df[sol_df['split'] == 'train']
|
||||
test_df = sol_df[sol_df['split'] == 'test']
|
||||
|
||||
# Use for model development
|
||||
X_train = dm.to_fp(train_df[mol_column])
|
||||
y_train = train_df['solubility']
|
||||
```
|
||||
|
||||
### Usage Guidelines
|
||||
|
||||
**For testing and tutorials**:
|
||||
```python
|
||||
# Quick dataset for testing code
|
||||
df = dm.data.cdk2()
|
||||
mols = df['mol'].tolist()
|
||||
|
||||
# Test descriptor calculation
|
||||
descriptors_df = dm.descriptors.batch_compute_many_descriptors(mols)
|
||||
|
||||
# Test clustering
|
||||
clusters = dm.cluster_mols(mols, cutoff=0.3)
|
||||
```
|
||||
|
||||
**For learning workflows**:
|
||||
```python
|
||||
# Complete ML pipeline example
|
||||
sol_df = dm.data.solubility()
|
||||
|
||||
# Preprocessing
|
||||
train = sol_df[sol_df['split'] == 'train']
|
||||
test = sol_df[sol_df['split'] == 'test']
|
||||
|
||||
# Featurization
|
||||
X_train = dm.to_fp(train['mol'])
|
||||
X_test = dm.to_fp(test['mol'])
|
||||
|
||||
# Model training (example)
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
model = RandomForestRegressor()
|
||||
model.fit(X_train, train['solubility'])
|
||||
predictions = model.predict(X_test)
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
- **Toy Datasets**: Designed for pedagogical purposes, not production use
|
||||
- **Small Size**: Limited number of compounds suitable for quick tests
|
||||
- **Pre-processed**: Data already cleaned and formatted
|
||||
- **Citations**: Check dataset documentation for proper attribution if publishing
|
||||
|
||||
### Best Practices
|
||||
|
||||
1. **Use for development only**: Don't draw scientific conclusions from toy datasets
|
||||
2. **Validate on real data**: Always test production code on actual project data
|
||||
3. **Proper attribution**: Cite original data sources if using in publications
|
||||
4. **Understand limitations**: Know the scope and quality of each dataset
|
||||
Reference in New Issue
Block a user