Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/datamol/references/reactions_data.md
+++ b/skills/datamol/references/reactions_data.md
@@ -0,0 +1,218 @@
+# Datamol Reactions and Data Modules Reference
+
+## Reactions Module (`datamol.reactions`)
+
+The reactions module enables programmatic application of chemical transformations using SMARTS reaction patterns.
+
+### Applying Chemical Reactions
+
+#### `dm.reactions.apply_reaction(rxn, reactants, as_smiles=False, sanitize=True, single_product_group=True, rm_attach=True, product_index=0)`
+Apply a chemical reaction to reactant molecules.
+- **Parameters**:
+  - `rxn`: Reaction object (from SMARTS pattern)
+  - `reactants`: Tuple of reactant molecules
+  - `as_smiles`: Return SMILES strings (True) or molecule objects (False)
+  - `sanitize`: Sanitize product molecules
+  - `single_product_group`: Return single product (True) or all product groups (False)
+  - `rm_attach`: Remove attachment point markers
+  - `product_index`: Which product to return from reaction
+- **Returns**: Product molecule(s) or SMILES
+- **Example**:
+  ```python
+  from rdkit import Chem
+
+  # Define reaction: alcohol + carboxylic acid → ester
+  rxn = Chem.rdChemReactions.ReactionFromSmarts(
+      '[C:1][OH:2].[C:3](=[O:4])[OH:5]>>[C:1][O:2][C:3](=[O:4])'
+  )
+
+  # Apply to reactants
+  alcohol = dm.to_mol("CCO")
+  acid = dm.to_mol("CC(=O)O")
+  product = dm.reactions.apply_reaction(rxn, (alcohol, acid))
+  ```
+
+### Creating Reactions
+
+Reactions are typically created from SMARTS patterns using RDKit:
+```python
+from rdkit.Chem import rdChemReactions
+
+# Reaction pattern: [reactant1].[reactant2]>>[product]
+rxn = rdChemReactions.ReactionFromSmarts(
+    '[1*][*:1].[1*][*:2]>>[*:1][*:2]'
+)
+```
+
+### Validation Functions
+
+The module includes functions to:
+- **Check if molecule is reactant**: Verify if molecule matches reactant pattern
+- **Validate reaction**: Check if reaction is synthetically reasonable
+- **Process reaction files**: Load reactions from files or databases
+
+### Common Reaction Patterns
+
+**Amide formation**:
+```python
+# Amine + carboxylic acid → amide
+amide_rxn = rdChemReactions.ReactionFromSmarts(
+    '[N:1].[C:2](=[O:3])[OH]>>[N:1][C:2](=[O:3])'
+)
+```
+
+**Suzuki coupling**:
+```python
+# Aryl halide + boronic acid → biaryl
+suzuki_rxn = rdChemReactions.ReactionFromSmarts(
+    '[c:1][Br].[c:2][B]([OH])[OH]>>[c:1][c:2]'
+)
+```
+
+**Functional group transformations**:
+```python
+# Alcohol → ester
+esterification = rdChemReactions.ReactionFromSmarts(
+    '[C:1][OH:2].[C:3](=[O:4])[Cl]>>[C:1][O:2][C:3](=[O:4])'
+)
+```
+
+### Workflow Example
+
+```python
+import datamol as dm
+from rdkit.Chem import rdChemReactions
+
+# 1. Define reaction
+rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'  # Acid → acid chloride
+rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
+
+# 2. Apply to molecule library
+acids = [dm.to_mol(smi) for smi in acid_smiles_list]
+acid_chlorides = []
+
+for acid in acids:
+    try:
+        product = dm.reactions.apply_reaction(
+            rxn,
+            (acid,),  # Single reactant as tuple
+            sanitize=True
+        )
+        acid_chlorides.append(product)
+    except Exception as e:
+        print(f"Reaction failed: {e}")
+
+# 3. Validate products
+valid_products = [p for p in acid_chlorides if p is not None]
+```
+
+### Key Concepts
+
+- **SMARTS**: SMiles ARbitrary Target Specification - pattern language for reactions
+- **Atom Mapping**: Numbers like [C:1] preserve atom identity through reaction
+- **Attachment Points**: [1*] represents generic connection points
+- **Reaction Validation**: Not all SMARTS reactions are chemically reasonable
+
+---
+
+## Data Module (`datamol.data`)
+
+The data module provides convenient access to curated molecular datasets for testing and learning.
+
+### Available Datasets
+
+#### `dm.data.cdk2(as_df=True, mol_column='mol')`
+RDKit CDK2 dataset - kinase inhibitor data.
+- **Parameters**:
+  - `as_df`: Return as DataFrame (True) or list of molecules (False)
+  - `mol_column`: Name for molecule column
+- **Returns**: Dataset with molecular structures and activity data
+- **Use case**: Small dataset for algorithm testing
+- **Example**:
+  ```python
+  cdk2_df = dm.data.cdk2(as_df=True)
+  print(cdk2_df.shape)
+  print(cdk2_df.columns)
+  ```
+
+#### `dm.data.freesolv()`
+FreeSolv dataset - experimental and calculated hydration free energies.
+- **Contents**: 642 molecules with:
+  - IUPAC names
+  - SMILES strings
+  - Experimental hydration free energy values
+  - Calculated values
+- **Warning**: "Only meant to be used as a toy dataset for pedagogic and testing purposes"
+- **Not suitable for**: Benchmarking or production model training
+- **Example**:
+  ```python
+  freesolv_df = dm.data.freesolv()
+  # Columns: iupac, smiles, expt (kcal/mol), calc (kcal/mol)
+  ```
+
+#### `dm.data.solubility(as_df=True, mol_column='mol')`
+RDKit solubility dataset with train/test splits.
+- **Contents**: Aqueous solubility data with pre-defined splits
+- **Columns**: Includes 'split' column with 'train' or 'test' values
+- **Use case**: Testing ML workflows with proper train/test separation
+- **Example**:
+  ```python
+  sol_df = dm.data.solubility(as_df=True)
+
+  # Split into train/test
+  train_df = sol_df[sol_df['split'] == 'train']
+  test_df = sol_df[sol_df['split'] == 'test']
+
+  # Use for model development
+  X_train = dm.to_fp(train_df[mol_column])
+  y_train = train_df['solubility']
+  ```
+
+### Usage Guidelines
+
+**For testing and tutorials**:
+```python
+# Quick dataset for testing code
+df = dm.data.cdk2()
+mols = df['mol'].tolist()
+
+# Test descriptor calculation
+descriptors_df = dm.descriptors.batch_compute_many_descriptors(mols)
+
+# Test clustering
+clusters = dm.cluster_mols(mols, cutoff=0.3)
+```
+
+**For learning workflows**:
+```python
+# Complete ML pipeline example
+sol_df = dm.data.solubility()
+
+# Preprocessing
+train = sol_df[sol_df['split'] == 'train']
+test = sol_df[sol_df['split'] == 'test']
+
+# Featurization
+X_train = dm.to_fp(train['mol'])
+X_test = dm.to_fp(test['mol'])
+
+# Model training (example)
+from sklearn.ensemble import RandomForestRegressor
+model = RandomForestRegressor()
+model.fit(X_train, train['solubility'])
+predictions = model.predict(X_test)
+```
+
+### Important Notes
+
+- **Toy Datasets**: Designed for pedagogical purposes, not production use
+- **Small Size**: Limited number of compounds suitable for quick tests
+- **Pre-processed**: Data already cleaned and formatted
+- **Citations**: Check dataset documentation for proper attribution if publishing
+
+### Best Practices
+
+1. **Use for development only**: Don't draw scientific conclusions from toy datasets
+2. **Validate on real data**: Always test production code on actual project data
+3. **Proper attribution**: Cite original data sources if using in publications
+4. **Understand limitations**: Know the scope and quality of each dataset