Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,131 @@
# Datamol Conformers Module Reference
The `datamol.conformers` module provides tools for generating and analyzing 3D molecular conformations.
## Conformer Generation
### `dm.conformers.generate(mol, n_confs=None, rms_cutoff=None, minimize_energy=True, method='ETKDGv3', add_hs=True, ...)`
Generate 3D molecular conformers.
- **Parameters**:
- `mol`: Input molecule
- `n_confs`: Number of conformers to generate (auto-determined based on rotatable bonds if None)
- `rms_cutoff`: RMS threshold in Ångströms for filtering similar conformers (removes duplicates)
- `minimize_energy`: Apply UFF energy minimization (default: True)
- `method`: Embedding method - options:
- `'ETDG'` - Experimental Torsion Distance Geometry
- `'ETKDG'` - ETDG with additional basic knowledge
- `'ETKDGv2'` - Enhanced version 2
- `'ETKDGv3'` - Enhanced version 3 (default, recommended)
- `add_hs`: Add hydrogens before embedding (default: True, critical for quality)
- `random_seed`: Set for reproducibility
- **Returns**: Molecule with embedded conformers
- **Example**:
```python
mol = dm.to_mol("CCO")
mol_3d = dm.conformers.generate(mol, n_confs=10, rms_cutoff=0.5)
conformers = mol_3d.GetConformers() # Access all conformers
```
## Conformer Clustering
### `dm.conformers.cluster(mol, rms_cutoff=1.0, already_aligned=False, centroids=False)`
Group conformers by RMS distance.
- **Parameters**:
- `rms_cutoff`: Clustering threshold in Ångströms (default: 1.0)
- `already_aligned`: Whether conformers are pre-aligned
- `centroids`: Return centroid conformers (True) or cluster groups (False)
- **Returns**: Cluster information or centroid conformers
- **Use case**: Identify distinct conformational families
### `dm.conformers.return_centroids(mol, conf_clusters, centroids=True)`
Extract representative conformers from clusters.
- **Parameters**:
- `conf_clusters`: Sequence of cluster indices from `cluster()`
- `centroids`: Return single molecule (True) or list of molecules (False)
- **Returns**: Centroid conformer(s)
## Conformer Analysis
### `dm.conformers.rmsd(mol)`
Calculate pairwise RMSD matrix across all conformers.
- **Requirements**: Minimum 2 conformers
- **Returns**: NxN matrix of RMSD values
- **Use case**: Quantify conformer diversity
### `dm.conformers.sasa(mol, n_jobs=1, ...)`
Calculate Solvent Accessible Surface Area (SASA) using FreeSASA.
- **Parameters**:
- `n_jobs`: Parallelization for multiple conformers
- **Returns**: Array of SASA values (one per conformer)
- **Storage**: Values stored in each conformer as property `'rdkit_free_sasa'`
- **Example**:
```python
sasa_values = dm.conformers.sasa(mol_3d)
# Or access from conformer properties
conf = mol_3d.GetConformer(0)
sasa = conf.GetDoubleProp('rdkit_free_sasa')
```
## Low-Level Conformer Manipulation
### `dm.conformers.center_of_mass(mol, conf_id=-1, use_atoms=True, round_coord=None)`
Calculate molecular center.
- **Parameters**:
- `conf_id`: Conformer index (-1 for first conformer)
- `use_atoms`: Use atomic masses (True) or geometric center (False)
- `round_coord`: Decimal precision for rounding
- **Returns**: 3D coordinates of center
- **Use case**: Centering molecules for visualization or alignment
### `dm.conformers.get_coords(mol, conf_id=-1)`
Retrieve atomic coordinates from a conformer.
- **Returns**: Nx3 numpy array of atomic positions
- **Example**:
```python
positions = dm.conformers.get_coords(mol_3d, conf_id=0)
# positions.shape: (num_atoms, 3)
```
### `dm.conformers.translate(mol, conf_id=-1, transform_matrix=None)`
Reposition conformer using transformation matrix.
- **Modification**: Operates in-place
- **Use case**: Aligning or repositioning molecules
## Workflow Example
```python
import datamol as dm
# 1. Create molecule and generate conformers
mol = dm.to_mol("CC(C)CCO") # Isopentanol
mol_3d = dm.conformers.generate(
mol,
n_confs=50, # Generate 50 initial conformers
rms_cutoff=0.5, # Filter similar conformers
minimize_energy=True # Minimize energy
)
# 2. Analyze conformers
n_conformers = mol_3d.GetNumConformers()
print(f"Generated {n_conformers} unique conformers")
# 3. Calculate SASA
sasa_values = dm.conformers.sasa(mol_3d)
# 4. Cluster conformers
clusters = dm.conformers.cluster(mol_3d, rms_cutoff=1.0, centroids=False)
# 5. Get representative conformers
centroids = dm.conformers.return_centroids(mol_3d, clusters)
# 6. Access 3D coordinates
coords = dm.conformers.get_coords(mol_3d, conf_id=0)
```
## Key Concepts
- **Distance Geometry**: Method for generating 3D structures from connectivity information
- **ETKDG**: Uses experimental torsion angle preferences and additional chemical knowledge
- **RMS Cutoff**: Lower values = more unique conformers; higher values = fewer, more distinct conformers
- **Energy Minimization**: Relaxes structures to nearest local energy minimum
- **Hydrogens**: Critical for accurate 3D geometry - always include during embedding

View File

@@ -0,0 +1,130 @@
# Datamol Core API Reference
This document covers the main functions available in the datamol namespace.
## Molecule Creation and Conversion
### `to_mol(mol, ...)`
Convert SMILES string or other molecular representations to RDKit molecule objects.
- **Parameters**: Accepts SMILES strings, InChI, or other molecular formats
- **Returns**: `rdkit.Chem.Mol` object
- **Common usage**: `mol = dm.to_mol("CCO")`
### `from_inchi(inchi)`
Convert InChI string to molecule object.
### `from_smarts(smarts)`
Convert SMARTS pattern to molecule object.
### `from_selfies(selfies)`
Convert SELFIES string to molecule object.
### `copy_mol(mol)`
Create a copy of a molecule object to avoid modifying the original.
## Molecule Export
### `to_smiles(mol, ...)`
Convert molecule object to SMILES string.
- **Common parameters**: `canonical=True`, `isomeric=True`
### `to_inchi(mol, ...)`
Convert molecule to InChI string representation.
### `to_inchikey(mol)`
Convert molecule to InChI key (fixed-length hash).
### `to_smarts(mol)`
Convert molecule to SMARTS pattern.
### `to_selfies(mol)`
Convert molecule to SELFIES (Self-Referencing Embedded Strings) format.
## Sanitization and Standardization
### `sanitize_mol(mol, ...)`
Enhanced version of RDKit's sanitize operation using mol→SMILES→mol conversion and aromatic nitrogen fixing.
- **Purpose**: Fix common molecular structure issues
- **Returns**: Sanitized molecule or None if sanitization fails
### `standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, ...)`
Apply comprehensive standardization procedures including:
- Metal disconnection
- Normalization (charge corrections)
- Reionization
- Fragment handling (largest fragment selection)
### `standardize_smiles(smiles, ...)`
Apply SMILES standardization procedures directly to a SMILES string.
### `fix_mol(mol)`
Attempt to fix molecular structure issues automatically.
### `fix_valence(mol)`
Correct valence errors in molecular structures.
## Molecular Properties
### `reorder_atoms(mol, ...)`
Ensure consistent atom ordering for the same molecule regardless of original SMILES representation.
- **Purpose**: Maintain reproducible feature generation
### `remove_hs(mol, ...)`
Remove hydrogen atoms from molecular structure.
### `add_hs(mol, ...)`
Add explicit hydrogen atoms to molecular structure.
## Fingerprints and Similarity
### `to_fp(mol, fp_type='ecfp', ...)`
Generate molecular fingerprints for similarity calculations.
- **Fingerprint types**:
- `'ecfp'` - Extended Connectivity Fingerprints (Morgan)
- `'fcfp'` - Functional Connectivity Fingerprints
- `'maccs'` - MACCS keys
- `'topological'` - Topological fingerprints
- `'atompair'` - Atom pair fingerprints
- **Common parameters**: `n_bits`, `radius`
- **Returns**: Numpy array or RDKit fingerprint object
### `pdist(mols, ...)`
Calculate pairwise Tanimoto distances between all molecules in a list.
- **Supports**: Parallel processing via `n_jobs` parameter
- **Returns**: Distance matrix
### `cdist(mols1, mols2, ...)`
Calculate Tanimoto distances between two sets of molecules.
## Clustering and Diversity
### `cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)`
Cluster molecules using Butina clustering algorithm.
- **Parameters**:
- `cutoff`: Distance threshold (default 0.2)
- `feature_fn`: Custom function for molecular features
- `n_jobs`: Parallelization (-1 for all cores)
- **Important**: Builds full distance matrix - suitable for ~1000 structures, not for 10,000+
- **Returns**: List of clusters (each cluster is a list of molecule indices)
### `pick_diverse(mols, npick, ...)`
Select diverse subset of molecules based on fingerprint diversity.
### `pick_centroids(mols, npick, ...)`
Select centroid molecules representing clusters.
## Graph Operations
### `to_graph(mol)`
Convert molecule to graph representation for graph-based analysis.
### `get_all_path_between(mol, start, end)`
Find all paths between two atoms in molecular structure.
## DataFrame Integration
### `to_df(mols, smiles_column='smiles', mol_column='mol')`
Convert list of molecules to pandas DataFrame.
### `from_df(df, smiles_column='smiles', mol_column='mol')`
Convert pandas DataFrame to list of molecules.

View File

@@ -0,0 +1,195 @@
# Datamol Descriptors and Visualization Reference
## Descriptors Module (`datamol.descriptors`)
The descriptors module provides tools for computing molecular properties and descriptors.
### Specialized Descriptor Functions
#### `dm.descriptors.n_aromatic_atoms(mol)`
Calculate the number of aromatic atoms.
- **Returns**: Integer count
- **Use case**: Aromaticity analysis
#### `dm.descriptors.n_aromatic_atoms_proportion(mol)`
Calculate ratio of aromatic atoms to total heavy atoms.
- **Returns**: Float between 0 and 1
- **Use case**: Quantifying aromatic character
#### `dm.descriptors.n_charged_atoms(mol)`
Count atoms with nonzero formal charge.
- **Returns**: Integer count
- **Use case**: Charge distribution analysis
#### `dm.descriptors.n_rigid_bonds(mol)`
Count non-rotatable bonds (neither single bonds nor ring bonds).
- **Returns**: Integer count
- **Use case**: Molecular flexibility assessment
#### `dm.descriptors.n_stereo_centers(mol)`
Count stereogenic centers (chiral centers).
- **Returns**: Integer count
- **Use case**: Stereochemistry analysis
#### `dm.descriptors.n_stereo_centers_unspecified(mol)`
Count stereocenters lacking stereochemical specification.
- **Returns**: Integer count
- **Use case**: Identifying incomplete stereochemistry
### Batch Descriptor Computation
#### `dm.descriptors.compute_many_descriptors(mol, properties_fn=None, add_properties=True)`
Compute multiple molecular properties for a single molecule.
- **Parameters**:
- `properties_fn`: Custom list of descriptor functions
- `add_properties`: Include additional computed properties
- **Returns**: Dictionary of descriptor name → value pairs
- **Default descriptors include**:
- Molecular weight, LogP, number of H-bond donors/acceptors
- Aromatic atoms, stereocenters, rotatable bonds
- TPSA (Topological Polar Surface Area)
- Ring count, heteroatom count
- **Example**:
```python
mol = dm.to_mol("CCO")
descriptors = dm.descriptors.compute_many_descriptors(mol)
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1, ...}
```
#### `dm.descriptors.batch_compute_many_descriptors(mols, properties_fn=None, add_properties=True, n_jobs=1, batch_size=None, progress=False)`
Compute descriptors for multiple molecules in parallel.
- **Parameters**:
- `mols`: List of molecules
- `n_jobs`: Number of parallel jobs (-1 for all cores)
- `batch_size`: Chunk size for parallel processing
- `progress`: Show progress bar
- **Returns**: Pandas DataFrame with one row per molecule
- **Example**:
```python
mols = [dm.to_mol(smi) for smi in smiles_list]
df = dm.descriptors.batch_compute_many_descriptors(
mols,
n_jobs=-1,
progress=True
)
```
### RDKit Descriptor Access
#### `dm.descriptors.any_rdkit_descriptor(name)`
Retrieve any descriptor function from RDKit by name.
- **Parameters**: `name` - Descriptor function name (e.g., 'MolWt', 'TPSA')
- **Returns**: RDKit descriptor function
- **Available descriptors**: From `rdkit.Chem.Descriptors` and `rdkit.Chem.rdMolDescriptors`
- **Example**:
```python
tpsa_fn = dm.descriptors.any_rdkit_descriptor('TPSA')
tpsa_value = tpsa_fn(mol)
```
### Common Use Cases
**Drug-likeness Filtering (Lipinski's Rule of Five)**:
```python
descriptors = dm.descriptors.compute_many_descriptors(mol)
is_druglike = (
descriptors['mw'] <= 500 and
descriptors['logp'] <= 5 and
descriptors['hbd'] <= 5 and
descriptors['hba'] <= 10
)
```
**ADME Property Analysis**:
```python
df = dm.descriptors.batch_compute_many_descriptors(compound_library)
# Filter by TPSA for blood-brain barrier penetration
bbb_candidates = df[df['tpsa'] < 90]
```
---
## Visualization Module (`datamol.viz`)
The viz module provides tools for rendering molecules and conformers as images.
### Main Visualization Function
#### `dm.viz.to_image(mols, legends=None, n_cols=4, use_svg=False, mol_size=(200, 200), highlight_atom=None, highlight_bond=None, outfile=None, max_mols=None, copy=True, indices=False, ...)`
Generate image grid from molecules.
- **Parameters**:
- `mols`: Single molecule or list of molecules
- `legends`: String or list of strings as labels (one per molecule)
- `n_cols`: Number of molecules per row (default: 4)
- `use_svg`: Output SVG format (True) or PNG (False, default)
- `mol_size`: Tuple (width, height) or single int for square images
- `highlight_atom`: Atom indices to highlight (list or dict)
- `highlight_bond`: Bond indices to highlight (list or dict)
- `outfile`: Save path (local or remote, supports fsspec)
- `max_mols`: Maximum number of molecules to display
- `indices`: Draw atom indices on structures (default: False)
- `align`: Align molecules using MCS (Maximum Common Substructure)
- **Returns**: Image object (can be displayed in Jupyter) or saves to file
- **Example**:
```python
# Basic grid
dm.viz.to_image(mols[:10], legends=[dm.to_smiles(m) for m in mols[:10]])
# Save to file
dm.viz.to_image(mols, outfile="molecules.png", n_cols=5)
# Highlight substructure
dm.viz.to_image(mol, highlight_atom=[0, 1, 2], highlight_bond=[0, 1])
# Aligned visualization
dm.viz.to_image(mols, align=True, legends=activity_labels)
```
### Conformer Visualization
#### `dm.viz.conformers(mol, n_confs=None, align_conf=True, n_cols=3, sync_views=True, remove_hs=True, ...)`
Display multiple conformers in grid layout.
- **Parameters**:
- `mol`: Molecule with embedded conformers
- `n_confs`: Number or list of conformer indices to display (None = all)
- `align_conf`: Align conformers for comparison (default: True)
- `n_cols`: Grid columns (default: 3)
- `sync_views`: Synchronize 3D views when interactive (default: True)
- `remove_hs`: Remove hydrogens for clarity (default: True)
- **Returns**: Grid of conformer visualizations
- **Use case**: Comparing conformational diversity
- **Example**:
```python
mol_3d = dm.conformers.generate(mol, n_confs=20)
dm.viz.conformers(mol_3d, n_confs=10, align_conf=True)
```
### Circle Grid Visualization
#### `dm.viz.circle_grid(center_mol, circle_mols, mol_size=200, circle_margin=50, act_mapper=None, ...)`
Create concentric ring visualization with central molecule.
- **Parameters**:
- `center_mol`: Molecule at center
- `circle_mols`: List of molecule lists (one list per ring)
- `mol_size`: Image size per molecule
- `circle_margin`: Spacing between rings (default: 50)
- `act_mapper`: Activity mapping dictionary for color-coding
- **Returns**: Circular grid image
- **Use case**: Visualizing molecular neighborhoods, SAR analysis, similarity networks
- **Example**:
```python
# Show a reference molecule surrounded by similar compounds
dm.viz.circle_grid(
center_mol=reference,
circle_mols=[nearest_neighbors, second_tier]
)
```
### Visualization Best Practices
1. **Use legends for clarity**: Always label molecules with SMILES, IDs, or activity values
2. **Align related molecules**: Use `align=True` in `to_image()` for SAR analysis
3. **Adjust grid size**: Set `n_cols` based on molecule count and display width
4. **Use SVG for publications**: Set `use_svg=True` for scalable vector graphics
5. **Highlight substructures**: Use `highlight_atom` and `highlight_bond` to emphasize features
6. **Save large grids**: Use `outfile` parameter to save rather than display in memory

View File

@@ -0,0 +1,174 @@
# Datamol Fragments and Scaffolds Reference
## Scaffolds Module (`datamol.scaffold`)
Scaffolds represent the core structure of molecules, useful for identifying structural families and analyzing structure-activity relationships (SAR).
### Murcko Scaffolds
#### `dm.to_scaffold_murcko(mol)`
Extract Bemis-Murcko scaffold (molecular framework).
- **Method**: Removes side chains, retaining ring systems and linkers
- **Returns**: Molecule object representing the scaffold
- **Use case**: Identify core structures across compound series
- **Example**:
```python
mol = dm.to_mol("c1ccc(cc1)CCN") # Phenethylamine
scaffold = dm.to_scaffold_murcko(mol)
scaffold_smiles = dm.to_smiles(scaffold)
# Returns: 'c1ccccc1CC' (benzene ring + ethyl linker)
```
**Workflow for scaffold analysis**:
```python
# Extract scaffolds from compound library
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
# Count scaffold frequency
from collections import Counter
scaffold_counts = Counter(scaffold_smiles)
most_common = scaffold_counts.most_common(10)
```
### Fuzzy Scaffolds
#### `dm.scaffold.fuzzy_scaffolding(mol, ...)`
Generate fuzzy scaffolds with enforceable groups that must appear in the core.
- **Purpose**: More flexible scaffold definition allowing specified functional groups
- **Use case**: Custom scaffold definitions beyond Murcko rules
### Applications
**Scaffold-based splitting** (for ML model validation):
```python
# Group compounds by scaffold
scaffold_to_mols = {}
for mol, scaffold in zip(mols, scaffolds):
smi = dm.to_smiles(scaffold)
if smi not in scaffold_to_mols:
scaffold_to_mols[smi] = []
scaffold_to_mols[smi].append(mol)
# Ensure train/test sets have different scaffolds
```
**SAR analysis**:
```python
# Group by scaffold and analyze activity
for scaffold_smi, molecules in scaffold_to_mols.items():
activities = [get_activity(mol) for mol in molecules]
print(f"Scaffold: {scaffold_smi}, Mean activity: {np.mean(activities)}")
```
---
## Fragments Module (`datamol.fragment`)
Molecular fragmentation breaks molecules into smaller pieces based on chemical rules, useful for fragment-based drug design and substructure analysis.
### BRICS Fragmentation
#### `dm.fragment.brics(mol, ...)`
Fragment molecule using BRICS (Breaking Retrosynthetically Interesting Chemical Substructures).
- **Method**: Dissects based on 16 chemically meaningful bond types
- **Consideration**: Considers chemical environment and surrounding substructures
- **Returns**: Set of fragment SMILES strings
- **Use case**: Retrosynthetic analysis, fragment-based design
- **Example**:
```python
mol = dm.to_mol("c1ccccc1CCN")
fragments = dm.fragment.brics(mol)
# Returns fragments like: '[1*]CCN', '[1*]c1ccccc1', etc.
# [1*] represents attachment points
```
### RECAP Fragmentation
#### `dm.fragment.recap(mol, ...)`
Fragment molecule using RECAP (Retrosynthetic Combinatorial Analysis Procedure).
- **Method**: Dissects based on 11 predefined bond types
- **Rules**:
- Leaves alkyl groups smaller than 5 carbons intact
- Preserves cyclic bonds
- **Returns**: Set of fragment SMILES strings
- **Use case**: Combinatorial library design
- **Example**:
```python
mol = dm.to_mol("CCCCCc1ccccc1")
fragments = dm.fragment.recap(mol)
```
### MMPA Fragmentation
#### `dm.fragment.mmpa_frag(mol, ...)`
Fragment for Matched Molecular Pair Analysis.
- **Purpose**: Generate fragments suitable for identifying molecular pairs
- **Use case**: Analyzing how small structural changes affect properties
- **Example**:
```python
fragments = dm.fragment.mmpa_frag(mol)
# Used to find pairs of molecules differing by single transformation
```
### Comparison of Methods
| Method | Bond Types | Preserves Cycles | Best For |
|--------|-----------|------------------|----------|
| BRICS | 16 | Yes | Retrosynthetic analysis, fragment recombination |
| RECAP | 11 | Yes | Combinatorial library design |
| MMPA | Variable | Depends | Structure-activity relationship analysis |
### Fragmentation Workflow
```python
import datamol as dm
# 1. Fragment a molecule
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
brics_frags = dm.fragment.brics(mol)
recap_frags = dm.fragment.recap(mol)
# 2. Analyze fragment frequency across library
all_fragments = []
for mol in molecule_library:
frags = dm.fragment.brics(mol)
all_fragments.extend(frags)
# 3. Identify common fragments
from collections import Counter
fragment_counts = Counter(all_fragments)
common_fragments = fragment_counts.most_common(20)
# 4. Convert fragments back to molecules (remove attachment points)
def clean_fragment(frag_smiles):
# Remove [1*], [2*], etc. attachment point markers
clean = frag_smiles.replace('[1*]', '[H]')
return dm.to_mol(clean)
```
### Advanced: Fragment-Based Virtual Screening
```python
# Build fragment library from known actives
active_fragments = set()
for active_mol in active_compounds:
frags = dm.fragment.brics(active_mol)
active_fragments.update(frags)
# Screen compounds for presence of active fragments
def score_by_fragments(mol, fragment_set):
mol_frags = dm.fragment.brics(mol)
overlap = mol_frags.intersection(fragment_set)
return len(overlap) / len(mol_frags)
# Score screening library
scores = [score_by_fragments(mol, active_fragments) for mol in screening_lib]
```
### Key Concepts
- **Attachment Points**: Marked with [1*], [2*], etc. in fragment SMILES
- **Retrosynthetic**: Fragmentation mimics synthetic disconnections
- **Chemically Meaningful**: Breaks occur at typical synthetic bonds
- **Recombination**: Fragments can theoretically be recombined into valid molecules

View File

@@ -0,0 +1,109 @@
# Datamol I/O Module Reference
The `datamol.io` module provides comprehensive file handling for molecular data across multiple formats.
## Reading Molecular Files
### `dm.read_sdf(filename, sanitize=True, remove_hs=True, as_df=True, mol_column='mol', ...)`
Read Structure-Data File (SDF) format.
- **Parameters**:
- `filename`: Path to SDF file (supports local and remote paths via fsspec)
- `sanitize`: Apply sanitization to molecules
- `remove_hs`: Remove explicit hydrogens
- `as_df`: Return as DataFrame (True) or list of molecules (False)
- `mol_column`: Name of molecule column in DataFrame
- `n_jobs`: Enable parallel processing
- **Returns**: DataFrame or list of molecules
- **Example**: `df = dm.read_sdf("compounds.sdf")`
### `dm.read_smi(filename, smiles_column='smiles', mol_column='mol', as_df=True, ...)`
Read SMILES file (space-delimited by default).
- **Common format**: SMILES followed by molecule ID/name
- **Example**: `df = dm.read_smi("molecules.smi")`
### `dm.read_csv(filename, smiles_column='smiles', mol_column=None, ...)`
Read CSV file with optional automatic SMILES-to-molecule conversion.
- **Parameters**:
- `smiles_column`: Column containing SMILES strings
- `mol_column`: If specified, creates molecule objects from SMILES column
- **Example**: `df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")`
### `dm.read_excel(filename, sheet_name=0, smiles_column='smiles', mol_column=None, ...)`
Read Excel files with molecule handling.
- **Parameters**:
- `sheet_name`: Sheet to read (index or name)
- Other parameters similar to `read_csv`
- **Example**: `df = dm.read_excel("compounds.xlsx", sheet_name="Sheet1")`
### `dm.read_molblock(molblock, sanitize=True, remove_hs=True)`
Parse MOL block string (molecular structure text representation).
### `dm.read_mol2file(filename, sanitize=True, remove_hs=True, cleanupSubstructures=True)`
Read Mol2 format files.
### `dm.read_pdbfile(filename, sanitize=True, remove_hs=True, proximityBonding=True)`
Read Protein Data Bank (PDB) format files.
### `dm.read_pdbblock(pdbblock, sanitize=True, remove_hs=True, proximityBonding=True)`
Parse PDB block string.
### `dm.open_df(filename, ...)`
Universal DataFrame reader - automatically detects format.
- **Supported formats**: CSV, Excel, Parquet, JSON, SDF
- **Example**: `df = dm.open_df("data.csv")` or `df = dm.open_df("molecules.sdf")`
## Writing Molecular Files
### `dm.to_sdf(mols, filename, mol_column=None, ...)`
Write molecules to SDF file.
- **Input types**:
- List of molecules
- DataFrame with molecule column
- Sequence of molecules
- **Parameters**:
- `mol_column`: Column name if input is DataFrame
- **Example**:
```python
dm.to_sdf(mols, "output.sdf")
# or from DataFrame
dm.to_sdf(df, "output.sdf", mol_column="mol")
```
### `dm.to_smi(mols, filename, mol_column=None, ...)`
Write molecules to SMILES file with optional validation.
- **Format**: SMILES strings with optional molecule names/IDs
### `dm.to_xlsx(df, filename, mol_columns=None, ...)`
Export DataFrame to Excel with rendered molecular images.
- **Parameters**:
- `mol_columns`: Columns containing molecules to render as images
- **Special feature**: Automatically renders molecules as images in Excel cells
- **Example**: `dm.to_xlsx(df, "molecules.xlsx", mol_columns=["mol"])`
### `dm.to_molblock(mol, ...)`
Convert molecule to MOL block string.
### `dm.to_pdbblock(mol, ...)`
Convert molecule to PDB block string.
### `dm.save_df(df, filename, ...)`
Save DataFrame in multiple formats (CSV, Excel, Parquet, JSON).
## Remote File Support
All I/O functions support remote file paths through fsspec integration:
- **Supported protocols**: S3 (AWS), GCS (Google Cloud), Azure, HTTP/HTTPS
- **Example**:
```python
dm.read_sdf("s3://bucket/compounds.sdf")
dm.read_csv("https://example.com/data.csv")
```
## Key Parameters Across Functions
- **`sanitize`**: Apply molecule sanitization (default: True)
- **`remove_hs`**: Remove explicit hydrogens (default: True)
- **`as_df`**: Return DataFrame vs list (default: True for most functions)
- **`n_jobs`**: Enable parallel processing (None = all cores, 1 = sequential)
- **`mol_column`**: Name of molecule column in DataFrames
- **`smiles_column`**: Name of SMILES column in DataFrames

View File

@@ -0,0 +1,218 @@
# Datamol Reactions and Data Modules Reference
## Reactions Module (`datamol.reactions`)
The reactions module enables programmatic application of chemical transformations using SMARTS reaction patterns.
### Applying Chemical Reactions
#### `dm.reactions.apply_reaction(rxn, reactants, as_smiles=False, sanitize=True, single_product_group=True, rm_attach=True, product_index=0)`
Apply a chemical reaction to reactant molecules.
- **Parameters**:
- `rxn`: Reaction object (from SMARTS pattern)
- `reactants`: Tuple of reactant molecules
- `as_smiles`: Return SMILES strings (True) or molecule objects (False)
- `sanitize`: Sanitize product molecules
- `single_product_group`: Return single product (True) or all product groups (False)
- `rm_attach`: Remove attachment point markers
- `product_index`: Which product to return from reaction
- **Returns**: Product molecule(s) or SMILES
- **Example**:
```python
from rdkit import Chem
# Define reaction: alcohol + carboxylic acid → ester
rxn = Chem.rdChemReactions.ReactionFromSmarts(
'[C:1][OH:2].[C:3](=[O:4])[OH:5]>>[C:1][O:2][C:3](=[O:4])'
)
# Apply to reactants
alcohol = dm.to_mol("CCO")
acid = dm.to_mol("CC(=O)O")
product = dm.reactions.apply_reaction(rxn, (alcohol, acid))
```
### Creating Reactions
Reactions are typically created from SMARTS patterns using RDKit:
```python
from rdkit.Chem import rdChemReactions
# Reaction pattern: [reactant1].[reactant2]>>[product]
rxn = rdChemReactions.ReactionFromSmarts(
'[1*][*:1].[1*][*:2]>>[*:1][*:2]'
)
```
### Validation Functions
The module includes functions to:
- **Check if molecule is reactant**: Verify if molecule matches reactant pattern
- **Validate reaction**: Check if reaction is synthetically reasonable
- **Process reaction files**: Load reactions from files or databases
### Common Reaction Patterns
**Amide formation**:
```python
# Amine + carboxylic acid → amide
amide_rxn = rdChemReactions.ReactionFromSmarts(
'[N:1].[C:2](=[O:3])[OH]>>[N:1][C:2](=[O:3])'
)
```
**Suzuki coupling**:
```python
# Aryl halide + boronic acid → biaryl
suzuki_rxn = rdChemReactions.ReactionFromSmarts(
'[c:1][Br].[c:2][B]([OH])[OH]>>[c:1][c:2]'
)
```
**Functional group transformations**:
```python
# Alcohol → ester
esterification = rdChemReactions.ReactionFromSmarts(
'[C:1][OH:2].[C:3](=[O:4])[Cl]>>[C:1][O:2][C:3](=[O:4])'
)
```
### Workflow Example
```python
import datamol as dm
from rdkit.Chem import rdChemReactions
# 1. Define reaction
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]' # Acid → acid chloride
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
# 2. Apply to molecule library
acids = [dm.to_mol(smi) for smi in acid_smiles_list]
acid_chlorides = []
for acid in acids:
try:
product = dm.reactions.apply_reaction(
rxn,
(acid,), # Single reactant as tuple
sanitize=True
)
acid_chlorides.append(product)
except Exception as e:
print(f"Reaction failed: {e}")
# 3. Validate products
valid_products = [p for p in acid_chlorides if p is not None]
```
### Key Concepts
- **SMARTS**: SMiles ARbitrary Target Specification - pattern language for reactions
- **Atom Mapping**: Numbers like [C:1] preserve atom identity through reaction
- **Attachment Points**: [1*] represents generic connection points
- **Reaction Validation**: Not all SMARTS reactions are chemically reasonable
---
## Data Module (`datamol.data`)
The data module provides convenient access to curated molecular datasets for testing and learning.
### Available Datasets
#### `dm.data.cdk2(as_df=True, mol_column='mol')`
RDKit CDK2 dataset - kinase inhibitor data.
- **Parameters**:
- `as_df`: Return as DataFrame (True) or list of molecules (False)
- `mol_column`: Name for molecule column
- **Returns**: Dataset with molecular structures and activity data
- **Use case**: Small dataset for algorithm testing
- **Example**:
```python
cdk2_df = dm.data.cdk2(as_df=True)
print(cdk2_df.shape)
print(cdk2_df.columns)
```
#### `dm.data.freesolv()`
FreeSolv dataset - experimental and calculated hydration free energies.
- **Contents**: 642 molecules with:
- IUPAC names
- SMILES strings
- Experimental hydration free energy values
- Calculated values
- **Warning**: "Only meant to be used as a toy dataset for pedagogic and testing purposes"
- **Not suitable for**: Benchmarking or production model training
- **Example**:
```python
freesolv_df = dm.data.freesolv()
# Columns: iupac, smiles, expt (kcal/mol), calc (kcal/mol)
```
#### `dm.data.solubility(as_df=True, mol_column='mol')`
RDKit solubility dataset with train/test splits.
- **Contents**: Aqueous solubility data with pre-defined splits
- **Columns**: Includes 'split' column with 'train' or 'test' values
- **Use case**: Testing ML workflows with proper train/test separation
- **Example**:
```python
sol_df = dm.data.solubility(as_df=True)
# Split into train/test
train_df = sol_df[sol_df['split'] == 'train']
test_df = sol_df[sol_df['split'] == 'test']
# Use for model development
X_train = dm.to_fp(train_df[mol_column])
y_train = train_df['solubility']
```
### Usage Guidelines
**For testing and tutorials**:
```python
# Quick dataset for testing code
df = dm.data.cdk2()
mols = df['mol'].tolist()
# Test descriptor calculation
descriptors_df = dm.descriptors.batch_compute_many_descriptors(mols)
# Test clustering
clusters = dm.cluster_mols(mols, cutoff=0.3)
```
**For learning workflows**:
```python
# Complete ML pipeline example
sol_df = dm.data.solubility()
# Preprocessing
train = sol_df[sol_df['split'] == 'train']
test = sol_df[sol_df['split'] == 'test']
# Featurization
X_train = dm.to_fp(train['mol'])
X_test = dm.to_fp(test['mol'])
# Model training (example)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, train['solubility'])
predictions = model.predict(X_test)
```
### Important Notes
- **Toy Datasets**: Designed for pedagogical purposes, not production use
- **Small Size**: Limited number of compounds suitable for quick tests
- **Pre-processed**: Data already cleaned and formatted
- **Citations**: Check dataset documentation for proper attribution if publishing
### Best Practices
1. **Use for development only**: Don't draw scientific conclusions from toy datasets
2. **Validate on real data**: Always test production code on actual project data
3. **Proper attribution**: Cite original data sources if using in publications
4. **Understand limitations**: Know the scope and quality of each dataset