Initial commit
This commit is contained in:
700
skills/datamol/SKILL.md
Normal file
700
skills/datamol/SKILL.md
Normal file
@@ -0,0 +1,700 @@
|
||||
---
|
||||
name: datamol
|
||||
description: "Pythonic wrapper around RDKit with simplified interface and sensible defaults. Preferred for standard drug discovery: SMILES parsing, standardization, descriptors, fingerprints, clustering, 3D conformers, parallel processing. Returns native rdkit.Chem.Mol objects. For advanced control or custom parameters, use rdkit directly."
|
||||
---
|
||||
|
||||
# Datamol Cheminformatics Skill
|
||||
|
||||
## Overview
|
||||
|
||||
Datamol is a Python library that provides a lightweight, Pythonic abstraction layer over RDKit for molecular cheminformatics. Simplify complex molecular operations with sensible defaults, efficient parallelization, and modern I/O capabilities. All molecular objects are native `rdkit.Chem.Mol` instances, ensuring full compatibility with the RDKit ecosystem.
|
||||
|
||||
**Key capabilities**:
|
||||
- Molecular format conversion (SMILES, SELFIES, InChI)
|
||||
- Structure standardization and sanitization
|
||||
- Molecular descriptors and fingerprints
|
||||
- 3D conformer generation and analysis
|
||||
- Clustering and diversity selection
|
||||
- Scaffold and fragment analysis
|
||||
- Chemical reaction application
|
||||
- Visualization and alignment
|
||||
- Batch processing with parallelization
|
||||
- Cloud storage support via fsspec
|
||||
|
||||
## Installation and Setup
|
||||
|
||||
Guide users to install datamol:
|
||||
|
||||
```bash
|
||||
uv pip install datamol
|
||||
```
|
||||
|
||||
**Import convention**:
|
||||
```python
|
||||
import datamol as dm
|
||||
```
|
||||
|
||||
## Core Workflows
|
||||
|
||||
### 1. Basic Molecule Handling
|
||||
|
||||
**Creating molecules from SMILES**:
|
||||
```python
|
||||
import datamol as dm
|
||||
|
||||
# Single molecule
|
||||
mol = dm.to_mol("CCO") # Ethanol
|
||||
|
||||
# From list of SMILES
|
||||
smiles_list = ["CCO", "c1ccccc1", "CC(=O)O"]
|
||||
mols = [dm.to_mol(smi) for smi in smiles_list]
|
||||
|
||||
# Error handling
|
||||
mol = dm.to_mol("invalid_smiles") # Returns None
|
||||
if mol is None:
|
||||
print("Failed to parse SMILES")
|
||||
```
|
||||
|
||||
**Converting molecules to SMILES**:
|
||||
```python
|
||||
# Canonical SMILES
|
||||
smiles = dm.to_smiles(mol)
|
||||
|
||||
# Isomeric SMILES (includes stereochemistry)
|
||||
smiles = dm.to_smiles(mol, isomeric=True)
|
||||
|
||||
# Other formats
|
||||
inchi = dm.to_inchi(mol)
|
||||
inchikey = dm.to_inchikey(mol)
|
||||
selfies = dm.to_selfies(mol)
|
||||
```
|
||||
|
||||
**Standardization and sanitization** (always recommend for user-provided molecules):
|
||||
```python
|
||||
# Sanitize molecule
|
||||
mol = dm.sanitize_mol(mol)
|
||||
|
||||
# Full standardization (recommended for datasets)
|
||||
mol = dm.standardize_mol(
|
||||
mol,
|
||||
disconnect_metals=True,
|
||||
normalize=True,
|
||||
reionize=True
|
||||
)
|
||||
|
||||
# For SMILES strings directly
|
||||
clean_smiles = dm.standardize_smiles(smiles)
|
||||
```
|
||||
|
||||
### 2. Reading and Writing Molecular Files
|
||||
|
||||
Refer to `references/io_module.md` for comprehensive I/O documentation.
|
||||
|
||||
**Reading files**:
|
||||
```python
|
||||
# SDF files (most common in chemistry)
|
||||
df = dm.read_sdf("compounds.sdf", mol_column='mol')
|
||||
|
||||
# SMILES files
|
||||
df = dm.read_smi("molecules.smi", smiles_column='smiles', mol_column='mol')
|
||||
|
||||
# CSV with SMILES column
|
||||
df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")
|
||||
|
||||
# Excel files
|
||||
df = dm.read_excel("compounds.xlsx", sheet_name=0, mol_column="mol")
|
||||
|
||||
# Universal reader (auto-detects format)
|
||||
df = dm.open_df("file.sdf") # Works with .sdf, .csv, .xlsx, .parquet, .json
|
||||
```
|
||||
|
||||
**Writing files**:
|
||||
```python
|
||||
# Save as SDF
|
||||
dm.to_sdf(mols, "output.sdf")
|
||||
# Or from DataFrame
|
||||
dm.to_sdf(df, "output.sdf", mol_column="mol")
|
||||
|
||||
# Save as SMILES file
|
||||
dm.to_smi(mols, "output.smi")
|
||||
|
||||
# Excel with rendered molecule images
|
||||
dm.to_xlsx(df, "output.xlsx", mol_columns=["mol"])
|
||||
```
|
||||
|
||||
**Remote file support** (S3, GCS, HTTP):
|
||||
```python
|
||||
# Read from cloud storage
|
||||
df = dm.read_sdf("s3://bucket/compounds.sdf")
|
||||
df = dm.read_csv("https://example.com/data.csv")
|
||||
|
||||
# Write to cloud storage
|
||||
dm.to_sdf(mols, "s3://bucket/output.sdf")
|
||||
```
|
||||
|
||||
### 3. Molecular Descriptors and Properties
|
||||
|
||||
Refer to `references/descriptors_viz.md` for detailed descriptor documentation.
|
||||
|
||||
**Computing descriptors for a single molecule**:
|
||||
```python
|
||||
# Get standard descriptor set
|
||||
descriptors = dm.descriptors.compute_many_descriptors(mol)
|
||||
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1,
|
||||
# 'tpsa': 20.23, 'n_aromatic_atoms': 0, ...}
|
||||
```
|
||||
|
||||
**Batch descriptor computation** (recommended for datasets):
|
||||
```python
|
||||
# Compute for all molecules in parallel
|
||||
desc_df = dm.descriptors.batch_compute_many_descriptors(
|
||||
mols,
|
||||
n_jobs=-1, # Use all CPU cores
|
||||
progress=True # Show progress bar
|
||||
)
|
||||
```
|
||||
|
||||
**Specific descriptors**:
|
||||
```python
|
||||
# Aromaticity
|
||||
n_aromatic = dm.descriptors.n_aromatic_atoms(mol)
|
||||
aromatic_ratio = dm.descriptors.n_aromatic_atoms_proportion(mol)
|
||||
|
||||
# Stereochemistry
|
||||
n_stereo = dm.descriptors.n_stereo_centers(mol)
|
||||
n_unspec = dm.descriptors.n_stereo_centers_unspecified(mol)
|
||||
|
||||
# Flexibility
|
||||
n_rigid = dm.descriptors.n_rigid_bonds(mol)
|
||||
```
|
||||
|
||||
**Drug-likeness filtering (Lipinski's Rule of Five)**:
|
||||
```python
|
||||
# Filter compounds
|
||||
def is_druglike(mol):
|
||||
desc = dm.descriptors.compute_many_descriptors(mol)
|
||||
return (
|
||||
desc['mw'] <= 500 and
|
||||
desc['logp'] <= 5 and
|
||||
desc['hbd'] <= 5 and
|
||||
desc['hba'] <= 10
|
||||
)
|
||||
|
||||
druglike_mols = [mol for mol in mols if is_druglike(mol)]
|
||||
```
|
||||
|
||||
### 4. Molecular Fingerprints and Similarity
|
||||
|
||||
**Generating fingerprints**:
|
||||
```python
|
||||
# ECFP (Extended Connectivity Fingerprint, default)
|
||||
fp = dm.to_fp(mol, fp_type='ecfp', radius=2, n_bits=2048)
|
||||
|
||||
# Other fingerprint types
|
||||
fp_maccs = dm.to_fp(mol, fp_type='maccs')
|
||||
fp_topological = dm.to_fp(mol, fp_type='topological')
|
||||
fp_atompair = dm.to_fp(mol, fp_type='atompair')
|
||||
```
|
||||
|
||||
**Similarity calculations**:
|
||||
```python
|
||||
# Pairwise distances within a set
|
||||
distance_matrix = dm.pdist(mols, n_jobs=-1)
|
||||
|
||||
# Distances between two sets
|
||||
distances = dm.cdist(query_mols, library_mols, n_jobs=-1)
|
||||
|
||||
# Find most similar molecules
|
||||
from scipy.spatial.distance import squareform
|
||||
dist_matrix = squareform(dm.pdist(mols))
|
||||
# Lower distance = higher similarity (Tanimoto distance = 1 - Tanimoto similarity)
|
||||
```
|
||||
|
||||
### 5. Clustering and Diversity Selection
|
||||
|
||||
Refer to `references/core_api.md` for clustering details.
|
||||
|
||||
**Butina clustering**:
|
||||
```python
|
||||
# Cluster molecules by structural similarity
|
||||
clusters = dm.cluster_mols(
|
||||
mols,
|
||||
cutoff=0.2, # Tanimoto distance threshold (0=identical, 1=completely different)
|
||||
n_jobs=-1 # Parallel processing
|
||||
)
|
||||
|
||||
# Each cluster is a list of molecule indices
|
||||
for i, cluster in enumerate(clusters):
|
||||
print(f"Cluster {i}: {len(cluster)} molecules")
|
||||
cluster_mols = [mols[idx] for idx in cluster]
|
||||
```
|
||||
|
||||
**Important**: Butina clustering builds a full distance matrix - suitable for ~1000 molecules, not for 10,000+.
|
||||
|
||||
**Diversity selection**:
|
||||
```python
|
||||
# Pick diverse subset
|
||||
diverse_mols = dm.pick_diverse(
|
||||
mols,
|
||||
npick=100 # Select 100 diverse molecules
|
||||
)
|
||||
|
||||
# Pick cluster centroids
|
||||
centroids = dm.pick_centroids(
|
||||
mols,
|
||||
npick=50 # Select 50 representative molecules
|
||||
)
|
||||
```
|
||||
|
||||
### 6. Scaffold Analysis
|
||||
|
||||
Refer to `references/fragments_scaffolds.md` for complete scaffold documentation.
|
||||
|
||||
**Extracting Murcko scaffolds**:
|
||||
```python
|
||||
# Get Bemis-Murcko scaffold (core structure)
|
||||
scaffold = dm.to_scaffold_murcko(mol)
|
||||
scaffold_smiles = dm.to_smiles(scaffold)
|
||||
```
|
||||
|
||||
**Scaffold-based analysis**:
|
||||
```python
|
||||
# Group compounds by scaffold
|
||||
from collections import Counter
|
||||
|
||||
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
|
||||
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
|
||||
|
||||
# Count scaffold frequency
|
||||
scaffold_counts = Counter(scaffold_smiles)
|
||||
most_common = scaffold_counts.most_common(10)
|
||||
|
||||
# Create scaffold-to-molecules mapping
|
||||
scaffold_groups = {}
|
||||
for mol, scaf_smi in zip(mols, scaffold_smiles):
|
||||
if scaf_smi not in scaffold_groups:
|
||||
scaffold_groups[scaf_smi] = []
|
||||
scaffold_groups[scaf_smi].append(mol)
|
||||
```
|
||||
|
||||
**Scaffold-based train/test splitting** (for ML):
|
||||
```python
|
||||
# Ensure train and test sets have different scaffolds
|
||||
scaffold_to_mols = {}
|
||||
for mol, scaf in zip(mols, scaffold_smiles):
|
||||
if scaf not in scaffold_to_mols:
|
||||
scaffold_to_mols[scaf] = []
|
||||
scaffold_to_mols[scaf].append(mol)
|
||||
|
||||
# Split scaffolds into train/test
|
||||
import random
|
||||
scaffolds = list(scaffold_to_mols.keys())
|
||||
random.shuffle(scaffolds)
|
||||
split_idx = int(0.8 * len(scaffolds))
|
||||
train_scaffolds = scaffolds[:split_idx]
|
||||
test_scaffolds = scaffolds[split_idx:]
|
||||
|
||||
# Get molecules for each split
|
||||
train_mols = [mol for scaf in train_scaffolds for mol in scaffold_to_mols[scaf]]
|
||||
test_mols = [mol for scaf in test_scaffolds for mol in scaffold_to_mols[scaf]]
|
||||
```
|
||||
|
||||
### 7. Molecular Fragmentation
|
||||
|
||||
Refer to `references/fragments_scaffolds.md` for fragmentation details.
|
||||
|
||||
**BRICS fragmentation** (16 bond types):
|
||||
```python
|
||||
# Fragment molecule
|
||||
fragments = dm.fragment.brics(mol)
|
||||
# Returns: set of fragment SMILES with attachment points like '[1*]CCN'
|
||||
```
|
||||
|
||||
**RECAP fragmentation** (11 bond types):
|
||||
```python
|
||||
fragments = dm.fragment.recap(mol)
|
||||
```
|
||||
|
||||
**Fragment analysis**:
|
||||
```python
|
||||
# Find common fragments across compound library
|
||||
from collections import Counter
|
||||
|
||||
all_fragments = []
|
||||
for mol in mols:
|
||||
frags = dm.fragment.brics(mol)
|
||||
all_fragments.extend(frags)
|
||||
|
||||
fragment_counts = Counter(all_fragments)
|
||||
common_frags = fragment_counts.most_common(20)
|
||||
|
||||
# Fragment-based scoring
|
||||
def fragment_score(mol, reference_fragments):
|
||||
mol_frags = dm.fragment.brics(mol)
|
||||
overlap = mol_frags.intersection(reference_fragments)
|
||||
return len(overlap) / len(mol_frags) if mol_frags else 0
|
||||
```
|
||||
|
||||
### 8. 3D Conformer Generation
|
||||
|
||||
Refer to `references/conformers_module.md` for detailed conformer documentation.
|
||||
|
||||
**Generating conformers**:
|
||||
```python
|
||||
# Generate 3D conformers
|
||||
mol_3d = dm.conformers.generate(
|
||||
mol,
|
||||
n_confs=50, # Number to generate (auto if None)
|
||||
rms_cutoff=0.5, # Filter similar conformers (Ångströms)
|
||||
minimize_energy=True, # Minimize with UFF force field
|
||||
method='ETKDGv3' # Embedding method (recommended)
|
||||
)
|
||||
|
||||
# Access conformers
|
||||
n_conformers = mol_3d.GetNumConformers()
|
||||
conf = mol_3d.GetConformer(0) # Get first conformer
|
||||
positions = conf.GetPositions() # Nx3 array of atom coordinates
|
||||
```
|
||||
|
||||
**Conformer clustering**:
|
||||
```python
|
||||
# Cluster conformers by RMSD
|
||||
clusters = dm.conformers.cluster(
|
||||
mol_3d,
|
||||
rms_cutoff=1.0,
|
||||
centroids=False
|
||||
)
|
||||
|
||||
# Get representative conformers
|
||||
centroids = dm.conformers.return_centroids(mol_3d, clusters)
|
||||
```
|
||||
|
||||
**SASA calculation**:
|
||||
```python
|
||||
# Calculate solvent accessible surface area
|
||||
sasa_values = dm.conformers.sasa(mol_3d, n_jobs=-1)
|
||||
|
||||
# Access SASA from conformer properties
|
||||
conf = mol_3d.GetConformer(0)
|
||||
sasa = conf.GetDoubleProp('rdkit_free_sasa')
|
||||
```
|
||||
|
||||
### 9. Visualization
|
||||
|
||||
Refer to `references/descriptors_viz.md` for visualization documentation.
|
||||
|
||||
**Basic molecule grid**:
|
||||
```python
|
||||
# Visualize molecules
|
||||
dm.viz.to_image(
|
||||
mols[:20],
|
||||
legends=[dm.to_smiles(m) for m in mols[:20]],
|
||||
n_cols=5,
|
||||
mol_size=(300, 300)
|
||||
)
|
||||
|
||||
# Save to file
|
||||
dm.viz.to_image(mols, outfile="molecules.png")
|
||||
|
||||
# SVG for publications
|
||||
dm.viz.to_image(mols, outfile="molecules.svg", use_svg=True)
|
||||
```
|
||||
|
||||
**Aligned visualization** (for SAR analysis):
|
||||
```python
|
||||
# Align molecules by common substructure
|
||||
dm.viz.to_image(
|
||||
similar_mols,
|
||||
align=True, # Enable MCS alignment
|
||||
legends=activity_labels,
|
||||
n_cols=4
|
||||
)
|
||||
```
|
||||
|
||||
**Highlighting substructures**:
|
||||
```python
|
||||
# Highlight specific atoms and bonds
|
||||
dm.viz.to_image(
|
||||
mol,
|
||||
highlight_atom=[0, 1, 2, 3], # Atom indices
|
||||
highlight_bond=[0, 1, 2] # Bond indices
|
||||
)
|
||||
```
|
||||
|
||||
**Conformer visualization**:
|
||||
```python
|
||||
# Display multiple conformers
|
||||
dm.viz.conformers(
|
||||
mol_3d,
|
||||
n_confs=10,
|
||||
align_conf=True,
|
||||
n_cols=3
|
||||
)
|
||||
```
|
||||
|
||||
### 10. Chemical Reactions
|
||||
|
||||
Refer to `references/reactions_data.md` for reactions documentation.
|
||||
|
||||
**Applying reactions**:
|
||||
```python
|
||||
from rdkit.Chem import rdChemReactions
|
||||
|
||||
# Define reaction from SMARTS
|
||||
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]'
|
||||
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
|
||||
|
||||
# Apply to molecule
|
||||
reactant = dm.to_mol("CC(=O)O") # Acetic acid
|
||||
product = dm.reactions.apply_reaction(
|
||||
rxn,
|
||||
(reactant,),
|
||||
sanitize=True
|
||||
)
|
||||
|
||||
# Convert to SMILES
|
||||
product_smiles = dm.to_smiles(product)
|
||||
```
|
||||
|
||||
**Batch reaction application**:
|
||||
```python
|
||||
# Apply reaction to library
|
||||
products = []
|
||||
for mol in reactant_mols:
|
||||
try:
|
||||
prod = dm.reactions.apply_reaction(rxn, (mol,))
|
||||
if prod is not None:
|
||||
products.append(prod)
|
||||
except Exception as e:
|
||||
print(f"Reaction failed: {e}")
|
||||
```
|
||||
|
||||
## Parallelization
|
||||
|
||||
Datamol includes built-in parallelization for many operations. Use `n_jobs` parameter:
|
||||
- `n_jobs=1`: Sequential (no parallelization)
|
||||
- `n_jobs=-1`: Use all available CPU cores
|
||||
- `n_jobs=4`: Use 4 cores
|
||||
|
||||
**Functions supporting parallelization**:
|
||||
- `dm.read_sdf(..., n_jobs=-1)`
|
||||
- `dm.descriptors.batch_compute_many_descriptors(..., n_jobs=-1)`
|
||||
- `dm.cluster_mols(..., n_jobs=-1)`
|
||||
- `dm.pdist(..., n_jobs=-1)`
|
||||
- `dm.conformers.sasa(..., n_jobs=-1)`
|
||||
|
||||
**Progress bars**: Many batch operations support `progress=True` parameter.
|
||||
|
||||
## Common Workflows and Patterns
|
||||
|
||||
### Complete Pipeline: Data Loading → Filtering → Analysis
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
import pandas as pd
|
||||
|
||||
# 1. Load molecules
|
||||
df = dm.read_sdf("compounds.sdf")
|
||||
|
||||
# 2. Standardize
|
||||
df['mol'] = df['mol'].apply(lambda m: dm.standardize_mol(m) if m else None)
|
||||
df = df[df['mol'].notna()] # Remove failed molecules
|
||||
|
||||
# 3. Compute descriptors
|
||||
desc_df = dm.descriptors.batch_compute_many_descriptors(
|
||||
df['mol'].tolist(),
|
||||
n_jobs=-1,
|
||||
progress=True
|
||||
)
|
||||
|
||||
# 4. Filter by drug-likeness
|
||||
druglike = (
|
||||
(desc_df['mw'] <= 500) &
|
||||
(desc_df['logp'] <= 5) &
|
||||
(desc_df['hbd'] <= 5) &
|
||||
(desc_df['hba'] <= 10)
|
||||
)
|
||||
filtered_df = df[druglike]
|
||||
|
||||
# 5. Cluster and select diverse subset
|
||||
diverse_mols = dm.pick_diverse(
|
||||
filtered_df['mol'].tolist(),
|
||||
npick=100
|
||||
)
|
||||
|
||||
# 6. Visualize results
|
||||
dm.viz.to_image(
|
||||
diverse_mols,
|
||||
legends=[dm.to_smiles(m) for m in diverse_mols],
|
||||
outfile="diverse_compounds.png",
|
||||
n_cols=10
|
||||
)
|
||||
```
|
||||
|
||||
### Structure-Activity Relationship (SAR) Analysis
|
||||
|
||||
```python
|
||||
# Group by scaffold
|
||||
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
|
||||
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
|
||||
|
||||
# Create DataFrame with activities
|
||||
sar_df = pd.DataFrame({
|
||||
'mol': mols,
|
||||
'scaffold': scaffold_smiles,
|
||||
'activity': activities # User-provided activity data
|
||||
})
|
||||
|
||||
# Analyze each scaffold series
|
||||
for scaffold, group in sar_df.groupby('scaffold'):
|
||||
if len(group) >= 3: # Need multiple examples
|
||||
print(f"\nScaffold: {scaffold}")
|
||||
print(f"Count: {len(group)}")
|
||||
print(f"Activity range: {group['activity'].min():.2f} - {group['activity'].max():.2f}")
|
||||
|
||||
# Visualize with activities as legends
|
||||
dm.viz.to_image(
|
||||
group['mol'].tolist(),
|
||||
legends=[f"Activity: {act:.2f}" for act in group['activity']],
|
||||
align=True # Align by common substructure
|
||||
)
|
||||
```
|
||||
|
||||
### Virtual Screening Pipeline
|
||||
|
||||
```python
|
||||
# 1. Generate fingerprints for query and library
|
||||
query_fps = [dm.to_fp(mol) for mol in query_actives]
|
||||
library_fps = [dm.to_fp(mol) for mol in library_mols]
|
||||
|
||||
# 2. Calculate similarities
|
||||
from scipy.spatial.distance import cdist
|
||||
import numpy as np
|
||||
|
||||
distances = dm.cdist(query_actives, library_mols, n_jobs=-1)
|
||||
|
||||
# 3. Find closest matches (min distance to any query)
|
||||
min_distances = distances.min(axis=0)
|
||||
similarities = 1 - min_distances # Convert distance to similarity
|
||||
|
||||
# 4. Rank and select top hits
|
||||
top_indices = np.argsort(similarities)[::-1][:100] # Top 100
|
||||
top_hits = [library_mols[i] for i in top_indices]
|
||||
top_scores = [similarities[i] for i in top_indices]
|
||||
|
||||
# 5. Visualize hits
|
||||
dm.viz.to_image(
|
||||
top_hits[:20],
|
||||
legends=[f"Sim: {score:.3f}" for score in top_scores[:20]],
|
||||
outfile="screening_hits.png"
|
||||
)
|
||||
```
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
For detailed API documentation, consult these reference files:
|
||||
|
||||
- **`references/core_api.md`**: Core namespace functions (conversions, standardization, fingerprints, clustering)
|
||||
- **`references/io_module.md`**: File I/O operations (read/write SDF, CSV, Excel, remote files)
|
||||
- **`references/conformers_module.md`**: 3D conformer generation, clustering, SASA calculations
|
||||
- **`references/descriptors_viz.md`**: Molecular descriptors and visualization functions
|
||||
- **`references/fragments_scaffolds.md`**: Scaffold extraction, BRICS/RECAP fragmentation
|
||||
- **`references/reactions_data.md`**: Chemical reactions and toy datasets
|
||||
|
||||
## Best Practices
|
||||
|
||||
1. **Always standardize molecules** from external sources:
|
||||
```python
|
||||
mol = dm.standardize_mol(mol, disconnect_metals=True, normalize=True, reionize=True)
|
||||
```
|
||||
|
||||
2. **Check for None values** after molecule parsing:
|
||||
```python
|
||||
mol = dm.to_mol(smiles)
|
||||
if mol is None:
|
||||
# Handle invalid SMILES
|
||||
```
|
||||
|
||||
3. **Use parallel processing** for large datasets:
|
||||
```python
|
||||
result = dm.operation(..., n_jobs=-1, progress=True)
|
||||
```
|
||||
|
||||
4. **Leverage fsspec** for cloud storage:
|
||||
```python
|
||||
df = dm.read_sdf("s3://bucket/compounds.sdf")
|
||||
```
|
||||
|
||||
5. **Use appropriate fingerprints** for similarity:
|
||||
- ECFP (Morgan): General purpose, structural similarity
|
||||
- MACCS: Fast, smaller feature space
|
||||
- Atom pairs: Considers atom pairs and distances
|
||||
|
||||
6. **Consider scale limitations**:
|
||||
- Butina clustering: ~1,000 molecules (full distance matrix)
|
||||
- For larger datasets: Use diversity selection or hierarchical methods
|
||||
|
||||
7. **Scaffold splitting for ML**: Ensure proper train/test separation by scaffold
|
||||
|
||||
8. **Align molecules** when visualizing SAR series
|
||||
|
||||
## Error Handling
|
||||
|
||||
```python
|
||||
# Safe molecule creation
|
||||
def safe_to_mol(smiles):
|
||||
try:
|
||||
mol = dm.to_mol(smiles)
|
||||
if mol is not None:
|
||||
mol = dm.standardize_mol(mol)
|
||||
return mol
|
||||
except Exception as e:
|
||||
print(f"Failed to process {smiles}: {e}")
|
||||
return None
|
||||
|
||||
# Safe batch processing
|
||||
valid_mols = []
|
||||
for smiles in smiles_list:
|
||||
mol = safe_to_mol(smiles)
|
||||
if mol is not None:
|
||||
valid_mols.append(mol)
|
||||
```
|
||||
|
||||
## Integration with Machine Learning
|
||||
|
||||
```python
|
||||
# Feature generation
|
||||
X = np.array([dm.to_fp(mol) for mol in mols])
|
||||
|
||||
# Or descriptors
|
||||
desc_df = dm.descriptors.batch_compute_many_descriptors(mols, n_jobs=-1)
|
||||
X = desc_df.values
|
||||
|
||||
# Train model
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
model = RandomForestRegressor()
|
||||
model.fit(X, y_target)
|
||||
|
||||
# Predict
|
||||
predictions = model.predict(X_test)
|
||||
```
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
**Issue**: Molecule parsing fails
|
||||
- **Solution**: Use `dm.standardize_smiles()` first or try `dm.fix_mol()`
|
||||
|
||||
**Issue**: Memory errors with clustering
|
||||
- **Solution**: Use `dm.pick_diverse()` instead of full clustering for large sets
|
||||
|
||||
**Issue**: Slow conformer generation
|
||||
- **Solution**: Reduce `n_confs` or increase `rms_cutoff` to generate fewer conformers
|
||||
|
||||
**Issue**: Remote file access fails
|
||||
- **Solution**: Ensure fsspec and appropriate cloud provider libraries are installed (s3fs, gcsfs, etc.)
|
||||
|
||||
## Additional Resources
|
||||
|
||||
- **Datamol Documentation**: https://docs.datamol.io/
|
||||
- **RDKit Documentation**: https://www.rdkit.org/docs/
|
||||
- **GitHub Repository**: https://github.com/datamol-io/datamol
|
||||
131
skills/datamol/references/conformers_module.md
Normal file
131
skills/datamol/references/conformers_module.md
Normal file
@@ -0,0 +1,131 @@
|
||||
# Datamol Conformers Module Reference
|
||||
|
||||
The `datamol.conformers` module provides tools for generating and analyzing 3D molecular conformations.
|
||||
|
||||
## Conformer Generation
|
||||
|
||||
### `dm.conformers.generate(mol, n_confs=None, rms_cutoff=None, minimize_energy=True, method='ETKDGv3', add_hs=True, ...)`
|
||||
Generate 3D molecular conformers.
|
||||
- **Parameters**:
|
||||
- `mol`: Input molecule
|
||||
- `n_confs`: Number of conformers to generate (auto-determined based on rotatable bonds if None)
|
||||
- `rms_cutoff`: RMS threshold in Ångströms for filtering similar conformers (removes duplicates)
|
||||
- `minimize_energy`: Apply UFF energy minimization (default: True)
|
||||
- `method`: Embedding method - options:
|
||||
- `'ETDG'` - Experimental Torsion Distance Geometry
|
||||
- `'ETKDG'` - ETDG with additional basic knowledge
|
||||
- `'ETKDGv2'` - Enhanced version 2
|
||||
- `'ETKDGv3'` - Enhanced version 3 (default, recommended)
|
||||
- `add_hs`: Add hydrogens before embedding (default: True, critical for quality)
|
||||
- `random_seed`: Set for reproducibility
|
||||
- **Returns**: Molecule with embedded conformers
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("CCO")
|
||||
mol_3d = dm.conformers.generate(mol, n_confs=10, rms_cutoff=0.5)
|
||||
conformers = mol_3d.GetConformers() # Access all conformers
|
||||
```
|
||||
|
||||
## Conformer Clustering
|
||||
|
||||
### `dm.conformers.cluster(mol, rms_cutoff=1.0, already_aligned=False, centroids=False)`
|
||||
Group conformers by RMS distance.
|
||||
- **Parameters**:
|
||||
- `rms_cutoff`: Clustering threshold in Ångströms (default: 1.0)
|
||||
- `already_aligned`: Whether conformers are pre-aligned
|
||||
- `centroids`: Return centroid conformers (True) or cluster groups (False)
|
||||
- **Returns**: Cluster information or centroid conformers
|
||||
- **Use case**: Identify distinct conformational families
|
||||
|
||||
### `dm.conformers.return_centroids(mol, conf_clusters, centroids=True)`
|
||||
Extract representative conformers from clusters.
|
||||
- **Parameters**:
|
||||
- `conf_clusters`: Sequence of cluster indices from `cluster()`
|
||||
- `centroids`: Return single molecule (True) or list of molecules (False)
|
||||
- **Returns**: Centroid conformer(s)
|
||||
|
||||
## Conformer Analysis
|
||||
|
||||
### `dm.conformers.rmsd(mol)`
|
||||
Calculate pairwise RMSD matrix across all conformers.
|
||||
- **Requirements**: Minimum 2 conformers
|
||||
- **Returns**: NxN matrix of RMSD values
|
||||
- **Use case**: Quantify conformer diversity
|
||||
|
||||
### `dm.conformers.sasa(mol, n_jobs=1, ...)`
|
||||
Calculate Solvent Accessible Surface Area (SASA) using FreeSASA.
|
||||
- **Parameters**:
|
||||
- `n_jobs`: Parallelization for multiple conformers
|
||||
- **Returns**: Array of SASA values (one per conformer)
|
||||
- **Storage**: Values stored in each conformer as property `'rdkit_free_sasa'`
|
||||
- **Example**:
|
||||
```python
|
||||
sasa_values = dm.conformers.sasa(mol_3d)
|
||||
# Or access from conformer properties
|
||||
conf = mol_3d.GetConformer(0)
|
||||
sasa = conf.GetDoubleProp('rdkit_free_sasa')
|
||||
```
|
||||
|
||||
## Low-Level Conformer Manipulation
|
||||
|
||||
### `dm.conformers.center_of_mass(mol, conf_id=-1, use_atoms=True, round_coord=None)`
|
||||
Calculate molecular center.
|
||||
- **Parameters**:
|
||||
- `conf_id`: Conformer index (-1 for first conformer)
|
||||
- `use_atoms`: Use atomic masses (True) or geometric center (False)
|
||||
- `round_coord`: Decimal precision for rounding
|
||||
- **Returns**: 3D coordinates of center
|
||||
- **Use case**: Centering molecules for visualization or alignment
|
||||
|
||||
### `dm.conformers.get_coords(mol, conf_id=-1)`
|
||||
Retrieve atomic coordinates from a conformer.
|
||||
- **Returns**: Nx3 numpy array of atomic positions
|
||||
- **Example**:
|
||||
```python
|
||||
positions = dm.conformers.get_coords(mol_3d, conf_id=0)
|
||||
# positions.shape: (num_atoms, 3)
|
||||
```
|
||||
|
||||
### `dm.conformers.translate(mol, conf_id=-1, transform_matrix=None)`
|
||||
Reposition conformer using transformation matrix.
|
||||
- **Modification**: Operates in-place
|
||||
- **Use case**: Aligning or repositioning molecules
|
||||
|
||||
## Workflow Example
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
|
||||
# 1. Create molecule and generate conformers
|
||||
mol = dm.to_mol("CC(C)CCO") # Isopentanol
|
||||
mol_3d = dm.conformers.generate(
|
||||
mol,
|
||||
n_confs=50, # Generate 50 initial conformers
|
||||
rms_cutoff=0.5, # Filter similar conformers
|
||||
minimize_energy=True # Minimize energy
|
||||
)
|
||||
|
||||
# 2. Analyze conformers
|
||||
n_conformers = mol_3d.GetNumConformers()
|
||||
print(f"Generated {n_conformers} unique conformers")
|
||||
|
||||
# 3. Calculate SASA
|
||||
sasa_values = dm.conformers.sasa(mol_3d)
|
||||
|
||||
# 4. Cluster conformers
|
||||
clusters = dm.conformers.cluster(mol_3d, rms_cutoff=1.0, centroids=False)
|
||||
|
||||
# 5. Get representative conformers
|
||||
centroids = dm.conformers.return_centroids(mol_3d, clusters)
|
||||
|
||||
# 6. Access 3D coordinates
|
||||
coords = dm.conformers.get_coords(mol_3d, conf_id=0)
|
||||
```
|
||||
|
||||
## Key Concepts
|
||||
|
||||
- **Distance Geometry**: Method for generating 3D structures from connectivity information
|
||||
- **ETKDG**: Uses experimental torsion angle preferences and additional chemical knowledge
|
||||
- **RMS Cutoff**: Lower values = more unique conformers; higher values = fewer, more distinct conformers
|
||||
- **Energy Minimization**: Relaxes structures to nearest local energy minimum
|
||||
- **Hydrogens**: Critical for accurate 3D geometry - always include during embedding
|
||||
130
skills/datamol/references/core_api.md
Normal file
130
skills/datamol/references/core_api.md
Normal file
@@ -0,0 +1,130 @@
|
||||
# Datamol Core API Reference
|
||||
|
||||
This document covers the main functions available in the datamol namespace.
|
||||
|
||||
## Molecule Creation and Conversion
|
||||
|
||||
### `to_mol(mol, ...)`
|
||||
Convert SMILES string or other molecular representations to RDKit molecule objects.
|
||||
- **Parameters**: Accepts SMILES strings, InChI, or other molecular formats
|
||||
- **Returns**: `rdkit.Chem.Mol` object
|
||||
- **Common usage**: `mol = dm.to_mol("CCO")`
|
||||
|
||||
### `from_inchi(inchi)`
|
||||
Convert InChI string to molecule object.
|
||||
|
||||
### `from_smarts(smarts)`
|
||||
Convert SMARTS pattern to molecule object.
|
||||
|
||||
### `from_selfies(selfies)`
|
||||
Convert SELFIES string to molecule object.
|
||||
|
||||
### `copy_mol(mol)`
|
||||
Create a copy of a molecule object to avoid modifying the original.
|
||||
|
||||
## Molecule Export
|
||||
|
||||
### `to_smiles(mol, ...)`
|
||||
Convert molecule object to SMILES string.
|
||||
- **Common parameters**: `canonical=True`, `isomeric=True`
|
||||
|
||||
### `to_inchi(mol, ...)`
|
||||
Convert molecule to InChI string representation.
|
||||
|
||||
### `to_inchikey(mol)`
|
||||
Convert molecule to InChI key (fixed-length hash).
|
||||
|
||||
### `to_smarts(mol)`
|
||||
Convert molecule to SMARTS pattern.
|
||||
|
||||
### `to_selfies(mol)`
|
||||
Convert molecule to SELFIES (Self-Referencing Embedded Strings) format.
|
||||
|
||||
## Sanitization and Standardization
|
||||
|
||||
### `sanitize_mol(mol, ...)`
|
||||
Enhanced version of RDKit's sanitize operation using mol→SMILES→mol conversion and aromatic nitrogen fixing.
|
||||
- **Purpose**: Fix common molecular structure issues
|
||||
- **Returns**: Sanitized molecule or None if sanitization fails
|
||||
|
||||
### `standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, ...)`
|
||||
Apply comprehensive standardization procedures including:
|
||||
- Metal disconnection
|
||||
- Normalization (charge corrections)
|
||||
- Reionization
|
||||
- Fragment handling (largest fragment selection)
|
||||
|
||||
### `standardize_smiles(smiles, ...)`
|
||||
Apply SMILES standardization procedures directly to a SMILES string.
|
||||
|
||||
### `fix_mol(mol)`
|
||||
Attempt to fix molecular structure issues automatically.
|
||||
|
||||
### `fix_valence(mol)`
|
||||
Correct valence errors in molecular structures.
|
||||
|
||||
## Molecular Properties
|
||||
|
||||
### `reorder_atoms(mol, ...)`
|
||||
Ensure consistent atom ordering for the same molecule regardless of original SMILES representation.
|
||||
- **Purpose**: Maintain reproducible feature generation
|
||||
|
||||
### `remove_hs(mol, ...)`
|
||||
Remove hydrogen atoms from molecular structure.
|
||||
|
||||
### `add_hs(mol, ...)`
|
||||
Add explicit hydrogen atoms to molecular structure.
|
||||
|
||||
## Fingerprints and Similarity
|
||||
|
||||
### `to_fp(mol, fp_type='ecfp', ...)`
|
||||
Generate molecular fingerprints for similarity calculations.
|
||||
- **Fingerprint types**:
|
||||
- `'ecfp'` - Extended Connectivity Fingerprints (Morgan)
|
||||
- `'fcfp'` - Functional Connectivity Fingerprints
|
||||
- `'maccs'` - MACCS keys
|
||||
- `'topological'` - Topological fingerprints
|
||||
- `'atompair'` - Atom pair fingerprints
|
||||
- **Common parameters**: `n_bits`, `radius`
|
||||
- **Returns**: Numpy array or RDKit fingerprint object
|
||||
|
||||
### `pdist(mols, ...)`
|
||||
Calculate pairwise Tanimoto distances between all molecules in a list.
|
||||
- **Supports**: Parallel processing via `n_jobs` parameter
|
||||
- **Returns**: Distance matrix
|
||||
|
||||
### `cdist(mols1, mols2, ...)`
|
||||
Calculate Tanimoto distances between two sets of molecules.
|
||||
|
||||
## Clustering and Diversity
|
||||
|
||||
### `cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)`
|
||||
Cluster molecules using Butina clustering algorithm.
|
||||
- **Parameters**:
|
||||
- `cutoff`: Distance threshold (default 0.2)
|
||||
- `feature_fn`: Custom function for molecular features
|
||||
- `n_jobs`: Parallelization (-1 for all cores)
|
||||
- **Important**: Builds full distance matrix - suitable for ~1000 structures, not for 10,000+
|
||||
- **Returns**: List of clusters (each cluster is a list of molecule indices)
|
||||
|
||||
### `pick_diverse(mols, npick, ...)`
|
||||
Select diverse subset of molecules based on fingerprint diversity.
|
||||
|
||||
### `pick_centroids(mols, npick, ...)`
|
||||
Select centroid molecules representing clusters.
|
||||
|
||||
## Graph Operations
|
||||
|
||||
### `to_graph(mol)`
|
||||
Convert molecule to graph representation for graph-based analysis.
|
||||
|
||||
### `get_all_path_between(mol, start, end)`
|
||||
Find all paths between two atoms in molecular structure.
|
||||
|
||||
## DataFrame Integration
|
||||
|
||||
### `to_df(mols, smiles_column='smiles', mol_column='mol')`
|
||||
Convert list of molecules to pandas DataFrame.
|
||||
|
||||
### `from_df(df, smiles_column='smiles', mol_column='mol')`
|
||||
Convert pandas DataFrame to list of molecules.
|
||||
195
skills/datamol/references/descriptors_viz.md
Normal file
195
skills/datamol/references/descriptors_viz.md
Normal file
@@ -0,0 +1,195 @@
|
||||
# Datamol Descriptors and Visualization Reference
|
||||
|
||||
## Descriptors Module (`datamol.descriptors`)
|
||||
|
||||
The descriptors module provides tools for computing molecular properties and descriptors.
|
||||
|
||||
### Specialized Descriptor Functions
|
||||
|
||||
#### `dm.descriptors.n_aromatic_atoms(mol)`
|
||||
Calculate the number of aromatic atoms.
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Aromaticity analysis
|
||||
|
||||
#### `dm.descriptors.n_aromatic_atoms_proportion(mol)`
|
||||
Calculate ratio of aromatic atoms to total heavy atoms.
|
||||
- **Returns**: Float between 0 and 1
|
||||
- **Use case**: Quantifying aromatic character
|
||||
|
||||
#### `dm.descriptors.n_charged_atoms(mol)`
|
||||
Count atoms with nonzero formal charge.
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Charge distribution analysis
|
||||
|
||||
#### `dm.descriptors.n_rigid_bonds(mol)`
|
||||
Count non-rotatable bonds (neither single bonds nor ring bonds).
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Molecular flexibility assessment
|
||||
|
||||
#### `dm.descriptors.n_stereo_centers(mol)`
|
||||
Count stereogenic centers (chiral centers).
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Stereochemistry analysis
|
||||
|
||||
#### `dm.descriptors.n_stereo_centers_unspecified(mol)`
|
||||
Count stereocenters lacking stereochemical specification.
|
||||
- **Returns**: Integer count
|
||||
- **Use case**: Identifying incomplete stereochemistry
|
||||
|
||||
### Batch Descriptor Computation
|
||||
|
||||
#### `dm.descriptors.compute_many_descriptors(mol, properties_fn=None, add_properties=True)`
|
||||
Compute multiple molecular properties for a single molecule.
|
||||
- **Parameters**:
|
||||
- `properties_fn`: Custom list of descriptor functions
|
||||
- `add_properties`: Include additional computed properties
|
||||
- **Returns**: Dictionary of descriptor name → value pairs
|
||||
- **Default descriptors include**:
|
||||
- Molecular weight, LogP, number of H-bond donors/acceptors
|
||||
- Aromatic atoms, stereocenters, rotatable bonds
|
||||
- TPSA (Topological Polar Surface Area)
|
||||
- Ring count, heteroatom count
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("CCO")
|
||||
descriptors = dm.descriptors.compute_many_descriptors(mol)
|
||||
# Returns: {'mw': 46.07, 'logp': -0.03, 'hbd': 1, 'hba': 1, ...}
|
||||
```
|
||||
|
||||
#### `dm.descriptors.batch_compute_many_descriptors(mols, properties_fn=None, add_properties=True, n_jobs=1, batch_size=None, progress=False)`
|
||||
Compute descriptors for multiple molecules in parallel.
|
||||
- **Parameters**:
|
||||
- `mols`: List of molecules
|
||||
- `n_jobs`: Number of parallel jobs (-1 for all cores)
|
||||
- `batch_size`: Chunk size for parallel processing
|
||||
- `progress`: Show progress bar
|
||||
- **Returns**: Pandas DataFrame with one row per molecule
|
||||
- **Example**:
|
||||
```python
|
||||
mols = [dm.to_mol(smi) for smi in smiles_list]
|
||||
df = dm.descriptors.batch_compute_many_descriptors(
|
||||
mols,
|
||||
n_jobs=-1,
|
||||
progress=True
|
||||
)
|
||||
```
|
||||
|
||||
### RDKit Descriptor Access
|
||||
|
||||
#### `dm.descriptors.any_rdkit_descriptor(name)`
|
||||
Retrieve any descriptor function from RDKit by name.
|
||||
- **Parameters**: `name` - Descriptor function name (e.g., 'MolWt', 'TPSA')
|
||||
- **Returns**: RDKit descriptor function
|
||||
- **Available descriptors**: From `rdkit.Chem.Descriptors` and `rdkit.Chem.rdMolDescriptors`
|
||||
- **Example**:
|
||||
```python
|
||||
tpsa_fn = dm.descriptors.any_rdkit_descriptor('TPSA')
|
||||
tpsa_value = tpsa_fn(mol)
|
||||
```
|
||||
|
||||
### Common Use Cases
|
||||
|
||||
**Drug-likeness Filtering (Lipinski's Rule of Five)**:
|
||||
```python
|
||||
descriptors = dm.descriptors.compute_many_descriptors(mol)
|
||||
is_druglike = (
|
||||
descriptors['mw'] <= 500 and
|
||||
descriptors['logp'] <= 5 and
|
||||
descriptors['hbd'] <= 5 and
|
||||
descriptors['hba'] <= 10
|
||||
)
|
||||
```
|
||||
|
||||
**ADME Property Analysis**:
|
||||
```python
|
||||
df = dm.descriptors.batch_compute_many_descriptors(compound_library)
|
||||
# Filter by TPSA for blood-brain barrier penetration
|
||||
bbb_candidates = df[df['tpsa'] < 90]
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Visualization Module (`datamol.viz`)
|
||||
|
||||
The viz module provides tools for rendering molecules and conformers as images.
|
||||
|
||||
### Main Visualization Function
|
||||
|
||||
#### `dm.viz.to_image(mols, legends=None, n_cols=4, use_svg=False, mol_size=(200, 200), highlight_atom=None, highlight_bond=None, outfile=None, max_mols=None, copy=True, indices=False, ...)`
|
||||
Generate image grid from molecules.
|
||||
- **Parameters**:
|
||||
- `mols`: Single molecule or list of molecules
|
||||
- `legends`: String or list of strings as labels (one per molecule)
|
||||
- `n_cols`: Number of molecules per row (default: 4)
|
||||
- `use_svg`: Output SVG format (True) or PNG (False, default)
|
||||
- `mol_size`: Tuple (width, height) or single int for square images
|
||||
- `highlight_atom`: Atom indices to highlight (list or dict)
|
||||
- `highlight_bond`: Bond indices to highlight (list or dict)
|
||||
- `outfile`: Save path (local or remote, supports fsspec)
|
||||
- `max_mols`: Maximum number of molecules to display
|
||||
- `indices`: Draw atom indices on structures (default: False)
|
||||
- `align`: Align molecules using MCS (Maximum Common Substructure)
|
||||
- **Returns**: Image object (can be displayed in Jupyter) or saves to file
|
||||
- **Example**:
|
||||
```python
|
||||
# Basic grid
|
||||
dm.viz.to_image(mols[:10], legends=[dm.to_smiles(m) for m in mols[:10]])
|
||||
|
||||
# Save to file
|
||||
dm.viz.to_image(mols, outfile="molecules.png", n_cols=5)
|
||||
|
||||
# Highlight substructure
|
||||
dm.viz.to_image(mol, highlight_atom=[0, 1, 2], highlight_bond=[0, 1])
|
||||
|
||||
# Aligned visualization
|
||||
dm.viz.to_image(mols, align=True, legends=activity_labels)
|
||||
```
|
||||
|
||||
### Conformer Visualization
|
||||
|
||||
#### `dm.viz.conformers(mol, n_confs=None, align_conf=True, n_cols=3, sync_views=True, remove_hs=True, ...)`
|
||||
Display multiple conformers in grid layout.
|
||||
- **Parameters**:
|
||||
- `mol`: Molecule with embedded conformers
|
||||
- `n_confs`: Number or list of conformer indices to display (None = all)
|
||||
- `align_conf`: Align conformers for comparison (default: True)
|
||||
- `n_cols`: Grid columns (default: 3)
|
||||
- `sync_views`: Synchronize 3D views when interactive (default: True)
|
||||
- `remove_hs`: Remove hydrogens for clarity (default: True)
|
||||
- **Returns**: Grid of conformer visualizations
|
||||
- **Use case**: Comparing conformational diversity
|
||||
- **Example**:
|
||||
```python
|
||||
mol_3d = dm.conformers.generate(mol, n_confs=20)
|
||||
dm.viz.conformers(mol_3d, n_confs=10, align_conf=True)
|
||||
```
|
||||
|
||||
### Circle Grid Visualization
|
||||
|
||||
#### `dm.viz.circle_grid(center_mol, circle_mols, mol_size=200, circle_margin=50, act_mapper=None, ...)`
|
||||
Create concentric ring visualization with central molecule.
|
||||
- **Parameters**:
|
||||
- `center_mol`: Molecule at center
|
||||
- `circle_mols`: List of molecule lists (one list per ring)
|
||||
- `mol_size`: Image size per molecule
|
||||
- `circle_margin`: Spacing between rings (default: 50)
|
||||
- `act_mapper`: Activity mapping dictionary for color-coding
|
||||
- **Returns**: Circular grid image
|
||||
- **Use case**: Visualizing molecular neighborhoods, SAR analysis, similarity networks
|
||||
- **Example**:
|
||||
```python
|
||||
# Show a reference molecule surrounded by similar compounds
|
||||
dm.viz.circle_grid(
|
||||
center_mol=reference,
|
||||
circle_mols=[nearest_neighbors, second_tier]
|
||||
)
|
||||
```
|
||||
|
||||
### Visualization Best Practices
|
||||
|
||||
1. **Use legends for clarity**: Always label molecules with SMILES, IDs, or activity values
|
||||
2. **Align related molecules**: Use `align=True` in `to_image()` for SAR analysis
|
||||
3. **Adjust grid size**: Set `n_cols` based on molecule count and display width
|
||||
4. **Use SVG for publications**: Set `use_svg=True` for scalable vector graphics
|
||||
5. **Highlight substructures**: Use `highlight_atom` and `highlight_bond` to emphasize features
|
||||
6. **Save large grids**: Use `outfile` parameter to save rather than display in memory
|
||||
174
skills/datamol/references/fragments_scaffolds.md
Normal file
174
skills/datamol/references/fragments_scaffolds.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Datamol Fragments and Scaffolds Reference
|
||||
|
||||
## Scaffolds Module (`datamol.scaffold`)
|
||||
|
||||
Scaffolds represent the core structure of molecules, useful for identifying structural families and analyzing structure-activity relationships (SAR).
|
||||
|
||||
### Murcko Scaffolds
|
||||
|
||||
#### `dm.to_scaffold_murcko(mol)`
|
||||
Extract Bemis-Murcko scaffold (molecular framework).
|
||||
- **Method**: Removes side chains, retaining ring systems and linkers
|
||||
- **Returns**: Molecule object representing the scaffold
|
||||
- **Use case**: Identify core structures across compound series
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("c1ccc(cc1)CCN") # Phenethylamine
|
||||
scaffold = dm.to_scaffold_murcko(mol)
|
||||
scaffold_smiles = dm.to_smiles(scaffold)
|
||||
# Returns: 'c1ccccc1CC' (benzene ring + ethyl linker)
|
||||
```
|
||||
|
||||
**Workflow for scaffold analysis**:
|
||||
```python
|
||||
# Extract scaffolds from compound library
|
||||
scaffolds = [dm.to_scaffold_murcko(mol) for mol in mols]
|
||||
scaffold_smiles = [dm.to_smiles(s) for s in scaffolds]
|
||||
|
||||
# Count scaffold frequency
|
||||
from collections import Counter
|
||||
scaffold_counts = Counter(scaffold_smiles)
|
||||
most_common = scaffold_counts.most_common(10)
|
||||
```
|
||||
|
||||
### Fuzzy Scaffolds
|
||||
|
||||
#### `dm.scaffold.fuzzy_scaffolding(mol, ...)`
|
||||
Generate fuzzy scaffolds with enforceable groups that must appear in the core.
|
||||
- **Purpose**: More flexible scaffold definition allowing specified functional groups
|
||||
- **Use case**: Custom scaffold definitions beyond Murcko rules
|
||||
|
||||
### Applications
|
||||
|
||||
**Scaffold-based splitting** (for ML model validation):
|
||||
```python
|
||||
# Group compounds by scaffold
|
||||
scaffold_to_mols = {}
|
||||
for mol, scaffold in zip(mols, scaffolds):
|
||||
smi = dm.to_smiles(scaffold)
|
||||
if smi not in scaffold_to_mols:
|
||||
scaffold_to_mols[smi] = []
|
||||
scaffold_to_mols[smi].append(mol)
|
||||
|
||||
# Ensure train/test sets have different scaffolds
|
||||
```
|
||||
|
||||
**SAR analysis**:
|
||||
```python
|
||||
# Group by scaffold and analyze activity
|
||||
for scaffold_smi, molecules in scaffold_to_mols.items():
|
||||
activities = [get_activity(mol) for mol in molecules]
|
||||
print(f"Scaffold: {scaffold_smi}, Mean activity: {np.mean(activities)}")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Fragments Module (`datamol.fragment`)
|
||||
|
||||
Molecular fragmentation breaks molecules into smaller pieces based on chemical rules, useful for fragment-based drug design and substructure analysis.
|
||||
|
||||
### BRICS Fragmentation
|
||||
|
||||
#### `dm.fragment.brics(mol, ...)`
|
||||
Fragment molecule using BRICS (Breaking Retrosynthetically Interesting Chemical Substructures).
|
||||
- **Method**: Dissects based on 16 chemically meaningful bond types
|
||||
- **Consideration**: Considers chemical environment and surrounding substructures
|
||||
- **Returns**: Set of fragment SMILES strings
|
||||
- **Use case**: Retrosynthetic analysis, fragment-based design
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("c1ccccc1CCN")
|
||||
fragments = dm.fragment.brics(mol)
|
||||
# Returns fragments like: '[1*]CCN', '[1*]c1ccccc1', etc.
|
||||
# [1*] represents attachment points
|
||||
```
|
||||
|
||||
### RECAP Fragmentation
|
||||
|
||||
#### `dm.fragment.recap(mol, ...)`
|
||||
Fragment molecule using RECAP (Retrosynthetic Combinatorial Analysis Procedure).
|
||||
- **Method**: Dissects based on 11 predefined bond types
|
||||
- **Rules**:
|
||||
- Leaves alkyl groups smaller than 5 carbons intact
|
||||
- Preserves cyclic bonds
|
||||
- **Returns**: Set of fragment SMILES strings
|
||||
- **Use case**: Combinatorial library design
|
||||
- **Example**:
|
||||
```python
|
||||
mol = dm.to_mol("CCCCCc1ccccc1")
|
||||
fragments = dm.fragment.recap(mol)
|
||||
```
|
||||
|
||||
### MMPA Fragmentation
|
||||
|
||||
#### `dm.fragment.mmpa_frag(mol, ...)`
|
||||
Fragment for Matched Molecular Pair Analysis.
|
||||
- **Purpose**: Generate fragments suitable for identifying molecular pairs
|
||||
- **Use case**: Analyzing how small structural changes affect properties
|
||||
- **Example**:
|
||||
```python
|
||||
fragments = dm.fragment.mmpa_frag(mol)
|
||||
# Used to find pairs of molecules differing by single transformation
|
||||
```
|
||||
|
||||
### Comparison of Methods
|
||||
|
||||
| Method | Bond Types | Preserves Cycles | Best For |
|
||||
|--------|-----------|------------------|----------|
|
||||
| BRICS | 16 | Yes | Retrosynthetic analysis, fragment recombination |
|
||||
| RECAP | 11 | Yes | Combinatorial library design |
|
||||
| MMPA | Variable | Depends | Structure-activity relationship analysis |
|
||||
|
||||
### Fragmentation Workflow
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
|
||||
# 1. Fragment a molecule
|
||||
mol = dm.to_mol("CC(=O)Oc1ccccc1C(=O)O") # Aspirin
|
||||
brics_frags = dm.fragment.brics(mol)
|
||||
recap_frags = dm.fragment.recap(mol)
|
||||
|
||||
# 2. Analyze fragment frequency across library
|
||||
all_fragments = []
|
||||
for mol in molecule_library:
|
||||
frags = dm.fragment.brics(mol)
|
||||
all_fragments.extend(frags)
|
||||
|
||||
# 3. Identify common fragments
|
||||
from collections import Counter
|
||||
fragment_counts = Counter(all_fragments)
|
||||
common_fragments = fragment_counts.most_common(20)
|
||||
|
||||
# 4. Convert fragments back to molecules (remove attachment points)
|
||||
def clean_fragment(frag_smiles):
|
||||
# Remove [1*], [2*], etc. attachment point markers
|
||||
clean = frag_smiles.replace('[1*]', '[H]')
|
||||
return dm.to_mol(clean)
|
||||
```
|
||||
|
||||
### Advanced: Fragment-Based Virtual Screening
|
||||
|
||||
```python
|
||||
# Build fragment library from known actives
|
||||
active_fragments = set()
|
||||
for active_mol in active_compounds:
|
||||
frags = dm.fragment.brics(active_mol)
|
||||
active_fragments.update(frags)
|
||||
|
||||
# Screen compounds for presence of active fragments
|
||||
def score_by_fragments(mol, fragment_set):
|
||||
mol_frags = dm.fragment.brics(mol)
|
||||
overlap = mol_frags.intersection(fragment_set)
|
||||
return len(overlap) / len(mol_frags)
|
||||
|
||||
# Score screening library
|
||||
scores = [score_by_fragments(mol, active_fragments) for mol in screening_lib]
|
||||
```
|
||||
|
||||
### Key Concepts
|
||||
|
||||
- **Attachment Points**: Marked with [1*], [2*], etc. in fragment SMILES
|
||||
- **Retrosynthetic**: Fragmentation mimics synthetic disconnections
|
||||
- **Chemically Meaningful**: Breaks occur at typical synthetic bonds
|
||||
- **Recombination**: Fragments can theoretically be recombined into valid molecules
|
||||
109
skills/datamol/references/io_module.md
Normal file
109
skills/datamol/references/io_module.md
Normal file
@@ -0,0 +1,109 @@
|
||||
# Datamol I/O Module Reference
|
||||
|
||||
The `datamol.io` module provides comprehensive file handling for molecular data across multiple formats.
|
||||
|
||||
## Reading Molecular Files
|
||||
|
||||
### `dm.read_sdf(filename, sanitize=True, remove_hs=True, as_df=True, mol_column='mol', ...)`
|
||||
Read Structure-Data File (SDF) format.
|
||||
- **Parameters**:
|
||||
- `filename`: Path to SDF file (supports local and remote paths via fsspec)
|
||||
- `sanitize`: Apply sanitization to molecules
|
||||
- `remove_hs`: Remove explicit hydrogens
|
||||
- `as_df`: Return as DataFrame (True) or list of molecules (False)
|
||||
- `mol_column`: Name of molecule column in DataFrame
|
||||
- `n_jobs`: Enable parallel processing
|
||||
- **Returns**: DataFrame or list of molecules
|
||||
- **Example**: `df = dm.read_sdf("compounds.sdf")`
|
||||
|
||||
### `dm.read_smi(filename, smiles_column='smiles', mol_column='mol', as_df=True, ...)`
|
||||
Read SMILES file (space-delimited by default).
|
||||
- **Common format**: SMILES followed by molecule ID/name
|
||||
- **Example**: `df = dm.read_smi("molecules.smi")`
|
||||
|
||||
### `dm.read_csv(filename, smiles_column='smiles', mol_column=None, ...)`
|
||||
Read CSV file with optional automatic SMILES-to-molecule conversion.
|
||||
- **Parameters**:
|
||||
- `smiles_column`: Column containing SMILES strings
|
||||
- `mol_column`: If specified, creates molecule objects from SMILES column
|
||||
- **Example**: `df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")`
|
||||
|
||||
### `dm.read_excel(filename, sheet_name=0, smiles_column='smiles', mol_column=None, ...)`
|
||||
Read Excel files with molecule handling.
|
||||
- **Parameters**:
|
||||
- `sheet_name`: Sheet to read (index or name)
|
||||
- Other parameters similar to `read_csv`
|
||||
- **Example**: `df = dm.read_excel("compounds.xlsx", sheet_name="Sheet1")`
|
||||
|
||||
### `dm.read_molblock(molblock, sanitize=True, remove_hs=True)`
|
||||
Parse MOL block string (molecular structure text representation).
|
||||
|
||||
### `dm.read_mol2file(filename, sanitize=True, remove_hs=True, cleanupSubstructures=True)`
|
||||
Read Mol2 format files.
|
||||
|
||||
### `dm.read_pdbfile(filename, sanitize=True, remove_hs=True, proximityBonding=True)`
|
||||
Read Protein Data Bank (PDB) format files.
|
||||
|
||||
### `dm.read_pdbblock(pdbblock, sanitize=True, remove_hs=True, proximityBonding=True)`
|
||||
Parse PDB block string.
|
||||
|
||||
### `dm.open_df(filename, ...)`
|
||||
Universal DataFrame reader - automatically detects format.
|
||||
- **Supported formats**: CSV, Excel, Parquet, JSON, SDF
|
||||
- **Example**: `df = dm.open_df("data.csv")` or `df = dm.open_df("molecules.sdf")`
|
||||
|
||||
## Writing Molecular Files
|
||||
|
||||
### `dm.to_sdf(mols, filename, mol_column=None, ...)`
|
||||
Write molecules to SDF file.
|
||||
- **Input types**:
|
||||
- List of molecules
|
||||
- DataFrame with molecule column
|
||||
- Sequence of molecules
|
||||
- **Parameters**:
|
||||
- `mol_column`: Column name if input is DataFrame
|
||||
- **Example**:
|
||||
```python
|
||||
dm.to_sdf(mols, "output.sdf")
|
||||
# or from DataFrame
|
||||
dm.to_sdf(df, "output.sdf", mol_column="mol")
|
||||
```
|
||||
|
||||
### `dm.to_smi(mols, filename, mol_column=None, ...)`
|
||||
Write molecules to SMILES file with optional validation.
|
||||
- **Format**: SMILES strings with optional molecule names/IDs
|
||||
|
||||
### `dm.to_xlsx(df, filename, mol_columns=None, ...)`
|
||||
Export DataFrame to Excel with rendered molecular images.
|
||||
- **Parameters**:
|
||||
- `mol_columns`: Columns containing molecules to render as images
|
||||
- **Special feature**: Automatically renders molecules as images in Excel cells
|
||||
- **Example**: `dm.to_xlsx(df, "molecules.xlsx", mol_columns=["mol"])`
|
||||
|
||||
### `dm.to_molblock(mol, ...)`
|
||||
Convert molecule to MOL block string.
|
||||
|
||||
### `dm.to_pdbblock(mol, ...)`
|
||||
Convert molecule to PDB block string.
|
||||
|
||||
### `dm.save_df(df, filename, ...)`
|
||||
Save DataFrame in multiple formats (CSV, Excel, Parquet, JSON).
|
||||
|
||||
## Remote File Support
|
||||
|
||||
All I/O functions support remote file paths through fsspec integration:
|
||||
- **Supported protocols**: S3 (AWS), GCS (Google Cloud), Azure, HTTP/HTTPS
|
||||
- **Example**:
|
||||
```python
|
||||
dm.read_sdf("s3://bucket/compounds.sdf")
|
||||
dm.read_csv("https://example.com/data.csv")
|
||||
```
|
||||
|
||||
## Key Parameters Across Functions
|
||||
|
||||
- **`sanitize`**: Apply molecule sanitization (default: True)
|
||||
- **`remove_hs`**: Remove explicit hydrogens (default: True)
|
||||
- **`as_df`**: Return DataFrame vs list (default: True for most functions)
|
||||
- **`n_jobs`**: Enable parallel processing (None = all cores, 1 = sequential)
|
||||
- **`mol_column`**: Name of molecule column in DataFrames
|
||||
- **`smiles_column`**: Name of SMILES column in DataFrames
|
||||
218
skills/datamol/references/reactions_data.md
Normal file
218
skills/datamol/references/reactions_data.md
Normal file
@@ -0,0 +1,218 @@
|
||||
# Datamol Reactions and Data Modules Reference
|
||||
|
||||
## Reactions Module (`datamol.reactions`)
|
||||
|
||||
The reactions module enables programmatic application of chemical transformations using SMARTS reaction patterns.
|
||||
|
||||
### Applying Chemical Reactions
|
||||
|
||||
#### `dm.reactions.apply_reaction(rxn, reactants, as_smiles=False, sanitize=True, single_product_group=True, rm_attach=True, product_index=0)`
|
||||
Apply a chemical reaction to reactant molecules.
|
||||
- **Parameters**:
|
||||
- `rxn`: Reaction object (from SMARTS pattern)
|
||||
- `reactants`: Tuple of reactant molecules
|
||||
- `as_smiles`: Return SMILES strings (True) or molecule objects (False)
|
||||
- `sanitize`: Sanitize product molecules
|
||||
- `single_product_group`: Return single product (True) or all product groups (False)
|
||||
- `rm_attach`: Remove attachment point markers
|
||||
- `product_index`: Which product to return from reaction
|
||||
- **Returns**: Product molecule(s) or SMILES
|
||||
- **Example**:
|
||||
```python
|
||||
from rdkit import Chem
|
||||
|
||||
# Define reaction: alcohol + carboxylic acid → ester
|
||||
rxn = Chem.rdChemReactions.ReactionFromSmarts(
|
||||
'[C:1][OH:2].[C:3](=[O:4])[OH:5]>>[C:1][O:2][C:3](=[O:4])'
|
||||
)
|
||||
|
||||
# Apply to reactants
|
||||
alcohol = dm.to_mol("CCO")
|
||||
acid = dm.to_mol("CC(=O)O")
|
||||
product = dm.reactions.apply_reaction(rxn, (alcohol, acid))
|
||||
```
|
||||
|
||||
### Creating Reactions
|
||||
|
||||
Reactions are typically created from SMARTS patterns using RDKit:
|
||||
```python
|
||||
from rdkit.Chem import rdChemReactions
|
||||
|
||||
# Reaction pattern: [reactant1].[reactant2]>>[product]
|
||||
rxn = rdChemReactions.ReactionFromSmarts(
|
||||
'[1*][*:1].[1*][*:2]>>[*:1][*:2]'
|
||||
)
|
||||
```
|
||||
|
||||
### Validation Functions
|
||||
|
||||
The module includes functions to:
|
||||
- **Check if molecule is reactant**: Verify if molecule matches reactant pattern
|
||||
- **Validate reaction**: Check if reaction is synthetically reasonable
|
||||
- **Process reaction files**: Load reactions from files or databases
|
||||
|
||||
### Common Reaction Patterns
|
||||
|
||||
**Amide formation**:
|
||||
```python
|
||||
# Amine + carboxylic acid → amide
|
||||
amide_rxn = rdChemReactions.ReactionFromSmarts(
|
||||
'[N:1].[C:2](=[O:3])[OH]>>[N:1][C:2](=[O:3])'
|
||||
)
|
||||
```
|
||||
|
||||
**Suzuki coupling**:
|
||||
```python
|
||||
# Aryl halide + boronic acid → biaryl
|
||||
suzuki_rxn = rdChemReactions.ReactionFromSmarts(
|
||||
'[c:1][Br].[c:2][B]([OH])[OH]>>[c:1][c:2]'
|
||||
)
|
||||
```
|
||||
|
||||
**Functional group transformations**:
|
||||
```python
|
||||
# Alcohol → ester
|
||||
esterification = rdChemReactions.ReactionFromSmarts(
|
||||
'[C:1][OH:2].[C:3](=[O:4])[Cl]>>[C:1][O:2][C:3](=[O:4])'
|
||||
)
|
||||
```
|
||||
|
||||
### Workflow Example
|
||||
|
||||
```python
|
||||
import datamol as dm
|
||||
from rdkit.Chem import rdChemReactions
|
||||
|
||||
# 1. Define reaction
|
||||
rxn_smarts = '[C:1](=[O:2])[OH:3]>>[C:1](=[O:2])[Cl:3]' # Acid → acid chloride
|
||||
rxn = rdChemReactions.ReactionFromSmarts(rxn_smarts)
|
||||
|
||||
# 2. Apply to molecule library
|
||||
acids = [dm.to_mol(smi) for smi in acid_smiles_list]
|
||||
acid_chlorides = []
|
||||
|
||||
for acid in acids:
|
||||
try:
|
||||
product = dm.reactions.apply_reaction(
|
||||
rxn,
|
||||
(acid,), # Single reactant as tuple
|
||||
sanitize=True
|
||||
)
|
||||
acid_chlorides.append(product)
|
||||
except Exception as e:
|
||||
print(f"Reaction failed: {e}")
|
||||
|
||||
# 3. Validate products
|
||||
valid_products = [p for p in acid_chlorides if p is not None]
|
||||
```
|
||||
|
||||
### Key Concepts
|
||||
|
||||
- **SMARTS**: SMiles ARbitrary Target Specification - pattern language for reactions
|
||||
- **Atom Mapping**: Numbers like [C:1] preserve atom identity through reaction
|
||||
- **Attachment Points**: [1*] represents generic connection points
|
||||
- **Reaction Validation**: Not all SMARTS reactions are chemically reasonable
|
||||
|
||||
---
|
||||
|
||||
## Data Module (`datamol.data`)
|
||||
|
||||
The data module provides convenient access to curated molecular datasets for testing and learning.
|
||||
|
||||
### Available Datasets
|
||||
|
||||
#### `dm.data.cdk2(as_df=True, mol_column='mol')`
|
||||
RDKit CDK2 dataset - kinase inhibitor data.
|
||||
- **Parameters**:
|
||||
- `as_df`: Return as DataFrame (True) or list of molecules (False)
|
||||
- `mol_column`: Name for molecule column
|
||||
- **Returns**: Dataset with molecular structures and activity data
|
||||
- **Use case**: Small dataset for algorithm testing
|
||||
- **Example**:
|
||||
```python
|
||||
cdk2_df = dm.data.cdk2(as_df=True)
|
||||
print(cdk2_df.shape)
|
||||
print(cdk2_df.columns)
|
||||
```
|
||||
|
||||
#### `dm.data.freesolv()`
|
||||
FreeSolv dataset - experimental and calculated hydration free energies.
|
||||
- **Contents**: 642 molecules with:
|
||||
- IUPAC names
|
||||
- SMILES strings
|
||||
- Experimental hydration free energy values
|
||||
- Calculated values
|
||||
- **Warning**: "Only meant to be used as a toy dataset for pedagogic and testing purposes"
|
||||
- **Not suitable for**: Benchmarking or production model training
|
||||
- **Example**:
|
||||
```python
|
||||
freesolv_df = dm.data.freesolv()
|
||||
# Columns: iupac, smiles, expt (kcal/mol), calc (kcal/mol)
|
||||
```
|
||||
|
||||
#### `dm.data.solubility(as_df=True, mol_column='mol')`
|
||||
RDKit solubility dataset with train/test splits.
|
||||
- **Contents**: Aqueous solubility data with pre-defined splits
|
||||
- **Columns**: Includes 'split' column with 'train' or 'test' values
|
||||
- **Use case**: Testing ML workflows with proper train/test separation
|
||||
- **Example**:
|
||||
```python
|
||||
sol_df = dm.data.solubility(as_df=True)
|
||||
|
||||
# Split into train/test
|
||||
train_df = sol_df[sol_df['split'] == 'train']
|
||||
test_df = sol_df[sol_df['split'] == 'test']
|
||||
|
||||
# Use for model development
|
||||
X_train = dm.to_fp(train_df[mol_column])
|
||||
y_train = train_df['solubility']
|
||||
```
|
||||
|
||||
### Usage Guidelines
|
||||
|
||||
**For testing and tutorials**:
|
||||
```python
|
||||
# Quick dataset for testing code
|
||||
df = dm.data.cdk2()
|
||||
mols = df['mol'].tolist()
|
||||
|
||||
# Test descriptor calculation
|
||||
descriptors_df = dm.descriptors.batch_compute_many_descriptors(mols)
|
||||
|
||||
# Test clustering
|
||||
clusters = dm.cluster_mols(mols, cutoff=0.3)
|
||||
```
|
||||
|
||||
**For learning workflows**:
|
||||
```python
|
||||
# Complete ML pipeline example
|
||||
sol_df = dm.data.solubility()
|
||||
|
||||
# Preprocessing
|
||||
train = sol_df[sol_df['split'] == 'train']
|
||||
test = sol_df[sol_df['split'] == 'test']
|
||||
|
||||
# Featurization
|
||||
X_train = dm.to_fp(train['mol'])
|
||||
X_test = dm.to_fp(test['mol'])
|
||||
|
||||
# Model training (example)
|
||||
from sklearn.ensemble import RandomForestRegressor
|
||||
model = RandomForestRegressor()
|
||||
model.fit(X_train, train['solubility'])
|
||||
predictions = model.predict(X_test)
|
||||
```
|
||||
|
||||
### Important Notes
|
||||
|
||||
- **Toy Datasets**: Designed for pedagogical purposes, not production use
|
||||
- **Small Size**: Limited number of compounds suitable for quick tests
|
||||
- **Pre-processed**: Data already cleaned and formatted
|
||||
- **Citations**: Check dataset documentation for proper attribution if publishing
|
||||
|
||||
### Best Practices
|
||||
|
||||
1. **Use for development only**: Don't draw scientific conclusions from toy datasets
|
||||
2. **Validate on real data**: Always test production code on actual project data
|
||||
3. **Proper attribution**: Cite original data sources if using in publications
|
||||
4. **Understand limitations**: Know the scope and quality of each dataset
|
||||
Reference in New Issue
Block a user