# Datamol Core API Reference This document covers the main functions available in the datamol namespace. ## Molecule Creation and Conversion ### `to_mol(mol, ...)` Convert SMILES string or other molecular representations to RDKit molecule objects. - **Parameters**: Accepts SMILES strings, InChI, or other molecular formats - **Returns**: `rdkit.Chem.Mol` object - **Common usage**: `mol = dm.to_mol("CCO")` ### `from_inchi(inchi)` Convert InChI string to molecule object. ### `from_smarts(smarts)` Convert SMARTS pattern to molecule object. ### `from_selfies(selfies)` Convert SELFIES string to molecule object. ### `copy_mol(mol)` Create a copy of a molecule object to avoid modifying the original. ## Molecule Export ### `to_smiles(mol, ...)` Convert molecule object to SMILES string. - **Common parameters**: `canonical=True`, `isomeric=True` ### `to_inchi(mol, ...)` Convert molecule to InChI string representation. ### `to_inchikey(mol)` Convert molecule to InChI key (fixed-length hash). ### `to_smarts(mol)` Convert molecule to SMARTS pattern. ### `to_selfies(mol)` Convert molecule to SELFIES (Self-Referencing Embedded Strings) format. ## Sanitization and Standardization ### `sanitize_mol(mol, ...)` Enhanced version of RDKit's sanitize operation using mol→SMILES→mol conversion and aromatic nitrogen fixing. - **Purpose**: Fix common molecular structure issues - **Returns**: Sanitized molecule or None if sanitization fails ### `standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, ...)` Apply comprehensive standardization procedures including: - Metal disconnection - Normalization (charge corrections) - Reionization - Fragment handling (largest fragment selection) ### `standardize_smiles(smiles, ...)` Apply SMILES standardization procedures directly to a SMILES string. ### `fix_mol(mol)` Attempt to fix molecular structure issues automatically. ### `fix_valence(mol)` Correct valence errors in molecular structures. ## Molecular Properties ### `reorder_atoms(mol, ...)` Ensure consistent atom ordering for the same molecule regardless of original SMILES representation. - **Purpose**: Maintain reproducible feature generation ### `remove_hs(mol, ...)` Remove hydrogen atoms from molecular structure. ### `add_hs(mol, ...)` Add explicit hydrogen atoms to molecular structure. ## Fingerprints and Similarity ### `to_fp(mol, fp_type='ecfp', ...)` Generate molecular fingerprints for similarity calculations. - **Fingerprint types**: - `'ecfp'` - Extended Connectivity Fingerprints (Morgan) - `'fcfp'` - Functional Connectivity Fingerprints - `'maccs'` - MACCS keys - `'topological'` - Topological fingerprints - `'atompair'` - Atom pair fingerprints - **Common parameters**: `n_bits`, `radius` - **Returns**: Numpy array or RDKit fingerprint object ### `pdist(mols, ...)` Calculate pairwise Tanimoto distances between all molecules in a list. - **Supports**: Parallel processing via `n_jobs` parameter - **Returns**: Distance matrix ### `cdist(mols1, mols2, ...)` Calculate Tanimoto distances between two sets of molecules. ## Clustering and Diversity ### `cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)` Cluster molecules using Butina clustering algorithm. - **Parameters**: - `cutoff`: Distance threshold (default 0.2) - `feature_fn`: Custom function for molecular features - `n_jobs`: Parallelization (-1 for all cores) - **Important**: Builds full distance matrix - suitable for ~1000 structures, not for 10,000+ - **Returns**: List of clusters (each cluster is a list of molecule indices) ### `pick_diverse(mols, npick, ...)` Select diverse subset of molecules based on fingerprint diversity. ### `pick_centroids(mols, npick, ...)` Select centroid molecules representing clusters. ## Graph Operations ### `to_graph(mol)` Convert molecule to graph representation for graph-based analysis. ### `get_all_path_between(mol, start, end)` Find all paths between two atoms in molecular structure. ## DataFrame Integration ### `to_df(mols, smiles_column='smiles', mol_column='mol')` Convert list of molecules to pandas DataFrame. ### `from_df(df, smiles_column='smiles', mol_column='mol')` Convert pandas DataFrame to list of molecules.