Files
gh-k-dense-ai-claude-scient…/skills/datamol/references/core_api.md
2025-11-30 08:30:10 +08:00

4.1 KiB

Datamol Core API Reference

This document covers the main functions available in the datamol namespace.

Molecule Creation and Conversion

to_mol(mol, ...)

Convert SMILES string or other molecular representations to RDKit molecule objects.

  • Parameters: Accepts SMILES strings, InChI, or other molecular formats
  • Returns: rdkit.Chem.Mol object
  • Common usage: mol = dm.to_mol("CCO")

from_inchi(inchi)

Convert InChI string to molecule object.

from_smarts(smarts)

Convert SMARTS pattern to molecule object.

from_selfies(selfies)

Convert SELFIES string to molecule object.

copy_mol(mol)

Create a copy of a molecule object to avoid modifying the original.

Molecule Export

to_smiles(mol, ...)

Convert molecule object to SMILES string.

  • Common parameters: canonical=True, isomeric=True

to_inchi(mol, ...)

Convert molecule to InChI string representation.

to_inchikey(mol)

Convert molecule to InChI key (fixed-length hash).

to_smarts(mol)

Convert molecule to SMARTS pattern.

to_selfies(mol)

Convert molecule to SELFIES (Self-Referencing Embedded Strings) format.

Sanitization and Standardization

sanitize_mol(mol, ...)

Enhanced version of RDKit's sanitize operation using mol→SMILES→mol conversion and aromatic nitrogen fixing.

  • Purpose: Fix common molecular structure issues
  • Returns: Sanitized molecule or None if sanitization fails

standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, ...)

Apply comprehensive standardization procedures including:

  • Metal disconnection
  • Normalization (charge corrections)
  • Reionization
  • Fragment handling (largest fragment selection)

standardize_smiles(smiles, ...)

Apply SMILES standardization procedures directly to a SMILES string.

fix_mol(mol)

Attempt to fix molecular structure issues automatically.

fix_valence(mol)

Correct valence errors in molecular structures.

Molecular Properties

reorder_atoms(mol, ...)

Ensure consistent atom ordering for the same molecule regardless of original SMILES representation.

  • Purpose: Maintain reproducible feature generation

remove_hs(mol, ...)

Remove hydrogen atoms from molecular structure.

add_hs(mol, ...)

Add explicit hydrogen atoms to molecular structure.

Fingerprints and Similarity

to_fp(mol, fp_type='ecfp', ...)

Generate molecular fingerprints for similarity calculations.

  • Fingerprint types:
    • 'ecfp' - Extended Connectivity Fingerprints (Morgan)
    • 'fcfp' - Functional Connectivity Fingerprints
    • 'maccs' - MACCS keys
    • 'topological' - Topological fingerprints
    • 'atompair' - Atom pair fingerprints
  • Common parameters: n_bits, radius
  • Returns: Numpy array or RDKit fingerprint object

pdist(mols, ...)

Calculate pairwise Tanimoto distances between all molecules in a list.

  • Supports: Parallel processing via n_jobs parameter
  • Returns: Distance matrix

cdist(mols1, mols2, ...)

Calculate Tanimoto distances between two sets of molecules.

Clustering and Diversity

cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)

Cluster molecules using Butina clustering algorithm.

  • Parameters:
    • cutoff: Distance threshold (default 0.2)
    • feature_fn: Custom function for molecular features
    • n_jobs: Parallelization (-1 for all cores)
  • Important: Builds full distance matrix - suitable for ~1000 structures, not for 10,000+
  • Returns: List of clusters (each cluster is a list of molecule indices)

pick_diverse(mols, npick, ...)

Select diverse subset of molecules based on fingerprint diversity.

pick_centroids(mols, npick, ...)

Select centroid molecules representing clusters.

Graph Operations

to_graph(mol)

Convert molecule to graph representation for graph-based analysis.

get_all_path_between(mol, start, end)

Find all paths between two atoms in molecular structure.

DataFrame Integration

to_df(mols, smiles_column='smiles', mol_column='mol')

Convert list of molecules to pandas DataFrame.

from_df(df, smiles_column='smiles', mol_column='mol')

Convert pandas DataFrame to list of molecules.