Initial commit

This commit is contained in:
Zhongwei Li
2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions

View File

@@ -0,0 +1,130 @@
# Datamol Core API Reference
This document covers the main functions available in the datamol namespace.
## Molecule Creation and Conversion
### `to_mol(mol, ...)`
Convert SMILES string or other molecular representations to RDKit molecule objects.
- **Parameters**: Accepts SMILES strings, InChI, or other molecular formats
- **Returns**: `rdkit.Chem.Mol` object
- **Common usage**: `mol = dm.to_mol("CCO")`
### `from_inchi(inchi)`
Convert InChI string to molecule object.
### `from_smarts(smarts)`
Convert SMARTS pattern to molecule object.
### `from_selfies(selfies)`
Convert SELFIES string to molecule object.
### `copy_mol(mol)`
Create a copy of a molecule object to avoid modifying the original.
## Molecule Export
### `to_smiles(mol, ...)`
Convert molecule object to SMILES string.
- **Common parameters**: `canonical=True`, `isomeric=True`
### `to_inchi(mol, ...)`
Convert molecule to InChI string representation.
### `to_inchikey(mol)`
Convert molecule to InChI key (fixed-length hash).
### `to_smarts(mol)`
Convert molecule to SMARTS pattern.
### `to_selfies(mol)`
Convert molecule to SELFIES (Self-Referencing Embedded Strings) format.
## Sanitization and Standardization
### `sanitize_mol(mol, ...)`
Enhanced version of RDKit's sanitize operation using mol→SMILES→mol conversion and aromatic nitrogen fixing.
- **Purpose**: Fix common molecular structure issues
- **Returns**: Sanitized molecule or None if sanitization fails
### `standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, ...)`
Apply comprehensive standardization procedures including:
- Metal disconnection
- Normalization (charge corrections)
- Reionization
- Fragment handling (largest fragment selection)
### `standardize_smiles(smiles, ...)`
Apply SMILES standardization procedures directly to a SMILES string.
### `fix_mol(mol)`
Attempt to fix molecular structure issues automatically.
### `fix_valence(mol)`
Correct valence errors in molecular structures.
## Molecular Properties
### `reorder_atoms(mol, ...)`
Ensure consistent atom ordering for the same molecule regardless of original SMILES representation.
- **Purpose**: Maintain reproducible feature generation
### `remove_hs(mol, ...)`
Remove hydrogen atoms from molecular structure.
### `add_hs(mol, ...)`
Add explicit hydrogen atoms to molecular structure.
## Fingerprints and Similarity
### `to_fp(mol, fp_type='ecfp', ...)`
Generate molecular fingerprints for similarity calculations.
- **Fingerprint types**:
- `'ecfp'` - Extended Connectivity Fingerprints (Morgan)
- `'fcfp'` - Functional Connectivity Fingerprints
- `'maccs'` - MACCS keys
- `'topological'` - Topological fingerprints
- `'atompair'` - Atom pair fingerprints
- **Common parameters**: `n_bits`, `radius`
- **Returns**: Numpy array or RDKit fingerprint object
### `pdist(mols, ...)`
Calculate pairwise Tanimoto distances between all molecules in a list.
- **Supports**: Parallel processing via `n_jobs` parameter
- **Returns**: Distance matrix
### `cdist(mols1, mols2, ...)`
Calculate Tanimoto distances between two sets of molecules.
## Clustering and Diversity
### `cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)`
Cluster molecules using Butina clustering algorithm.
- **Parameters**:
- `cutoff`: Distance threshold (default 0.2)
- `feature_fn`: Custom function for molecular features
- `n_jobs`: Parallelization (-1 for all cores)
- **Important**: Builds full distance matrix - suitable for ~1000 structures, not for 10,000+
- **Returns**: List of clusters (each cluster is a list of molecule indices)
### `pick_diverse(mols, npick, ...)`
Select diverse subset of molecules based on fingerprint diversity.
### `pick_centroids(mols, npick, ...)`
Select centroid molecules representing clusters.
## Graph Operations
### `to_graph(mol)`
Convert molecule to graph representation for graph-based analysis.
### `get_all_path_between(mol, start, end)`
Find all paths between two atoms in molecular structure.
## DataFrame Integration
### `to_df(mols, smiles_column='smiles', mol_column='mol')`
Convert list of molecules to pandas DataFrame.
### `from_df(df, smiles_column='smiles', mol_column='mol')`
Convert pandas DataFrame to list of molecules.