Initial commit

2025-11-30 08:30:10 +08:00
commit f0bd18fb4e
824 changed files with 331919 additions and 0 deletions
--- a/skills/datamol/references/core_api.md
+++ b/skills/datamol/references/core_api.md
@@ -0,0 +1,130 @@
+# Datamol Core API Reference
+
+This document covers the main functions available in the datamol namespace.
+
+## Molecule Creation and Conversion
+
+### `to_mol(mol, ...)`
+Convert SMILES string or other molecular representations to RDKit molecule objects.
+- **Parameters**: Accepts SMILES strings, InChI, or other molecular formats
+- **Returns**: `rdkit.Chem.Mol` object
+- **Common usage**: `mol = dm.to_mol("CCO")`
+
+### `from_inchi(inchi)`
+Convert InChI string to molecule object.
+
+### `from_smarts(smarts)`
+Convert SMARTS pattern to molecule object.
+
+### `from_selfies(selfies)`
+Convert SELFIES string to molecule object.
+
+### `copy_mol(mol)`
+Create a copy of a molecule object to avoid modifying the original.
+
+## Molecule Export
+
+### `to_smiles(mol, ...)`
+Convert molecule object to SMILES string.
+- **Common parameters**: `canonical=True`, `isomeric=True`
+
+### `to_inchi(mol, ...)`
+Convert molecule to InChI string representation.
+
+### `to_inchikey(mol)`
+Convert molecule to InChI key (fixed-length hash).
+
+### `to_smarts(mol)`
+Convert molecule to SMARTS pattern.
+
+### `to_selfies(mol)`
+Convert molecule to SELFIES (Self-Referencing Embedded Strings) format.
+
+## Sanitization and Standardization
+
+### `sanitize_mol(mol, ...)`
+Enhanced version of RDKit's sanitize operation using mol→SMILES→mol conversion and aromatic nitrogen fixing.
+- **Purpose**: Fix common molecular structure issues
+- **Returns**: Sanitized molecule or None if sanitization fails
+
+### `standardize_mol(mol, disconnect_metals=False, normalize=True, reionize=True, ...)`
+Apply comprehensive standardization procedures including:
+- Metal disconnection
+- Normalization (charge corrections)
+- Reionization
+- Fragment handling (largest fragment selection)
+
+### `standardize_smiles(smiles, ...)`
+Apply SMILES standardization procedures directly to a SMILES string.
+
+### `fix_mol(mol)`
+Attempt to fix molecular structure issues automatically.
+
+### `fix_valence(mol)`
+Correct valence errors in molecular structures.
+
+## Molecular Properties
+
+### `reorder_atoms(mol, ...)`
+Ensure consistent atom ordering for the same molecule regardless of original SMILES representation.
+- **Purpose**: Maintain reproducible feature generation
+
+### `remove_hs(mol, ...)`
+Remove hydrogen atoms from molecular structure.
+
+### `add_hs(mol, ...)`
+Add explicit hydrogen atoms to molecular structure.
+
+## Fingerprints and Similarity
+
+### `to_fp(mol, fp_type='ecfp', ...)`
+Generate molecular fingerprints for similarity calculations.
+- **Fingerprint types**:
+  - `'ecfp'` - Extended Connectivity Fingerprints (Morgan)
+  - `'fcfp'` - Functional Connectivity Fingerprints
+  - `'maccs'` - MACCS keys
+  - `'topological'` - Topological fingerprints
+  - `'atompair'` - Atom pair fingerprints
+- **Common parameters**: `n_bits`, `radius`
+- **Returns**: Numpy array or RDKit fingerprint object
+
+### `pdist(mols, ...)`
+Calculate pairwise Tanimoto distances between all molecules in a list.
+- **Supports**: Parallel processing via `n_jobs` parameter
+- **Returns**: Distance matrix
+
+### `cdist(mols1, mols2, ...)`
+Calculate Tanimoto distances between two sets of molecules.
+
+## Clustering and Diversity
+
+### `cluster_mols(mols, cutoff=0.2, feature_fn=None, n_jobs=1)`
+Cluster molecules using Butina clustering algorithm.
+- **Parameters**:
+  - `cutoff`: Distance threshold (default 0.2)
+  - `feature_fn`: Custom function for molecular features
+  - `n_jobs`: Parallelization (-1 for all cores)
+- **Important**: Builds full distance matrix - suitable for ~1000 structures, not for 10,000+
+- **Returns**: List of clusters (each cluster is a list of molecule indices)
+
+### `pick_diverse(mols, npick, ...)`
+Select diverse subset of molecules based on fingerprint diversity.
+
+### `pick_centroids(mols, npick, ...)`
+Select centroid molecules representing clusters.
+
+## Graph Operations
+
+### `to_graph(mol)`
+Convert molecule to graph representation for graph-based analysis.
+
+### `get_all_path_between(mol, start, end)`
+Find all paths between two atoms in molecular structure.
+
+## DataFrame Integration
+
+### `to_df(mols, smiles_column='smiles', mol_column='mol')`
+Convert list of molecules to pandas DataFrame.
+
+### `from_df(df, smiles_column='smiles', mol_column='mol')`
+Convert pandas DataFrame to list of molecules.