110 lines
4.2 KiB
Markdown
110 lines
4.2 KiB
Markdown
# Datamol I/O Module Reference
|
|
|
|
The `datamol.io` module provides comprehensive file handling for molecular data across multiple formats.
|
|
|
|
## Reading Molecular Files
|
|
|
|
### `dm.read_sdf(filename, sanitize=True, remove_hs=True, as_df=True, mol_column='mol', ...)`
|
|
Read Structure-Data File (SDF) format.
|
|
- **Parameters**:
|
|
- `filename`: Path to SDF file (supports local and remote paths via fsspec)
|
|
- `sanitize`: Apply sanitization to molecules
|
|
- `remove_hs`: Remove explicit hydrogens
|
|
- `as_df`: Return as DataFrame (True) or list of molecules (False)
|
|
- `mol_column`: Name of molecule column in DataFrame
|
|
- `n_jobs`: Enable parallel processing
|
|
- **Returns**: DataFrame or list of molecules
|
|
- **Example**: `df = dm.read_sdf("compounds.sdf")`
|
|
|
|
### `dm.read_smi(filename, smiles_column='smiles', mol_column='mol', as_df=True, ...)`
|
|
Read SMILES file (space-delimited by default).
|
|
- **Common format**: SMILES followed by molecule ID/name
|
|
- **Example**: `df = dm.read_smi("molecules.smi")`
|
|
|
|
### `dm.read_csv(filename, smiles_column='smiles', mol_column=None, ...)`
|
|
Read CSV file with optional automatic SMILES-to-molecule conversion.
|
|
- **Parameters**:
|
|
- `smiles_column`: Column containing SMILES strings
|
|
- `mol_column`: If specified, creates molecule objects from SMILES column
|
|
- **Example**: `df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")`
|
|
|
|
### `dm.read_excel(filename, sheet_name=0, smiles_column='smiles', mol_column=None, ...)`
|
|
Read Excel files with molecule handling.
|
|
- **Parameters**:
|
|
- `sheet_name`: Sheet to read (index or name)
|
|
- Other parameters similar to `read_csv`
|
|
- **Example**: `df = dm.read_excel("compounds.xlsx", sheet_name="Sheet1")`
|
|
|
|
### `dm.read_molblock(molblock, sanitize=True, remove_hs=True)`
|
|
Parse MOL block string (molecular structure text representation).
|
|
|
|
### `dm.read_mol2file(filename, sanitize=True, remove_hs=True, cleanupSubstructures=True)`
|
|
Read Mol2 format files.
|
|
|
|
### `dm.read_pdbfile(filename, sanitize=True, remove_hs=True, proximityBonding=True)`
|
|
Read Protein Data Bank (PDB) format files.
|
|
|
|
### `dm.read_pdbblock(pdbblock, sanitize=True, remove_hs=True, proximityBonding=True)`
|
|
Parse PDB block string.
|
|
|
|
### `dm.open_df(filename, ...)`
|
|
Universal DataFrame reader - automatically detects format.
|
|
- **Supported formats**: CSV, Excel, Parquet, JSON, SDF
|
|
- **Example**: `df = dm.open_df("data.csv")` or `df = dm.open_df("molecules.sdf")`
|
|
|
|
## Writing Molecular Files
|
|
|
|
### `dm.to_sdf(mols, filename, mol_column=None, ...)`
|
|
Write molecules to SDF file.
|
|
- **Input types**:
|
|
- List of molecules
|
|
- DataFrame with molecule column
|
|
- Sequence of molecules
|
|
- **Parameters**:
|
|
- `mol_column`: Column name if input is DataFrame
|
|
- **Example**:
|
|
```python
|
|
dm.to_sdf(mols, "output.sdf")
|
|
# or from DataFrame
|
|
dm.to_sdf(df, "output.sdf", mol_column="mol")
|
|
```
|
|
|
|
### `dm.to_smi(mols, filename, mol_column=None, ...)`
|
|
Write molecules to SMILES file with optional validation.
|
|
- **Format**: SMILES strings with optional molecule names/IDs
|
|
|
|
### `dm.to_xlsx(df, filename, mol_columns=None, ...)`
|
|
Export DataFrame to Excel with rendered molecular images.
|
|
- **Parameters**:
|
|
- `mol_columns`: Columns containing molecules to render as images
|
|
- **Special feature**: Automatically renders molecules as images in Excel cells
|
|
- **Example**: `dm.to_xlsx(df, "molecules.xlsx", mol_columns=["mol"])`
|
|
|
|
### `dm.to_molblock(mol, ...)`
|
|
Convert molecule to MOL block string.
|
|
|
|
### `dm.to_pdbblock(mol, ...)`
|
|
Convert molecule to PDB block string.
|
|
|
|
### `dm.save_df(df, filename, ...)`
|
|
Save DataFrame in multiple formats (CSV, Excel, Parquet, JSON).
|
|
|
|
## Remote File Support
|
|
|
|
All I/O functions support remote file paths through fsspec integration:
|
|
- **Supported protocols**: S3 (AWS), GCS (Google Cloud), Azure, HTTP/HTTPS
|
|
- **Example**:
|
|
```python
|
|
dm.read_sdf("s3://bucket/compounds.sdf")
|
|
dm.read_csv("https://example.com/data.csv")
|
|
```
|
|
|
|
## Key Parameters Across Functions
|
|
|
|
- **`sanitize`**: Apply molecule sanitization (default: True)
|
|
- **`remove_hs`**: Remove explicit hydrogens (default: True)
|
|
- **`as_df`**: Return DataFrame vs list (default: True for most functions)
|
|
- **`n_jobs`**: Enable parallel processing (None = all cores, 1 = sequential)
|
|
- **`mol_column`**: Name of molecule column in DataFrames
|
|
- **`smiles_column`**: Name of SMILES column in DataFrames
|