gh-k-dense-ai-claude-scient…/skills/datamol/references/io_module.md

# Datamol I/O Module Reference

The `datamol.io` module provides comprehensive file handling for molecular data across multiple formats.

## Reading Molecular Files

### `dm.read_sdf(filename, sanitize=True, remove_hs=True, as_df=True, mol_column='mol', ...)`
Read Structure-Data File (SDF) format.
- **Parameters**:
  - `filename`: Path to SDF file (supports local and remote paths via fsspec)
  - `sanitize`: Apply sanitization to molecules
  - `remove_hs`: Remove explicit hydrogens
  - `as_df`: Return as DataFrame (True) or list of molecules (False)
  - `mol_column`: Name of molecule column in DataFrame
  - `n_jobs`: Enable parallel processing
- **Returns**: DataFrame or list of molecules
- **Example**: `df = dm.read_sdf("compounds.sdf")`

### `dm.read_smi(filename, smiles_column='smiles', mol_column='mol', as_df=True, ...)`
Read SMILES file (space-delimited by default).
- **Common format**: SMILES followed by molecule ID/name
- **Example**: `df = dm.read_smi("molecules.smi")`

### `dm.read_csv(filename, smiles_column='smiles', mol_column=None, ...)`
Read CSV file with optional automatic SMILES-to-molecule conversion.
- **Parameters**:
  - `smiles_column`: Column containing SMILES strings
  - `mol_column`: If specified, creates molecule objects from SMILES column
- **Example**: `df = dm.read_csv("data.csv", smiles_column="SMILES", mol_column="mol")`

### `dm.read_excel(filename, sheet_name=0, smiles_column='smiles', mol_column=None, ...)`
Read Excel files with molecule handling.
- **Parameters**:
  - `sheet_name`: Sheet to read (index or name)
  - Other parameters similar to `read_csv`
- **Example**: `df = dm.read_excel("compounds.xlsx", sheet_name="Sheet1")`

### `dm.read_molblock(molblock, sanitize=True, remove_hs=True)`
Parse MOL block string (molecular structure text representation).

### `dm.read_mol2file(filename, sanitize=True, remove_hs=True, cleanupSubstructures=True)`
Read Mol2 format files.

### `dm.read_pdbfile(filename, sanitize=True, remove_hs=True, proximityBonding=True)`
Read Protein Data Bank (PDB) format files.

### `dm.read_pdbblock(pdbblock, sanitize=True, remove_hs=True, proximityBonding=True)`
Parse PDB block string.

### `dm.open_df(filename, ...)`
Universal DataFrame reader - automatically detects format.
- **Supported formats**: CSV, Excel, Parquet, JSON, SDF
- **Example**: `df = dm.open_df("data.csv")` or `df = dm.open_df("molecules.sdf")`

## Writing Molecular Files

### `dm.to_sdf(mols, filename, mol_column=None, ...)`
Write molecules to SDF file.
- **Input types**:
  - List of molecules
  - DataFrame with molecule column
  - Sequence of molecules
- **Parameters**:
  - `mol_column`: Column name if input is DataFrame
- **Example**:
  ```python
  dm.to_sdf(mols, "output.sdf")
  # or from DataFrame
  dm.to_sdf(df, "output.sdf", mol_column="mol")
  ```

### `dm.to_smi(mols, filename, mol_column=None, ...)`
Write molecules to SMILES file with optional validation.
- **Format**: SMILES strings with optional molecule names/IDs

### `dm.to_xlsx(df, filename, mol_columns=None, ...)`
Export DataFrame to Excel with rendered molecular images.
- **Parameters**:
  - `mol_columns`: Columns containing molecules to render as images
- **Special feature**: Automatically renders molecules as images in Excel cells
- **Example**: `dm.to_xlsx(df, "molecules.xlsx", mol_columns=["mol"])`

### `dm.to_molblock(mol, ...)`
Convert molecule to MOL block string.

### `dm.to_pdbblock(mol, ...)`
Convert molecule to PDB block string.

### `dm.save_df(df, filename, ...)`
Save DataFrame in multiple formats (CSV, Excel, Parquet, JSON).

## Remote File Support

All I/O functions support remote file paths through fsspec integration:
- **Supported protocols**: S3 (AWS), GCS (Google Cloud), Azure, HTTP/HTTPS
- **Example**:
  ```python
  dm.read_sdf("s3://bucket/compounds.sdf")
  dm.read_csv("https://example.com/data.csv")
  ```

## Key Parameters Across Functions

- **`sanitize`**: Apply molecule sanitization (default: True)
- **`remove_hs`**: Remove explicit hydrogens (default: True)
- **`as_df`**: Return DataFrame vs list (default: True for most functions)
- **`n_jobs`**: Enable parallel processing (None = all cores, 1 = sequential)
- **`mol_column`**: Name of molecule column in DataFrames
- **`smiles_column`**: Name of SMILES column in DataFrames