# Chemistry and Molecular File Formats Reference This reference covers file formats commonly used in computational chemistry, cheminformatics, molecular modeling, and related fields. ## Structure File Formats ### .pdb - Protein Data Bank **Description:** Standard format for 3D structures of biological macromolecules **Typical Data:** Atomic coordinates, residue information, secondary structure, crystal structure data **Use Cases:** Protein structure analysis, molecular visualization, docking studies **Python Libraries:** - `Biopython`: `Bio.PDB` - `MDAnalysis`: `MDAnalysis.Universe('file.pdb')` - `PyMOL`: `pymol.cmd.load('file.pdb')` - `ProDy`: `prody.parsePDB('file.pdb')` **EDA Approach:** - Structure validation (bond lengths, angles, clashes) - Secondary structure analysis - B-factor distribution - Missing residues/atoms detection - Ramachandran plots for validation - Surface area and volume calculations ### .cif - Crystallographic Information File **Description:** Structured data format for crystallographic information **Typical Data:** Unit cell parameters, atomic coordinates, symmetry operations, experimental data **Use Cases:** Crystal structure determination, structural biology, materials science **Python Libraries:** - `gemmi`: `gemmi.cif.read_file('file.cif')` - `PyCifRW`: `CifFile.ReadCif('file.cif')` - `Biopython`: `Bio.PDB.MMCIFParser()` **EDA Approach:** - Data completeness check - Resolution and quality metrics - Unit cell parameter analysis - Symmetry group validation - Atomic displacement parameters - R-factors and validation metrics ### .mol - MDL Molfile **Description:** Chemical structure file format by MDL/Accelrys **Typical Data:** 2D/3D coordinates, atom types, bond orders, charges **Use Cases:** Chemical database storage, cheminformatics, drug design **Python Libraries:** - `RDKit`: `Chem.MolFromMolFile('file.mol')` - `Open Babel`: `pybel.readfile('mol', 'file.mol')` - `ChemoPy`: For descriptor calculation **EDA Approach:** - Molecular property calculation (MW, logP, TPSA) - Functional group analysis - Ring system detection - Stereochemistry validation - 2D/3D coordinate consistency - Valence and charge validation ### .mol2 - Tripos Mol2 **Description:** Complete 3D molecular structure format with atom typing **Typical Data:** Coordinates, SYBYL atom types, bond types, charges, substructures **Use Cases:** Molecular docking, QSAR studies, drug discovery **Python Libraries:** - `RDKit`: `Chem.MolFromMol2File('file.mol2')` - `Open Babel`: `pybel.readfile('mol2', 'file.mol2')` - `MDAnalysis`: Can parse mol2 topology **EDA Approach:** - Atom type distribution - Partial charge analysis - Bond type statistics - Substructure identification - Conformational analysis - Energy minimization status check ### .sdf - Structure Data File **Description:** Multi-structure file format with associated data **Typical Data:** Multiple molecular structures with properties/annotations **Use Cases:** Chemical databases, virtual screening, compound libraries **Python Libraries:** - `RDKit`: `Chem.SDMolSupplier('file.sdf')` - `Open Babel`: `pybel.readfile('sdf', 'file.sdf')` - `PandasTools` (RDKit): For DataFrame integration **EDA Approach:** - Dataset size and diversity metrics - Property distribution analysis (MW, logP, etc.) - Structural diversity (Tanimoto similarity) - Missing data assessment - Outlier detection in properties - Scaffold analysis ### .xyz - XYZ Coordinates **Description:** Simple Cartesian coordinate format **Typical Data:** Atom types and 3D coordinates **Use Cases:** Quantum chemistry, geometry optimization, molecular dynamics **Python Libraries:** - `ASE`: `ase.io.read('file.xyz')` - `Open Babel`: `pybel.readfile('xyz', 'file.xyz')` - `cclib`: For parsing QM outputs with xyz **EDA Approach:** - Geometry analysis (bond lengths, angles, dihedrals) - Center of mass calculation - Moment of inertia - Molecular size metrics - Coordinate validation - Symmetry detection ### .smi / .smiles - SMILES String **Description:** Line notation for chemical structures **Typical Data:** Text representation of molecular structure **Use Cases:** Chemical databases, literature mining, data exchange **Python Libraries:** - `RDKit`: `Chem.MolFromSmiles(smiles)` - `Open Babel`: Can parse SMILES - `DeepChem`: For ML on SMILES **EDA Approach:** - SMILES syntax validation - Descriptor calculation from SMILES - Fingerprint generation - Substructure searching - Tautomer enumeration - Stereoisomer handling ### .pdbqt - AutoDock PDBQT **Description:** Modified PDB format for AutoDock docking **Typical Data:** Coordinates, partial charges, atom types for docking **Use Cases:** Molecular docking, virtual screening **Python Libraries:** - `Meeko`: For PDBQT preparation - `Open Babel`: Can read PDBQT - `ProDy`: Limited PDBQT support **EDA Approach:** - Charge distribution analysis - Rotatable bond identification - Atom type validation - Coordinate quality check - Hydrogen placement validation - Torsion definition analysis ### .mae - Maestro Format **Description:** Schrödinger's proprietary molecular structure format **Typical Data:** Structures, properties, annotations from Schrödinger suite **Use Cases:** Drug discovery, molecular modeling with Schrödinger tools **Python Libraries:** - `schrodinger.structure`: Requires Schrödinger installation - Custom parsers for basic reading **EDA Approach:** - Property extraction and analysis - Structure quality metrics - Conformer analysis - Docking score distributions - Ligand efficiency metrics ### .gro - GROMACS Coordinate File **Description:** Molecular structure file for GROMACS MD simulations **Typical Data:** Atom positions, velocities, box vectors **Use Cases:** Molecular dynamics simulations, GROMACS workflows **Python Libraries:** - `MDAnalysis`: `Universe('file.gro')` - `MDTraj`: `mdtraj.load_gro('file.gro')` - `GromacsWrapper`: For GROMACS integration **EDA Approach:** - System composition analysis - Box dimension validation - Atom position distribution - Velocity distribution (if present) - Density calculation - Solvation analysis ## Computational Chemistry Output Formats ### .log - Gaussian Log File **Description:** Output from Gaussian quantum chemistry calculations **Typical Data:** Energies, geometries, frequencies, orbitals, populations **Use Cases:** QM calculations, geometry optimization, frequency analysis **Python Libraries:** - `cclib`: `cclib.io.ccread('file.log')` - `GaussianRunPack`: For Gaussian workflows - Custom parsers with regex **EDA Approach:** - Convergence analysis - Energy profile extraction - Vibrational frequency analysis - Orbital energy levels - Population analysis (Mulliken, NBO) - Thermochemistry data extraction ### .out - Quantum Chemistry Output **Description:** Generic output file from various QM packages **Typical Data:** Calculation results, energies, properties **Use Cases:** QM calculations across different software **Python Libraries:** - `cclib`: Universal parser for QM outputs - `ASE`: Can read some output formats **EDA Approach:** - Software-specific parsing - Convergence criteria check - Energy and gradient trends - Basis set and method validation - Computational cost analysis ### .wfn / .wfx - Wavefunction Files **Description:** Wavefunction data for quantum chemical analysis **Typical Data:** Molecular orbitals, basis sets, density matrices **Use Cases:** Electron density analysis, QTAIM analysis **Python Libraries:** - `Multiwfn`: Interface via Python - `Horton`: For wavefunction analysis - Custom parsers for specific formats **EDA Approach:** - Orbital population analysis - Electron density distribution - Critical point analysis (QTAIM) - Molecular orbital visualization - Bonding analysis ### .fchk - Gaussian Formatted Checkpoint **Description:** Formatted checkpoint file from Gaussian **Typical Data:** Complete wavefunction data, results, geometry **Use Cases:** Post-processing Gaussian calculations **Python Libraries:** - `cclib`: Can parse fchk files - `GaussView` Python API (if available) - Custom parsers **EDA Approach:** - Wavefunction quality assessment - Property extraction - Basis set information - Gradient and Hessian analysis - Natural orbital analysis ### .cube - Gaussian Cube File **Description:** Volumetric data on a 3D grid **Typical Data:** Electron density, molecular orbitals, ESP on grid **Use Cases:** Visualization of volumetric properties **Python Libraries:** - `cclib`: `cclib.io.ccread('file.cube')` - `ase.io`: `ase.io.read('file.cube')` - `pyquante`: For cube file manipulation **EDA Approach:** - Grid dimension and spacing analysis - Value distribution statistics - Isosurface value determination - Integration over volume - Comparison between different cubes ## Molecular Dynamics Formats ### .dcd - Binary Trajectory **Description:** Binary trajectory format (CHARMM, NAMD) **Typical Data:** Time series of atomic coordinates **Use Cases:** MD trajectory analysis **Python Libraries:** - `MDAnalysis`: `Universe(topology, 'traj.dcd')` - `MDTraj`: `mdtraj.load_dcd('traj.dcd', top='topology.pdb')` - `PyTraj` (Amber): Limited support **EDA Approach:** - RMSD/RMSF analysis - Trajectory length and frame count - Coordinate range and drift - Periodic boundary handling - File integrity check - Time step validation ### .xtc - Compressed Trajectory **Description:** GROMACS compressed trajectory format **Typical Data:** Compressed coordinates from MD simulations **Use Cases:** Space-efficient MD trajectory storage **Python Libraries:** - `MDAnalysis`: `Universe(topology, 'traj.xtc')` - `MDTraj`: `mdtraj.load_xtc('traj.xtc', top='topology.pdb')` **EDA Approach:** - Compression ratio assessment - Precision loss evaluation - RMSD over time - Structural stability metrics - Sampling frequency analysis ### .trr - GROMACS Trajectory **Description:** Full precision GROMACS trajectory **Typical Data:** Coordinates, velocities, forces from MD **Use Cases:** High-precision MD analysis **Python Libraries:** - `MDAnalysis`: Full support - `MDTraj`: Can read trr files - `GromacsWrapper` **EDA Approach:** - Full system dynamics analysis - Energy conservation check (with velocities) - Force analysis - Temperature and pressure validation - System equilibration assessment ### .nc / .netcdf - Amber NetCDF Trajectory **Description:** Network Common Data Form trajectory **Typical Data:** MD coordinates, velocities, forces **Use Cases:** Amber MD simulations, large trajectory storage **Python Libraries:** - `MDAnalysis`: NetCDF support - `PyTraj`: Native Amber analysis - `netCDF4`: Low-level access **EDA Approach:** - Metadata extraction - Trajectory statistics - Time series analysis - Replica exchange analysis - Multi-dimensional data extraction ### .top - GROMACS Topology **Description:** Molecular topology for GROMACS **Typical Data:** Atom types, bonds, angles, force field parameters **Use Cases:** MD simulation setup and analysis **Python Libraries:** - `ParmEd`: `parmed.load_file('system.top')` - `MDAnalysis`: Can parse topology - Custom parsers for specific fields **EDA Approach:** - Force field parameter validation - System composition - Bond/angle/dihedral distribution - Charge neutrality check - Molecule type enumeration ### .psf - Protein Structure File (CHARMM) **Description:** Topology file for CHARMM/NAMD **Typical Data:** Atom connectivity, types, charges **Use Cases:** CHARMM/NAMD MD simulations **Python Libraries:** - `MDAnalysis`: Native PSF support - `ParmEd`: Can read PSF files **EDA Approach:** - Topology validation - Connectivity analysis - Charge distribution - Atom type statistics - Segment analysis ### .prmtop - Amber Parameter/Topology **Description:** Amber topology and parameter file **Typical Data:** System topology, force field parameters **Use Cases:** Amber MD simulations **Python Libraries:** - `ParmEd`: `parmed.load_file('system.prmtop')` - `PyTraj`: Native Amber support **EDA Approach:** - Force field completeness - Parameter validation - System size and composition - Periodic box information - Atom mask creation for analysis ### .inpcrd / .rst7 - Amber Coordinates **Description:** Amber coordinate/restart file **Typical Data:** Atomic coordinates, velocities, box info **Use Cases:** Starting coordinates for Amber MD **Python Libraries:** - `ParmEd`: Works with prmtop - `PyTraj`: Amber coordinate reading **EDA Approach:** - Coordinate validity - System initialization check - Box vector validation - Velocity distribution (if restart) - Energy minimization status ## Spectroscopy and Analytical Data ### .jcamp / .jdx - JCAMP-DX **Description:** Joint Committee on Atomic and Molecular Physical Data eXchange **Typical Data:** Spectroscopic data (IR, NMR, MS, UV-Vis) **Use Cases:** Spectroscopy data exchange and archiving **Python Libraries:** - `jcamp`: `jcamp.jcamp_reader('file.jdx')` - `nmrglue`: For NMR JCAMP files - Custom parsers for specific subtypes **EDA Approach:** - Peak detection and analysis - Baseline correction assessment - Signal-to-noise calculation - Spectral range validation - Integration analysis - Comparison with reference spectra ### .mzML - Mass Spectrometry Markup Language **Description:** Standard XML format for mass spectrometry data **Typical Data:** MS/MS spectra, chromatograms, metadata **Use Cases:** Proteomics, metabolomics, mass spectrometry workflows **Python Libraries:** - `pymzml`: `pymzml.run.Reader('file.mzML')` - `pyteomics`: `pyteomics.mzml.read('file.mzML')` - `MSFileReader` wrappers **EDA Approach:** - Scan count and types - MS level distribution - Retention time range - m/z range and resolution - Peak intensity distribution - Data completeness - Quality control metrics ### .mzXML - Mass Spectrometry XML **Description:** Open XML format for MS data **Typical Data:** Mass spectra, retention times, peak lists **Use Cases:** Legacy MS data, metabolomics **Python Libraries:** - `pymzml`: Can read mzXML - `pyteomics.mzxml` - `lxml` for direct XML parsing **EDA Approach:** - Similar to mzML - Version compatibility check - Conversion quality assessment - Peak picking validation ### .raw - Vendor Raw Data **Description:** Proprietary instrument data files (Thermo, Bruker, etc.) **Typical Data:** Raw instrument signals, unprocessed data **Use Cases:** Direct instrument data access **Python Libraries:** - `pymsfilereader`: For Thermo RAW files - `ThermoRawFileParser`: CLI wrapper - Vendor-specific APIs (Thermo, Bruker Compass) **EDA Approach:** - Instrument method extraction - Raw signal quality - Calibration status - Scan function analysis - Chromatographic quality metrics ### .d - Agilent Data Directory **Description:** Agilent's data folder structure **Typical Data:** LC-MS, GC-MS data and metadata **Use Cases:** Agilent instrument data processing **Python Libraries:** - `agilent-reader`: Community tools - `Chemstation` Python integration - Custom directory parsing **EDA Approach:** - Directory structure validation - Method parameter extraction - Signal file integrity - Calibration curve analysis - Sequence information extraction ### .fid - NMR Free Induction Decay **Description:** Raw NMR time-domain data **Typical Data:** Time-domain NMR signal **Use Cases:** NMR processing and analysis **Python Libraries:** - `nmrglue`: `nmrglue.bruker.read_fid('fid')` - `nmrstarlib`: For NMR-STAR files **EDA Approach:** - Signal decay analysis - Noise level assessment - Acquisition parameter validation - Apodization function selection - Zero-filling optimization - Phasing parameter estimation ### .ft - NMR Frequency-Domain Data **Description:** Processed NMR spectrum **Typical Data:** Frequency-domain NMR data **Use Cases:** NMR analysis and interpretation **Python Libraries:** - `nmrglue`: Comprehensive NMR support - `pyNMR`: For processing **EDA Approach:** - Peak picking and integration - Chemical shift calibration - Multiplicity analysis - Coupling constant extraction - Spectral quality metrics - Reference compound identification ### .spc - Spectroscopy File **Description:** Thermo Galactic spectroscopy format **Typical Data:** IR, Raman, UV-Vis spectra **Use Cases:** Spectroscopic data from various instruments **Python Libraries:** - `spc`: `spc.File('file.spc')` - Custom parsers for binary format **EDA Approach:** - Spectral resolution - Wavelength/wavenumber range - Baseline characterization - Peak identification - Derivative spectra calculation ## Chemical Database Formats ### .inchi - International Chemical Identifier **Description:** Text identifier for chemical substances **Typical Data:** Layered chemical structure representation **Use Cases:** Chemical database keys, structure searching **Python Libraries:** - `RDKit`: `Chem.MolFromInchi(inchi)` - `Open Babel`: InChI conversion **EDA Approach:** - InChI validation - Layer analysis - Stereochemistry verification - InChI key generation - Structure round-trip validation ### .cdx / .cdxml - ChemDraw Exchange **Description:** ChemDraw drawing file format **Typical Data:** 2D chemical structures with annotations **Use Cases:** Chemical drawing, publication figures **Python Libraries:** - `RDKit`: Can import some CDXML - `Open Babel`: Limited support - `ChemDraw` Python API (commercial) **EDA Approach:** - Structure extraction - Annotation preservation - Style consistency - 2D coordinate validation ### .cml - Chemical Markup Language **Description:** XML-based chemical structure format **Typical Data:** Chemical structures, reactions, properties **Use Cases:** Semantic chemical data representation **Python Libraries:** - `RDKit`: CML support - `Open Babel`: Good CML support - `lxml`: For XML parsing **EDA Approach:** - XML schema validation - Namespace handling - Property extraction - Reaction scheme analysis - Metadata completeness ### .rxn - MDL Reaction File **Description:** Chemical reaction structure file **Typical Data:** Reactants, products, reaction arrows **Use Cases:** Reaction databases, synthesis planning **Python Libraries:** - `RDKit`: `Chem.ReactionFromRxnFile('file.rxn')` - `Open Babel`: Reaction support **EDA Approach:** - Reaction balancing validation - Atom mapping analysis - Reagent identification - Stereochemistry changes - Reaction classification ### .rdf - Reaction Data File **Description:** Multi-reaction file format **Typical Data:** Multiple reactions with data **Use Cases:** Reaction databases **Python Libraries:** - `RDKit`: RDF reading capabilities - Custom parsers **EDA Approach:** - Reaction yield statistics - Condition analysis - Success rate patterns - Reagent frequency analysis ## Computational Output and Data ### .hdf5 / .h5 - Hierarchical Data Format **Description:** Container for scientific data arrays **Typical Data:** Large arrays, metadata, hierarchical organization **Use Cases:** Large dataset storage, computational results **Python Libraries:** - `h5py`: `h5py.File('file.h5', 'r')` - `pytables`: Advanced HDF5 interface - `pandas`: Can read HDF5 **EDA Approach:** - Dataset structure exploration - Array shape and dtype analysis - Metadata extraction - Memory-efficient data sampling - Chunk optimization analysis - Compression ratio assessment ### .pkl / .pickle - Python Pickle **Description:** Serialized Python objects **Typical Data:** Any Python object (molecules, dataframes, models) **Use Cases:** Intermediate data storage, model persistence **Python Libraries:** - `pickle`: Built-in serialization - `joblib`: Enhanced pickling for large arrays - `dill`: Extended pickle support **EDA Approach:** - Object type inspection - Size and complexity analysis - Version compatibility check - Security validation (trusted source) - Deserialization testing ### .npy / .npz - NumPy Arrays **Description:** NumPy array binary format **Typical Data:** Numerical arrays (coordinates, features, matrices) **Use Cases:** Fast numerical data I/O **Python Libraries:** - `numpy`: `np.load('file.npy')` - Direct memory mapping for large files **EDA Approach:** - Array shape and dimensions - Data type and precision - Statistical summary (mean, std, range) - Missing value detection - Outlier identification - Memory footprint analysis ### .mat - MATLAB Data File **Description:** MATLAB workspace data **Typical Data:** Arrays, structures from MATLAB **Use Cases:** MATLAB-Python data exchange **Python Libraries:** - `scipy.io`: `scipy.io.loadmat('file.mat')` - `h5py`: For v7.3 MAT files **EDA Approach:** - Variable extraction and types - Array dimension analysis - Structure field exploration - MATLAB version compatibility - Data type conversion validation ### .csv - Comma-Separated Values **Description:** Tabular data in text format **Typical Data:** Chemical properties, experimental data, descriptors **Use Cases:** Data exchange, analysis, machine learning **Python Libraries:** - `pandas`: `pd.read_csv('file.csv')` - `csv`: Built-in module - `polars`: Fast CSV reading **EDA Approach:** - Data types inference - Missing value patterns - Statistical summaries - Correlation analysis - Distribution visualization - Outlier detection ### .json - JavaScript Object Notation **Description:** Structured text data format **Typical Data:** Chemical properties, metadata, API responses **Use Cases:** Data interchange, configuration, web APIs **Python Libraries:** - `json`: Built-in JSON support - `pandas`: `pd.read_json()` - `ujson`: Faster JSON parsing **EDA Approach:** - Schema validation - Nesting depth analysis - Key-value distribution - Data type consistency - Array length statistics ### .parquet - Apache Parquet **Description:** Columnar storage format **Typical Data:** Large tabular datasets efficiently **Use Cases:** Big data, efficient columnar analytics **Python Libraries:** - `pandas`: `pd.read_parquet('file.parquet')` - `pyarrow`: Direct parquet access - `fastparquet`: Alternative implementation **EDA Approach:** - Column statistics from metadata - Partition analysis - Compression efficiency - Row group structure - Fast sampling for large files - Schema evolution tracking