Initial commit
This commit is contained in:
@@ -0,0 +1,664 @@
|
||||
# Chemistry and Molecular File Formats Reference
|
||||
|
||||
This reference covers file formats commonly used in computational chemistry, cheminformatics, molecular modeling, and related fields.
|
||||
|
||||
## Structure File Formats
|
||||
|
||||
### .pdb - Protein Data Bank
|
||||
**Description:** Standard format for 3D structures of biological macromolecules
|
||||
**Typical Data:** Atomic coordinates, residue information, secondary structure, crystal structure data
|
||||
**Use Cases:** Protein structure analysis, molecular visualization, docking studies
|
||||
**Python Libraries:**
|
||||
- `Biopython`: `Bio.PDB`
|
||||
- `MDAnalysis`: `MDAnalysis.Universe('file.pdb')`
|
||||
- `PyMOL`: `pymol.cmd.load('file.pdb')`
|
||||
- `ProDy`: `prody.parsePDB('file.pdb')`
|
||||
**EDA Approach:**
|
||||
- Structure validation (bond lengths, angles, clashes)
|
||||
- Secondary structure analysis
|
||||
- B-factor distribution
|
||||
- Missing residues/atoms detection
|
||||
- Ramachandran plots for validation
|
||||
- Surface area and volume calculations
|
||||
|
||||
### .cif - Crystallographic Information File
|
||||
**Description:** Structured data format for crystallographic information
|
||||
**Typical Data:** Unit cell parameters, atomic coordinates, symmetry operations, experimental data
|
||||
**Use Cases:** Crystal structure determination, structural biology, materials science
|
||||
**Python Libraries:**
|
||||
- `gemmi`: `gemmi.cif.read_file('file.cif')`
|
||||
- `PyCifRW`: `CifFile.ReadCif('file.cif')`
|
||||
- `Biopython`: `Bio.PDB.MMCIFParser()`
|
||||
**EDA Approach:**
|
||||
- Data completeness check
|
||||
- Resolution and quality metrics
|
||||
- Unit cell parameter analysis
|
||||
- Symmetry group validation
|
||||
- Atomic displacement parameters
|
||||
- R-factors and validation metrics
|
||||
|
||||
### .mol - MDL Molfile
|
||||
**Description:** Chemical structure file format by MDL/Accelrys
|
||||
**Typical Data:** 2D/3D coordinates, atom types, bond orders, charges
|
||||
**Use Cases:** Chemical database storage, cheminformatics, drug design
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromMolFile('file.mol')`
|
||||
- `Open Babel`: `pybel.readfile('mol', 'file.mol')`
|
||||
- `ChemoPy`: For descriptor calculation
|
||||
**EDA Approach:**
|
||||
- Molecular property calculation (MW, logP, TPSA)
|
||||
- Functional group analysis
|
||||
- Ring system detection
|
||||
- Stereochemistry validation
|
||||
- 2D/3D coordinate consistency
|
||||
- Valence and charge validation
|
||||
|
||||
### .mol2 - Tripos Mol2
|
||||
**Description:** Complete 3D molecular structure format with atom typing
|
||||
**Typical Data:** Coordinates, SYBYL atom types, bond types, charges, substructures
|
||||
**Use Cases:** Molecular docking, QSAR studies, drug discovery
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromMol2File('file.mol2')`
|
||||
- `Open Babel`: `pybel.readfile('mol2', 'file.mol2')`
|
||||
- `MDAnalysis`: Can parse mol2 topology
|
||||
**EDA Approach:**
|
||||
- Atom type distribution
|
||||
- Partial charge analysis
|
||||
- Bond type statistics
|
||||
- Substructure identification
|
||||
- Conformational analysis
|
||||
- Energy minimization status check
|
||||
|
||||
### .sdf - Structure Data File
|
||||
**Description:** Multi-structure file format with associated data
|
||||
**Typical Data:** Multiple molecular structures with properties/annotations
|
||||
**Use Cases:** Chemical databases, virtual screening, compound libraries
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.SDMolSupplier('file.sdf')`
|
||||
- `Open Babel`: `pybel.readfile('sdf', 'file.sdf')`
|
||||
- `PandasTools` (RDKit): For DataFrame integration
|
||||
**EDA Approach:**
|
||||
- Dataset size and diversity metrics
|
||||
- Property distribution analysis (MW, logP, etc.)
|
||||
- Structural diversity (Tanimoto similarity)
|
||||
- Missing data assessment
|
||||
- Outlier detection in properties
|
||||
- Scaffold analysis
|
||||
|
||||
### .xyz - XYZ Coordinates
|
||||
**Description:** Simple Cartesian coordinate format
|
||||
**Typical Data:** Atom types and 3D coordinates
|
||||
**Use Cases:** Quantum chemistry, geometry optimization, molecular dynamics
|
||||
**Python Libraries:**
|
||||
- `ASE`: `ase.io.read('file.xyz')`
|
||||
- `Open Babel`: `pybel.readfile('xyz', 'file.xyz')`
|
||||
- `cclib`: For parsing QM outputs with xyz
|
||||
**EDA Approach:**
|
||||
- Geometry analysis (bond lengths, angles, dihedrals)
|
||||
- Center of mass calculation
|
||||
- Moment of inertia
|
||||
- Molecular size metrics
|
||||
- Coordinate validation
|
||||
- Symmetry detection
|
||||
|
||||
### .smi / .smiles - SMILES String
|
||||
**Description:** Line notation for chemical structures
|
||||
**Typical Data:** Text representation of molecular structure
|
||||
**Use Cases:** Chemical databases, literature mining, data exchange
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromSmiles(smiles)`
|
||||
- `Open Babel`: Can parse SMILES
|
||||
- `DeepChem`: For ML on SMILES
|
||||
**EDA Approach:**
|
||||
- SMILES syntax validation
|
||||
- Descriptor calculation from SMILES
|
||||
- Fingerprint generation
|
||||
- Substructure searching
|
||||
- Tautomer enumeration
|
||||
- Stereoisomer handling
|
||||
|
||||
### .pdbqt - AutoDock PDBQT
|
||||
**Description:** Modified PDB format for AutoDock docking
|
||||
**Typical Data:** Coordinates, partial charges, atom types for docking
|
||||
**Use Cases:** Molecular docking, virtual screening
|
||||
**Python Libraries:**
|
||||
- `Meeko`: For PDBQT preparation
|
||||
- `Open Babel`: Can read PDBQT
|
||||
- `ProDy`: Limited PDBQT support
|
||||
**EDA Approach:**
|
||||
- Charge distribution analysis
|
||||
- Rotatable bond identification
|
||||
- Atom type validation
|
||||
- Coordinate quality check
|
||||
- Hydrogen placement validation
|
||||
- Torsion definition analysis
|
||||
|
||||
### .mae - Maestro Format
|
||||
**Description:** Schrödinger's proprietary molecular structure format
|
||||
**Typical Data:** Structures, properties, annotations from Schrödinger suite
|
||||
**Use Cases:** Drug discovery, molecular modeling with Schrödinger tools
|
||||
**Python Libraries:**
|
||||
- `schrodinger.structure`: Requires Schrödinger installation
|
||||
- Custom parsers for basic reading
|
||||
**EDA Approach:**
|
||||
- Property extraction and analysis
|
||||
- Structure quality metrics
|
||||
- Conformer analysis
|
||||
- Docking score distributions
|
||||
- Ligand efficiency metrics
|
||||
|
||||
### .gro - GROMACS Coordinate File
|
||||
**Description:** Molecular structure file for GROMACS MD simulations
|
||||
**Typical Data:** Atom positions, velocities, box vectors
|
||||
**Use Cases:** Molecular dynamics simulations, GROMACS workflows
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: `Universe('file.gro')`
|
||||
- `MDTraj`: `mdtraj.load_gro('file.gro')`
|
||||
- `GromacsWrapper`: For GROMACS integration
|
||||
**EDA Approach:**
|
||||
- System composition analysis
|
||||
- Box dimension validation
|
||||
- Atom position distribution
|
||||
- Velocity distribution (if present)
|
||||
- Density calculation
|
||||
- Solvation analysis
|
||||
|
||||
## Computational Chemistry Output Formats
|
||||
|
||||
### .log - Gaussian Log File
|
||||
**Description:** Output from Gaussian quantum chemistry calculations
|
||||
**Typical Data:** Energies, geometries, frequencies, orbitals, populations
|
||||
**Use Cases:** QM calculations, geometry optimization, frequency analysis
|
||||
**Python Libraries:**
|
||||
- `cclib`: `cclib.io.ccread('file.log')`
|
||||
- `GaussianRunPack`: For Gaussian workflows
|
||||
- Custom parsers with regex
|
||||
**EDA Approach:**
|
||||
- Convergence analysis
|
||||
- Energy profile extraction
|
||||
- Vibrational frequency analysis
|
||||
- Orbital energy levels
|
||||
- Population analysis (Mulliken, NBO)
|
||||
- Thermochemistry data extraction
|
||||
|
||||
### .out - Quantum Chemistry Output
|
||||
**Description:** Generic output file from various QM packages
|
||||
**Typical Data:** Calculation results, energies, properties
|
||||
**Use Cases:** QM calculations across different software
|
||||
**Python Libraries:**
|
||||
- `cclib`: Universal parser for QM outputs
|
||||
- `ASE`: Can read some output formats
|
||||
**EDA Approach:**
|
||||
- Software-specific parsing
|
||||
- Convergence criteria check
|
||||
- Energy and gradient trends
|
||||
- Basis set and method validation
|
||||
- Computational cost analysis
|
||||
|
||||
### .wfn / .wfx - Wavefunction Files
|
||||
**Description:** Wavefunction data for quantum chemical analysis
|
||||
**Typical Data:** Molecular orbitals, basis sets, density matrices
|
||||
**Use Cases:** Electron density analysis, QTAIM analysis
|
||||
**Python Libraries:**
|
||||
- `Multiwfn`: Interface via Python
|
||||
- `Horton`: For wavefunction analysis
|
||||
- Custom parsers for specific formats
|
||||
**EDA Approach:**
|
||||
- Orbital population analysis
|
||||
- Electron density distribution
|
||||
- Critical point analysis (QTAIM)
|
||||
- Molecular orbital visualization
|
||||
- Bonding analysis
|
||||
|
||||
### .fchk - Gaussian Formatted Checkpoint
|
||||
**Description:** Formatted checkpoint file from Gaussian
|
||||
**Typical Data:** Complete wavefunction data, results, geometry
|
||||
**Use Cases:** Post-processing Gaussian calculations
|
||||
**Python Libraries:**
|
||||
- `cclib`: Can parse fchk files
|
||||
- `GaussView` Python API (if available)
|
||||
- Custom parsers
|
||||
**EDA Approach:**
|
||||
- Wavefunction quality assessment
|
||||
- Property extraction
|
||||
- Basis set information
|
||||
- Gradient and Hessian analysis
|
||||
- Natural orbital analysis
|
||||
|
||||
### .cube - Gaussian Cube File
|
||||
**Description:** Volumetric data on a 3D grid
|
||||
**Typical Data:** Electron density, molecular orbitals, ESP on grid
|
||||
**Use Cases:** Visualization of volumetric properties
|
||||
**Python Libraries:**
|
||||
- `cclib`: `cclib.io.ccread('file.cube')`
|
||||
- `ase.io`: `ase.io.read('file.cube')`
|
||||
- `pyquante`: For cube file manipulation
|
||||
**EDA Approach:**
|
||||
- Grid dimension and spacing analysis
|
||||
- Value distribution statistics
|
||||
- Isosurface value determination
|
||||
- Integration over volume
|
||||
- Comparison between different cubes
|
||||
|
||||
## Molecular Dynamics Formats
|
||||
|
||||
### .dcd - Binary Trajectory
|
||||
**Description:** Binary trajectory format (CHARMM, NAMD)
|
||||
**Typical Data:** Time series of atomic coordinates
|
||||
**Use Cases:** MD trajectory analysis
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: `Universe(topology, 'traj.dcd')`
|
||||
- `MDTraj`: `mdtraj.load_dcd('traj.dcd', top='topology.pdb')`
|
||||
- `PyTraj` (Amber): Limited support
|
||||
**EDA Approach:**
|
||||
- RMSD/RMSF analysis
|
||||
- Trajectory length and frame count
|
||||
- Coordinate range and drift
|
||||
- Periodic boundary handling
|
||||
- File integrity check
|
||||
- Time step validation
|
||||
|
||||
### .xtc - Compressed Trajectory
|
||||
**Description:** GROMACS compressed trajectory format
|
||||
**Typical Data:** Compressed coordinates from MD simulations
|
||||
**Use Cases:** Space-efficient MD trajectory storage
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: `Universe(topology, 'traj.xtc')`
|
||||
- `MDTraj`: `mdtraj.load_xtc('traj.xtc', top='topology.pdb')`
|
||||
**EDA Approach:**
|
||||
- Compression ratio assessment
|
||||
- Precision loss evaluation
|
||||
- RMSD over time
|
||||
- Structural stability metrics
|
||||
- Sampling frequency analysis
|
||||
|
||||
### .trr - GROMACS Trajectory
|
||||
**Description:** Full precision GROMACS trajectory
|
||||
**Typical Data:** Coordinates, velocities, forces from MD
|
||||
**Use Cases:** High-precision MD analysis
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: Full support
|
||||
- `MDTraj`: Can read trr files
|
||||
- `GromacsWrapper`
|
||||
**EDA Approach:**
|
||||
- Full system dynamics analysis
|
||||
- Energy conservation check (with velocities)
|
||||
- Force analysis
|
||||
- Temperature and pressure validation
|
||||
- System equilibration assessment
|
||||
|
||||
### .nc / .netcdf - Amber NetCDF Trajectory
|
||||
**Description:** Network Common Data Form trajectory
|
||||
**Typical Data:** MD coordinates, velocities, forces
|
||||
**Use Cases:** Amber MD simulations, large trajectory storage
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: NetCDF support
|
||||
- `PyTraj`: Native Amber analysis
|
||||
- `netCDF4`: Low-level access
|
||||
**EDA Approach:**
|
||||
- Metadata extraction
|
||||
- Trajectory statistics
|
||||
- Time series analysis
|
||||
- Replica exchange analysis
|
||||
- Multi-dimensional data extraction
|
||||
|
||||
### .top - GROMACS Topology
|
||||
**Description:** Molecular topology for GROMACS
|
||||
**Typical Data:** Atom types, bonds, angles, force field parameters
|
||||
**Use Cases:** MD simulation setup and analysis
|
||||
**Python Libraries:**
|
||||
- `ParmEd`: `parmed.load_file('system.top')`
|
||||
- `MDAnalysis`: Can parse topology
|
||||
- Custom parsers for specific fields
|
||||
**EDA Approach:**
|
||||
- Force field parameter validation
|
||||
- System composition
|
||||
- Bond/angle/dihedral distribution
|
||||
- Charge neutrality check
|
||||
- Molecule type enumeration
|
||||
|
||||
### .psf - Protein Structure File (CHARMM)
|
||||
**Description:** Topology file for CHARMM/NAMD
|
||||
**Typical Data:** Atom connectivity, types, charges
|
||||
**Use Cases:** CHARMM/NAMD MD simulations
|
||||
**Python Libraries:**
|
||||
- `MDAnalysis`: Native PSF support
|
||||
- `ParmEd`: Can read PSF files
|
||||
**EDA Approach:**
|
||||
- Topology validation
|
||||
- Connectivity analysis
|
||||
- Charge distribution
|
||||
- Atom type statistics
|
||||
- Segment analysis
|
||||
|
||||
### .prmtop - Amber Parameter/Topology
|
||||
**Description:** Amber topology and parameter file
|
||||
**Typical Data:** System topology, force field parameters
|
||||
**Use Cases:** Amber MD simulations
|
||||
**Python Libraries:**
|
||||
- `ParmEd`: `parmed.load_file('system.prmtop')`
|
||||
- `PyTraj`: Native Amber support
|
||||
**EDA Approach:**
|
||||
- Force field completeness
|
||||
- Parameter validation
|
||||
- System size and composition
|
||||
- Periodic box information
|
||||
- Atom mask creation for analysis
|
||||
|
||||
### .inpcrd / .rst7 - Amber Coordinates
|
||||
**Description:** Amber coordinate/restart file
|
||||
**Typical Data:** Atomic coordinates, velocities, box info
|
||||
**Use Cases:** Starting coordinates for Amber MD
|
||||
**Python Libraries:**
|
||||
- `ParmEd`: Works with prmtop
|
||||
- `PyTraj`: Amber coordinate reading
|
||||
**EDA Approach:**
|
||||
- Coordinate validity
|
||||
- System initialization check
|
||||
- Box vector validation
|
||||
- Velocity distribution (if restart)
|
||||
- Energy minimization status
|
||||
|
||||
## Spectroscopy and Analytical Data
|
||||
|
||||
### .jcamp / .jdx - JCAMP-DX
|
||||
**Description:** Joint Committee on Atomic and Molecular Physical Data eXchange
|
||||
**Typical Data:** Spectroscopic data (IR, NMR, MS, UV-Vis)
|
||||
**Use Cases:** Spectroscopy data exchange and archiving
|
||||
**Python Libraries:**
|
||||
- `jcamp`: `jcamp.jcamp_reader('file.jdx')`
|
||||
- `nmrglue`: For NMR JCAMP files
|
||||
- Custom parsers for specific subtypes
|
||||
**EDA Approach:**
|
||||
- Peak detection and analysis
|
||||
- Baseline correction assessment
|
||||
- Signal-to-noise calculation
|
||||
- Spectral range validation
|
||||
- Integration analysis
|
||||
- Comparison with reference spectra
|
||||
|
||||
### .mzML - Mass Spectrometry Markup Language
|
||||
**Description:** Standard XML format for mass spectrometry data
|
||||
**Typical Data:** MS/MS spectra, chromatograms, metadata
|
||||
**Use Cases:** Proteomics, metabolomics, mass spectrometry workflows
|
||||
**Python Libraries:**
|
||||
- `pymzml`: `pymzml.run.Reader('file.mzML')`
|
||||
- `pyteomics`: `pyteomics.mzml.read('file.mzML')`
|
||||
- `MSFileReader` wrappers
|
||||
**EDA Approach:**
|
||||
- Scan count and types
|
||||
- MS level distribution
|
||||
- Retention time range
|
||||
- m/z range and resolution
|
||||
- Peak intensity distribution
|
||||
- Data completeness
|
||||
- Quality control metrics
|
||||
|
||||
### .mzXML - Mass Spectrometry XML
|
||||
**Description:** Open XML format for MS data
|
||||
**Typical Data:** Mass spectra, retention times, peak lists
|
||||
**Use Cases:** Legacy MS data, metabolomics
|
||||
**Python Libraries:**
|
||||
- `pymzml`: Can read mzXML
|
||||
- `pyteomics.mzxml`
|
||||
- `lxml` for direct XML parsing
|
||||
**EDA Approach:**
|
||||
- Similar to mzML
|
||||
- Version compatibility check
|
||||
- Conversion quality assessment
|
||||
- Peak picking validation
|
||||
|
||||
### .raw - Vendor Raw Data
|
||||
**Description:** Proprietary instrument data files (Thermo, Bruker, etc.)
|
||||
**Typical Data:** Raw instrument signals, unprocessed data
|
||||
**Use Cases:** Direct instrument data access
|
||||
**Python Libraries:**
|
||||
- `pymsfilereader`: For Thermo RAW files
|
||||
- `ThermoRawFileParser`: CLI wrapper
|
||||
- Vendor-specific APIs (Thermo, Bruker Compass)
|
||||
**EDA Approach:**
|
||||
- Instrument method extraction
|
||||
- Raw signal quality
|
||||
- Calibration status
|
||||
- Scan function analysis
|
||||
- Chromatographic quality metrics
|
||||
|
||||
### .d - Agilent Data Directory
|
||||
**Description:** Agilent's data folder structure
|
||||
**Typical Data:** LC-MS, GC-MS data and metadata
|
||||
**Use Cases:** Agilent instrument data processing
|
||||
**Python Libraries:**
|
||||
- `agilent-reader`: Community tools
|
||||
- `Chemstation` Python integration
|
||||
- Custom directory parsing
|
||||
**EDA Approach:**
|
||||
- Directory structure validation
|
||||
- Method parameter extraction
|
||||
- Signal file integrity
|
||||
- Calibration curve analysis
|
||||
- Sequence information extraction
|
||||
|
||||
### .fid - NMR Free Induction Decay
|
||||
**Description:** Raw NMR time-domain data
|
||||
**Typical Data:** Time-domain NMR signal
|
||||
**Use Cases:** NMR processing and analysis
|
||||
**Python Libraries:**
|
||||
- `nmrglue`: `nmrglue.bruker.read_fid('fid')`
|
||||
- `nmrstarlib`: For NMR-STAR files
|
||||
**EDA Approach:**
|
||||
- Signal decay analysis
|
||||
- Noise level assessment
|
||||
- Acquisition parameter validation
|
||||
- Apodization function selection
|
||||
- Zero-filling optimization
|
||||
- Phasing parameter estimation
|
||||
|
||||
### .ft - NMR Frequency-Domain Data
|
||||
**Description:** Processed NMR spectrum
|
||||
**Typical Data:** Frequency-domain NMR data
|
||||
**Use Cases:** NMR analysis and interpretation
|
||||
**Python Libraries:**
|
||||
- `nmrglue`: Comprehensive NMR support
|
||||
- `pyNMR`: For processing
|
||||
**EDA Approach:**
|
||||
- Peak picking and integration
|
||||
- Chemical shift calibration
|
||||
- Multiplicity analysis
|
||||
- Coupling constant extraction
|
||||
- Spectral quality metrics
|
||||
- Reference compound identification
|
||||
|
||||
### .spc - Spectroscopy File
|
||||
**Description:** Thermo Galactic spectroscopy format
|
||||
**Typical Data:** IR, Raman, UV-Vis spectra
|
||||
**Use Cases:** Spectroscopic data from various instruments
|
||||
**Python Libraries:**
|
||||
- `spc`: `spc.File('file.spc')`
|
||||
- Custom parsers for binary format
|
||||
**EDA Approach:**
|
||||
- Spectral resolution
|
||||
- Wavelength/wavenumber range
|
||||
- Baseline characterization
|
||||
- Peak identification
|
||||
- Derivative spectra calculation
|
||||
|
||||
## Chemical Database Formats
|
||||
|
||||
### .inchi - International Chemical Identifier
|
||||
**Description:** Text identifier for chemical substances
|
||||
**Typical Data:** Layered chemical structure representation
|
||||
**Use Cases:** Chemical database keys, structure searching
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.MolFromInchi(inchi)`
|
||||
- `Open Babel`: InChI conversion
|
||||
**EDA Approach:**
|
||||
- InChI validation
|
||||
- Layer analysis
|
||||
- Stereochemistry verification
|
||||
- InChI key generation
|
||||
- Structure round-trip validation
|
||||
|
||||
### .cdx / .cdxml - ChemDraw Exchange
|
||||
**Description:** ChemDraw drawing file format
|
||||
**Typical Data:** 2D chemical structures with annotations
|
||||
**Use Cases:** Chemical drawing, publication figures
|
||||
**Python Libraries:**
|
||||
- `RDKit`: Can import some CDXML
|
||||
- `Open Babel`: Limited support
|
||||
- `ChemDraw` Python API (commercial)
|
||||
**EDA Approach:**
|
||||
- Structure extraction
|
||||
- Annotation preservation
|
||||
- Style consistency
|
||||
- 2D coordinate validation
|
||||
|
||||
### .cml - Chemical Markup Language
|
||||
**Description:** XML-based chemical structure format
|
||||
**Typical Data:** Chemical structures, reactions, properties
|
||||
**Use Cases:** Semantic chemical data representation
|
||||
**Python Libraries:**
|
||||
- `RDKit`: CML support
|
||||
- `Open Babel`: Good CML support
|
||||
- `lxml`: For XML parsing
|
||||
**EDA Approach:**
|
||||
- XML schema validation
|
||||
- Namespace handling
|
||||
- Property extraction
|
||||
- Reaction scheme analysis
|
||||
- Metadata completeness
|
||||
|
||||
### .rxn - MDL Reaction File
|
||||
**Description:** Chemical reaction structure file
|
||||
**Typical Data:** Reactants, products, reaction arrows
|
||||
**Use Cases:** Reaction databases, synthesis planning
|
||||
**Python Libraries:**
|
||||
- `RDKit`: `Chem.ReactionFromRxnFile('file.rxn')`
|
||||
- `Open Babel`: Reaction support
|
||||
**EDA Approach:**
|
||||
- Reaction balancing validation
|
||||
- Atom mapping analysis
|
||||
- Reagent identification
|
||||
- Stereochemistry changes
|
||||
- Reaction classification
|
||||
|
||||
### .rdf - Reaction Data File
|
||||
**Description:** Multi-reaction file format
|
||||
**Typical Data:** Multiple reactions with data
|
||||
**Use Cases:** Reaction databases
|
||||
**Python Libraries:**
|
||||
- `RDKit`: RDF reading capabilities
|
||||
- Custom parsers
|
||||
**EDA Approach:**
|
||||
- Reaction yield statistics
|
||||
- Condition analysis
|
||||
- Success rate patterns
|
||||
- Reagent frequency analysis
|
||||
|
||||
## Computational Output and Data
|
||||
|
||||
### .hdf5 / .h5 - Hierarchical Data Format
|
||||
**Description:** Container for scientific data arrays
|
||||
**Typical Data:** Large arrays, metadata, hierarchical organization
|
||||
**Use Cases:** Large dataset storage, computational results
|
||||
**Python Libraries:**
|
||||
- `h5py`: `h5py.File('file.h5', 'r')`
|
||||
- `pytables`: Advanced HDF5 interface
|
||||
- `pandas`: Can read HDF5
|
||||
**EDA Approach:**
|
||||
- Dataset structure exploration
|
||||
- Array shape and dtype analysis
|
||||
- Metadata extraction
|
||||
- Memory-efficient data sampling
|
||||
- Chunk optimization analysis
|
||||
- Compression ratio assessment
|
||||
|
||||
### .pkl / .pickle - Python Pickle
|
||||
**Description:** Serialized Python objects
|
||||
**Typical Data:** Any Python object (molecules, dataframes, models)
|
||||
**Use Cases:** Intermediate data storage, model persistence
|
||||
**Python Libraries:**
|
||||
- `pickle`: Built-in serialization
|
||||
- `joblib`: Enhanced pickling for large arrays
|
||||
- `dill`: Extended pickle support
|
||||
**EDA Approach:**
|
||||
- Object type inspection
|
||||
- Size and complexity analysis
|
||||
- Version compatibility check
|
||||
- Security validation (trusted source)
|
||||
- Deserialization testing
|
||||
|
||||
### .npy / .npz - NumPy Arrays
|
||||
**Description:** NumPy array binary format
|
||||
**Typical Data:** Numerical arrays (coordinates, features, matrices)
|
||||
**Use Cases:** Fast numerical data I/O
|
||||
**Python Libraries:**
|
||||
- `numpy`: `np.load('file.npy')`
|
||||
- Direct memory mapping for large files
|
||||
**EDA Approach:**
|
||||
- Array shape and dimensions
|
||||
- Data type and precision
|
||||
- Statistical summary (mean, std, range)
|
||||
- Missing value detection
|
||||
- Outlier identification
|
||||
- Memory footprint analysis
|
||||
|
||||
### .mat - MATLAB Data File
|
||||
**Description:** MATLAB workspace data
|
||||
**Typical Data:** Arrays, structures from MATLAB
|
||||
**Use Cases:** MATLAB-Python data exchange
|
||||
**Python Libraries:**
|
||||
- `scipy.io`: `scipy.io.loadmat('file.mat')`
|
||||
- `h5py`: For v7.3 MAT files
|
||||
**EDA Approach:**
|
||||
- Variable extraction and types
|
||||
- Array dimension analysis
|
||||
- Structure field exploration
|
||||
- MATLAB version compatibility
|
||||
- Data type conversion validation
|
||||
|
||||
### .csv - Comma-Separated Values
|
||||
**Description:** Tabular data in text format
|
||||
**Typical Data:** Chemical properties, experimental data, descriptors
|
||||
**Use Cases:** Data exchange, analysis, machine learning
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv('file.csv')`
|
||||
- `csv`: Built-in module
|
||||
- `polars`: Fast CSV reading
|
||||
**EDA Approach:**
|
||||
- Data types inference
|
||||
- Missing value patterns
|
||||
- Statistical summaries
|
||||
- Correlation analysis
|
||||
- Distribution visualization
|
||||
- Outlier detection
|
||||
|
||||
### .json - JavaScript Object Notation
|
||||
**Description:** Structured text data format
|
||||
**Typical Data:** Chemical properties, metadata, API responses
|
||||
**Use Cases:** Data interchange, configuration, web APIs
|
||||
**Python Libraries:**
|
||||
- `json`: Built-in JSON support
|
||||
- `pandas`: `pd.read_json()`
|
||||
- `ujson`: Faster JSON parsing
|
||||
**EDA Approach:**
|
||||
- Schema validation
|
||||
- Nesting depth analysis
|
||||
- Key-value distribution
|
||||
- Data type consistency
|
||||
- Array length statistics
|
||||
|
||||
### .parquet - Apache Parquet
|
||||
**Description:** Columnar storage format
|
||||
**Typical Data:** Large tabular datasets efficiently
|
||||
**Use Cases:** Big data, efficient columnar analytics
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_parquet('file.parquet')`
|
||||
- `pyarrow`: Direct parquet access
|
||||
- `fastparquet`: Alternative implementation
|
||||
**EDA Approach:**
|
||||
- Column statistics from metadata
|
||||
- Partition analysis
|
||||
- Compression efficiency
|
||||
- Row group structure
|
||||
- Fast sampling for large files
|
||||
- Schema evolution tracking
|
||||
Reference in New Issue
Block a user