Files
2025-11-30 08:30:10 +08:00

15 KiB

General Scientific Data Formats Reference

This reference covers general-purpose scientific data formats used across multiple disciplines.

Numerical and Array Data

.npy - NumPy Array

Description: Binary NumPy array format Typical Data: N-dimensional arrays of any data type Use Cases: Fast I/O for numerical data, intermediate results Python Libraries:

  • numpy: np.load('file.npy'), np.save()
  • Memory-mapped access: np.load('file.npy', mmap_mode='r') EDA Approach:
  • Array shape and dimensionality
  • Data type and precision
  • Statistical summary (mean, std, min, max, percentiles)
  • Missing or invalid values (NaN, inf)
  • Memory footprint
  • Value distribution and histogram
  • Sparsity analysis
  • Correlation structure (if 2D)

.npz - Compressed NumPy Archive

Description: Multiple NumPy arrays in one file Typical Data: Collections of related arrays Use Cases: Saving multiple arrays together, compressed storage Python Libraries:

  • numpy: np.load('file.npz') returns dict-like object
  • np.savez() or np.savez_compressed() EDA Approach:
  • List of contained arrays
  • Individual array analysis
  • Relationships between arrays
  • Total file size and compression ratio
  • Naming conventions
  • Data consistency checks

.csv - Comma-Separated Values

Description: Plain text tabular data Typical Data: Experimental measurements, results tables Use Cases: Universal data exchange, spreadsheet export Python Libraries:

  • pandas: pd.read_csv('file.csv')
  • csv: Built-in module
  • polars: High-performance CSV reading
  • numpy: np.loadtxt() or np.genfromtxt() EDA Approach:
  • Row and column counts
  • Data type inference
  • Missing value patterns and frequency
  • Column statistics (numeric: mean, std; categorical: frequencies)
  • Outlier detection
  • Correlation matrix
  • Duplicate row detection
  • Header and index validation
  • Encoding issues detection

.tsv / .tab - Tab-Separated Values

Description: Tab-delimited tabular data Typical Data: Similar to CSV but tab-separated Use Cases: Bioinformatics, text processing output Python Libraries:

  • pandas: pd.read_csv('file.tsv', sep='\t') EDA Approach:
  • Same as CSV format
  • Tab vs space validation
  • Quote handling

.xlsx / .xls - Excel Spreadsheets

Description: Microsoft Excel binary/XML formats Typical Data: Tabular data with formatting, formulas Use Cases: Lab notebooks, data entry, reports Python Libraries:

  • pandas: pd.read_excel('file.xlsx')
  • openpyxl: Full Excel file manipulation
  • xlrd: Reading .xls (legacy) EDA Approach:
  • Sheet enumeration and names
  • Per-sheet data analysis
  • Formula evaluation
  • Merged cells handling
  • Hidden rows/columns
  • Data validation rules
  • Named ranges
  • Formatting-only cells detection

.json - JavaScript Object Notation

Description: Hierarchical text data format Typical Data: Nested data structures, metadata Use Cases: API responses, configuration, results Python Libraries:

  • json: Built-in module
  • pandas: pd.read_json()
  • ujson: Faster JSON parsing EDA Approach:
  • Schema inference
  • Nesting depth
  • Key-value distribution
  • Array lengths
  • Data type consistency
  • Missing keys
  • Duplicate detection
  • Size and complexity metrics

.xml - Extensible Markup Language

Description: Hierarchical markup format Typical Data: Structured data with metadata Use Cases: Standards-based data exchange, APIs Python Libraries:

  • lxml: lxml.etree.parse()
  • xml.etree.ElementTree: Built-in XML
  • xmltodict: Convert XML to dict EDA Approach:
  • Schema/DTD validation
  • Element hierarchy and depth
  • Namespace handling
  • Attribute vs element content
  • CDATA sections
  • Text content extraction
  • Sibling and child counts

.yaml / .yml - YAML

Description: Human-readable data serialization Typical Data: Configuration, metadata, parameters Use Cases: Experiment configurations, pipelines Python Libraries:

  • yaml: yaml.safe_load() or yaml.load()
  • ruamel.yaml: YAML 1.2 support EDA Approach:
  • Configuration structure
  • Data type handling
  • List and dict depth
  • Anchor and alias usage
  • Multi-document files
  • Comments preservation
  • Validation against schema

.toml - TOML Configuration

Description: Configuration file format Typical Data: Settings, parameters Use Cases: Python package configuration, settings Python Libraries:

  • tomli / tomllib: TOML reading (tomllib in Python 3.11+)
  • toml: Reading and writing EDA Approach:
  • Section structure
  • Key-value pairs
  • Data type inference
  • Nested table validation
  • Required vs optional fields

.ini - INI Configuration

Description: Simple configuration format Typical Data: Application settings Use Cases: Legacy configurations, simple settings Python Libraries:

  • configparser: Built-in INI parser EDA Approach:
  • Section enumeration
  • Key-value extraction
  • Type conversion
  • Comment handling
  • Case sensitivity

Binary and Compressed Data

.hdf5 / .h5 - Hierarchical Data Format 5

Description: Container for large scientific datasets Typical Data: Multi-dimensional arrays, metadata, groups Use Cases: Large datasets, multi-modal data, parallel I/O Python Libraries:

  • h5py: h5py.File('file.h5', 'r')
  • pytables: Advanced HDF5 interface
  • pandas: HDF5 storage via HDFStore EDA Approach:
  • Group and dataset hierarchy
  • Dataset shapes and dtypes
  • Attributes and metadata
  • Compression and chunking strategy
  • Memory-efficient sampling
  • Dataset relationships
  • File size and efficiency
  • Access patterns optimization

.zarr - Chunked Array Storage

Description: Cloud-optimized chunked arrays Typical Data: Large N-dimensional arrays Use Cases: Cloud storage, parallel computing, streaming Python Libraries:

  • zarr: zarr.open('file.zarr')
  • xarray: Zarr backend support EDA Approach:
  • Array metadata and dimensions
  • Chunk size optimization
  • Compression codec and ratio
  • Synchronizer and store type
  • Multi-scale hierarchies
  • Parallel access performance
  • Attribute metadata

.gz / .gzip - Gzip Compressed

Description: Compressed data files Typical Data: Any compressed text or binary Use Cases: Compression for storage/transfer Python Libraries:

  • gzip: Built-in gzip module
  • pandas: Automatic gzip handling in read functions EDA Approach:
  • Compression ratio
  • Original file type detection
  • Decompression validation
  • Header information
  • Multi-member archives

.bz2 - Bzip2 Compressed

Description: Bzip2 compression Typical Data: Highly compressed files Use Cases: Better compression than gzip Python Libraries:

  • bz2: Built-in bz2 module
  • Automatic handling in pandas EDA Approach:
  • Compression efficiency
  • Decompression time
  • Content validation

.zip - ZIP Archive

Description: Archive with multiple files Typical Data: Collections of files Use Cases: File distribution, archiving Python Libraries:

  • zipfile: Built-in ZIP support
  • pandas: Can read zipped CSVs EDA Approach:
  • Archive member listing
  • Compression method per file
  • Total vs compressed size
  • Directory structure
  • File type distribution
  • Extraction validation

.tar / .tar.gz - TAR Archive

Description: Unix tape archive Typical Data: Multiple files and directories Use Cases: Software distribution, backups Python Libraries:

  • tarfile: Built-in TAR support EDA Approach:
  • Member file listing
  • Compression (if .tar.gz, .tar.bz2)
  • Directory structure
  • Permissions preservation
  • Extraction testing

Time Series and Waveform Data

.wav - Waveform Audio

Description: Audio waveform data Typical Data: Acoustic signals, audio recordings Use Cases: Acoustic analysis, ultrasound, signal processing Python Libraries:

  • scipy.io.wavfile: scipy.io.wavfile.read()
  • wave: Built-in module
  • soundfile: Enhanced audio I/O EDA Approach:
  • Sample rate and duration
  • Bit depth and channels
  • Amplitude distribution
  • Spectral analysis (FFT)
  • Signal-to-noise ratio
  • Clipping detection
  • Frequency content

.mat - MATLAB Data

Description: MATLAB workspace variables Typical Data: Arrays, structures, cells Use Cases: MATLAB-Python interoperability Python Libraries:

  • scipy.io: scipy.io.loadmat()
  • h5py: For MATLAB v7.3 files (HDF5-based)
  • mat73: Pure Python for v7.3 EDA Approach:
  • Variable names and types
  • Array dimensions
  • Structure field exploration
  • Cell array handling
  • Sparse matrix detection
  • MATLAB version compatibility
  • Metadata extraction

.edf - European Data Format

Description: Time series data (especially medical) Typical Data: EEG, physiological signals Use Cases: Medical signal storage Python Libraries:

  • pyedflib: EDF/EDF+ reading and writing
  • mne: Neurophysiology data (supports EDF) EDA Approach:
  • Signal count and names
  • Sampling frequencies
  • Signal ranges and units
  • Recording duration
  • Annotation events
  • Data quality (saturation, noise)
  • Patient/study information

.csv (Time Series)

Description: CSV with timestamp column Typical Data: Time-indexed measurements Use Cases: Sensor data, monitoring, experiments Python Libraries:

  • pandas: pd.read_csv() with parse_dates EDA Approach:
  • Temporal range and resolution
  • Sampling regularity
  • Missing time points
  • Trend and seasonality
  • Stationarity tests
  • Autocorrelation
  • Anomaly detection

Geospatial and Environmental Data

.shp - Shapefile

Description: Geospatial vector data Typical Data: Geographic features (points, lines, polygons) Use Cases: GIS analysis, spatial data Python Libraries:

  • geopandas: gpd.read_file('file.shp')
  • fiona: Lower-level shapefile access
  • pyshp: Pure Python shapefile reader EDA Approach:
  • Geometry type and count
  • Coordinate reference system
  • Bounding box
  • Attribute table analysis
  • Geometry validity
  • Spatial distribution
  • Multi-part features
  • Associated files (.shx, .dbf, .prj)

.geojson - GeoJSON

Description: JSON format for geographic data Typical Data: Features with geometry and properties Use Cases: Web mapping, spatial analysis Python Libraries:

  • geopandas: Native GeoJSON support
  • json: Parse as JSON then process EDA Approach:
  • Feature count and types
  • CRS specification
  • Bounding box calculation
  • Property schema
  • Geometry complexity
  • Nesting structure

.tif / .tiff (Geospatial)

Description: GeoTIFF with spatial reference Typical Data: Satellite imagery, DEMs, rasters Use Cases: Remote sensing, terrain analysis Python Libraries:

  • rasterio: rasterio.open('file.tif')
  • gdal: Geospatial Data Abstraction Library
  • xarray with rioxarray: N-D geospatial arrays EDA Approach:
  • Raster dimensions and resolution
  • Band count and descriptions
  • Coordinate reference system
  • Geotransform parameters
  • NoData value handling
  • Pixel value distribution
  • Histogram analysis
  • Overviews and pyramids

.nc / .netcdf - Network Common Data Form

Description: Self-describing array-based data Typical Data: Climate, atmospheric, oceanographic data Use Cases: Scientific datasets, model output Python Libraries:

  • netCDF4: netCDF4.Dataset('file.nc')
  • xarray: xr.open_dataset('file.nc') EDA Approach:
  • Variable enumeration
  • Dimension analysis
  • Time series properties
  • Spatial coverage
  • Attribute metadata (CF conventions)
  • Coordinate systems
  • Chunking and compression
  • Data quality flags

.grib / .grib2 - Gridded Binary

Description: Meteorological data format Typical Data: Weather forecasts, climate data Use Cases: Numerical weather prediction Python Libraries:

  • pygrib: GRIB file reading
  • xarray with cfgrib: GRIB to xarray EDA Approach:
  • Message inventory
  • Parameter and level types
  • Spatial grid specification
  • Temporal coverage
  • Ensemble members
  • Forecast vs analysis
  • Data packing and precision

.hdf4 - HDF4 Format

Description: Older HDF format Typical Data: NASA Earth Science data Use Cases: Satellite data (MODIS, etc.) Python Libraries:

  • pyhdf: HDF4 access
  • gdal: Can read HDF4 EDA Approach:
  • Scientific dataset listing
  • Vdata and attributes
  • Dimension scales
  • Metadata extraction
  • Quality flags
  • Conversion to HDF5 or NetCDF

Specialized Scientific Formats

.fits - Flexible Image Transport System

Description: Astronomy data format Typical Data: Images, tables, spectra from telescopes Use Cases: Astronomical observations Python Libraries:

  • astropy.io.fits: fits.open('file.fits')
  • fitsio: Alternative FITS library EDA Approach:
  • HDU (Header Data Unit) structure
  • Image dimensions and WCS
  • Header keyword analysis
  • Table column descriptions
  • Data type and scaling
  • FITS convention compliance
  • Checksum validation

.asdf - Advanced Scientific Data Format

Description: Next-gen data format for astronomy Typical Data: Complex hierarchical scientific data Use Cases: James Webb Space Telescope data Python Libraries:

  • asdf: asdf.open('file.asdf') EDA Approach:
  • Tree structure exploration
  • Schema validation
  • Internal vs external arrays
  • Compression methods
  • YAML metadata
  • Version compatibility

.root - ROOT Data Format

Description: CERN ROOT framework format Typical Data: High-energy physics data Use Cases: Particle physics experiments Python Libraries:

  • uproot: Pure Python ROOT reading
  • ROOT: Official PyROOT bindings EDA Approach:
  • TTree structure
  • Branch types and entries
  • Histogram inventory
  • Event loop statistics
  • File compression
  • Split level analysis

.txt - Plain Text Data

Description: Generic text-based data Typical Data: Tab/space-delimited, custom formats Use Cases: Simple data exchange, logs Python Libraries:

  • pandas: pd.read_csv() with custom delimiters
  • numpy: np.loadtxt(), np.genfromtxt()
  • Built-in file reading EDA Approach:
  • Format detection (delimiter, header)
  • Data type inference
  • Comment line handling
  • Missing value codes
  • Column alignment
  • Encoding detection

.dat - Generic Data File

Description: Binary or text data Typical Data: Instrument output, custom formats Use Cases: Various scientific instruments Python Libraries:

  • Format-specific: requires knowledge of structure
  • numpy: np.fromfile() for binary
  • struct: Parse binary structures EDA Approach:
  • Binary vs text determination
  • Header detection
  • Record structure inference
  • Endianness
  • Data type patterns
  • Validation with documentation

.log - Log Files

Description: Text logs from software/instruments Typical Data: Timestamped events, messages Use Cases: Troubleshooting, experiment tracking Python Libraries:

  • Built-in file reading
  • pandas: Structured log parsing
  • Regular expressions for parsing EDA Approach:
  • Log level distribution
  • Timestamp parsing
  • Error and warning frequency
  • Event sequencing
  • Pattern recognition
  • Anomaly detection
  • Session boundaries