Initial commit
This commit is contained in:
@@ -0,0 +1,518 @@
|
||||
# General Scientific Data Formats Reference
|
||||
|
||||
This reference covers general-purpose scientific data formats used across multiple disciplines.
|
||||
|
||||
## Numerical and Array Data
|
||||
|
||||
### .npy - NumPy Array
|
||||
**Description:** Binary NumPy array format
|
||||
**Typical Data:** N-dimensional arrays of any data type
|
||||
**Use Cases:** Fast I/O for numerical data, intermediate results
|
||||
**Python Libraries:**
|
||||
- `numpy`: `np.load('file.npy')`, `np.save()`
|
||||
- Memory-mapped access: `np.load('file.npy', mmap_mode='r')`
|
||||
**EDA Approach:**
|
||||
- Array shape and dimensionality
|
||||
- Data type and precision
|
||||
- Statistical summary (mean, std, min, max, percentiles)
|
||||
- Missing or invalid values (NaN, inf)
|
||||
- Memory footprint
|
||||
- Value distribution and histogram
|
||||
- Sparsity analysis
|
||||
- Correlation structure (if 2D)
|
||||
|
||||
### .npz - Compressed NumPy Archive
|
||||
**Description:** Multiple NumPy arrays in one file
|
||||
**Typical Data:** Collections of related arrays
|
||||
**Use Cases:** Saving multiple arrays together, compressed storage
|
||||
**Python Libraries:**
|
||||
- `numpy`: `np.load('file.npz')` returns dict-like object
|
||||
- `np.savez()` or `np.savez_compressed()`
|
||||
**EDA Approach:**
|
||||
- List of contained arrays
|
||||
- Individual array analysis
|
||||
- Relationships between arrays
|
||||
- Total file size and compression ratio
|
||||
- Naming conventions
|
||||
- Data consistency checks
|
||||
|
||||
### .csv - Comma-Separated Values
|
||||
**Description:** Plain text tabular data
|
||||
**Typical Data:** Experimental measurements, results tables
|
||||
**Use Cases:** Universal data exchange, spreadsheet export
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv('file.csv')`
|
||||
- `csv`: Built-in module
|
||||
- `polars`: High-performance CSV reading
|
||||
- `numpy`: `np.loadtxt()` or `np.genfromtxt()`
|
||||
**EDA Approach:**
|
||||
- Row and column counts
|
||||
- Data type inference
|
||||
- Missing value patterns and frequency
|
||||
- Column statistics (numeric: mean, std; categorical: frequencies)
|
||||
- Outlier detection
|
||||
- Correlation matrix
|
||||
- Duplicate row detection
|
||||
- Header and index validation
|
||||
- Encoding issues detection
|
||||
|
||||
### .tsv / .tab - Tab-Separated Values
|
||||
**Description:** Tab-delimited tabular data
|
||||
**Typical Data:** Similar to CSV but tab-separated
|
||||
**Use Cases:** Bioinformatics, text processing output
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv('file.tsv', sep='\t')`
|
||||
**EDA Approach:**
|
||||
- Same as CSV format
|
||||
- Tab vs space validation
|
||||
- Quote handling
|
||||
|
||||
### .xlsx / .xls - Excel Spreadsheets
|
||||
**Description:** Microsoft Excel binary/XML formats
|
||||
**Typical Data:** Tabular data with formatting, formulas
|
||||
**Use Cases:** Lab notebooks, data entry, reports
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_excel('file.xlsx')`
|
||||
- `openpyxl`: Full Excel file manipulation
|
||||
- `xlrd`: Reading .xls (legacy)
|
||||
**EDA Approach:**
|
||||
- Sheet enumeration and names
|
||||
- Per-sheet data analysis
|
||||
- Formula evaluation
|
||||
- Merged cells handling
|
||||
- Hidden rows/columns
|
||||
- Data validation rules
|
||||
- Named ranges
|
||||
- Formatting-only cells detection
|
||||
|
||||
### .json - JavaScript Object Notation
|
||||
**Description:** Hierarchical text data format
|
||||
**Typical Data:** Nested data structures, metadata
|
||||
**Use Cases:** API responses, configuration, results
|
||||
**Python Libraries:**
|
||||
- `json`: Built-in module
|
||||
- `pandas`: `pd.read_json()`
|
||||
- `ujson`: Faster JSON parsing
|
||||
**EDA Approach:**
|
||||
- Schema inference
|
||||
- Nesting depth
|
||||
- Key-value distribution
|
||||
- Array lengths
|
||||
- Data type consistency
|
||||
- Missing keys
|
||||
- Duplicate detection
|
||||
- Size and complexity metrics
|
||||
|
||||
### .xml - Extensible Markup Language
|
||||
**Description:** Hierarchical markup format
|
||||
**Typical Data:** Structured data with metadata
|
||||
**Use Cases:** Standards-based data exchange, APIs
|
||||
**Python Libraries:**
|
||||
- `lxml`: `lxml.etree.parse()`
|
||||
- `xml.etree.ElementTree`: Built-in XML
|
||||
- `xmltodict`: Convert XML to dict
|
||||
**EDA Approach:**
|
||||
- Schema/DTD validation
|
||||
- Element hierarchy and depth
|
||||
- Namespace handling
|
||||
- Attribute vs element content
|
||||
- CDATA sections
|
||||
- Text content extraction
|
||||
- Sibling and child counts
|
||||
|
||||
### .yaml / .yml - YAML
|
||||
**Description:** Human-readable data serialization
|
||||
**Typical Data:** Configuration, metadata, parameters
|
||||
**Use Cases:** Experiment configurations, pipelines
|
||||
**Python Libraries:**
|
||||
- `yaml`: `yaml.safe_load()` or `yaml.load()`
|
||||
- `ruamel.yaml`: YAML 1.2 support
|
||||
**EDA Approach:**
|
||||
- Configuration structure
|
||||
- Data type handling
|
||||
- List and dict depth
|
||||
- Anchor and alias usage
|
||||
- Multi-document files
|
||||
- Comments preservation
|
||||
- Validation against schema
|
||||
|
||||
### .toml - TOML Configuration
|
||||
**Description:** Configuration file format
|
||||
**Typical Data:** Settings, parameters
|
||||
**Use Cases:** Python package configuration, settings
|
||||
**Python Libraries:**
|
||||
- `tomli` / `tomllib`: TOML reading (tomllib in Python 3.11+)
|
||||
- `toml`: Reading and writing
|
||||
**EDA Approach:**
|
||||
- Section structure
|
||||
- Key-value pairs
|
||||
- Data type inference
|
||||
- Nested table validation
|
||||
- Required vs optional fields
|
||||
|
||||
### .ini - INI Configuration
|
||||
**Description:** Simple configuration format
|
||||
**Typical Data:** Application settings
|
||||
**Use Cases:** Legacy configurations, simple settings
|
||||
**Python Libraries:**
|
||||
- `configparser`: Built-in INI parser
|
||||
**EDA Approach:**
|
||||
- Section enumeration
|
||||
- Key-value extraction
|
||||
- Type conversion
|
||||
- Comment handling
|
||||
- Case sensitivity
|
||||
|
||||
## Binary and Compressed Data
|
||||
|
||||
### .hdf5 / .h5 - Hierarchical Data Format 5
|
||||
**Description:** Container for large scientific datasets
|
||||
**Typical Data:** Multi-dimensional arrays, metadata, groups
|
||||
**Use Cases:** Large datasets, multi-modal data, parallel I/O
|
||||
**Python Libraries:**
|
||||
- `h5py`: `h5py.File('file.h5', 'r')`
|
||||
- `pytables`: Advanced HDF5 interface
|
||||
- `pandas`: HDF5 storage via HDFStore
|
||||
**EDA Approach:**
|
||||
- Group and dataset hierarchy
|
||||
- Dataset shapes and dtypes
|
||||
- Attributes and metadata
|
||||
- Compression and chunking strategy
|
||||
- Memory-efficient sampling
|
||||
- Dataset relationships
|
||||
- File size and efficiency
|
||||
- Access patterns optimization
|
||||
|
||||
### .zarr - Chunked Array Storage
|
||||
**Description:** Cloud-optimized chunked arrays
|
||||
**Typical Data:** Large N-dimensional arrays
|
||||
**Use Cases:** Cloud storage, parallel computing, streaming
|
||||
**Python Libraries:**
|
||||
- `zarr`: `zarr.open('file.zarr')`
|
||||
- `xarray`: Zarr backend support
|
||||
**EDA Approach:**
|
||||
- Array metadata and dimensions
|
||||
- Chunk size optimization
|
||||
- Compression codec and ratio
|
||||
- Synchronizer and store type
|
||||
- Multi-scale hierarchies
|
||||
- Parallel access performance
|
||||
- Attribute metadata
|
||||
|
||||
### .gz / .gzip - Gzip Compressed
|
||||
**Description:** Compressed data files
|
||||
**Typical Data:** Any compressed text or binary
|
||||
**Use Cases:** Compression for storage/transfer
|
||||
**Python Libraries:**
|
||||
- `gzip`: Built-in gzip module
|
||||
- `pandas`: Automatic gzip handling in read functions
|
||||
**EDA Approach:**
|
||||
- Compression ratio
|
||||
- Original file type detection
|
||||
- Decompression validation
|
||||
- Header information
|
||||
- Multi-member archives
|
||||
|
||||
### .bz2 - Bzip2 Compressed
|
||||
**Description:** Bzip2 compression
|
||||
**Typical Data:** Highly compressed files
|
||||
**Use Cases:** Better compression than gzip
|
||||
**Python Libraries:**
|
||||
- `bz2`: Built-in bz2 module
|
||||
- Automatic handling in pandas
|
||||
**EDA Approach:**
|
||||
- Compression efficiency
|
||||
- Decompression time
|
||||
- Content validation
|
||||
|
||||
### .zip - ZIP Archive
|
||||
**Description:** Archive with multiple files
|
||||
**Typical Data:** Collections of files
|
||||
**Use Cases:** File distribution, archiving
|
||||
**Python Libraries:**
|
||||
- `zipfile`: Built-in ZIP support
|
||||
- `pandas`: Can read zipped CSVs
|
||||
**EDA Approach:**
|
||||
- Archive member listing
|
||||
- Compression method per file
|
||||
- Total vs compressed size
|
||||
- Directory structure
|
||||
- File type distribution
|
||||
- Extraction validation
|
||||
|
||||
### .tar / .tar.gz - TAR Archive
|
||||
**Description:** Unix tape archive
|
||||
**Typical Data:** Multiple files and directories
|
||||
**Use Cases:** Software distribution, backups
|
||||
**Python Libraries:**
|
||||
- `tarfile`: Built-in TAR support
|
||||
**EDA Approach:**
|
||||
- Member file listing
|
||||
- Compression (if .tar.gz, .tar.bz2)
|
||||
- Directory structure
|
||||
- Permissions preservation
|
||||
- Extraction testing
|
||||
|
||||
## Time Series and Waveform Data
|
||||
|
||||
### .wav - Waveform Audio
|
||||
**Description:** Audio waveform data
|
||||
**Typical Data:** Acoustic signals, audio recordings
|
||||
**Use Cases:** Acoustic analysis, ultrasound, signal processing
|
||||
**Python Libraries:**
|
||||
- `scipy.io.wavfile`: `scipy.io.wavfile.read()`
|
||||
- `wave`: Built-in module
|
||||
- `soundfile`: Enhanced audio I/O
|
||||
**EDA Approach:**
|
||||
- Sample rate and duration
|
||||
- Bit depth and channels
|
||||
- Amplitude distribution
|
||||
- Spectral analysis (FFT)
|
||||
- Signal-to-noise ratio
|
||||
- Clipping detection
|
||||
- Frequency content
|
||||
|
||||
### .mat - MATLAB Data
|
||||
**Description:** MATLAB workspace variables
|
||||
**Typical Data:** Arrays, structures, cells
|
||||
**Use Cases:** MATLAB-Python interoperability
|
||||
**Python Libraries:**
|
||||
- `scipy.io`: `scipy.io.loadmat()`
|
||||
- `h5py`: For MATLAB v7.3 files (HDF5-based)
|
||||
- `mat73`: Pure Python for v7.3
|
||||
**EDA Approach:**
|
||||
- Variable names and types
|
||||
- Array dimensions
|
||||
- Structure field exploration
|
||||
- Cell array handling
|
||||
- Sparse matrix detection
|
||||
- MATLAB version compatibility
|
||||
- Metadata extraction
|
||||
|
||||
### .edf - European Data Format
|
||||
**Description:** Time series data (especially medical)
|
||||
**Typical Data:** EEG, physiological signals
|
||||
**Use Cases:** Medical signal storage
|
||||
**Python Libraries:**
|
||||
- `pyedflib`: EDF/EDF+ reading and writing
|
||||
- `mne`: Neurophysiology data (supports EDF)
|
||||
**EDA Approach:**
|
||||
- Signal count and names
|
||||
- Sampling frequencies
|
||||
- Signal ranges and units
|
||||
- Recording duration
|
||||
- Annotation events
|
||||
- Data quality (saturation, noise)
|
||||
- Patient/study information
|
||||
|
||||
### .csv (Time Series)
|
||||
**Description:** CSV with timestamp column
|
||||
**Typical Data:** Time-indexed measurements
|
||||
**Use Cases:** Sensor data, monitoring, experiments
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv()` with `parse_dates`
|
||||
**EDA Approach:**
|
||||
- Temporal range and resolution
|
||||
- Sampling regularity
|
||||
- Missing time points
|
||||
- Trend and seasonality
|
||||
- Stationarity tests
|
||||
- Autocorrelation
|
||||
- Anomaly detection
|
||||
|
||||
## Geospatial and Environmental Data
|
||||
|
||||
### .shp - Shapefile
|
||||
**Description:** Geospatial vector data
|
||||
**Typical Data:** Geographic features (points, lines, polygons)
|
||||
**Use Cases:** GIS analysis, spatial data
|
||||
**Python Libraries:**
|
||||
- `geopandas`: `gpd.read_file('file.shp')`
|
||||
- `fiona`: Lower-level shapefile access
|
||||
- `pyshp`: Pure Python shapefile reader
|
||||
**EDA Approach:**
|
||||
- Geometry type and count
|
||||
- Coordinate reference system
|
||||
- Bounding box
|
||||
- Attribute table analysis
|
||||
- Geometry validity
|
||||
- Spatial distribution
|
||||
- Multi-part features
|
||||
- Associated files (.shx, .dbf, .prj)
|
||||
|
||||
### .geojson - GeoJSON
|
||||
**Description:** JSON format for geographic data
|
||||
**Typical Data:** Features with geometry and properties
|
||||
**Use Cases:** Web mapping, spatial analysis
|
||||
**Python Libraries:**
|
||||
- `geopandas`: Native GeoJSON support
|
||||
- `json`: Parse as JSON then process
|
||||
**EDA Approach:**
|
||||
- Feature count and types
|
||||
- CRS specification
|
||||
- Bounding box calculation
|
||||
- Property schema
|
||||
- Geometry complexity
|
||||
- Nesting structure
|
||||
|
||||
### .tif / .tiff (Geospatial)
|
||||
**Description:** GeoTIFF with spatial reference
|
||||
**Typical Data:** Satellite imagery, DEMs, rasters
|
||||
**Use Cases:** Remote sensing, terrain analysis
|
||||
**Python Libraries:**
|
||||
- `rasterio`: `rasterio.open('file.tif')`
|
||||
- `gdal`: Geospatial Data Abstraction Library
|
||||
- `xarray` with `rioxarray`: N-D geospatial arrays
|
||||
**EDA Approach:**
|
||||
- Raster dimensions and resolution
|
||||
- Band count and descriptions
|
||||
- Coordinate reference system
|
||||
- Geotransform parameters
|
||||
- NoData value handling
|
||||
- Pixel value distribution
|
||||
- Histogram analysis
|
||||
- Overviews and pyramids
|
||||
|
||||
### .nc / .netcdf - Network Common Data Form
|
||||
**Description:** Self-describing array-based data
|
||||
**Typical Data:** Climate, atmospheric, oceanographic data
|
||||
**Use Cases:** Scientific datasets, model output
|
||||
**Python Libraries:**
|
||||
- `netCDF4`: `netCDF4.Dataset('file.nc')`
|
||||
- `xarray`: `xr.open_dataset('file.nc')`
|
||||
**EDA Approach:**
|
||||
- Variable enumeration
|
||||
- Dimension analysis
|
||||
- Time series properties
|
||||
- Spatial coverage
|
||||
- Attribute metadata (CF conventions)
|
||||
- Coordinate systems
|
||||
- Chunking and compression
|
||||
- Data quality flags
|
||||
|
||||
### .grib / .grib2 - Gridded Binary
|
||||
**Description:** Meteorological data format
|
||||
**Typical Data:** Weather forecasts, climate data
|
||||
**Use Cases:** Numerical weather prediction
|
||||
**Python Libraries:**
|
||||
- `pygrib`: GRIB file reading
|
||||
- `xarray` with `cfgrib`: GRIB to xarray
|
||||
**EDA Approach:**
|
||||
- Message inventory
|
||||
- Parameter and level types
|
||||
- Spatial grid specification
|
||||
- Temporal coverage
|
||||
- Ensemble members
|
||||
- Forecast vs analysis
|
||||
- Data packing and precision
|
||||
|
||||
### .hdf4 - HDF4 Format
|
||||
**Description:** Older HDF format
|
||||
**Typical Data:** NASA Earth Science data
|
||||
**Use Cases:** Satellite data (MODIS, etc.)
|
||||
**Python Libraries:**
|
||||
- `pyhdf`: HDF4 access
|
||||
- `gdal`: Can read HDF4
|
||||
**EDA Approach:**
|
||||
- Scientific dataset listing
|
||||
- Vdata and attributes
|
||||
- Dimension scales
|
||||
- Metadata extraction
|
||||
- Quality flags
|
||||
- Conversion to HDF5 or NetCDF
|
||||
|
||||
## Specialized Scientific Formats
|
||||
|
||||
### .fits - Flexible Image Transport System
|
||||
**Description:** Astronomy data format
|
||||
**Typical Data:** Images, tables, spectra from telescopes
|
||||
**Use Cases:** Astronomical observations
|
||||
**Python Libraries:**
|
||||
- `astropy.io.fits`: `fits.open('file.fits')`
|
||||
- `fitsio`: Alternative FITS library
|
||||
**EDA Approach:**
|
||||
- HDU (Header Data Unit) structure
|
||||
- Image dimensions and WCS
|
||||
- Header keyword analysis
|
||||
- Table column descriptions
|
||||
- Data type and scaling
|
||||
- FITS convention compliance
|
||||
- Checksum validation
|
||||
|
||||
### .asdf - Advanced Scientific Data Format
|
||||
**Description:** Next-gen data format for astronomy
|
||||
**Typical Data:** Complex hierarchical scientific data
|
||||
**Use Cases:** James Webb Space Telescope data
|
||||
**Python Libraries:**
|
||||
- `asdf`: `asdf.open('file.asdf')`
|
||||
**EDA Approach:**
|
||||
- Tree structure exploration
|
||||
- Schema validation
|
||||
- Internal vs external arrays
|
||||
- Compression methods
|
||||
- YAML metadata
|
||||
- Version compatibility
|
||||
|
||||
### .root - ROOT Data Format
|
||||
**Description:** CERN ROOT framework format
|
||||
**Typical Data:** High-energy physics data
|
||||
**Use Cases:** Particle physics experiments
|
||||
**Python Libraries:**
|
||||
- `uproot`: Pure Python ROOT reading
|
||||
- `ROOT`: Official PyROOT bindings
|
||||
**EDA Approach:**
|
||||
- TTree structure
|
||||
- Branch types and entries
|
||||
- Histogram inventory
|
||||
- Event loop statistics
|
||||
- File compression
|
||||
- Split level analysis
|
||||
|
||||
### .txt - Plain Text Data
|
||||
**Description:** Generic text-based data
|
||||
**Typical Data:** Tab/space-delimited, custom formats
|
||||
**Use Cases:** Simple data exchange, logs
|
||||
**Python Libraries:**
|
||||
- `pandas`: `pd.read_csv()` with custom delimiters
|
||||
- `numpy`: `np.loadtxt()`, `np.genfromtxt()`
|
||||
- Built-in file reading
|
||||
**EDA Approach:**
|
||||
- Format detection (delimiter, header)
|
||||
- Data type inference
|
||||
- Comment line handling
|
||||
- Missing value codes
|
||||
- Column alignment
|
||||
- Encoding detection
|
||||
|
||||
### .dat - Generic Data File
|
||||
**Description:** Binary or text data
|
||||
**Typical Data:** Instrument output, custom formats
|
||||
**Use Cases:** Various scientific instruments
|
||||
**Python Libraries:**
|
||||
- Format-specific: requires knowledge of structure
|
||||
- `numpy`: `np.fromfile()` for binary
|
||||
- `struct`: Parse binary structures
|
||||
**EDA Approach:**
|
||||
- Binary vs text determination
|
||||
- Header detection
|
||||
- Record structure inference
|
||||
- Endianness
|
||||
- Data type patterns
|
||||
- Validation with documentation
|
||||
|
||||
### .log - Log Files
|
||||
**Description:** Text logs from software/instruments
|
||||
**Typical Data:** Timestamped events, messages
|
||||
**Use Cases:** Troubleshooting, experiment tracking
|
||||
**Python Libraries:**
|
||||
- Built-in file reading
|
||||
- `pandas`: Structured log parsing
|
||||
- Regular expressions for parsing
|
||||
**EDA Approach:**
|
||||
- Log level distribution
|
||||
- Timestamp parsing
|
||||
- Error and warning frequency
|
||||
- Event sequencing
|
||||
- Pattern recognition
|
||||
- Anomaly detection
|
||||
- Session boundaries
|
||||
Reference in New Issue
Block a user